Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic-v2 AutoRound W4A16
This is a 4-bit AutoRound export of:
llmfan46/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic-v2
Quantization settings
- Method: AutoRound 0.9.2
- Scheme: W4A16
- Bits: 4
- Group size: 128
- Iterations: 200
- Calibration seqlen: 512
- Calibration samples: 64
- Batch size: 1
- Gradient accumulate steps: 8
- Low GPU memory mode: enabled
- Packing format:
auto_round:auto_gptq
Artifact summary
- Source size:
51G - Quantized size:
18G - Size reduction:
64.71% - Final export: 5 safetensor shards
AutoRound reported 400/607 modules quantized. The visual tower, many
linear_attn.in_proj_a/b layers, and lm_head remained unquantized.
Perplexity benchmark
WikiText-2 test split, sliding-window next-token perplexity:
seq_len=512stride=256max_tokens=8192
| Model | PPL | Tokens scored | Load time | Eval time |
|---|---|---|---|---|
| FP source | 7.518735 | 8161 | 9.829s | 106.855s |
| AutoRound W4A16 | 7.747804 | 8161 | 5.844s | 42.065s |
Degradation vs FP: +3.0466%
Loading with Transformers
from transformers import Qwen3_5ForConditionalGeneration, AutoTokenizer
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"PATH/TO/THIS/MODEL",
device_map="auto",
trust_remote_code=True,
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"PATH/TO/THIS/MODEL",
trust_remote_code=True,
)
Loading with vLLM
python -m vllm.entrypoints.openai.api_server \
--model PATH/TO/THIS/MODEL \
--quantization gptq \
--tensor-parallel-size 2
Included benchmark/report artifacts
benchmark_summary.jsonheretic_v2_fp_ppl.jsonheretic_v2_quant_ppl.jsonHERETIC_V2_QUANTIZATION_REPORT.md
Notes
- The source model architecture is
Qwen3_5ForConditionalGeneration. - Text-only evaluation was run through the internal language model path.
- Please follow the original model's usage terms and license expectations.
- Downloads last month
- 821