Qwen2.5-1.5B-Instruct · GPTQ 4-bit (v2, quality-optimized)

Part of a systematic 4-way quantization study on Qwen2.5-1.5B-Instruct. See the study overview for comparisons across all variants.

An improved GPTQ 4-bit quantization of Qwen/Qwen2.5-1.5B-Instruct, using a quality-optimized configuration derived from ablation against GPTQ v1.

Three targeted fixes over v1:

Smaller group size (128 → 64): each scaling factor covers fewer weights, reducing per-group quantization error at the cost of ~5% larger file size
Activation ordering ON (desc_act=False → True): reorders weight columns by descending activation magnitude before quantizing, so the most influential weights are quantized last when accumulated error is lowest
Instruction-domain calibration: OpenHermes-2.5 chat data instead of Wikitext-2 prose — better signal for an instruction-tuned model

Result: perplexity delta reduced from +3.46 (v1) to +1.19 (v2) — a 65% improvement in quality preservation with only 5% size increase.

Benchmark Results

All measurements on A100-40GB, batch size 1, HuggingFace Transformers, 50 tokens generated, 10-run average.

Metric	FP16 baseline	GPTQ v1	This model	Delta vs FP16
VRAM usage	3.56 GB	1.63 GB	1.66 GB	−53.4%
Disk size	3.1 GB	1.16 GB	1.19 GB	−61.6%
Throughput	38.7 tok/s	16.1 tok/s	13.8 tok/s	−64%
Latency (TTFT)	26.8 ms	63.8 ms	76.9 ms	+187%
Perplexity (Wikitext-2)	11.90	15.36	13.09	+1.19

Quality gate (PPL delta < 1.0): ❌ FAIL (borderline — 0.19 above threshold)

The marginal quality gate failure is consistent with published literature: at INT4, 1.5B models typically see +1.0–2.0 PPL degradation due to limited parameter redundancy. A delta of +1.19 represents good quality preservation for this model size and bit-width.

What Changed vs v1

Parameter	v1	v2	Effect
`group_size`	128	64	Finer scaling, 65% PPL improvement
`desc_act`	False	True	Activation ordering ON
`damp_percent`	0.01	0.01	Unchanged — stable for Qwen2
Calibration data	Wikitext-2	OpenHermes-2.5	Domain-matched
Quantization time	~4 min	~30 min	Cost of desc_act=True

Quantization Config

GPTQConfig(
    bits=4,
    group_size=64,       # finer scaling than standard 128
    desc_act=True,       # activation ordering ON — key quality improvement
    damp_percent=0.01,   # Hessian stability, default for Qwen2 architecture
    dataset=openhermes_samples,  # instruction-domain calibration
)

Parameter	Value
Method	GPTQ
Bits	4
Group size	64
desc_act	True
damp_percent	0.01
Calibration	OpenHermes-2.5, 128 samples
Framework	auto-gptq==0.7.1, transformers==4.44.0, optimum==1.16.0
Hardware	A100-40GB
Date	February 2025

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantization in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements: pip install transformers==4.44.0 auto-gptq==0.7.1 optimum==1.16.0

Why is throughput slower than v1?

desc_act=True adds a column-reordering step during quantization that produces better weights, but those reordered weights are slightly harder for the GPTQ CUDA kernel to process at inference time — hence 13.8 vs 16.1 tok/s. This is a known tradeoff. For throughput-critical deployments, see the AWQ variant which achieves 24.9 tok/s with comparable quality (+1.26 PPL).

Study Context

Variant	VRAM	Tok/s	PPL	PPL Δ	Quality
FP16 baseline	3.56 GB	38.7	11.90	—	reference
GPTQ v1	1.63 GB	16.1	15.36	+3.46	❌
GPTQ v2 (this)	1.66 GB	13.8	13.09	+1.19	❌
AWQ	1.16 GB	24.9	13.16	+1.26	❌

GPTQ v2 achieves the best raw perplexity of the three quantized variants (+1.19) at the cost of slowest inference (13.8 tok/s). If quality is the priority and throughput is secondary, this is the recommended variant. For deployment where both matter, AWQ provides a better overall tradeoff.

Study methodology: Mohaaxa

Downloads last month: 3

Safetensors

Model size

2B params

Tensor type

I32

F16

Model tree for Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

(184)

this model