Qwen2.5-1.5B-Instruct ยท GPTQ 4-bit (v2, quality-optimized)

Part of a systematic 4-way quantization study on Qwen2.5-1.5B-Instruct. See the study overview for comparisons across all variants.

An improved GPTQ 4-bit quantization of Qwen/Qwen2.5-1.5B-Instruct, using a quality-optimized configuration derived from ablation against GPTQ v1.

Three targeted fixes over v1:

  1. Smaller group size (128 โ†’ 64): each scaling factor covers fewer weights, reducing per-group quantization error at the cost of ~5% larger file size
  2. Activation ordering ON (desc_act=False โ†’ True): reorders weight columns by descending activation magnitude before quantizing, so the most influential weights are quantized last when accumulated error is lowest
  3. Instruction-domain calibration: OpenHermes-2.5 chat data instead of Wikitext-2 prose โ€” better signal for an instruction-tuned model

Result: perplexity delta reduced from +3.46 (v1) to +1.19 (v2) โ€” a 65% improvement in quality preservation with only 5% size increase.


Benchmark Results

All measurements on A100-40GB, batch size 1, HuggingFace Transformers, 50 tokens generated, 10-run average.

Metric FP16 baseline GPTQ v1 This model Delta vs FP16
VRAM usage 3.56 GB 1.63 GB 1.66 GB โˆ’53.4%
Disk size 3.1 GB 1.16 GB 1.19 GB โˆ’61.6%
Throughput 38.7 tok/s 16.1 tok/s 13.8 tok/s โˆ’64%
Latency (TTFT) 26.8 ms 63.8 ms 76.9 ms +187%
Perplexity (Wikitext-2) 11.90 15.36 13.09 +1.19

Quality gate (PPL delta < 1.0): โŒ FAIL (borderline โ€” 0.19 above threshold)

The marginal quality gate failure is consistent with published literature: at INT4, 1.5B models typically see +1.0โ€“2.0 PPL degradation due to limited parameter redundancy. A delta of +1.19 represents good quality preservation for this model size and bit-width.


What Changed vs v1

Parameter v1 v2 Effect
group_size 128 64 Finer scaling, 65% PPL improvement
desc_act False True Activation ordering ON
damp_percent 0.01 0.01 Unchanged โ€” stable for Qwen2
Calibration data Wikitext-2 OpenHermes-2.5 Domain-matched
Quantization time ~4 min ~30 min Cost of desc_act=True

Quantization Config

GPTQConfig(
    bits=4,
    group_size=64,       # finer scaling than standard 128
    desc_act=True,       # activation ordering ON โ€” key quality improvement
    damp_percent=0.01,   # Hessian stability, default for Qwen2 architecture
    dataset=openhermes_samples,  # instruction-domain calibration
)
Parameter Value
Method GPTQ
Bits 4
Group size 64
desc_act True
damp_percent 0.01
Calibration OpenHermes-2.5, 128 samples
Framework auto-gptq==0.7.1, transformers==4.44.0, optimum==1.16.0
Hardware A100-40GB
Date February 2025

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantization in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements: pip install transformers==4.44.0 auto-gptq==0.7.1 optimum==1.16.0


Why is throughput slower than v1?

desc_act=True adds a column-reordering step during quantization that produces better weights, but those reordered weights are slightly harder for the GPTQ CUDA kernel to process at inference time โ€” hence 13.8 vs 16.1 tok/s. This is a known tradeoff. For throughput-critical deployments, see the AWQ variant which achieves 24.9 tok/s with comparable quality (+1.26 PPL).


Study Context

Variant VRAM Tok/s PPL PPL ฮ” Quality
FP16 baseline 3.56 GB 38.7 11.90 โ€” reference
GPTQ v1 1.63 GB 16.1 15.36 +3.46 โŒ
GPTQ v2 (this) 1.66 GB 13.8 13.09 +1.19 โŒ
AWQ 1.16 GB 24.9 13.16 +1.26 โŒ

GPTQ v2 achieves the best raw perplexity of the three quantized variants (+1.19) at the cost of slowest inference (13.8 tok/s). If quality is the priority and throughput is secondary, this is the recommended variant. For deployment where both matter, AWQ provides a better overall tradeoff.

Study methodology: Mohaaxa

Downloads last month
3
Safetensors
Model size
2B params
Tensor type
I32
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2

Quantized
(184)
this model