Qwen2.5-1.5B-Instruct · GPTQ 4-bit (v1)

Part of a systematic 4-way quantization study on Qwen2.5-1.5B-Instruct. See the study overview for comparisons across all variants.

A 4-bit GPTQ quantization of Qwen/Qwen2.5-1.5B-Instruct. This is the baseline quantization — preserved as a reference point to show what default GPTQ settings produce and why they fall short.


Benchmark Results

All measurements taken on A100-40GB, batch size 1, HuggingFace Transformers, generating 50 tokens from a fixed prompt averaged over 10 runs.

Metric FP16 baseline This model Delta
VRAM usage 3.56 GB 1.63 GB 54.2% smaller
Disk size 3.1 GB 1.16 GB 62.6% smaller
Throughput 38.7 tok/s 16.1 tok/s −58%
Latency (TTFT) 26.8 ms 63.8 ms +138%
Perplexity (Wikitext-2) 11.90 15.36 +3.46 ❌

Quality gate (PPL delta < 1.0): ❌ FAIL

The +3.46 perplexity degradation is significant. Root causes identified in this study:

  • group_size=128 is too coarse for a 1.5B model — fewer parameters means each scaling factor covers a larger proportion of the weight space
  • desc_act=False disables activation ordering, quantizing the most sensitive weights with already-accumulated error
  • Wikitext-2 calibration is domain-mismatched for an instruction-tuned model

See GPTQ v2 for the corrected configuration that passes the quality gate.


Quantization Config

GPTQConfig(
    bits=4,
    group_size=128,     # coarse — identified as root cause of quality failure
    desc_act=False,     # activation ordering OFF
    damp_percent=0.01,
    dataset=wikitext2_samples,   # domain mismatch for instruction model
)
Parameter Value
Method GPTQ
Bits 4
Group size 128
desc_act False
Calibration Wikitext-2, 128 samples
Framework auto-gptq==0.7.1, transformers==4.44.0
Hardware A100-40GB
Date February 2025

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Mohaaxa/qwen2.5-1.5b-gptq-4bit",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/qwen2.5-1.5b-gptq-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantization in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements: pip install transformers==4.44.0 auto-gptq==0.7.1 optimum==1.16.0


Why is throughput slower than FP16?

This is expected on high-end GPUs. A100's FP16 tensor cores are fast enough that the overhead of INT4 dequantization during inference exceeds the memory bandwidth savings. GPTQ's real advantage is on memory-constrained hardware (consumer GPUs, edge devices like Jetson) where the 54% VRAM reduction allows running a model that otherwise wouldn't fit.


Study Context

This model is part of a 4-variant benchmark study:

Variant VRAM Tok/s PPL PPL Δ Quality
FP16 baseline 3.56 GB 38.7 11.90 reference
GPTQ v1 (this) 1.63 GB 16.1 15.36 +3.46 ❌ FAIL
GPTQ v2 1.66 GB 13.8 13.09 +1.19 ❌ FAIL
AWQ 1.16 GB 24.9 13.16 +1.26 ❌ FAIL

Key finding: On a 1.5B model at INT4, perplexity degradation above 1.0 is expected and consistent with the quantization research literature, which shows small models suffer disproportionately under aggressive quantization. AWQ achieves the best VRAM footprint (1.16 GB, 67% reduction) while matching GPTQ v2 quality. For this model size, GGUF Q4_K_M would likely achieve lower PPL delta through mixed-precision per-layer quantization.

Study methodology: Mohaaxa

Downloads last month
4
Safetensors
Model size
2B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mohaaxa/qwen2.5-1.5b-awq-4bit

Quantized
(184)
this model