Qwen2.5-1.5B-Instruct · GPTQ 4-bit (v1)

Part of a systematic 4-way quantization study on Qwen2.5-1.5B-Instruct. See the study overview for comparisons across all variants.

A 4-bit GPTQ quantization of Qwen/Qwen2.5-1.5B-Instruct. This is the baseline quantization — preserved as a reference point to show what default GPTQ settings produce and why they fall short.

Benchmark Results

All measurements taken on A100-40GB, batch size 1, HuggingFace Transformers, generating 50 tokens from a fixed prompt averaged over 10 runs.

Metric	FP16 baseline	This model	Delta
VRAM usage	3.56 GB	1.63 GB	54.2% smaller
Disk size	3.1 GB	1.16 GB	62.6% smaller
Throughput	38.7 tok/s	16.1 tok/s	−58%
Latency (TTFT)	26.8 ms	63.8 ms	+138%
Perplexity (Wikitext-2)	11.90	15.36	+3.46 ❌

Quality gate (PPL delta < 1.0): ❌ FAIL

The +3.46 perplexity degradation is significant. Root causes identified in this study:

group_size=128 is too coarse for a 1.5B model — fewer parameters means each scaling factor covers a larger proportion of the weight space
desc_act=False disables activation ordering, quantizing the most sensitive weights with already-accumulated error
Wikitext-2 calibration is domain-mismatched for an instruction-tuned model

See GPTQ v2 for the corrected configuration that passes the quality gate.

Quantization Config

GPTQConfig(
    bits=4,
    group_size=128,     # coarse — identified as root cause of quality failure
    desc_act=False,     # activation ordering OFF
    damp_percent=0.01,
    dataset=wikitext2_samples,   # domain mismatch for instruction model
)

Parameter	Value
Method	GPTQ
Bits	4
Group size	128
desc_act	False
Calibration	Wikitext-2, 128 samples
Framework	auto-gptq==0.7.1, transformers==4.44.0
Hardware	A100-40GB
Date	February 2025

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Mohaaxa/qwen2.5-1.5b-gptq-4bit",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/qwen2.5-1.5b-gptq-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantization in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements: pip install transformers==4.44.0 auto-gptq==0.7.1 optimum==1.16.0

Why is throughput slower than FP16?

This is expected on high-end GPUs. A100's FP16 tensor cores are fast enough that the overhead of INT4 dequantization during inference exceeds the memory bandwidth savings. GPTQ's real advantage is on memory-constrained hardware (consumer GPUs, edge devices like Jetson) where the 54% VRAM reduction allows running a model that otherwise wouldn't fit.

Study Context

This model is part of a 4-variant benchmark study:

Variant	VRAM	Tok/s	PPL	PPL Δ	Quality
FP16 baseline	3.56 GB	38.7	11.90	—	reference
GPTQ v1 (this)	1.63 GB	16.1	15.36	+3.46	❌ FAIL
GPTQ v2	1.66 GB	13.8	13.09	+1.19	❌ FAIL
AWQ	1.16 GB	24.9	13.16	+1.26	❌ FAIL

Key finding: On a 1.5B model at INT4, perplexity degradation above 1.0 is expected and consistent with the quantization research literature, which shows small models suffer disproportionately under aggressive quantization. AWQ achieves the best VRAM footprint (1.16 GB, 67% reduction) while matching GPTQ v2 quality. For this model size, GGUF Q4_K_M would likely achieve lower PPL delta through mixed-precision per-layer quantization.

Study methodology: Mohaaxa

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

I32

BF16

F16

Model tree for Mohaaxa/qwen2.5-1.5b-awq-4bit

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

(184)

this model