Instructions to use Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- vLLM
How to use Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2
- SGLang
How to use Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2 with Docker Model Runner:
docker model run hf.co/Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2
Qwen2.5-1.5B-Instruct ยท GPTQ 4-bit (v2, quality-optimized)
Part of a systematic 4-way quantization study on Qwen2.5-1.5B-Instruct. See the study overview for comparisons across all variants.
An improved GPTQ 4-bit quantization of Qwen/Qwen2.5-1.5B-Instruct, using a quality-optimized configuration derived from ablation against GPTQ v1.
Three targeted fixes over v1:
- Smaller group size (
128 โ 64): each scaling factor covers fewer weights, reducing per-group quantization error at the cost of ~5% larger file size - Activation ordering ON (
desc_act=False โ True): reorders weight columns by descending activation magnitude before quantizing, so the most influential weights are quantized last when accumulated error is lowest - Instruction-domain calibration: OpenHermes-2.5 chat data instead of Wikitext-2 prose โ better signal for an instruction-tuned model
Result: perplexity delta reduced from +3.46 (v1) to +1.19 (v2) โ a 65% improvement in quality preservation with only 5% size increase.
Benchmark Results
All measurements on A100-40GB, batch size 1, HuggingFace Transformers, 50 tokens generated, 10-run average.
| Metric | FP16 baseline | GPTQ v1 | This model | Delta vs FP16 |
|---|---|---|---|---|
| VRAM usage | 3.56 GB | 1.63 GB | 1.66 GB | โ53.4% |
| Disk size | 3.1 GB | 1.16 GB | 1.19 GB | โ61.6% |
| Throughput | 38.7 tok/s | 16.1 tok/s | 13.8 tok/s | โ64% |
| Latency (TTFT) | 26.8 ms | 63.8 ms | 76.9 ms | +187% |
| Perplexity (Wikitext-2) | 11.90 | 15.36 | 13.09 | +1.19 |
Quality gate (PPL delta < 1.0): โ FAIL (borderline โ 0.19 above threshold)
The marginal quality gate failure is consistent with published literature: at INT4, 1.5B models typically see +1.0โ2.0 PPL degradation due to limited parameter redundancy. A delta of +1.19 represents good quality preservation for this model size and bit-width.
What Changed vs v1
| Parameter | v1 | v2 | Effect |
|---|---|---|---|
group_size |
128 | 64 | Finer scaling, 65% PPL improvement |
desc_act |
False | True | Activation ordering ON |
damp_percent |
0.01 | 0.01 | Unchanged โ stable for Qwen2 |
| Calibration data | Wikitext-2 | OpenHermes-2.5 | Domain-matched |
| Quantization time | ~4 min | ~30 min | Cost of desc_act=True |
Quantization Config
GPTQConfig(
bits=4,
group_size=64, # finer scaling than standard 128
desc_act=True, # activation ordering ON โ key quality improvement
damp_percent=0.01, # Hessian stability, default for Qwen2 architecture
dataset=openhermes_samples, # instruction-domain calibration
)
| Parameter | Value |
|---|---|
| Method | GPTQ |
| Bits | 4 |
| Group size | 64 |
| desc_act | True |
| damp_percent | 0.01 |
| Calibration | OpenHermes-2.5, 128 samples |
| Framework | auto-gptq==0.7.1, transformers==4.44.0, optimum==1.16.0 |
| Hardware | A100-40GB |
| Date | February 2025 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/qwen2.5-1.5b-gptq-4bit-v2")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantization in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Requirements: pip install transformers==4.44.0 auto-gptq==0.7.1 optimum==1.16.0
Why is throughput slower than v1?
desc_act=True adds a column-reordering step during quantization that produces
better weights, but those reordered weights are slightly harder for the GPTQ
CUDA kernel to process at inference time โ hence 13.8 vs 16.1 tok/s. This is a
known tradeoff. For throughput-critical deployments, see the
AWQ variant which
achieves 24.9 tok/s with comparable quality (+1.26 PPL).
Study Context
| Variant | VRAM | Tok/s | PPL | PPL ฮ | Quality |
|---|---|---|---|---|---|
| FP16 baseline | 3.56 GB | 38.7 | 11.90 | โ | reference |
| GPTQ v1 | 1.63 GB | 16.1 | 15.36 | +3.46 | โ |
| GPTQ v2 (this) | 1.66 GB | 13.8 | 13.09 | +1.19 | โ |
| AWQ | 1.16 GB | 24.9 | 13.16 | +1.26 | โ |
GPTQ v2 achieves the best raw perplexity of the three quantized variants (+1.19) at the cost of slowest inference (13.8 tok/s). If quality is the priority and throughput is secondary, this is the recommended variant. For deployment where both matter, AWQ provides a better overall tradeoff.
Study methodology: Mohaaxa
- Downloads last month
- 3