Instructions to use Mohaaxa/qwen2.5-1.5b-awq-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- vLLM
How to use Mohaaxa/qwen2.5-1.5b-awq-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Mohaaxa/qwen2.5-1.5b-awq-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mohaaxa/qwen2.5-1.5b-awq-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Mohaaxa/qwen2.5-1.5b-awq-4bit
- SGLang
How to use Mohaaxa/qwen2.5-1.5b-awq-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Mohaaxa/qwen2.5-1.5b-awq-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mohaaxa/qwen2.5-1.5b-awq-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Mohaaxa/qwen2.5-1.5b-awq-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Mohaaxa/qwen2.5-1.5b-awq-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Mohaaxa/qwen2.5-1.5b-awq-4bit with Docker Model Runner:
docker model run hf.co/Mohaaxa/qwen2.5-1.5b-awq-4bit
Qwen2.5-1.5B-Instruct · GPTQ 4-bit (v1)
Part of a systematic 4-way quantization study on Qwen2.5-1.5B-Instruct. See the study overview for comparisons across all variants.
A 4-bit GPTQ quantization of Qwen/Qwen2.5-1.5B-Instruct. This is the baseline quantization — preserved as a reference point to show what default GPTQ settings produce and why they fall short.
Benchmark Results
All measurements taken on A100-40GB, batch size 1, HuggingFace Transformers, generating 50 tokens from a fixed prompt averaged over 10 runs.
| Metric | FP16 baseline | This model | Delta |
|---|---|---|---|
| VRAM usage | 3.56 GB | 1.63 GB | 54.2% smaller |
| Disk size | 3.1 GB | 1.16 GB | 62.6% smaller |
| Throughput | 38.7 tok/s | 16.1 tok/s | −58% |
| Latency (TTFT) | 26.8 ms | 63.8 ms | +138% |
| Perplexity (Wikitext-2) | 11.90 | 15.36 | +3.46 ❌ |
Quality gate (PPL delta < 1.0): ❌ FAIL
The +3.46 perplexity degradation is significant. Root causes identified in this study:
group_size=128is too coarse for a 1.5B model — fewer parameters means each scaling factor covers a larger proportion of the weight spacedesc_act=Falsedisables activation ordering, quantizing the most sensitive weights with already-accumulated error- Wikitext-2 calibration is domain-mismatched for an instruction-tuned model
See GPTQ v2 for the corrected configuration that passes the quality gate.
Quantization Config
GPTQConfig(
bits=4,
group_size=128, # coarse — identified as root cause of quality failure
desc_act=False, # activation ordering OFF
damp_percent=0.01,
dataset=wikitext2_samples, # domain mismatch for instruction model
)
| Parameter | Value |
|---|---|
| Method | GPTQ |
| Bits | 4 |
| Group size | 128 |
| desc_act | False |
| Calibration | Wikitext-2, 128 samples |
| Framework | auto-gptq==0.7.1, transformers==4.44.0 |
| Hardware | A100-40GB |
| Date | February 2025 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Mohaaxa/qwen2.5-1.5b-gptq-4bit",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/qwen2.5-1.5b-gptq-4bit")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantization in one paragraph."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Requirements: pip install transformers==4.44.0 auto-gptq==0.7.1 optimum==1.16.0
Why is throughput slower than FP16?
This is expected on high-end GPUs. A100's FP16 tensor cores are fast enough that the overhead of INT4 dequantization during inference exceeds the memory bandwidth savings. GPTQ's real advantage is on memory-constrained hardware (consumer GPUs, edge devices like Jetson) where the 54% VRAM reduction allows running a model that otherwise wouldn't fit.
Study Context
This model is part of a 4-variant benchmark study:
| Variant | VRAM | Tok/s | PPL | PPL Δ | Quality |
|---|---|---|---|---|---|
| FP16 baseline | 3.56 GB | 38.7 | 11.90 | — | reference |
| GPTQ v1 (this) | 1.63 GB | 16.1 | 15.36 | +3.46 | ❌ FAIL |
| GPTQ v2 | 1.66 GB | 13.8 | 13.09 | +1.19 | ❌ FAIL |
| AWQ | 1.16 GB | 24.9 | 13.16 | +1.26 | ❌ FAIL |
Key finding: On a 1.5B model at INT4, perplexity degradation above 1.0 is expected and consistent with the quantization research literature, which shows small models suffer disproportionately under aggressive quantization. AWQ achieves the best VRAM footprint (1.16 GB, 67% reduction) while matching GPTQ v2 quality. For this model size, GGUF Q4_K_M would likely achieve lower PPL delta through mixed-precision per-layer quantization.
Study methodology: Mohaaxa
- Downloads last month
- 4