Qwen3-14B INT4 Pure (GPTQ + Hadamard)
Pure INT4 quantization of Qwen/Qwen3-14B β all layers quantized to INT4.
Smaller size β 9.2 GB vs 11 GB for mixed version. Slightly lower quality but still competitive.
Quality
| Metric | This Model | Mixed INT4/INT8 | llama.cpp Q4_K_M |
|---|---|---|---|
| Perplexity (WikiText-2) | 7.787 | 7.692 | 7.715 |
Performance (AMD Radeon AI PRO R9700)
| Metric | Speed |
|---|---|
| Decode (ctx=128) | 62.7 t/s |
| Prefill (pp512) | 2076 t/s |
| VRAM | ~8.5 GB |
Quantization Details
- Method: INT4 asymmetric with Hadamard rotation + GPTQ calibration
- All layers: INT4 (no INT8 sensitive layers)
- Block size: 32
- Calibration: 256 samples Γ 256 tokens from RedPajama
- KV cache: FP8 E4M3 at inference time
File Format
Custom .pt format β requires rdna4-quant engine.
embed.pt β embedding weights (FP16)
layer_000.pt β layer 0 (quantized weights + scales + metadata)
...
layer_039.pt β layer 39
final_norm.pt β final RMSNorm
lm_head.pt β LM head (FP16)
meta.pt β quantization metadata
Usage
git clone https://github.com/JohnTDI-cpu/rdna4-quant
cd rdna4-quant
pip install -r requirements.txt
# Download weights
huggingface-cli download JohnTdi/Qwen3-14B-INT4-Pure-GPTQ --local-dir quantized_v5_pure_int4
# Run inference
python int4_engine_v5.py --quant-dir quantized_v5_pure_int4 --chat
# Or start API server
python api_server.py --quant-dir quantized_v5_pure_int4
Hardware Requirements
- AMD GPU with ROCm 6.x+ support (RDNA3/4, MI300X)
- ~9 GB VRAM
- ROCm 6.x or 7.x, PyTorch 2.6+
License
MIT β same as the inference engine. Base model (Qwen3-14B) has its own license from Alibaba.