Qwen3-14B INT4 Pure (GPTQ + Hadamard)

Pure INT4 quantization of Qwen/Qwen3-14B β€” all layers quantized to INT4.

Smaller size β€” 9.2 GB vs 11 GB for mixed version. Slightly lower quality but still competitive.

Quality

Metric This Model Mixed INT4/INT8 llama.cpp Q4_K_M
Perplexity (WikiText-2) 7.787 7.692 7.715

Performance (AMD Radeon AI PRO R9700)

Metric Speed
Decode (ctx=128) 62.7 t/s
Prefill (pp512) 2076 t/s
VRAM ~8.5 GB

Quantization Details

  • Method: INT4 asymmetric with Hadamard rotation + GPTQ calibration
  • All layers: INT4 (no INT8 sensitive layers)
  • Block size: 32
  • Calibration: 256 samples Γ— 256 tokens from RedPajama
  • KV cache: FP8 E4M3 at inference time

File Format

Custom .pt format β€” requires rdna4-quant engine.

embed.pt          β€” embedding weights (FP16)
layer_000.pt      β€” layer 0 (quantized weights + scales + metadata)
...
layer_039.pt      β€” layer 39
final_norm.pt     β€” final RMSNorm
lm_head.pt        β€” LM head (FP16)
meta.pt           β€” quantization metadata

Usage

git clone https://github.com/JohnTDI-cpu/rdna4-quant
cd rdna4-quant
pip install -r requirements.txt

# Download weights
huggingface-cli download JohnTdi/Qwen3-14B-INT4-Pure-GPTQ --local-dir quantized_v5_pure_int4

# Run inference
python int4_engine_v5.py --quant-dir quantized_v5_pure_int4 --chat

# Or start API server
python api_server.py --quant-dir quantized_v5_pure_int4

Hardware Requirements

  • AMD GPU with ROCm 6.x+ support (RDNA3/4, MI300X)
  • ~9 GB VRAM
  • ROCm 6.x or 7.x, PyTorch 2.6+

License

MIT β€” same as the inference engine. Base model (Qwen3-14B) has its own license from Alibaba.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JohnTdi/Qwen3-14B-INT4-Pure-GPTQ

Finetuned
Qwen/Qwen3-14B
Finetuned
(230)
this model