Qwen3-14B INT4 Pure (GPTQ + Hadamard)

Pure INT4 quantization of Qwen/Qwen3-14B — all layers quantized to INT4.

Smaller size — 9.2 GB vs 11 GB for mixed version. Slightly lower quality but still competitive.

Quality

Metric	This Model	Mixed INT4/INT8	llama.cpp Q4_K_M
Perplexity (WikiText-2)	7.787	7.692	7.715

Performance (AMD Radeon AI PRO R9700)

Metric	Speed
Decode (ctx=128)	62.7 t/s
Prefill (pp512)	2076 t/s
VRAM	~8.5 GB

Quantization Details

Method: INT4 asymmetric with Hadamard rotation + GPTQ calibration
All layers: INT4 (no INT8 sensitive layers)
Block size: 32
Calibration: 256 samples × 256 tokens from RedPajama
KV cache: FP8 E4M3 at inference time

File Format

Custom .pt format — requires rdna4-quant engine.

embed.pt          — embedding weights (FP16)
layer_000.pt      — layer 0 (quantized weights + scales + metadata)
...
layer_039.pt      — layer 39
final_norm.pt     — final RMSNorm
lm_head.pt        — LM head (FP16)
meta.pt           — quantization metadata

Usage

git clone https://github.com/JohnTDI-cpu/rdna4-quant
cd rdna4-quant
pip install -r requirements.txt

# Download weights
huggingface-cli download JohnTdi/Qwen3-14B-INT4-Pure-GPTQ --local-dir quantized_v5_pure_int4

# Run inference
python int4_engine_v5.py --quant-dir quantized_v5_pure_int4 --chat

# Or start API server
python api_server.py --quant-dir quantized_v5_pure_int4

Hardware Requirements

AMD GPU with ROCm 6.x+ support (RDNA3/4, MI300X)
~9 GB VRAM
ROCm 6.x or 7.x, PyTorch 2.6+

License

MIT — same as the inference engine. Base model (Qwen3-14B) has its own license from Alibaba.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for JohnTdi/Qwen3-14B-INT4-Pure-GPTQ

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Finetuned

(230)

this model