YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-OSS-20B INT4 — TurboQuant KV Cache Quantization
GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, with TurboQuant V3 KV cache quantization applied at runtime on Intel Arc 140V GPU.
TurboQuant applies Lloyd-Max + Random Rotation vector quantization to key/value cache tensors after each decode step via the OpenVINO state API, reducing KV memory usage.
Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)
Benchmark Results
Test configuration
| Item | Value |
|---|---|
| Device | Intel Arc 140V GPU |
| Max new tokens | 200 |
| Input context | ~15–25 tokens (short prompts) |
| Total context length | ~215–225 tokens |
| Runs per prompt | 3 (averaged) |
| TurboQuant K bits | 6 |
| TurboQuant V bits | 4 |
| Residual window | 64 (recent tokens kept in FP32) |
| Compression frequency | Every 4 decode steps |
Results vs Baseline (3 prompts averaged)
| Mode | Avg Latency (s) | TPOT (ms/tok) | Throughput (tok/s) | Mem Δ (MB) |
|---|---|---|---|---|
| Baseline (no quant) | 9.41 | 47.1 | 21.25 | ~5 |
| TurboQuant K6/V4 | 22.97 | 114.9 | 8.70 | ~114 |
| Change | +144% | +144% | -59% | +109 MB |
Per-prompt detail
| Prompt | Mode | Latency (s) | TPOT (ms) | Throughput (tok/s) | Mem Δ (MB) |
|---|---|---|---|---|---|
| MoE vs Dense | Baseline | 9.378 | 46.9 | 21.33 | +9.9 |
| MoE vs Dense | TurboQuant | 22.955 | 114.8 | 8.71 | +248.1 |
| Fibonacci | Baseline | 9.427 | 47.1 | 21.21 | +2.6 |
| Fibonacci | TurboQuant | 22.930 | 114.7 | 8.72 | +95.2 |
| OpenVINO 장점 | Baseline | 9.432 | 47.2 | 21.21 | +2.9 |
| OpenVINO 장점 | TurboQuant | 23.030 | 115.1 | 8.68 | −1.1 |
Note: Performance overhead stems from GPU↔CPU round-trips for each of the 48 KV state tensors (24 layers × K+V) per compression step. Production deployment requires GPU-native KV kernels.
Repository Contents
| File | Description |
|---|---|
openvino_model.bin |
INT4-quantized model weights (12 GB, git-lfs) |
openvino_model.xml |
OpenVINO IR graph definition |
openvino_tokenizer.bin/xml |
OpenVINO tokenizer |
openvino_detokenizer.bin/xml |
OpenVINO detokenizer |
config.json |
Model configuration |
export.py |
Download model from HuggingFace |
infer.py |
Single-prompt inference with TurboQuant |
benchmark.py |
Baseline vs TurboQuant latency/memory benchmark |
turboquant/compressors_v3.py |
TurboQuant V3: MSECompressor, TurboQuantV3 |
turboquant/turboquant.py |
Core TurboQuant: rotation matrix, Lloyd-Max quantizer |
turboquant/lloyd_max.py |
Lloyd-Max codebook for Beta/Gaussian distributions |
Installation
pip install optimum[openvino] transformers openvino psutil scipy huggingface_hub
Usage
Download the model
python export.py --output-dir ./model
Single inference with TurboQuant
python infer.py \
--model-dir . \
--device GPU \
--prompt "Explain MoE transformer architectures." \
--k-bits 6 \
--v-bits 4 \
--residual-window 64
Benchmark: Baseline vs TurboQuant
python benchmark.py \
--model-dir . \
--device GPU \
--k-bits 6 \
--v-bits 4 \
--residual-window 64 \
--compress-every 4 \
--runs 3 \
--output results.json
Arguments
| Argument | Default | Description |
|---|---|---|
--model-dir |
. |
OpenVINO model directory |
--device |
GPU |
GPU or CPU (auto fallback) |
--k-bits |
6 |
Key quantization bit-width |
--v-bits |
4 |
Value quantization bit-width |
--residual-window |
64 |
Recent tokens kept in FP32 |
--compress-every |
4 |
Apply compression every N decode steps |
--max-new-tokens |
200 |
Tokens to generate |
--runs |
3 |
Benchmark runs per prompt |
How TurboQuant Works
TurboQuant V3 (community-informed improvements from ICLR 2026 paper):
- Random rotation: Apply orthogonal matrix
Pito K/V vectors → coordinates become near-Gaussian - Lloyd-Max quantization: Optimal scalar quantizer (MSE-minimizing centroids) for each coordinate
- Bit packing: Pack quantized indices to reduce memory (e.g. 6-bit → 75% of fp16 storage)
- Residual window: Keep the most recent
residual_windowtokens in FP32 for generation quality - Asymmetric K/V bits: Keys get more bits (6) than values (4) — keys need inner-product precision
KV state access via OV state API:
# Read KV state tensor for layer 0 key
states = model.request.query_state()
state_map = {s.name: s for s in states}
k_arr = np.copy(state_map["past_key_values.0.key"].state.data) # (1, 8, seq, 64)
# Compress and write back
compressed = compressor.compress(torch.from_numpy(k_arr))
decompressed = compressor.decompress(compressed)
state_map["past_key_values.0.key"].state = ov.Tensor(decompressed.numpy())
State shape: (1, kv_heads=8, seq_len, head_dim=64) — 48 states (24 layers × K + V).
Hardware Requirements
- Intel Arc GPU (Xe series) or any Intel CPU
- At least 16 GB system RAM
- OpenVINO 2026.1.0+
License
Model weights follow the OpenAI GPT-OSS usage policy.
TurboQuant algorithm: ICLR 2026 paper.
Scripts in this repository are released under the Apache 2.0 License.
- Downloads last month
- 56