YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT-OSS-20B INT4 — TurboQuant KV Cache Quantization

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, with TurboQuant V3 KV cache quantization applied at runtime on Intel Arc 140V GPU.

TurboQuant applies Lloyd-Max + Random Rotation vector quantization to key/value cache tensors after each decode step via the OpenVINO state API, reducing KV memory usage.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)


Benchmark Results

Test configuration

Item Value
Device Intel Arc 140V GPU
Max new tokens 200
Input context ~15–25 tokens (short prompts)
Total context length ~215–225 tokens
Runs per prompt 3 (averaged)
TurboQuant K bits 6
TurboQuant V bits 4
Residual window 64 (recent tokens kept in FP32)
Compression frequency Every 4 decode steps

Results vs Baseline (3 prompts averaged)

Mode Avg Latency (s) TPOT (ms/tok) Throughput (tok/s) Mem Δ (MB)
Baseline (no quant) 9.41 47.1 21.25 ~5
TurboQuant K6/V4 22.97 114.9 8.70 ~114
Change +144% +144% -59% +109 MB

Per-prompt detail

Prompt Mode Latency (s) TPOT (ms) Throughput (tok/s) Mem Δ (MB)
MoE vs Dense Baseline 9.378 46.9 21.33 +9.9
MoE vs Dense TurboQuant 22.955 114.8 8.71 +248.1
Fibonacci Baseline 9.427 47.1 21.21 +2.6
Fibonacci TurboQuant 22.930 114.7 8.72 +95.2
OpenVINO 장점 Baseline 9.432 47.2 21.21 +2.9
OpenVINO 장점 TurboQuant 23.030 115.1 8.68 −1.1

Note: Performance overhead stems from GPU↔CPU round-trips for each of the 48 KV state tensors (24 layers × K+V) per compression step. Production deployment requires GPU-native KV kernels.


Repository Contents

File Description
openvino_model.bin INT4-quantized model weights (12 GB, git-lfs)
openvino_model.xml OpenVINO IR graph definition
openvino_tokenizer.bin/xml OpenVINO tokenizer
openvino_detokenizer.bin/xml OpenVINO detokenizer
config.json Model configuration
export.py Download model from HuggingFace
infer.py Single-prompt inference with TurboQuant
benchmark.py Baseline vs TurboQuant latency/memory benchmark
turboquant/compressors_v3.py TurboQuant V3: MSECompressor, TurboQuantV3
turboquant/turboquant.py Core TurboQuant: rotation matrix, Lloyd-Max quantizer
turboquant/lloyd_max.py Lloyd-Max codebook for Beta/Gaussian distributions

Installation

pip install optimum[openvino] transformers openvino psutil scipy huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference with TurboQuant

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain MoE transformer architectures." \
  --k-bits 6 \
  --v-bits 4 \
  --residual-window 64

Benchmark: Baseline vs TurboQuant

python benchmark.py \
  --model-dir . \
  --device GPU \
  --k-bits 6 \
  --v-bits 4 \
  --residual-window 64 \
  --compress-every 4 \
  --runs 3 \
  --output results.json

Arguments

Argument Default Description
--model-dir . OpenVINO model directory
--device GPU GPU or CPU (auto fallback)
--k-bits 6 Key quantization bit-width
--v-bits 4 Value quantization bit-width
--residual-window 64 Recent tokens kept in FP32
--compress-every 4 Apply compression every N decode steps
--max-new-tokens 200 Tokens to generate
--runs 3 Benchmark runs per prompt

How TurboQuant Works

TurboQuant V3 (community-informed improvements from ICLR 2026 paper):

  1. Random rotation: Apply orthogonal matrix Pi to K/V vectors → coordinates become near-Gaussian
  2. Lloyd-Max quantization: Optimal scalar quantizer (MSE-minimizing centroids) for each coordinate
  3. Bit packing: Pack quantized indices to reduce memory (e.g. 6-bit → 75% of fp16 storage)
  4. Residual window: Keep the most recent residual_window tokens in FP32 for generation quality
  5. Asymmetric K/V bits: Keys get more bits (6) than values (4) — keys need inner-product precision

KV state access via OV state API:

# Read KV state tensor for layer 0 key
states = model.request.query_state()
state_map = {s.name: s for s in states}
k_arr = np.copy(state_map["past_key_values.0.key"].state.data)  # (1, 8, seq, 64)

# Compress and write back
compressed = compressor.compress(torch.from_numpy(k_arr))
decompressed = compressor.decompress(compressed)
state_map["past_key_values.0.key"].state = ov.Tensor(decompressed.numpy())

State shape: (1, kv_heads=8, seq_len, head_dim=64) — 48 states (24 layers × K + V).


Hardware Requirements

  • Intel Arc GPU (Xe series) or any Intel CPU
  • At least 16 GB system RAM
  • OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
TurboQuant algorithm: ICLR 2026 paper.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xpuenabler/gpt-oss-20b-int4-openvino-turboquant