GPT-OSS-20B INT4 — TurboQuant KV Cache Quantization

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, with TurboQuant V3 KV cache quantization applied at runtime on Intel Arc 140V GPU.

TurboQuant applies Lloyd-Max + Random Rotation vector quantization to key/value cache tensors after each decode step via the OpenVINO state API, reducing KV memory usage.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)

Benchmark Results

Test configuration

Item	Value
Device	Intel Arc 140V GPU
Max new tokens	200
Input context	~15–25 tokens (short prompts)
Total context length	~215–225 tokens
Runs per prompt	3 (averaged)
TurboQuant K bits	6
TurboQuant V bits	4
Residual window	64 (recent tokens kept in FP32)
Compression frequency	Every 4 decode steps

Results vs Baseline (3 prompts averaged)

Mode	Avg Latency (s)	TPOT (ms/tok)	Throughput (tok/s)	Mem Δ (MB)
Baseline (no quant)	9.41	47.1	21.25	~5
TurboQuant K6/V4	22.97	114.9	8.70	~114
Change	+144%	+144%	-59%	+109 MB

Per-prompt detail

Prompt	Mode	Latency (s)	TPOT (ms)	Throughput (tok/s)	Mem Δ (MB)
MoE vs Dense	Baseline	9.378	46.9	21.33	+9.9
MoE vs Dense	TurboQuant	22.955	114.8	8.71	+248.1
Fibonacci	Baseline	9.427	47.1	21.21	+2.6
Fibonacci	TurboQuant	22.930	114.7	8.72	+95.2
OpenVINO 장점	Baseline	9.432	47.2	21.21	+2.9
OpenVINO 장점	TurboQuant	23.030	115.1	8.68	−1.1

Note: Performance overhead stems from GPU↔CPU round-trips for each of the 48 KV state tensors (24 layers × K+V) per compression step. Production deployment requires GPU-native KV kernels.

Repository Contents

File	Description
`openvino_model.bin`	INT4-quantized model weights (12 GB, git-lfs)
`openvino_model.xml`	OpenVINO IR graph definition
`openvino_tokenizer.bin/xml`	OpenVINO tokenizer
`openvino_detokenizer.bin/xml`	OpenVINO detokenizer
`config.json`	Model configuration
`export.py`	Download model from HuggingFace
`infer.py`	Single-prompt inference with TurboQuant
`benchmark.py`	Baseline vs TurboQuant latency/memory benchmark
`turboquant/compressors_v3.py`	TurboQuant V3: MSECompressor, TurboQuantV3
`turboquant/turboquant.py`	Core TurboQuant: rotation matrix, Lloyd-Max quantizer
`turboquant/lloyd_max.py`	Lloyd-Max codebook for Beta/Gaussian distributions

Installation

pip install optimum[openvino] transformers openvino psutil scipy huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference with TurboQuant

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain MoE transformer architectures." \
  --k-bits 6 \
  --v-bits 4 \
  --residual-window 64

Benchmark: Baseline vs TurboQuant

python benchmark.py \
  --model-dir . \
  --device GPU \
  --k-bits 6 \
  --v-bits 4 \
  --residual-window 64 \
  --compress-every 4 \
  --runs 3 \
  --output results.json

Arguments

Argument	Default	Description
`--model-dir`	`.`	OpenVINO model directory
`--device`	`GPU`	`GPU` or `CPU` (auto fallback)
`--k-bits`	`6`	Key quantization bit-width
`--v-bits`	`4`	Value quantization bit-width
`--residual-window`	`64`	Recent tokens kept in FP32
`--compress-every`	`4`	Apply compression every N decode steps
`--max-new-tokens`	`200`	Tokens to generate
`--runs`	`3`	Benchmark runs per prompt

How TurboQuant Works

TurboQuant V3 (community-informed improvements from ICLR 2026 paper):

Random rotation: Apply orthogonal matrix Pi to K/V vectors → coordinates become near-Gaussian
Lloyd-Max quantization: Optimal scalar quantizer (MSE-minimizing centroids) for each coordinate
Bit packing: Pack quantized indices to reduce memory (e.g. 6-bit → 75% of fp16 storage)
Residual window: Keep the most recent residual_window tokens in FP32 for generation quality
Asymmetric K/V bits: Keys get more bits (6) than values (4) — keys need inner-product precision

KV state access via OV state API:

# Read KV state tensor for layer 0 key
states = model.request.query_state()
state_map = {s.name: s for s in states}
k_arr = np.copy(state_map["past_key_values.0.key"].state.data)  # (1, 8, seq, 64)

# Compress and write back
compressed = compressor.compress(torch.from_numpy(k_arr))
decompressed = compressor.decompress(compressed)
state_map["past_key_values.0.key"].state = ov.Tensor(decompressed.numpy())

State shape: (1, kv_heads=8, seq_len, head_dim=64) — 48 states (24 layers × K + V).

Hardware Requirements

Intel Arc GPU (Xe series) or any Intel CPU
At least 16 GB system RAM
OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
TurboQuant algorithm: ICLR 2026 paper.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month: 56

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xpuenabler/gpt-oss-20b-int4-openvino-turboquant

Characterizing WASP-43b's interior structure: unveiling tidal decay and apsidal motion

Paper • 2501.03685 • Published Jan 7, 2025