GPT-OSS-20B INT4 — TriAttention KV Token Pruning

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, with TriAttention key-norm based KV token pruning applied at runtime on Intel Arc 140V GPU.

TriAttention scores token importance by the L2 norm of key vectors and zeroes out the K/V entries of low-importance historical tokens (soft-pruning), keeping the KV tensor shape intact.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)

Benchmark Results

Test configuration

Item	Value
Device	Intel Arc 140V GPU
Max new tokens	200
Input context	~15–25 tokens (short prompts)
Total context length	~215–225 tokens
Runs per prompt	3 (averaged)
KV Budget	128 tokens
Recent window	32 tokens (always preserved)
Prune trigger	seq_len > budget (128)
Prune events	~83–89 per 200-token generation

Results vs Baseline (3 prompts averaged)

Mode	Avg Latency (s)	TPOT (ms/tok)	Throughput (tok/s)	Mem Δ (MB)
Baseline (no prune)	9.56	47.8	20.92	~6
TriAttention soft-prune	20.69	103.5	9.66	~138 (variable)
Change	+116%	+117%	−54%	+132 MB

Per-prompt detail

Prompt	Mode	Latency (s)	TPOT (ms)	Throughput (tok/s)	Mem Δ (MB)
MoE vs Dense	Baseline	9.789	48.9	20.43	+9.9
MoE vs Dense	TriAttention	20.483	102.4	9.76	+442.5
Fibonacci	Baseline	9.416	47.1	21.24	+2.7
Fibonacci	TriAttention	20.780	103.9	9.62	−132.3
OpenVINO 장점	Baseline	9.480	47.4	21.10	+4.9
OpenVINO 장점	TriAttention	20.817	104.1	9.61	+102.9

Note: Performance overhead stems from GPU↔CPU round-trips for 48 KV state tensors (24 layers × K+V) on every decode step after the budget is exceeded. Memory delta is highly variable due to GPU driver lazy allocation and GC timing. Production deployment requires GPU-native KV pruning kernels.

Repository Contents

File	Description
`openvino_model.bin`	INT4-quantized model weights (12 GB, git-lfs)
`openvino_model.xml`	OpenVINO IR graph definition
`openvino_tokenizer.bin/xml`	OpenVINO tokenizer
`openvino_detokenizer.bin/xml`	OpenVINO detokenizer
`config.json`	Model configuration
`export.py`	Download model from HuggingFace
`infer.py`	Single-prompt inference with TriAttention pruning
`benchmark.py`	Baseline vs TriAttention latency/memory benchmark

Installation

pip install optimum[openvino] transformers openvino psutil huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference with TriAttention

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain MoE transformer architectures." \
  --budget 128 \
  --recent-window 32

Benchmark: Baseline vs TriAttention

python benchmark.py \
  --model-dir . \
  --device GPU \
  --budget 128 \
  --recent-window 32 \
  --runs 3 \
  --output results.json

Arguments

Argument	Default	Description
`--model-dir`	`.`	OpenVINO model directory
`--device`	`GPU`	`GPU` or `CPU` (auto fallback)
`--budget`	`128`	Max KV tokens to keep (prune when exceeded)
`--recent-window`	`32`	Most recent tokens always preserved
`--max-new-tokens`	`200`	Tokens to generate
`--runs`	`3`	Benchmark runs per prompt

How TriAttention Soft-Prune Works

TriAttention scores each historical token by the mean L2 norm of its key vector across attention heads. Tokens with low key norms have near-zero attention weights and are "evicted" by zeroing their K/V entries.

For each decode step where seq_len > budget:
  1. Read all 48 KV state tensors via model.request.query_state()
  2. Compute key_norm[t] = mean over heads of ||k_t||_2, for historical tokens
  3. Zero out K/V for the (hist_len - keep_hist) lowest-norm tokens
  4. Write modified tensors back via state.state = ov.Tensor(...)

Why soft-prune instead of hard-prune (token removal)?

OpenVINO stateful models manage attention_mask internally with a fixed shape equal to the current sequence length. Removing tokens (hard-prune) would change seq_len but the model's internal mask would mismatch → Broadcast incorrect target shape error. Soft-prune keeps shape constant while effectively eliminating the contribution of zeroed tokens (attention_weight → ~0).

# Key norm scoring (simplified)
k_hist = k_arr[:, :, :hist_len, :]  # (1, heads, hist, head_dim)
key_norms = np.linalg.norm(k_hist[0], axis=-1).mean(axis=0)  # (hist,)

# Zero out bottom-ranked tokens
zero_idx = np.argpartition(key_norms, zero_count)[:zero_count]
k_arr[:, :, zero_idx, :] = 0.0
v_arr[:, :, zero_idx, :] = 0.0

Hardware Requirements

Intel Arc GPU (Xe series) or any Intel CPU
At least 16 GB system RAM
OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month: 58

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support