YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-OSS-20B INT4 — TriAttention KV Token Pruning
GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, with TriAttention key-norm based KV token pruning applied at runtime on Intel Arc 140V GPU.
TriAttention scores token importance by the L2 norm of key vectors and zeroes out the K/V entries of low-importance historical tokens (soft-pruning), keeping the KV tensor shape intact.
Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)
Benchmark Results
Test configuration
| Item | Value |
|---|---|
| Device | Intel Arc 140V GPU |
| Max new tokens | 200 |
| Input context | ~15–25 tokens (short prompts) |
| Total context length | ~215–225 tokens |
| Runs per prompt | 3 (averaged) |
| KV Budget | 128 tokens |
| Recent window | 32 tokens (always preserved) |
| Prune trigger | seq_len > budget (128) |
| Prune events | ~83–89 per 200-token generation |
Results vs Baseline (3 prompts averaged)
| Mode | Avg Latency (s) | TPOT (ms/tok) | Throughput (tok/s) | Mem Δ (MB) |
|---|---|---|---|---|
| Baseline (no prune) | 9.56 | 47.8 | 20.92 | ~6 |
| TriAttention soft-prune | 20.69 | 103.5 | 9.66 | ~138 (variable) |
| Change | +116% | +117% | −54% | +132 MB |
Per-prompt detail
| Prompt | Mode | Latency (s) | TPOT (ms) | Throughput (tok/s) | Mem Δ (MB) |
|---|---|---|---|---|---|
| MoE vs Dense | Baseline | 9.789 | 48.9 | 20.43 | +9.9 |
| MoE vs Dense | TriAttention | 20.483 | 102.4 | 9.76 | +442.5 |
| Fibonacci | Baseline | 9.416 | 47.1 | 21.24 | +2.7 |
| Fibonacci | TriAttention | 20.780 | 103.9 | 9.62 | −132.3 |
| OpenVINO 장점 | Baseline | 9.480 | 47.4 | 21.10 | +4.9 |
| OpenVINO 장점 | TriAttention | 20.817 | 104.1 | 9.61 | +102.9 |
Note: Performance overhead stems from GPU↔CPU round-trips for 48 KV state tensors (24 layers × K+V) on every decode step after the budget is exceeded. Memory delta is highly variable due to GPU driver lazy allocation and GC timing. Production deployment requires GPU-native KV pruning kernels.
Repository Contents
| File | Description |
|---|---|
openvino_model.bin |
INT4-quantized model weights (12 GB, git-lfs) |
openvino_model.xml |
OpenVINO IR graph definition |
openvino_tokenizer.bin/xml |
OpenVINO tokenizer |
openvino_detokenizer.bin/xml |
OpenVINO detokenizer |
config.json |
Model configuration |
export.py |
Download model from HuggingFace |
infer.py |
Single-prompt inference with TriAttention pruning |
benchmark.py |
Baseline vs TriAttention latency/memory benchmark |
Installation
pip install optimum[openvino] transformers openvino psutil huggingface_hub
Usage
Download the model
python export.py --output-dir ./model
Single inference with TriAttention
python infer.py \
--model-dir . \
--device GPU \
--prompt "Explain MoE transformer architectures." \
--budget 128 \
--recent-window 32
Benchmark: Baseline vs TriAttention
python benchmark.py \
--model-dir . \
--device GPU \
--budget 128 \
--recent-window 32 \
--runs 3 \
--output results.json
Arguments
| Argument | Default | Description |
|---|---|---|
--model-dir |
. |
OpenVINO model directory |
--device |
GPU |
GPU or CPU (auto fallback) |
--budget |
128 |
Max KV tokens to keep (prune when exceeded) |
--recent-window |
32 |
Most recent tokens always preserved |
--max-new-tokens |
200 |
Tokens to generate |
--runs |
3 |
Benchmark runs per prompt |
How TriAttention Soft-Prune Works
TriAttention scores each historical token by the mean L2 norm of its key vector across attention heads. Tokens with low key norms have near-zero attention weights and are "evicted" by zeroing their K/V entries.
For each decode step where seq_len > budget:
1. Read all 48 KV state tensors via model.request.query_state()
2. Compute key_norm[t] = mean over heads of ||k_t||_2, for historical tokens
3. Zero out K/V for the (hist_len - keep_hist) lowest-norm tokens
4. Write modified tensors back via state.state = ov.Tensor(...)
Why soft-prune instead of hard-prune (token removal)?
OpenVINO stateful models manage attention_mask internally with a fixed shape equal to the current
sequence length. Removing tokens (hard-prune) would change seq_len but the model's internal mask
would mismatch → Broadcast incorrect target shape error. Soft-prune keeps shape constant while
effectively eliminating the contribution of zeroed tokens (attention_weight → ~0).
# Key norm scoring (simplified)
k_hist = k_arr[:, :, :hist_len, :] # (1, heads, hist, head_dim)
key_norms = np.linalg.norm(k_hist[0], axis=-1).mean(axis=0) # (hist,)
# Zero out bottom-ranked tokens
zero_idx = np.argpartition(key_norms, zero_count)[:zero_count]
k_arr[:, :, zero_idx, :] = 0.0
v_arr[:, :, zero_idx, :] = 0.0
Hardware Requirements
- Intel Arc GPU (Xe series) or any Intel CPU
- At least 16 GB system RAM
- OpenVINO 2026.1.0+
License
Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.
- Downloads last month
- 58