YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT-OSS-20B INT4 — TriAttention KV Token Pruning

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, with TriAttention key-norm based KV token pruning applied at runtime on Intel Arc 140V GPU.

TriAttention scores token importance by the L2 norm of key vectors and zeroes out the K/V entries of low-importance historical tokens (soft-pruning), keeping the KV tensor shape intact.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)


Benchmark Results

Test configuration

Item Value
Device Intel Arc 140V GPU
Max new tokens 200
Input context ~15–25 tokens (short prompts)
Total context length ~215–225 tokens
Runs per prompt 3 (averaged)
KV Budget 128 tokens
Recent window 32 tokens (always preserved)
Prune trigger seq_len > budget (128)
Prune events ~83–89 per 200-token generation

Results vs Baseline (3 prompts averaged)

Mode Avg Latency (s) TPOT (ms/tok) Throughput (tok/s) Mem Δ (MB)
Baseline (no prune) 9.56 47.8 20.92 ~6
TriAttention soft-prune 20.69 103.5 9.66 ~138 (variable)
Change +116% +117% −54% +132 MB

Per-prompt detail

Prompt Mode Latency (s) TPOT (ms) Throughput (tok/s) Mem Δ (MB)
MoE vs Dense Baseline 9.789 48.9 20.43 +9.9
MoE vs Dense TriAttention 20.483 102.4 9.76 +442.5
Fibonacci Baseline 9.416 47.1 21.24 +2.7
Fibonacci TriAttention 20.780 103.9 9.62 −132.3
OpenVINO 장점 Baseline 9.480 47.4 21.10 +4.9
OpenVINO 장점 TriAttention 20.817 104.1 9.61 +102.9

Note: Performance overhead stems from GPU↔CPU round-trips for 48 KV state tensors (24 layers × K+V) on every decode step after the budget is exceeded. Memory delta is highly variable due to GPU driver lazy allocation and GC timing. Production deployment requires GPU-native KV pruning kernels.


Repository Contents

File Description
openvino_model.bin INT4-quantized model weights (12 GB, git-lfs)
openvino_model.xml OpenVINO IR graph definition
openvino_tokenizer.bin/xml OpenVINO tokenizer
openvino_detokenizer.bin/xml OpenVINO detokenizer
config.json Model configuration
export.py Download model from HuggingFace
infer.py Single-prompt inference with TriAttention pruning
benchmark.py Baseline vs TriAttention latency/memory benchmark

Installation

pip install optimum[openvino] transformers openvino psutil huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference with TriAttention

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain MoE transformer architectures." \
  --budget 128 \
  --recent-window 32

Benchmark: Baseline vs TriAttention

python benchmark.py \
  --model-dir . \
  --device GPU \
  --budget 128 \
  --recent-window 32 \
  --runs 3 \
  --output results.json

Arguments

Argument Default Description
--model-dir . OpenVINO model directory
--device GPU GPU or CPU (auto fallback)
--budget 128 Max KV tokens to keep (prune when exceeded)
--recent-window 32 Most recent tokens always preserved
--max-new-tokens 200 Tokens to generate
--runs 3 Benchmark runs per prompt

How TriAttention Soft-Prune Works

TriAttention scores each historical token by the mean L2 norm of its key vector across attention heads. Tokens with low key norms have near-zero attention weights and are "evicted" by zeroing their K/V entries.

For each decode step where seq_len > budget:
  1. Read all 48 KV state tensors via model.request.query_state()
  2. Compute key_norm[t] = mean over heads of ||k_t||_2, for historical tokens
  3. Zero out K/V for the (hist_len - keep_hist) lowest-norm tokens
  4. Write modified tensors back via state.state = ov.Tensor(...)

Why soft-prune instead of hard-prune (token removal)?

OpenVINO stateful models manage attention_mask internally with a fixed shape equal to the current sequence length. Removing tokens (hard-prune) would change seq_len but the model's internal mask would mismatch → Broadcast incorrect target shape error. Soft-prune keeps shape constant while effectively eliminating the contribution of zeroed tokens (attention_weight → ~0).

# Key norm scoring (simplified)
k_hist = k_arr[:, :, :hist_len, :]  # (1, heads, hist, head_dim)
key_norms = np.linalg.norm(k_hist[0], axis=-1).mean(axis=0)  # (hist,)

# Zero out bottom-ranked tokens
zero_idx = np.argpartition(key_norms, zero_count)[:zero_count]
k_arr[:, :, zero_idx, :] = 0.0
v_arr[:, :, zero_idx, :] = 0.0

Hardware Requirements

  • Intel Arc GPU (Xe series) or any Intel CPU
  • At least 16 GB system RAM
  • OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month
58
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support