YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-OSS-20B INT4 — optimum-intel OVModelForCausalLM
GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound,
benchmarked on Intel Arc 140V GPU using optimum-intel OVModelForCausalLM.
This runtime provides access to the internal KV cache state tensors via OpenVINO's state API
(model.request.query_state()), which is required for the TurboQuant and TriAttention experiments.
Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)
Benchmark Results
Test configuration
| Item | Value |
|---|---|
| Device | Intel Arc 140V GPU |
| Max new tokens | 200 |
| Input context | ~15–25 tokens (short prompts) |
| Total context length | ~215–225 tokens |
| Runs per prompt | 3 (averaged) |
| Inference mode | Greedy (do_sample=False) |
| stateful | True |
Results (3 prompts averaged)
| Prompt | Latency (s) | TPOT (ms/tok) | Throughput (tok/s) |
|---|---|---|---|
| MoE vs Dense transformer | 9.421 | 47.1 | 21.23 |
| Fibonacci memoization | 9.426 | 47.1 | 21.22 |
| OpenVINO advantages | 9.403 | 47.0 | 21.27 |
| Average | 9.42 | 47.1 | 21.24 |
Compared to
openvino_genai.LLMPipeline(27.0 tok/s), the optimum-intel interface is ~21% slower due to HuggingFacemodel.generate()overhead. However, it enables direct KV cache state access for post-processing (quantization, pruning).
Repository Contents
| File | Description |
|---|---|
openvino_model.bin |
INT4-quantized model weights (12 GB, git-lfs) |
openvino_model.xml |
OpenVINO IR graph definition |
openvino_tokenizer.bin/xml |
OpenVINO tokenizer |
openvino_detokenizer.bin/xml |
OpenVINO detokenizer |
config.json |
Model configuration |
export.py |
Download model from HuggingFace |
infer.py |
Single-prompt inference |
benchmark.py |
Latency & memory benchmark suite |
Installation
pip install optimum[openvino] transformers psutil huggingface_hub
Usage
Download the model
python export.py --output-dir ./model
Single inference
python infer.py \
--model-dir . \
--device GPU \
--prompt "Explain the key differences between MoE and dense transformers."
Benchmark (latency / memory)
python benchmark.py \
--model-dir . \
--device GPU \
--max-new-tokens 200 \
--runs 3 \
--output results.json
Arguments
| Argument | Default | Description |
|---|---|---|
--model-dir |
. |
Path to OpenVINO model directory |
--device |
GPU |
GPU or CPU (auto fallback to CPU) |
--max-new-tokens |
200 |
Number of tokens to generate |
--runs |
3 |
Benchmark runs per prompt |
--output |
results_optimum.json |
JSON result output path |
KV Cache State Access
This repository uses stateful=True which enables the OV state API:
from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(".", device="GPU", stateful=True)
# Access KV cache states after each forward pass
states = model.request.query_state()
for s in states:
print(s.name, s.state.data.shape) # e.g. (1, 8, seq_len, 64)
KV state shape: (1, kv_heads=8, seq_len, head_dim=64) — 48 states total (24 layers × K + V).
Hardware Requirements
- Intel Arc GPU (Xe series) or any Intel CPU
- At least 16 GB system RAM
- OpenVINO 2026.1.0+
License
Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.
- Downloads last month
- 50