YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT-OSS-20B INT4 — optimum-intel OVModelForCausalLM

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, benchmarked on Intel Arc 140V GPU using optimum-intel OVModelForCausalLM.

This runtime provides access to the internal KV cache state tensors via OpenVINO's state API (model.request.query_state()), which is required for the TurboQuant and TriAttention experiments.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)


Benchmark Results

Test configuration

Item Value
Device Intel Arc 140V GPU
Max new tokens 200
Input context ~15–25 tokens (short prompts)
Total context length ~215–225 tokens
Runs per prompt 3 (averaged)
Inference mode Greedy (do_sample=False)
stateful True

Results (3 prompts averaged)

Prompt Latency (s) TPOT (ms/tok) Throughput (tok/s)
MoE vs Dense transformer 9.421 47.1 21.23
Fibonacci memoization 9.426 47.1 21.22
OpenVINO advantages 9.403 47.0 21.27
Average 9.42 47.1 21.24

Compared to openvino_genai.LLMPipeline (27.0 tok/s), the optimum-intel interface is ~21% slower due to HuggingFace model.generate() overhead. However, it enables direct KV cache state access for post-processing (quantization, pruning).


Repository Contents

File Description
openvino_model.bin INT4-quantized model weights (12 GB, git-lfs)
openvino_model.xml OpenVINO IR graph definition
openvino_tokenizer.bin/xml OpenVINO tokenizer
openvino_detokenizer.bin/xml OpenVINO detokenizer
config.json Model configuration
export.py Download model from HuggingFace
infer.py Single-prompt inference
benchmark.py Latency & memory benchmark suite

Installation

pip install optimum[openvino] transformers psutil huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain the key differences between MoE and dense transformers."

Benchmark (latency / memory)

python benchmark.py \
  --model-dir . \
  --device GPU \
  --max-new-tokens 200 \
  --runs 3 \
  --output results.json

Arguments

Argument Default Description
--model-dir . Path to OpenVINO model directory
--device GPU GPU or CPU (auto fallback to CPU)
--max-new-tokens 200 Number of tokens to generate
--runs 3 Benchmark runs per prompt
--output results_optimum.json JSON result output path

KV Cache State Access

This repository uses stateful=True which enables the OV state API:

from optimum.intel import OVModelForCausalLM

model = OVModelForCausalLM.from_pretrained(".", device="GPU", stateful=True)

# Access KV cache states after each forward pass
states = model.request.query_state()
for s in states:
    print(s.name, s.state.data.shape)  # e.g. (1, 8, seq_len, 64)

KV state shape: (1, kv_heads=8, seq_len, head_dim=64) — 48 states total (24 layers × K + V).


Hardware Requirements

  • Intel Arc GPU (Xe series) or any Intel CPU
  • At least 16 GB system RAM
  • OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support