GPT-OSS-20B INT4 — optimum-intel OVModelForCausalLM

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, benchmarked on Intel Arc 140V GPU using optimum-intel OVModelForCausalLM.

This runtime provides access to the internal KV cache state tensors via OpenVINO's state API (model.request.query_state()), which is required for the TurboQuant and TriAttention experiments.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + optimum-intel
Device: Intel Arc 140V GPU (Lunar Lake iGPU)

Benchmark Results

Test configuration

Item	Value
Device	Intel Arc 140V GPU
Max new tokens	200
Input context	~15–25 tokens (short prompts)
Total context length	~215–225 tokens
Runs per prompt	3 (averaged)
Inference mode	Greedy (do_sample=False)
stateful	True

Results (3 prompts averaged)

Prompt	Latency (s)	TPOT (ms/tok)	Throughput (tok/s)
MoE vs Dense transformer	9.421	47.1	21.23
Fibonacci memoization	9.426	47.1	21.22
OpenVINO advantages	9.403	47.0	21.27
Average	9.42	47.1	21.24

Compared to openvino_genai.LLMPipeline (27.0 tok/s), the optimum-intel interface is ~21% slower due to HuggingFace model.generate() overhead. However, it enables direct KV cache state access for post-processing (quantization, pruning).

Repository Contents

File	Description
`openvino_model.bin`	INT4-quantized model weights (12 GB, git-lfs)
`openvino_model.xml`	OpenVINO IR graph definition
`openvino_tokenizer.bin/xml`	OpenVINO tokenizer
`openvino_detokenizer.bin/xml`	OpenVINO detokenizer
`config.json`	Model configuration
`export.py`	Download model from HuggingFace
`infer.py`	Single-prompt inference
`benchmark.py`	Latency & memory benchmark suite

Installation

pip install optimum[openvino] transformers psutil huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain the key differences between MoE and dense transformers."

Benchmark (latency / memory)

python benchmark.py \
  --model-dir . \
  --device GPU \
  --max-new-tokens 200 \
  --runs 3 \
  --output results.json

Arguments

Argument	Default	Description
`--model-dir`	`.`	Path to OpenVINO model directory
`--device`	`GPU`	`GPU` or `CPU` (auto fallback to CPU)
`--max-new-tokens`	`200`	Number of tokens to generate
`--runs`	`3`	Benchmark runs per prompt
`--output`	`results_optimum.json`	JSON result output path

KV Cache State Access

This repository uses stateful=True which enables the OV state API:

from optimum.intel import OVModelForCausalLM

model = OVModelForCausalLM.from_pretrained(".", device="GPU", stateful=True)

# Access KV cache states after each forward pass
states = model.request.query_state()
for s in states:
    print(s.name, s.state.data.shape)  # e.g. (1, 8, seq_len, 64)

KV state shape: (1, kv_heads=8, seq_len, head_dim=64) — 48 states total (24 layers × K + V).

Hardware Requirements

Intel Arc GPU (Xe series) or any Intel CPU
At least 16 GB system RAM
OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month: 50

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support