GPT-OSS-20B INT4 — OpenVINO GenAI Inference

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, benchmarked on Intel Arc 140V GPU using openvino_genai.LLMPipeline.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + openvino-genai
Device: Intel Arc 140V GPU (Lunar Lake iGPU)

Benchmark Results

Test configuration

Item	Value
Device	Intel Arc 140V GPU
Max new tokens	200
Input context	~15–25 tokens (short prompts)
Total context length	~215–225 tokens
Runs per prompt	3 (averaged)
Inference mode	Greedy (do_sample=False)

Results (3 prompts averaged)

Prompt	Latency (s)	TPOT (ms/tok)	Throughput (tok/s)
MoE vs Dense transformer	7.534	54.2	26.55
Fibonacci memoization	7.338	53.2	27.25
OpenVINO advantages	7.361	56.5	27.17
Average	7.41	54.6	27.0

openvino_genai.LLMPipeline delivers the best throughput (~27 tok/s) among all tested runtimes, thanks to its built-in continuous batching and scheduling optimizations.

Repository Contents

File	Description
`openvino_model.bin`	INT4-quantized model weights (12 GB, git-lfs)
`openvino_model.xml`	OpenVINO IR graph definition
`openvino_tokenizer.bin/xml`	OpenVINO tokenizer
`openvino_detokenizer.bin/xml`	OpenVINO detokenizer
`config.json`	Model configuration
`export.py`	Download model from HuggingFace
`infer.py`	Single-prompt inference
`benchmark.py`	Latency & memory benchmark suite

Installation

pip install openvino-genai openvino psutil huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain the key differences between MoE and dense transformers."

Benchmark (latency / memory)

python benchmark.py \
  --model-dir . \
  --device GPU \
  --max-new-tokens 200 \
  --runs 3 \
  --output results.json

Arguments

Argument	Default	Description
`--model-dir`	`.`	Path to OpenVINO model directory
`--device`	`GPU`	`GPU` or `CPU` (auto fallback to CPU)
`--max-new-tokens`	`200`	Number of tokens to generate
`--runs`	`3`	Benchmark runs per prompt
`--output`	`results_genai.json`	JSON result output path

Hardware Requirements

Intel Arc GPU (Xe series) or any Intel CPU
At least 16 GB system RAM
OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month: 41

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support