YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT-OSS-20B INT4 — OpenVINO GenAI Inference

GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound, benchmarked on Intel Arc 140V GPU using openvino_genai.LLMPipeline.

Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + openvino-genai
Device: Intel Arc 140V GPU (Lunar Lake iGPU)


Benchmark Results

Test configuration

Item Value
Device Intel Arc 140V GPU
Max new tokens 200
Input context ~15–25 tokens (short prompts)
Total context length ~215–225 tokens
Runs per prompt 3 (averaged)
Inference mode Greedy (do_sample=False)

Results (3 prompts averaged)

Prompt Latency (s) TPOT (ms/tok) Throughput (tok/s)
MoE vs Dense transformer 7.534 54.2 26.55
Fibonacci memoization 7.338 53.2 27.25
OpenVINO advantages 7.361 56.5 27.17
Average 7.41 54.6 27.0

openvino_genai.LLMPipeline delivers the best throughput (~27 tok/s) among all tested runtimes, thanks to its built-in continuous batching and scheduling optimizations.


Repository Contents

File Description
openvino_model.bin INT4-quantized model weights (12 GB, git-lfs)
openvino_model.xml OpenVINO IR graph definition
openvino_tokenizer.bin/xml OpenVINO tokenizer
openvino_detokenizer.bin/xml OpenVINO detokenizer
config.json Model configuration
export.py Download model from HuggingFace
infer.py Single-prompt inference
benchmark.py Latency & memory benchmark suite

Installation

pip install openvino-genai openvino psutil huggingface_hub

Usage

Download the model

python export.py --output-dir ./model

Single inference

python infer.py \
  --model-dir . \
  --device GPU \
  --prompt "Explain the key differences between MoE and dense transformers."

Benchmark (latency / memory)

python benchmark.py \
  --model-dir . \
  --device GPU \
  --max-new-tokens 200 \
  --runs 3 \
  --output results.json

Arguments

Argument Default Description
--model-dir . Path to OpenVINO model directory
--device GPU GPU or CPU (auto fallback to CPU)
--max-new-tokens 200 Number of tokens to generate
--runs 3 Benchmark runs per prompt
--output results_genai.json JSON result output path

Hardware Requirements

  • Intel Arc GPU (Xe series) or any Intel CPU
  • At least 16 GB system RAM
  • OpenVINO 2026.1.0+

License

Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.

Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support