YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-OSS-20B INT4 — OpenVINO GenAI Inference
GPT-OSS-20B (OpenAI MoE, 32 experts, Mistral architecture) quantized to INT4 via AutoRound,
benchmarked on Intel Arc 140V GPU using openvino_genai.LLMPipeline.
Base model: OpenVINO/gpt-oss-20b-int4-ov
Runtime: OpenVINO 2026.1.0 + openvino-genai
Device: Intel Arc 140V GPU (Lunar Lake iGPU)
Benchmark Results
Test configuration
| Item | Value |
|---|---|
| Device | Intel Arc 140V GPU |
| Max new tokens | 200 |
| Input context | ~15–25 tokens (short prompts) |
| Total context length | ~215–225 tokens |
| Runs per prompt | 3 (averaged) |
| Inference mode | Greedy (do_sample=False) |
Results (3 prompts averaged)
| Prompt | Latency (s) | TPOT (ms/tok) | Throughput (tok/s) |
|---|---|---|---|
| MoE vs Dense transformer | 7.534 | 54.2 | 26.55 |
| Fibonacci memoization | 7.338 | 53.2 | 27.25 |
| OpenVINO advantages | 7.361 | 56.5 | 27.17 |
| Average | 7.41 | 54.6 | 27.0 |
openvino_genai.LLMPipelinedelivers the best throughput (~27 tok/s) among all tested runtimes, thanks to its built-in continuous batching and scheduling optimizations.
Repository Contents
| File | Description |
|---|---|
openvino_model.bin |
INT4-quantized model weights (12 GB, git-lfs) |
openvino_model.xml |
OpenVINO IR graph definition |
openvino_tokenizer.bin/xml |
OpenVINO tokenizer |
openvino_detokenizer.bin/xml |
OpenVINO detokenizer |
config.json |
Model configuration |
export.py |
Download model from HuggingFace |
infer.py |
Single-prompt inference |
benchmark.py |
Latency & memory benchmark suite |
Installation
pip install openvino-genai openvino psutil huggingface_hub
Usage
Download the model
python export.py --output-dir ./model
Single inference
python infer.py \
--model-dir . \
--device GPU \
--prompt "Explain the key differences between MoE and dense transformers."
Benchmark (latency / memory)
python benchmark.py \
--model-dir . \
--device GPU \
--max-new-tokens 200 \
--runs 3 \
--output results.json
Arguments
| Argument | Default | Description |
|---|---|---|
--model-dir |
. |
Path to OpenVINO model directory |
--device |
GPU |
GPU or CPU (auto fallback to CPU) |
--max-new-tokens |
200 |
Number of tokens to generate |
--runs |
3 |
Benchmark runs per prompt |
--output |
results_genai.json |
JSON result output path |
Hardware Requirements
- Intel Arc GPU (Xe series) or any Intel CPU
- At least 16 GB system RAM
- OpenVINO 2026.1.0+
License
Model weights follow the OpenAI GPT-OSS usage policy.
Scripts in this repository are released under the Apache 2.0 License.
- Downloads last month
- 41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support