vLLM 0.19.0 for Jetson Orin (CUDA 12.6, SM 8.7)

Pre-built vLLM 0.19.0 wheel for NVIDIA Jetson Orin family — ready within 72 hours of Gemma 4 release.

Highlights

vLLM 0.19.0 — latest release with native Gemma 4 architecture support
SM 8.7 tensor cores — compiled specifically for Jetson Orin
CUDA 12.6 / JetPack 6.2 — matches current Jetson SDK
529 MB wheel — single file install, no compilation needed

Supported Devices

Device	Status
Jetson AGX Orin 64GB	Tested
Jetson AGX Orin 32GB	Compatible
Jetson Orin NX 16GB	Compatible
Jetson Orin Nano 8GB	Memory constrained

Install (wheel)

# Create venv
python3 -m venv ~/vllm-env && source ~/vllm-env/bin/activate

# Install PyTorch for Jetson (bundled in this repo)
pip install https://huggingface.co/YuyiRobot/vllm-jetson-orin/resolve/main/torch-2.10.0-cp310-cp310-linux_aarch64.whl

# Install vLLM
pip install https://huggingface.co/YuyiRobot/vllm-jetson-orin/resolve/main/vllm-0.19.0+cu126-cp310-cp310-linux_aarch64.whl

# Install transformers for Gemma 4
pip install --no-deps "transformers>=5.5.0"

Run Gemma 4

vllm serve /path/to/gemma-4-E4B-it-W4A16 \
    --host 0.0.0.0 --port 8000 \
    --served-model-name gemma-4-e4b \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.65 \
    --enable-prefix-caching

Benchmark (Jetson AGX Orin 64GB)

Model: Gemma-4-E4B-IT-W4A16 (GPTQ W4A16)

Test	Time	Prompt tokens	Generated tokens	Speed
Short in / Short out (TTFT)	3,511 ms	-	11	3.1 tok/s
Short in / Long out (Decode)	16,586 ms	27	512	30.9 tok/s
Long in / Short out (Prefill)	1,409 ms	609	20	432.2 tok/s

Docker results: Decode 28.3 tok/s, Prefill 563.4 tok/s (similar performance)

Useful Links

Docker image (recommended for production): ghcr.io/yuyirobotlab/vllm-orin:0.19.0
Build from source

Acknowledgements

vLLM — the inference engine
thehighnotes/vllm-jetson-orin — pioneered vLLM wheel distribution for Jetson Orin
NVIDIA Jetson AI Lab — PyTorch wheels and ecosystem

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support