vLLM 0.19.0 for Jetson Orin (CUDA 12.6, SM 8.7)

Pre-built vLLM 0.19.0 wheel for NVIDIA Jetson Orin family β€” ready within 72 hours of Gemma 4 release.

Highlights

  • vLLM 0.19.0 β€” latest release with native Gemma 4 architecture support
  • SM 8.7 tensor cores β€” compiled specifically for Jetson Orin
  • CUDA 12.6 / JetPack 6.2 β€” matches current Jetson SDK
  • 529 MB wheel β€” single file install, no compilation needed

Supported Devices

Device Status
Jetson AGX Orin 64GB Tested
Jetson AGX Orin 32GB Compatible
Jetson Orin NX 16GB Compatible
Jetson Orin Nano 8GB Memory constrained

Install (wheel)

# Create venv
python3 -m venv ~/vllm-env && source ~/vllm-env/bin/activate

# Install PyTorch for Jetson (bundled in this repo)
pip install https://huggingface.co/YuyiRobot/vllm-jetson-orin/resolve/main/torch-2.10.0-cp310-cp310-linux_aarch64.whl

# Install vLLM
pip install https://huggingface.co/YuyiRobot/vllm-jetson-orin/resolve/main/vllm-0.19.0+cu126-cp310-cp310-linux_aarch64.whl

# Install transformers for Gemma 4
pip install --no-deps "transformers>=5.5.0"

Run Gemma 4

vllm serve /path/to/gemma-4-E4B-it-W4A16 \
    --host 0.0.0.0 --port 8000 \
    --served-model-name gemma-4-e4b \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.65 \
    --enable-prefix-caching

Benchmark (Jetson AGX Orin 64GB)

Model: Gemma-4-E4B-IT-W4A16 (GPTQ W4A16)

Test Time Prompt tokens Generated tokens Speed
Short in / Short out (TTFT) 3,511 ms - 11 3.1 tok/s
Short in / Long out (Decode) 16,586 ms 27 512 30.9 tok/s
Long in / Short out (Prefill) 1,409 ms 609 20 432.2 tok/s

Docker results: Decode 28.3 tok/s, Prefill 563.4 tok/s (similar performance)

Useful Links

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support