vLLM 0.19.0 for Jetson Orin (CUDA 12.6, SM 8.7)
Pre-built vLLM 0.19.0 wheel for NVIDIA Jetson Orin family β ready within 72 hours of Gemma 4 release.
Highlights
- vLLM 0.19.0 β latest release with native Gemma 4 architecture support
- SM 8.7 tensor cores β compiled specifically for Jetson Orin
- CUDA 12.6 / JetPack 6.2 β matches current Jetson SDK
- 529 MB wheel β single file install, no compilation needed
Supported Devices
| Device | Status |
|---|---|
| Jetson AGX Orin 64GB | Tested |
| Jetson AGX Orin 32GB | Compatible |
| Jetson Orin NX 16GB | Compatible |
| Jetson Orin Nano 8GB | Memory constrained |
Install (wheel)
# Create venv
python3 -m venv ~/vllm-env && source ~/vllm-env/bin/activate
# Install PyTorch for Jetson (bundled in this repo)
pip install https://huggingface.co/YuyiRobot/vllm-jetson-orin/resolve/main/torch-2.10.0-cp310-cp310-linux_aarch64.whl
# Install vLLM
pip install https://huggingface.co/YuyiRobot/vllm-jetson-orin/resolve/main/vllm-0.19.0+cu126-cp310-cp310-linux_aarch64.whl
# Install transformers for Gemma 4
pip install --no-deps "transformers>=5.5.0"
Run Gemma 4
vllm serve /path/to/gemma-4-E4B-it-W4A16 \
--host 0.0.0.0 --port 8000 \
--served-model-name gemma-4-e4b \
--max-model-len 4096 \
--gpu-memory-utilization 0.65 \
--enable-prefix-caching
Benchmark (Jetson AGX Orin 64GB)
Model: Gemma-4-E4B-IT-W4A16 (GPTQ W4A16)
| Test | Time | Prompt tokens | Generated tokens | Speed |
|---|---|---|---|---|
| Short in / Short out (TTFT) | 3,511 ms | - | 11 | 3.1 tok/s |
| Short in / Long out (Decode) | 16,586 ms | 27 | 512 | 30.9 tok/s |
| Long in / Short out (Prefill) | 1,409 ms | 609 | 20 | 432.2 tok/s |
Docker results: Decode 28.3 tok/s, Prefill 563.4 tok/s (similar performance)
Useful Links
- Docker image (recommended for production):
ghcr.io/yuyirobotlab/vllm-orin:0.19.0 - Build from source
Acknowledgements
- vLLM β the inference engine
- thehighnotes/vllm-jetson-orin β pioneered vLLM wheel distribution for Jetson Orin
- NVIDIA Jetson AI Lab β PyTorch wheels and ecosystem
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support