vLLM Docker Image β Qwen3.5-122B-A10B-NVFP4 on Jetson AGX Thor
First confirmed deployment of Qwen3.5-122B-A10B at NVFP4 precision on a single NVIDIA Jetson AGX Thor (128GB unified memory).
This repository hosts the compressed Docker image tarballs needed to run a vLLM inference server for Qwen3.5-122B-A10B on Jetson AGX Thor. For full documentation, reproduction scripts, patches, and the Dockerfile, see the companion GitHub repository:
π¦ GitHub (scripts, patches, docs): patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor
π€ Base model weights: Qwen/Qwen3.5-122B-A10B
What's in This Repository
This HuggingFace repo contains only the Docker image as compressed tar archives. The image has all required patches pre-applied and is ready to load on a Jetson AGX Thor system.
| File | Description |
|---|---|
vllm-thor-qwen35-latest.tar.gz.* |
Split compressed Docker image tarballs |
sha256sums.txt |
Checksums for verifying image integrity |
Model weights are not included. You must supply
Qwen3.5-122B-A10B-NVFP4weights separately. See the GitHub repo for the resharding script (01_reshard_nvfp4.sh).
Hardware Requirements
| Component | Specification |
|---|---|
| Platform | NVIDIA Jetson AGX Thor |
| SoC | Thor (Blackwell GPU architecture) |
| Unified Memory | 128 GB LPDDR5x |
| GPU | Integrated Blackwell GPU (FP4 native tensor cores) |
| CPU | 12-core Arm Cortex-X925 |
| Storage | NVMe SSD (model weights on local disk) |
| OS | Ubuntu 24.04 (aarch64) |
| CUDA | 12.x (Jetson JetPack) |
| Architecture | aarch64 |
| Docker | nvidia-container-runtime |
β οΈ This image is aarch64 only and requires a Blackwell-architecture GPU for NVFP4 execution. It will not run on x86 or older Jetson platforms.
Observed Performance
| Metric | Value |
|---|---|
| Decode throughput | 18.9 t/s |
TTFT (with --enforce-eager) |
~120s β οΈ do not use this flag |
TTFT (without --enforce-eager) |
~10β20s (estimated after CUDA graph warmup) |
| VRAM at init (peak) | ~97 GB |
| VRAM steady-state | ~75β80 GB |
| Max context length | 16,384 tokens |
| GPU memory utilization flag | 0.72 |
| Quantization | NVFP4 / compressed-tensors |
| Attention backend | FlashInfer |
| Model weights size (resharded) | ~75 GB |
Quick Start
1. Reassemble and load the image
# Reassemble split tarballs
cat vllm-thor-qwen35-latest.tar.gz.* > vllm-thor-qwen35-latest.tar.gz
# Verify integrity
sha256sum -c sha256sums.txt
# Load into Docker
docker load -i vllm-thor-qwen35-latest.tar.gz
2. Prepare model weights
Download and reshard the base BF16 weights to NVFP4 using the script from the GitHub repo:
git clone https://github.com/patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor.git
cd qwen-3.5-122b-a10b-jetson-thor
bash scripts/01_reshard_nvfp4.sh
# Resharded weights output to: ~/Qwen3.5-122B-A10B-NVFP4/resharded/
3. Serve
bash scripts/03_serve.sh
Or manually:
docker run --rm --runtime=nvidia \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e LD_PRELOAD=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1 \
-e HF_HUB_DISABLE_XET=1 \
-v ~/Qwen3.5-122B-A10B-NVFP4/resharded:/model \
-v ~/thor-vllm-cache:/root/.cache/vllm \
-p 8000:8000 \
vllm-thor:qwen35-latest \
python -m vllm.entrypoints.openai.api_server \
--model /model \
--quantization compressed-tensors \
--attention-backend FLASHINFER \
--gpu-memory-utilization 0.72 \
--max-model-len 16384 \
--max-num-seqs 2
Critical Notes
β Do NOT use --enforce-eager
With --enforce-eager, every forward pass goes through Python dispatch with no CUDA graph optimization. For a 94-layer MoE model this causes 120s TTFT on prompts of ~900 tokens. Remove the flag and allow CUDA graph warmup at startup. The first startup after removing the flag will take 10β20 minutes longer while graphs are captured and cached to `/thor-vllm-cache`.
Required Environment Variables
| Variable | Value | Purpose |
|---|---|---|
VLLM_USE_FLASHINFER_MOE_FP4 |
0 |
MoE FP4 FlashInfer kernel broken on Thor |
LD_PRELOAD |
/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1 |
Required for CUDA library resolution on Jetson |
HF_HUB_DISABLE_XET |
1 |
Disables experimental HuggingFace XET transfer protocol |
GPU Memory Utilization
Values above 0.72 cause OOM during KV cache profiling at model load. Do not increase this value without testing.
--max-num-seqs 2
Higher values increase CUDA graph capture time and VRAM pressure during warmup. Keep at 2 unless you have tested higher values.
CUDA Graph Cache
CUDA graphs are saved to the volume mounted at /root/.cache/vllm. Always mount a persistent host directory here. Without this, graphs are recaptured on every container start (adds 10β20 min per boot).
Benign Startup Warning
You will see this at container start β it is safe to ignore:
ERROR: ld.so: object '/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1' from
LD_PRELOAD cannot be preloaded (file too short): ignored.
The CUDA stack loads correctly via other paths.
Import Errors Outside a GPU Container
Running any vLLM import outside of a GPU-enabled container (e.g. for patching or verification) will fail with:
ImportError: /tmp/vllm/vllm/_C.abi3.so: undefined symbol: cuPointerGetAttribute
This is expected. Verify patches using grep, not Python imports.
What's Pre-Applied in This Image
The image is built from a pinned vLLM source commit with two required fixes applied. Neither fix is present in the upstream vLLM version baked into the base Jetson container at the time of this build.
Patch 1 β RMSNormGated activation parameter
Adds a missing activation parameter to RMSNormGated.__init__ in vllm/model_executor/layers/layernorm.py. Without this, model load fails with:
AttributeError: 'RMSNormGated' object has no attribute 'activation'
Root cause: RMSNormGated.forward() references self.activation at line 595, but the __init__ method never accepted or stored this parameter. The upstream vLLM code assumed it was already set but the version baked into the Jetson container was missing it.
The fix β two additions to __init__:
# Added to signature (after norm_before_gate):
activation: str = "silu",
# Added to body (after self.norm_before_gate = norm_before_gate):
self.activation = activation
Full patched __init__ signature:
def __init__(
self,
hidden_size: int,
eps: float = 1e-6,
group_size: Optional[int] = None,
norm_before_gate: bool = False,
activation: str = "silu", # β ADDED
device: torch.device | None = None,
dtype: torch.dtype | None = None,
):
Patch 2 β FlashInfer MoE FP4 kernel disable
Not a code patch β applied via environment variable (VLLM_USE_FLASHINFER_MOE_FP4=0). The FlashInfer MoE FP4 kernel is broken on Jetson Thor at this vLLM version. Setting this variable falls back to the standard MoE kernel. FlashInfer attention (non-MoE) works correctly and is kept enabled via --attention-backend FLASHINFER.
Verify patches are present in the loaded image
docker run --rm vllm-thor:qwen35-latest \
grep -n "self.activation\|activation: str" \
/tmp/vllm/vllm/model_executor/layers/layernorm.py
# Expected output:
# 511: activation: str = "silu",
# 536: self.activation = activation
# 597: activation=self.activation,
Full patch files are available in the GitHub repo under reproduce/patches/.
Software Stack
| Component | Version / Detail |
|---|---|
| vLLM | Built from source β pinned commit (see GitHub) |
| Python | 3.12.12 (cpython-3.12.12-linux-aarch64) |
| PyTorch | Compatible with Jetson Thor JetPack CUDA stack |
| Docker runtime | nvidia container runtime |
| Base Docker image | NVIDIA Jetson vLLM container (aarch64) |
| Quantization format | compressed-tensors (NVFP4) |
| Attention kernel | FlashInfer (MoE FP4 kernel disabled) |
Repository Structure (GitHub)
QWEN-3.5-122B-A10B-JETSON-.../
βββ README.MD
βββ DOCKERFILE
βββ docker-compose.yml
βββ .gitignore
βββ layernorm_rmsnormgated_activation.patch
βββ chat/
β βββ templates/
β β βββ chat.html
β βββ main.py
β βββ requirements.txt
βββ scripts/
βββ 01_reshard_nvfp4.sh
βββ 02_build_docker.sh
βββ 03_serve.sh
βββ 04_verify.sh
βββ 05_patch_existing_image.sh
Relationship to GitHub Repository
| Resource | Location |
|---|---|
| Docker image tarballs | This HuggingFace repo |
| Dockerfile & build scripts | GitHub |
| Patch files | GitHub (reproduce/patches/) |
| Resharding script | GitHub (scripts/01_reshard_nvfp4.sh) |
| Chat UI | GitHub (chat/) |
| Full documentation | GitHub (README.MD) |
License
- Reproduction scripts / vLLM: Apache 2.0
- Qwen3.5 model weights: Qwen License β see model card before use
Citation
If this deployment is useful in your work, please consider citing or linking to the GitHub repository.
Patrick Devaney β vLLM Jetson Thor: Qwen3.5-122B-A10B-NVFP4 (February 2026)
https://github.com/patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor