vLLM Docker Image β€” Qwen3.5-122B-A10B-NVFP4 on Jetson AGX Thor

First confirmed deployment of Qwen3.5-122B-A10B at NVFP4 precision on a single NVIDIA Jetson AGX Thor (128GB unified memory).

This repository hosts the compressed Docker image tarballs needed to run a vLLM inference server for Qwen3.5-122B-A10B on Jetson AGX Thor. For full documentation, reproduction scripts, patches, and the Dockerfile, see the companion GitHub repository:

πŸ“¦ GitHub (scripts, patches, docs): patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor

πŸ€– Base model weights: Qwen/Qwen3.5-122B-A10B


What's in This Repository

This HuggingFace repo contains only the Docker image as compressed tar archives. The image has all required patches pre-applied and is ready to load on a Jetson AGX Thor system.

File Description
vllm-thor-qwen35-latest.tar.gz.* Split compressed Docker image tarballs
sha256sums.txt Checksums for verifying image integrity

Model weights are not included. You must supply Qwen3.5-122B-A10B-NVFP4 weights separately. See the GitHub repo for the resharding script (01_reshard_nvfp4.sh).


Hardware Requirements

Component Specification
Platform NVIDIA Jetson AGX Thor
SoC Thor (Blackwell GPU architecture)
Unified Memory 128 GB LPDDR5x
GPU Integrated Blackwell GPU (FP4 native tensor cores)
CPU 12-core Arm Cortex-X925
Storage NVMe SSD (model weights on local disk)
OS Ubuntu 24.04 (aarch64)
CUDA 12.x (Jetson JetPack)
Architecture aarch64
Docker nvidia-container-runtime

⚠️ This image is aarch64 only and requires a Blackwell-architecture GPU for NVFP4 execution. It will not run on x86 or older Jetson platforms.


Observed Performance

Metric Value
Decode throughput 18.9 t/s
TTFT (with --enforce-eager) ~120s ⚠️ do not use this flag
TTFT (without --enforce-eager) ~10–20s (estimated after CUDA graph warmup)
VRAM at init (peak) ~97 GB
VRAM steady-state ~75–80 GB
Max context length 16,384 tokens
GPU memory utilization flag 0.72
Quantization NVFP4 / compressed-tensors
Attention backend FlashInfer
Model weights size (resharded) ~75 GB

Quick Start

1. Reassemble and load the image

# Reassemble split tarballs
cat vllm-thor-qwen35-latest.tar.gz.* > vllm-thor-qwen35-latest.tar.gz

# Verify integrity
sha256sum -c sha256sums.txt

# Load into Docker
docker load -i vllm-thor-qwen35-latest.tar.gz

2. Prepare model weights

Download and reshard the base BF16 weights to NVFP4 using the script from the GitHub repo:

git clone https://github.com/patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor.git
cd qwen-3.5-122b-a10b-jetson-thor
bash scripts/01_reshard_nvfp4.sh
# Resharded weights output to: ~/Qwen3.5-122B-A10B-NVFP4/resharded/

3. Serve

bash scripts/03_serve.sh

Or manually:

docker run --rm --runtime=nvidia \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e LD_PRELOAD=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1 \
  -e HF_HUB_DISABLE_XET=1 \
  -v ~/Qwen3.5-122B-A10B-NVFP4/resharded:/model \
  -v ~/thor-vllm-cache:/root/.cache/vllm \
  -p 8000:8000 \
  vllm-thor:qwen35-latest \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --quantization compressed-tensors \
    --attention-backend FLASHINFER \
    --gpu-memory-utilization 0.72 \
    --max-model-len 16384 \
    --max-num-seqs 2

Critical Notes

❌ Do NOT use --enforce-eager

With --enforce-eager, every forward pass goes through Python dispatch with no CUDA graph optimization. For a 94-layer MoE model this causes 120s TTFT on prompts of ~900 tokens. Remove the flag and allow CUDA graph warmup at startup. The first startup after removing the flag will take 10–20 minutes longer while graphs are captured and cached to `/thor-vllm-cache`.

Required Environment Variables

Variable Value Purpose
VLLM_USE_FLASHINFER_MOE_FP4 0 MoE FP4 FlashInfer kernel broken on Thor
LD_PRELOAD /usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1 Required for CUDA library resolution on Jetson
HF_HUB_DISABLE_XET 1 Disables experimental HuggingFace XET transfer protocol

GPU Memory Utilization

Values above 0.72 cause OOM during KV cache profiling at model load. Do not increase this value without testing.

--max-num-seqs 2

Higher values increase CUDA graph capture time and VRAM pressure during warmup. Keep at 2 unless you have tested higher values.

CUDA Graph Cache

CUDA graphs are saved to the volume mounted at /root/.cache/vllm. Always mount a persistent host directory here. Without this, graphs are recaptured on every container start (adds 10–20 min per boot).

Benign Startup Warning

You will see this at container start β€” it is safe to ignore:

ERROR: ld.so: object '/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1' from
LD_PRELOAD cannot be preloaded (file too short): ignored.

The CUDA stack loads correctly via other paths.

Import Errors Outside a GPU Container

Running any vLLM import outside of a GPU-enabled container (e.g. for patching or verification) will fail with:

ImportError: /tmp/vllm/vllm/_C.abi3.so: undefined symbol: cuPointerGetAttribute

This is expected. Verify patches using grep, not Python imports.


What's Pre-Applied in This Image

The image is built from a pinned vLLM source commit with two required fixes applied. Neither fix is present in the upstream vLLM version baked into the base Jetson container at the time of this build.

Patch 1 β€” RMSNormGated activation parameter

Adds a missing activation parameter to RMSNormGated.__init__ in vllm/model_executor/layers/layernorm.py. Without this, model load fails with:

AttributeError: 'RMSNormGated' object has no attribute 'activation'

Root cause: RMSNormGated.forward() references self.activation at line 595, but the __init__ method never accepted or stored this parameter. The upstream vLLM code assumed it was already set but the version baked into the Jetson container was missing it.

The fix β€” two additions to __init__:

# Added to signature (after norm_before_gate):
activation: str = "silu",

# Added to body (after self.norm_before_gate = norm_before_gate):
self.activation = activation

Full patched __init__ signature:

def __init__(
    self,
    hidden_size: int,
    eps: float = 1e-6,
    group_size: Optional[int] = None,
    norm_before_gate: bool = False,
    activation: str = "silu",          # ← ADDED
    device: torch.device | None = None,
    dtype: torch.dtype | None = None,
):

Patch 2 β€” FlashInfer MoE FP4 kernel disable

Not a code patch β€” applied via environment variable (VLLM_USE_FLASHINFER_MOE_FP4=0). The FlashInfer MoE FP4 kernel is broken on Jetson Thor at this vLLM version. Setting this variable falls back to the standard MoE kernel. FlashInfer attention (non-MoE) works correctly and is kept enabled via --attention-backend FLASHINFER.

Verify patches are present in the loaded image

docker run --rm vllm-thor:qwen35-latest \
  grep -n "self.activation\|activation: str" \
  /tmp/vllm/vllm/model_executor/layers/layernorm.py
# Expected output:
# 511:        activation: str = "silu",
# 536:        self.activation = activation
# 597:            activation=self.activation,

Full patch files are available in the GitHub repo under reproduce/patches/.


Software Stack

Component Version / Detail
vLLM Built from source β€” pinned commit (see GitHub)
Python 3.12.12 (cpython-3.12.12-linux-aarch64)
PyTorch Compatible with Jetson Thor JetPack CUDA stack
Docker runtime nvidia container runtime
Base Docker image NVIDIA Jetson vLLM container (aarch64)
Quantization format compressed-tensors (NVFP4)
Attention kernel FlashInfer (MoE FP4 kernel disabled)

Repository Structure (GitHub)

QWEN-3.5-122B-A10B-JETSON-.../
β”œβ”€β”€ README.MD
β”œβ”€β”€ DOCKERFILE
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ .gitignore
β”œβ”€β”€ layernorm_rmsnormgated_activation.patch
β”œβ”€β”€ chat/
β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   └── chat.html
β”‚   β”œβ”€β”€ main.py
β”‚   └── requirements.txt
└── scripts/
    β”œβ”€β”€ 01_reshard_nvfp4.sh
    β”œβ”€β”€ 02_build_docker.sh
    β”œβ”€β”€ 03_serve.sh
    β”œβ”€β”€ 04_verify.sh
    └── 05_patch_existing_image.sh

Relationship to GitHub Repository

Resource Location
Docker image tarballs This HuggingFace repo
Dockerfile & build scripts GitHub
Patch files GitHub (reproduce/patches/)
Resharding script GitHub (scripts/01_reshard_nvfp4.sh)
Chat UI GitHub (chat/)
Full documentation GitHub (README.MD)

License

  • Reproduction scripts / vLLM: Apache 2.0
  • Qwen3.5 model weights: Qwen License β€” see model card before use

Citation

If this deployment is useful in your work, please consider citing or linking to the GitHub repository.

Patrick Devaney β€” vLLM Jetson Thor: Qwen3.5-122B-A10B-NVFP4 (February 2026)
https://github.com/patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support