vLLM Docker Image — Qwen3.5-122B-A10B-NVFP4 on Jetson AGX Thor

First confirmed deployment of Qwen3.5-122B-A10B at NVFP4 precision on a single NVIDIA Jetson AGX Thor (128GB unified memory).

This repository hosts the compressed Docker image tarballs needed to run a vLLM inference server for Qwen3.5-122B-A10B on Jetson AGX Thor. For full documentation, reproduction scripts, patches, and the Dockerfile, see the companion GitHub repository:

📦 GitHub (scripts, patches, docs): patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor

🤖 Base model weights: Qwen/Qwen3.5-122B-A10B

What's in This Repository

This HuggingFace repo contains only the Docker image as compressed tar archives. The image has all required patches pre-applied and is ready to load on a Jetson AGX Thor system.

File	Description
`vllm-thor-qwen35-latest.tar.gz.*`	Split compressed Docker image tarballs
`sha256sums.txt`	Checksums for verifying image integrity

Model weights are not included. You must supply Qwen3.5-122B-A10B-NVFP4 weights separately. See the GitHub repo for the resharding script (01_reshard_nvfp4.sh).

Hardware Requirements

Component	Specification
Platform	NVIDIA Jetson AGX Thor
SoC	Thor (Blackwell GPU architecture)
Unified Memory	128 GB LPDDR5x
GPU	Integrated Blackwell GPU (FP4 native tensor cores)
CPU	12-core Arm Cortex-X925
Storage	NVMe SSD (model weights on local disk)
OS	Ubuntu 24.04 (aarch64)
CUDA	12.x (Jetson JetPack)
Architecture	aarch64
Docker	nvidia-container-runtime

⚠️ This image is aarch64 only and requires a Blackwell-architecture GPU for NVFP4 execution. It will not run on x86 or older Jetson platforms.

Observed Performance

Metric	Value
Decode throughput	18.9 t/s
TTFT (with `--enforce-eager`)	~120s ⚠️ do not use this flag
TTFT (without `--enforce-eager`)	~10–20s (estimated after CUDA graph warmup)
VRAM at init (peak)	~97 GB
VRAM steady-state	~75–80 GB
Max context length	16,384 tokens
GPU memory utilization flag	0.72
Quantization	NVFP4 / compressed-tensors
Attention backend	FlashInfer
Model weights size (resharded)	~75 GB

Quick Start

1. Reassemble and load the image

# Reassemble split tarballs
cat vllm-thor-qwen35-latest.tar.gz.* > vllm-thor-qwen35-latest.tar.gz

# Verify integrity
sha256sum -c sha256sums.txt

# Load into Docker
docker load -i vllm-thor-qwen35-latest.tar.gz

2. Prepare model weights

Download and reshard the base BF16 weights to NVFP4 using the script from the GitHub repo:

git clone https://github.com/patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor.git
cd qwen-3.5-122b-a10b-jetson-thor
bash scripts/01_reshard_nvfp4.sh
# Resharded weights output to: ~/Qwen3.5-122B-A10B-NVFP4/resharded/

3. Serve

bash scripts/03_serve.sh

Or manually:

docker run --rm --runtime=nvidia \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e LD_PRELOAD=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1 \
  -e HF_HUB_DISABLE_XET=1 \
  -v ~/Qwen3.5-122B-A10B-NVFP4/resharded:/model \
  -v ~/thor-vllm-cache:/root/.cache/vllm \
  -p 8000:8000 \
  vllm-thor:qwen35-latest \
  python -m vllm.entrypoints.openai.api_server \
    --model /model \
    --quantization compressed-tensors \
    --attention-backend FLASHINFER \
    --gpu-memory-utilization 0.72 \
    --max-model-len 16384 \
    --max-num-seqs 2

Critical Notes

❌ Do NOT use `--enforce-eager`

With --enforce-eager, every forward pass goes through Python dispatch with no CUDA graph optimization. For a 94-layer MoE model this causes 120s TTFT on prompts of ~900 tokens. Remove the flag and allow CUDA graph warmup at startup. The first startup after removing the flag will take 10–20 minutes longer while graphs are captured and cached to `/thor-vllm-cache`.

Required Environment Variables

Variable	Value	Purpose
`VLLM_USE_FLASHINFER_MOE_FP4`	`0`	MoE FP4 FlashInfer kernel broken on Thor
`LD_PRELOAD`	`/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1`	Required for CUDA library resolution on Jetson
`HF_HUB_DISABLE_XET`	`1`	Disables experimental HuggingFace XET transfer protocol

GPU Memory Utilization

Values above 0.72 cause OOM during KV cache profiling at model load. Do not increase this value without testing.

`--max-num-seqs 2`

Higher values increase CUDA graph capture time and VRAM pressure during warmup. Keep at 2 unless you have tested higher values.

CUDA Graph Cache

CUDA graphs are saved to the volume mounted at /root/.cache/vllm. Always mount a persistent host directory here. Without this, graphs are recaptured on every container start (adds 10–20 min per boot).

Benign Startup Warning

You will see this at container start — it is safe to ignore:

ERROR: ld.so: object '/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1' from
LD_PRELOAD cannot be preloaded (file too short): ignored.

The CUDA stack loads correctly via other paths.

Import Errors Outside a GPU Container

Running any vLLM import outside of a GPU-enabled container (e.g. for patching or verification) will fail with:

ImportError: /tmp/vllm/vllm/_C.abi3.so: undefined symbol: cuPointerGetAttribute

This is expected. Verify patches using grep, not Python imports.

What's Pre-Applied in This Image

The image is built from a pinned vLLM source commit with two required fixes applied. Neither fix is present in the upstream vLLM version baked into the base Jetson container at the time of this build.

Patch 1 — `RMSNormGated` activation parameter

Adds a missing activation parameter to RMSNormGated.__init__ in vllm/model_executor/layers/layernorm.py. Without this, model load fails with:

AttributeError: 'RMSNormGated' object has no attribute 'activation'

Root cause: RMSNormGated.forward() references self.activation at line 595, but the __init__ method never accepted or stored this parameter. The upstream vLLM code assumed it was already set but the version baked into the Jetson container was missing it.

The fix — two additions to __init__:

# Added to signature (after norm_before_gate):
activation: str = "silu",

# Added to body (after self.norm_before_gate = norm_before_gate):
self.activation = activation

Full patched __init__ signature:

def __init__(
    self,
    hidden_size: int,
    eps: float = 1e-6,
    group_size: Optional[int] = None,
    norm_before_gate: bool = False,
    activation: str = "silu",          # ← ADDED
    device: torch.device | None = None,
    dtype: torch.dtype | None = None,
):

Patch 2 — FlashInfer MoE FP4 kernel disable

Not a code patch — applied via environment variable (VLLM_USE_FLASHINFER_MOE_FP4=0). The FlashInfer MoE FP4 kernel is broken on Jetson Thor at this vLLM version. Setting this variable falls back to the standard MoE kernel. FlashInfer attention (non-MoE) works correctly and is kept enabled via --attention-backend FLASHINFER.

Verify patches are present in the loaded image

docker run --rm vllm-thor:qwen35-latest \
  grep -n "self.activation\|activation: str" \
  /tmp/vllm/vllm/model_executor/layers/layernorm.py
# Expected output:
# 511:        activation: str = "silu",
# 536:        self.activation = activation
# 597:            activation=self.activation,

Full patch files are available in the GitHub repo under reproduce/patches/.

Software Stack

Component	Version / Detail
vLLM	Built from source — pinned commit (see GitHub)
Python	3.12.12 (cpython-3.12.12-linux-aarch64)
PyTorch	Compatible with Jetson Thor JetPack CUDA stack
Docker runtime	nvidia container runtime
Base Docker image	NVIDIA Jetson vLLM container (aarch64)
Quantization format	compressed-tensors (NVFP4)
Attention kernel	FlashInfer (MoE FP4 kernel disabled)

Repository Structure (GitHub)

QWEN-3.5-122B-A10B-JETSON-.../
├── README.MD
├── DOCKERFILE
├── docker-compose.yml
├── .gitignore
├── layernorm_rmsnormgated_activation.patch
├── chat/
│   ├── templates/
│   │   └── chat.html
│   ├── main.py
│   └── requirements.txt
└── scripts/
    ├── 01_reshard_nvfp4.sh
    ├── 02_build_docker.sh
    ├── 03_serve.sh
    ├── 04_verify.sh
    └── 05_patch_existing_image.sh

Relationship to GitHub Repository

Resource	Location
Docker image tarballs	This HuggingFace repo
Dockerfile & build scripts	GitHub
Patch files	GitHub (`reproduce/patches/`)
Resharding script	GitHub (`scripts/01_reshard_nvfp4.sh`)
Chat UI	GitHub (`chat/`)
Full documentation	GitHub (`README.MD`)

License

Reproduction scripts / vLLM: Apache 2.0
Qwen3.5 model weights: Qwen License — see model card before use

Citation

If this deployment is useful in your work, please consider citing or linking to the GitHub repository.

Patrick Devaney — vLLM Jetson Thor: Qwen3.5-122B-A10B-NVFP4 (February 2026)
https://github.com/patrickbdevaney/qwen-3.5-122b-a10b-jetson-thor

Downloads last month: -; Downloads are not tracked for this model. How to track