Jetson Thor Official Container for vLLM 0.16 fails to load nemotron-3-super -- says mixed-precision quant config is unsupported in vLLM 0.16 container

#20

by mrjbj - opened 23 days ago

Motivation

I got a Jetson Thor AGX at the San Jose convention two weeks ago and am struggling to get nemotron-3-super working on Nvidia's latest container. Below is context about the environment assembled from my AI assisted debugging session that includes a lot of technical details in case it proves useful to someone more knowledgeable who may be able to confirm if this is a real issue that could be fixed in a future release of the vLLM container. Appreciate any help or pointers as I'm still a bit new to the space. Thank you.

Human Context

NVIDIA's official Jetson Thor vLLM container (found at ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor) fails to load NVIDIA's Nemotron 3 Super 120B model. The AGX Thor has 128GB unified memory that should fit everything and the docs (including OpenClaw tutorial) recommend this container along with the NVFP4 checkpoint for Thor, implying it should work.

So far, I've got both the Nano 30B and Nano 4B models to run perfectly (only Super is affected), in what looks like a problem stemming from the model's mixed-precision quantization format (e.g. NVFP4 for routed MoE experts, FP8 for shared experts) that the container's vLLM 0.16 does not support. Newer vLLM versions (0.19+) added ModelOptMixedPrecisionConfig which handles this format correctly. So if the AI helping me debug this understands correctly, there appears to be a version gap between the time when the container was built and when the Super NVFP4 checkpoint was published with its mixed-precision format afterward. If so, the fix is straightforward: update the Thor container to a vLLM version that includes mixed-precision ModelOpt support.

AI Technical Details

Environment

Hardware: NVIDIA AGX Thor Developer Kit, 128GB unified memory, SM110a (Blackwell)
JetPack: 7.2, L4T R38.4, CUDA 13.2, Ubuntu 24.04 aarch64 SBSA
Container: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
Tag: 0.16.0-g15d76f74e-r38.2-arm64-sbsa-cu130-24.04
vLLM version: 0.16.0rc2.dev479+g15d76f74e.d20260225
Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (HuggingFace, latest revision)

Steps to reproduce

docker run --rm -it \
  --runtime=nvidia --network host \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e HF_TOKEN="$HF_TOKEN" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --gpu-memory-utilization 0.75 \
    --max-model-len 16384 \
    --max-num-seqs 8 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --trust-remote-code

Error

  Value error, ModelOpt currently only supports: ['FP8', 'FP8_PER_CHANNEL_PER_TOKEN', 'FP8_PB_WO', 'NVFP4', 'MXFP8'] quantizations in vLLM.
  Please check the `hf_quant_config.json` file for your model's quant configuration.

Root cause

The Super 120B NVFP4 checkpoint uses per-layer mixed-precision quantization in its hf_quant_config.json. Routed MoE experts are quantized to NVFP4, while shared experts use FP8:

    "quant_algo": "NVFP4",
    "group_size": 16
},
"backbone.layers.87.mixer.shared_experts.up_proj": {
    "quant_algo": "FP8"
},
"backbone.layers.87.mixer.shared_experts.down_proj": {
    "quant_algo": "FP8"
}

vLLM 0.16 in this container only supports a single top-level quant_algo value. It does not have the ModelOptMixedPrecisionConfig class that was added in later vLLM versions to handle per-layer mixed quantization.
By contrast, the Nano 30B NVFP4 checkpoint uses a simple top-level format:

    "producer": { "name": "modelopt", "version": "0.29.0" },
    "quantization": {
        "quant_algo": "NVFP4",
        "kv_cache_quant_algo": "FP8",
        "group_size": 16,
        "exclude_modules": ["lm_head", ...]
    }
}

This loads and runs correctly on the same container (benchmarked at 43.76 t/s).

What works on this container

ModelHF Repo IDStatusNano 4B BF16nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16✅ Works (FlashInfer attention, FLASHINFER_CUTLASS MoE)Nano 30B
NVFP4nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4✅ Works (43.76 t/s, FLASHINFER_CUTLASS, fp8 KV cache)
Super 120B NVFP4nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4❌ Fails (mixed-precision quant config)

Workaround

The same Super 120B NVFP4 checkpoint loads and runs correctly on a self-built vLLM 0.19 image (vllm:r39.0.arm64-sbsa-cu132-24.04) at ~10.6 t/s. However, this requires building vLLM from source with multiple manual workarounds for Thor (TRITON_ATTN, NINJA_MAX_JOBS serialization, etc.), which the official container was intended to eliminate.

Suggested fix

Update ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor to a vLLM version that includes ModelOptMixedPrecisionConfig support (added in the vLLM codebase after 0.16). The current latest vLLM docs show this class exists at vllm/model_executor/layers/quantization/modelopt.py and explicitly supports "checkpoints where different layers use different quantization algorithms (e.g., FP8 for dense layers and NVFP4 for MoE experts)" — which is exactly the Super checkpoint format.

References

OpenClaw tutorial referencing this container + model: https://www.jetson-ai-lab.com/tutorials/openclaw/
Jetson AI Lab genai tutorial: https://www.jetson-ai-lab.com/tutorials/genai-on-jetson-llms-vlms/
vLLM ModelOptMixedPrecisionConfig docs: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt/
Container package page: https://github.com/orgs/nvidia-ai-iot/packages/container/package/vllm
Super NVFP4 HuggingFace repo: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

llm-wizard

21 days ago

The team is lookin' into it!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment