NVFP4 Quantized Qwen3.5-122B-A10B (Resharded for DGX Spark)

This is a resharded version of RedHatAI/Qwen3.5-122B-A10B-NVFP4 optimized for NVIDIA DGX Spark (128GB unified memory).

The original model has 2 safetensors shards (47GB + 25GB). On DGX Spark's unified memory architecture, the fastsafetensors loader allocates a contiguous GPU buffer per shard — a 47GB allocation fails on 128GB unified memory after accounting for PyTorch/CUDA overhead. This resharded version splits the same weights into 16 shards of ~5GB each, matching the shard sizes used by other models that load successfully on Spark (Nemotron, INT4-AutoRound).

The weights are identical to the original RedHatAI model — only the shard layout has changed.

Usage

vllm serve sjug/Qwen3.5-122B-A10B-NVFP4-resharded \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.7 \
  --max-model-len 262144 \
  --enable-prefix-caching \
  --trust-remote-code

For DGX Spark with Marlin backend (recommended for ~2% faster inference):

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve sjug/Qwen3.5-122B-A10B-NVFP4-resharded \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.7 \
  --max-model-len 262144 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 10 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

See spark-vllm-docker recipes for full deployment configs.

Resharding Details

  • Source: RedHatAI/Qwen3.5-122B-A10B-NVFP4 (2 shards: 50.0GB + 26.4GB)
  • Output: 16 shards of ~4.7GB each (last shard 1.4GB)
  • Total: 72GB, 149,100 tensors
  • Method: Read tensors via safe_open, accumulated on GPU, written to new shards at 5GB boundaries
  • Verification: Loaded and served successfully on DGX Spark with fastsafetensors

Performance on DGX Spark

Metric Value
Model loading 71.3 GiB, ~90s via fastsafetensors
Token generation ~17.5 tok/s (Marlin), ~17.1 tok/s (CUTLASS)
KV cache (at 0.7 util) 6.9 GiB, 150K tokens, 2.1x concurrency at 262K context
Backend Marlin recommended (+2% over CUTLASS)

Original Model Evaluations

From RedHatAI:

Qwen/Qwen3.5-122B-A10B RedHatAI/Qwen3.5-122B-A10B-NVFP4
GSM8k Accuracy 88.2 85.8
Recovery - 97.0%

Note: More rigorous evaluations are currently in progress and will be available soon.

Downloads last month
1,933
Safetensors
Model size
71B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sjug/Qwen3.5-122B-A10B-NVFP4-resharded

Quantized
(104)
this model