NVFP4 Quantized Qwen3.5-122B-A10B (Resharded for DGX Spark)
This is a resharded version of RedHatAI/Qwen3.5-122B-A10B-NVFP4 optimized for NVIDIA DGX Spark (128GB unified memory).
The original model has 2 safetensors shards (47GB + 25GB). On DGX Spark's unified memory architecture, the fastsafetensors loader allocates a contiguous GPU buffer per shard — a 47GB allocation fails on 128GB unified memory after accounting for PyTorch/CUDA overhead. This resharded version splits the same weights into 16 shards of ~5GB each, matching the shard sizes used by other models that load successfully on Spark (Nemotron, INT4-AutoRound).
The weights are identical to the original RedHatAI model — only the shard layout has changed.
Usage
vllm serve sjug/Qwen3.5-122B-A10B-NVFP4-resharded \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.7 \
--max-model-len 262144 \
--enable-prefix-caching \
--trust-remote-code
For DGX Spark with Marlin backend (recommended for ~2% faster inference):
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve sjug/Qwen3.5-122B-A10B-NVFP4-resharded \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.7 \
--max-model-len 262144 \
--max-num-batched-tokens 8192 \
--max-num-seqs 10 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--trust-remote-code
See spark-vllm-docker recipes for full deployment configs.
Resharding Details
- Source: RedHatAI/Qwen3.5-122B-A10B-NVFP4 (2 shards: 50.0GB + 26.4GB)
- Output: 16 shards of ~4.7GB each (last shard 1.4GB)
- Total: 72GB, 149,100 tensors
- Method: Read tensors via
safe_open, accumulated on GPU, written to new shards at 5GB boundaries - Verification: Loaded and served successfully on DGX Spark with fastsafetensors
Performance on DGX Spark
| Metric | Value |
|---|---|
| Model loading | 71.3 GiB, ~90s via fastsafetensors |
| Token generation | ~17.5 tok/s (Marlin), ~17.1 tok/s (CUTLASS) |
| KV cache (at 0.7 util) | 6.9 GiB, 150K tokens, 2.1x concurrency at 262K context |
| Backend | Marlin recommended (+2% over CUTLASS) |
Original Model Evaluations
From RedHatAI:
| Qwen/Qwen3.5-122B-A10B | RedHatAI/Qwen3.5-122B-A10B-NVFP4 | |
|---|---|---|
| GSM8k Accuracy | 88.2 | 85.8 |
| Recovery | - | 97.0% |
Note: More rigorous evaluations are currently in progress and will be available soon.
- Downloads last month
- 1,933
Model tree for sjug/Qwen3.5-122B-A10B-NVFP4-resharded
Base model
Qwen/Qwen3.5-122B-A10B