NVFP4 Quantized Qwen3.5-122B-A10B (Resharded for DGX Spark)

This is a resharded version of RedHatAI/Qwen3.5-122B-A10B-NVFP4 optimized for NVIDIA DGX Spark (128GB unified memory).

The original model has 2 safetensors shards (47GB + 25GB). On DGX Spark's unified memory architecture, the fastsafetensors loader allocates a contiguous GPU buffer per shard — a 47GB allocation fails on 128GB unified memory after accounting for PyTorch/CUDA overhead. This resharded version splits the same weights into 16 shards of ~5GB each, matching the shard sizes used by other models that load successfully on Spark (Nemotron, INT4-AutoRound).

The weights are identical to the original RedHatAI model — only the shard layout has changed.

Usage

vllm serve sjug/Qwen3.5-122B-A10B-NVFP4-resharded \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.7 \
  --max-model-len 262144 \
  --enable-prefix-caching \
  --trust-remote-code

For DGX Spark with Marlin backend (recommended for ~2% faster inference):

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve sjug/Qwen3.5-122B-A10B-NVFP4-resharded \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.7 \
  --max-model-len 262144 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 10 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

See spark-vllm-docker recipes for full deployment configs.

Resharding Details

Source: RedHatAI/Qwen3.5-122B-A10B-NVFP4 (2 shards: 50.0GB + 26.4GB)
Output: 16 shards of ~4.7GB each (last shard 1.4GB)
Total: 72GB, 149,100 tensors
Method: Read tensors via safe_open, accumulated on GPU, written to new shards at 5GB boundaries
Verification: Loaded and served successfully on DGX Spark with fastsafetensors

Performance on DGX Spark

Metric	Value
Model loading	71.3 GiB, ~90s via fastsafetensors
Token generation	~17.5 tok/s (Marlin), ~17.1 tok/s (CUTLASS)
KV cache (at 0.7 util)	6.9 GiB, 150K tokens, 2.1x concurrency at 262K context
Backend	Marlin recommended (+2% over CUTLASS)

Original Model Evaluations

From RedHatAI:

	Qwen/Qwen3.5-122B-A10B	RedHatAI/Qwen3.5-122B-A10B-NVFP4
GSM8k Accuracy	88.2	85.8
Recovery	-	97.0%

Note: More rigorous evaluations are currently in progress and will be available soon.

Downloads last month: 1,933

Safetensors

Model size

71B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sjug/Qwen3.5-122B-A10B-NVFP4-resharded

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model