Loading this variant with vLLM
The checkpoint has heterogeneous per-layer expert counts. vLLM's stock
Qwen3NextSparseMoeBlock builds every layer with config.num_experts; our
bundled vllm_pruned_patch.py overrides each layer to use its own count from
config.per_layer_num_experts.
One-liner
PYTHONPATH=$(pwd):${PYTHONPATH:-} python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='.', tensor_parallel_size=4, dtype='bfloat16',
gpu_memory_utilization=0.85, trust_remote_code=True,
enforce_eager=True)
print(llm.generate(['def fib(n):'], SamplingParams(max_tokens=128))[0].outputs[0].text)
"
Why PYTHONPATH?
vLLM spawns worker subprocesses via multiprocessing.spawn (safe with CUDA).
Those workers re-import vllm fresh — any monkey-patch you applied in the
parent process is gone. Python's sitecustomize.py mechanism runs
automatically in every interpreter that has the relevant directory on
sys.path. Putting this variant folder on PYTHONPATH is enough.
Tensor parallelism notes
- TP (tensor parallel) works fine with heterogeneous counts — TP shards hidden dimensions inside each expert.
- EP (expert parallel) assumes experts shard evenly across ranks, which
is broken for heterogeneous counts. Keep
--enable-eplboff.
With lm-eval-harness
PYTHONPATH=$(pwd):${PYTHONPATH:-} \
lm_eval --model vllm \
--model_args "pretrained=.,tensor_parallel_size=4,dtype=bfloat16,gpu_memory_utilization=0.85,max_model_len=4096,trust_remote_code=True,enforce_eager=True" \
--tasks humaneval,mbpp \
--batch_size auto