# Loading this variant with vLLM The checkpoint has heterogeneous per-layer expert counts. vLLM's stock `Qwen3NextSparseMoeBlock` builds every layer with `config.num_experts`; our bundled `vllm_pruned_patch.py` overrides each layer to use its own count from `config.per_layer_num_experts`. ## One-liner ```bash PYTHONPATH=$(pwd):${PYTHONPATH:-} python -c " from vllm import LLM, SamplingParams llm = LLM(model='.', tensor_parallel_size=4, dtype='bfloat16', gpu_memory_utilization=0.85, trust_remote_code=True, enforce_eager=True) print(llm.generate(['def fib(n):'], SamplingParams(max_tokens=128))[0].outputs[0].text) " ``` ## Why PYTHONPATH? vLLM spawns worker subprocesses via `multiprocessing.spawn` (safe with CUDA). Those workers re-import `vllm` fresh — any monkey-patch you applied in the parent process is gone. Python's `sitecustomize.py` mechanism runs automatically in **every** interpreter that has the relevant directory on `sys.path`. Putting this variant folder on `PYTHONPATH` is enough. ## Tensor parallelism notes - **TP (tensor parallel)** works fine with heterogeneous counts — TP shards hidden dimensions inside each expert. - **EP (expert parallel)** assumes experts shard evenly across ranks, which is broken for heterogeneous counts. Keep `--enable-eplb` off. ## With lm-eval-harness ```bash PYTHONPATH=$(pwd):${PYTHONPATH:-} \ lm_eval --model vllm \ --model_args "pretrained=.,tensor_parallel_size=4,dtype=bfloat16,gpu_memory_utilization=0.85,max_model_len=4096,trust_remote_code=True,enforce_eager=True" \ --tasks humaneval,mbpp \ --batch_size auto ```