| # Loading this variant with vLLM |
|
|
| The checkpoint has heterogeneous per-layer expert counts. vLLM's stock |
| `Qwen3NextSparseMoeBlock` builds every layer with `config.num_experts`; our |
| bundled `vllm_pruned_patch.py` overrides each layer to use its own count from |
| `config.per_layer_num_experts`. |
|
|
| ## One-liner |
|
|
| ```bash |
| PYTHONPATH=$(pwd):${PYTHONPATH:-} python -c " |
| from vllm import LLM, SamplingParams |
| llm = LLM(model='.', tensor_parallel_size=4, dtype='bfloat16', |
| gpu_memory_utilization=0.85, trust_remote_code=True, |
| enforce_eager=True) |
| print(llm.generate(['def fib(n):'], SamplingParams(max_tokens=128))[0].outputs[0].text) |
| " |
| ``` |
|
|
| ## Why PYTHONPATH? |
|
|
| vLLM spawns worker subprocesses via `multiprocessing.spawn` (safe with CUDA). |
| Those workers re-import `vllm` fresh — any monkey-patch you applied in the |
| parent process is gone. Python's `sitecustomize.py` mechanism runs |
| automatically in **every** interpreter that has the relevant directory on |
| `sys.path`. Putting this variant folder on `PYTHONPATH` is enough. |
|
|
| ## Tensor parallelism notes |
|
|
| - **TP (tensor parallel)** works fine with heterogeneous counts — TP shards |
| hidden dimensions inside each expert. |
| - **EP (expert parallel)** assumes experts shard evenly across ranks, which |
| is broken for heterogeneous counts. Keep `--enable-eplb` off. |
|
|
| ## With lm-eval-harness |
|
|
| ```bash |
| PYTHONPATH=$(pwd):${PYTHONPATH:-} \ |
| lm_eval --model vllm \ |
| --model_args "pretrained=.,tensor_parallel_size=4,dtype=bfloat16,gpu_memory_utilization=0.85,max_model_len=4096,trust_remote_code=True,enforce_eager=True" \ |
| --tasks humaneval,mbpp \ |
| --batch_size auto |
| ``` |
|
|