helcig's picture
Add general-25-nonuniform
e9a164c verified
# Loading this variant with vLLM
The checkpoint has heterogeneous per-layer expert counts. vLLM's stock
`Qwen3NextSparseMoeBlock` builds every layer with `config.num_experts`; our
bundled `vllm_pruned_patch.py` overrides each layer to use its own count from
`config.per_layer_num_experts`.
## One-liner
```bash
PYTHONPATH=$(pwd):${PYTHONPATH:-} python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='.', tensor_parallel_size=4, dtype='bfloat16',
gpu_memory_utilization=0.85, trust_remote_code=True,
enforce_eager=True)
print(llm.generate(['def fib(n):'], SamplingParams(max_tokens=128))[0].outputs[0].text)
"
```
## Why PYTHONPATH?
vLLM spawns worker subprocesses via `multiprocessing.spawn` (safe with CUDA).
Those workers re-import `vllm` fresh — any monkey-patch you applied in the
parent process is gone. Python's `sitecustomize.py` mechanism runs
automatically in **every** interpreter that has the relevant directory on
`sys.path`. Putting this variant folder on `PYTHONPATH` is enough.
## Tensor parallelism notes
- **TP (tensor parallel)** works fine with heterogeneous counts — TP shards
hidden dimensions inside each expert.
- **EP (expert parallel)** assumes experts shard evenly across ranks, which
is broken for heterogeneous counts. Keep `--enable-eplb` off.
## With lm-eval-harness
```bash
PYTHONPATH=$(pwd):${PYTHONPATH:-} \
lm_eval --model vllm \
--model_args "pretrained=.,tensor_parallel_size=4,dtype=bfloat16,gpu_memory_utilization=0.85,max_model_len=4096,trust_remote_code=True,enforce_eager=True" \
--tasks humaneval,mbpp \
--batch_size auto
```