helcig's picture
Add coding-50-nonuniform
e90f769 verified

Loading this variant with vLLM

The checkpoint has heterogeneous per-layer expert counts. vLLM's stock Qwen3NextSparseMoeBlock builds every layer with config.num_experts; our bundled vllm_pruned_patch.py overrides each layer to use its own count from config.per_layer_num_experts.

One-liner

PYTHONPATH=$(pwd):${PYTHONPATH:-} python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='.', tensor_parallel_size=4, dtype='bfloat16',
          gpu_memory_utilization=0.85, trust_remote_code=True,
          enforce_eager=True)
print(llm.generate(['def fib(n):'], SamplingParams(max_tokens=128))[0].outputs[0].text)
"

Why PYTHONPATH?

vLLM spawns worker subprocesses via multiprocessing.spawn (safe with CUDA). Those workers re-import vllm fresh — any monkey-patch you applied in the parent process is gone. Python's sitecustomize.py mechanism runs automatically in every interpreter that has the relevant directory on sys.path. Putting this variant folder on PYTHONPATH is enough.

Tensor parallelism notes

  • TP (tensor parallel) works fine with heterogeneous counts — TP shards hidden dimensions inside each expert.
  • EP (expert parallel) assumes experts shard evenly across ranks, which is broken for heterogeneous counts. Keep --enable-eplb off.

With lm-eval-harness

PYTHONPATH=$(pwd):${PYTHONPATH:-} \
  lm_eval --model vllm \
    --model_args "pretrained=.,tensor_parallel_size=4,dtype=bfloat16,gpu_memory_utilization=0.85,max_model_len=4096,trust_remote_code=True,enforce_eager=True" \
    --tasks humaneval,mbpp \
    --batch_size auto