Model loading reduction from 5.5 minutes to 5.26 seconds

#19

by tcclaviger - opened 26 days ago

Discussion

tcclaviger

26 days ago

•

edited 26 days ago

In Qwen3_5.py inside vllm:

if is_fused_expert:
if name_mapped not in params_dict:
success = False
else:
param = params_dict[name_mapped]
tp_rank = get_tensor_model_parallel_rank()
tp_size = get_tensor_model_parallel_world_size()
n_exp = min(loaded_weight.shape[0], param.data.shape[0])
if "gate_up_proj" in name:
half_param = param.data.shape[1] // 2
full_N = loaded_weight.shape[1] // 2
start = tp_rank * half_param
d2 = min(loaded_weight.shape[2], param.data.shape[2])
# gate half
param.data[:n_exp, :half_param, :d2].copy_(
loaded_weight[:n_exp, start:start + half_param, :d2])
# up half
param.data[:n_exp, half_param:, :d2].copy_(
loaded_weight[:n_exp, full_N + start:full_N + start + half_param, :d2])
else:
# w2: TP shards input dim (last dim)
shard = param.data.shape[2]
start = tp_rank * shard
sliced = loaded_weight[:n_exp, :, start:start + shard]
d1 = min(sliced.shape[1], param.data.shape[1])
param.data[:n_exp, :d1, :].copy_(sliced[:, :d1, :])
success = True

Why faster: The old path called load_fused_expert_weights which looped through all 256 experts individually, calling weight_loader per expert. For gate_up_proj, it first .chunk(2) the tensor into gate/up halves, then called the loop twice (w1 and w3). Each weight_loader call went through shard_id dispatch, expert_id mapping, TP slicing logic, and a per-expert copy. That's ~1024 Python function calls and ~1024 small tensor copies per layer × 48 layers.

The new path does 2 bulk tensor.copy_() calls for gate_up (one for gate half, one for up half) and 1 for down_proj — per tensor. 4 tensors per layer = 4 bulk copies. No loops, no per-expert dispatch, no chunking.

tcclaviger

26 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment