Serving on two devices

by shadowlilac - opened 2 days ago

It looks like it's currently impossible to serve on two devices due to how the model was exported:

ValueError: MiMoV2ForCausalLM requires effective attention TP size 4 because its fused qkv_proj weights are TP=4-interleaved; got 2 (tp_size=2, dp_size=1, enable_dp_attention=False, attn_cp_size=1). Set --tp, --dp, --enable-dp-attention, and --attention-context-parallel-size so the effective attention TP size is 4.

lukealonso

Owner 2 days ago

Are you using the container shown in the model card?

shadowlilac

1 day ago

Yes, i am serving that custom image on top of kubernetes, however that second part should likely be irrelevant. Additionally, I had to duplicate this repo and add the modeling_mimo files because otherwise it would somehow complain mimo_v2 is not part of transformers. I tried loading this with both your image and the sglang cuda 13 mimo dev image (with triton attention since it's the only supported one that has both Diff KV and attention sinks and supports SM120)

shadowlilac

1 day ago

Does your attention backend possibly assume devices are all on a single host?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment