Serving on two devices
It looks like it's currently impossible to serve on two devices due to how the model was exported:
ValueError: MiMoV2ForCausalLM requires effective attention TP size 4 because its fused qkv_proj weights are TP=4-interleaved; got 2 (tp_size=2, dp_size=1, enable_dp_attention=False, attn_cp_size=1). Set --tp, --dp, --enable-dp-attention, and --attention-context-parallel-size so the effective attention TP size is 4.
Are you using the container shown in the model card?
Yes, i am serving that custom image on top of kubernetes, however that second part should likely be irrelevant. Additionally, I had to duplicate this repo and add the modeling_mimo files because otherwise it would somehow complain mimo_v2 is not part of transformers. I tried loading this with both your image and the sglang cuda 13 mimo dev image (with triton attention since it's the only supported one that has both Diff KV and attention sinks and supports SM120)
Does your attention backend possibly assume devices are all on a single host?