Working good on 96GB VRAM + DDR5 Setup

#2
by phakio - opened

Just leaving a post to say that along with the mentioned PR, I got this model running at decent speeds on my setup (1x4090, 3x3090 and 512GB DDR5 ram offload).

The IQ3 just barely doesn't make the cut to fit onto 96GB vram. I'd be interested in trying a Q2 quant just to see how fast it can run on full GPU offload, but I think that the models output is already affected by the Q3 quant.

None-the-less, here are my stats, and launch command so others can try if they want!

I've found this model decent. Except when it decides to overthink. It's funny, sometimes the model thinks very briefly and it's impressive, and other times it thinks more than the original qwen 3 launch. I'd be very interesting in trying out MiMo-v2.5 Pro if you ever quantize that one.

/home/phone/mimo-llama/llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/mimo/MiMo-V2.5-IQ3_S-00001-of-00004.gguf \
    --alias AesSedai/MiMo-V2.5-GGUF \
    --ctx-size 20000 \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7)\..*=CUDA0,blk\.(9|10|11|12|13|14|15|16|39)\..*=CUDA1,blk\.(17|18|19|20|21|22|23|24|38)\..*=CUDA2,blk\.(25|26|27|28|29|30|31|32)\..*=CUDA3" \
    --parallel 1 \
    --cpu-moe \
    --threads 48 \
    --threads-batch 56 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \
    --no-mmap \
    --mlock \
    --fit off \
    -fa off

image

image

Hi, thanks for the feedback! I will be quantizing and uploading Pro as well, I wasn't sure if there would be more requested changes in the PR so I'm waiting until it's ready to merge before pulling the trigger there given that it's a 2TB BF16 to wrangle.

Re: Q2, once the PR is merged I'm sure Bart / Ubergarm / Unsloth will provide the usual full suites of quantizations :)

Just leaving a post to say that along with the mentioned PR, I got this model running at decent speeds on my setup (1x4090, 3x3090 and 512GB DDR5 ram offload).

The IQ3 just barely doesn't make the cut to fit onto 96GB vram. I'd be interested in trying a Q2 quant just to see how fast it can run on full GPU offload, but I think that the models output is already affected by the Q3 quant.

None-the-less, here are my stats, and launch command so others can try if they want!

I've found this model decent. Except when it decides to overthink. It's funny, sometimes the model thinks very briefly and it's impressive, and other times it thinks more than the original qwen 3 launch. I'd be very interesting in trying out MiMo-v2.5 Pro if you ever quantize that one.

/home/phone/mimo-llama/llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/mimo/MiMo-V2.5-IQ3_S-00001-of-00004.gguf \
    --alias AesSedai/MiMo-V2.5-GGUF \
    --ctx-size 20000 \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7)\..*=CUDA0,blk\.(9|10|11|12|13|14|15|16|39)\..*=CUDA1,blk\.(17|18|19|20|21|22|23|24|38)\..*=CUDA2,blk\.(25|26|27|28|29|30|31|32)\..*=CUDA3" \
    --parallel 1 \
    --cpu-moe \
    --threads 48 \
    --threads-batch 56 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \
    --no-mmap \
    --mlock \
    --fit off \
    -fa off

image

image

Try setting -ub 2048, you should see a decent bump up in pp tok/s. Also, you are offloading layers instead of moe layers with that -ot parameter I think. Either use -ot or -ncmoe or just set -ncmoe to something like 10 to test.

I'll try it out later! I'll admit ever since the "-fit" command I haven't used -ot, so it took longer than I'd like to admit to get my command running! Thanks for the advice!

--- edit

I dropped the OT and just set cpu-moe, I still get a solid 20t/s generation, slightly higher prompt processing speeds, but my gpus are now all half utilized. this method would allow for me to use much higher context. also able to run at Q4_K_M with the same speed. I'm also getting good results with --reasoning off, although I use LLMs more as a study partner to explain new concepts I come across. I don't think using this model, or any model really without reasoning would be good for things like coding. It really speeds up general chat though.

I'm having fun testing it out! Looking forward to putting MiMo 2.5 Pro against Kimi 2.6, as the two seems to be open source rivals as of late. Interesting times-

I did dig into the FA issue a bit and got a pretty good speedup, tested on the Q8_0 for the non-Pro version:

Before:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   14.860 |   551.27 |   59.324 |    34.52 |
|  8192 |   2048 |   8192 |   27.317 |   299.89 |  198.618 |    10.31 |
|  8192 |   2048 |  16384 |   40.400 |   202.77 |  220.156 |     9.30 |
|  8192 |   2048 |  24576 |   53.788 |   152.30 |  240.882 |     8.50 |

After:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |    2.646 |  3096.50 |   26.495 |    77.30 |
|  8192 |   2048 |   8192 |    2.849 |  2875.61 |   28.700 |    71.36 |
|  8192 |   2048 |  16384 |    3.035 |  2698.98 |   28.985 |    70.66 |
|  8192 |   2048 |  24576 |    3.224 |  2541.33 |   29.247 |    70.02 |

and PPL is still sane: Final estimate: PPL = 5.1331 +/- 0.03025

I've pushed it to a new branch, based on the branch from this PR: https://github.com/AesSedai/llama.cpp/tree/mimo-v2.5-fattn

I tested this on my 6000 Pros but I think it would require more work for older arches / vulkan / etc. But for newish arches + CUDA it should be fine?

I compiled the new branch, but as of right now I'm tight on SSD space and for some reason in my sleep deprived state last night when I was cleaning and removing old models, decided to keep only the Q8 quant of this model lmao let's see how it fares. I actually found the responses between the Q8 and Q4_K_M very similar, varying just one or two tokens in most cases. (even then the token variation didn't affect the end answer) I think I just wanted to keep the Q8 because I knew it was the current most accurate / best quant, and the speeds were still decent in my use case.


New Branch Build - 116 T/S PP // 16 T/S Generation Speeds

This is, in my opinion very usable, as most of the model is offloaded to CPU, the onlything on GPU is the non MOE dense model layer, and the context cache. Actually looking at it, out of my pool of 96GB vram i'm currently only using a total of 21GB.

I don't have exact numbers, but PP is about 4x more, and token generation is about 5 tokens per second more, compared to Q8 on the other PR build.

Really not that bad considering that in my above testing, I was using the Q3 gguf with most of the model offloaded to GPUs. So basically now I'm running a much more accurate quant, 95% on CPU, at half the speed I was running a heavy quant!

Thanks for looking into it, this is a great improvement!

--- EDIT

For consistency I redownloaded the original IQ3 quant variation that I tested.

New results with new build + Q3: 623 t/s PP // 20 t/s TG

slot print_timing: id  0 | task 0 | 
prompt eval time =    5282.13 ms /  3291 tokens (    1.61 ms per token,   623.04 tokens per second)
       eval time =   33720.49 ms /   695 tokens (   48.52 ms per token,    20.61 tokens per second)
      total time =   39002.62 ms /  3986 tokens

slower token gen than before, but much better PP.

Sign up or log in to comment