Instructions to use AesSedai/MiMo-V2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AesSedai/MiMo-V2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AesSedai/MiMo-V2.5-GGUF",
	filename="IQ3_S/MiMo-V2.5-IQ3_S-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AesSedai/MiMo-V2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Use Docker

docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use AesSedai/MiMo-V2.5-GGUF with Ollama:
```
ollama run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
```

Unsloth Studio new

How to use AesSedai/MiMo-V2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Pi new

How to use AesSedai/MiMo-V2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MiMo-V2.5-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use AesSedai/MiMo-V2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
```

Lemonade

How to use AesSedai/MiMo-V2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MiMo-V2.5-GGUF-Q4_K_M

List all available models

lemonade list

Working good on 96GB VRAM + DDR5 Setup

by phakio - opened 6 days ago

Discussion

phakio

6 days ago

Just leaving a post to say that along with the mentioned PR, I got this model running at decent speeds on my setup (1x4090, 3x3090 and 512GB DDR5 ram offload).

The IQ3 just barely doesn't make the cut to fit onto 96GB vram. I'd be interested in trying a Q2 quant just to see how fast it can run on full GPU offload, but I think that the models output is already affected by the Q3 quant.

None-the-less, here are my stats, and launch command so others can try if they want!

I've found this model decent. Except when it decides to overthink. It's funny, sometimes the model thinks very briefly and it's impressive, and other times it thinks more than the original qwen 3 launch. I'd be very interesting in trying out MiMo-v2.5 Pro if you ever quantize that one.

/home/phone/mimo-llama/llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/mimo/MiMo-V2.5-IQ3_S-00001-of-00004.gguf \
    --alias AesSedai/MiMo-V2.5-GGUF \
    --ctx-size 20000 \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7)\..*=CUDA0,blk\.(9|10|11|12|13|14|15|16|39)\..*=CUDA1,blk\.(17|18|19|20|21|22|23|24|38)\..*=CUDA2,blk\.(25|26|27|28|29|30|31|32)\..*=CUDA3" \
    --parallel 1 \
    --cpu-moe \
    --threads 48 \
    --threads-batch 56 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \
    --no-mmap \
    --mlock \
    --fit off \
    -fa off

AesSedai

Owner 6 days ago

Hi, thanks for the feedback! I will be quantizing and uploading Pro as well, I wasn't sure if there would be more requested changes in the PR so I'm waiting until it's ready to merge before pulling the trigger there given that it's a 2TB BF16 to wrangle.

Re: Q2, once the PR is merged I'm sure Bart / Ubergarm / Unsloth will provide the usual full suites of quantizations :)

qenme

5 days ago

Just leaving a post to say that along with the mentioned PR, I got this model running at decent speeds on my setup (1x4090, 3x3090 and 512GB DDR5 ram offload).

The IQ3 just barely doesn't make the cut to fit onto 96GB vram. I'd be interested in trying a Q2 quant just to see how fast it can run on full GPU offload, but I think that the models output is already affected by the Q3 quant.

None-the-less, here are my stats, and launch command so others can try if they want!

I've found this model decent. Except when it decides to overthink. It's funny, sometimes the model thinks very briefly and it's impressive, and other times it thinks more than the original qwen 3 launch. I'd be very interesting in trying out MiMo-v2.5 Pro if you ever quantize that one.
/home/phone/mimo-llama/llama.cpp/build/bin/llama-server \
    --model /run/media/phone/SharedData/LocalModelsBIG/mimo/MiMo-V2.5-IQ3_S-00001-of-00004.gguf \
    --alias AesSedai/MiMo-V2.5-GGUF \
    --ctx-size 20000 \
    -ngl 999 \
    -ot "blk\.(0|1|2|3|4|5|6|7)\..*=CUDA0,blk\.(9|10|11|12|13|14|15|16|39)\..*=CUDA1,blk\.(17|18|19|20|21|22|23|24|38)\..*=CUDA2,blk\.(25|26|27|28|29|30|31|32)\..*=CUDA3" \
    --parallel 1 \
    --cpu-moe \
    --threads 48 \
    --threads-batch 56 \
    --host 0.0.0.0 \
    --port 8081 \
    --jinja \
    --no-mmap \
    --mlock \
    --fit off \
    -fa off

Try setting -ub 2048, you should see a decent bump up in pp tok/s. Also, you are offloading layers instead of moe layers with that -ot parameter I think. Either use -ot or -ncmoe or just set -ncmoe to something like 10 to test.

phakio

5 days ago

•

edited 5 days ago

I'll try it out later! I'll admit ever since the "-fit" command I haven't used -ot, so it took longer than I'd like to admit to get my command running! Thanks for the advice!

--- edit

I dropped the OT and just set cpu-moe, I still get a solid 20t/s generation, slightly higher prompt processing speeds, but my gpus are now all half utilized. this method would allow for me to use much higher context. also able to run at Q4_K_M with the same speed. I'm also getting good results with --reasoning off, although I use LLMs more as a study partner to explain new concepts I come across. I don't think using this model, or any model really without reasoning would be good for things like coding. It really speeds up general chat though.

I'm having fun testing it out! Looking forward to putting MiMo 2.5 Pro against Kimi 2.6, as the two seems to be open source rivals as of late. Interesting times-

AesSedai

Owner 5 days ago

I did dig into the FA issue a bit and got a pretty good speedup, tested on the Q8_0 for the non-Pro version:

Before:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   14.860 |   551.27 |   59.324 |    34.52 |
|  8192 |   2048 |   8192 |   27.317 |   299.89 |  198.618 |    10.31 |
|  8192 |   2048 |  16384 |   40.400 |   202.77 |  220.156 |     9.30 |
|  8192 |   2048 |  24576 |   53.788 |   152.30 |  240.882 |     8.50 |

After:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |    2.646 |  3096.50 |   26.495 |    77.30 |
|  8192 |   2048 |   8192 |    2.849 |  2875.61 |   28.700 |    71.36 |
|  8192 |   2048 |  16384 |    3.035 |  2698.98 |   28.985 |    70.66 |
|  8192 |   2048 |  24576 |    3.224 |  2541.33 |   29.247 |    70.02 |

and PPL is still sane: Final estimate: PPL = 5.1331 +/- 0.03025

I've pushed it to a new branch, based on the branch from this PR: https://github.com/AesSedai/llama.cpp/tree/mimo-v2.5-fattn

I tested this on my 6000 Pros but I think it would require more work for older arches / vulkan / etc. But for newish arches + CUDA it should be fine?

phakio

4 days ago

•

edited 4 days ago

I compiled the new branch, but as of right now I'm tight on SSD space and for some reason in my sleep deprived state last night when I was cleaning and removing old models, decided to keep only the Q8 quant of this model lmao let's see how it fares. I actually found the responses between the Q8 and Q4_K_M very similar, varying just one or two tokens in most cases. (even then the token variation didn't affect the end answer) I think I just wanted to keep the Q8 because I knew it was the current most accurate / best quant, and the speeds were still decent in my use case.

New Branch Build - 116 T/S PP // 16 T/S Generation Speeds

This is, in my opinion very usable, as most of the model is offloaded to CPU, the onlything on GPU is the non MOE dense model layer, and the context cache. Actually looking at it, out of my pool of 96GB vram i'm currently only using a total of 21GB.

I don't have exact numbers, but PP is about 4x more, and token generation is about 5 tokens per second more, compared to Q8 on the other PR build.

Really not that bad considering that in my above testing, I was using the Q3 gguf with most of the model offloaded to GPUs. So basically now I'm running a much more accurate quant, 95% on CPU, at half the speed I was running a heavy quant!

Thanks for looking into it, this is a great improvement!

--- EDIT

For consistency I redownloaded the original IQ3 quant variation that I tested.

New results with new build + Q3: 623 t/s PP // 20 t/s TG

slot print_timing: id  0 | task 0 | 
prompt eval time =    5282.13 ms /  3291 tokens (    1.61 ms per token,   623.04 tokens per second)
       eval time =   33720.49 ms /   695 tokens (   48.52 ms per token,    20.61 tokens per second)
      total time =   39002.62 ms /  3986 tokens

slower token gen than before, but much better PP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment