Instructions to use AesSedai/MiMo-V2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AesSedai/MiMo-V2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AesSedai/MiMo-V2.5-GGUF",
	filename="IQ3_S/MiMo-V2.5-IQ3_S-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AesSedai/MiMo-V2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Use Docker

docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use AesSedai/MiMo-V2.5-GGUF with Ollama:
```
ollama run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
```

Unsloth Studio new

How to use AesSedai/MiMo-V2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Pi new

How to use AesSedai/MiMo-V2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MiMo-V2.5-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use AesSedai/MiMo-V2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
```

Lemonade

How to use AesSedai/MiMo-V2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MiMo-V2.5-GGUF-Q4_K_M

List all available models

lemonade list

vision not work

by marX13 - opened 1 day ago

Discussion

marX13

1 day ago

I recompile and redownload the latest code and gguf, it keeps produce @@@@ for image and some simple QA

BahamutRU

about 21 hours ago

IDK, all is working one-shot compile. 0 problem.

marX13

about 14 hours ago

Some fix by claude:
MiMo-V2.5 vision tower: late-layer FFN overflow on Apple Metal → NaN embeddings → @@@@-only LLM output
Symptoms
llama-server, mimo-v2.5 model + mmproj-MiMo-V2.5-F16.gguf, Mac Metal backend.
Any image prompt produces only token id 31 (@) / 62182 (@@@@).
Cold-start text works on the first request; after one image inference, subsequent text requests also fail (the slot's KV cache inherits NaN from the broken image embedding).
MTMD_DEBUG_EMBEDDINGS=1 shows the merger output has mean=nan, sum=nan, with 98 of 340 image tokens being entirely NaN (4096 NaN values per token), the other 242 tokens completely fine.
Root cause (two issues, both in tools/mtmd/models/mimovl.cpp)
Issue 1 — post_ln is built as LayerNorm but should be RMSNorm.

// Before:
inpL = build_norm(inpL, model.post_ln_w, model.post_ln_b, NORM_TYPE_NORMAL, 1e-6f, n_layer);
The mmproj only stores v.post_ln.weight (no .bias), and the converter explicitly tags the vision tower as RMSNorm:

convert_hf_to_gguf.py L9533

MiMoV2 vision RMSNorm: HF uses getattr(config, "rms_norm_eps", 1e-6) ...

self.gguf_writer.add_vision_attention_layernorm_eps(self.rms_norm_eps)
In-block ln1 / ln2 already use NORM_TYPE_RMS. Just post_ln was wrong.

Issue 2 — Late-layer SwiGLU output overflows F16 inside the down_proj mat-mul on Apple Silicon.

Per-layer trace from a cb_eval walk of the graph (Metal backend, llama-server, mmproj F16):

ffn_inp-27 max=3513.63
ffn_inp_normed-27 max=20.24
ffn_gate-27 max=277.98 (F32, sane)
ffn_up-27 max=853.11 (F32, sane)
ffn_swiglu-27 max=234621.67 (F32, sane — silu(gate) * up grows here)
ffn_down-27 max=281042.72 inf=186880 ← overflow appears here
ffn_out-27 inf=186880
layer_out-27 inf=186880
node_995 RMS_NORM bad=186880 ← Inf/Inf = NaN, propagates to merger output
The interesting part is that this overflow happens even with F32 inputs:

node[988] 'ffn_down-27' op=MUL_MAT type=f32 inf=186880
src0='v.blk.27.ffn_down.weight (copy)'(f32) ← weight cast to F32
src1='ffn_swiglu-27'(f32) ← input is F32 (max 234621)
This is because Metal's mul_mm kernel uses simdgroup-matrix multiply (simdgroup_float8x8). On Apple Silicon, the simdgroup matrix multiply runs at F16 multiplicand precision (only the accumulator is F32) — even when both operands are typed float. F32 input values >65504 get clamped to ±Inf inside the multiply step, so the per-row sums become Inf.

ggml_mul_mat_set_prec(GGML_PREC_F32) is silently ignored on Metal — grep -rn GGML_PREC_F32 ggml/src/ggml-metal/ returns nothing.

I verified this is not "input has Inf" / "weight has Inf":

Host-side inp_raw vector is clean (bad=0, min=-1.79, max=2.15).
ffn_swiglu-27 inf=0 — the input to ffn_down-27 has no Inf yet.
Casting ff_down_w to F32 explicitly (ggml_cast) does not help — the cb_eval log confirms the cast tensor reaches mat-mul as F32, and the simdgroup-matrix multiply still overflows.
Fix
tools/mtmd/models/mimovl.cpp — two changes:

// Merger post-norm: NORMAL → RMS, hardcoded eps → hparams eps
inpL = build_norm(inpL, model.post_ln_w, model.post_ln_b, NORM_TYPE_RMS, eps, n_layer);

// Inline build_ffn so we can pre/post-scale around the down-projection.
// down(swiglu * (1/256)) * 256 == down(swiglu), but the scaled input fits
// in F16 so the simdgroup multiply no longer overflows. Bias is added
// after the unscale to preserve down(x) + b semantics.
{
ggml_tensor * tmp = build_mm(layer.ff_up_w, cur);
if (layer.ff_up_b) tmp = ggml_add(ctx0, tmp, layer.ff_up_b);

cur = build_mm(layer.ff_gate_w, cur);
if (layer.ff_gate_b) cur = ggml_add(ctx0, cur, layer.ff_gate_b);

cur = ggml_swiglu_split(ctx0, cur, tmp);

const float kFFNDownScale = 1.0f / 256.0f;
cur = ggml_scale(ctx0, cur, kFFNDownScale);
cur = build_mm(layer.ff_down_w, cur);
cur = ggml_scale(ctx0, cur, 1.0f / kFFNDownScale);
if (layer.ff_down_b) cur = ggml_add(ctx0, cur, layer.ff_down_b);

}
Choice of 256: max swiglu observed ≈ 234621; 234621 / 256 ≈ 917, comfortably within F16 (max 65504). Anything between ~8 and ~16 would also work; 256 leaves margin for variance across images.

Diff stat
1 file changed, 30 insertions(+), 8 deletions(-)

Commit on local branch mimo-v2.5-vision-ffn-overflow-fix.

Long-term
The root cause is a Metal backend limitation: simdgroup_matrix<float, 8, 8> doesn't actually use F32 multiply on Apple's tensor units, and ggml_mul_mat_set_prec(GGML_PREC_F32) is a no-op there. A proper upstream fix would be either (a) honor GGML_PREC_F32 in Metal's mul_mm kernel selection by routing to a scalar F32 path when set, or (b) keep mmproj weights as BF16 (BF16 has F32 exponent range, so the simdgroup multiply step doesn't clip). MiMo-V2.5's vision tower is unusual in producing such large pre-down activations; most ViTs stay safely within F16 range.

AesSedai

Owner about 13 hours ago

Ah, I did see that overflow issue on CUDA and I added that F32 cast to address it. I don't have any apple hardware I can run or test on though so I didn't know there was a gap remaining there. I'll check the LayerNorm vs RMSNorm question out later, that HF uses getattr(config, "rms_norm_eps", 1e-6) was a default from the MiMo-V2.5 transformer impl and isn't in the config actually. That post_inp using RMS might explain a couple of the vision oddities though which would be nice to see fixed.

That fix for scaledown I'm more hesitant about, I'll have to investigate it further. Can you try the BF16 mmproj instead? I don't think that one overflows and might circumvent needing to use those hacks.

marX13

about 9 hours ago

BF16 mmproj works good, f16 f32 not work.

AesSedai

Owner about 8 hours ago

I did throw your response into an LLM and double-checked, its output:

The LayerNorm vs RMSNorm claim is wrong — the current code is correct.

HF reference (modeling_mimo_v2.py:768-781):
class MiMoVisionPatchMerger(nn.Module):
def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2):
        super().__init__()
        self.hidden_size = context_dim * (spatial_merge_size**2)
        self.ln_q = nn.LayerNorm(context_dim, eps=1e-6)   # ← LayerNorm, hardcoded eps
        ...
Compare to the in-block norms (modeling_mimo_v2.py:745-746):
self.norm1 = nn.RMSNorm(dim, eps=rms_norm_eps)
self.norm2 = nn.RMSNorm(dim, eps=rms_norm_eps)
The vision tower deliberately uses two different norm types: RMSNorm for the pre-attn / pre-FFN norms inside each block, and LayerNorm for the merger's pre-projection norm. Current mimovl.cpp mirrors that exactly: NORM_TYPE_RMS for ln_1/ln_2, NORM_TYPE_NORMAL for post_ln. The hardcoded 1e-6f matches HF too.

Why the reporter inferred RMSNorm: they noticed the mmproj only stores v.post_ln.weight and no v.post_ln.bias. That's true — but it doesn't imply RMSNorm. The HF checkpoint itself only ships > visual.merger.ln_q.weight (no .bias). Verified against model.safetensors.index.json:
visual.merger.ln_q.weight       (present)
visual.merger.mlp.0.weight      (present, also no .bias)
visual.merger.mlp.2.weight      (present, also no .bias)
nn.LayerNorm(context_dim, eps=1e-6) defaults to bias=True, but the published checkpoint just omits zero biases (the merger's nn.Linear layers also have no biases stored — same pattern). clip_graph::build_norm already handles mb == nullptr cleanly: ggml_norm(x, eps) * weight with no add — i.e., LayerNorm with zero bias, which is mathematically what HF does.

The proposed Issue 1 "fix" would make the merger norm wrong — it would apply RMSNorm semantics (no mean-centering) where HF applies LayerNorm semantics (mean-centered). For activations with non-zero mean (and post-attention residual streams almost always have non-zero means), this changes the output distribution and would likely shift downstream LLM behavior, possibly subtly enough to look "the same on CUDA but somehow different on Metal."

On the converter side, add_vision_attention_layernorm_eps(self.rms_norm_eps) is just storing the eps value (the field name is "vision attention layernorm eps" — it's a single eps scalar, the runtime decides whether to feed it to LayerNorm or RMSNorm based on the build code). That value is used for ln_1/ln_2 (which are RMSNorm-with-rms_eps), and the merger norm uses its own hardcoded 1e-6 in the runtime. So the eps wiring is correct too.

On Issue 2 (the real Metal bug): the diagnosis looks right — Metal's simdgroup_float8x8 multiply only has F16 multiplicand precision, and ggml_mul_mat_set_prec(GGML_PREC_F32) is silently a no-op there (grep -rn GGML_PREC_F32 ggml/src/ggml-metal/ returns nothing). So an F32 input >65504 gets clipped inside the multiply to ±Inf. The pre/post-scale workaround is mathematically valid (linear scaling commutes with mat-mul) but model-specific. The cleaner long-term fixes are exactly what the reporter listed — Metal kernel respecting GGML_PREC_F32, or use the mmproj as BF16.

I think it's just going to stay as an apple / metal issue for now, and the RMSNorm change is wrong.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment