vision not work
I recompile and redownload the latest code and gguf, it keeps produce @@@@ for image and some simple QA
IDK, all is working one-shot compile. 0 problem.
Some fix by claude:
MiMo-V2.5 vision tower: late-layer FFN overflow on Apple Metal β NaN embeddings β @@@@-only LLM output
Symptoms
llama-server, mimo-v2.5 model + mmproj-MiMo-V2.5-F16.gguf, Mac Metal backend.
Any image prompt produces only token id 31 (@) / 62182 (@@@@).
Cold-start text works on the first request; after one image inference, subsequent text requests also fail (the slot's KV cache inherits NaN from the broken image embedding).
MTMD_DEBUG_EMBEDDINGS=1 shows the merger output has mean=nan, sum=nan, with 98 of 340 image tokens being entirely NaN (4096 NaN values per token), the other 242 tokens completely fine.
Root cause (two issues, both in tools/mtmd/models/mimovl.cpp)
Issue 1 β post_ln is built as LayerNorm but should be RMSNorm.
// Before:
inpL = build_norm(inpL, model.post_ln_w, model.post_ln_b, NORM_TYPE_NORMAL, 1e-6f, n_layer);
The mmproj only stores v.post_ln.weight (no .bias), and the converter explicitly tags the vision tower as RMSNorm:
convert_hf_to_gguf.py L9533
MiMoV2 vision RMSNorm: HF uses getattr(config, "rms_norm_eps", 1e-6) ...
self.gguf_writer.add_vision_attention_layernorm_eps(self.rms_norm_eps)
In-block ln1 / ln2 already use NORM_TYPE_RMS. Just post_ln was wrong.
Issue 2 β Late-layer SwiGLU output overflows F16 inside the down_proj mat-mul on Apple Silicon.
Per-layer trace from a cb_eval walk of the graph (Metal backend, llama-server, mmproj F16):
ffn_inp-27 max=3513.63
ffn_inp_normed-27 max=20.24
ffn_gate-27 max=277.98 (F32, sane)
ffn_up-27 max=853.11 (F32, sane)
ffn_swiglu-27 max=234621.67 (F32, sane β silu(gate) * up grows here)
ffn_down-27 max=281042.72 inf=186880 β overflow appears here
ffn_out-27 inf=186880
layer_out-27 inf=186880
node_995 RMS_NORM bad=186880 β Inf/Inf = NaN, propagates to merger output
The interesting part is that this overflow happens even with F32 inputs:
node[988] 'ffn_down-27' op=MUL_MAT type=f32 inf=186880
src0='v.blk.27.ffn_down.weight (copy)'(f32) β weight cast to F32
src1='ffn_swiglu-27'(f32) β input is F32 (max 234621)
This is because Metal's mul_mm kernel uses simdgroup-matrix multiply (simdgroup_float8x8). On Apple Silicon, the simdgroup matrix multiply runs at F16 multiplicand precision (only the accumulator is F32) β even when both operands are typed float. F32 input values >65504 get clamped to Β±Inf inside the multiply step, so the per-row sums become Inf.
ggml_mul_mat_set_prec(GGML_PREC_F32) is silently ignored on Metal β grep -rn GGML_PREC_F32 ggml/src/ggml-metal/ returns nothing.
I verified this is not "input has Inf" / "weight has Inf":
Host-side inp_raw vector is clean (bad=0, min=-1.79, max=2.15).
ffn_swiglu-27 inf=0 β the input to ffn_down-27 has no Inf yet.
Casting ff_down_w to F32 explicitly (ggml_cast) does not help β the cb_eval log confirms the cast tensor reaches mat-mul as F32, and the simdgroup-matrix multiply still overflows.
Fix
tools/mtmd/models/mimovl.cpp β two changes:
// Merger post-norm: NORMAL β RMS, hardcoded eps β hparams eps
inpL = build_norm(inpL, model.post_ln_w, model.post_ln_b, NORM_TYPE_RMS, eps, n_layer);
// Inline build_ffn so we can pre/post-scale around the down-projection.
// down(swiglu * (1/256)) * 256 == down(swiglu), but the scaled input fits
// in F16 so the simdgroup multiply no longer overflows. Bias is added
// after the unscale to preserve down(x) + b semantics.
{
ggml_tensor * tmp = build_mm(layer.ff_up_w, cur);
if (layer.ff_up_b) tmp = ggml_add(ctx0, tmp, layer.ff_up_b);
cur = build_mm(layer.ff_gate_w, cur);
if (layer.ff_gate_b) cur = ggml_add(ctx0, cur, layer.ff_gate_b);
cur = ggml_swiglu_split(ctx0, cur, tmp);
const float kFFNDownScale = 1.0f / 256.0f;
cur = ggml_scale(ctx0, cur, kFFNDownScale);
cur = build_mm(layer.ff_down_w, cur);
cur = ggml_scale(ctx0, cur, 1.0f / kFFNDownScale);
if (layer.ff_down_b) cur = ggml_add(ctx0, cur, layer.ff_down_b);
}
Choice of 256: max swiglu observed β 234621; 234621 / 256 β 917, comfortably within F16 (max 65504). Anything between ~8 and ~16 would also work; 256 leaves margin for variance across images.
Diff stat
1 file changed, 30 insertions(+), 8 deletions(-)
Commit on local branch mimo-v2.5-vision-ffn-overflow-fix.
Long-term
The root cause is a Metal backend limitation: simdgroup_matrix<float, 8, 8> doesn't actually use F32 multiply on Apple's tensor units, and ggml_mul_mat_set_prec(GGML_PREC_F32) is a no-op there. A proper upstream fix would be either (a) honor GGML_PREC_F32 in Metal's mul_mm kernel selection by routing to a scalar F32 path when set, or (b) keep mmproj weights as BF16 (BF16 has F32 exponent range, so the simdgroup multiply step doesn't clip). MiMo-V2.5's vision tower is unusual in producing such large pre-down activations; most ViTs stay safely within F16 range.
Ah, I did see that overflow issue on CUDA and I added that F32 cast to address it. I don't have any apple hardware I can run or test on though so I didn't know there was a gap remaining there. I'll check the LayerNorm vs RMSNorm question out later, that HF uses getattr(config, "rms_norm_eps", 1e-6) was a default from the MiMo-V2.5 transformer impl and isn't in the config actually. That post_inp using RMS might explain a couple of the vision oddities though which would be nice to see fixed.
That fix for scaledown I'm more hesitant about, I'll have to investigate it further. Can you try the BF16 mmproj instead? I don't think that one overflows and might circumvent needing to use those hacks.
BF16 mmproj works good, f16 f32 not work.
I did throw your response into an LLM and double-checked, its output:
The LayerNorm vs RMSNorm claim is wrong β the current code is correct.
HF reference (modeling_mimo_v2.py:768-781):
class MiMoVisionPatchMerger(nn.Module): def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2): super().__init__() self.hidden_size = context_dim * (spatial_merge_size**2) self.ln_q = nn.LayerNorm(context_dim, eps=1e-6) # β LayerNorm, hardcoded eps ...Compare to the in-block norms (modeling_mimo_v2.py:745-746):
self.norm1 = nn.RMSNorm(dim, eps=rms_norm_eps) self.norm2 = nn.RMSNorm(dim, eps=rms_norm_eps)The vision tower deliberately uses two different norm types: RMSNorm for the pre-attn / pre-FFN norms inside each block, and LayerNorm for the merger's pre-projection norm. Current mimovl.cpp mirrors that exactly:
NORM_TYPE_RMSforln_1/ln_2,NORM_TYPE_NORMALforpost_ln. The hardcoded1e-6fmatches HF too.Why the reporter inferred RMSNorm: they noticed the mmproj only stores v.post_ln.weight and no v.post_ln.bias. That's true β but it doesn't imply RMSNorm. The HF checkpoint itself only ships > visual.merger.ln_q.weight (no .bias). Verified against
model.safetensors.index.json:visual.merger.ln_q.weight (present) visual.merger.mlp.0.weight (present, also no .bias) visual.merger.mlp.2.weight (present, also no .bias)
nn.LayerNorm(context_dim, eps=1e-6)defaults tobias=True, but the published checkpoint just omits zero biases (the merger's nn.Linear layers also have no biases stored β same pattern).clip_graph::build_normalready handlesmb == nullptrcleanly:ggml_norm(x, eps) * weightwith no add β i.e., LayerNorm with zero bias, which is mathematically what HF does.The proposed Issue 1 "fix" would make the merger norm wrong β it would apply RMSNorm semantics (no mean-centering) where HF applies LayerNorm semantics (mean-centered). For activations with non-zero mean (and post-attention residual streams almost always have non-zero means), this changes the output distribution and would likely shift downstream LLM behavior, possibly subtly enough to look "the same on CUDA but somehow different on Metal."
On the converter side,
add_vision_attention_layernorm_eps(self.rms_norm_eps)is just storing the eps value (the field name is "vision attention layernorm eps" β it's a single eps scalar, the runtime decides whether to feed it to LayerNorm or RMSNorm based on the build code). That value is used for ln_1/ln_2 (which are RMSNorm-with-rms_eps), and the merger norm uses its own hardcoded 1e-6 in the runtime. So the eps wiring is correct too.On Issue 2 (the real Metal bug): the diagnosis looks right β Metal's
simdgroup_float8x8multiply only has F16 multiplicand precision, andggml_mul_mat_set_prec(GGML_PREC_F32)is silently a no-op there (grep -rn GGML_PREC_F32 ggml/src/ggml-metal/returns nothing). So an F32 input >65504 gets clipped inside the multiply to Β±Inf. The pre/post-scale workaround is mathematically valid (linear scaling commutes with mat-mul) but model-specific. The cleaner long-term fixes are exactly what the reporter listed β Metal kernel respecting GGML_PREC_F32, or use the mmproj as BF16.
I think it's just going to stay as an apple / metal issue for now, and the RMSNorm change is wrong.