File size: 5,880 Bytes
e748a3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb6a49b
e748a3f
bb6a49b
 
 
e748a3f
 
 
4c6a0ed
 
 
 
 
 
e748a3f
bb6a49b
e748a3f
bb6a49b
e748a3f
bb6a49b
 
 
 
 
 
e748a3f
 
 
 
bb6a49b
 
 
 
 
 
 
4c6a0ed
 
bb6a49b
 
 
 
 
 
 
 
 
 
 
 
 
e748a3f
bb6a49b
 
 
 
 
 
 
4c6a0ed
bb6a49b
 
 
 
 
 
4c6a0ed
 
bb6a49b
4c6a0ed
bb6a49b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34ba54d
bb6a49b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: mit
language:
- zh
- en
pipeline_tag: automatic-speech-recognition
tags:
- audio
- speech-recognition
- gguf
- mimo
- qwen2
library_name: ggml
base_model: XiaomiMiMo/MiMo-V2.5-ASR
---

# MiMo-V2.5-ASR β€” GGUF

GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**. Pure C++ inference β€” no Python, no Transformers, runs on Apple Silicon (Metal), CPU, and CUDA.

The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 LM, full prefill + step-decode KV-cached graphs, prompt construction matching the upstream `MimoAudio.asr_sft` reference exactly. JFK transcription test passes verbatim.

## Available variants

| File | Quant | Size | Layout | Recommended |
|---|---|---|---|---|
| `mimo-asr-f16.gguf` | F16 | 14.9 GB | separate Q/K/V | Full precision; needs ~16 GB RAM during inference |
| `mimo-asr-q4_k.gguf` | Q4_K | 4.2 GB | **fused QKV** | **Default** β€” fits in 8 GB RAM, no quality loss visible on JFK |

The default `mimo-asr-q4_k.gguf` (re-uploaded May 2026, PLAN #60d) ships with per-LM-layer Q/K/V projections fused into a single `model.layers.{i}.attn.qkv.{weight,bias}` tensor pair, yielding ~1.7Γ— faster per-step decode on M1 vs the prior unfused layout (3058 ms/step β†’ 1806 ms/step on a contended-disk run; ~1.1-1.2Γ— pure-compute on a quiet box). The CrispASR runtime auto-detects either layout: the F16 file above keeps working unchanged via the separate-Q/K/V fallback path. Re-upload of a fused F16 is queued behind disk-headroom availability.

Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** β€” the audio tokenizer is a separate model that converts 16 kHz PCM β†’ 8-channel RVQ codes that this LM consumes.

## Architecture

- **Audio path** β€” 6-layer input_local_transformer (1024d, 64 heads, GS=4 group size, SiLU, sinusoidal RoPE on Q/K) + 8-channel RVQ codebook embeddings + linear group-downcast to 4096d
- **LM** β€” 36-layer Qwen2 (hidden=4096, 32 attn heads, 8 KV heads, intermediate=12288, RMSNorm, SwiGLU, RoPE ΞΈ=640K, max_pos=8192)
- **LM head** β€” untied, vocab=151680
- **Total params** β€” ~7.5B
- **Languages** β€” Mandarin (with Wu / Cantonese / Hokkien / Sichuanese dialect support), English, code-switching
- **License** β€” MIT (matches upstream)

## Usage with CrispASR

```bash
# Build (one-time)
git clone https://github.com/CrispStrobe/CrispASR.git
cd CrispASR
cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build-ninja-compile --target crispasr

# Download both halves
hf download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
hf download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/

# Transcribe
build-ninja-compile/bin/crispasr \
  --backend mimo-asr \
  -m models/mimo-asr-q4_k.gguf \
  --codec-model models/mimo-tokenizer-q4_k.gguf \
  -f samples/jfk.wav
```

If `--codec-model` is omitted, the runtime auto-discovers `mimo-tokenizer-q4_k.gguf` (or `mimo-tokenizer.gguf`, `mimo-audio-tokenizer.gguf`) next to the LM file.

### Expected output (JFK sample)

```
And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.
```

This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.

### Performance

On Apple M1, Metal backend, Q4_K, warm-cache:

| Phase | Time |
|---|---|
| LM load (mmap, lazy) | ~1 s |
| Audio tokenize (11 s sample) | ~0.5 s |
| Prefill (T_groups=71) | ~3 s |
| Step decode (~25 tokens) | ~20 s with the fused-QKV file (β‰ˆ0.8 s/token; was ~30 s pre-fusion) |
| **End-to-end** | **~25-30 s for 11 s audio (~0.4Γ— realtime)** |

Per-step decode is the bottleneck; PLAN #60d (May 2026) fused the per-LM-layer Q/K/V projections into one matmul, replacing 3 mul_mat + 3 ggml_add per layer Γ— 36 layers with 1 + 1, for a measured ~1.7Γ— speedup at the same disk-pressure level. KV cache reuse via cached step graphs (PLAN #51b') is also live. Future perf wins: F16 with fused QKV (queued behind disk headroom), `CRISPASR_KV_QUANT=q8_0` for hour-long inputs (PLAN #60e env-flag is already plumbed; default stays F16 until per-backend rollout completes).

## Validation

Stage-by-stage cosine similarity against the bf16 PyTorch reference on JFK (Q4_K weights, bf16 ref):

| Stage | cos_mean | cos_min |
|---|---|---|
| `prefill_audio_features` | 0.998 | 0.992 |
| `prefill_text_embeds` | 0.996 | 0.901 |
| `prefill_inputs_embeds` | 0.998 | 0.901 |
| `prefill_last_hidden` | 0.963 | 0.963 |
| `prefill_text_logits_step0` | 0.981 | 0.981 |

Argmax of step-0 logits is token 1597 (`' And'`), matching the Python reference. The strict cosβ‰₯0.999 gate is tracked under F16+fp32 ref but requires >28 GB RAM; in practice the Q4_K + bf16-ref ceiling reflects quantisation noise, not bugs.

## Conversion (reproducibility)

```bash
# Set OMP_NUM_THREADS=1 to avoid a torch+OpenMP deadlock during bf16β†’f16
OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 \
  python models/convert-mimo-asr-to-gguf.py \
    --input XiaomiMiMo/MiMo-V2.5-ASR \
    --output mimo-asr-f16.gguf \
    --outtype f16

build-ninja-compile/bin/crispasr-quantize \
  mimo-asr-f16.gguf mimo-asr-q4_k.gguf q4_k
```

Vocab is padded to 151680 entries (151643 BPE + 30 special + 7 unused slots) and `tokenizer.ggml.merges` is populated (151291 entries). Earlier filenames (`mimo-asr.gguf`, `mimo-asr-q2_k.gguf`) shipped before commit `2191a70` with truncated vocab + missing merges and were removed from the repo on 2026-05-01.

## Citation

```bibtex
@misc{mimo2025v25asr,
  title = {MiMo-V2.5-ASR},
  author = {Xiaomi MiMo Team},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR}
}
```

## License

MIT β€” same as upstream.