| --- |
| license: mit |
| language: |
| - zh |
| - en |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - audio |
| - speech-recognition |
| - gguf |
| - mimo |
| - qwen2 |
| library_name: ggml |
| base_model: XiaomiMiMo/MiMo-V2.5-ASR |
| --- |
| |
| # MiMo-V2.5-ASR β GGUF |
|
|
| GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**. Pure C++ inference β no Python, no Transformers, runs on Apple Silicon (Metal), CPU, and CUDA. |
|
|
| The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 LM, full prefill + step-decode KV-cached graphs, prompt construction matching the upstream `MimoAudio.asr_sft` reference exactly. JFK transcription test passes verbatim. |
|
|
| ## Available variants |
|
|
| | File | Quant | Size | Layout | Recommended | |
| |---|---|---|---|---| |
| | `mimo-asr-f16.gguf` | F16 | 14.9 GB | separate Q/K/V | Full precision; needs ~16 GB RAM during inference | |
| | `mimo-asr-q4_k.gguf` | Q4_K | 4.2 GB | **fused QKV** | **Default** β fits in 8 GB RAM, no quality loss visible on JFK | |
| |
| The default `mimo-asr-q4_k.gguf` (re-uploaded May 2026, PLAN #60d) ships with per-LM-layer Q/K/V projections fused into a single `model.layers.{i}.attn.qkv.{weight,bias}` tensor pair, yielding ~1.7Γ faster per-step decode on M1 vs the prior unfused layout (3058 ms/step β 1806 ms/step on a contended-disk run; ~1.1-1.2Γ pure-compute on a quiet box). The CrispASR runtime auto-detects either layout: the F16 file above keeps working unchanged via the separate-Q/K/V fallback path. Re-upload of a fused F16 is queued behind disk-headroom availability. |
|
|
| Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** β the audio tokenizer is a separate model that converts 16 kHz PCM β 8-channel RVQ codes that this LM consumes. |
|
|
| ## Architecture |
|
|
| - **Audio path** β 6-layer input_local_transformer (1024d, 64 heads, GS=4 group size, SiLU, sinusoidal RoPE on Q/K) + 8-channel RVQ codebook embeddings + linear group-downcast to 4096d |
| - **LM** β 36-layer Qwen2 (hidden=4096, 32 attn heads, 8 KV heads, intermediate=12288, RMSNorm, SwiGLU, RoPE ΞΈ=640K, max_pos=8192) |
| - **LM head** β untied, vocab=151680 |
| - **Total params** β ~7.5B |
| - **Languages** β Mandarin (with Wu / Cantonese / Hokkien / Sichuanese dialect support), English, code-switching |
| - **License** β MIT (matches upstream) |
| |
| ## Usage with CrispASR |
| |
| ```bash |
| # Build (one-time) |
| git clone https://github.com/CrispStrobe/CrispASR.git |
| cd CrispASR |
| cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release |
| cmake --build build-ninja-compile --target crispasr |
| |
| # Download both halves |
| hf download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/ |
| hf download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/ |
| |
| # Transcribe |
| build-ninja-compile/bin/crispasr \ |
| --backend mimo-asr \ |
| -m models/mimo-asr-q4_k.gguf \ |
| --codec-model models/mimo-tokenizer-q4_k.gguf \ |
| -f samples/jfk.wav |
| ``` |
| |
| If `--codec-model` is omitted, the runtime auto-discovers `mimo-tokenizer-q4_k.gguf` (or `mimo-tokenizer.gguf`, `mimo-audio-tokenizer.gguf`) next to the LM file. |
|
|
| ### Expected output (JFK sample) |
|
|
| ``` |
| And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country. |
| ``` |
|
|
| This matches the upstream Python `MimoAudio.asr_sft` reference verbatim. |
|
|
| ### Performance |
|
|
| On Apple M1, Metal backend, Q4_K, warm-cache: |
| |
| | Phase | Time | |
| |---|---| |
| | LM load (mmap, lazy) | ~1 s | |
| | Audio tokenize (11 s sample) | ~0.5 s | |
| | Prefill (T_groups=71) | ~3 s | |
| | Step decode (~25 tokens) | ~20 s with the fused-QKV file (β0.8 s/token; was ~30 s pre-fusion) | |
| | **End-to-end** | **~25-30 s for 11 s audio (~0.4Γ realtime)** | |
|
|
| Per-step decode is the bottleneck; PLAN #60d (May 2026) fused the per-LM-layer Q/K/V projections into one matmul, replacing 3 mul_mat + 3 ggml_add per layer Γ 36 layers with 1 + 1, for a measured ~1.7Γ speedup at the same disk-pressure level. KV cache reuse via cached step graphs (PLAN #51b') is also live. Future perf wins: F16 with fused QKV (queued behind disk headroom), `CRISPASR_KV_QUANT=q8_0` for hour-long inputs (PLAN #60e env-flag is already plumbed; default stays F16 until per-backend rollout completes). |
|
|
| ## Validation |
|
|
| Stage-by-stage cosine similarity against the bf16 PyTorch reference on JFK (Q4_K weights, bf16 ref): |
| |
| | Stage | cos_mean | cos_min | |
| |---|---|---| |
| | `prefill_audio_features` | 0.998 | 0.992 | |
| | `prefill_text_embeds` | 0.996 | 0.901 | |
| | `prefill_inputs_embeds` | 0.998 | 0.901 | |
| | `prefill_last_hidden` | 0.963 | 0.963 | |
| | `prefill_text_logits_step0` | 0.981 | 0.981 | |
|
|
| Argmax of step-0 logits is token 1597 (`' And'`), matching the Python reference. The strict cosβ₯0.999 gate is tracked under F16+fp32 ref but requires >28 GB RAM; in practice the Q4_K + bf16-ref ceiling reflects quantisation noise, not bugs. |
| |
| ## Conversion (reproducibility) |
| |
| ```bash |
| # Set OMP_NUM_THREADS=1 to avoid a torch+OpenMP deadlock during bf16βf16 |
| OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 \ |
| python models/convert-mimo-asr-to-gguf.py \ |
| --input XiaomiMiMo/MiMo-V2.5-ASR \ |
| --output mimo-asr-f16.gguf \ |
| --outtype f16 |
| |
| build-ninja-compile/bin/crispasr-quantize \ |
| mimo-asr-f16.gguf mimo-asr-q4_k.gguf q4_k |
| ``` |
| |
| Vocab is padded to 151680 entries (151643 BPE + 30 special + 7 unused slots) and `tokenizer.ggml.merges` is populated (151291 entries). Earlier filenames (`mimo-asr.gguf`, `mimo-asr-q2_k.gguf`) shipped before commit `2191a70` with truncated vocab + missing merges and were removed from the repo on 2026-05-01. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{mimo2025v25asr, |
| title = {MiMo-V2.5-ASR}, |
| author = {Xiaomi MiMo Team}, |
| year = {2025}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT β same as upstream. |
|
|