cstr
/

mimo-asr-GGUF

@@ -14,32 +14,118 @@ library_name: ggml
 base_model: XiaomiMiMo/MiMo-V2.5-ASR
 ---
-# MiMo-V2.5-ASR -- GGUF
-GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**.
 ## Available variants
-| File | Quant | Size | Notes |
 |---|---|---|---|
-| `mimo-asr.gguf` | F16 | 15.3 GB | Full precision (719 tensors) |
-| `mimo-asr-q4_k.gguf` | Q4_K | ~4.5 GB | Quantized (pending) |
-## Model details
-- **Architecture:** 8-channel RVQ speech embeddings + 6-layer input transformer (1024d, 64 heads) + 36-layer Qwen2 LLM (4096d, 32 heads, 8 KV heads, SiLU, RoPE θ=640K)
-- **Parameters:** ~8B
-- **Languages:** Mandarin Chinese, English, Chinese dialects (Wu, Cantonese, Hokkien, Sichuanese), code-switched speech
-- **License:** MIT
-- **Source:** [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR)
-## Notes
-- Requires separate `MiMo-Audio-Tokenizer` to convert waveform → RVQ tokens first
-- CrispASR runtime implementation in progress
 ## Usage with CrispASR
 ```bash
-./build/bin/crispasr --backend mimo-asr -m mimo-asr-q4_k.gguf -f audio.wav
 ```

 base_model: XiaomiMiMo/MiMo-V2.5-ASR
 ---
+# MiMo-V2.5-ASR — GGUF
+GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**. Pure C++ inference — no Python, no Transformers, runs on Apple Silicon (Metal), CPU, and CUDA.
+The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 LM, full prefill + step-decode KV-cached graphs, prompt construction matching the upstream `MimoAudio.asr_sft` reference exactly. JFK transcription test passes verbatim.
 ## Available variants
+| File | Quant | Size | Recommended |
 |---|---|---|---|
+| `mimo-asr-f16.gguf` | F16 | 14.9 GB | Full precision; needs ~16 GB RAM during inference |
+| `mimo-asr-q4_k.gguf` | Q4_K | 4.5 GB | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
+| `mimo-asr-q2_k.gguf` | Q2_K | (legacy) | Older quant, see commit history |
+Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
+## Architecture
+- **Audio path** — 6-layer input_local_transformer (1024d, 64 heads, GS=4 group size, SiLU, sinusoidal RoPE on Q/K) + 8-channel RVQ codebook embeddings + linear group-downcast to 4096d
+- **LM** — 36-layer Qwen2 (hidden=4096, 32 attn heads, 8 KV heads, intermediate=12288, RMSNorm, SwiGLU, RoPE θ=640K, max_pos=8192)
+- **LM head** — untied, vocab=151680
+- **Total params** — ~7.5B
+- **Languages** — Mandarin (with Wu / Cantonese / Hokkien / Sichuanese dialect support), English, code-switching
+- **License** — MIT (matches upstream)
 ## Usage with CrispASR
 ```bash
+# Build (one-time)
+git clone https://github.com/CrispStrobe/CrispASR.git
+cd CrispASR
+cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
+cmake --build build-ninja-compile --target crispasr
+# Download both halves
+huggingface-cli download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
+huggingface-cli download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
+# Transcribe
+build-ninja-compile/bin/crispasr \
+  --backend mimo-asr \
+  -m models/mimo-asr-q4_k.gguf \
+  --codec-model models/mimo-tokenizer-q4_k.gguf \
+  -f samples/jfk.wav
+```
+If `--codec-model` is omitted, the runtime auto-discovers `mimo-tokenizer-q4_k.gguf` (or `mimo-tokenizer.gguf`, `mimo-audio-tokenizer.gguf`) next to the LM file.
+### Expected output (JFK sample)
 ```
+And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.
+```
+This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.
+### Performance
+On Apple M1, Metal backend, Q4_K:
+| Phase | Time |
+|---|---|
+| LM load (mmap, lazy) | ~1 s |
+| Audio tokenize (11 s sample) | ~0.5 s |
+| Prefill (T_groups=71) | ~3 s |
+| Step decode (~25 tokens) | ~30 s (≈1 s/token) |
+| **End-to-end** | **~37 s for 11 s audio (0.3× realtime)** |
+The step decode dominates; per-step Q4_K dequant is the bottleneck. F16 is faster per-step (no dequant) but 16 GB on disk and needs the full RAM during compute. Future perf wins: KV cache reuse across step graphs, block-batched generation, F16 with mmap-loaded weights.
+## Validation
+Stage-by-stage cosine similarity against the bf16 PyTorch reference on JFK (Q4_K weights, bf16 ref):
+| Stage | cos_mean | cos_min |
+|---|---|---|
+| `prefill_audio_features` | 0.998 | 0.992 |
+| `prefill_text_embeds` | 0.996 | 0.901 |
+| `prefill_inputs_embeds` | 0.998 | 0.901 |
+| `prefill_last_hidden` | 0.963 | 0.963 |
+| `prefill_text_logits_step0` | 0.981 | 0.981 |
+Argmax of step-0 logits is token 1597 (`' And'`), matching the Python reference. The strict cos≥0.999 gate is tracked under F16+fp32 ref but requires >28 GB RAM; in practice the Q4_K + bf16-ref ceiling reflects quantisation noise, not bugs.
+## Conversion (reproducibility)
+```bash
+# Set OMP_NUM_THREADS=1 to avoid a torch+OpenMP deadlock during bf16→f16
+OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 \
+  python models/convert-mimo-asr-to-gguf.py \
+    --input XiaomiMiMo/MiMo-V2.5-ASR \
+    --output mimo-asr-f16.gguf \
+    --outtype f16
+build-ninja-compile/bin/crispasr-quantize \
+  mimo-asr-f16.gguf mimo-asr-q4_k.gguf q4_k
+```
+Vocab is padded to 151680 entries (151643 BPE + 30 special + 7 unused slots) and `tokenizer.ggml.merges` is populated (151291 entries). The earlier truncated-vocab GGUFs in this repo predate commit `2191a70` (CrispASR) and should not be used.
+## Citation
+```bibtex
+@misc{mimo2025v25asr,
+  title = {MiMo-V2.5-ASR},
+  author = {Xiaomi MiMo Team},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR}
+}
+```
+## License
+MIT — same as upstream.