Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -14,32 +14,118 @@ library_name: ggml
|
|
| 14 |
base_model: XiaomiMiMo/MiMo-V2.5-ASR
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# MiMo-V2.5-ASR
|
| 18 |
|
| 19 |
-
GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## Available variants
|
| 22 |
|
| 23 |
-
| File | Quant | Size |
|
| 24 |
|---|---|---|---|
|
| 25 |
-
| `mimo-asr.gguf` | F16 |
|
| 26 |
-
| `mimo-asr-q4_k.gguf` | Q4_K |
|
| 27 |
-
|
| 28 |
-
## Model details
|
| 29 |
|
| 30 |
-
|
| 31 |
-
- **Parameters:** ~8B
|
| 32 |
-
- **Languages:** Mandarin Chinese, English, Chinese dialects (Wu, Cantonese, Hokkien, Sichuanese), code-switched speech
|
| 33 |
-
- **License:** MIT
|
| 34 |
-
- **Source:** [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR)
|
| 35 |
|
| 36 |
-
##
|
| 37 |
|
| 38 |
-
-
|
| 39 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## Usage with CrispASR
|
| 42 |
|
| 43 |
```bash
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
base_model: XiaomiMiMo/MiMo-V2.5-ASR
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# MiMo-V2.5-ASR — GGUF
|
| 18 |
|
| 19 |
+
GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**. Pure C++ inference — no Python, no Transformers, runs on Apple Silicon (Metal), CPU, and CUDA.
|
| 20 |
+
|
| 21 |
+
The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 LM, full prefill + step-decode KV-cached graphs, prompt construction matching the upstream `MimoAudio.asr_sft` reference exactly. JFK transcription test passes verbatim.
|
| 22 |
|
| 23 |
## Available variants
|
| 24 |
|
| 25 |
+
| File | Quant | Size | Recommended |
|
| 26 |
|---|---|---|---|
|
| 27 |
+
| `mimo-asr-f16.gguf` | F16 | 14.9 GB | Full precision; needs ~16 GB RAM during inference |
|
| 28 |
+
| `mimo-asr-q4_k.gguf` | Q4_K | 4.5 GB | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
|
| 29 |
+
| `mimo-asr-q2_k.gguf` | Q2_K | (legacy) | Older quant, see commit history |
|
|
|
|
| 30 |
|
| 31 |
+
Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
## Architecture
|
| 34 |
|
| 35 |
+
- **Audio path** — 6-layer input_local_transformer (1024d, 64 heads, GS=4 group size, SiLU, sinusoidal RoPE on Q/K) + 8-channel RVQ codebook embeddings + linear group-downcast to 4096d
|
| 36 |
+
- **LM** — 36-layer Qwen2 (hidden=4096, 32 attn heads, 8 KV heads, intermediate=12288, RMSNorm, SwiGLU, RoPE θ=640K, max_pos=8192)
|
| 37 |
+
- **LM head** — untied, vocab=151680
|
| 38 |
+
- **Total params** — ~7.5B
|
| 39 |
+
- **Languages** — Mandarin (with Wu / Cantonese / Hokkien / Sichuanese dialect support), English, code-switching
|
| 40 |
+
- **License** — MIT (matches upstream)
|
| 41 |
|
| 42 |
## Usage with CrispASR
|
| 43 |
|
| 44 |
```bash
|
| 45 |
+
# Build (one-time)
|
| 46 |
+
git clone https://github.com/CrispStrobe/CrispASR.git
|
| 47 |
+
cd CrispASR
|
| 48 |
+
cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
|
| 49 |
+
cmake --build build-ninja-compile --target crispasr
|
| 50 |
+
|
| 51 |
+
# Download both halves
|
| 52 |
+
huggingface-cli download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
|
| 53 |
+
huggingface-cli download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
|
| 54 |
+
|
| 55 |
+
# Transcribe
|
| 56 |
+
build-ninja-compile/bin/crispasr \
|
| 57 |
+
--backend mimo-asr \
|
| 58 |
+
-m models/mimo-asr-q4_k.gguf \
|
| 59 |
+
--codec-model models/mimo-tokenizer-q4_k.gguf \
|
| 60 |
+
-f samples/jfk.wav
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
If `--codec-model` is omitted, the runtime auto-discovers `mimo-tokenizer-q4_k.gguf` (or `mimo-tokenizer.gguf`, `mimo-audio-tokenizer.gguf`) next to the LM file.
|
| 64 |
+
|
| 65 |
+
### Expected output (JFK sample)
|
| 66 |
+
|
| 67 |
```
|
| 68 |
+
And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.
|
| 72 |
+
|
| 73 |
+
### Performance
|
| 74 |
+
|
| 75 |
+
On Apple M1, Metal backend, Q4_K:
|
| 76 |
+
|
| 77 |
+
| Phase | Time |
|
| 78 |
+
|---|---|
|
| 79 |
+
| LM load (mmap, lazy) | ~1 s |
|
| 80 |
+
| Audio tokenize (11 s sample) | ~0.5 s |
|
| 81 |
+
| Prefill (T_groups=71) | ~3 s |
|
| 82 |
+
| Step decode (~25 tokens) | ~30 s (≈1 s/token) |
|
| 83 |
+
| **End-to-end** | **~37 s for 11 s audio (0.3× realtime)** |
|
| 84 |
+
|
| 85 |
+
The step decode dominates; per-step Q4_K dequant is the bottleneck. F16 is faster per-step (no dequant) but 16 GB on disk and needs the full RAM during compute. Future perf wins: KV cache reuse across step graphs, block-batched generation, F16 with mmap-loaded weights.
|
| 86 |
+
|
| 87 |
+
## Validation
|
| 88 |
+
|
| 89 |
+
Stage-by-stage cosine similarity against the bf16 PyTorch reference on JFK (Q4_K weights, bf16 ref):
|
| 90 |
+
|
| 91 |
+
| Stage | cos_mean | cos_min |
|
| 92 |
+
|---|---|---|
|
| 93 |
+
| `prefill_audio_features` | 0.998 | 0.992 |
|
| 94 |
+
| `prefill_text_embeds` | 0.996 | 0.901 |
|
| 95 |
+
| `prefill_inputs_embeds` | 0.998 | 0.901 |
|
| 96 |
+
| `prefill_last_hidden` | 0.963 | 0.963 |
|
| 97 |
+
| `prefill_text_logits_step0` | 0.981 | 0.981 |
|
| 98 |
+
|
| 99 |
+
Argmax of step-0 logits is token 1597 (`' And'`), matching the Python reference. The strict cos≥0.999 gate is tracked under F16+fp32 ref but requires >28 GB RAM; in practice the Q4_K + bf16-ref ceiling reflects quantisation noise, not bugs.
|
| 100 |
+
|
| 101 |
+
## Conversion (reproducibility)
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
# Set OMP_NUM_THREADS=1 to avoid a torch+OpenMP deadlock during bf16→f16
|
| 105 |
+
OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 \
|
| 106 |
+
python models/convert-mimo-asr-to-gguf.py \
|
| 107 |
+
--input XiaomiMiMo/MiMo-V2.5-ASR \
|
| 108 |
+
--output mimo-asr-f16.gguf \
|
| 109 |
+
--outtype f16
|
| 110 |
+
|
| 111 |
+
build-ninja-compile/bin/crispasr-quantize \
|
| 112 |
+
mimo-asr-f16.gguf mimo-asr-q4_k.gguf q4_k
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
Vocab is padded to 151680 entries (151643 BPE + 30 special + 7 unused slots) and `tokenizer.ggml.merges` is populated (151291 entries). The earlier truncated-vocab GGUFs in this repo predate commit `2191a70` (CrispASR) and should not be used.
|
| 116 |
+
|
| 117 |
+
## Citation
|
| 118 |
+
|
| 119 |
+
```bibtex
|
| 120 |
+
@misc{mimo2025v25asr,
|
| 121 |
+
title = {MiMo-V2.5-ASR},
|
| 122 |
+
author = {Xiaomi MiMo Team},
|
| 123 |
+
year = {2025},
|
| 124 |
+
publisher = {Hugging Face},
|
| 125 |
+
url = {https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR}
|
| 126 |
+
}
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
## License
|
| 130 |
+
|
| 131 |
+
MIT — same as upstream.
|