cstr commited on
Commit
bb6a49b
·
verified ·
1 Parent(s): a320115

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +102 -16
README.md CHANGED
@@ -14,32 +14,118 @@ library_name: ggml
14
  base_model: XiaomiMiMo/MiMo-V2.5-ASR
15
  ---
16
 
17
- # MiMo-V2.5-ASR -- GGUF
18
 
19
- GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for use with **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**.
 
 
20
 
21
  ## Available variants
22
 
23
- | File | Quant | Size | Notes |
24
  |---|---|---|---|
25
- | `mimo-asr.gguf` | F16 | 15.3 GB | Full precision (719 tensors) |
26
- | `mimo-asr-q4_k.gguf` | Q4_K | ~4.5 GB | Quantized (pending) |
27
-
28
- ## Model details
29
 
30
- - **Architecture:** 8-channel RVQ speech embeddings + 6-layer input transformer (1024d, 64 heads) + 36-layer Qwen2 LLM (4096d, 32 heads, 8 KV heads, SiLU, RoPE θ=640K)
31
- - **Parameters:** ~8B
32
- - **Languages:** Mandarin Chinese, English, Chinese dialects (Wu, Cantonese, Hokkien, Sichuanese), code-switched speech
33
- - **License:** MIT
34
- - **Source:** [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR)
35
 
36
- ## Notes
37
 
38
- - Requires separate `MiMo-Audio-Tokenizer` to convert waveform → RVQ tokens first
39
- - CrispASR runtime implementation in progress
 
 
 
 
40
 
41
  ## Usage with CrispASR
42
 
43
  ```bash
44
- ./build/bin/crispasr --backend mimo-asr -m mimo-asr-q4_k.gguf -f audio.wav
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  base_model: XiaomiMiMo/MiMo-V2.5-ASR
15
  ---
16
 
17
+ # MiMo-V2.5-ASR — GGUF
18
 
19
+ GGUF conversion of [`XiaomiMiMo/MiMo-V2.5-ASR`](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) for **[CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR)**. Pure C++ inference — no Python, no Transformers, runs on Apple Silicon (Metal), CPU, and CUDA.
20
+
21
+ The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 LM, full prefill + step-decode KV-cached graphs, prompt construction matching the upstream `MimoAudio.asr_sft` reference exactly. JFK transcription test passes verbatim.
22
 
23
  ## Available variants
24
 
25
+ | File | Quant | Size | Recommended |
26
  |---|---|---|---|
27
+ | `mimo-asr-f16.gguf` | F16 | 14.9 GB | Full precision; needs ~16 GB RAM during inference |
28
+ | `mimo-asr-q4_k.gguf` | Q4_K | 4.5 GB | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
29
+ | `mimo-asr-q2_k.gguf` | Q2_K | (legacy) | Older quant, see commit history |
 
30
 
31
+ Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
 
 
 
 
32
 
33
+ ## Architecture
34
 
35
+ - **Audio path** — 6-layer input_local_transformer (1024d, 64 heads, GS=4 group size, SiLU, sinusoidal RoPE on Q/K) + 8-channel RVQ codebook embeddings + linear group-downcast to 4096d
36
+ - **LM** — 36-layer Qwen2 (hidden=4096, 32 attn heads, 8 KV heads, intermediate=12288, RMSNorm, SwiGLU, RoPE θ=640K, max_pos=8192)
37
+ - **LM head** — untied, vocab=151680
38
+ - **Total params** — ~7.5B
39
+ - **Languages** — Mandarin (with Wu / Cantonese / Hokkien / Sichuanese dialect support), English, code-switching
40
+ - **License** — MIT (matches upstream)
41
 
42
  ## Usage with CrispASR
43
 
44
  ```bash
45
+ # Build (one-time)
46
+ git clone https://github.com/CrispStrobe/CrispASR.git
47
+ cd CrispASR
48
+ cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
49
+ cmake --build build-ninja-compile --target crispasr
50
+
51
+ # Download both halves
52
+ huggingface-cli download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
53
+ huggingface-cli download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
54
+
55
+ # Transcribe
56
+ build-ninja-compile/bin/crispasr \
57
+ --backend mimo-asr \
58
+ -m models/mimo-asr-q4_k.gguf \
59
+ --codec-model models/mimo-tokenizer-q4_k.gguf \
60
+ -f samples/jfk.wav
61
+ ```
62
+
63
+ If `--codec-model` is omitted, the runtime auto-discovers `mimo-tokenizer-q4_k.gguf` (or `mimo-tokenizer.gguf`, `mimo-audio-tokenizer.gguf`) next to the LM file.
64
+
65
+ ### Expected output (JFK sample)
66
+
67
  ```
68
+ And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.
69
+ ```
70
+
71
+ This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.
72
+
73
+ ### Performance
74
+
75
+ On Apple M1, Metal backend, Q4_K:
76
+
77
+ | Phase | Time |
78
+ |---|---|
79
+ | LM load (mmap, lazy) | ~1 s |
80
+ | Audio tokenize (11 s sample) | ~0.5 s |
81
+ | Prefill (T_groups=71) | ~3 s |
82
+ | Step decode (~25 tokens) | ~30 s (≈1 s/token) |
83
+ | **End-to-end** | **~37 s for 11 s audio (0.3× realtime)** |
84
+
85
+ The step decode dominates; per-step Q4_K dequant is the bottleneck. F16 is faster per-step (no dequant) but 16 GB on disk and needs the full RAM during compute. Future perf wins: KV cache reuse across step graphs, block-batched generation, F16 with mmap-loaded weights.
86
+
87
+ ## Validation
88
+
89
+ Stage-by-stage cosine similarity against the bf16 PyTorch reference on JFK (Q4_K weights, bf16 ref):
90
+
91
+ | Stage | cos_mean | cos_min |
92
+ |---|---|---|
93
+ | `prefill_audio_features` | 0.998 | 0.992 |
94
+ | `prefill_text_embeds` | 0.996 | 0.901 |
95
+ | `prefill_inputs_embeds` | 0.998 | 0.901 |
96
+ | `prefill_last_hidden` | 0.963 | 0.963 |
97
+ | `prefill_text_logits_step0` | 0.981 | 0.981 |
98
+
99
+ Argmax of step-0 logits is token 1597 (`' And'`), matching the Python reference. The strict cos≥0.999 gate is tracked under F16+fp32 ref but requires >28 GB RAM; in practice the Q4_K + bf16-ref ceiling reflects quantisation noise, not bugs.
100
+
101
+ ## Conversion (reproducibility)
102
+
103
+ ```bash
104
+ # Set OMP_NUM_THREADS=1 to avoid a torch+OpenMP deadlock during bf16→f16
105
+ OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 \
106
+ python models/convert-mimo-asr-to-gguf.py \
107
+ --input XiaomiMiMo/MiMo-V2.5-ASR \
108
+ --output mimo-asr-f16.gguf \
109
+ --outtype f16
110
+
111
+ build-ninja-compile/bin/crispasr-quantize \
112
+ mimo-asr-f16.gguf mimo-asr-q4_k.gguf q4_k
113
+ ```
114
+
115
+ Vocab is padded to 151680 entries (151643 BPE + 30 special + 7 unused slots) and `tokenizer.ggml.merges` is populated (151291 entries). The earlier truncated-vocab GGUFs in this repo predate commit `2191a70` (CrispASR) and should not be used.
116
+
117
+ ## Citation
118
+
119
+ ```bibtex
120
+ @misc{mimo2025v25asr,
121
+ title = {MiMo-V2.5-ASR},
122
+ author = {Xiaomi MiMo Team},
123
+ year = {2025},
124
+ publisher = {Hugging Face},
125
+ url = {https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR}
126
+ }
127
+ ```
128
+
129
+ ## License
130
+
131
+ MIT — same as upstream.