MarkDaniel212 commited on
Commit
28133b9
Β·
verified Β·
1 Parent(s): f101323

Add model card README

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ ---
5
+
6
+ <div align="center">
7
+ <picture>
8
+ <source srcset="https://github.com/XiaomiMiMo/MiMo-VL/raw/main/figures/Xiaomi_MiMo_darkmode.png?raw=true" media="(prefers-color-scheme: dark)">
9
+ <img src="https://github.com/XiaomiMiMo/MiMo-VL/raw/main/figures/Xiaomi_MiMo.png?raw=true" width="60%" alt="Xiaomi-MiMo" />
10
+ </picture>
11
+ </div>
12
+
13
+ <div align="center">
14
+ <h3>
15
+ <b>
16
+ <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span><br/>
17
+ MiMo-V2.5-ASR: Robust Speech Recognition Across<br/>
18
+ Languages, Dialects, and Complex Acoustic Scenarios<br/>
19
+ <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
20
+ </b>
21
+ </h3>
22
+ </div>
23
+
24
+ <br/>
25
+
26
+ <div align="center" style="line-height: 1;">
27
+ |
28
+ <a href="https://huggingface.co/collections/XiaomiMiMo/mimo-v2.5-asr" target="_blank">πŸ€— HuggingFace</a>
29
+ &nbsp;|
30
+ <a href="https://github.com/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">πŸ’» GitHub</a>
31
+ &nbsp;|
32
+ <a href="https://xiaomimimo.github.io/MiMo-V2.5-ASR-Demo" target="_blank">πŸ“° Blog</a>
33
+ &nbsp;|
34
+
35
+ <br/>
36
+ </div>
37
+
38
+ <br/>
39
+
40
+ ## Introduction
41
+
42
+ **MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.
43
+
44
+ ## Abstract
45
+
46
+ Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. We present **MiMo-V2.5-ASR**, a large-scale end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:
47
+
48
+ - πŸ—£οΈ **Chinese Dialects**: Native support for Wu, Min-nan, Cantonese, Sichuanese, and other major Chinese dialects.
49
+ - πŸ”€ **Code-Switch**: Fluent transcription of Chinese–English code-switched speech without any language tag prompting.
50
+ - 🎡 **Song Lyrics**: Accurate lyric transcription for both Chinese and English songs, even when vocals are mixed with accompaniment.
51
+ - πŸ”Š **Noisy Conditions**: Robust recognition in high-noise and far-field environments.
52
+ - πŸ‘₯ **Multi-Speaker**: Accurate transcription of overlapping and cross-talk conversations, such as meeting scenarios.
53
+ - πŸ‡¬πŸ‡§ **Complex English Scenarios**: Leading performance among non-English-only models on English multi-speaker meeting benchmarks such as AMI.
54
+ - πŸ“š **Knowledge-Intensive Recognition**: Precise recognition of classical poetry, technical terminology, and named entities (people, places, organizations).
55
+
56
+ ## Results
57
+
58
+ MiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, singing, code-switching, noisy conditions, and multi-speaker scenarios. Highlights of our results are shown below.
59
+
60
+ ### Standard Mandarin Chinese
61
+
62
+ ![Results - Standard Chinese](https://github.com/XiaomiMiMo/MiMo-V2.5-ASR/raw/main/assets/results_chinese.png)
63
+
64
+ ### Standard English
65
+
66
+ ![Results - Standard English](https://github.com/XiaomiMiMo/MiMo-V2.5-ASR/raw/main/assets/results_english.png)
67
+
68
+ ### Chinese Dialects
69
+
70
+ ![Results - Chinese Dialects](https://github.com/XiaomiMiMo/MiMo-V2.5-ASR/raw/main/assets/results_dialect.png)
71
+
72
+ ### Singing & Code-Switch
73
+
74
+ ![Results - Singing and Code-Switch](https://github.com/XiaomiMiMo/MiMo-V2.5-ASR/raw/main/assets/results_singing_codeswitch.png)
75
+
76
+ ## Model Download
77
+
78
+ | Models | πŸ€— Hugging Face |
79
+ |-------|-------|
80
+ | MiMo-Audio-Tokenizer | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer) |
81
+ | MiMo-V2.5-ASR | [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) |
82
+
83
+ ```bash
84
+ pip install huggingface-hub
85
+
86
+ hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
87
+ hf download XiaomiMiMo/MiMo-V2.5-ASR --local-dir ./models/MiMo-V2.5-ASR
88
+ ```
89
+
90
+ ## Getting Started
91
+
92
+ Spin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app.
93
+
94
+ ### Prerequisites (Linux)
95
+
96
+ * Python 3.12
97
+ * CUDA >= 12.0
98
+
99
+ ### Installation
100
+
101
+ ```bash
102
+ git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
103
+ cd MiMo-V2.5-ASR
104
+ pip install -r requirements.txt
105
+ pip install flash-attn==2.7.4.post1
106
+ ```
107
+
108
+ > \[!Note]
109
+ > If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:
110
+ >
111
+ > * [Download Precompiled Wheel](https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl)
112
+ >
113
+ > ```sh
114
+ > pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
115
+ > ```
116
+
117
+ ### Run the Demo
118
+
119
+ ```bash
120
+ python run_mimo_asr.py
121
+ ```
122
+
123
+ This launches a local Gradio interface for MiMo-V2.5-ASR. You can:
124
+
125
+ * Upload an audio file **or** record directly from your microphone.
126
+ * Optionally specify a **language tag** (Chinese / English / Auto) to bias the model for a specific language, or leave it to **Auto** for automatic language detection (recommended for code-switched speech).
127
+ * The demo calls the `asr_sft()` interface under the hood.
128
+
129
+ To load the model and tokenizer automatically at startup, pass their paths on the command line:
130
+
131
+ ```bash
132
+ python run_mimo_asr.py \
133
+ --model-path ./models/MiMo-V2.5-ASR \
134
+ --tokenizer-path ./models/MiMo-Audio-Tokenizer
135
+ ```
136
+
137
+ Otherwise, enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-V2.5-ASR` in the **Model Configuration** tab, then start transcribing!
138
+
139
+ ## Python API
140
+
141
+ Basic usage with the `asr_sft` interface:
142
+
143
+ ```python
144
+ from src.mimo_audio.mimo_audio import MimoAudio
145
+
146
+ model = MimoAudio(
147
+ model_path="./models/MiMo-V2.5-ASR",
148
+ tokenizer_path="./models/MiMo-Audio-Tokenizer",
149
+ )
150
+
151
+ # Automatic language detection (recommended for code-switching)
152
+ text = model.asr_sft("path/to/audio.wav")
153
+ print(text)
154
+
155
+ # With explicit language tag
156
+ text_zh = model.asr_sft("path/to/audio.wav", audio_tag="<chinese>")
157
+ text_en = model.asr_sft("path/to/audio.wav", audio_tag="<english>")
158
+ ```
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @misc{coreteam2026mimov25asr,
164
+ title={MiMo-V2.5-ASR: Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios},
165
+ author={LLM-Core-Team Xiaomi},
166
+ year={2026},
167
+ url={https://github.com/XiaomiMiMo/MiMo-V2.5-ASR},
168
+ }
169
+ ```
170
+
171
+ ## Contact
172
+
173
+ Please contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue if you have any questions.