File size: 8,022 Bytes
28133b9
 
 
2874332
 
 
 
 
 
722226b
 
28133b9
 
 
 
2fe9b2e
 
28133b9
 
 
 
 
 
b7737be
28133b9
 
b7737be
28133b9
 
 
 
 
 
 
 
 
 
d6b007f
bca03af
a979bfd
28133b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8632966
 
 
 
 
5509570
8632966
 
28133b9
 
 
675e46b
28133b9
5509570
28133b9
a979bfd
28133b9
 
 
98641d5
 
 
 
28133b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17d796b
 
 
 
 
 
 
e5e4788
28133b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: mit
library_name: transformers
language:
- zh
- en
- yue
pipeline_tag: automatic-speech-recognition
tags:
- safetensors
- text-generation-inference
---

<div align="center">
  <picture>
    <source srcset="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/XiaomiMIMO.png" media="(prefers-color-scheme: dark)">
    <img src="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/XiaomiMIMO.png" width="60%" alt="Xiaomi-MiMo" />
  </picture>
</div>

<div align="center">
  <h3>
    <b>
      <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━</span><br/>
      MiMo-V2.5-ASR: Robust Speech Recognition Across<br/>
      Languages, Dialects, and Complex Acoustic Scenarios<br/>
      <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
    </b>
  </h3>
</div>

<br/>

<div align="center" style="line-height: 1;">
  |
  <a href="https://github.com/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">πŸ’» GitHub</a>
  &nbsp;|
  <a href="https://huggingface.co/spaces/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">πŸš€ Online Demo</a>
  &nbsp;|
  <a href="https://mimo.xiaomi.com/mimo-v2-5-asr" target="_blank">πŸ“° Blog</a>
  &nbsp;|

  <br/>
</div>

<br/>

## Introduction

**MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.

## Abstract

Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. We present **MiMo-V2.5-ASR**, a large-scale end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:

- πŸ—£οΈ **Chinese Dialects**: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more.
- πŸ”€ **Code-Switch**: Seamless Chinese–English code-switching transcription with no language tags required.
- 🎡 **Song Recognition**: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals.
- πŸ”Š **Noisy Environments**: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions.
- πŸ‘₯ **Multi-Speaker**: Accurate transcription of overlapping, multi-party conversations such as meetings.
- πŸ‡¬πŸ‡§ **Complex English Scenarios**: Leading performance on the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) for challenging English benchmarks such as AMI.
- πŸ“š **Knowledge-Intensive Recognition**: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material.
- πŸ“ **Native Punctuation**: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed.

## Results

MiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, lyric recognition, and internal business scenarios. The chart below summarizes the average performance of MiMo-V2.5-ASR across these scenarios.

![ASR Results](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/MiMo_ASR_Results.png)

For per-benchmark numbers and specific qualitative cases, please refer to our [blog](https://mimo.xiaomi.com/mimo-v2-5-asr).

## Model Download

| Models | πŸ€— Hugging Face | πŸ€–οΈ ModelScope |
|-------|-------|-------|
| MiMo-Audio-Tokenizer | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer) | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://modelscope.cn/models/XiaomiMiMo/MiMo-Audio-Tokenizer)|
| MiMo-V2.5-ASR | [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) | [XiaomiMiMo/MiMo-V2.5-ASR](https://modelscope.cn/models/XiaomiMiMo/MiMo-V2.5-ASR) |

```bash
pip install huggingface-hub

hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
hf download XiaomiMiMo/MiMo-V2.5-ASR --local-dir ./models/MiMo-V2.5-ASR
```

## Getting Started

Spin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app.

### Prerequisites (Linux)

* Python 3.12
* CUDA >= 12.0

### Installation

```bash
git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
cd MiMo-V2.5-ASR
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1
```

> \[!Note]
> If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:
>
> * [Download Precompiled Wheel](https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl)
>
> ```sh
> pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
> ```

### Run the Demo

```bash
python run_mimo_asr.py
```

This launches a local Gradio interface for MiMo-V2.5-ASR. You can:

* Upload an audio file **or** record directly from your microphone.
* Optionally specify a **language tag** (Chinese / English / Auto) to bias the model for a specific language, or leave it to **Auto** for automatic language detection (recommended for code-switched speech).
* The demo calls the `asr_sft()` interface under the hood.

The interface provides a **Model Configuration** tab for setting local model and tokenizer paths, and a **Speech Recognition** tab where you drop in audio, pick a language tag, and hit *Transcribe* β€” the decoded text and processing status stream into the panels on the right.

<p align="center">
  <img src="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/MiMo_ASR_Demo.png" alt="MiMo-V2.5-ASR Gradio Demo" width="90%" />
  <br/>
  <em>Figure: Gradio demo for MiMo-V2.5-ASR β€” upload an audio clip or record from your microphone, choose a language tag, and get the transcription on the right.</em>
</p>

To load the model and tokenizer automatically at startup, pass their paths on the command line:

```bash
python run_mimo_asr.py \
    --model-path ./models/MiMo-V2.5-ASR \
    --tokenizer-path ./models/MiMo-Audio-Tokenizer
```

Otherwise, enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-V2.5-ASR` in the **Model Configuration** tab, then start transcribing!

## Python API

Basic usage with the `asr_sft` interface:

```python
from src.mimo_audio.mimo_audio import MimoAudio

model = MimoAudio(
    model_path="./models/MiMo-V2.5-ASR",
    tokenizer_path="./models/MiMo-Audio-Tokenizer",
)

# Automatic language detection (recommended for code-switching)
text = model.asr_sft("path/to/audio.wav")
print(text)

# With explicit language tag
text_zh = model.asr_sft("path/to/audio.wav", audio_tag="<chinese>")
text_en = model.asr_sft("path/to/audio.wav", audio_tag="<english>")
```

## Citation

```bibtex
@misc{coreteam2026mimov25asr,
      title={MiMo-V2.5-ASR: Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios},
      author={LLM-Core-Team Xiaomi},
      year={2026},
      url={https://github.com/XiaomiMiMo/MiMo-V2.5-ASR},
}
```

## Contact

Please contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue if you have any questions.