File size: 8,022 Bytes
28133b9 2874332 722226b 28133b9 2fe9b2e 28133b9 b7737be 28133b9 b7737be 28133b9 d6b007f bca03af a979bfd 28133b9 8632966 5509570 8632966 28133b9 675e46b 28133b9 5509570 28133b9 a979bfd 28133b9 98641d5 28133b9 17d796b e5e4788 28133b9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: mit
library_name: transformers
language:
- zh
- en
- yue
pipeline_tag: automatic-speech-recognition
tags:
- safetensors
- text-generation-inference
---
<div align="center">
<picture>
<source srcset="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/XiaomiMIMO.png" media="(prefers-color-scheme: dark)">
<img src="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/XiaomiMIMO.png" width="60%" alt="Xiaomi-MiMo" />
</picture>
</div>
<div align="center">
<h3>
<b>
<span>βββββββββββββββββββββββββββ</span><br/>
MiMo-V2.5-ASR: Robust Speech Recognition Across<br/>
Languages, Dialects, and Complex Acoustic Scenarios<br/>
<span>βββββββββββββββββββββββββββ</span>
</b>
</h3>
</div>
<br/>
<div align="center" style="line-height: 1;">
|
<a href="https://github.com/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">π» GitHub</a>
|
<a href="https://huggingface.co/spaces/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">π Online Demo</a>
|
<a href="https://mimo.xiaomi.com/mimo-v2-5-asr" target="_blank">π° Blog</a>
|
<br/>
</div>
<br/>
## Introduction
**MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.
## Abstract
Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. We present **MiMo-V2.5-ASR**, a large-scale end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:
- π£οΈ **Chinese Dialects**: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more.
- π **Code-Switch**: Seamless ChineseβEnglish code-switching transcription with no language tags required.
- π΅ **Song Recognition**: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals.
- π **Noisy Environments**: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions.
- π₯ **Multi-Speaker**: Accurate transcription of overlapping, multi-party conversations such as meetings.
- π¬π§ **Complex English Scenarios**: Leading performance on the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) for challenging English benchmarks such as AMI.
- π **Knowledge-Intensive Recognition**: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material.
- π **Native Punctuation**: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed.
## Results
MiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, lyric recognition, and internal business scenarios. The chart below summarizes the average performance of MiMo-V2.5-ASR across these scenarios.

For per-benchmark numbers and specific qualitative cases, please refer to our [blog](https://mimo.xiaomi.com/mimo-v2-5-asr).
## Model Download
| Models | π€ Hugging Face | π€οΈ ModelScope |
|-------|-------|-------|
| MiMo-Audio-Tokenizer | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer) | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://modelscope.cn/models/XiaomiMiMo/MiMo-Audio-Tokenizer)|
| MiMo-V2.5-ASR | [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) | [XiaomiMiMo/MiMo-V2.5-ASR](https://modelscope.cn/models/XiaomiMiMo/MiMo-V2.5-ASR) |
```bash
pip install huggingface-hub
hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
hf download XiaomiMiMo/MiMo-V2.5-ASR --local-dir ./models/MiMo-V2.5-ASR
```
## Getting Started
Spin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app.
### Prerequisites (Linux)
* Python 3.12
* CUDA >= 12.0
### Installation
```bash
git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
cd MiMo-V2.5-ASR
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1
```
> \[!Note]
> If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:
>
> * [Download Precompiled Wheel](https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl)
>
> ```sh
> pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
> ```
### Run the Demo
```bash
python run_mimo_asr.py
```
This launches a local Gradio interface for MiMo-V2.5-ASR. You can:
* Upload an audio file **or** record directly from your microphone.
* Optionally specify a **language tag** (Chinese / English / Auto) to bias the model for a specific language, or leave it to **Auto** for automatic language detection (recommended for code-switched speech).
* The demo calls the `asr_sft()` interface under the hood.
The interface provides a **Model Configuration** tab for setting local model and tokenizer paths, and a **Speech Recognition** tab where you drop in audio, pick a language tag, and hit *Transcribe* β the decoded text and processing status stream into the panels on the right.
<p align="center">
<img src="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR/resolve/main/assets/MiMo_ASR_Demo.png" alt="MiMo-V2.5-ASR Gradio Demo" width="90%" />
<br/>
<em>Figure: Gradio demo for MiMo-V2.5-ASR β upload an audio clip or record from your microphone, choose a language tag, and get the transcription on the right.</em>
</p>
To load the model and tokenizer automatically at startup, pass their paths on the command line:
```bash
python run_mimo_asr.py \
--model-path ./models/MiMo-V2.5-ASR \
--tokenizer-path ./models/MiMo-Audio-Tokenizer
```
Otherwise, enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-V2.5-ASR` in the **Model Configuration** tab, then start transcribing!
## Python API
Basic usage with the `asr_sft` interface:
```python
from src.mimo_audio.mimo_audio import MimoAudio
model = MimoAudio(
model_path="./models/MiMo-V2.5-ASR",
tokenizer_path="./models/MiMo-Audio-Tokenizer",
)
# Automatic language detection (recommended for code-switching)
text = model.asr_sft("path/to/audio.wav")
print(text)
# With explicit language tag
text_zh = model.asr_sft("path/to/audio.wav", audio_tag="<chinese>")
text_en = model.asr_sft("path/to/audio.wav", audio_tag="<english>")
```
## Citation
```bibtex
@misc{coreteam2026mimov25asr,
title={MiMo-V2.5-ASR: Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios},
author={LLM-Core-Team Xiaomi},
year={2026},
url={https://github.com/XiaomiMiMo/MiMo-V2.5-ASR},
}
```
## Contact
Please contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue if you have any questions.
|