| --- |
| license: apache-2.0 |
| library_name: mlx |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - mlx |
| - asr |
| - automatic-speech-recognition |
| - speech-recognition |
| - mimo |
| base_model: |
| - XiaomiMiMo/MiMo-V2.5-ASR |
| language: |
| - zh |
| - en |
| --- |
| |
| Current variant: `4bit` (default entry) |
|
|
| <div align="center"> |
| <img src="https://raw.githubusercontent.com/XiaomiMiMo/MiMo-V2.5-ASR/main/assets/XiaomiMIMO.png" width="60%" alt="Xiaomi-MiMo" /> |
| </div> |
|
|
| <div align="center"> |
| <h3> |
| <b> |
| <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span><br/> |
| MiMo-V2.5-ASR: Robust Speech Recognition Across<br/> |
| Languages, Dialects, and Complex Acoustic Scenarios<br/> |
| <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span> |
| </b> |
| </h3> |
| </div> |
| |
| <br/> |
|
|
| <div align="center" style="line-height: 1;"> |
| | |
| <a href="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">🤗 Official Model</a> |
| | |
| <a href="https://huggingface.co/spaces/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">🚀 Official Demo</a> |
| | |
| <a href="https://mimo.xiaomi.com/mimo-v2-5-asr" target="_blank">📰 Official Blog</a> |
| | |
| <a href="https://github.com/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">💻 Official Code</a> |
| | |
| </div> |
|
|
| <br/> |
|
|
| ## MLX Note |
|
|
| This repository is a community MLX conversion of the official `XiaomiMiMo/MiMo-V2.5-ASR` release for Apple silicon. The original model description below is preserved from the official release, and the MLX-specific material in this page is added as an incremental note for local MLX deployment. |
|
|
| ## MLX Usage |
|
|
| Current MLX usage is documented in the GitHub forks below: |
|
|
| - [ailuntx/MiMo-V2.5-ASR-MLX](https://github.com/ailuntx/MiMo-V2.5-ASR-MLX) |
| - [ailuntx/MiMo-Audio-Tokenizer-MLX](https://github.com/ailuntx/MiMo-Audio-Tokenizer-MLX) |
|
|
| Install the current MLX path: |
|
|
| ```bash |
| pip install git+https://github.com/ailuntx/mlx-audio@feat/mimo-v25-asr |
| ``` |
|
|
| Download the MLX checkpoints: |
|
|
| ```bash |
| hf download mlx-community/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer |
| hf download mlx-community/MiMo-V2.5-ASR-MLX --local-dir ./models/MiMo-V2.5-ASR-MLX |
| ``` |
|
|
| Run transcription from the helper script in `ailuntx/MiMo-V2.5-ASR-MLX`: |
|
|
| ```bash |
| git clone https://github.com/ailuntx/MiMo-V2.5-ASR-MLX.git |
| cd MiMo-V2.5-ASR-MLX |
| python run_mimo_asr_mlx.py \ |
| --model ./models/MiMo-V2.5-ASR-MLX \ |
| --audio path/to/audio.wav |
| ``` |
|
|
| Python: |
|
|
| ```python |
| from mlx_audio.stt import load |
| |
| model = load("./models/MiMo-V2.5-ASR-MLX") |
| result = model.generate("path/to/audio.wav", language="en") |
| print(result.text) |
| ``` |
|
|
| Notes: |
|
|
| - `mlx-community/MiMo-V2.5-ASR-MLX` resolves `mlx-community/MiMo-Audio-Tokenizer` through `mlx_manifest.json`. |
| - The current install path depends on the MiMo support branch in `ailuntx/mlx-audio`. |
| - The usage section here will be simplified once MiMo lands in upstream `mlx-audio` and `mlx-audio-swift`. |
|
|
| ## Introduction |
|
|
| **MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks. |
|
|
| ## Abstract |
|
|
| Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. Therefore, we present **MiMo-V2.5-ASR**, an end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions: |
|
|
| - 🗣️ **Chinese Dialects**: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more. |
| - 🔀 **Code-Switch**: Seamless Chinese-English code-switching transcription with no language tags required. |
| - 🎵 **Song Recognition**: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals. |
| - 🔊 **Noisy Environments**: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions. |
| - 👥 **Multi-Speaker**: Accurate transcription of overlapping, multi-party conversations such as meetings. |
| - 🇬🇧 **Complex English Scenarios**: Leading performance on the Open ASR Leaderboard for challenging English benchmarks such as AMI. |
| - 📚 **Knowledge-Intensive Recognition**: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material. |
| - 📝 **Native Punctuation**: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed. |
|
|
| ## Results |
|
|
| MiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, lyric recognition, and internal business scenarios. The chart below summarizes the average performance of MiMo-V2.5-ASR across these scenarios. |
|
|
|  |
|
|
| For per-benchmark numbers and specific qualitative cases, please refer to the official [blog](https://mimo.xiaomi.com/mimo-v2-5-asr). |
|
|
| ## Model Download |
|
|
| | Models | 🤗 Hugging Face | |
| |-------|-------| |
| | MiMo-Audio-Tokenizer | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer) | |
| | MiMo-V2.5-ASR | [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) | |
|
|
| ```bash |
| pip install huggingface-hub |
| |
| hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer |
| hf download XiaomiMiMo/MiMo-V2.5-ASR --local-dir ./models/MiMo-V2.5-ASR |
| ``` |
|
|
| ## MLX Releases |
|
|
| The following repositories are MLX conversions derived from the official release: |
|
|
| | Variant | Precision | Size | Local smoke time | Smoke result | |
| | --- | --- | ---: | ---: | --- | |
| | `MiMo-V2.5-ASR-MLX` | 4bit | 4.2 GB | 0.88 s | `Intention.` | |
| | `MiMo-V2.5-ASR-MLX-4bit` | 4bit | 4.2 GB | 0.88 s | `Intention.` | |
| | `MiMo-V2.5-ASR-MLX-8bit` | 8bit | 8.0 GB | 10.80 s | `Intention.` | |
| | `MiMo-V2.5-ASR-MLX-bf16` | bf16 | 15 GB | - | dense reference export | |
| | `MiMo-V2.5-ASR-MLX-fp32` | fp32 | 30 GB | - | dense reference export | |
|
|
| MLX conversion notes: |
|
|
| - Base model: `XiaomiMiMo/MiMo-V2.5-ASR` |
| - Tokenizer resolution: automatic via `mlx-community/MiMo-Audio-Tokenizer` |
| - Conversion date: `2026-05-12` |
| - Local validation runtimes: `mlx-audio` and `mlx-audio-swift` |
| - Recommended default: `MiMo-V2.5-ASR-MLX` |
|
|
| Example downloads: |
|
|
| ```bash |
| hf download mlx-community/MiMo-V2.5-ASR-MLX --local-dir ./models/MiMo-V2.5-ASR-MLX |
| hf download mlx-community/MiMo-V2.5-ASR-MLX-8bit --local-dir ./models/MiMo-V2.5-ASR-MLX-8bit |
| ``` |
|
|
| ## Validation |
|
|
| Local smoke validation was run with `mlx-audio` and `mlx-audio-swift`. |
|
|
| - `intention.wav` -> `Intention.` |
| - `conversational_a.wav` -> expected coffee / Kaldi paragraph |
|
|
| ## Getting Started |
|
|
| The following section is preserved from the official project and describes the original Python/CUDA workflow. |
|
|
| Spin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app. |
|
|
| ### Prerequisites (Linux) |
|
|
| * Python 3.12 |
| * CUDA >= 12.0 |
|
|
| ### Installation |
|
|
| ```bash |
| git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git |
| cd MiMo-V2.5-ASR-MLX |
| pip install -r requirements.txt |
| pip install flash-attn==2.7.4.post1 |
| ``` |
|
|
| > [!Note] |
| > If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually: |
| > |
| > * [Download Precompiled Wheel](https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl) |
| > |
| > ```sh |
| > pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl |
| > ``` |
| |
| ### Run the Demo |
| |
| ```bash |
| python run_mimo_asr.py |
| ``` |
| |
|  |
| |
| This launches a local Gradio interface for MiMo-V2.5-ASR. You can: |
| |
| * Upload an audio file **or** record directly from your microphone. |
| * Optionally specify a **language tag** (Chinese / English / Auto) to bias the model for a specific language, or leave it to **Auto** for automatic language detection (recommended for code-switched speech). |
| * The demo calls the `asr_sft()` interface under the hood. |
|
|
| To load the model and tokenizer automatically at startup, pass their paths on the command line: |
|
|
| ```bash |
| python run_mimo_asr.py \ |
| --model-path ./models/MiMo-V2.5-ASR \ |
| --tokenizer-path ./models/MiMo-Audio-Tokenizer |
| ``` |
|
|
| Otherwise, enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-V2.5-ASR` in the **Model Configuration** tab, then start transcribing. |
|
|
| ## Python API |
|
|
| The following API example is preserved from the official project. |
|
|
| Basic usage with the `asr_sft` interface: |
|
|
| ```python |
| from src.mimo_audio.mimo_audio import MimoAudio |
| |
| model = MimoAudio( |
| model_path="./models/MiMo-V2.5-ASR", |
| tokenizer_path="./models/MiMo-Audio-Tokenizer", |
| ) |
| |
| # Automatic language detection (recommended for code-switching) |
| text = model.asr_sft("path/to/audio.wav") |
| print(text) |
| |
| # With explicit language tag |
| text_zh = model.asr_sft("path/to/audio.wav", audio_tag="<chinese>") |
| text_en = model.asr_sft("path/to/audio.wav", audio_tag="<english>") |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{coreteam2026mimov25asr, |
| title={MiMo-V2.5-ASR: Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios}, |
| author={LLM-Core-Team Xiaomi}, |
| year={2026}, |
| url={https://github.com/XiaomiMiMo/MiMo-V2.5-ASR}, |
| } |
| ``` |
|
|
| ## Contact |
|
|
| Please contact [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue in the official project if you have questions about the original model. |
|
|