mlx-community
/

MiMo-V2.5-ASR-MLX-8bit

@@ -15,53 +15,85 @@ language:
 - en
 ---
-# MiMo-V2.5-ASR-MLX-8bit
 Current variant: `8bit`
-This repository is a community MLX conversion of the official `XiaomiMiMo/MiMo-V2.5-ASR` release for local inference on Apple silicon. The original model, tokenizer, benchmark claims, demo, and project materials remain with the Xiaomi MiMo team. The MLX-specific notes in this repository are added as an incremental deployment layer on top of the official release.
-Official resources:
-- Model: `XiaomiMiMo/MiMo-V2.5-ASR`
-- Tokenizer: `XiaomiMiMo/MiMo-Audio-Tokenizer`
-- Demo: `XiaomiMiMo/MiMo-V2.5-ASR` Space
-- Blog: `mimo.xiaomi.com/mimo-v2-5-asr`
-- Code: `XiaomiMiMo/MiMo-V2.5-ASR`
 ## Introduction
-**MiMo-V2.5-ASR** is an end-to-end automatic speech recognition model developed by the Xiaomi MiMo team. It is designed for robust transcription across Mandarin Chinese and English, Chinese dialects, code-switched speech, lyrics, noisy recordings, meetings, and knowledge-intensive content.
-The official release highlights the following capabilities:
-- Native support for Chinese dialects including Wu, Cantonese, Hokkien, and Sichuanese.
-- Seamless Chinese-English code-switching transcription without language tags.
-- Lyrics transcription for Chinese and English songs.
-- Robust recognition under heavy noise and far-field capture.
-- Accurate transcription for multi-speaker and overlapping conversations.
-- Strong performance on complex English meeting-style benchmarks.
-- Reliable handling of terminology, names, places, and other knowledge-dense material.
-- Native punctuation generation without a separate post-processing stage.
 ## Results
-For benchmark charts, qualitative examples, and the original project presentation, please refer to the official model page and blog:
-- Official model card: `XiaomiMiMo/MiMo-V2.5-ASR`
-- Official blog: `mimo.xiaomi.com/mimo-v2-5-asr`
-## MLX Conversion
-This repository packages the official release as an MLX-ready model family for Apple silicon. The conversion was built from the official model weights together with `XiaomiMiMo/MiMo-Audio-Tokenizer`.
-- Base model: `XiaomiMiMo/MiMo-V2.5-ASR`
-- Required tokenizer: `XiaomiMiMo/MiMo-Audio-Tokenizer`
-- Conversion date: `2026-05-12`
-- Runtime used for validation: `mlx-audio-swift`
-- Recommended default: `MiMo-V2.5-ASR-MLX`
-## Variant Summary
 | Variant | Precision | Size | Local smoke time | Smoke result |
 | --- | --- | ---: | ---: | --- |
@@ -71,15 +103,104 @@ This repository packages the official release as an MLX-ready model family for A
 | `MiMo-V2.5-ASR-MLX-bf16` | bf16 | 15 GB | - | dense reference export |
 | `MiMo-V2.5-ASR-MLX-fp32` | fp32 | 30 GB | - | dense reference export |
 ## Validation
 Local smoke validation was run with `mlx-audio-swift` on `Tests/media/intention.wav`.
 - Output: `Intention.`
-## Citation
-If you use the original model, please cite the official project:
 ```bibtex
 @misc{coreteam2026mimov25asr,
@@ -92,7 +213,4 @@ If you use the original model, please cite the official project:
 ## Contact
-For questions about the original model, please refer to the official project channels:
-- `mimo@xiaomi.com`
-- `XiaomiMiMo/MiMo-V2.5-ASR`

 - en
 ---
 Current variant: `8bit`
+<div align="center">
+  <img src="https://raw.githubusercontent.com/XiaomiMiMo/MiMo-V2.5-ASR/main/assets/XiaomiMIMO.png" width="60%" alt="Xiaomi-MiMo" />
+</div>
+<div align="center">
+  <h3>
+    <b>
+      <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span><br/>
+      MiMo-V2.5-ASR: Robust Speech Recognition Across<br/>
+      Languages, Dialects, and Complex Acoustic Scenarios<br/>
+      <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
+    </b>
+  </h3>
+</div>
+<br/>
+<div align="center" style="line-height: 1;">
+  |
+  <a href="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">🤗 Official Model</a>
+  &nbsp;|
+  <a href="https://huggingface.co/spaces/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">🚀 Official Demo</a>
+  &nbsp;|
+  <a href="https://mimo.xiaomi.com/mimo-v2-5-asr" target="_blank">📰 Official Blog</a>
+  &nbsp;|
+  <a href="https://github.com/XiaomiMiMo/MiMo-V2.5-ASR" target="_blank">💻 Official Code</a>
+  &nbsp;|
+</div>
+<br/>
+## MLX Note
+This repository is a community MLX conversion of the official `XiaomiMiMo/MiMo-V2.5-ASR` release for Apple silicon. The original model description below is preserved from the official release, and the MLX-specific material in this page is added as an incremental note for local MLX deployment.
 ## Introduction
+**MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.
+## Abstract
+Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. Therefore, we present **MiMo-V2.5-ASR**, an end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:
+- 🗣️ **Chinese Dialects**: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more.
+- 🔀 **Code-Switch**: Seamless Chinese-English code-switching transcription with no language tags required.
+- 🎵 **Song Recognition**: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals.
+- 🔊 **Noisy Environments**: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions.
+- 👥 **Multi-Speaker**: Accurate transcription of overlapping, multi-party conversations such as meetings.
+- 🇬🇧 **Complex English Scenarios**: Leading performance on the Open ASR Leaderboard for challenging English benchmarks such as AMI.
+- 📚 **Knowledge-Intensive Recognition**: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material.
+- 📝 **Native Punctuation**: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed.
 ## Results
+MiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, lyric recognition, and internal business scenarios. The chart below summarizes the average performance of MiMo-V2.5-ASR across these scenarios.
+![Results](https://raw.githubusercontent.com/XiaomiMiMo/MiMo-V2.5-ASR/main/assets/MiMo_ASR_Results.png)
+For per-benchmark numbers and specific qualitative cases, please refer to the official [blog](https://mimo.xiaomi.com/mimo-v2-5-asr).
+## Model Download
+| Models   | 🤗 Hugging Face |
+|-------|-------|
+| MiMo-Audio-Tokenizer | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer) |
+| MiMo-V2.5-ASR | [XiaomiMiMo/MiMo-V2.5-ASR](https://huggingface.co/XiaomiMiMo/MiMo-V2.5-ASR) |
+```bash
+pip install huggingface-hub
+hf download XiaomiMiMo/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
+hf download XiaomiMiMo/MiMo-V2.5-ASR --local-dir ./models/MiMo-V2.5-ASR
+```
+## MLX Releases
+The following repositories are MLX conversions derived from the official release:
 | Variant | Precision | Size | Local smoke time | Smoke result |
 | --- | --- | ---: | ---: | --- |
 | `MiMo-V2.5-ASR-MLX-bf16` | bf16 | 15 GB | - | dense reference export |
 | `MiMo-V2.5-ASR-MLX-fp32` | fp32 | 30 GB | - | dense reference export |
+MLX conversion notes:
+- Base model: `XiaomiMiMo/MiMo-V2.5-ASR`
+- Required tokenizer: `XiaomiMiMo/MiMo-Audio-Tokenizer`
+- Conversion date: `2026-05-12`
+- Local validation runtime: `mlx-audio-swift`
+- Recommended default: `MiMo-V2.5-ASR-MLX`
+Example downloads:
+```bash
+hf download ailuntz/MiMo-V2.5-ASR-MLX --local-dir ./models/MiMo-V2.5-ASR-MLX
+hf download ailuntz/MiMo-V2.5-ASR-MLX-8bit --local-dir ./models/MiMo-V2.5-ASR-MLX-8bit
+```
 ## Validation
 Local smoke validation was run with `mlx-audio-swift` on `Tests/media/intention.wav`.
 - Output: `Intention.`
+## Getting Started
+The following section is preserved from the official project and describes the original Python/CUDA workflow.
+Spin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app.
+### Prerequisites (Linux)
+* Python 3.12
+* CUDA >= 12.0
+### Installation
+```bash
+git clone https://github.com/XiaomiMiMo/MiMo-V2.5-ASR.git
+cd MiMo-V2.5-ASR
+pip install -r requirements.txt
+pip install flash-attn==2.7.4.post1
+```
+> [!Note]
+> If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:
+>
+> * [Download Precompiled Wheel](https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl)
+>
+> ```sh
+> pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
+> ```
+### Run the Demo
+```bash
+python run_mimo_asr.py
+```
+![MiMo-V2.5-ASR Demo](https://raw.githubusercontent.com/XiaomiMiMo/MiMo-V2.5-ASR/main/assets/MiMo_ASR_Demo.png)
+This launches a local Gradio interface for MiMo-V2.5-ASR. You can:
+* Upload an audio file **or** record directly from your microphone.
+* Optionally specify a **language tag** (Chinese / English / Auto) to bias the model for a specific language, or leave it to **Auto** for automatic language detection (recommended for code-switched speech).
+* The demo calls the `asr_sft()` interface under the hood.
+To load the model and tokenizer automatically at startup, pass their paths on the command line:
+```bash
+python run_mimo_asr.py \
+    --model-path ./models/MiMo-V2.5-ASR \
+    --tokenizer-path ./models/MiMo-Audio-Tokenizer
+```
+Otherwise, enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-V2.5-ASR` in the **Model Configuration** tab, then start transcribing.
+## Python API
+The following API example is preserved from the official project.
+Basic usage with the `asr_sft` interface:
+```python
+from src.mimo_audio.mimo_audio import MimoAudio
+model = MimoAudio(
+    model_path="./models/MiMo-V2.5-ASR",
+    tokenizer_path="./models/MiMo-Audio-Tokenizer",
+)
+# Automatic language detection (recommended for code-switching)
+text = model.asr_sft("path/to/audio.wav")
+print(text)
+# With explicit language tag
+text_zh = model.asr_sft("path/to/audio.wav", audio_tag="<chinese>")
+text_en = model.asr_sft("path/to/audio.wav", audio_tag="<english>")
+```
+## Citation
 ```bibtex
 @misc{coreteam2026mimov25asr,
 ## Contact
+Please contact [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue in the official project if you have questions about the original model.