mlx-community
/

MiMo-V2.5-ASR-MLX-4bit

@@ -52,6 +52,52 @@ Current variant: `4bit`
 This repository is a community MLX conversion of the official `XiaomiMiMo/MiMo-V2.5-ASR` release for Apple silicon. The original model description below is preserved from the official release, and the MLX-specific material in this page is added as an incremental note for local MLX deployment.
 ## Introduction
 **MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.
@@ -106,10 +152,9 @@ The following repositories are MLX conversions derived from the official release
 MLX conversion notes:
 - Base model: `XiaomiMiMo/MiMo-V2.5-ASR`
-- One-line MLX loading: `fromPretrained("mlx-community/MiMo-V2.5-ASR-MLX")` auto-resolves the tokenizer mirror
 - Tokenizer resolution: automatic via `mlx-community/MiMo-Audio-Tokenizer`
 - Conversion date: `2026-05-12`
-- Local validation runtime: `mlx-audio-swift`
 - Recommended default: `MiMo-V2.5-ASR-MLX`
 Example downloads:
@@ -121,9 +166,10 @@ hf download mlx-community/MiMo-V2.5-ASR-MLX-8bit --local-dir ./models/MiMo-V2.5-
 ## Validation
-Local smoke validation was run with `mlx-audio-swift` on `Tests/media/intention.wav`.
-- Output: `Intention.`
 ## Getting Started

 This repository is a community MLX conversion of the official `XiaomiMiMo/MiMo-V2.5-ASR` release for Apple silicon. The original model description below is preserved from the official release, and the MLX-specific material in this page is added as an incremental note for local MLX deployment.
+## MLX Usage
+Current MLX usage is documented in the GitHub forks below:
+- [ailuntx/MiMo-V2.5-ASR](https://github.com/ailuntx/MiMo-V2.5-ASR)
+- [ailuntx/MiMo-Audio-Tokenizer](https://github.com/ailuntx/MiMo-Audio-Tokenizer)
+Install the current MLX path:
+```bash
+pip install git+https://github.com/ailuntx/mlx-audio@feat/mimo-v25-asr
+```
+Download the MLX checkpoints:
+```bash
+hf download mlx-community/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
+hf download mlx-community/MiMo-V2.5-ASR-MLX --local-dir ./models/MiMo-V2.5-ASR-MLX
+```
+Run transcription from the helper script in `ailuntx/MiMo-V2.5-ASR`:
+```bash
+git clone https://github.com/ailuntx/MiMo-V2.5-ASR.git
+cd MiMo-V2.5-ASR
+python run_mimo_asr_mlx.py \
+    --model ./models/MiMo-V2.5-ASR-MLX \
+    --audio path/to/audio.wav
+```
+Python:
+```python
+from mlx_audio.stt import load
+model = load("./models/MiMo-V2.5-ASR-MLX")
+result = model.generate("path/to/audio.wav", language="en")
+print(result.text)
+```
+Notes:
+- `mlx-community/MiMo-V2.5-ASR-MLX` resolves `mlx-community/MiMo-Audio-Tokenizer` through `mlx_manifest.json`.
+- The current install path depends on the MiMo support branch in `ailuntx/mlx-audio`.
+- The usage section here will be simplified once MiMo lands in upstream `mlx-audio` and `mlx-audio-swift`.
 ## Introduction
 **MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.
 MLX conversion notes:
 - Base model: `XiaomiMiMo/MiMo-V2.5-ASR`
 - Tokenizer resolution: automatic via `mlx-community/MiMo-Audio-Tokenizer`
 - Conversion date: `2026-05-12`
+- Local validation runtimes: `mlx-audio` and `mlx-audio-swift`
 - Recommended default: `MiMo-V2.5-ASR-MLX`
 Example downloads:
 ## Validation
+Local smoke validation was run with `mlx-audio` and `mlx-audio-swift`.
+- `intention.wav` -> `Intention.`
+- `conversational_a.wav` -> expected coffee / Kaldi paragraph
 ## Getting Started