File size: 9,076 Bytes
3ebf7da
 
2616113
 
 
 
 
 
3ebf7da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2616113
 
 
 
 
 
 
cf7f321
2616113
35106e4
 
 
 
50ed323
 
35106e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50ed323
 
35106e4
 
 
 
 
 
 
 
 
 
3ebf7da
 
 
 
 
 
 
 
 
2616113
3ebf7da
2616113
3ebf7da
 
cf7f321
 
3ebf7da
 
 
 
 
2616113
3ebf7da
2616113
3ebf7da
 
 
 
2616113
 
3ebf7da
 
 
 
 
2616113
3ebf7da
 
 
 
 
 
 
 
 
 
 
2616113
 
3ebf7da
 
 
 
2616113
3ebf7da
2616113
 
3ebf7da
 
 
 
 
 
 
 
 
 
 
 
 
 
2616113
3ebf7da
 
 
 
2616113
3ebf7da
 
 
 
 
2616113
3ebf7da
 
 
2616113
 
 
 
 
 
 
 
3ebf7da
 
 
 
 
 
 
 
 
 
 
 
 
2616113
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
license: mit
base_model: XiaomiMiMo/MiMo-Audio-Tokenizer
tags:
- mlx
- speech
- audio-tokenizer
- automatic-speech-recognition
---
<div align="center">
  <picture>
    <source srcset="https://github.com/XiaomiMiMo/MiMo-VL/raw/main/figures/Xiaomi_MiMo_darkmode.png?raw=true" media="(prefers-color-scheme: dark)">
    <img src="https://github.com/XiaomiMiMo/MiMo-VL/raw/main/figures/Xiaomi_MiMo.png?raw=true" width="60%" alt="Xiaomi-MiMo" />
  </picture>
</div>

<h3 align="center">
  <b>
    <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
    <br/>
    MiMo Audio: Audio Language Models are Few-Shot Learners
    <br/>
    <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
    <br/>
  </b>
</h3>

<br/>

<div align="center" style="line-height: 1;">
  |
  <a href="https://huggingface.co/collections/XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0" target="_blank">πŸ€— HuggingFace</a>
  &nbsp;|
  <a href="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audio-Technical-Report.pdf" target="_blank">πŸ“„ Paper</a>
  &nbsp;|
  <a href="https://xiaomimimo.github.io/MiMo-Audio-Demo" target="_blank">πŸ“° Blog</a>
  &nbsp;|
  <a href="https://huggingface.co/spaces/XiaomiMiMo/mimo_audio_chat" target="_blank">πŸ”₯ Online Demo</a>
  &nbsp;|
  <a href="https://github.com/XiaomiMiMo/MiMo-Audio-Eval" target="_blank">πŸ“Š MiMo-Audio-Eval</a>
  &nbsp;|

  <br/>
</div>

<br/>

## MLX Conversion

This repository is the MLX export used by `mlx-community/MiMo-V2.5-ASR-MLX`.

- Default precision is `fp32`.
- This export keeps the encoder and RVQ path used by MiMo ASR.
- Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
- The published MLX weights are therefore an ASR-focused inference subset, not a byte-for-byte mirror of the full official tokenizer release.

## MLX Usage

Current MLX usage is documented in:

- [ailuntx/MiMo-V2.5-ASR-MLX](https://github.com/ailuntx/MiMo-V2.5-ASR-MLX)
- [ailuntx/MiMo-Audio-Tokenizer-MLX](https://github.com/ailuntx/MiMo-Audio-Tokenizer-MLX)

Install the current MLX path:

```bash
pip install git+https://github.com/ailuntx/mlx-audio@feat/mimo-v25-asr
```

Download the tokenizer:

```bash
hf download mlx-community/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
```

This tokenizer is consumed automatically by:

- [mlx-community/MiMo-V2.5-ASR-MLX](https://huggingface.co/mlx-community/MiMo-V2.5-ASR-MLX)

If you are following the standalone GitHub path, clone the MiMo ASR fork and use its helper script:

```bash
git clone https://github.com/ailuntx/MiMo-V2.5-ASR-MLX.git
cd MiMo-V2.5-ASR-MLX
python run_mimo_asr_mlx.py \
    --model ./models/MiMo-V2.5-ASR-MLX \
    --audio path/to/audio.wav
```

Notes:

- `mlx-community/MiMo-V2.5-ASR-MLX` resolves this tokenizer through `mlx_manifest.json`.
- This repo is not meant to be the primary user entrypoint; use the MiMo ASR repo above for end-to-end transcription.

## Introduction

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.

<p align="center">
  <img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/Results.png?raw=true">
</p>

## Architecture

### MiMo-Audio-Tokenizer

MiMo-Audio-Tokenizer is a 1.2B-parameter Transformer operating at 25 Hz. It employs an eight-layer RVQ stack to generate 200 tokens per second. By jointly optimizing semantic and reconstruction objectives, we train MiMo-Audio-Tokenizer from scratch on a 10-million-hour corpus, achieving superior reconstruction quality and facilitating downstream language modeling.

For clarity: the official Xiaomi release above describes the full tokenizer stack. This MLX repository publishes the encoder/RVQ subset used by `MiMo-V2.5-ASR`, which is why the Hugging Face file summary for this repo is about `0.64B` parameters instead of the full `1.2B`.

<p align="center">
  <img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/tokenizer.png?raw=true">
</p>

MiMo-Audio couples a patch encoder, an LLM, and a patch decoder to improve modeling efficiency for high-rate sequences and bridge the length mismatch between speech and text. The patch encoder aggregates four consecutive time steps of RVQ tokens into a single patch, downsampling the sequence to a 6.25 Hz representation for the LLM. The patch decoder autoregressively generates the full 25 Hz RVQ token sequence via a delayed-generation scheme.

### MiMo-Audio

<p align="center">
  <img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/architecture.png?raw=true">
</p>

## Explore MiMo-Audio Now! πŸš€πŸš€πŸš€

- 🎧 **Try the Hugging Face demo:** [MiMo-Audio Demo](https://huggingface.co/spaces/XiaomiMiMo/mimo_audio_chat)
- πŸ“° **Read the Official Blog:** [MiMo-Audio Blog](https://xiaomimimo.github.io/MiMo-Audio-Demo)
- πŸ“„ **Dive into the Technical Report:** [MiMo-Audio Technical Report](https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audio-Technical-Report.pdf)

## Model Download

| Models   | πŸ€— Hugging Face |
|-------|-------|
| MiMo-Audio-Tokenizer | [XiaomiMiMo/MiMo-Audio-Tokenizer](https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer) |
| MiMo-Audio-7B-Base | [XiaomiMiMo/MiMo-Audio-7B-Base](https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Base) |
| MiMo-Audio-7B-Instruct | [XiaomiMiMo/MiMo-Audio-7B-Instruct](https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Instruct) |

## Getting Started

Spin up the MiMo-Audio demo in minutes with the built-in Gradio app.

### Installation

```sh
git clone https://github.com/XiaomiMiMo/MiMo-Audio.git
cd MiMo-Audio
pip install -e .
```

### Run the demo

```sh
python run_mimo_audio.py
```

This launches a local Gradio interface where you can try MiMo-Audio interactively.

<p align="center">
  <img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/demo_ui.jpg?raw=true">
</p>

Enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-Audio-7B-Instruct`, then enjoy the full functionality of MiMo-Audio!

## Inference Scripts

### Base Model

We provide an example script to explore the **in-context learning** capabilities of `MiMo-Audio-7B-Base`.  
See: [`inference_example_pretrain.py`](https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/inference_example_pretrain.py)

### Instruct Model

To try the instruction-tuned model `MiMo-Audio-7B-Instruct`, use the corresponding inference script.  
See: [`inference_example_sft.py`](https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/inference_example_sft.py)

## Evaluation Toolkit

Full evaluation suite are available at 🌐[MiMo-Audio-Eval](https://github.com/XiaomiMiMo/MiMo-Audio-Eval).

This toolkit is designed to evaluate MiMo-Audio and other recent audio LLMs as mentioned in the paper. It provides a flexible and extensible framework, supporting a wide range of datasets, tasks, and models.

## Validation

This MLX export was validated locally with `mlx-audio-swift` and `MiMo-V2.5-ASR-MLX`.

- Smoke samples: `intention.wav`, `conversational_a.wav`, `noisy_audio.wav`
- Release precision: `fp32`
- Lower-precision internal experiments were kept out of the Hub release to avoid frontend drift and naming ambiguity

## Citation

```bibtex
@misc{coreteam2025mimoaudio,
      title={MiMo-Audio: Audio Language Models are Few-Shot Learners}, 
      author={LLM-Core-Team Xiaomi},
      year={2025},
      url={GitHub - XiaomiMiMo/MiMo-Audio}, 
}
```

## Contact

Please contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue if you have any questions.