Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -54,6 +54,44 @@ This repository is the MLX export used by `mlx-community/MiMo-V2.5-ASR-MLX`.
|
|
| 54 |
- Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
|
| 55 |
- The published MLX weights are therefore an ASR-focused inference subset, not a byte-for-byte mirror of the full official tokenizer release.
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## Introduction
|
| 58 |
|
| 59 |
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.
|
|
|
|
| 54 |
- Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
|
| 55 |
- The published MLX weights are therefore an ASR-focused inference subset, not a byte-for-byte mirror of the full official tokenizer release.
|
| 56 |
|
| 57 |
+
## MLX Usage
|
| 58 |
+
|
| 59 |
+
Current MLX usage is documented in:
|
| 60 |
+
|
| 61 |
+
- [ailuntx/MiMo-V2.5-ASR](https://github.com/ailuntx/MiMo-V2.5-ASR)
|
| 62 |
+
- [ailuntx/MiMo-Audio-Tokenizer](https://github.com/ailuntx/MiMo-Audio-Tokenizer)
|
| 63 |
+
|
| 64 |
+
Install the current MLX path:
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
pip install git+https://github.com/ailuntx/mlx-audio@feat/mimo-v25-asr
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
Download the tokenizer:
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
hf download mlx-community/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
This tokenizer is consumed automatically by:
|
| 77 |
+
|
| 78 |
+
- [mlx-community/MiMo-V2.5-ASR-MLX](https://huggingface.co/mlx-community/MiMo-V2.5-ASR-MLX)
|
| 79 |
+
|
| 80 |
+
If you are following the standalone GitHub path, clone the MiMo ASR fork and use its helper script:
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
git clone https://github.com/ailuntx/MiMo-V2.5-ASR.git
|
| 84 |
+
cd MiMo-V2.5-ASR
|
| 85 |
+
python run_mimo_asr_mlx.py \
|
| 86 |
+
--model ./models/MiMo-V2.5-ASR-MLX \
|
| 87 |
+
--audio path/to/audio.wav
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
Notes:
|
| 91 |
+
|
| 92 |
+
- `mlx-community/MiMo-V2.5-ASR-MLX` resolves this tokenizer through `mlx_manifest.json`.
|
| 93 |
+
- This repo is not meant to be the primary user entrypoint; use the MiMo ASR repo above for end-to-end transcription.
|
| 94 |
+
|
| 95 |
## Introduction
|
| 96 |
|
| 97 |
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.
|