ailuntz commited on
Commit
35106e4
·
verified ·
1 Parent(s): cf7f321

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md CHANGED
@@ -54,6 +54,44 @@ This repository is the MLX export used by `mlx-community/MiMo-V2.5-ASR-MLX`.
54
  - Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
55
  - The published MLX weights are therefore an ASR-focused inference subset, not a byte-for-byte mirror of the full official tokenizer release.
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ## Introduction
58
 
59
  Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.
 
54
  - Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
55
  - The published MLX weights are therefore an ASR-focused inference subset, not a byte-for-byte mirror of the full official tokenizer release.
56
 
57
+ ## MLX Usage
58
+
59
+ Current MLX usage is documented in:
60
+
61
+ - [ailuntx/MiMo-V2.5-ASR](https://github.com/ailuntx/MiMo-V2.5-ASR)
62
+ - [ailuntx/MiMo-Audio-Tokenizer](https://github.com/ailuntx/MiMo-Audio-Tokenizer)
63
+
64
+ Install the current MLX path:
65
+
66
+ ```bash
67
+ pip install git+https://github.com/ailuntx/mlx-audio@feat/mimo-v25-asr
68
+ ```
69
+
70
+ Download the tokenizer:
71
+
72
+ ```bash
73
+ hf download mlx-community/MiMo-Audio-Tokenizer --local-dir ./models/MiMo-Audio-Tokenizer
74
+ ```
75
+
76
+ This tokenizer is consumed automatically by:
77
+
78
+ - [mlx-community/MiMo-V2.5-ASR-MLX](https://huggingface.co/mlx-community/MiMo-V2.5-ASR-MLX)
79
+
80
+ If you are following the standalone GitHub path, clone the MiMo ASR fork and use its helper script:
81
+
82
+ ```bash
83
+ git clone https://github.com/ailuntx/MiMo-V2.5-ASR.git
84
+ cd MiMo-V2.5-ASR
85
+ python run_mimo_asr_mlx.py \
86
+ --model ./models/MiMo-V2.5-ASR-MLX \
87
+ --audio path/to/audio.wav
88
+ ```
89
+
90
+ Notes:
91
+
92
+ - `mlx-community/MiMo-V2.5-ASR-MLX` resolves this tokenizer through `mlx_manifest.json`.
93
+ - This repo is not meant to be the primary user entrypoint; use the MiMo ASR repo above for end-to-end transcription.
94
+
95
  ## Introduction
96
 
97
  Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models.