ailuntz commited on
Commit
cf7f321
·
verified ·
1 Parent(s): 2616113

Clarify MLX tokenizer subset scope

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -52,6 +52,7 @@ This repository is the MLX export used by `mlx-community/MiMo-V2.5-ASR-MLX`.
52
  - Default precision is `fp32`.
53
  - This export keeps the encoder and RVQ path used by MiMo ASR.
54
  - Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
 
55
 
56
  ## Introduction
57
 
@@ -67,6 +68,8 @@ Existing audio language models typically rely on task-specific fine-tuning to ac
67
 
68
  MiMo-Audio-Tokenizer is a 1.2B-parameter Transformer operating at 25 Hz. It employs an eight-layer RVQ stack to generate 200 tokens per second. By jointly optimizing semantic and reconstruction objectives, we train MiMo-Audio-Tokenizer from scratch on a 10-million-hour corpus, achieving superior reconstruction quality and facilitating downstream language modeling.
69
 
 
 
70
  <p align="center">
71
  <img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/tokenizer.png?raw=true">
72
  </p>
 
52
  - Default precision is `fp32`.
53
  - This export keeps the encoder and RVQ path used by MiMo ASR.
54
  - Decoder and vocoder weights are omitted here because they are not used in the ASR pipeline.
55
+ - The published MLX weights are therefore an ASR-focused inference subset, not a byte-for-byte mirror of the full official tokenizer release.
56
 
57
  ## Introduction
58
 
 
68
 
69
  MiMo-Audio-Tokenizer is a 1.2B-parameter Transformer operating at 25 Hz. It employs an eight-layer RVQ stack to generate 200 tokens per second. By jointly optimizing semantic and reconstruction objectives, we train MiMo-Audio-Tokenizer from scratch on a 10-million-hour corpus, achieving superior reconstruction quality and facilitating downstream language modeling.
70
 
71
+ For clarity: the official Xiaomi release above describes the full tokenizer stack. This MLX repository publishes the encoder/RVQ subset used by `MiMo-V2.5-ASR`, which is why the Hugging Face file summary for this repo is about `0.64B` parameters instead of the full `1.2B`.
72
+
73
  <p align="center">
74
  <img width="95%" src="https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/assets/tokenizer.png?raw=true">
75
  </p>