ASR Model Compression
Collection
Compressed ASR models for on-device speech recognition on Apple Silicon. CoreML and MLX variants of Cohere Transcribe, optimized for PressType. โข 2 items โข Updated
The most aggressive MLX quantization of CohereLabs/cohere-transcribe-03-2026 that still produces correct transcripts. Encoder at 3-bit, decoder at 4-bit. Runs entirely on-device via Apple MLX on Apple Silicon.
| Metric | Value |
|---|---|
| Size | 891 MB (vs 3.9 GB FP16 โ 4.4x smaller) |
| WER (LibriSpeech test-clean) | 1.07% |
| WER (LibriSpeech test-other) | 2.17% |
| Composite WER | 1.62% |
| RTFx (M4 Air) | 23.9x real-time |
| Effective bits/param | ~3.25 |
| Component | Quantization |
|---|---|
| Encoder | 3-bit linear (per-group scale, group size 64) |
| Decoder | 4-bit affine (per-group scale, group size 64) |
| Format | MLX safetensors (model.safetensors) |
1x1 Conv1d layers are converted to Linear equivalents to enable quantization of convolutional layers.
Requires mlx-audio installed from git main:
pip install "mlx-audio[stt] @ git+https://github.com/Blaizzy/mlx-audio.git"
from mlx_audio.stt import load
model, processor = load("MarkChen1214/cohere-transcribe-03-2026-MLX-Mixed-3bit4bit")
result = model.generate(audio="audio.wav")
print(result["text"])
Note: Requires the quantization patch (--apply-patch with mlx_audio_cohere_quant_patch.py) when using the mlx-audio CLI.
| Dataset | Samples | Audio Hours | WER | RTFx |
|---|---|---|---|---|
| LibriSpeech test-clean | 2,620 | 5.4h | 1.07% | 25.2x |
| LibriSpeech test-other | 2,939 | 5.34h | 2.17% | 22.8x |
GPL-3.0 โ see LICENSE.
The base model (CohereLabs/cohere-transcribe-03-2026) is Apache 2.0.
4-bit
Base model
CohereLabs/cohere-transcribe-03-2026