Automatic Speech Recognition
MLX
Safetensors
granite_speech_nar
mlx-audio
speech-to-text
non-autoregressive
granite
custom_code
Instructions to use mlx-community/granite-speech-4.1-2b-nar-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/granite-speech-4.1-2b-nar-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir granite-speech-4.1-2b-nar-mlx mlx-community/granite-speech-4.1-2b-nar-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
| library_name: mlx | |
| license: apache-2.0 | |
| base_model: ibm-granite/granite-speech-4.1-2b-nar | |
| language: [en, fr, de, es, pt] | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - mlx | |
| - mlx-audio | |
| - speech-to-text | |
| - non-autoregressive | |
| - granite | |
| # Granite Speech 4.1 2B NAR — MLX | |
| MLX port of [`ibm-granite/granite-speech-4.1-2b-nar`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar) for Apple Silicon. Runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio). | |
| ## Architecture | |
| Non-autoregressive ASR via CTC + bidirectional LM editing: | |
| 1. **16-layer Conformer encoder** (543M params) produces an initial BPE CTC hypothesis. | |
| 2. **2-layer windowed Q-Former projector** (80M params) converts multi-layer encoder states into audio embeddings. | |
| 3. **40-layer bidirectional Granite editor** (1.6B params) takes `[audio | hypothesis_tokens]` and emits edited logits in a single forward pass — no autoregression, no KV cache. | |
| 4. Final CTC collapse on text-position logits yields the transcript. | |
| Total: ~2.25B params, bf16. | |
| ## Quickstart | |
| ```python | |
| from pathlib import Path | |
| from mlx_audio.stt.utils import load_model | |
| model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx")) | |
| out = model.generate("audio.wav") | |
| print(out.text) | |
| ``` | |
| ## Limitations | |
| - Batch size 1. | |
| - bf16 baseline only — no quantized variants yet. | |
| - No streaming inference. | |
| - macOS 14+, Apple Silicon. | |
| ## Reference | |
| Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar | |
| Validated against the upstream PyTorch reference: exact 44-token match and exact transcript string on the example wav. | |
| ## License | |
| Apache-2.0, matching the upstream model. | |