--- library_name: mlx license: apache-2.0 base_model: ibm-granite/granite-speech-4.1-2b-nar language: [en, fr, de, es, pt] pipeline_tag: automatic-speech-recognition tags: - mlx - mlx-audio - speech-to-text - non-autoregressive - granite --- # Granite Speech 4.1 2B NAR — MLX MLX port of [`ibm-granite/granite-speech-4.1-2b-nar`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar) for Apple Silicon. Runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio). ## Architecture Non-autoregressive ASR via CTC + bidirectional LM editing: 1. **16-layer Conformer encoder** (543M params) produces an initial BPE CTC hypothesis. 2. **2-layer windowed Q-Former projector** (80M params) converts multi-layer encoder states into audio embeddings. 3. **40-layer bidirectional Granite editor** (1.6B params) takes `[audio | hypothesis_tokens]` and emits edited logits in a single forward pass — no autoregression, no KV cache. 4. Final CTC collapse on text-position logits yields the transcript. Total: ~2.25B params, bf16. ## Quickstart ```python from pathlib import Path from mlx_audio.stt.utils import load_model model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx")) out = model.generate("audio.wav") print(out.text) ``` ## Limitations - Batch size 1. - bf16 baseline only — no quantized variants yet. - No streaming inference. - macOS 14+, Apple Silicon. ## Reference Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar Validated against the upstream PyTorch reference: exact 44-token match and exact transcript string on the example wav. ## License Apache-2.0, matching the upstream model.