mouddane's picture
Duplicate from mouddane/granite-speech-4.1-2b-nar-mlx
54790f9
---
library_name: mlx
license: apache-2.0
base_model: ibm-granite/granite-speech-4.1-2b-nar
language: [en, fr, de, es, pt]
pipeline_tag: automatic-speech-recognition
tags:
- mlx
- mlx-audio
- speech-to-text
- non-autoregressive
- granite
---
# Granite Speech 4.1 2B NAR — MLX
MLX port of [`ibm-granite/granite-speech-4.1-2b-nar`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar) for Apple Silicon. Runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio).
## Architecture
Non-autoregressive ASR via CTC + bidirectional LM editing:
1. **16-layer Conformer encoder** (543M params) produces an initial BPE CTC hypothesis.
2. **2-layer windowed Q-Former projector** (80M params) converts multi-layer encoder states into audio embeddings.
3. **40-layer bidirectional Granite editor** (1.6B params) takes `[audio | hypothesis_tokens]` and emits edited logits in a single forward pass — no autoregression, no KV cache.
4. Final CTC collapse on text-position logits yields the transcript.
Total: ~2.25B params, bf16.
## Quickstart
```python
from pathlib import Path
from mlx_audio.stt.utils import load_model
model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx"))
out = model.generate("audio.wav")
print(out.text)
```
## Limitations
- Batch size 1.
- bf16 baseline only — no quantized variants yet.
- No streaming inference.
- macOS 14+, Apple Silicon.
## Reference
Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar
Validated against the upstream PyTorch reference: exact 44-token match and exact transcript string on the example wav.
## License
Apache-2.0, matching the upstream model.