Granite Speech 4.1 2B NAR — MLX

MLX port of ibm-granite/granite-speech-4.1-2b-nar for Apple Silicon. Runs via mlx-audio.

Architecture

Non-autoregressive ASR via CTC + bidirectional LM editing:

  1. 16-layer Conformer encoder (543M params) produces an initial BPE CTC hypothesis.
  2. 2-layer windowed Q-Former projector (80M params) converts multi-layer encoder states into audio embeddings.
  3. 40-layer bidirectional Granite editor (1.6B params) takes [audio | hypothesis_tokens] and emits edited logits in a single forward pass — no autoregression, no KV cache.
  4. Final CTC collapse on text-position logits yields the transcript.

Total: ~2.25B params, bf16.

Quickstart

from pathlib import Path
from mlx_audio.stt.utils import load_model

model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx"))
out = model.generate("audio.wav")
print(out.text)

Limitations

  • Batch size 1.
  • bf16 baseline only — no quantized variants yet.
  • No streaming inference.
  • macOS 14+, Apple Silicon.

Reference

Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar

Validated against the upstream PyTorch reference: exact 44-token match and exact transcript string on the example wav.

License

Apache-2.0, matching the upstream model.

Downloads last month
36
Safetensors
Model size
2B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mouddane/granite-speech-4.1-2b-nar-mlx

Finetuned
(2)
this model