Granite Speech 4.1 2B NAR — MLX (5-bit)

5-bit quantized MLX port of ibm-granite/granite-speech-4.1-2b-nar for Apple Silicon. Runs via mlx-audio.

The bf16 baseline lives at mlx-community/granite-speech-4.1-2b-nar-mlx. This 5-bit variant trades a tiny amount of punctuation precision for ~47% size reduction and a comparable runtime.

Size and performance

Variant model.safetensors Inference (24.9s audio, M-series) RTF
bf16 (baseline) 4.51 GB 1.33 s 18.7× real-time
5-bit (this repo) 2.37 GB 1.48 s 16.8× real-time

Quantization is applied only to the editor's Linear projections (1.6 B params; attention_multiplier / embedding_multiplier / logits_scaling / residual_multiplier are preserved as configured). The Conformer encoder (540 M params) and Q-Former projector (80 M params) stay at full precision because they are noise-sensitive and small enough that quantizing them would yield little memory benefit.

Quality drift vs bf16 on the multilingual reference sample is limited to:

  • One added comma (la nuit suivante, vs la nuit suivante) — arguably more correct French.
  • Spacing inside a quoted dialogue ("si vous vs " si vous).

Word-level transcript is otherwise identical to the bf16 baseline. No characters or accents are corrupted (paraîtra, pêcheur, soeur all intact).

Architecture

Non-autoregressive ASR via CTC + bidirectional LM editing:

  1. 16-layer Conformer encoder (543 M params) produces an initial BPE CTC hypothesis.
  2. 2-layer windowed Q-Former projector (80 M params) converts multi-layer encoder states into audio embeddings.
  3. 40-layer bidirectional Granite editor (1.6 B params) takes [audio | hypothesis_tokens] and emits edited logits in a single forward pass — no autoregression, no KV cache.
  4. Final CTC collapse on text-position logits yields the transcript.

Total: ~2.25 B params. Editor quantized to 5-bit; encoder + projector remain bf16.

Quickstart

from pathlib import Path
from mlx_audio.stt.utils import load_model

model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx-5bit"))
out = model.generate("audio.wav")
print(out.text)

Limitations

  • Batch size 1 only.
  • No streaming inference.
  • macOS 14+, Apple Silicon (M-series).

Reference

Validated against the upstream PyTorch reference: exact transcript match on the bf16 baseline. The 5-bit variant matches the bf16 baseline at the word level (with the two minor punctuation differences listed above).

License

Apache-2.0, matching the upstream model.

Downloads last month
35
Safetensors
Model size
0.9B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mouddane/granite-speech-4.1-2b-nar-mlx-5bit

Quantized
(3)
this model