granite-speech-4.1-2b-nar-mlx

Duplicate from mouddane/granite-speech-4.1-2b-nar-mlx

54790f9 2 days ago

1.68 kB

	---
	library_name: mlx
	license: apache-2.0
	base_model: ibm-granite/granite-speech-4.1-2b-nar
	language: [en, fr, de, es, pt]
	pipeline_tag: automatic-speech-recognition
	tags:
	- mlx
	- mlx-audio
	- speech-to-text
	- non-autoregressive
	- granite
	---

	# Granite Speech 4.1 2B NAR — MLX

	MLX port of [`ibm-granite/granite-speech-4.1-2b-nar`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar) for Apple Silicon. Runs via [mlx-audio](https://github.com/Blaizzy/mlx-audio).

	## Architecture

	Non-autoregressive ASR via CTC + bidirectional LM editing:

	1. 16-layer Conformer encoder (543M params) produces an initial BPE CTC hypothesis.
	2. 2-layer windowed Q-Former projector (80M params) converts multi-layer encoder states into audio embeddings.
	3. 40-layer bidirectional Granite editor (1.6B params) takes `[audio \| hypothesis_tokens]` and emits edited logits in a single forward pass — no autoregression, no KV cache.
	4. Final CTC collapse on text-position logits yields the transcript.

	Total: ~2.25B params, bf16.

	## Quickstart

	```python
	from pathlib import Path
	from mlx_audio.stt.utils import load_model

	model = load_model(Path("mlx-community/granite-speech-4.1-2b-nar-mlx"))
	out = model.generate("audio.wav")
	print(out.text)
	```

	## Limitations

	- Batch size 1.
	- bf16 baseline only — no quantized variants yet.
	- No streaming inference.
	- macOS 14+, Apple Silicon.

	## Reference

	Upstream model card: https://huggingface.co/ibm-granite/granite-speech-4.1-2b-nar

	Validated against the upstream PyTorch reference: exact 44-token match and exact transcript string on the example wav.

	## License

	Apache-2.0, matching the upstream model.