GGUF + pure-C++ runtime in CrispASR — Granite 4.1-2B-NAR (single-forward, ~3×)

#4
by cstr - opened

We've added the non-autoregressive 4.1-2B-NAR variant to CrispASR. One C++ binary, one GGUF — no Python.

The NAR pipeline is interesting because almost everything you'd expect to be autoregressive isn't:

  • Single LLM forward over the concatenation [audio, text+slots] with is_causal=False everywhere.
  • Encoder self-conditioning at layer 8 — the layer-8 CTC softmax is fed back into the hidden stream as a 1024-dim residual, and the per-frame blank probability captured here also drives a posterior-weighted pool of the BPE auxiliary head (100353-vocab).
  • 4-layer encoder hidden-state concatenation into the projector (twice as many feeds as PLUS).
  • Slot argmax + unique_consecutive + drop-EOS decode — no greedy/beam loop.

We have it as a separate granite_nle.cpp runtime (sibling of granite_speech.cpp, intentionally not merged — LEARNINGS "Lesson 3 — sibling-not-merge for Conformer dialects"). Encoder also runs as a single ggml graph (with the layer-8 self-cond residual + snapshot concat + final CTC logits all captured inline). 19.27 s → 6.41 s on M1+Q4_K (~3.0×), bit-exact end-to-end on JFK via crispasr-diff granite-nle.

Pre-quantised GGUFs (Apache-2.0): cstr/granite-speech-4.1-2b-nar-GGUF

./build/bin/crispasr --backend granite-4.1-nar -m auto -f audio.wav -osrt

Sibling AR variants: 4.1-2b (already discussed #5) and 4.1-2b-plus.

Sign up or log in to comment