GGUF + pure-C++ runtime in CrispASR — Granite 4.1-2B-Plus on the GPU graph
We've added 4.1-2B-PLUS to CrispASR. One C++ binary, one GGUF — no Python.
The interesting bit is how PLUS rides the GPU path. PLUS's projector takes the concat of two encoder hidden-state layers (cat_hidden_layers: [3, …], 1024+1024 → 2048). Naively this would mean dropping out of the ggml graph mid-encode to grab the layer-3 post-norm activation, but we capture the post-norm tensors inline with ggml_set_output() and ggml_concat them with the final encoder output — the entire encoder still compiles to a single ggml graph, so PLUS gets the full Metal/CUDA/Vulkan acceleration.
End-to-end on M1 + Q4_K: 9.41 s baseline → 3.74 s with the graph encoder (~2.5×), same transcript byte-for-byte (LEARNINGS "Granite Speech 4.1").
Pre-quantised GGUFs (Apache-2.0): cstr/granite-speech-4.1-2b-plus-GGUF
./build/bin/crispasr --backend granite-4.1-plus -m auto -f audio.wav -osrt
# (set GRANITE_DISABLE_ENCODER_GRAPH=1 to fall back to the per-op CPU loop for diffing)
Sibling 4.1 variants: base #5 and the non-autoregressive 4.1-nar.