cstr
/

mimo-asr-GGUF

@@ -22,10 +22,12 @@ The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 L
 ## Available variants
-| File | Quant | Size | Recommended |
-|---|---|---|---|
-| `mimo-asr-f16.gguf` | F16 | 14.9 GB | Full precision; needs ~16 GB RAM during inference |
-| `mimo-asr-q4_k.gguf` | Q4_K | 4.5 GB | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
 Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
@@ -48,8 +50,8 @@ cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
 cmake --build build-ninja-compile --target crispasr
 # Download both halves
-huggingface-cli download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
-huggingface-cli download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
 # Transcribe
 build-ninja-compile/bin/crispasr \
@@ -71,17 +73,17 @@ This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.
 ### Performance
-On Apple M1, Metal backend, Q4_K:
 | Phase | Time |
 |---|---|
 | LM load (mmap, lazy) | ~1 s |
 | Audio tokenize (11 s sample) | ~0.5 s |
 | Prefill (T_groups=71) | ~3 s |
-| Step decode (~25 tokens) | ~30 s (≈1 s/token) |
-| **End-to-end** | **~37 s for 11 s audio (0.3× realtime)** |
-The step decode dominates; per-step Q4_K dequant is the bottleneck. F16 is faster per-step (no dequant) but 16 GB on disk and needs the full RAM during compute. Future perf wins: KV cache reuse across step graphs, block-batched generation, F16 with mmap-loaded weights.
 ## Validation

 ## Available variants
+| File | Quant | Size | Layout | Recommended |
+|---|---|---|---|---|
+| `mimo-asr-f16.gguf` | F16 | 14.9 GB | separate Q/K/V | Full precision; needs ~16 GB RAM during inference |
+| `mimo-asr-q4_k.gguf` | Q4_K | 4.2 GB | **fused QKV** | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
+The default `mimo-asr-q4_k.gguf` (re-uploaded May 2026, PLAN #60d) ships with per-LM-layer Q/K/V projections fused into a single `model.layers.{i}.attn.qkv.{weight,bias}` tensor pair, yielding ~1.7× faster per-step decode on M1 vs the prior unfused layout (3058 ms/step → 1806 ms/step on a contended-disk run; ~1.1-1.2× pure-compute on a quiet box). The CrispASR runtime auto-detects either layout: the F16 file above keeps working unchanged via the separate-Q/K/V fallback path. Re-upload of a fused F16 is queued behind disk-headroom availability.
 Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
 cmake --build build-ninja-compile --target crispasr
 # Download both halves
+hf download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
+hf download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
 # Transcribe
 build-ninja-compile/bin/crispasr \
 ### Performance
+On Apple M1, Metal backend, Q4_K, warm-cache:
 | Phase | Time |
 |---|---|
 | LM load (mmap, lazy) | ~1 s |
 | Audio tokenize (11 s sample) | ~0.5 s |
 | Prefill (T_groups=71) | ~3 s |
+| Step decode (~25 tokens) | ~20 s with the fused-QKV file (≈0.8 s/token; was ~30 s pre-fusion) |
+| **End-to-end** | **~25-30 s for 11 s audio (~0.4× realtime)** |
+Per-step decode is the bottleneck; PLAN #60d (May 2026) fused the per-LM-layer Q/K/V projections into one matmul, replacing 3 mul_mat + 3 ggml_add per layer × 36 layers with 1 + 1, for a measured ~1.7× speedup at the same disk-pressure level. KV cache reuse via cached step graphs (PLAN #51b') is also live. Future perf wins: F16 with fused QKV (queued behind disk headroom), `CRISPASR_KV_QUANT=q8_0` for hour-long inputs (PLAN #60e env-flag is already plumbed; default stays F16 until per-backend rollout completes).
 ## Validation