README: fused-QKV layout for Q4_K (PLAN #60d)
Browse files
README.md
CHANGED
|
@@ -22,10 +22,12 @@ The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 L
|
|
| 22 |
|
| 23 |
## Available variants
|
| 24 |
|
| 25 |
-
| File | Quant | Size | Recommended |
|
| 26 |
-
|---|---|---|---|
|
| 27 |
-
| `mimo-asr-f16.gguf` | F16 | 14.9 GB | Full precision; needs ~16 GB RAM during inference |
|
| 28 |
-
| `mimo-asr-q4_k.gguf` | Q4_K | 4.
|
|
|
|
|
|
|
| 29 |
|
| 30 |
Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
|
| 31 |
|
|
@@ -48,8 +50,8 @@ cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
|
|
| 48 |
cmake --build build-ninja-compile --target crispasr
|
| 49 |
|
| 50 |
# Download both halves
|
| 51 |
-
|
| 52 |
-
|
| 53 |
|
| 54 |
# Transcribe
|
| 55 |
build-ninja-compile/bin/crispasr \
|
|
@@ -71,17 +73,17 @@ This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.
|
|
| 71 |
|
| 72 |
### Performance
|
| 73 |
|
| 74 |
-
On Apple M1, Metal backend, Q4_K:
|
| 75 |
|
| 76 |
| Phase | Time |
|
| 77 |
|---|---|
|
| 78 |
| LM load (mmap, lazy) | ~1 s |
|
| 79 |
| Audio tokenize (11 s sample) | ~0.5 s |
|
| 80 |
| Prefill (T_groups=71) | ~3 s |
|
| 81 |
-
| Step decode (~25 tokens) | ~
|
| 82 |
-
| **End-to-end** | **~
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
## Validation
|
| 87 |
|
|
|
|
| 22 |
|
| 23 |
## Available variants
|
| 24 |
|
| 25 |
+
| File | Quant | Size | Layout | Recommended |
|
| 26 |
+
|---|---|---|---|---|
|
| 27 |
+
| `mimo-asr-f16.gguf` | F16 | 14.9 GB | separate Q/K/V | Full precision; needs ~16 GB RAM during inference |
|
| 28 |
+
| `mimo-asr-q4_k.gguf` | Q4_K | 4.2 GB | **fused QKV** | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
|
| 29 |
+
|
| 30 |
+
The default `mimo-asr-q4_k.gguf` (re-uploaded May 2026, PLAN #60d) ships with per-LM-layer Q/K/V projections fused into a single `model.layers.{i}.attn.qkv.{weight,bias}` tensor pair, yielding ~1.7× faster per-step decode on M1 vs the prior unfused layout (3058 ms/step → 1806 ms/step on a contended-disk run; ~1.1-1.2× pure-compute on a quiet box). The CrispASR runtime auto-detects either layout: the F16 file above keeps working unchanged via the separate-Q/K/V fallback path. Re-upload of a fused F16 is queued behind disk-headroom availability.
|
| 31 |
|
| 32 |
Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
|
| 33 |
|
|
|
|
| 50 |
cmake --build build-ninja-compile --target crispasr
|
| 51 |
|
| 52 |
# Download both halves
|
| 53 |
+
hf download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
|
| 54 |
+
hf download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
|
| 55 |
|
| 56 |
# Transcribe
|
| 57 |
build-ninja-compile/bin/crispasr \
|
|
|
|
| 73 |
|
| 74 |
### Performance
|
| 75 |
|
| 76 |
+
On Apple M1, Metal backend, Q4_K, warm-cache:
|
| 77 |
|
| 78 |
| Phase | Time |
|
| 79 |
|---|---|
|
| 80 |
| LM load (mmap, lazy) | ~1 s |
|
| 81 |
| Audio tokenize (11 s sample) | ~0.5 s |
|
| 82 |
| Prefill (T_groups=71) | ~3 s |
|
| 83 |
+
| Step decode (~25 tokens) | ~20 s with the fused-QKV file (≈0.8 s/token; was ~30 s pre-fusion) |
|
| 84 |
+
| **End-to-end** | **~25-30 s for 11 s audio (~0.4× realtime)** |
|
| 85 |
|
| 86 |
+
Per-step decode is the bottleneck; PLAN #60d (May 2026) fused the per-LM-layer Q/K/V projections into one matmul, replacing 3 mul_mat + 3 ggml_add per layer × 36 layers with 1 + 1, for a measured ~1.7× speedup at the same disk-pressure level. KV cache reuse via cached step graphs (PLAN #51b') is also live. Future perf wins: F16 with fused QKV (queued behind disk headroom), `CRISPASR_KV_QUANT=q8_0` for hour-long inputs (PLAN #60e env-flag is already plumbed; default stays F16 until per-backend rollout completes).
|
| 87 |
|
| 88 |
## Validation
|
| 89 |
|