cstr commited on
Commit
4c6a0ed
·
verified ·
1 Parent(s): 968cd88

README: fused-QKV layout for Q4_K (PLAN #60d)

Browse files
Files changed (1) hide show
  1. README.md +12 -10
README.md CHANGED
@@ -22,10 +22,12 @@ The runtime is functional end-to-end: greedy decode through the 36-layer Qwen2 L
22
 
23
  ## Available variants
24
 
25
- | File | Quant | Size | Recommended |
26
- |---|---|---|---|
27
- | `mimo-asr-f16.gguf` | F16 | 14.9 GB | Full precision; needs ~16 GB RAM during inference |
28
- | `mimo-asr-q4_k.gguf` | Q4_K | 4.5 GB | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
 
 
29
 
30
  Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
31
 
@@ -48,8 +50,8 @@ cmake -B build-ninja-compile -G Ninja -DCMAKE_BUILD_TYPE=Release
48
  cmake --build build-ninja-compile --target crispasr
49
 
50
  # Download both halves
51
- huggingface-cli download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
52
- huggingface-cli download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
53
 
54
  # Transcribe
55
  build-ninja-compile/bin/crispasr \
@@ -71,17 +73,17 @@ This matches the upstream Python `MimoAudio.asr_sft` reference verbatim.
71
 
72
  ### Performance
73
 
74
- On Apple M1, Metal backend, Q4_K:
75
 
76
  | Phase | Time |
77
  |---|---|
78
  | LM load (mmap, lazy) | ~1 s |
79
  | Audio tokenize (11 s sample) | ~0.5 s |
80
  | Prefill (T_groups=71) | ~3 s |
81
- | Step decode (~25 tokens) | ~30 s (≈1 s/token) |
82
- | **End-to-end** | **~37 s for 11 s audio (0.3× realtime)** |
83
 
84
- The step decode dominates; per-step Q4_K dequant is the bottleneck. F16 is faster per-step (no dequant) but 16 GB on disk and needs the full RAM during compute. Future perf wins: KV cache reuse across step graphs, block-batched generation, F16 with mmap-loaded weights.
85
 
86
  ## Validation
87
 
 
22
 
23
  ## Available variants
24
 
25
+ | File | Quant | Size | Layout | Recommended |
26
+ |---|---|---|---|---|
27
+ | `mimo-asr-f16.gguf` | F16 | 14.9 GB | separate Q/K/V | Full precision; needs ~16 GB RAM during inference |
28
+ | `mimo-asr-q4_k.gguf` | Q4_K | 4.2 GB | **fused QKV** | **Default** — fits in 8 GB RAM, no quality loss visible on JFK |
29
+
30
+ The default `mimo-asr-q4_k.gguf` (re-uploaded May 2026, PLAN #60d) ships with per-LM-layer Q/K/V projections fused into a single `model.layers.{i}.attn.qkv.{weight,bias}` tensor pair, yielding ~1.7× faster per-step decode on M1 vs the prior unfused layout (3058 ms/step → 1806 ms/step on a contended-disk run; ~1.1-1.2× pure-compute on a quiet box). The CrispASR runtime auto-detects either layout: the F16 file above keeps working unchanged via the separate-Q/K/V fallback path. Re-upload of a fused F16 is queued behind disk-headroom availability.
31
 
32
  Pair with **[`cstr/mimo-tokenizer-GGUF`](https://huggingface.co/cstr/mimo-tokenizer-GGUF)** — the audio tokenizer is a separate model that converts 16 kHz PCM → 8-channel RVQ codes that this LM consumes.
33
 
 
50
  cmake --build build-ninja-compile --target crispasr
51
 
52
  # Download both halves
53
+ hf download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf --local-dir models/
54
+ hf download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf --local-dir models/
55
 
56
  # Transcribe
57
  build-ninja-compile/bin/crispasr \
 
73
 
74
  ### Performance
75
 
76
+ On Apple M1, Metal backend, Q4_K, warm-cache:
77
 
78
  | Phase | Time |
79
  |---|---|
80
  | LM load (mmap, lazy) | ~1 s |
81
  | Audio tokenize (11 s sample) | ~0.5 s |
82
  | Prefill (T_groups=71) | ~3 s |
83
+ | Step decode (~25 tokens) | ~20 s with the fused-QKV file (≈0.8 s/token; was ~30 s pre-fusion) |
84
+ | **End-to-end** | **~25-30 s for 11 s audio (~0.4× realtime)** |
85
 
86
+ Per-step decode is the bottleneck; PLAN #60d (May 2026) fused the per-LM-layer Q/K/V projections into one matmul, replacing 3 mul_mat + 3 ggml_add per layer × 36 layers with 1 + 1, for a measured ~1.7× speedup at the same disk-pressure level. KV cache reuse via cached step graphs (PLAN #51b') is also live. Future perf wins: F16 with fused QKV (queued behind disk headroom), `CRISPASR_KV_QUANT=q8_0` for hour-long inputs (PLAN #60e env-flag is already plumbed; default stays F16 until per-backend rollout completes).
87
 
88
  ## Validation
89