GGUF + pure-C++ runtime in CrispASR — MiMo-V2.5-ASR (matches reference verbatim on JFK)
#3
by cstr - opened
We've added MiMo-V2.5-ASR to CrispASR as the mimo-asr backend. C++ binary, GGUF — no Python.
src/mimo_asr.cpp — 6L input_local_transformer (1024d, 64 heads) + 36L Qwen2 LM (4096d, 32 Q heads / 8 KV heads, SiLU, RoPE θ=640K). MiMo-ASR is token-based ASR: it consumes 8-channel RVQ codes from a separate audio tokenizer, not a mel spectrogram. So we ship two GGUFs:
cstr/mimo-asr-GGUF— the LM (Q4_K 4.5 GB / F16 15.3 GB)cstr/mimo-tokenizer-GGUF— the MiMo-Audio-Tokenizer (loaded via--codec-model)
End-to-end JFK transcription matches the upstream MimoAudio.asr_sft reference verbatim (PLAN #51).
A handful of MiMo-specific lessons from the port:
prompt_inputs_embedsshape is[9, T]— row 0 is text, rows 1–8 are the 8 RVQ channels. Easy to get wrong.- Capture tensors MUST call
ggml_set_output(). Readingprefill_inputs_embedsback viaggml_graph_get_tensor()without it gave us cos≈0.003 against the reference for ~3 hours before the scheduler-allocator interaction clicked. - Always prefix the converter with
OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1—torch.Tensor.to(torch.float16)on bf16 weights deadlocks under OpenMP for ~30 minutes silently.
Quick start:
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend mimo-asr \
-m mimo-asr-q4_k.gguf \
--codec-model mimo-tokenizer-q4_k.gguf \
-f audio.wav