GGUF + pure-C++ runtime in CrispASR — MiMo-V2.5-ASR (matches reference verbatim on JFK)

by cstr - opened 12 days ago

We've added MiMo-V2.5-ASR to CrispASR as the mimo-asr backend. C++ binary, GGUF — no Python.

src/mimo_asr.cpp — 6L input_local_transformer (1024d, 64 heads) + 36L Qwen2 LM (4096d, 32 Q heads / 8 KV heads, SiLU, RoPE θ=640K). MiMo-ASR is token-based ASR: it consumes 8-channel RVQ codes from a separate audio tokenizer, not a mel spectrogram. So we ship two GGUFs:

cstr/mimo-asr-GGUF — the LM (Q4_K 4.5 GB / F16 15.3 GB)
cstr/mimo-tokenizer-GGUF — the MiMo-Audio-Tokenizer (loaded via --codec-model)

End-to-end JFK transcription matches the upstream MimoAudio.asr_sft reference verbatim (PLAN #51).

A handful of MiMo-specific lessons from the port:

prompt_inputs_embeds shape is [9, T] — row 0 is text, rows 1–8 are the 8 RVQ channels. Easy to get wrong.
Capture tensors MUST call ggml_set_output(). Reading prefill_inputs_embeds back via ggml_graph_get_tensor() without it gave us cos≈0.003 against the reference for ~3 hours before the scheduler-allocator interaction clicked.
Always prefix the converter with OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1 — torch.Tensor.to(torch.float16) on bf16 weights deadlocks under OpenMP for ~30 minutes silently.

Quick start:

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend mimo-asr \
    -m mimo-asr-q4_k.gguf \
    --codec-model mimo-tokenizer-q4_k.gguf \
    -f audio.wav

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment