GGUF + pure-C++ runtime in CrispASR — MiMo-V2.5-ASR (matches reference verbatim on JFK)

#3
by cstr - opened

We've added MiMo-V2.5-ASR to CrispASR as the mimo-asr backend. C++ binary, GGUF — no Python.

src/mimo_asr.cpp — 6L input_local_transformer (1024d, 64 heads) + 36L Qwen2 LM (4096d, 32 Q heads / 8 KV heads, SiLU, RoPE θ=640K). MiMo-ASR is token-based ASR: it consumes 8-channel RVQ codes from a separate audio tokenizer, not a mel spectrogram. So we ship two GGUFs:

End-to-end JFK transcription matches the upstream MimoAudio.asr_sft reference verbatim (PLAN #51).

A handful of MiMo-specific lessons from the port:

  • prompt_inputs_embeds shape is [9, T] — row 0 is text, rows 1–8 are the 8 RVQ channels. Easy to get wrong.
  • Capture tensors MUST call ggml_set_output(). Reading prefill_inputs_embeds back via ggml_graph_get_tensor() without it gave us cos≈0.003 against the reference for ~3 hours before the scheduler-allocator interaction clicked.
  • Always prefix the converter with OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 PYTHONUNBUFFERED=1torch.Tensor.to(torch.float16) on bf16 weights deadlocks under OpenMP for ~30 minutes silently.

Quick start:

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend mimo-asr \
    -m mimo-asr-q4_k.gguf \
    --codec-model mimo-tokenizer-q4_k.gguf \
    -f audio.wav

Sign up or log in to comment