Continuous Mimi Encoder

This is a modified ONNX export of the Mimi audio codec encoder.

It bypasses the Residual Vector Quantization (RVQ) to directly output continuous feature representations instead of discrete tokens. This preserves fine-grained acoustic details, making it ideal for downstream tasks like real-time dialogue analysis or speech emotion recognition.

Architecture diagram showing the bypassed RVQ module in the Mimi encoder
Figure: Architecture showing the RVQ bypass for continuous latents. (Adapted from the Moshi technical report)

⚖️ License and Attribution

This model is a derivative work based on the "Mimi" model.

Original Developer: Kyutai
License: CC BY 4.0
Modifications: Bypassed the RVQ module to output continuous embeddings and exported to ONNX format.

Speed and accuracy (reference)

Numbers below are illustrative (desktop CPU, FP32 ONNX vs PyTorch kyutai/mimi, streaming chunk emit=1280 @ 16 kHz, inference.py --compare --mic). Your machine, thread counts, and ORT / PyTorch builds will shift them.

Speed — RTF (processing time ÷ real-time audio length per chunk)
In repeated 1-second windows, Torch often lands around RTF ≈ 0.22–0.32 and FP32 ONNX around RTF ≈ 0.14–0.18 — i.e. ONNX is commonly ~1.5–2× less wall time than Torch for the same window. Both RTF values stay below 1 in those traces, so either path can run faster than real time for this configuration.

Accuracy — Torch vs FP32 ONNX embeddings
With identical windows fed to both backends, many seconds show max absolute differences ~1e−6 … 1e−3 and cosine similarity ≳ 0.9999. During loud transients or speech onsets, max_abs can briefly reach ~1e−2 … 5e−2 while mean_abs stays much smaller; cosine may dip toward ~0.97 in the worst second before tightening again. The published FP32 ONNX pair is the intended match to Torch; INT8 trades more numerical error for CPU throughput.

Getting Started

Requirements

Use Python 3.10+ and install the following (versions should match your CUDA/CPU setup for PyTorch).

Recommended: transformers==5.5.3 (streaming Mimi + cache behaviour used by the ONNX export and inference.py are validated against this release).

Scope	Packages
Always	`torch`, `transformers` (5.5.3 recommended; Mimi / Kyutai support), `einops`, `numpy`, `huggingface_hub`
ONNX path	`onnxruntime` or `onnxruntime-gpu` (GPU builds use the CUDA execution provider when `--device cuda`)
Microphone (`--mic`)	`pyaudio` (optional; platform-specific wheels may be required on Windows)

Example install for CPU ONNX + Torch:

pip install "torch" "transformers==5.5.3" einops numpy huggingface_hub onnxruntime

For GPU ONNX, replace onnxruntime with onnxruntime-gpu and install a matching torch build. For live microphone streaming, add pyaudio when your platform supports it.

The Torch encoder path downloads kyutai/mimi weights via Hugging Face on first use. The ONNX path, if you do not pass local paths, downloads continuous_mimi_fp32.onnx / .json (or INT8 counterparts) from this Hugging Face repository using huggingface_hub.

`inference.py`

inference.py provides a small CLI around the streaming Mimi encoder: PyTorch (kyutai/mimi) or ONNX, optional 16 kHz microphone streaming, and an optional Torch–ONNX comparison mode. Run it with python inference.py ... after installing the packages above.

Quick checks

# One synthetic streaming step; PyTorch Mimi encoder (HF weights)
python inference.py --backend torch

# Same, but ONNX runtime (FP32 bundle from this Hub, cached locally after first run)
python inference.py --backend onnx --onnx-precision fp32

# Use ONNX files you already downloaded next to the script
python inference.py --backend onnx --onnx-model ./continuous_mimi_fp32.onnx --onnx-meta ./continuous_mimi_fp32.json

# INT8 ONNX is intended for CPU in this script (do not pair INT8 with CUDA here)
python inference.py --backend onnx --onnx-precision int8 --device cpu

Microphone (16 kHz, mono float32)

Requires pyaudio. Prints RTF (processing time ÷ real-time audio duration) and per-chunk timing on a fixed wall-clock interval (default 1 s).

python inference.py --mic --backend onnx
python inference.py --mic --backend onnx --mic-device-index 1 --max-steps 500 --mic-report-interval 1.0

Torch vs ONNX comparison

Initializes both backends and feeds identical streaming windows; reports numeric differences and per-backend timing / RTF.

# Synthetic multi-step stream (default 10 steps; override with --compare-steps)
python inference.py --compare --compare-steps 20

# Same comparison using live microphone chunks
python inference.py --compare --mic

Example (--compare --mic on CPU, Ctrl+C to stop)

Each [1s] line aggregates every encoder step in that wall-clock second: embedding agreement (max_abs, mean_abs, cos) plus per-chunk mean inference time and RTF for Torch and ONNX separately.

$ python inference.py --mic --compare
Loading weights: 100%|████████████████| 350/350 [00:00<00:00, 5623.51it/s]
Froze EncoderMimi!
Loading weights: 100%|████████████████| 350/350 [00:00<00:00, 5653.81it/s]
Froze EncoderMimiOnnx!
Mic compare: 16 kHz mono, emit=1280, overlap=320. Diff summary every 1s. Ctrl+C to stop.
[1s] compare  max_abs(worst step)=1.646578e-06  mean_abs(avg over steps)=1.316559e-07  cos(min over steps)=0.99999988  |  torch proc/chunk=22.89ms RTF=0.286  onnx proc/chunk=11.73ms RTF=0.147
[1s] compare  max_abs(worst step)=4.466623e-06  mean_abs(avg over steps)=2.932540e-07  cos(min over steps)=0.99999988  |  torch proc/chunk=17.16ms RTF=0.214  onnx proc/chunk=11.40ms RTF=0.142
[1s] compare  max_abs(worst step)=1.005828e-06  mean_abs(avg over steps)=9.214972e-08  cos(min over steps)=0.99999988  |  torch proc/chunk=18.35ms RTF=0.229  onnx proc/chunk=11.71ms RTF=0.146
[1s] compare  max_abs(worst step)=1.596969e-03  mean_abs(avg over steps)=9.473910e-05  cos(min over steps)=0.99998349  |  torch proc/chunk=18.77ms RTF=0.235  onnx proc/chunk=11.46ms RTF=0.143
[1s] compare  max_abs(worst step)=2.210568e-02  mean_abs(avg over steps)=2.323536e-03  cos(min over steps)=0.99507332  |  torch proc/chunk=19.79ms RTF=0.247  onnx proc/chunk=12.36ms RTF=0.155
[1s] compare  max_abs(worst step)=4.575500e-02  mean_abs(avg over steps)=4.142828e-03  cos(min over steps)=0.97149730  |  torch proc/chunk=18.23ms RTF=0.228  onnx proc/chunk=11.81ms RTF=0.148
... further `[1s]` lines each second while audio continues ...
Interrupted.
[summary] compare  max_abs(worst step)=3.545750e-03  mean_abs(avg over steps)=6.147290e-04  cos(min over steps)=0.99984562  |  torch proc/chunk=23.35ms RTF=0.292  onnx proc/chunk=11.98ms RTF=0.150

When --compare is set, --backend is ignored (both paths run). With --compare and without --mic, --compare-steps sets how many synthetic chunks to run.

Other useful flags

Flag	Role
`--device`	`cpu`, `cuda`, `cuda:0`, …
`--frame-hz`	Output frame rate alignment for the frame-rate conv (default `12.5`)
`--hf-cache-dir`	Hugging Face Hub cache directory override
`--force-download`	Force re-fetch of Hub ONNX artifacts
`--onnx-cpu-intra-threads` / `--onnx-cpu-inter-threads`	ORT session thread counts on CPU EP

Downloads last month: -; Downloads are not tracked for this model. How to track