Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
Paper β’ 2602.12241 β’ Published
ONNX INT8 (dynamic quantization) export of UsefulSensors/moonshine-streaming-small, a fast streaming ASR model designed for real-time on-device speech recognition.
Based on the paper: Moonshine v2: Ergodic Streaming Encoder ASR
| Details | |
|---|---|
| Architecture | Encoder-decoder with sliding-window attention |
| Parameters | ~200M |
| Quantization | Dynamic INT8 (weight-only, MatMul/Gemm ops) |
| Total size | 341 MB (vs ~1.3 GB FP32) |
| Input | 16kHz mono float32 audio + attention_mask |
| Output | Text tokens (BPE, 32768 vocab) |
| File | Size | Description |
|---|---|---|
encoder_model_int8.onnx |
72 MB | Audio β encoder hidden states (620-dim) |
decoder_model_int8.onnx |
142 MB | First decode step, no KV cache β logits + KV |
decoder_with_past_model_int8.onnx |
127 MB | Subsequent steps with KV cache β logits + updated KV |
pip install onnxruntime numpy tokenizers
# For real-time mic input:
pip install sounddevice
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
MODEL_DIR = "moonshine-streaming-onnx"
BOS, EOS = 1, 2
# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]
encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")
# Prepare audio (16kHz float32, padded to multiple of 80)
audio = np.random.randn(16000).astype(np.float32) # replace with real audio
remainder = len(audio) % 80
if remainder:
audio = np.pad(audio, (0, 80 - remainder))
audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)
# Encode
(enc_out,) = encoder.run(None, {
"input_values": audio_input,
"attention_mask": attention_mask,
})
# Decode first token
outs = decoder.run(None, {
"decoder_input_ids": np.array([[BOS]], dtype=np.int64),
"encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))
# Build KV cache mapping (decoder outputs β decoder_with_past inputs)
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}
kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
mapped = name.replace("present_", "past_", 1)
if mapped in past_in_names:
kv_dict[mapped] = tensor
elif name + "_orig" in past_in_names:
kv_dict[name + "_orig"] = tensor
# Autoregressive decode loop
tokens = [token]
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
for _ in range(255):
if token == EOS:
break
inputs = {"decoder_input_ids": np.array([[token]], dtype=np.int64), "encoder_hidden_states": enc_out}
inputs.update(kv_dict)
outs = decoder_past.run(None, inputs)
token = int(np.argmax(outs[0][0, -1, :]))
tokens.append(token)
# Update KV cache
kv_dict = {}
for name, tensor in zip(past_out_names, outs[1:]):
mapped = name.replace("present_", "past_", 1)
if mapped in past_in_names:
kv_dict[mapped] = tensor
elif name + "_orig" in past_in_names:
kv_dict[name + "_orig"] = tensor
text = tokenizer.decode(tokens)
print(text)
The encoder requires an attention_mask input (all ones for unpadded audio). Without it, the sliding-window bidirectional attention mask is not constructed and the model produces no transcription.
Tested on MacBook Air M3, ONNX Runtime CPU:
| Metric | Value |
|---|---|
| Encode latency | ~350-420ms (for ~6s audio) |
| Decode latency | ~35-40ms/token |
| Real-time factor | 0.22x (4.5x faster than real-time) |
| Peak RAM | ~1.4 GB |
torch.onnx.export(dynamo=True) to handle the vmap-based sliding-window attention maskingtorch.onnx.export(dynamo=True) with symbolic dynamic shapes for batch, sequence, and KV cache dimensionsonnxruntime.quantization.quantize_dynamic targeting MatMul/Gemm ops, symmetric int8 weightsKey patches applied during export:
DynamicLayer.update: Skip torch.cat on first call (empty tensor β rank mismatch fix)attention_mask through to the encoder@article{kudlur2025moonshine,
title={Moonshine v2: Ergodic Streaming Encoder ASR},
author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
journal={arXiv preprint arXiv:2602.12241},
year={2025}
}
MIT (same as base model)
Base model
UsefulSensors/moonshine-streaming-small