Moonshine v2 Streaming Small β€” ONNX INT8

ONNX INT8 (dynamic quantization) export of UsefulSensors/moonshine-streaming-small, a fast streaming ASR model designed for real-time on-device speech recognition.

Based on the paper: Moonshine v2: Ergodic Streaming Encoder ASR

Model Details

Details
Architecture Encoder-decoder with sliding-window attention
Parameters ~200M
Quantization Dynamic INT8 (weight-only, MatMul/Gemm ops)
Total size 341 MB (vs ~1.3 GB FP32)
Input 16kHz mono float32 audio + attention_mask
Output Text tokens (BPE, 32768 vocab)

ONNX Files

File Size Description
encoder_model_int8.onnx 72 MB Audio β†’ encoder hidden states (620-dim)
decoder_model_int8.onnx 142 MB First decode step, no KV cache β†’ logits + KV
decoder_with_past_model_int8.onnx 127 MB Subsequent steps with KV cache β†’ logits + updated KV

Usage

Requirements

pip install onnxruntime numpy tokenizers
# For real-time mic input:
pip install sounddevice

Basic Inference

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "moonshine-streaming-onnx"
BOS, EOS = 1, 2

# Load models
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Prepare audio (16kHz float32, padded to multiple of 80)
audio = np.random.randn(16000).astype(np.float32)  # replace with real audio
remainder = len(audio) % 80
if remainder:
    audio = np.pad(audio, (0, 80 - remainder))

audio_input = audio[np.newaxis, :]
attention_mask = np.ones_like(audio_input, dtype=np.int64)

# Encode
(enc_out,) = encoder.run(None, {
    "input_values": audio_input,
    "attention_mask": attention_mask,
})

# Decode first token
outs = decoder.run(None, {
    "decoder_input_ids": np.array([[BOS]], dtype=np.int64),
    "encoder_hidden_states": enc_out,
})
logits, past_kvs = outs[0], outs[1:]
token = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping (decoder outputs β†’ decoder_with_past inputs)
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
past_in_names = {i.name for i in decoder_past.get_inputs()} - {"decoder_input_ids", "encoder_hidden_states"}

kv_dict = {}
for name, tensor in zip(dec_out_names, past_kvs):
    mapped = name.replace("present_", "past_", 1)
    if mapped in past_in_names:
        kv_dict[mapped] = tensor
    elif name + "_orig" in past_in_names:
        kv_dict[name + "_orig"] = tensor

# Autoregressive decode loop
tokens = [token]
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]

for _ in range(255):
    if token == EOS:
        break
    inputs = {"decoder_input_ids": np.array([[token]], dtype=np.int64), "encoder_hidden_states": enc_out}
    inputs.update(kv_dict)
    outs = decoder_past.run(None, inputs)
    token = int(np.argmax(outs[0][0, -1, :]))
    tokens.append(token)
    # Update KV cache
    kv_dict = {}
    for name, tensor in zip(past_out_names, outs[1:]):
        mapped = name.replace("present_", "past_", 1)
        if mapped in past_in_names:
            kv_dict[mapped] = tensor
        elif name + "_orig" in past_in_names:
            kv_dict[name + "_orig"] = tensor

text = tokenizer.decode(tokens)
print(text)

Important: Attention Mask Required

The encoder requires an attention_mask input (all ones for unpadded audio). Without it, the sliding-window bidirectional attention mask is not constructed and the model produces no transcription.

Performance

Tested on MacBook Air M3, ONNX Runtime CPU:

Metric Value
Encode latency ~350-420ms (for ~6s audio)
Decode latency ~35-40ms/token
Real-time factor 0.22x (4.5x faster than real-time)
Peak RAM ~1.4 GB

How It Was Exported

  1. Encoder: Exported with torch.onnx.export(dynamo=True) to handle the vmap-based sliding-window attention masking
  2. Decoders: Exported with torch.onnx.export(dynamo=True) with symbolic dynamic shapes for batch, sequence, and KV cache dimensions
  3. Quantization: onnxruntime.quantization.quantize_dynamic targeting MatMul/Gemm ops, symmetric int8 weights

Key patches applied during export:

  • DynamicLayer.update: Skip torch.cat on first call (empty tensor β†’ rank mismatch fix)
  • Encoder wrapper passes attention_mask through to the encoder

Limitations

  • English only
  • Optimized for short utterances (streaming chunks of 1-5 seconds)
  • INT8 quantization has negligible accuracy impact for greedy decoding; may affect beam search/sampling slightly

Citation

@article{kudlur2025moonshine,
  title={Moonshine v2: Ergodic Streaming Encoder ASR},
  author={Kudlur, Manjunath and King, Evan and Wang, James and Warden, Pete},
  journal={arXiv preprint arXiv:2602.12241},
  year={2025}
}

License

MIT (same as base model)

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mazino0/moonshine-streaming-small-onnx

Quantized
(1)
this model

Paper for Mazino0/moonshine-streaming-small-onnx