Moonshine Streaming Medium — ONNX INT8

Dynamic INT8 quantized ONNX export of UsefulSensors/moonshine-streaming-medium for fast CPU inference with ONNX Runtime.

Model Overview

Moonshine v2 Streaming is an encoder-decoder ASR model designed for real-time streaming speech recognition. The encoder uses causal sliding-window attention (no positional embeddings), enabling it to process audio incrementally. The decoder uses RoPE-based causal attention with cross-attention to encoder states.

	Details
Architecture	Encoder-Decoder Transformer (streaming)
Parameters	~330M (FP32)
Encoder	14 layers, 768-dim, 10 heads, sliding-window attention
Decoder	14 layers, 640-dim, 10 heads, RoPE + cross-attention
Vocab	32,768 BPE tokens
Audio Input	16 kHz mono, 5ms frames (80 samples)
Quantization	Dynamic INT8 (weight-only, symmetric)
Latency	Real-time capable on modern CPUs

Files

File	Size	Description
`encoder_model_int8.onnx`	135 MB	Audio → 768-dim encoder hidden states
`decoder_model_int8.onnx`	225 MB	First decode step (initializes KV cache)
`decoder_with_past_model_int8.onnx`	202 MB	Subsequent decode steps (streaming KV cache)
`config.json`	—	Model architecture configuration
`tokenizer.json`	3.6 MB	BPE tokenizer (32,768 vocab)
`processor_config.json`	—	Audio processor settings
`tokenizer_config.json`	—	Tokenizer metadata
Total	~562 MB	~~64% smaller than FP32 (~~1.57 GB)

Quick Start

Installation

pip install onnxruntime numpy tokenizers

Basic Inference

import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "moonshine-streaming-medium-onnx"

# Load models
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Encode audio (16kHz float32)
audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second
mask = np.ones((1, 16000), dtype=np.int64)
enc_out = encoder.run(None, {"input_values": audio, "attention_mask": mask})[0]

# First decode step (BOS token = 1)
bos = np.array([[1]], dtype=np.int64)
first_out = decoder.run(None, {"decoder_input_ids": bos, "encoder_hidden_states": enc_out})
logits = first_out[0]
token_id = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping for subsequent steps
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
dec_past_in_names = {i.name for i in decoder_past.get_inputs() if i.name not in ("decoder_input_ids", "encoder_hidden_states")}
kv = {}
for name, tensor in zip(dec_out_names, first_out[1:]):
    past_name = name.replace("present_", "past_", 1)
    mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
    kv[mapped] = tensor

# Autoregressive decoding
tokens = [token_id]
EOS = 2
while token_id != EOS and len(tokens) < 256:
    inputs = {"decoder_input_ids": np.array([[token_id]], dtype=np.int64), "encoder_hidden_states": enc_out}
    inputs.update(kv)
    past_out = decoder_past.run(None, inputs)
    token_id = int(np.argmax(past_out[0][0, -1, :]))
    tokens.append(token_id)
    # Update KV cache
    past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
    kv = {}
    for name, tensor in zip(past_out_names, past_out[1:]):
        past_name = name.replace("present_", "past_", 1)
        mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
        kv[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)

Real-Time Microphone Streaming

The companion CLI tool provides real-time streaming ASR with voice activity detection:

pip install sounddevice
python inference_moonshine.py --model-dir moonshine_streaming_medium

Architecture Details

Three-Model Design

The model is split into 3 ONNX graphs for efficient streaming:

Encoder — Processes raw audio with causal stride-2 convolutions and sliding-window attention. Outputs 768-dim hidden states at 50Hz (one frame per 20ms of audio).
Decoder (first step) — Takes BOS token + encoder states, produces first token logits and initializes 56 KV cache tensors (14 layers × 2 attention types × key+value).
Decoder with past — Takes previous token + encoder states + KV cache, produces next token logits and updated cache. Self-attention KV grows each step; cross-attention KV stays constant.

Sliding Window Attention

The encoder uses per-layer sliding window sizes for streaming efficiency:

Layers	Window	Lookahead	Purpose
0–1	16 frames	4 frames	Initial context with lookahead
2–11	16 frames	0 frames	Causal processing (no future)
12–13	16 frames	4 frames	Final refinement with lookahead

Quantization Details

Method: Dynamic INT8 (weight-only)
Target ops: MatMul, Gemm (transformer compute)
Weights: Symmetric INT8
Activations: Remain FP32 at runtime
Audio frontend: Conv/ConvTranspose kept at full precision
Accuracy impact: Negligible (<0.0001 max absolute encoder diff vs FP32)

Comparison with Small

	Small	Medium
Encoder layers	10	14
Encoder hidden	620	768
Decoder layers	10	14
Decoder hidden	512	640
Attention heads	8	10
INT8 size	~358 MB	~562 MB
KV tensors	40	56

Execution Providers

Works with any ONNX Runtime execution provider:

Provider	Platform	Notes
`CPUExecutionProvider`	All	Default, always available
`CoreMLExecutionProvider`	macOS	Hardware-accelerated on Apple Silicon
`CUDAExecutionProvider`	Linux/Windows	NVIDIA GPU
`DirectMLExecutionProvider`	Windows	DirectX 12 GPU

Export Reproduction

To reproduce this export from the original safetensors:

pip install "transformers>=5.2.0" "huggingface_hub>=0.23" torch onnx onnxruntime
python export_moonshine_streaming_medium.py

Export script: onnx-creator

License

This model inherits the MIT License from the original Moonshine model by Useful Sensors.

Citation

@article{jeffries2024moonshine,
  title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
  author={Jeffries, Nat and Silent, Evan},
  journal={arXiv preprint arXiv:2410.15608},
  year={2024}
}

Downloads last month: 43

Model tree for Mazino0/moonshine-streaming-medium-onnx

Base model

UsefulSensors/moonshine-streaming-medium

Quantized

(1)

this model

Dataset used to train Mazino0/moonshine-streaming-medium-onnx

Paper for Mazino0/moonshine-streaming-medium-onnx

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Paper • 2410.15608 • Published Oct 21, 2024 • 12