Moonshine Streaming Medium β ONNX INT8
Dynamic INT8 quantized ONNX export of UsefulSensors/moonshine-streaming-medium for fast CPU inference with ONNX Runtime.
Model Overview
Moonshine v2 Streaming is an encoder-decoder ASR model designed for real-time streaming speech recognition. The encoder uses causal sliding-window attention (no positional embeddings), enabling it to process audio incrementally. The decoder uses RoPE-based causal attention with cross-attention to encoder states.
| Details | |
|---|---|
| Architecture | Encoder-Decoder Transformer (streaming) |
| Parameters | ~330M (FP32) |
| Encoder | 14 layers, 768-dim, 10 heads, sliding-window attention |
| Decoder | 14 layers, 640-dim, 10 heads, RoPE + cross-attention |
| Vocab | 32,768 BPE tokens |
| Audio Input | 16 kHz mono, 5ms frames (80 samples) |
| Quantization | Dynamic INT8 (weight-only, symmetric) |
| Latency | Real-time capable on modern CPUs |
Files
| File | Size | Description |
|---|---|---|
encoder_model_int8.onnx |
135 MB | Audio β 768-dim encoder hidden states |
decoder_model_int8.onnx |
225 MB | First decode step (initializes KV cache) |
decoder_with_past_model_int8.onnx |
202 MB | Subsequent decode steps (streaming KV cache) |
config.json |
β | Model architecture configuration |
tokenizer.json |
3.6 MB | BPE tokenizer (32,768 vocab) |
processor_config.json |
β | Audio processor settings |
tokenizer_config.json |
β | Tokenizer metadata |
| Total | ~562 MB |
Quick Start
Installation
pip install onnxruntime numpy tokenizers
Basic Inference
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer
MODEL_DIR = "moonshine-streaming-medium-onnx"
# Load models
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
providers = ["CPUExecutionProvider"]
encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")
# Encode audio (16kHz float32)
audio = np.random.randn(1, 16000).astype(np.float32) # 1 second
mask = np.ones((1, 16000), dtype=np.int64)
enc_out = encoder.run(None, {"input_values": audio, "attention_mask": mask})[0]
# First decode step (BOS token = 1)
bos = np.array([[1]], dtype=np.int64)
first_out = decoder.run(None, {"decoder_input_ids": bos, "encoder_hidden_states": enc_out})
logits = first_out[0]
token_id = int(np.argmax(logits[0, -1, :]))
# Build KV cache mapping for subsequent steps
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
dec_past_in_names = {i.name for i in decoder_past.get_inputs() if i.name not in ("decoder_input_ids", "encoder_hidden_states")}
kv = {}
for name, tensor in zip(dec_out_names, first_out[1:]):
past_name = name.replace("present_", "past_", 1)
mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
kv[mapped] = tensor
# Autoregressive decoding
tokens = [token_id]
EOS = 2
while token_id != EOS and len(tokens) < 256:
inputs = {"decoder_input_ids": np.array([[token_id]], dtype=np.int64), "encoder_hidden_states": enc_out}
inputs.update(kv)
past_out = decoder_past.run(None, inputs)
token_id = int(np.argmax(past_out[0][0, -1, :]))
tokens.append(token_id)
# Update KV cache
past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
kv = {}
for name, tensor in zip(past_out_names, past_out[1:]):
past_name = name.replace("present_", "past_", 1)
mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
kv[mapped] = tensor
text = tokenizer.decode(tokens)
print(text)
Real-Time Microphone Streaming
The companion CLI tool provides real-time streaming ASR with voice activity detection:
pip install sounddevice
python inference_moonshine.py --model-dir moonshine_streaming_medium
Architecture Details
Three-Model Design
The model is split into 3 ONNX graphs for efficient streaming:
Encoder β Processes raw audio with causal stride-2 convolutions and sliding-window attention. Outputs 768-dim hidden states at 50Hz (one frame per 20ms of audio).
Decoder (first step) β Takes BOS token + encoder states, produces first token logits and initializes 56 KV cache tensors (14 layers Γ 2 attention types Γ key+value).
Decoder with past β Takes previous token + encoder states + KV cache, produces next token logits and updated cache. Self-attention KV grows each step; cross-attention KV stays constant.
Sliding Window Attention
The encoder uses per-layer sliding window sizes for streaming efficiency:
| Layers | Window | Lookahead | Purpose |
|---|---|---|---|
| 0β1 | 16 frames | 4 frames | Initial context with lookahead |
| 2β11 | 16 frames | 0 frames | Causal processing (no future) |
| 12β13 | 16 frames | 4 frames | Final refinement with lookahead |
Quantization Details
- Method: Dynamic INT8 (weight-only)
- Target ops: MatMul, Gemm (transformer compute)
- Weights: Symmetric INT8
- Activations: Remain FP32 at runtime
- Audio frontend: Conv/ConvTranspose kept at full precision
- Accuracy impact: Negligible (<0.0001 max absolute encoder diff vs FP32)
Comparison with Small
| Small | Medium | |
|---|---|---|
| Encoder layers | 10 | 14 |
| Encoder hidden | 620 | 768 |
| Decoder layers | 10 | 14 |
| Decoder hidden | 512 | 640 |
| Attention heads | 8 | 10 |
| INT8 size | ~358 MB | ~562 MB |
| KV tensors | 40 | 56 |
Execution Providers
Works with any ONNX Runtime execution provider:
| Provider | Platform | Notes |
|---|---|---|
CPUExecutionProvider |
All | Default, always available |
CoreMLExecutionProvider |
macOS | Hardware-accelerated on Apple Silicon |
CUDAExecutionProvider |
Linux/Windows | NVIDIA GPU |
DirectMLExecutionProvider |
Windows | DirectX 12 GPU |
Export Reproduction
To reproduce this export from the original safetensors:
pip install "transformers>=5.2.0" "huggingface_hub>=0.23" torch onnx onnxruntime
python export_moonshine_streaming_medium.py
Export script: onnx-creator
License
This model inherits the MIT License from the original Moonshine model by Useful Sensors.
Citation
@article{jeffries2024moonshine,
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
author={Jeffries, Nat and Silent, Evan},
journal={arXiv preprint arXiv:2410.15608},
year={2024}
}
- Downloads last month
- 43
Model tree for Mazino0/moonshine-streaming-medium-onnx
Base model
UsefulSensors/moonshine-streaming-medium