YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SenseVoice ASR β ONNX (int8)
ONNX export and int8 quantization of SenseVoice by FunAudioLLM (Alibaba/DAMO Academy). Exported via sherpa-onnx by k2-fsa.
This repo mirrors the model for CDN stability as a dependency of the @omote/core SDK.
Model Summary
| Property | Value |
|---|---|
| Architecture | SenseVoiceSmall (50 SANM layers, CTC encoder-only) |
| Parameters | ~234M |
| Vocabulary | 25,055 tokens (SentencePiece BPE) |
| Languages | zh, en, ja, ko, yue + 45 others |
| Tasks | ASR + Language ID + Emotion + Audio Events |
| Quantization | Dynamic int8 (QUInt8, MatMul-only) |
| Size | ~228 MB |
| Opset | 13 |
| Min ORT Version | 1.17.0 |
Files
| File | Size | Description |
|---|---|---|
model.int8.onnx |
~228 MB | Int8 quantized model (single file) |
tokens.txt |
~309 KB | SentencePiece vocabulary (25,055 tokens) |
Quick Start
TypeScript (@omote/core)
import { createSenseVoice } from '@omote/core';
// Zero-config β uses this HuggingFace CDN automatically
const asr = createSenseVoice();
await asr.load();
const { text, emotion, language } = await asr.transcribe(audioSamples);
console.log(`[${language}] [${emotion}] ${text}`);
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])
# Input: preprocessed features (fbank β LFR β CMVN)
# See "Preprocessing" section below
feeds = {
"x": features, # [1, num_lfr_frames, 560] float32
"x_length": np.array([num_lfr_frames], dtype=np.int32),
"language": np.array([0], dtype=np.int32), # 0=auto
"text_norm": np.array([14], dtype=np.int32), # 14=with_itn
}
logits = session.run(["logits"], feeds)[0]
# CTC decode logits to text
Input / Output Specification
Inputs
| Name | Shape | Dtype | Description |
|---|---|---|---|
x |
[1, T, 560] |
float32 | LFR-stacked + CMVN-normalized mel features |
x_length |
[1] |
int32 | Number of LFR frames |
language |
[1] |
int32 | Language ID (0=auto, 3=zh, 4=en, 7=yue, 11=ja, 12=ko) |
text_norm |
[1] |
int32 | Text normalization (14=with_itn, 15=without_itn) |
Outputs
| Name | Shape | Dtype | Description |
|---|---|---|---|
logits |
[1, T, 25055] |
float32 | CTC logits over vocabulary |
Preprocessing
All preprocessing is external to the ONNX model. The model expects already-preprocessed features:
- Kaldi Fbank: Raw 16kHz audio β 80-bin log-mel spectrogram (25ms frame, 10ms shift, Hamming window, 0.97 preemphasis)
- LFR Stacking: Stack 7 frames with stride 6 β
[T_reduced, 560] - CMVN: Normalize using statistics stored in model metadata (
neg_mean,inv_stddevβ 560 floats each)
The @omote/core SDK handles all preprocessing automatically.
CMVN Statistics
Stored as comma-separated strings in the ONNX model's custom metadata:
neg_mean: 560 float values (negative mean for normalization)inv_stddev: 560 float values (inverse standard deviation)
Access in Python: session.get_modelmeta().custom_metadata_map["neg_mean"]
Attribution
- Model: SenseVoiceSmall by FunAudioLLM (Alibaba/DAMO Academy)
- Paper: FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
- ONNX Export: sherpa-onnx by k2-fsa (@csukuangfj)
- Original HuggingFace: csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
License
Model weights are licensed under the FunASR Model License v1.1 by Alibaba Group. The license permits copying, modifying, and sharing with attribution.
The ONNX export code (sherpa-onnx) is licensed under Apache 2.0.