SenseVoice ASR — ONNX (int8)

ONNX export and int8 quantization of SenseVoice by FunAudioLLM (Alibaba/DAMO Academy). Exported via sherpa-onnx by k2-fsa.

This repo mirrors the model for CDN stability as a dependency of the @omote/core SDK.

Model Summary

Property	Value
Architecture	SenseVoiceSmall (50 SANM layers, CTC encoder-only)
Parameters	~234M
Vocabulary	25,055 tokens (SentencePiece BPE)
Languages	zh, en, ja, ko, yue + 45 others
Tasks	ASR + Language ID + Emotion + Audio Events
Quantization	Dynamic int8 (QUInt8, MatMul-only)
Size	~228 MB
Opset	13
Min ORT Version	1.17.0

Files

File	Size	Description
`model.int8.onnx`	~228 MB	Int8 quantized model (single file)
`tokens.txt`	~309 KB	SentencePiece vocabulary (25,055 tokens)

Quick Start

TypeScript (@omote/core)

import { createSenseVoice } from '@omote/core';

// Zero-config — uses this HuggingFace CDN automatically
const asr = createSenseVoice();
await asr.load();

const { text, emotion, language } = await asr.transcribe(audioSamples);
console.log(`[${language}] [${emotion}] ${text}`);

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])

# Input: preprocessed features (fbank → LFR → CMVN)
# See "Preprocessing" section below
feeds = {
    "x": features,           # [1, num_lfr_frames, 560] float32
    "x_length": np.array([num_lfr_frames], dtype=np.int32),
    "language": np.array([0], dtype=np.int32),   # 0=auto
    "text_norm": np.array([14], dtype=np.int32), # 14=with_itn
}

logits = session.run(["logits"], feeds)[0]
# CTC decode logits to text

Input / Output Specification

Inputs

Name	Shape	Dtype	Description
`x`	`[1, T, 560]`	float32	LFR-stacked + CMVN-normalized mel features
`x_length`	`[1]`	int32	Number of LFR frames
`language`	`[1]`	int32	Language ID (0=auto, 3=zh, 4=en, 7=yue, 11=ja, 12=ko)
`text_norm`	`[1]`	int32	Text normalization (14=with_itn, 15=without_itn)

Outputs

Name	Shape	Dtype	Description
`logits`	`[1, T, 25055]`	float32	CTC logits over vocabulary

Preprocessing

All preprocessing is external to the ONNX model. The model expects already-preprocessed features:

Kaldi Fbank: Raw 16kHz audio → 80-bin log-mel spectrogram (25ms frame, 10ms shift, Hamming window, 0.97 preemphasis)
LFR Stacking: Stack 7 frames with stride 6 → [T_reduced, 560]
CMVN: Normalize using statistics stored in model metadata (neg_mean, inv_stddev — 560 floats each)

The @omote/core SDK handles all preprocessing automatically.

CMVN Statistics

Stored as comma-separated strings in the ONNX model's custom metadata:

neg_mean: 560 float values (negative mean for normalization)
inv_stddev: 560 float values (inverse standard deviation)

Access in Python: session.get_modelmeta().custom_metadata_map["neg_mean"]

Attribution

Model: SenseVoiceSmall by FunAudioLLM (Alibaba/DAMO Academy)
Paper: FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
ONNX Export: sherpa-onnx by k2-fsa (@csukuangfj)
Original HuggingFace: csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

License

Model weights are licensed under the FunASR Model License v1.1 by Alibaba Group. The license permits copying, modifying, and sharing with attribution.

The ONNX export code (sherpa-onnx) is licensed under Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for omote-ai/sensevoice-asr

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Paper • 2407.04051 • Published Jul 4, 2024 • 40