YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SenseVoice ASR β€” ONNX (int8)

ONNX export and int8 quantization of SenseVoice by FunAudioLLM (Alibaba/DAMO Academy). Exported via sherpa-onnx by k2-fsa.

This repo mirrors the model for CDN stability as a dependency of the @omote/core SDK.

Model Summary

Property Value
Architecture SenseVoiceSmall (50 SANM layers, CTC encoder-only)
Parameters ~234M
Vocabulary 25,055 tokens (SentencePiece BPE)
Languages zh, en, ja, ko, yue + 45 others
Tasks ASR + Language ID + Emotion + Audio Events
Quantization Dynamic int8 (QUInt8, MatMul-only)
Size ~228 MB
Opset 13
Min ORT Version 1.17.0

Files

File Size Description
model.int8.onnx ~228 MB Int8 quantized model (single file)
tokens.txt ~309 KB SentencePiece vocabulary (25,055 tokens)

Quick Start

TypeScript (@omote/core)

import { createSenseVoice } from '@omote/core';

// Zero-config β€” uses this HuggingFace CDN automatically
const asr = createSenseVoice();
await asr.load();

const { text, emotion, language } = await asr.transcribe(audioSamples);
console.log(`[${language}] [${emotion}] ${text}`);

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])

# Input: preprocessed features (fbank β†’ LFR β†’ CMVN)
# See "Preprocessing" section below
feeds = {
    "x": features,           # [1, num_lfr_frames, 560] float32
    "x_length": np.array([num_lfr_frames], dtype=np.int32),
    "language": np.array([0], dtype=np.int32),   # 0=auto
    "text_norm": np.array([14], dtype=np.int32), # 14=with_itn
}

logits = session.run(["logits"], feeds)[0]
# CTC decode logits to text

Input / Output Specification

Inputs

Name Shape Dtype Description
x [1, T, 560] float32 LFR-stacked + CMVN-normalized mel features
x_length [1] int32 Number of LFR frames
language [1] int32 Language ID (0=auto, 3=zh, 4=en, 7=yue, 11=ja, 12=ko)
text_norm [1] int32 Text normalization (14=with_itn, 15=without_itn)

Outputs

Name Shape Dtype Description
logits [1, T, 25055] float32 CTC logits over vocabulary

Preprocessing

All preprocessing is external to the ONNX model. The model expects already-preprocessed features:

  1. Kaldi Fbank: Raw 16kHz audio β†’ 80-bin log-mel spectrogram (25ms frame, 10ms shift, Hamming window, 0.97 preemphasis)
  2. LFR Stacking: Stack 7 frames with stride 6 β†’ [T_reduced, 560]
  3. CMVN: Normalize using statistics stored in model metadata (neg_mean, inv_stddev β€” 560 floats each)

The @omote/core SDK handles all preprocessing automatically.

CMVN Statistics

Stored as comma-separated strings in the ONNX model's custom metadata:

  • neg_mean: 560 float values (negative mean for normalization)
  • inv_stddev: 560 float values (inverse standard deviation)

Access in Python: session.get_modelmeta().custom_metadata_map["neg_mean"]

Attribution

License

Model weights are licensed under the FunASR Model License v1.1 by Alibaba Group. The license permits copying, modifying, and sharing with attribution.

The ONNX export code (sherpa-onnx) is licensed under Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for omote-ai/sensevoice-asr