Sherpa Vietnamese ASR
Collection
ONNX models for github.com/welcomyou/sherpa-vietnamese-asr β offline Vietnamese ASR, CPU-only. β’ 4 items β’ Updated
ONNX Runtime export of 3D-Speaker CAM++ speaker embedding model speech_campplus_sv_zh_en_16k-common_advanced (192-dim, trained on 200k speakers across VoxCeleb + CNCeleb + 3D-Speaker corpus).
This is the only public ONNX export of this specific 200k-speaker variant β the upstream HF mirror funasr/campplus only ships the smaller PyTorch .bin, and the ModelScope ONNX is not always accessible outside China.
| Spec | |
|---|---|
| Architecture | CAM++ (FCM + TDNN + DenseTDNN + StatsPool + Dense) |
| Input | 80-dim fbank, (batch, frames, 80) float32 |
| Output | 192-dim L2-normalized speaker embedding, (batch, 192) |
| Sample rate | 16 kHz |
| File size | 27 MB |
| Speakers (training) | ~200k (VoxCeleb 1+2 + CNCeleb 1+2 + 3D-Speaker) |
| Languages | Chinese + English (works well cross-lingual incl. Vietnamese) |
import numpy as np
import onnxruntime as ort
import kaldi_native_fbank as knf
import soundfile as sf
from huggingface_hub import hf_hub_download
# 1. Load model
model_path = hf_hub_download("welcomyou/campplus-3dspeaker-200k-onnx",
"campplus_cn_en_common_200k.onnx")
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
# 2. Compute 80-dim fbank from 16kHz mono audio
audio, sr = sf.read("speaker.wav", dtype="float32") # must be 16 kHz mono
opts = knf.FbankOptions()
opts.frame_opts.samp_freq = 16000
opts.frame_opts.dither = 0.0
opts.mel_opts.num_bins = 80
fbank_extractor = knf.OnlineFbank(opts)
fbank_extractor.accept_waveform(16000, audio * 32768.0)
fbank_extractor.input_finished()
frames = np.stack([fbank_extractor.get_frame(i)
for i in range(fbank_extractor.num_frames_ready)])
frames -= frames.mean(axis=0, keepdims=True) # CMVN
# 3. Run inference
embedding = sess.run(None, {"input": frames[np.newaxis]})[0][0]
print(embedding.shape) # (192,)
# 4. Cosine similarity for verification
emb_a = embedding / np.linalg.norm(embedding)
# emb_b = ... (another speaker)
# similarity = emb_a @ emb_b
# 1. Download PyTorch checkpoint from ModelScope
pip install modelscope
python -c "from modelscope import snapshot_download; \
snapshot_download('iic/speech_campplus_sv_zh_en_16k-common_advanced')"
# 2. Run export script (re-implements CAM++ architecture + loads weights + exports ONNX)
python convert_onnx/export_campplus_onnx.py \
--pt_path modelscope_cache/iic/speech_campplus_sv_zh_en_16k-common_advanced/campplus_cn_en_common.pt \
--onnx_path models/campp-3dspeaker/campplus_cn_en_common_200k.onnx
Script: convert_onnx/export_campplus_onnx.py β copies FCM, TDNN, DenseTDNN, StatsPool, CAMPPlus modules from the 3D-Speaker repo, loads campplus_cn_en_common.pt, traces with torch.onnx.export.
speech_campplus_sv_zh_en_16k-common_advanced (ModelScope)License: Apache 2.0 (inherited from 3D-Speaker).