CAM++ 3D-Speaker (200k speakers, 192-dim) — ONNX

ONNX Runtime export of 3D-Speaker CAM++ speaker embedding model speech_campplus_sv_zh_en_16k-common_advanced (192-dim, trained on 200k speakers across VoxCeleb + CNCeleb + 3D-Speaker corpus).

This is the only public ONNX export of this specific 200k-speaker variant — the upstream HF mirror funasr/campplus only ships the smaller PyTorch .bin, and the ModelScope ONNX is not always accessible outside China.

	Spec
Architecture	CAM++ (FCM + TDNN + DenseTDNN + StatsPool + Dense)
Input	80-dim fbank, `(batch, frames, 80)` float32
Output	192-dim L2-normalized speaker embedding, `(batch, 192)`
Sample rate	16 kHz
File size	27 MB
Speakers (training)	~200k (VoxCeleb 1+2 + CNCeleb 1+2 + 3D-Speaker)
Languages	Chinese + English (works well cross-lingual incl. Vietnamese)

Quick start

import numpy as np
import onnxruntime as ort
import kaldi_native_fbank as knf
import soundfile as sf
from huggingface_hub import hf_hub_download

# 1. Load model
model_path = hf_hub_download("welcomyou/campplus-3dspeaker-200k-onnx",
                              "campplus_cn_en_common_200k.onnx")
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# 2. Compute 80-dim fbank from 16kHz mono audio
audio, sr = sf.read("speaker.wav", dtype="float32")  # must be 16 kHz mono
opts = knf.FbankOptions()
opts.frame_opts.samp_freq = 16000
opts.frame_opts.dither = 0.0
opts.mel_opts.num_bins = 80
fbank_extractor = knf.OnlineFbank(opts)
fbank_extractor.accept_waveform(16000, audio * 32768.0)
fbank_extractor.input_finished()
frames = np.stack([fbank_extractor.get_frame(i)
                   for i in range(fbank_extractor.num_frames_ready)])
frames -= frames.mean(axis=0, keepdims=True)  # CMVN

# 3. Run inference
embedding = sess.run(None, {"input": frames[np.newaxis]})[0][0]
print(embedding.shape)  # (192,)

# 4. Cosine similarity for verification
emb_a = embedding / np.linalg.norm(embedding)
# emb_b = ... (another speaker)
# similarity = emb_a @ emb_b

Reproducing the export

# 1. Download PyTorch checkpoint from ModelScope
pip install modelscope
python -c "from modelscope import snapshot_download; \
           snapshot_download('iic/speech_campplus_sv_zh_en_16k-common_advanced')"

# 2. Run export script (re-implements CAM++ architecture + loads weights + exports ONNX)
python convert_onnx/export_campplus_onnx.py \
    --pt_path   modelscope_cache/iic/speech_campplus_sv_zh_en_16k-common_advanced/campplus_cn_en_common.pt \
    --onnx_path models/campp-3dspeaker/campplus_cn_en_common_200k.onnx

Script: convert_onnx/export_campplus_onnx.py — copies FCM, TDNN, DenseTDNN, StatsPool, CAMPPlus modules from the 3D-Speaker repo, loads campplus_cn_en_common.pt, traces with torch.onnx.export.

Credits & License

Original model: 3D-Speaker speech_campplus_sv_zh_en_16k-common_advanced (ModelScope)
Architecture: 3D-Speaker by Alibaba DAMO Academy
Paper: CAM++: A Fast and Efficient Network for Speaker Verification using Context-aware Masking

License: Apache 2.0 (inherited from 3D-Speaker).

Used by

Sherpa Vietnamese ASR — Senko speaker diarization pipeline (~2.5× faster than ResNet34 on CPU).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including welcomyou/campplus-3dspeaker-200k-onnx

Sherpa Vietnamese ASR

Collection

ONNX models for github.com/welcomyou/sherpa-vietnamese-asr — offline Vietnamese ASR, CPU-only. • 4 items • Updated 1 day ago

Paper for welcomyou/campplus-3dspeaker-200k-onnx

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

Paper • 2303.00332 • Published Mar 1, 2023