CAM++ 3D-Speaker (200k speakers, 192-dim) β€” ONNX

ONNX Runtime export of 3D-Speaker CAM++ speaker embedding model speech_campplus_sv_zh_en_16k-common_advanced (192-dim, trained on 200k speakers across VoxCeleb + CNCeleb + 3D-Speaker corpus).

This is the only public ONNX export of this specific 200k-speaker variant β€” the upstream HF mirror funasr/campplus only ships the smaller PyTorch .bin, and the ModelScope ONNX is not always accessible outside China.

Spec
Architecture CAM++ (FCM + TDNN + DenseTDNN + StatsPool + Dense)
Input 80-dim fbank, (batch, frames, 80) float32
Output 192-dim L2-normalized speaker embedding, (batch, 192)
Sample rate 16 kHz
File size 27 MB
Speakers (training) ~200k (VoxCeleb 1+2 + CNCeleb 1+2 + 3D-Speaker)
Languages Chinese + English (works well cross-lingual incl. Vietnamese)

Quick start

import numpy as np
import onnxruntime as ort
import kaldi_native_fbank as knf
import soundfile as sf
from huggingface_hub import hf_hub_download

# 1. Load model
model_path = hf_hub_download("welcomyou/campplus-3dspeaker-200k-onnx",
                              "campplus_cn_en_common_200k.onnx")
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# 2. Compute 80-dim fbank from 16kHz mono audio
audio, sr = sf.read("speaker.wav", dtype="float32")  # must be 16 kHz mono
opts = knf.FbankOptions()
opts.frame_opts.samp_freq = 16000
opts.frame_opts.dither = 0.0
opts.mel_opts.num_bins = 80
fbank_extractor = knf.OnlineFbank(opts)
fbank_extractor.accept_waveform(16000, audio * 32768.0)
fbank_extractor.input_finished()
frames = np.stack([fbank_extractor.get_frame(i)
                   for i in range(fbank_extractor.num_frames_ready)])
frames -= frames.mean(axis=0, keepdims=True)  # CMVN

# 3. Run inference
embedding = sess.run(None, {"input": frames[np.newaxis]})[0][0]
print(embedding.shape)  # (192,)

# 4. Cosine similarity for verification
emb_a = embedding / np.linalg.norm(embedding)
# emb_b = ... (another speaker)
# similarity = emb_a @ emb_b

Reproducing the export

# 1. Download PyTorch checkpoint from ModelScope
pip install modelscope
python -c "from modelscope import snapshot_download; \
           snapshot_download('iic/speech_campplus_sv_zh_en_16k-common_advanced')"

# 2. Run export script (re-implements CAM++ architecture + loads weights + exports ONNX)
python convert_onnx/export_campplus_onnx.py \
    --pt_path   modelscope_cache/iic/speech_campplus_sv_zh_en_16k-common_advanced/campplus_cn_en_common.pt \
    --onnx_path models/campp-3dspeaker/campplus_cn_en_common_200k.onnx

Script: convert_onnx/export_campplus_onnx.py β€” copies FCM, TDNN, DenseTDNN, StatsPool, CAMPPlus modules from the 3D-Speaker repo, loads campplus_cn_en_common.pt, traces with torch.onnx.export.

Credits & License

License: Apache 2.0 (inherited from 3D-Speaker).

Used by

  • Sherpa Vietnamese ASR β€” Senko speaker diarization pipeline (~2.5Γ— faster than ResNet34 on CPU).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including welcomyou/campplus-3dspeaker-200k-onnx

Paper for welcomyou/campplus-3dspeaker-200k-onnx