YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Conv-TasNet Speaker Separation (ONNX)

Speaker separation model for splitting overlapping speech into individual speaker streams. Exported from SpeechBrain's pretrained SepFormer on WSJ0-2Mix.

Available Models

File Format Size SNR vs FP32 Notes
conv_tasnet_libri2mix_int8.onnx QUInt8 29.7 MB 22.6 dB Recommended β€” 71% smaller, negligible quality loss
conv_tasnet_libri2mix.onnx FP32 101.2 MB baseline Full precision reference

Model Details

Property Value
Architecture SepFormer (encoder + Transformer masknet + decoder)
Training data WSJ0-2Mix
Sample rate 8kHz mono
Input mixture β€” [batch, time] float32 waveform
Output separated β€” [batch, 2, time] float32 (2 separated sources)
ONNX opset 17
Parameters ~26M
Quantization Dynamic QUInt8 on MatMul ops (Transformer attention weights)

Usage (Python)

import onnxruntime as ort
import numpy as np

# Use int8 model (recommended)
sess = ort.InferenceSession("conv_tasnet_libri2mix_int8.onnx")

# Input: mono waveform at 8kHz
mixture = np.random.randn(1, 8000).astype(np.float32)  # 1 second
separated = sess.run(None, {"mixture": mixture})[0]

# separated.shape = (1, 2, 8000) β€” two speaker sources
source_1 = separated[0, 0, :]
source_2 = separated[0, 1, :]

Usage (Rust / ort)

use ort::session::Session;
use ndarray::Array2;

let session = Session::builder()?.commit_from_file("conv_tasnet_libri2mix_int8.onnx")?;
let input = Array2::<f32>::zeros((1, 8000)); // 1s at 8kHz
let outputs = session.run(ort::inputs![input.view()])?;
// outputs[0] shape: [1, 2, 8000]

Integration in Second Brain

Used by core-asr for real-time speaker separation during detected speech overlap:

System audio (16kHz) β†’ overlap detected?
  β”œβ”€ NO  β†’ normal single-speaker path
  └─ YES β†’ resample 16β†’8kHz β†’ SepFormer β†’ 2 sources β†’ resample 8β†’16kHz
           β†’ WeSpeaker embedding per source β†’ cluster β†’ per-speaker decode

Auto-downloaded via ensure_moonshine_models() from Mazino0/conv-tasnet-onnx.

Export & Quantization

pip install speechbrain torch onnx onnxruntime

# Export FP32 model
python scripts/export_conv_tasnet.py

# Quantize to int8
python -c "
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
    'conv_tasnet_libri2mix.onnx',
    'conv_tasnet_libri2mix_int8.onnx',
    weight_type=QuantType.QUInt8,
    op_types_to_quantize=['MatMul'],
)
"

License

The SepFormer model weights are from SpeechBrain (Apache 2.0). WSJ0-2Mix data is from the Wall Street Journal corpus (LDC license).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support