Conv-TasNet 16 kHz (Libri2Mix sepclean) β ONNX
ONNX Runtime export of JorisCos/ConvTasNet_Libri2Mix_sepclean_16k β Conv-TasNet trained on Libri2Mix sepclean for 2-speaker overlap separation at 16 kHz.
PyTorch+asteroid runtime (2 GB deps) replaced with ONNX Runtime (50 MB) at no accuracy cost (SNR 62 dB vs PyTorch on speech).
| Spec | |
|---|---|
| Architecture | Conv-TasNet (encoder + temporal-conv mask network + decoder) |
| Input | mono waveform, (batch, time) float32, 16 kHz |
| Output | 2 separated source streams, (batch, 2, time) float32 |
| File size | 19 MB (FP32) |
| Trained on | Libri2Mix sepclean (16 kHz, 2-speaker mixtures) |
| Authors | Joris Cosentino et al. |
Why FP32 only? (no INT8)
Dynamic INT8 quantization of this model is broken: SNR drops to 17 dB (vs PyTorch) and runs **5Γ slower** than FP32 on ONNX Runtime CPU EP. Conv-TasNet's many small Conv1d layers don't benefit from dynamic quantization. So this repo ships FP32 only.
If you need a smaller model, consider TIGER, Mossformer2, or static-quantization with calibration data β not dynamic INT8.
Quick start
import numpy as np
import onnxruntime as ort
import soundfile as sf
from huggingface_hub import hf_hub_download
# 1. Load model
model_path = hf_hub_download("welcomyou/convtasnet-libri2mix-16k-onnx",
"convtasnet_16k.onnx")
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
sess = ort.InferenceSession(model_path, opts, providers=["CPUExecutionProvider"])
# 2. Load 16 kHz mono mixture
audio, sr = sf.read("mixture.wav", dtype="float32") # must be 16 kHz mono
assert sr == 16000 and audio.ndim == 1
# 3. Separate
x = audio[np.newaxis] # (1, T)
sources = sess.run(None, {"mixture": x})[0] # (1, 2, T)
spk_a, spk_b = sources[0, 0], sources[0, 1]
# 4. Save (matching mixture length β Conv-TasNet may pad output by a few samples)
T = len(audio)
sf.write("source_a.wav", spk_a[:T], 16000)
sf.write("source_b.wav", spk_b[:T], 16000)
Performance
PyTorch (asteroid) vs ONNX Runtime FP32, single-thread on a typical CPU:
| Audio length | PyTorch | ONNX FP32 | Speedup |
|---|---|---|---|
| 3 s | 661 ms | 449 ms | 1.5Γ |
| 5 s | 1054 ms | 961 ms | 1.1Γ |
| 10 s | 2034 ms | 2056 ms | 1.0Γ |
Latency is roughly equal; the win is eliminating the 2 GB torch + asteroid dependency at runtime (cold start drops from ~30 s to ~0.5 s).
Reproducing the export
pip install torch asteroid asteroid_filterbanks onnxruntime numpy
python convert_onnx/export_convtasnet_onnx.py \
--output convtasnet_16k.onnx \
--verify
Script: convert_onnx/export_convtasnet_onnx.py β wraps torch.onnx.export with dynamic axes (time axis dynamic; supports any input length).
Credits & License
- Original PyTorch model: JorisCos/ConvTasNet_Libri2Mix_sepclean_16k
- Architecture: Conv-TasNet (Luo & Mesgarani, 2019, arXiv:1809.07454)
- Training data: Libri2Mix (mixtures of LibriSpeech utterances)
- Training framework: Asteroid
License: CC-BY-SA-4.0 (inherited from JorisCos/ConvTasNet_Libri2Mix_sepclean_16k β derivative works must use the same license).
Used by
- Sherpa Vietnamese ASR β overlap separation in the speaker diarization pipeline (works well on phone-call audio despite English training data).
Model tree for welcomyou/convtasnet-libri2mix-16k-onnx
Base model
JorisCos/ConvTasNet_Libri2Mix_sepclean_16k