Conv-TasNet 16 kHz (Libri2Mix sepclean) — ONNX

ONNX Runtime export of JorisCos/ConvTasNet_Libri2Mix_sepclean_16k — Conv-TasNet trained on Libri2Mix sepclean for 2-speaker overlap separation at 16 kHz.

PyTorch+asteroid runtime (~~2 GB deps) replaced with ONNX Runtime (~~50 MB) at no accuracy cost (SNR 62 dB vs PyTorch on speech).

	Spec
Architecture	Conv-TasNet (encoder + temporal-conv mask network + decoder)
Input	mono waveform, `(batch, time)` float32, 16 kHz
Output	2 separated source streams, `(batch, 2, time)` float32
File size	19 MB (FP32)
Trained on	Libri2Mix sepclean (16 kHz, 2-speaker mixtures)
Authors	Joris Cosentino et al.

Why FP32 only? (no INT8)

Dynamic INT8 quantization of this model is broken: SNR drops to 17 dB (vs PyTorch) and runs **5× slower** than FP32 on ONNX Runtime CPU EP. Conv-TasNet's many small Conv1d layers don't benefit from dynamic quantization. So this repo ships FP32 only.

If you need a smaller model, consider TIGER, Mossformer2, or static-quantization with calibration data — not dynamic INT8.

Quick start

import numpy as np
import onnxruntime as ort
import soundfile as sf
from huggingface_hub import hf_hub_download

# 1. Load model
model_path = hf_hub_download("welcomyou/convtasnet-libri2mix-16k-onnx",
                              "convtasnet_16k.onnx")
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
sess = ort.InferenceSession(model_path, opts, providers=["CPUExecutionProvider"])

# 2. Load 16 kHz mono mixture
audio, sr = sf.read("mixture.wav", dtype="float32")  # must be 16 kHz mono
assert sr == 16000 and audio.ndim == 1

# 3. Separate
x = audio[np.newaxis]  # (1, T)
sources = sess.run(None, {"mixture": x})[0]  # (1, 2, T)
spk_a, spk_b = sources[0, 0], sources[0, 1]

# 4. Save (matching mixture length — Conv-TasNet may pad output by a few samples)
T = len(audio)
sf.write("source_a.wav", spk_a[:T], 16000)
sf.write("source_b.wav", spk_b[:T], 16000)

Performance

PyTorch (asteroid) vs ONNX Runtime FP32, single-thread on a typical CPU:

Audio length	PyTorch	ONNX FP32	Speedup
3 s	661 ms	449 ms	1.5×
5 s	1054 ms	961 ms	1.1×
10 s	2034 ms	2056 ms	1.0×

Latency is roughly equal; the win is eliminating the 2 GB torch + asteroid dependency at runtime (cold start drops from ~30 s to ~0.5 s).

Reproducing the export

pip install torch asteroid asteroid_filterbanks onnxruntime numpy
python convert_onnx/export_convtasnet_onnx.py \
    --output convtasnet_16k.onnx \
    --verify

Script: convert_onnx/export_convtasnet_onnx.py — wraps torch.onnx.export with dynamic axes (time axis dynamic; supports any input length).

Credits & License

Original PyTorch model: JorisCos/ConvTasNet_Libri2Mix_sepclean_16k
Architecture: Conv-TasNet (Luo & Mesgarani, 2019, arXiv:1809.07454)
Training data: Libri2Mix (mixtures of LibriSpeech utterances)
Training framework: Asteroid

License: CC-BY-SA-4.0 (inherited from JorisCos/ConvTasNet_Libri2Mix_sepclean_16k — derivative works must use the same license).

Used by

Sherpa Vietnamese ASR — overlap separation in the speaker diarization pipeline (works well on phone-call audio despite English training data).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for welcomyou/convtasnet-libri2mix-16k-onnx

Base model

JorisCos/ConvTasNet_Libri2Mix_sepclean_16k

Quantized

(1)

this model

Collection including welcomyou/convtasnet-libri2mix-16k-onnx

Sherpa Vietnamese ASR

Collection

ONNX models for github.com/welcomyou/sherpa-vietnamese-asr — offline Vietnamese ASR, CPU-only. • 4 items • Updated 2 days ago

Paper for welcomyou/convtasnet-libri2mix-16k-onnx

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Paper • 1809.07454 • Published Sep 20, 2018