Conv-TasNet 16 kHz (Libri2Mix sepclean) β€” ONNX

ONNX Runtime export of JorisCos/ConvTasNet_Libri2Mix_sepclean_16k β€” Conv-TasNet trained on Libri2Mix sepclean for 2-speaker overlap separation at 16 kHz.

PyTorch+asteroid runtime (2 GB deps) replaced with ONNX Runtime (50 MB) at no accuracy cost (SNR 62 dB vs PyTorch on speech).

Spec
Architecture Conv-TasNet (encoder + temporal-conv mask network + decoder)
Input mono waveform, (batch, time) float32, 16 kHz
Output 2 separated source streams, (batch, 2, time) float32
File size 19 MB (FP32)
Trained on Libri2Mix sepclean (16 kHz, 2-speaker mixtures)
Authors Joris Cosentino et al.

Why FP32 only? (no INT8)

Dynamic INT8 quantization of this model is broken: SNR drops to 17 dB (vs PyTorch) and runs **5Γ— slower** than FP32 on ONNX Runtime CPU EP. Conv-TasNet's many small Conv1d layers don't benefit from dynamic quantization. So this repo ships FP32 only.

If you need a smaller model, consider TIGER, Mossformer2, or static-quantization with calibration data β€” not dynamic INT8.

Quick start

import numpy as np
import onnxruntime as ort
import soundfile as sf
from huggingface_hub import hf_hub_download

# 1. Load model
model_path = hf_hub_download("welcomyou/convtasnet-libri2mix-16k-onnx",
                              "convtasnet_16k.onnx")
opts = ort.SessionOptions()
opts.intra_op_num_threads = 4
sess = ort.InferenceSession(model_path, opts, providers=["CPUExecutionProvider"])

# 2. Load 16 kHz mono mixture
audio, sr = sf.read("mixture.wav", dtype="float32")  # must be 16 kHz mono
assert sr == 16000 and audio.ndim == 1

# 3. Separate
x = audio[np.newaxis]  # (1, T)
sources = sess.run(None, {"mixture": x})[0]  # (1, 2, T)
spk_a, spk_b = sources[0, 0], sources[0, 1]

# 4. Save (matching mixture length β€” Conv-TasNet may pad output by a few samples)
T = len(audio)
sf.write("source_a.wav", spk_a[:T], 16000)
sf.write("source_b.wav", spk_b[:T], 16000)

Performance

PyTorch (asteroid) vs ONNX Runtime FP32, single-thread on a typical CPU:

Audio length PyTorch ONNX FP32 Speedup
3 s 661 ms 449 ms 1.5Γ—
5 s 1054 ms 961 ms 1.1Γ—
10 s 2034 ms 2056 ms 1.0Γ—

Latency is roughly equal; the win is eliminating the 2 GB torch + asteroid dependency at runtime (cold start drops from ~30 s to ~0.5 s).

Reproducing the export

pip install torch asteroid asteroid_filterbanks onnxruntime numpy
python convert_onnx/export_convtasnet_onnx.py \
    --output convtasnet_16k.onnx \
    --verify

Script: convert_onnx/export_convtasnet_onnx.py β€” wraps torch.onnx.export with dynamic axes (time axis dynamic; supports any input length).

Credits & License

License: CC-BY-SA-4.0 (inherited from JorisCos/ConvTasNet_Libri2Mix_sepclean_16k β€” derivative works must use the same license).

Used by

  • Sherpa Vietnamese ASR β€” overlap separation in the speaker diarization pipeline (works well on phone-call audio despite English training data).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for welcomyou/convtasnet-libri2mix-16k-onnx

Quantized
(1)
this model

Collection including welcomyou/convtasnet-libri2mix-16k-onnx

Paper for welcomyou/convtasnet-libri2mix-16k-onnx