VoxLingua107 ECAPA-TDNN — ONNX export for Vernacula

Re-packaged ONNX export of speechbrain/lang-id-voxlingua107-ecapa for use as the language-identification backend in Vernacula.

Highlights

  • STFT op replaced with two Conv1D passes (cos + sin DFT basis, windowed): preprocessing wall-time share drops from 85.6% to 14% — roughly 27× faster on CUDA than the stock SpeechBrain export, which forced host fallback for the STFT op.
  • Single 83 MB FP32 ONNX file, weights inlined, no .data sidecar — minimal distribution friction for clients.
  • Parity validated at Δprob 3e-11 to 6e-5 and cosine similarity ≥ 0.9999 across a 5-clip set (en, de, fr, ru, hu; 90–602 s). Top-1 language matches PyTorch on every clip.
  • Duration-accuracy sweep (sweep_duration_accuracy.py) shows confidence plateau beyond ~30 s; clips under 5 s are the noisy regime.
  • IOBinding profiling harness (bench_iobinding.py) isolates H2D / D2H allocation and copy overhead by comparing numpy ↔ session.run vs GPU OrtValue buffers + run_with_iobinding for both serial and batched (b=16) workloads.

Contents

File Purpose
voxlingua107.onnx End-to-end graph: raw 16 kHz audio → 107-class logits + 256-dim embedding
lang_map.json Class index → { iso, name } lookup
manifest.json Per-file MD5 hashes for integrity checks

Preprocessing (FBANK via Conv1D, per-utterance mean-variance norm) is folded into the graph, so consumers just send raw PCM.

Export provenance

Exported via scripts/voxlingua107_export/ in the Vernacula repo. The STFT op is replaced with two Conv1D passes (cos + sin basis, windowed) so the preprocessing path has CUDA kernels end-to-end — roughly a 27× speedup on CUDA vs the stock SpeechBrain export.

License

Apache-2.0, inherited from the SpeechBrain source model.

Using these files

In Vernacula, language-ID runs automatically when the active ASR backend needs to choose a language. Outside Vernacula, pull with huggingface_hub and run with onnxruntime:

from huggingface_hub import snapshot_download
import onnxruntime as ort
import json

path = snapshot_download(repo_id="christopherthompson81/voxlingua107-lid-onnx")
sess = ort.InferenceSession(f"{path}/voxlingua107.onnx")
lang_map = json.load(open(f"{path}/lang_map.json"))
# Feed raw 16 kHz mono PCM as float32 [batch, samples]
# Outputs: logits [batch, 107] and embedding [batch, 256]

Limitations

Covers 107 languages (see the upstream VoxLingua107 paper for the full list). Accuracy varies by language and acoustic domain; the model was trained on YouTube audio and performs best on similar conversational speech. Short clips (<3 s) are noticeably less reliable than longer ones.

Citation

For the underlying model:

@inproceedings{valk2021slt,
  title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
  author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
  booktitle={Proc. IEEE SLT Workshop},
  year={2021},
}

See the upstream model card for additional citations.

Acknowledgments

  • Original model: SpeechBrain (Jörgen Valk, Tanel Alumäe — Tartu University)
  • ONNX repackaging: Chris Thompson for Vernacula

Issues with the ONNX export specifically: open an issue on the Vernacula repo. Issues with the underlying model: see the upstream model card.

See also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for christopherthompson81/voxlingua107-lid-onnx

Quantized
(5)
this model

Paper for christopherthompson81/voxlingua107-lid-onnx