IndicConformer 600M — ONNX export for Vernacula

Re-packaged ONNX export of AI4Bharat's 22-language ai4bharat/indic-conformer-600m-multilingual, in the on-disk shape that Vernacula's desktop ASR app expects. The CTC head only — the RNNT components from the source repo are not shipped here.

Conversion script: scripts/indicconformer_export/
Vernacula: github.com/christopherthompson81/vernacula
Upstream model: ai4bharat/indic-conformer-600m-multilingual

All numerical behavior is identical to the upstream encoder + CTC graph; only the on-disk packaging differs.

Highlights

22 languages, one shared CTC head. Encoder dim → 5633 logits with the shared blank at id 5632; per-language vocab spans live in language_spans.json as {start, length} pairs (22 × 256 tokens). Language selection is a C# post-argmax mask, not an ONNX input — one model serves every language.
Phase 1 parity (CPU-CPU FP32): max-abs logits delta 6.87e-5 at 1e-3 tolerance. CUDA-vs-CPU cross-device drift hit 1e-2 scale — typical for a 17-layer Conformer; CPU is the numerically exact path.
Real-audio parity on a 9.4 s hi-IN Fleurs clip: ~2 word-edit WER on an 11-word reference confirmed vocab, SentencePiece detokenisation, and language-span masking end-to-end.
Repackaged 2.43 GB of AI4Bharat external-data from 366 per-tensor blobs into a single .data sidecar. The repackaging walks initialisers + node attributes recursively so nothing is left behind.
Reused the Parakeet DFT preprocessor (no STFT op), with a getattr() shim for NeMo 1.23.0rc0 compatibility (exact_pad, stft_pad_amount are 2.x-only field names).

File	Purpose
`encoder-model.onnx` (+ `.data`)	Conformer encoder, `[features, features_lens] -> [encoded, encoded_lens]`
`ctc_decoder-model.onnx`	Single `Conv1d` → 5633-dim logits (22 × 256 language tokens + 1 shared CTC blank at id 5632)
`nemo128.onnx`	DFT-conv1d 80-mel preprocessor, `[waveforms, waveforms_lens] -> [features, features_lens]`
`vocab.txt`	Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632
`language_spans.json`	22 × `{start, length}` — which slice of `vocab.txt` each language's 256 tokens occupy
`config.json`	Preprocessor frontend params + CTC blank id
`manifest.json`	Per-file MD5 hashes (used by Vernacula's download verifier)

Export provenance

Exported via scripts/indicconformer_export/ in the Vernacula repo. The export uses AI4Bharat's NeMo fork (kept in an isolated venv from the main NeMo export tooling, since the fork pins different NeMo internals).

License

MIT, inherited from the upstream ai4bharat/indic-conformer-600m-multilingual model.

Using these files

In Vernacula, select IndicConformer as the ASR backend in Settings and the package will be downloaded and verified automatically. Outside Vernacula, pull with huggingface_hub and load with onnxruntime:

from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="christopherthompson81/indicconformer-600m-onnx")

CTC decoding is performed against vocab.txt with the blank id at 5632. The language_spans.json file lets you mask the logits to a specific language's 256-token span before greedy / beam decoding. See scripts/indicconformer_export/README.md for details.

Limitations

Covers 22 official Indian languages (listed in frontmatter). Accuracy and known failure modes inherit from the upstream AI4Bharat model card. The RNNT head from the source model is not included — only the CTC path — which trades a small amount of accuracy for substantially simpler decoding.

Citation

For the underlying model, see the upstream model card for the canonical citation.

Acknowledgments

Original model: AI4Bharat (IIT Madras)
ONNX repackaging: Chris Thompson for Vernacula

Issues with the ONNX export specifically: open an issue on the Vernacula repo. Issues with the underlying model: see the upstream model card.

Model tree for christopherthompson81/indicconformer-600m-onnx

Base model

ai4bharat/indic-conformer-600m-multilingual

Quantized

(5)

this model

christopherthompson81
/

indicconformer-600m-onnx