| --- |
| license: mit |
| library_name: onnxruntime |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - onnx |
| - onnxruntime |
| - automatic-speech-recognition |
| - indic |
| - indicconformer |
| - ai4bharat |
| - vernacula |
| base_model: ai4bharat/indic-conformer-600m-multilingual |
| language: |
| - as |
| - bn |
| - brx |
| - doi |
| - gu |
| - hi |
| - kn |
| - kok |
| - ks |
| - mai |
| - ml |
| - mni |
| - mr |
| - ne |
| - or |
| - pa |
| - sa |
| - sat |
| - sd |
| - ta |
| - te |
| - ur |
| --- |
| |
| # IndicConformer 600M β ONNX export for Vernacula |
|
|
| Re-packaged ONNX export of AI4Bharat's 22-language |
| [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual), |
| in the on-disk shape that [Vernacula](https://github.com/christopherthompson81/vernacula)'s |
| desktop ASR app expects. The CTC head only β the RNNT components from the |
| source repo are not shipped here. |
|
|
| - **Conversion script:** [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) |
| - **Vernacula:** [github.com/christopherthompson81/vernacula](https://github.com/christopherthompson81/vernacula) |
| - **Upstream model:** [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) |
|
|
| All numerical behavior is identical to the upstream encoder + CTC graph; |
| only the on-disk packaging differs. |
|
|
| ## Highlights |
|
|
| - **22 languages, one shared CTC head.** Encoder dim β 5633 logits with the shared blank at id 5632; per-language vocab spans live in `language_spans.json` as `{start, length}` pairs (22 Γ 256 tokens). Language selection is a C# post-argmax mask, not an ONNX input β one model serves every language. |
| - **Phase 1 parity (CPU-CPU FP32): max-abs logits delta 6.87e-5** at 1e-3 tolerance. CUDA-vs-CPU cross-device drift hit 1e-2 scale β typical for a 17-layer Conformer; CPU is the numerically exact path. |
| - **Real-audio parity on a 9.4 s hi-IN Fleurs clip:** ~2 word-edit WER on an 11-word reference confirmed vocab, SentencePiece detokenisation, and language-span masking end-to-end. |
| - **Repackaged 2.43 GB of AI4Bharat external-data from 366 per-tensor blobs** into a single `.data` sidecar. The repackaging walks initialisers + node attributes recursively so nothing is left behind. |
| - **Reused the Parakeet DFT preprocessor** (no `STFT` op), with a `getattr()` shim for NeMo 1.23.0rc0 compatibility (`exact_pad`, `stft_pad_amount` are 2.x-only field names). |
|
|
| ## Contents |
|
|
| | File | Purpose | |
| |---|---| |
| | `encoder-model.onnx` (+ `.data`) | Conformer encoder, `[features, features_lens] -> [encoded, encoded_lens]` | |
| | `ctc_decoder-model.onnx` | Single `Conv1d` β 5633-dim logits (22 Γ 256 language tokens + 1 shared CTC blank at id 5632) | |
| | `nemo128.onnx` | DFT-conv1d 80-mel preprocessor, `[waveforms, waveforms_lens] -> [features, features_lens]` | |
| | `vocab.txt` | Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632 | |
| | `language_spans.json` | 22 Γ `{start, length}` β which slice of `vocab.txt` each language's 256 tokens occupy | |
| | `config.json` | Preprocessor frontend params + CTC blank id | |
| | `manifest.json` | Per-file MD5 hashes (used by Vernacula's download verifier) | |
|
|
| ## Export provenance |
|
|
| Exported via [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) |
| in the [Vernacula](https://github.com/christopherthompson81/vernacula) repo. The export uses [AI4Bharat's NeMo fork](https://github.com/AI4Bharat/NeMo) |
| (kept in an isolated venv from the main [NeMo](https://github.com/NVIDIA/NeMo) export tooling, since the fork pins |
| different NeMo internals). |
|
|
| ## License |
|
|
| [MIT](https://opensource.org/licenses/MIT), inherited from the upstream |
| [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) |
| model. |
|
|
| ## Using these files |
|
|
| In Vernacula, select IndicConformer as the ASR backend in Settings and the |
| package will be downloaded and verified automatically. Outside Vernacula, |
| pull with `huggingface_hub` and load with `onnxruntime`: |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| path = snapshot_download(repo_id="christopherthompson81/indicconformer-600m-onnx") |
| ``` |
|
|
| CTC decoding is performed against `vocab.txt` with the blank id at 5632. |
| The `language_spans.json` file lets you mask the logits to a specific |
| language's 256-token span before greedy / beam decoding. See [`scripts/indicconformer_export/README.md`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) |
| for details. |
|
|
| ## Limitations |
|
|
| Covers 22 official Indian languages (listed in frontmatter). Accuracy and |
| known failure modes inherit from the |
| [upstream AI4Bharat model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual). |
| The RNNT head from the source model is not included β only the CTC path β |
| which trades a small amount of accuracy for substantially simpler decoding. |
|
|
| ## Citation |
|
|
| For the underlying model, see the |
| [upstream model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) |
| for the canonical citation. |
|
|
| ## Acknowledgments |
|
|
| - Original model: [AI4Bharat](https://ai4bharat.iitm.ac.in/) (IIT Madras) |
| - ONNX repackaging: [Chris Thompson](https://github.com/christopherthompson81) for [Vernacula](https://github.com/christopherthompson81/vernacula) |
|
|
| Issues with the ONNX export specifically: open an issue on |
| [the Vernacula repo](https://github.com/christopherthompson81/vernacula/issues). |
| Issues with the underlying model: see the upstream model card. |
|
|
| ## See also |
|
|
| - [Vernacula on GitHub](https://github.com/christopherthompson81/vernacula) β the speech pipeline app this package is built for |
| - [Conversion script (`scripts/indicconformer_export/`)](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) β the export pipeline that produced these files |
| - [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) β upstream model card |
| - [AI4Bharat](https://ai4bharat.iitm.ac.in/) β upstream research group at IIT Madras |
| - [Other Vernacula model packages](https://huggingface.co/christopherthompson81) |
|
|