christopherthompson81's picture
Vertically align metadata strip
938d265 verified
---
license: mit
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
tags:
- onnx
- onnxruntime
- automatic-speech-recognition
- indic
- indicconformer
- ai4bharat
- vernacula
base_model: ai4bharat/indic-conformer-600m-multilingual
language:
- as
- bn
- brx
- doi
- gu
- hi
- kn
- kok
- ks
- mai
- ml
- mni
- mr
- ne
- or
- pa
- sa
- sat
- sd
- ta
- te
- ur
---
# IndicConformer 600M β€” ONNX export for Vernacula
Re-packaged ONNX export of AI4Bharat's 22-language
[`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual),
in the on-disk shape that [Vernacula](https://github.com/christopherthompson81/vernacula)'s
desktop ASR app expects. The CTC head only β€” the RNNT components from the
source repo are not shipped here.
- **Conversion script:** [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
- **Vernacula:** [github.com/christopherthompson81/vernacula](https://github.com/christopherthompson81/vernacula)
- **Upstream model:** [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
All numerical behavior is identical to the upstream encoder + CTC graph;
only the on-disk packaging differs.
## Highlights
- **22 languages, one shared CTC head.** Encoder dim β†’ 5633 logits with the shared blank at id 5632; per-language vocab spans live in `language_spans.json` as `{start, length}` pairs (22 Γ— 256 tokens). Language selection is a C# post-argmax mask, not an ONNX input β€” one model serves every language.
- **Phase 1 parity (CPU-CPU FP32): max-abs logits delta 6.87e-5** at 1e-3 tolerance. CUDA-vs-CPU cross-device drift hit 1e-2 scale β€” typical for a 17-layer Conformer; CPU is the numerically exact path.
- **Real-audio parity on a 9.4 s hi-IN Fleurs clip:** ~2 word-edit WER on an 11-word reference confirmed vocab, SentencePiece detokenisation, and language-span masking end-to-end.
- **Repackaged 2.43 GB of AI4Bharat external-data from 366 per-tensor blobs** into a single `.data` sidecar. The repackaging walks initialisers + node attributes recursively so nothing is left behind.
- **Reused the Parakeet DFT preprocessor** (no `STFT` op), with a `getattr()` shim for NeMo 1.23.0rc0 compatibility (`exact_pad`, `stft_pad_amount` are 2.x-only field names).
## Contents
| File | Purpose |
|---|---|
| `encoder-model.onnx` (+ `.data`) | Conformer encoder, `[features, features_lens] -> [encoded, encoded_lens]` |
| `ctc_decoder-model.onnx` | Single `Conv1d` β†’ 5633-dim logits (22 Γ— 256 language tokens + 1 shared CTC blank at id 5632) |
| `nemo128.onnx` | DFT-conv1d 80-mel preprocessor, `[waveforms, waveforms_lens] -> [features, features_lens]` |
| `vocab.txt` | Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632 |
| `language_spans.json` | 22 Γ— `{start, length}` β€” which slice of `vocab.txt` each language's 256 tokens occupy |
| `config.json` | Preprocessor frontend params + CTC blank id |
| `manifest.json` | Per-file MD5 hashes (used by Vernacula's download verifier) |
## Export provenance
Exported via [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
in the [Vernacula](https://github.com/christopherthompson81/vernacula) repo. The export uses [AI4Bharat's NeMo fork](https://github.com/AI4Bharat/NeMo)
(kept in an isolated venv from the main [NeMo](https://github.com/NVIDIA/NeMo) export tooling, since the fork pins
different NeMo internals).
## License
[MIT](https://opensource.org/licenses/MIT), inherited from the upstream
[`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
model.
## Using these files
In Vernacula, select IndicConformer as the ASR backend in Settings and the
package will be downloaded and verified automatically. Outside Vernacula,
pull with `huggingface_hub` and load with `onnxruntime`:
```python
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="christopherthompson81/indicconformer-600m-onnx")
```
CTC decoding is performed against `vocab.txt` with the blank id at 5632.
The `language_spans.json` file lets you mask the logits to a specific
language's 256-token span before greedy / beam decoding. See [`scripts/indicconformer_export/README.md`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
for details.
## Limitations
Covers 22 official Indian languages (listed in frontmatter). Accuracy and
known failure modes inherit from the
[upstream AI4Bharat model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual).
The RNNT head from the source model is not included β€” only the CTC path β€”
which trades a small amount of accuracy for substantially simpler decoding.
## Citation
For the underlying model, see the
[upstream model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
for the canonical citation.
## Acknowledgments
- Original model: [AI4Bharat](https://ai4bharat.iitm.ac.in/) (IIT Madras)
- ONNX repackaging: [Chris Thompson](https://github.com/christopherthompson81) for [Vernacula](https://github.com/christopherthompson81/vernacula)
Issues with the ONNX export specifically: open an issue on
[the Vernacula repo](https://github.com/christopherthompson81/vernacula/issues).
Issues with the underlying model: see the upstream model card.
## See also
- [Vernacula on GitHub](https://github.com/christopherthompson81/vernacula) β€” the speech pipeline app this package is built for
- [Conversion script (`scripts/indicconformer_export/`)](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) β€” the export pipeline that produced these files
- [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) β€” upstream model card
- [AI4Bharat](https://ai4bharat.iitm.ac.in/) β€” upstream research group at IIT Madras
- [Other Vernacula model packages](https://huggingface.co/christopherthompson81)