---
license: mit
library_name: onnxruntime
pipeline_tag: automatic-speech-recognition
tags:
  - onnx
  - onnxruntime
  - automatic-speech-recognition
  - indic
  - indicconformer
  - ai4bharat
  - vernacula
base_model: ai4bharat/indic-conformer-600m-multilingual
language:
  - as
  - bn
  - brx
  - doi
  - gu
  - hi
  - kn
  - kok
  - ks
  - mai
  - ml
  - mni
  - mr
  - ne
  - or
  - pa
  - sa
  - sat
  - sd
  - ta
  - te
  - ur
---

# IndicConformer 600M — ONNX export for Vernacula

Re-packaged ONNX export of AI4Bharat's 22-language
[`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual),
in the on-disk shape that [Vernacula](https://github.com/christopherthompson81/vernacula)'s
desktop ASR app expects. The CTC head only — the RNNT components from the
source repo are not shipped here.

- **Conversion script:** [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
- **Vernacula:** [github.com/christopherthompson81/vernacula](https://github.com/christopherthompson81/vernacula)
- **Upstream model:** [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)

All numerical behavior is identical to the upstream encoder + CTC graph;
only the on-disk packaging differs.

## Highlights

- **22 languages, one shared CTC head.** Encoder dim → 5633 logits with the shared blank at id 5632; per-language vocab spans live in `language_spans.json` as `{start, length}` pairs (22 × 256 tokens). Language selection is a C# post-argmax mask, not an ONNX input — one model serves every language.
- **Phase 1 parity (CPU-CPU FP32): max-abs logits delta 6.87e-5** at 1e-3 tolerance. CUDA-vs-CPU cross-device drift hit 1e-2 scale — typical for a 17-layer Conformer; CPU is the numerically exact path.
- **Real-audio parity on a 9.4 s hi-IN Fleurs clip:** ~2 word-edit WER on an 11-word reference confirmed vocab, SentencePiece detokenisation, and language-span masking end-to-end.
- **Repackaged 2.43 GB of AI4Bharat external-data from 366 per-tensor blobs** into a single `.data` sidecar. The repackaging walks initialisers + node attributes recursively so nothing is left behind.
- **Reused the Parakeet DFT preprocessor** (no `STFT` op), with a `getattr()` shim for NeMo 1.23.0rc0 compatibility (`exact_pad`, `stft_pad_amount` are 2.x-only field names).

## Contents

| File | Purpose |
|---|---|
| `encoder-model.onnx` (+ `.data`) | Conformer encoder, `[features, features_lens] -> [encoded, encoded_lens]` |
| `ctc_decoder-model.onnx` | Single `Conv1d` → 5633-dim logits (22 × 256 language tokens + 1 shared CTC blank at id 5632) |
| `nemo128.onnx` | DFT-conv1d 80-mel preprocessor, `[waveforms, waveforms_lens] -> [features, features_lens]` |
| `vocab.txt` | Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632 |
| `language_spans.json` | 22 × `{start, length}` — which slice of `vocab.txt` each language's 256 tokens occupy |
| `config.json` | Preprocessor frontend params + CTC blank id |
| `manifest.json` | Per-file MD5 hashes (used by Vernacula's download verifier) |

## Export provenance

Exported via [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
in the [Vernacula](https://github.com/christopherthompson81/vernacula) repo. The export uses [AI4Bharat's NeMo fork](https://github.com/AI4Bharat/NeMo)
(kept in an isolated venv from the main [NeMo](https://github.com/NVIDIA/NeMo) export tooling, since the fork pins
different NeMo internals).

## License

[MIT](https://opensource.org/licenses/MIT), inherited from the upstream
[`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
model.

## Using these files

In Vernacula, select IndicConformer as the ASR backend in Settings and the
package will be downloaded and verified automatically. Outside Vernacula,
pull with `huggingface_hub` and load with `onnxruntime`:

```python
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="christopherthompson81/indicconformer-600m-onnx")
```

CTC decoding is performed against `vocab.txt` with the blank id at 5632.
The `language_spans.json` file lets you mask the logits to a specific
language's 256-token span before greedy / beam decoding. See [`scripts/indicconformer_export/README.md`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
for details.

## Limitations

Covers 22 official Indian languages (listed in frontmatter). Accuracy and
known failure modes inherit from the
[upstream AI4Bharat model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual).
The RNNT head from the source model is not included — only the CTC path —
which trades a small amount of accuracy for substantially simpler decoding.

## Citation

For the underlying model, see the
[upstream model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
for the canonical citation.

## Acknowledgments

- Original model: [AI4Bharat](https://ai4bharat.iitm.ac.in/) (IIT Madras)
- ONNX repackaging: [Chris Thompson](https://github.com/christopherthompson81) for [Vernacula](https://github.com/christopherthompson81/vernacula)

Issues with the ONNX export specifically: open an issue on
[the Vernacula repo](https://github.com/christopherthompson81/vernacula/issues).
Issues with the underlying model: see the upstream model card.

## See also

- [Vernacula on GitHub](https://github.com/christopherthompson81/vernacula) — the speech pipeline app this package is built for
- [Conversion script (`scripts/indicconformer_export/`)](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) — the export pipeline that produced these files
- [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) — upstream model card
- [AI4Bharat](https://ai4bharat.iitm.ac.in/) — upstream research group at IIT Madras
- [Other Vernacula model packages](https://huggingface.co/christopherthompson81)