Vertically align metadata strip

938d265 verified 18 days ago

6.25 kB

	---
	license: mit
	library_name: onnxruntime
	pipeline_tag: automatic-speech-recognition
	tags:
	- onnx
	- onnxruntime
	- automatic-speech-recognition
	- indic
	- indicconformer
	- ai4bharat
	- vernacula
	base_model: ai4bharat/indic-conformer-600m-multilingual
	language:
	- as
	- bn
	- brx
	- doi
	- gu
	- hi
	- kn
	- kok
	- ks
	- mai
	- ml
	- mni
	- mr
	- ne
	- or
	- pa
	- sa
	- sat
	- sd
	- ta
	- te
	- ur
	---

	# IndicConformer 600M — ONNX export for Vernacula

	Re-packaged ONNX export of AI4Bharat's 22-language
	[`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual),
	in the on-disk shape that [Vernacula](https://github.com/christopherthompson81/vernacula)'s
	desktop ASR app expects. The CTC head only — the RNNT components from the
	source repo are not shipped here.

	- Conversion script: [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
	- Vernacula: [github.com/christopherthompson81/vernacula](https://github.com/christopherthompson81/vernacula)
	- Upstream model: [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)

	All numerical behavior is identical to the upstream encoder + CTC graph;
	only the on-disk packaging differs.

	## Highlights

	- 22 languages, one shared CTC head. Encoder dim → 5633 logits with the shared blank at id 5632; per-language vocab spans live in `language_spans.json` as `{start, length}` pairs (22 × 256 tokens). Language selection is a C# post-argmax mask, not an ONNX input — one model serves every language.
	- Phase 1 parity (CPU-CPU FP32): max-abs logits delta 6.87e-5 at 1e-3 tolerance. CUDA-vs-CPU cross-device drift hit 1e-2 scale — typical for a 17-layer Conformer; CPU is the numerically exact path.
	- Real-audio parity on a 9.4 s hi-IN Fleurs clip: ~2 word-edit WER on an 11-word reference confirmed vocab, SentencePiece detokenisation, and language-span masking end-to-end.
	- Repackaged 2.43 GB of AI4Bharat external-data from 366 per-tensor blobs into a single `.data` sidecar. The repackaging walks initialisers + node attributes recursively so nothing is left behind.
	- Reused the Parakeet DFT preprocessor (no `STFT` op), with a `getattr()` shim for NeMo 1.23.0rc0 compatibility (`exact_pad`, `stft_pad_amount` are 2.x-only field names).

	## Contents

	\| File \| Purpose \|
	\|---\|---\|
	\| `encoder-model.onnx` (+ `.data`) \| Conformer encoder, `[features, features_lens] -> [encoded, encoded_lens]` \|
	\| `ctc_decoder-model.onnx` \| Single `Conv1d` → 5633-dim logits (22 × 256 language tokens + 1 shared CTC blank at id 5632) \|
	\| `nemo128.onnx` \| DFT-conv1d 80-mel preprocessor, `[waveforms, waveforms_lens] -> [features, features_lens]` \|
	\| `vocab.txt` \| Flat 5632-line vocab, id = line index; shared CTC blank is implicit at id 5632 \|
	\| `language_spans.json` \| 22 × `{start, length}` — which slice of `vocab.txt` each language's 256 tokens occupy \|
	\| `config.json` \| Preprocessor frontend params + CTC blank id \|
	\| `manifest.json` \| Per-file MD5 hashes (used by Vernacula's download verifier) \|

	## Export provenance

	Exported via [`scripts/indicconformer_export/`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
	in the [Vernacula](https://github.com/christopherthompson81/vernacula) repo. The export uses [AI4Bharat's NeMo fork](https://github.com/AI4Bharat/NeMo)
	(kept in an isolated venv from the main [NeMo](https://github.com/NVIDIA/NeMo) export tooling, since the fork pins
	different NeMo internals).

	## License

	[MIT](https://opensource.org/licenses/MIT), inherited from the upstream
	[`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
	model.

	## Using these files

	In Vernacula, select IndicConformer as the ASR backend in Settings and the
	package will be downloaded and verified automatically. Outside Vernacula,
	pull with `huggingface_hub` and load with `onnxruntime`:

	```python
	from huggingface_hub import snapshot_download
	path = snapshot_download(repo_id="christopherthompson81/indicconformer-600m-onnx")
	```

	CTC decoding is performed against `vocab.txt` with the blank id at 5632.
	The `language_spans.json` file lets you mask the logits to a specific
	language's 256-token span before greedy / beam decoding. See [`scripts/indicconformer_export/README.md`](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export)
	for details.

	## Limitations

	Covers 22 official Indian languages (listed in frontmatter). Accuracy and
	known failure modes inherit from the
	[upstream AI4Bharat model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual).
	The RNNT head from the source model is not included — only the CTC path —
	which trades a small amount of accuracy for substantially simpler decoding.

	## Citation

	For the underlying model, see the
	[upstream model card](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual)
	for the canonical citation.

	## Acknowledgments

	- Original model: [AI4Bharat](https://ai4bharat.iitm.ac.in/) (IIT Madras)
	- ONNX repackaging: [Chris Thompson](https://github.com/christopherthompson81) for [Vernacula](https://github.com/christopherthompson81/vernacula)

	Issues with the ONNX export specifically: open an issue on
	[the Vernacula repo](https://github.com/christopherthompson81/vernacula/issues).
	Issues with the underlying model: see the upstream model card.

	## See also

	- [Vernacula on GitHub](https://github.com/christopherthompson81/vernacula) — the speech pipeline app this package is built for
	- [Conversion script (`scripts/indicconformer_export/`)](https://github.com/christopherthompson81/vernacula/tree/main/scripts/indicconformer_export) — the export pipeline that produced these files
	- [`ai4bharat/indic-conformer-600m-multilingual`](https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual) — upstream model card
	- [AI4Bharat](https://ai4bharat.iitm.ac.in/) — upstream research group at IIT Madras
	- [Other Vernacula model packages](https://huggingface.co/christopherthompson81)