Update README.md

8c461ab verified 4 days ago

6.65 kB

	---
	language:
	- hi
	license: apache-2.0
	tags:
	- automatic-speech-recognition
	- hindi
	- conformer
	- ctc
	- kenlm
	- indic
	- asr
	- speech
	- vistaar
	- indian-languages
	datasets:
	- ai4bharat/vistaar
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	library_name: ctc
	model-index:
	- name: indic-conformer-600m
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vistaar (Kathbath)
	type: ai4bharat/vistaar
	metrics:
	- type: wer
	value: 9.00
	name: WER (+ Hindi-5M LM)
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vistaar (Kathbath Noisy)
	type: ai4bharat/vistaar
	metrics:
	- type: wer
	value: 10.19
	name: WER (+ Hindi-5M LM)
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vistaar (FLEURS)
	type: ai4bharat/vistaar
	metrics:
	- type: wer
	value: 11.18
	name: WER (+ Hindi-5M LM)
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vistaar (CommonVoice)
	type: ai4bharat/vistaar
	metrics:
	- type: wer
	value: 12.54
	name: WER (+ Hindi-5M LM)
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vistaar (MUCS)
	type: ai4bharat/vistaar
	metrics:
	- type: wer
	value: 9.05
	name: WER (+ Hindi-5M LM)
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Vistaar (Gramvaani)
	type: ai4bharat/vistaar
	metrics:
	- type: wer
	value: 24.09
	name: WER (+ Hindi-5M LM)
	---

	# Indic Conformer ASR — Hindi (600M)

	600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the [Vistaar benchmark](https://arxiv.org/abs/2305.15386). Achieves 12.09% average WER with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi.

	Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: 0.27× RTF (3.7× faster than real-time). On Apple MPS: ~0.03–0.05× RTF (20–30× faster than real-time).

	Code and evaluation scripts: [github.com/abhayverma6300/indic-asr-conformer](https://github.com/abhayverma6300/indic-asr-conformer/)

	---

	## Vistaar Results

	WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100.

	\| Dataset \| Domain \| Greedy WER \| + Hindi-5M LM \|
	\|---\|---\|---\|---\|
	\| Kathbath \| Read speech \| 10.34% \| 9.00% \|
	\| Kathbath Noisy \| Noisy read speech \| 11.86% \| 10.19% \|
	\| FLEURS \| Broadcast / read \| 12.68% \| 11.18% \|
	\| CommonVoice \| Crowd-sourced read \| 16.57% \| 12.54% \|
	\| IndicTTS \| TTS-derived \| 9.49% \| 8.55% \|
	\| MUCS \| Conversational \| 10.41% \| 9.05% \|
	\| Gramvaani \| Rural / dialectal \| 27.61% \| 24.09% \|
	\| Average \| \| 14.14% \| 12.09% \|

	### Leaderboard context

	\| Model \| Avg WER \| Open weights \| CPU inference \|
	\|---\|---\|---\|---\|
	\| Indic Conformer 600M + Hindi-5M LM \| 12.09% \| yes \| yes \|
	\| IndicWhisper (Whisper-medium fine-tuned) \| 13.6% \| yes \| slow \|
	\| Nvidia NeMo large \| 18.6% \| yes \| no \|
	\| Azure STT \| ~20% \| no \| no \|
	\| Google STT \| ~24% \| no \| no \|

	Numbers for other models from the [Vistaar paper](https://arxiv.org/abs/2305.15386) (AI4Bharat, 2023).

	---

	## Model files

	\| File \| Size \| Description \|
	\|---\|---\|---\|
	\| `am_model.pt` \| 2.4 GB \| Original TorchScript AM (CUDA device literals) \|
	\| `am_model_cpu.pt` \| 2.4 GB \| Patched for CPU inference \|
	\| `am_model_mps.pt` \| 2.4 GB \| Patched for Apple Silicon MPS \|
	\| `preprocessor.pt` \| ~92 KB \| Log-Mel frontend \|
	\| `lm/hindi/hi.bin` \| 145 MB \| 5-gram KenLM (Hindi-5M) \|
	\| `lm/hindi/unigrams.txt` \| — \| 201k Hindi words for pyctcdecode \|

	---

	## Quickstart

	### Install dependencies

	```bash
	pip install torch torchaudio pyctcdecode
	```

	### CPU inference

	```bash
	git clone https://github.com/Abhay-Verma031/indic-asr-conformer
	cd indic-asr-conformer

	huggingface-cli download Abhay-Verma031/indic-conformer-600m \
	--local-dir extracted_models_v3/

	python inference/cpu_infer.py \
	--audio speech.wav \
	--language hi \
	--preprocessor extracted_models_v3/preprocessor.pt \
	--am extracted_models_v3/am_model_cpu.pt \
	--lm extracted_models_v3/lm/hindi/hi.bin
	```

	### Apple Silicon MPS

	```bash
	python inference/cpu_infer.py \
	--audio speech.wav \
	--language hi \
	--preprocessor extracted_models_v3/preprocessor.pt \
	--am extracted_models_v3/am_model_mps.pt \
	--device mps \
	--lm extracted_models_v3/lm/hindi/hi.bin
	```

	### NVIDIA GPU

	```bash
	python inference/gpu_infer.py \
	--audio speech.wav \
	--language hi \
	--preprocessor extracted_models_v3/preprocessor.pt \
	--am extracted_models_v3/am_model.pt \
	--lm extracted_models_v3/lm/hindi/hi.bin
	```

	---

	## Architecture

	```
	AUDIO (16 kHz mono, FP32)
	│
	▼
	asr_preprocessor 80-dim log-Mel filterbank [B, 80, T']
	│
	▼
	asr_am Conformer encoder, ~600M params
	output: CTC logprobs [B, T', 257]
	(256 Hindi BPE tokens + CTC blank)
	│
	▼
	asr_decoder pyctcdecode CTC beam search + KenLM
	α=0.3 β=1.0 beam_width=100
	│
	▼
	TRANSCRIPT
	```

	The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only `torch` and `torchaudio`.

	---

	## Hindi language model

	The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores.

	\| \| Hindi-5M \|
	\|---\|---\|
	\| Order \| 5-gram \|
	\| Binary size \| 145 MB \|
	\| Training sentences \| 5,000,000 \|
	\| Unigrams \| 201,136 \|
	\| α \| 0.3 \|
	\| β \| 1.0 \|

	Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering.

	---

	## Citation

	```bibtex
	@misc{indic-conformer-600m,
	author = {Abhay Verma},
	title = {Indic Conformer ASR — Hindi 600M},
	year = {2026},
	url = {https://huggingface.co/abhayverma6300/indic-conformer-600m}
	}
	```