| --- |
| language: |
| - hi |
| license: apache-2.0 |
| tags: |
| - automatic-speech-recognition |
| - hindi |
| - conformer |
| - ctc |
| - kenlm |
| - indic |
| - asr |
| - speech |
| - vistaar |
| - indian-languages |
| datasets: |
| - ai4bharat/vistaar |
| metrics: |
| - wer |
| pipeline_tag: automatic-speech-recognition |
| library_name: ctc |
| model-index: |
| - name: indic-conformer-600m |
| results: |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| name: Vistaar (Kathbath) |
| type: ai4bharat/vistaar |
| metrics: |
| - type: wer |
| value: 9.00 |
| name: WER (+ Hindi-5M LM) |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| name: Vistaar (Kathbath Noisy) |
| type: ai4bharat/vistaar |
| metrics: |
| - type: wer |
| value: 10.19 |
| name: WER (+ Hindi-5M LM) |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| name: Vistaar (FLEURS) |
| type: ai4bharat/vistaar |
| metrics: |
| - type: wer |
| value: 11.18 |
| name: WER (+ Hindi-5M LM) |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| name: Vistaar (CommonVoice) |
| type: ai4bharat/vistaar |
| metrics: |
| - type: wer |
| value: 12.54 |
| name: WER (+ Hindi-5M LM) |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| name: Vistaar (MUCS) |
| type: ai4bharat/vistaar |
| metrics: |
| - type: wer |
| value: 9.05 |
| name: WER (+ Hindi-5M LM) |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| name: Vistaar (Gramvaani) |
| type: ai4bharat/vistaar |
| metrics: |
| - type: wer |
| value: 24.09 |
| name: WER (+ Hindi-5M LM) |
| --- |
| |
| # Indic Conformer ASR — Hindi (600M) |
|
|
| 600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the [Vistaar benchmark](https://arxiv.org/abs/2305.15386). Achieves **12.09% average WER** with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi. |
|
|
| Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: **0.27× RTF** (3.7× faster than real-time). On Apple MPS: **~0.03–0.05× RTF** (20–30× faster than real-time). |
|
|
| Code and evaluation scripts: [github.com/abhayverma6300/indic-asr-conformer](https://github.com/abhayverma6300/indic-asr-conformer/) |
|
|
| --- |
|
|
| ## Vistaar Results |
|
|
| WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100. |
|
|
| | Dataset | Domain | Greedy WER | + Hindi-5M LM | |
| |---|---|---|---| |
| | Kathbath | Read speech | 10.34% | **9.00%** | |
| | Kathbath Noisy | Noisy read speech | 11.86% | **10.19%** | |
| | FLEURS | Broadcast / read | 12.68% | **11.18%** | |
| | CommonVoice | Crowd-sourced read | 16.57% | **12.54%** | |
| | IndicTTS | TTS-derived | 9.49% | **8.55%** | |
| | MUCS | Conversational | 10.41% | **9.05%** | |
| | Gramvaani | Rural / dialectal | 27.61% | **24.09%** | |
| | **Average** | | **14.14%** | **12.09%** | |
|
|
| ### Leaderboard context |
|
|
| | Model | Avg WER | Open weights | CPU inference | |
| |---|---|---|---| |
| | **Indic Conformer 600M + Hindi-5M LM** | **12.09%** | yes | yes | |
| | IndicWhisper (Whisper-medium fine-tuned) | 13.6% | yes | slow | |
| | Nvidia NeMo large | 18.6% | yes | no | |
| | Azure STT | ~20% | no | no | |
| | Google STT | ~24% | no | no | |
|
|
| Numbers for other models from the [Vistaar paper](https://arxiv.org/abs/2305.15386) (AI4Bharat, 2023). |
|
|
| --- |
|
|
| ## Model files |
|
|
| | File | Size | Description | |
| |---|---|---| |
| | `am_model.pt` | 2.4 GB | Original TorchScript AM (CUDA device literals) | |
| | `am_model_cpu.pt` | 2.4 GB | Patched for CPU inference | |
| | `am_model_mps.pt` | 2.4 GB | Patched for Apple Silicon MPS | |
| | `preprocessor.pt` | ~92 KB | Log-Mel frontend | |
| | `lm/hindi/hi.bin` | 145 MB | 5-gram KenLM (Hindi-5M) | |
| | `lm/hindi/unigrams.txt` | — | 201k Hindi words for pyctcdecode | |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| ### Install dependencies |
|
|
| ```bash |
| pip install torch torchaudio pyctcdecode |
| ``` |
|
|
| ### CPU inference |
|
|
| ```bash |
| git clone https://github.com/Abhay-Verma031/indic-asr-conformer |
| cd indic-asr-conformer |
| |
| huggingface-cli download Abhay-Verma031/indic-conformer-600m \ |
| --local-dir extracted_models_v3/ |
| |
| python inference/cpu_infer.py \ |
| --audio speech.wav \ |
| --language hi \ |
| --preprocessor extracted_models_v3/preprocessor.pt \ |
| --am extracted_models_v3/am_model_cpu.pt \ |
| --lm extracted_models_v3/lm/hindi/hi.bin |
| ``` |
|
|
| ### Apple Silicon MPS |
|
|
| ```bash |
| python inference/cpu_infer.py \ |
| --audio speech.wav \ |
| --language hi \ |
| --preprocessor extracted_models_v3/preprocessor.pt \ |
| --am extracted_models_v3/am_model_mps.pt \ |
| --device mps \ |
| --lm extracted_models_v3/lm/hindi/hi.bin |
| ``` |
|
|
| ### NVIDIA GPU |
|
|
| ```bash |
| python inference/gpu_infer.py \ |
| --audio speech.wav \ |
| --language hi \ |
| --preprocessor extracted_models_v3/preprocessor.pt \ |
| --am extracted_models_v3/am_model.pt \ |
| --lm extracted_models_v3/lm/hindi/hi.bin |
| ``` |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| AUDIO (16 kHz mono, FP32) |
| │ |
| â–¼ |
| asr_preprocessor 80-dim log-Mel filterbank [B, 80, T'] |
| │ |
| â–¼ |
| asr_am Conformer encoder, ~600M params |
| output: CTC logprobs [B, T', 257] |
| (256 Hindi BPE tokens + CTC blank) |
| │ |
| â–¼ |
| asr_decoder pyctcdecode CTC beam search + KenLM |
| α=0.3 β=1.0 beam_width=100 |
| │ |
| â–¼ |
| TRANSCRIPT |
| ``` |
|
|
| The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only `torch` and `torchaudio`. |
|
|
| --- |
|
|
| ## Hindi language model |
|
|
| The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores. |
|
|
| | | Hindi-5M | |
| |---|---| |
| | Order | 5-gram | |
| | Binary size | 145 MB | |
| | Training sentences | 5,000,000 | |
| | Unigrams | 201,136 | |
| | α | 0.3 | |
| | β | 1.0 | |
|
|
| Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{indic-conformer-600m, |
| author = {Abhay Verma}, |
| title = {Indic Conformer ASR — Hindi 600M}, |
| year = {2026}, |
| url = {https://huggingface.co/abhayverma6300/indic-conformer-600m} |
| } |
| ``` |
|
|