--- language: - hi license: apache-2.0 tags: - automatic-speech-recognition - hindi - conformer - ctc - kenlm - indic - asr - speech - vistaar - indian-languages datasets: - ai4bharat/vistaar metrics: - wer pipeline_tag: automatic-speech-recognition library_name: ctc model-index: - name: indic-conformer-600m results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vistaar (Kathbath) type: ai4bharat/vistaar metrics: - type: wer value: 9.00 name: WER (+ Hindi-5M LM) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vistaar (Kathbath Noisy) type: ai4bharat/vistaar metrics: - type: wer value: 10.19 name: WER (+ Hindi-5M LM) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vistaar (FLEURS) type: ai4bharat/vistaar metrics: - type: wer value: 11.18 name: WER (+ Hindi-5M LM) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vistaar (CommonVoice) type: ai4bharat/vistaar metrics: - type: wer value: 12.54 name: WER (+ Hindi-5M LM) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vistaar (MUCS) type: ai4bharat/vistaar metrics: - type: wer value: 9.05 name: WER (+ Hindi-5M LM) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vistaar (Gramvaani) type: ai4bharat/vistaar metrics: - type: wer value: 24.09 name: WER (+ Hindi-5M LM) --- # Indic Conformer ASR — Hindi (600M) 600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the [Vistaar benchmark](https://arxiv.org/abs/2305.15386). Achieves **12.09% average WER** with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi. Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: **0.27× RTF** (3.7× faster than real-time). On Apple MPS: **~0.03–0.05× RTF** (20–30× faster than real-time). Code and evaluation scripts: [github.com/abhayverma6300/indic-asr-conformer](https://github.com/abhayverma6300/indic-asr-conformer/) --- ## Vistaar Results WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100. | Dataset | Domain | Greedy WER | + Hindi-5M LM | |---|---|---|---| | Kathbath | Read speech | 10.34% | **9.00%** | | Kathbath Noisy | Noisy read speech | 11.86% | **10.19%** | | FLEURS | Broadcast / read | 12.68% | **11.18%** | | CommonVoice | Crowd-sourced read | 16.57% | **12.54%** | | IndicTTS | TTS-derived | 9.49% | **8.55%** | | MUCS | Conversational | 10.41% | **9.05%** | | Gramvaani | Rural / dialectal | 27.61% | **24.09%** | | **Average** | | **14.14%** | **12.09%** | ### Leaderboard context | Model | Avg WER | Open weights | CPU inference | |---|---|---|---| | **Indic Conformer 600M + Hindi-5M LM** | **12.09%** | yes | yes | | IndicWhisper (Whisper-medium fine-tuned) | 13.6% | yes | slow | | Nvidia NeMo large | 18.6% | yes | no | | Azure STT | ~20% | no | no | | Google STT | ~24% | no | no | Numbers for other models from the [Vistaar paper](https://arxiv.org/abs/2305.15386) (AI4Bharat, 2023). --- ## Model files | File | Size | Description | |---|---|---| | `am_model.pt` | 2.4 GB | Original TorchScript AM (CUDA device literals) | | `am_model_cpu.pt` | 2.4 GB | Patched for CPU inference | | `am_model_mps.pt` | 2.4 GB | Patched for Apple Silicon MPS | | `preprocessor.pt` | ~92 KB | Log-Mel frontend | | `lm/hindi/hi.bin` | 145 MB | 5-gram KenLM (Hindi-5M) | | `lm/hindi/unigrams.txt` | — | 201k Hindi words for pyctcdecode | --- ## Quickstart ### Install dependencies ```bash pip install torch torchaudio pyctcdecode ``` ### CPU inference ```bash git clone https://github.com/Abhay-Verma031/indic-asr-conformer cd indic-asr-conformer huggingface-cli download Abhay-Verma031/indic-conformer-600m \ --local-dir extracted_models_v3/ python inference/cpu_infer.py \ --audio speech.wav \ --language hi \ --preprocessor extracted_models_v3/preprocessor.pt \ --am extracted_models_v3/am_model_cpu.pt \ --lm extracted_models_v3/lm/hindi/hi.bin ``` ### Apple Silicon MPS ```bash python inference/cpu_infer.py \ --audio speech.wav \ --language hi \ --preprocessor extracted_models_v3/preprocessor.pt \ --am extracted_models_v3/am_model_mps.pt \ --device mps \ --lm extracted_models_v3/lm/hindi/hi.bin ``` ### NVIDIA GPU ```bash python inference/gpu_infer.py \ --audio speech.wav \ --language hi \ --preprocessor extracted_models_v3/preprocessor.pt \ --am extracted_models_v3/am_model.pt \ --lm extracted_models_v3/lm/hindi/hi.bin ``` --- ## Architecture ``` AUDIO (16 kHz mono, FP32) │ ▼ asr_preprocessor 80-dim log-Mel filterbank [B, 80, T'] │ ▼ asr_am Conformer encoder, ~600M params output: CTC logprobs [B, T', 257] (256 Hindi BPE tokens + CTC blank) │ ▼ asr_decoder pyctcdecode CTC beam search + KenLM α=0.3 β=1.0 beam_width=100 │ ▼ TRANSCRIPT ``` The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only `torch` and `torchaudio`. --- ## Hindi language model The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores. | | Hindi-5M | |---|---| | Order | 5-gram | | Binary size | 145 MB | | Training sentences | 5,000,000 | | Unigrams | 201,136 | | α | 0.3 | | β | 1.0 | Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering. --- ## Citation ```bibtex @misc{indic-conformer-600m, author = {Abhay Verma}, title = {Indic Conformer ASR — Hindi 600M}, year = {2026}, url = {https://huggingface.co/abhayverma6300/indic-conformer-600m} } ```