language:
- hi
license: apache-2.0
tags:
- automatic-speech-recognition
- hindi
- conformer
- ctc
- kenlm
- indic
- asr
- speech
- vistaar
- indian-languages
datasets:
- ai4bharat/vistaar
metrics:
- wer
pipeline_tag: automatic-speech-recognition
library_name: ctc
model-index:
- name: indic-conformer-600m
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (Kathbath)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 9
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (Kathbath Noisy)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 10.19
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (FLEURS)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 11.18
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (CommonVoice)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 12.54
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (MUCS)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 9.05
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (Gramvaani)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 24.09
name: WER (+ Hindi-5M LM)
Indic Conformer ASR — Hindi (600M)
600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the Vistaar benchmark. Achieves 12.09% average WER with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi.
Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: 0.27× RTF (3.7× faster than real-time). On Apple MPS: ~0.03–0.05× RTF (20–30× faster than real-time).
Code and evaluation scripts: github.com/abhayverma6300/indic-asr-conformer
Vistaar Results
WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100.
| Dataset | Domain | Greedy WER | + Hindi-5M LM |
|---|---|---|---|
| Kathbath | Read speech | 10.34% | 9.00% |
| Kathbath Noisy | Noisy read speech | 11.86% | 10.19% |
| FLEURS | Broadcast / read | 12.68% | 11.18% |
| CommonVoice | Crowd-sourced read | 16.57% | 12.54% |
| IndicTTS | TTS-derived | 9.49% | 8.55% |
| MUCS | Conversational | 10.41% | 9.05% |
| Gramvaani | Rural / dialectal | 27.61% | 24.09% |
| Average | 14.14% | 12.09% |
Leaderboard context
| Model | Avg WER | Open weights | CPU inference |
|---|---|---|---|
| Indic Conformer 600M + Hindi-5M LM | 12.09% | yes | yes |
| IndicWhisper (Whisper-medium fine-tuned) | 13.6% | yes | slow |
| Nvidia NeMo large | 18.6% | yes | no |
| Azure STT | ~20% | no | no |
| Google STT | ~24% | no | no |
Numbers for other models from the Vistaar paper (AI4Bharat, 2023).
Model files
| File | Size | Description |
|---|---|---|
am_model.pt |
2.4 GB | Original TorchScript AM (CUDA device literals) |
am_model_cpu.pt |
2.4 GB | Patched for CPU inference |
am_model_mps.pt |
2.4 GB | Patched for Apple Silicon MPS |
preprocessor.pt |
~92 KB | Log-Mel frontend |
lm/hindi/hi.bin |
145 MB | 5-gram KenLM (Hindi-5M) |
lm/hindi/unigrams.txt |
— | 201k Hindi words for pyctcdecode |
Quickstart
Install dependencies
pip install torch torchaudio pyctcdecode
CPU inference
git clone https://github.com/Abhay-Verma031/indic-asr-conformer
cd indic-asr-conformer
huggingface-cli download Abhay-Verma031/indic-conformer-600m \
--local-dir extracted_models_v3/
python inference/cpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model_cpu.pt \
--lm extracted_models_v3/lm/hindi/hi.bin
Apple Silicon MPS
python inference/cpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model_mps.pt \
--device mps \
--lm extracted_models_v3/lm/hindi/hi.bin
NVIDIA GPU
python inference/gpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model.pt \
--lm extracted_models_v3/lm/hindi/hi.bin
Architecture
AUDIO (16 kHz mono, FP32)
│
â–¼
asr_preprocessor 80-dim log-Mel filterbank [B, 80, T']
│
â–¼
asr_am Conformer encoder, ~600M params
output: CTC logprobs [B, T', 257]
(256 Hindi BPE tokens + CTC blank)
│
â–¼
asr_decoder pyctcdecode CTC beam search + KenLM
α=0.3 β=1.0 beam_width=100
│
â–¼
TRANSCRIPT
The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only torch and torchaudio.
Hindi language model
The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores.
| Hindi-5M | |
|---|---|
| Order | 5-gram |
| Binary size | 145 MB |
| Training sentences | 5,000,000 |
| Unigrams | 201,136 |
| α | 0.3 |
| β | 1.0 |
Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering.
Citation
@misc{indic-conformer-600m,
author = {Abhay Verma},
title = {Indic Conformer ASR — Hindi 600M},
year = {2026},
url = {https://huggingface.co/abhayverma6300/indic-conformer-600m}
}