Abhay-Verma031's picture
Update README.md
8c461ab verified
---
language:
- hi
license: apache-2.0
tags:
- automatic-speech-recognition
- hindi
- conformer
- ctc
- kenlm
- indic
- asr
- speech
- vistaar
- indian-languages
datasets:
- ai4bharat/vistaar
metrics:
- wer
pipeline_tag: automatic-speech-recognition
library_name: ctc
model-index:
- name: indic-conformer-600m
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (Kathbath)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 9.00
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (Kathbath Noisy)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 10.19
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (FLEURS)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 11.18
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (CommonVoice)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 12.54
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (MUCS)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 9.05
name: WER (+ Hindi-5M LM)
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Vistaar (Gramvaani)
type: ai4bharat/vistaar
metrics:
- type: wer
value: 24.09
name: WER (+ Hindi-5M LM)
---
# Indic Conformer ASR — Hindi (600M)
600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the [Vistaar benchmark](https://arxiv.org/abs/2305.15386). Achieves **12.09% average WER** with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi.
Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: **0.27× RTF** (3.7× faster than real-time). On Apple MPS: **~0.03–0.05× RTF** (20–30× faster than real-time).
Code and evaluation scripts: [github.com/abhayverma6300/indic-asr-conformer](https://github.com/abhayverma6300/indic-asr-conformer/)
---
## Vistaar Results
WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100.
| Dataset | Domain | Greedy WER | + Hindi-5M LM |
|---|---|---|---|
| Kathbath | Read speech | 10.34% | **9.00%** |
| Kathbath Noisy | Noisy read speech | 11.86% | **10.19%** |
| FLEURS | Broadcast / read | 12.68% | **11.18%** |
| CommonVoice | Crowd-sourced read | 16.57% | **12.54%** |
| IndicTTS | TTS-derived | 9.49% | **8.55%** |
| MUCS | Conversational | 10.41% | **9.05%** |
| Gramvaani | Rural / dialectal | 27.61% | **24.09%** |
| **Average** | | **14.14%** | **12.09%** |
### Leaderboard context
| Model | Avg WER | Open weights | CPU inference |
|---|---|---|---|
| **Indic Conformer 600M + Hindi-5M LM** | **12.09%** | yes | yes |
| IndicWhisper (Whisper-medium fine-tuned) | 13.6% | yes | slow |
| Nvidia NeMo large | 18.6% | yes | no |
| Azure STT | ~20% | no | no |
| Google STT | ~24% | no | no |
Numbers for other models from the [Vistaar paper](https://arxiv.org/abs/2305.15386) (AI4Bharat, 2023).
---
## Model files
| File | Size | Description |
|---|---|---|
| `am_model.pt` | 2.4 GB | Original TorchScript AM (CUDA device literals) |
| `am_model_cpu.pt` | 2.4 GB | Patched for CPU inference |
| `am_model_mps.pt` | 2.4 GB | Patched for Apple Silicon MPS |
| `preprocessor.pt` | ~92 KB | Log-Mel frontend |
| `lm/hindi/hi.bin` | 145 MB | 5-gram KenLM (Hindi-5M) |
| `lm/hindi/unigrams.txt` | — | 201k Hindi words for pyctcdecode |
---
## Quickstart
### Install dependencies
```bash
pip install torch torchaudio pyctcdecode
```
### CPU inference
```bash
git clone https://github.com/Abhay-Verma031/indic-asr-conformer
cd indic-asr-conformer
huggingface-cli download Abhay-Verma031/indic-conformer-600m \
--local-dir extracted_models_v3/
python inference/cpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model_cpu.pt \
--lm extracted_models_v3/lm/hindi/hi.bin
```
### Apple Silicon MPS
```bash
python inference/cpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model_mps.pt \
--device mps \
--lm extracted_models_v3/lm/hindi/hi.bin
```
### NVIDIA GPU
```bash
python inference/gpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model.pt \
--lm extracted_models_v3/lm/hindi/hi.bin
```
---
## Architecture
```
AUDIO (16 kHz mono, FP32)
│
â–¼
asr_preprocessor 80-dim log-Mel filterbank [B, 80, T']
│
â–¼
asr_am Conformer encoder, ~600M params
output: CTC logprobs [B, T', 257]
(256 Hindi BPE tokens + CTC blank)
│
â–¼
asr_decoder pyctcdecode CTC beam search + KenLM
α=0.3 β=1.0 beam_width=100
│
â–¼
TRANSCRIPT
```
The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only `torch` and `torchaudio`.
---
## Hindi language model
The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores.
| | Hindi-5M |
|---|---|
| Order | 5-gram |
| Binary size | 145 MB |
| Training sentences | 5,000,000 |
| Unigrams | 201,136 |
| α | 0.3 |
| β | 1.0 |
Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering.
---
## Citation
```bibtex
@misc{indic-conformer-600m,
author = {Abhay Verma},
title = {Indic Conformer ASR — Hindi 600M},
year = {2026},
url = {https://huggingface.co/abhayverma6300/indic-conformer-600m}
}
```