---
language:
- hi
license: apache-2.0
tags:
- automatic-speech-recognition
- hindi
- conformer
- ctc
- kenlm
- indic
- asr
- speech
- vistaar
- indian-languages
datasets:
- ai4bharat/vistaar
metrics:
- wer
pipeline_tag: automatic-speech-recognition
library_name: ctc
model-index:
- name: indic-conformer-600m
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vistaar (Kathbath)
      type: ai4bharat/vistaar
    metrics:
    - type: wer
      value: 9.00
      name: WER (+ Hindi-5M LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vistaar (Kathbath Noisy)
      type: ai4bharat/vistaar
    metrics:
    - type: wer
      value: 10.19
      name: WER (+ Hindi-5M LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vistaar (FLEURS)
      type: ai4bharat/vistaar
    metrics:
    - type: wer
      value: 11.18
      name: WER (+ Hindi-5M LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vistaar (CommonVoice)
      type: ai4bharat/vistaar
    metrics:
    - type: wer
      value: 12.54
      name: WER (+ Hindi-5M LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vistaar (MUCS)
      type: ai4bharat/vistaar
    metrics:
    - type: wer
      value: 9.05
      name: WER (+ Hindi-5M LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vistaar (Gramvaani)
      type: ai4bharat/vistaar
    metrics:
    - type: wer
      value: 24.09
      name: WER (+ Hindi-5M LM)
---

# Indic Conformer ASR — Hindi (600M)

600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the [Vistaar benchmark](https://arxiv.org/abs/2305.15386). Achieves **12.09% average WER** with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi.

Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: **0.27× RTF** (3.7× faster than real-time). On Apple MPS: **~0.03–0.05× RTF** (20–30× faster than real-time).

Code and evaluation scripts: [github.com/abhayverma6300/indic-asr-conformer](https://github.com/abhayverma6300/indic-asr-conformer/)

---

## Vistaar Results

WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100.

| Dataset | Domain | Greedy WER | + Hindi-5M LM |
|---|---|---|---|
| Kathbath | Read speech | 10.34% | **9.00%** |
| Kathbath Noisy | Noisy read speech | 11.86% | **10.19%** |
| FLEURS | Broadcast / read | 12.68% | **11.18%** |
| CommonVoice | Crowd-sourced read | 16.57% | **12.54%** |
| IndicTTS | TTS-derived | 9.49% | **8.55%** |
| MUCS | Conversational | 10.41% | **9.05%** |
| Gramvaani | Rural / dialectal | 27.61% | **24.09%** |
| **Average** | | **14.14%** | **12.09%** |

### Leaderboard context

| Model | Avg WER | Open weights | CPU inference |
|---|---|---|---|
| **Indic Conformer 600M + Hindi-5M LM** | **12.09%** | yes | yes |
| IndicWhisper (Whisper-medium fine-tuned) | 13.6% | yes | slow |
| Nvidia NeMo large | 18.6% | yes | no |
| Azure STT | ~20% | no | no |
| Google STT | ~24% | no | no |

Numbers for other models from the [Vistaar paper](https://arxiv.org/abs/2305.15386) (AI4Bharat, 2023).

---

## Model files

| File | Size | Description |
|---|---|---|
| `am_model.pt` | 2.4 GB | Original TorchScript AM (CUDA device literals) |
| `am_model_cpu.pt` | 2.4 GB | Patched for CPU inference |
| `am_model_mps.pt` | 2.4 GB | Patched for Apple Silicon MPS |
| `preprocessor.pt` | ~92 KB | Log-Mel frontend |
| `lm/hindi/hi.bin` | 145 MB | 5-gram KenLM (Hindi-5M) |
| `lm/hindi/unigrams.txt` | — | 201k Hindi words for pyctcdecode |

---

## Quickstart

### Install dependencies

```bash
pip install torch torchaudio pyctcdecode
```

### CPU inference

```bash
git clone https://github.com/Abhay-Verma031/indic-asr-conformer
cd indic-asr-conformer

huggingface-cli download Abhay-Verma031/indic-conformer-600m \
    --local-dir extracted_models_v3/

python inference/cpu_infer.py \
    --audio speech.wav \
    --language hi \
    --preprocessor extracted_models_v3/preprocessor.pt \
    --am extracted_models_v3/am_model_cpu.pt \
    --lm extracted_models_v3/lm/hindi/hi.bin
```

### Apple Silicon MPS

```bash
python inference/cpu_infer.py \
    --audio speech.wav \
    --language hi \
    --preprocessor extracted_models_v3/preprocessor.pt \
    --am extracted_models_v3/am_model_mps.pt \
    --device mps \
    --lm extracted_models_v3/lm/hindi/hi.bin
```

### NVIDIA GPU

```bash
python inference/gpu_infer.py \
    --audio speech.wav \
    --language hi \
    --preprocessor extracted_models_v3/preprocessor.pt \
    --am extracted_models_v3/am_model.pt \
    --lm extracted_models_v3/lm/hindi/hi.bin
```

---

## Architecture

```
AUDIO (16 kHz mono, FP32)
        │
        ▼
  asr_preprocessor      80-dim log-Mel filterbank  [B, 80, T']
        │
        ▼
      asr_am             Conformer encoder, ~600M params
                         output: CTC logprobs  [B, T', 257]
                         (256 Hindi BPE tokens + CTC blank)
        │
        ▼
    asr_decoder          pyctcdecode CTC beam search + KenLM
                         α=0.3  β=1.0  beam_width=100
        │
        ▼
    TRANSCRIPT
```

The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only `torch` and `torchaudio`.

---

## Hindi language model

The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores.

| | Hindi-5M |
|---|---|
| Order | 5-gram |
| Binary size | 145 MB |
| Training sentences | 5,000,000 |
| Unigrams | 201,136 |
| α | 0.3 |
| β | 1.0 |

Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering.

---

## Citation

```bibtex
@misc{indic-conformer-600m,
  author = {Abhay Verma},
  title  = {Indic Conformer ASR — Hindi 600M},
  year   = {2026},
  url    = {https://huggingface.co/abhayverma6300/indic-conformer-600m}
}
```