Whisper Medium — Hindi R-MFT

Fine-tuned Hindi ASR model based on openai/whisper-medium, trained using the Reverse Multi-Stage Fine-Funing (R-MFT) recipe introduced in Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition.

This model is part of a set of Malayalam and Hindi Whisper models released by Adalat AI alongside the Vividh-ASR benchmark.

Model Description

R-MFT trains in three stages with a decreasing learning rate schedule, presenting the hardest acoustic data first during the highest-plasticity phase:

Stage	Data	LR
1	Tier C — Spontaneous (~558.7 hrs)	2e-4
2	Tier B — Broadcast (~1359.9 hrs)	1e-4
3	Tier A — Studio + Tier C mix (~272.1 hrs)	1e-5

Training uses AdamW (weight decay 0.1), linear warmup for the first 10% of steps, and cosine annealing to zero. Trained on NVIDIA H100 GPUs using HuggingFace Transformers.

Benchmark Results (Vividh-ASR)

Benchmark WER is measured using faster-whisper with 7s VAD segmentation for long-form audio. See the blogpost for full evaluation details.

Model	Tier A (Studio)	Tier B (Broadcast)	Tier C (Spontaneous)	Tier D (Noise)	Global
whisper-medium-hi-high-lr	13.63	11.33	18.98	14.05	15.73
whisper-medium-hi-rmft (This model)	15.82	10.11	22.71	17.27	18.14
whisper-small-hi-high-lr	16.96	11.05	23.02	16.77	18.73
whisper-small-hi-rmft	18.60	11.49	25.34	20.97	20.70
indic-whisper-hi	16.24	11.62	39.87	14.99	25.01
vaani-whisper-large-v3-hindi	12.55	17.61	28.91	14.52	21.05
whisper-medium-vaani-hindi	18.15	25.92	22.85	17.19	21.51
whisper-small-vaani-hindi	23.39	30.37	26.63	22.10	25.92

WER %. Lower is better. See Vividh-ASR benchmark for full evaluation details.

Usage

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="adalat-ai/whisper-medium-hi-rmft",
    chunk_length_s=30,
    device="cuda"
)

result = asr("audio.wav")
print(result["text"])

Note: For long-form audio, benchmark results use faster-whisper with 7s VAD segmentation. For short clips, the HuggingFace pipeline above will produce equivalent results.

Training Data

Training data is a superset of the Vividh-ASR benchmark evaluation splits. Sources used:

Tier	Hours	Sources
A (Studio)	272.1	FLEURS, IndicTTS, Kathbath, Common Voice, MUCS
B (Broadcast)	1359.9	Shrutilipi
C (Spontaneous)	558.7	IndicVoices
Total	2190.7

Intended Use & Limitations

This model is intended as a general-purpose Hindi ASR model optimised for verbatim transcription accuracy across diverse acoustic conditions.

Limitations:

Evaluated on Hindi and Malayalam only; generalisation to other Indic languages is untested
Tier D evaluation uses synthetic noise profiles; performance on real-world degraded audio may differ

Citation

If you use this model or the Vividh-ASR benchmark, please cite:

@misc{vividhasr2025,
  title   = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper
             for Indic Languages},
  author  = {[Kush Juvekar, Kavya Manohar, Kumaramanas Nethil]},
  year    = {2026},
  url     = {https://huggingface.co/blog/adalat-ai/vividh-benchmark}
}

@misc{vividh2026,
      title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition}, 
      author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil},
      year={2026},
      eprint={2605.13087},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.13087}, 
}