Instructions to use adalat-ai/whisper-medium-hi-rmft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adalat-ai/whisper-medium-hi-rmft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="adalat-ai/whisper-medium-hi-rmft")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("adalat-ai/whisper-medium-hi-rmft") model = AutoModelForSpeechSeq2Seq.from_pretrained("adalat-ai/whisper-medium-hi-rmft") - Notebooks
- Google Colab
- Kaggle
Whisper Medium — Hindi R-MFT
Fine-tuned Hindi ASR model based on openai/whisper-medium, trained using the Reverse Multi-Stage Fine-Funing (R-MFT) recipe introduced in Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition.
This model is part of a set of Malayalam and Hindi Whisper models released by Adalat AI alongside the Vividh-ASR benchmark.
Model Description
R-MFT trains in three stages with a decreasing learning rate schedule, presenting the hardest acoustic data first during the highest-plasticity phase:
| Stage | Data | LR |
|---|---|---|
| 1 | Tier C — Spontaneous (~558.7 hrs) | 2e-4 |
| 2 | Tier B — Broadcast (~1359.9 hrs) | 1e-4 |
| 3 | Tier A — Studio + Tier C mix (~272.1 hrs) | 1e-5 |
Training uses AdamW (weight decay 0.1), linear warmup for the first 10% of steps, and cosine annealing to zero. Trained on NVIDIA H100 GPUs using HuggingFace Transformers.
Benchmark Results (Vividh-ASR)
Benchmark WER is measured using faster-whisper with 7s VAD segmentation for long-form audio. See the blogpost for full evaluation details.
| Model | Tier A (Studio) | Tier B (Broadcast) | Tier C (Spontaneous) | Tier D (Noise) | Global |
|---|---|---|---|---|---|
| whisper-medium-hi-high-lr | 13.63 | 11.33 | 18.98 | 14.05 | 15.73 |
| whisper-medium-hi-rmft (This model) | 15.82 | 10.11 | 22.71 | 17.27 | 18.14 |
| whisper-small-hi-high-lr | 16.96 | 11.05 | 23.02 | 16.77 | 18.73 |
| whisper-small-hi-rmft | 18.60 | 11.49 | 25.34 | 20.97 | 20.70 |
| indic-whisper-hi | 16.24 | 11.62 | 39.87 | 14.99 | 25.01 |
| vaani-whisper-large-v3-hindi | 12.55 | 17.61 | 28.91 | 14.52 | 21.05 |
| whisper-medium-vaani-hindi | 18.15 | 25.92 | 22.85 | 17.19 | 21.51 |
| whisper-small-vaani-hindi | 23.39 | 30.37 | 26.63 | 22.10 | 25.92 |
WER %. Lower is better. See Vividh-ASR benchmark for full evaluation details.
Usage
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="adalat-ai/whisper-medium-hi-rmft",
chunk_length_s=30,
device="cuda"
)
result = asr("audio.wav")
print(result["text"])
Note: For long-form audio, benchmark results use faster-whisper with 7s VAD segmentation. For short clips, the HuggingFace pipeline above will produce equivalent results.
Training Data
Training data is a superset of the Vividh-ASR benchmark evaluation splits. Sources used:
| Tier | Hours | Sources |
|---|---|---|
| A (Studio) | 272.1 | FLEURS, IndicTTS, Kathbath, Common Voice, MUCS |
| B (Broadcast) | 1359.9 | Shrutilipi |
| C (Spontaneous) | 558.7 | IndicVoices |
| Total | 2190.7 |
Intended Use & Limitations
This model is intended as a general-purpose Hindi ASR model optimised for verbatim transcription accuracy across diverse acoustic conditions.
Limitations:
- Evaluated on Hindi and Malayalam only; generalisation to other Indic languages is untested
- Tier D evaluation uses synthetic noise profiles; performance on real-world degraded audio may differ
Citation
If you use this model or the Vividh-ASR benchmark, please cite:
@misc{vividhasr2025,
title = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper
for Indic Languages},
author = {[Kush Juvekar, Kavya Manohar, Kumaramanas Nethil]},
year = {2026},
url = {https://huggingface.co/blog/adalat-ai/vividh-benchmark}
}
@misc{vividh2026,
title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition},
author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil},
year={2026},
eprint={2605.13087},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.13087},
}
Related Models and Datasets
See the Vividh collection.
Developed by Adalat AI. Released under Apache 2.0.
- Downloads last month
- 42
Model tree for adalat-ai/whisper-medium-hi-rmft
Base model
openai/whisper-medium