Automatic Speech Recognition
Transformers
TensorBoard
Safetensors
Malayalam
whisper
malayalam
indic-asr
fine-tuned
Instructions to use adalat-ai/whisper-medium-ml-rmft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adalat-ai/whisper-medium-ml-rmft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="adalat-ai/whisper-medium-ml-rmft")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("adalat-ai/whisper-medium-ml-rmft") model = AutoModelForSpeechSeq2Seq.from_pretrained("adalat-ai/whisper-medium-ml-rmft") - Notebooks
- Google Colab
- Kaggle
| base_model: openai/whisper-medium | |
| language: | |
| - ml | |
| license: apache-2.0 | |
| metrics: | |
| - wer | |
| library_name: transformers | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - whisper | |
| - automatic-speech-recognition | |
| - malayalam | |
| - indic-asr | |
| - fine-tuned | |
| # Whisper Medium — Malayalam R-MFT | |
| This model was introduced in the paper [Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition](https://huggingface.co/papers/2605.13087). | |
| Fine-tuned Malayalam ASR model based on | |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained | |
| using the Reverse Multi-Stage Fine-Tuning (R-MFT) recipe introduced in | |
| [Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages](https://huggingface.co/blog/adalat-ai/vividh-benchmark). | |
| This model is part of a set of Malayalam and Hindi Whisper models released by | |
| [Adalat AI](https://www.adalat.ai/) alongside the Vividh-ASR benchmark. | |
| --- | |
| ## Model Description | |
| R-MFT trains in three stages with a decreasing learning rate schedule, presenting | |
| the hardest acoustic data first during the highest-plasticity phase: | |
| | Stage | Data | LR | | |
| |---|---|---| | |
| | 1 | Tier C — Spontaneous (~512.5 hrs) | 2e-4 | | |
| | 2 | Tier B — Broadcast (~200 hrs) | 1e-4 | | |
| | 3 | Tier A — Studio + Tier C mix (~182.2 hrs) | 1e-5 | | |
| Training uses AdamW (weight decay 0.1), linear warmup for the first 10% of | |
| steps, and cosine annealing to zero. Trained on NVIDIA H100 GPUs using | |
| HuggingFace Transformers. | |
| --- | |
| ## Benchmark Results (Vividh-ASR) | |
| Benchmark WER is measured using [faster-whisper](https://github.com/SYSTRAN/faster-whisper) | |
| with 7s VAD segmentation for long-form audio. See the | |
| [blogpost](https://huggingface.co/blog/adalat-ai/vividh-benchmark) for full evaluation details. | |
| | Model | Tier A (Studio) | Tier B (Broadcast) | Tier C (Spontaneous) | Tier D (Noise) | Global | | |
| |---|---|---|---|---|---| | |
| | [whisper-medium-ml-high-lr](https://huggingface.co/adalat-ai/whisper-medium-ml-high-lr) | 35.04 | 30.48 | 50.30 | 50.78 | 40.85 | | |
| | **whisper-medium-ml-rmft (This model)** | 37.56 | 31.66 | 46.10 | 45.73 | 39.64 | | |
| | [whisper-small-ml-high-lr](https://huggingface.co/adalat-ai/whisper-small-ml-high-lr) | 39.05 | 32.50 | 54.39 | 51.08 | 43.93 | | |
| | [whisper-small-ml-rmft](https://huggingface.co/adalat-ai/whisper-small-ml-rmft) | 40.26 | 35.05 | 53.77 | 48.04 | 44.53 | | |
| | [IndicWhisper](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file#evaluating-asr-models) | 38.07 | 32.43 | 65.74 | 46.92 | 47.96 | | |
| | [Vegam Whisper](https://huggingface.co/smcproject/vegam-whisper-medium-ml-int8_float16) | 38.74 | 55.10 | 58.53 | 54.46 | 53.39 | | |
| *WER %. Lower is better. See [Vividh-ASR benchmark](https://huggingface.co/datasets/adalat-ai/vividh-test-malayalam) for full | |
| evaluation details.* | |
| --- | |
| ## Usage | |
| ```python | |
| from transformers import pipeline | |
| asr = pipeline( | |
| "automatic-speech-recognition", | |
| model="adalat-ai/whisper-medium-ml-rmft", | |
| chunk_length_s=30, | |
| device="cuda" | |
| ) | |
| result = asr("audio.wav") | |
| print(result["text"]) | |
| ``` | |
| > **Note:** For long-form audio, benchmark results use | |
| > [faster-whisper](https://github.com/SYSTRAN/faster-whisper) with 7s VAD | |
| > segmentation. For short clips, the HuggingFace pipeline above will produce | |
| > equivalent results. | |
| --- | |
| ## Training Data | |
| Training data is a superset of the Vividh-ASR benchmark evaluation splits. | |
| Sources used: | |
| | Tier | Hours | Sources | | |
| |---|---|---| | |
| | A (Studio) | 182.2 | [Fleurs](https://huggingface.co/datasets/google/fleurs), [IndicTTS](https://www.iitm.ac.in/donlab/indictts/database.html), [OpenSLR](https://openslr.org/63/), [IMASC](https://huggingface.co/datasets/thennal/IMaSC) | | |
| | B (Broadcast) | 200.0 | [Shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) | | |
| | C (Spontaneous) | 512.5 | [IndicVoices](https://huggingface.co/datasets/ai4bharat/IndicVoices), [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | | |
| | **Total** | **894.7** | | | |
| --- | |
| ## Intended Use & Limitations | |
| This model is intended as a general-purpose Malayalam ASR model optimised for | |
| verbatim transcription accuracy across diverse acoustic conditions. | |
| **Limitations:** | |
| - Evaluated on Hindi and Malayalam only; generalisation to other Indic languages is untested | |
| - Tier D evaluation uses synthetic noise profiles; performance on real-world | |
| degraded audio may differ | |
| --- | |
| ## Citation | |
| If you use this model or the Vividh-ASR benchmark, please cite: | |
| ```bibtex | |
| @misc{vividhasr2025, | |
| title = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper | |
| for Indic Languages}, | |
| author = {Kush Juvekar, Kavya Manohar, Kumaramanas Nethil}, | |
| year = {2026}, | |
| url = {https://huggingface.co/blog/adalat-ai/vividh-benchmark} | |
| } | |
| ``` | |
| ```bibtex | |
| @misc{vividh2026, | |
| title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition}, | |
| author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil}, | |
| year={2026}, | |
| eprint={2605.13087}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2605.13087}, | |
| } | |
| ``` | |
| --- | |
| ## Related Models and Datasets | |
| See the [Vividh collection](https://huggingface.co/collections/adalat-ai/vividh-asr). | |
| --- | |
| *Developed by [Adalat AI](https://www.adalat.ai/). Released under Apache 2.0.* |