---
base_model: openai/whisper-medium
language:
- ml
license: apache-2.0
metrics:
- wer
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- automatic-speech-recognition
- malayalam
- indic-asr
- fine-tuned
---

# Whisper Medium — Malayalam R-MFT

This model was introduced in the paper [Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition](https://huggingface.co/papers/2605.13087).

Fine-tuned Malayalam ASR model based on
[openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained
using the Reverse Multi-Stage Fine-Tuning (R-MFT) recipe introduced in
[Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages](https://huggingface.co/blog/adalat-ai/vividh-benchmark).

This model is part of a set of Malayalam and Hindi Whisper models released by
[Adalat AI](https://www.adalat.ai/) alongside the Vividh-ASR benchmark.

---

## Model Description

R-MFT trains in three stages with a decreasing learning rate schedule, presenting
the hardest acoustic data first during the highest-plasticity phase:

| Stage | Data | LR |
|---|---|---|
| 1 | Tier C — Spontaneous (~512.5 hrs) | 2e-4 |
| 2 | Tier B — Broadcast (~200 hrs) | 1e-4 |
| 3 | Tier A — Studio + Tier C mix (~182.2 hrs) | 1e-5 |

Training uses AdamW (weight decay 0.1), linear warmup for the first 10% of
steps, and cosine annealing to zero. Trained on NVIDIA H100 GPUs using
HuggingFace Transformers.

---

## Benchmark Results (Vividh-ASR)

Benchmark WER is measured using [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
with 7s VAD segmentation for long-form audio. See the
[blogpost](https://huggingface.co/blog/adalat-ai/vividh-benchmark) for full evaluation details.

| Model | Tier A (Studio) | Tier B (Broadcast) | Tier C (Spontaneous) | Tier D (Noise) | Global |
|---|---|---|---|---|---|
| [whisper-medium-ml-high-lr](https://huggingface.co/adalat-ai/whisper-medium-ml-high-lr) | 35.04 | 30.48 | 50.30 | 50.78 | 40.85 |
| **whisper-medium-ml-rmft (This model)** | 37.56 | 31.66 | 46.10 | 45.73 | 39.64 |
| [whisper-small-ml-high-lr](https://huggingface.co/adalat-ai/whisper-small-ml-high-lr) | 39.05 | 32.50 | 54.39 | 51.08 | 43.93 |
| [whisper-small-ml-rmft](https://huggingface.co/adalat-ai/whisper-small-ml-rmft) | 40.26 | 35.05 | 53.77 | 48.04 | 44.53 |
| [IndicWhisper](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file#evaluating-asr-models) | 38.07 | 32.43 | 65.74 | 46.92 | 47.96 |
| [Vegam Whisper](https://huggingface.co/smcproject/vegam-whisper-medium-ml-int8_float16) | 38.74 | 55.10 | 58.53 | 54.46 | 53.39 |

*WER %. Lower is better. See [Vividh-ASR benchmark](https://huggingface.co/datasets/adalat-ai/vividh-test-malayalam) for full
evaluation details.*

---

## Usage

```python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="adalat-ai/whisper-medium-ml-rmft",
    chunk_length_s=30,
    device="cuda"
)

result = asr("audio.wav")
print(result["text"])
```

> **Note:** For long-form audio, benchmark results use
> [faster-whisper](https://github.com/SYSTRAN/faster-whisper) with 7s VAD
> segmentation. For short clips, the HuggingFace pipeline above will produce
> equivalent results.

---

## Training Data

Training data is a superset of the Vividh-ASR benchmark evaluation splits.
Sources used:

| Tier | Hours | Sources |
|---|---|---|
| A (Studio) | 182.2 | [Fleurs](https://huggingface.co/datasets/google/fleurs), [IndicTTS](https://www.iitm.ac.in/donlab/indictts/database.html), [OpenSLR](https://openslr.org/63/), [IMASC](https://huggingface.co/datasets/thennal/IMaSC) |
| B (Broadcast) | 200.0 | [Shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
| C (Spontaneous) | 512.5 | [IndicVoices](https://huggingface.co/datasets/ai4bharat/IndicVoices), [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) |
| **Total** | **894.7** | |

---

## Intended Use & Limitations

This model is intended as a general-purpose Malayalam ASR model optimised for
verbatim transcription accuracy across diverse acoustic conditions.

**Limitations:**
- Evaluated on Hindi and Malayalam only; generalisation to other Indic languages is untested
- Tier D evaluation uses synthetic noise profiles; performance on real-world
  degraded audio may differ

---

## Citation

If you use this model or the Vividh-ASR benchmark, please cite:

```bibtex
@misc{vividhasr2025,
  title   = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper
             for Indic Languages},
  author  = {Kush Juvekar, Kavya Manohar, Kumaramanas Nethil},
  year    = {2026},
  url     = {https://huggingface.co/blog/adalat-ai/vividh-benchmark}
}
```

```bibtex
@misc{vividh2026,
      title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition}, 
      author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil},
      year={2026},
      eprint={2605.13087},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.13087}, 
}
```

---

## Related Models and Datasets

See the [Vividh collection](https://huggingface.co/collections/adalat-ai/vividh-asr).

---

*Developed by [Adalat AI](https://www.adalat.ai/). Released under Apache 2.0.*