Update README.md

473dbcf verified 8 days ago

5.39 kB

	---
	base_model: openai/whisper-medium
	language:
	- ml
	license: apache-2.0
	metrics:
	- wer
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- whisper
	- automatic-speech-recognition
	- malayalam
	- indic-asr
	- fine-tuned
	---

	# Whisper Medium — Malayalam R-MFT

	This model was introduced in the paper [Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition](https://huggingface.co/papers/2605.13087).

	Fine-tuned Malayalam ASR model based on
	[openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained
	using the Reverse Multi-Stage Fine-Tuning (R-MFT) recipe introduced in
	[Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages](https://huggingface.co/blog/adalat-ai/vividh-benchmark).

	This model is part of a set of Malayalam and Hindi Whisper models released by
	[Adalat AI](https://www.adalat.ai/) alongside the Vividh-ASR benchmark.

	---

	## Model Description

	R-MFT trains in three stages with a decreasing learning rate schedule, presenting
	the hardest acoustic data first during the highest-plasticity phase:

	\| Stage \| Data \| LR \|
	\|---\|---\|---\|
	\| 1 \| Tier C — Spontaneous (~512.5 hrs) \| 2e-4 \|
	\| 2 \| Tier B — Broadcast (~200 hrs) \| 1e-4 \|
	\| 3 \| Tier A — Studio + Tier C mix (~182.2 hrs) \| 1e-5 \|

	Training uses AdamW (weight decay 0.1), linear warmup for the first 10% of
	steps, and cosine annealing to zero. Trained on NVIDIA H100 GPUs using
	HuggingFace Transformers.

	---

	## Benchmark Results (Vividh-ASR)

	Benchmark WER is measured using [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
	with 7s VAD segmentation for long-form audio. See the
	[blogpost](https://huggingface.co/blog/adalat-ai/vividh-benchmark) for full evaluation details.

	\| Model \| Tier A (Studio) \| Tier B (Broadcast) \| Tier C (Spontaneous) \| Tier D (Noise) \| Global \|
	\|---\|---\|---\|---\|---\|---\|
	\| [whisper-medium-ml-high-lr](https://huggingface.co/adalat-ai/whisper-medium-ml-high-lr) \| 35.04 \| 30.48 \| 50.30 \| 50.78 \| 40.85 \|
	\| whisper-medium-ml-rmft (This model) \| 37.56 \| 31.66 \| 46.10 \| 45.73 \| 39.64 \|
	\| [whisper-small-ml-high-lr](https://huggingface.co/adalat-ai/whisper-small-ml-high-lr) \| 39.05 \| 32.50 \| 54.39 \| 51.08 \| 43.93 \|
	\| [whisper-small-ml-rmft](https://huggingface.co/adalat-ai/whisper-small-ml-rmft) \| 40.26 \| 35.05 \| 53.77 \| 48.04 \| 44.53 \|
	\| [IndicWhisper](https://github.com/AI4Bharat/vistaar/tree/master?tab=readme-ov-file#evaluating-asr-models) \| 38.07 \| 32.43 \| 65.74 \| 46.92 \| 47.96 \|
	\| [Vegam Whisper](https://huggingface.co/smcproject/vegam-whisper-medium-ml-int8_float16) \| 38.74 \| 55.10 \| 58.53 \| 54.46 \| 53.39 \|

	*WER %. Lower is better. See [Vividh-ASR benchmark](https://huggingface.co/datasets/adalat-ai/vividh-test-malayalam) for full
	evaluation details.*

	---

	## Usage

	```python
	from transformers import pipeline

	asr = pipeline(
	"automatic-speech-recognition",
	model="adalat-ai/whisper-medium-ml-rmft",
	chunk_length_s=30,
	device="cuda"
	)

	result = asr("audio.wav")
	print(result["text"])
	```

	> Note: For long-form audio, benchmark results use
	> [faster-whisper](https://github.com/SYSTRAN/faster-whisper) with 7s VAD
	> segmentation. For short clips, the HuggingFace pipeline above will produce
	> equivalent results.

	---

	## Training Data

	Training data is a superset of the Vividh-ASR benchmark evaluation splits.
	Sources used:

	\| Tier \| Hours \| Sources \|
	\|---\|---\|---\|
	\| A (Studio) \| 182.2 \| [Fleurs](https://huggingface.co/datasets/google/fleurs), [IndicTTS](https://www.iitm.ac.in/donlab/indictts/database.html), [OpenSLR](https://openslr.org/63/), [IMASC](https://huggingface.co/datasets/thennal/IMaSC) \|
	\| B (Broadcast) \| 200.0 \| [Shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) \|
	\| C (Spontaneous) \| 512.5 \| [IndicVoices](https://huggingface.co/datasets/ai4bharat/IndicVoices), [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \|
	\| Total \| 894.7 \| \|

	---

	## Intended Use & Limitations

	This model is intended as a general-purpose Malayalam ASR model optimised for
	verbatim transcription accuracy across diverse acoustic conditions.

	Limitations:
	- Evaluated on Hindi and Malayalam only; generalisation to other Indic languages is untested
	- Tier D evaluation uses synthetic noise profiles; performance on real-world
	degraded audio may differ

	---

	## Citation

	If you use this model or the Vividh-ASR benchmark, please cite:

	```bibtex
	@misc{vividhasr2025,
	title = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper
	for Indic Languages},
	author = {Kush Juvekar, Kavya Manohar, Kumaramanas Nethil},
	year = {2026},
	url = {https://huggingface.co/blog/adalat-ai/vividh-benchmark}
	}
	```

	```bibtex
	@misc{vividh2026,
	title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition},
	author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil},
	year={2026},
	eprint={2605.13087},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2605.13087},
	}
	```

	---

	## Related Models and Datasets

	See the [Vividh collection](https://huggingface.co/collections/adalat-ai/vividh-asr).

	---

	Developed by [Adalat AI](https://www.adalat.ai/). Released under Apache 2.0.