Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages

Community Article Published May 15, 2026
  • Existing ASR models for Indic languages are biased toward studio and broadcast recordings and degrade on spontaneous speech. We address this on two fronts: Vividh-ASR, a benchmark that stratifies evaluation by acoustic complexity across four tiers, and a Whisper fine-tuning recipe that systematically improves robustness across all of them.
  • Most of the performance gain comes from one change: fine-tuning Whisper with a high learning rate (2e-4). This alone consistently outperforms existing public Hindi and Malayalam ASR models across all acoustic conditions.
  • Training on easier examples first — the standard curriculum approach — provides no benefit and often hurts. Training on harder conditions first helps for Malayalam, producing further gains on spontaneous and noisy speech; for Hindi, the high learning rate alone is already sufficient.
  • Together, these findings enable a 244M parameter Whisper model to outperform publicly available models up to six times its size on overall WER, without any architectural changes or proprietary data.

At Adalat AI, speech-to-text is not a peripheral feature — it is the foundation of everything we build. Our platform serves the Indian judiciary, and building production-grade ASR for that environment involves two distinct challenges. The first is acoustic: judges and lawyers speak spontaneously and work in environments that bear no resemblance to a recording studio. The second is operational: our inference pipeline must serve large numbers of concurrent courtroom users efficiently and reliably at scale, a systems problem we address separately in our work on Scalable Offline ASR for Command-Style Dictation in Courtrooms.

Our production models layer legal vocabulary and domain adaptation on top of a general acoustic foundation. But that foundation has to be strong first — a model that degrades on spontaneous, varied-condition audio will fail regardless of what you fine-tune on top of it. The models and recipe we are sharing here represent that foundation.

Whisper is a natural starting point for this work. Its encoder is trained on hundreds of thousands of hours of diverse audio, giving it acoustic representations that transfer remarkably well even to unseen conditions. But out of the box, base Whisper fails catastrophically on Indic languages — zero-shot WER routinely exceeds 100% for Malayalam and sometimes for Hindi, a consequence of both script mismatch and the near-absence of Indic language data in its pretraining mix. Fine-tuning closes this gap substantially, and the community has made real progress here. The question we set out to answer is not whether fine-tuning helps — it clearly does — but how much is achievable, and how best to do it without trading robustness on one condition for gains on another.

The standard answers — conservative learning rates to protect the pre-trained prior, easy-to-hard curriculum ordering, fine-tuning on whatever new data is available — turn out to be the wrong prescription. This post explains what we found instead.

We are releasing:

The Problem: Studio-Bias in Indic ASR

Every Indic ASR practitioner has encountered this: a model that performs beautifully on clean read speech but falls apart the moment it hears natural, spontaneous conversation. We call this studio-bias — and it is not a data problem alone.

It is a persistent pattern across the ecosystem. Models fine-tuned predominantly on studio-recorded, read speech — regardless of scale or training data volume — tend to generalize well to clean audio while degrading sharply on spontaneous, crowdsourced, or noisy conditions. As we show in our benchmark results, this holds even for large 1.5B parameter models trained on thousands of hours of Hindi. The gap between studio performance and real-world performance is wide, and closing it requires more than simply adding more data.

We developed our recipe through Malayalam experiments first, then validated it on Hindi. What we found challenges two standard assumptions about how to fine-tune Whisper for low-resource languages — and the results generalize across model scales, languages, and acoustic conditions.

Introducing the Vividh-ASR Benchmark

Existing benchmarks for Indic ASR largely organize evaluation by domain — medical, legal etc. We organize ours by acoustic complexity, which is what actually determines where models fail in deployment.

Vividh-ASR is built entirely from open-source datasets, stratified into four tiers. The tier definitions are identical across languages; the specific datasets within each tier vary where necessary to reflect what is available per language.

Tier Category Datasets Mal. Duration Hindi Duration
Tier A Studio, Read Fleurs, IndicTTS, Kathbath†, OpenSLRᴹ, CommonVoiceᴴ, MUCSᴴ 7.27 hrs 12.50 hrs
Tier B Broadcast, Fast Shrutilipi 6.33 hrs 6.82 hrs
Tier C Spontaneous, Crowdsourced IndicVoices, CommonVoiceᴹ 9.62 hrs 14.64 hrs
Tier D Synthetic Noise Kathbath Noisy 3.08 hrs 3.00 hrs
Total 26.31 hrs 36.96 hrs

ᴹ Malayalam only · ᴴ Hindi only · unmarked datasets appear in both languages

† Tier A uses clean Kathbath splits; Tier D uses noise-augmented splits of the same corpus. These are strictly non-overlapping.

Tier C is intentionally the largest subset — it represents the hardest and most practically important conditions, and it is where the studio-bias problem manifests most clearly. Tier D is evaluation-only, serving as a zero-shot stress test for acoustic generalization that no model sees during training.

A note on dataset placement across languages: CommonVoice sits in Tier C for Malayalam, where its crowdsourced variability makes it acoustically harder than studio recordings. For Hindi, it belongs to Tier A — Hindi CommonVoice recordings tend to be more structured and closer in character to read speech, placing it more naturally alongside other studio-condition data. This reflects a genuine difference in corpus character rather than an inconsistency in tier design.

Why Standard Fine-tuning Fails

We began where most practitioners do: evaluating IndicWhisper (2023) on our benchmark, then fine-tuning further on IndicVoices. After the release of the original IndicWhisper, AI4Bharat released IndicVoices (2024) — a large-scale, continuously updated spontaneous speech dataset covering all Indic languages, representing exactly the kind of real-world, unscripted audio that IndicWhisper had not been trained on. Fine-tuning on this new data was the natural next step. We explored this path in Malayalam and later validated on Hindi, details to follow.

Malayalam Models (WER%) Tier A Tier B Tier C Tier D Global Weighted Avg
IndicWhisper-ml 38.07 32.43 65.74 46.92 47.96
IndicWhisper-FT-TierC — High LR 35.87 43.12 41.57 51.36 42.26
IndicWhisper-FT-TierC — Low LR 37.77 36.97 55.39 53.75 46.46

The results confirmed what many practitioners have experienced: fine-tuning on spontaneous data improves Tier C, but degrades everywhere else. The global average can look better overall, creating the illusion of an improved model when in practice you have traded robustness on broadcast quality and noisy conditions for gains on a single tier. This is a crucial and underappreciated failure mode — if you benchmark only on spontaneous data, you will not see the regression.

Fine-tuning OpenAI Whisper

Fine-tuning IndicWhisper had hit a wall: any gain on Tier C came at the cost of every other tier, and no learning rate choice resolved the trade-off. This raised a more fundamental question — was the problem the starting point itself? IndicWhisper had been initialized from Whisper and then fine-tuned predominantly on clean data, potentially inheriting and deepening a studio-biased prior. We decided to go back to the base Whisper checkpoint and train on all available data from scratch — all tiers mixed, no curriculum, a single training run — asking whether the issue was what the model had already learned rather than what we were adding to it. We did it at two different learning rates, orders of magnitude apart.

High Learning Rate Breaks the Pre-trained Prior

The loss curves reveal the answer immediately. The conservative low-LR schedule (1e-5) plateaus within the first few thousand steps at a loss value an order of magnitude higher than the high-LR run. It gets stuck — the pre-trained decoder carries a strong linguistic prior shaped by English and high-resource languages, creating a deep narrow basin that small gradient updates simply cannot escape. A high learning rate gives the decoder the energy it needs to break out.

loss Training loss: High LR (2e-4) vs Low LR (1e-5) for Malayalam Whisper-Medium. The low-LR run plateaus prematurely and never recovers.

The WER results confirm what the loss curves suggest:

Malayalam Model (Single Stage WER %) Tier A Tier B Tier C Tier D Global Weighted Avg
IndicWhisper-ml 38.07 32.43 65.74 46.92 47.96
Whisper-Medium FT — Low LR (1e-5) 63.82 78.64 82.37 82.17 77.56
Whisper-Medium FT — High LR (2e-4) 35.04 30.48 50.30 50.78 40.38

The high LR model is dramatically better across every tier and already surpasses IndicWhisper on the global average. But look closely at Tier C: even with a high LR, single-stage training leaves spontaneous speech WER at 50.30 and noisy speech at 50.78 — still a long way from where we need it. This raised a follow-up question: if high LR is the necessary ingredient, does it also matter when the model first encounters the hardest data — when its plasticity is at its peak?

Curriculum Learning on Top: Reverse Multi-Stage Fine-Tuning (R-MFT)

With the learning rate insight established, we asked a secondary question: given that high-LR training is essential, does the order in which you present data types matter?

Standard curriculum learning says easy-to-hard: start with clean studio data and work toward spontaneous, noisy speech. We tested the reverse. If the model is going to be in its highest-plasticity phase regardless, why not force it to tackle the hardest acoustic problems first?

curriculum_comparison-image

Both recipes use a decreasing LR schedule — and that direction is not arbitrary. We ran the same experiments with an increasing schedule (1e-5 → 1e-4 → 2e-4), applying peak plasticity last. The final global weighted WER for both curriculum orderings worsened by 12+ points, with the hard-to-easy increasing variant starting catastrophically at 130 WER before partially recovering. High plasticity must come first — delayed high-energy updates cannot undo a conservative initialization.

stage_wer_evolution-medium-ml

WER across training stages for Malayalam Whisper-Medium. Both recipes start from the same zero-shot Whisper baseline (>100% WER, off-chart). R-MFT Stage 1 immediately drives Tier C from >100% → 44.4 while Standard MFT's Stage 1 on clean data barely moves it (63.7). Standard MFT's Stage 2 then worsens Tier C to 82 before Stage 3 rescues it — confirming the consolidation stages are doing real recovery work, not just incremental refinement. The pattern is consistent at Small scale across both languages.

The third stage is a consolidation stage: mixing clean studio data back in at a very low learning rate lets the model recover any studio precision lost during Stage 2, without overwriting the spontaneous speech robustness built in Stage 1.

The intermediate stage results are themselves instructive. After R-MFT Stage 1 on spontaneous data alone, Tier C already reaches 44.4 — while Tier A and Tier B regress sharply, the model has already captured the hardest acoustic conditions during its highest-plasticity phase. Standard MFT tells a different story: Stage 1 on clean data brings Tier C to 63.7, but Stage 2 on broadcast data then worsens it dramatically to 81.97 before Stage 3 finally recovers it to 50.9. The subsequent stages in both recipes are doing real recovery work — not just incremental refinement — which is precisely why the consolidation design matters.

For efficient data handling at this scale, we used the WebDataset format for I/O — packing preprocessed log-Mel features and token IDs into tar shards. This approach, described in detail by Collabora in their Whisper-Hindi v2 post, eliminates millions of small file reads during training runs. Training data is a superset of the benchmark evaluation splits — for Malayalam we additionally include IMASC, Festvox-IIITH, and other publicly available corpora; Hindi training similarly draws on additional open-source sources beyond those represented in the benchmark tiers. All models are trained on NVIDIA H100 GPUs using HuggingFace Transformers, with AdamW (weight decay 0.1), linear warmup for the first 10% of steps, and cosine annealing to zero from peak LR. Tier D is strictly held out from all training and validation splits.

Validating on Hindi

We developed the R-MFT recipe entirely through Malayalam experimentation. Before running the Hindi experiments, our expectation was clear: the same pattern should hold. High LR early is mandatory — the pre-trained basin argument applies to any low-resource Indic language regardless of script or phonology. We were less certain about the curriculum ordering. Malayalam's complex phonotactics and prosodic variability made it a strong candidate for benefiting from the hard-to-easy approach, but Hindi, with a larger and more broadcast-heavy training corpus, might not need that extra push.

The results were partly expected and partly surprising. High LR proved just as critical for Hindi as for Malayalam — conservative initialization failed consistently, confirming that finding is not Malayalam-specific. The surprise was how little the curriculum ordering mattered: R-MFT and Standard MFT converged to nearly identical global weighted WER for Hindi, while for Malayalam R-MFT held a meaningful edge. More unexpectedly, single-stage high LR — the simplest possible recipe — turned out to be the outright winner for Hindi at both model scales, a result we did not anticipate from the Malayalam experiments. Whether this reflects the differences in phonotactic complexity between the two language families, the composition of the training data, or both, remains an open question worth investigating across more Indic languages.

WER-Evolution-Hi-medium

WER across training stages for Hindi Whisper-Medium. Both recipes start from the same zero-shot baseline (~ 100 WER, off-chart). R-MFT Stage 1 immediately reaches 18.3 on Tier C — the best value either recipe achieves at any stage — before Stage 2 temporarily worsens it to 29.6 and Stage 3 recovers to 23.9. Standard MFT descends more gradually before both recipes converge to nearly identical final global WER (~ 18%), confirming that for Hindi, high LR early is the dominant ingredient and curriculum ordering adds minimal further gain.

We repeated the full set of experiments — same architecture, same curriculum structure, same learning rate schedules — so the comparison is clean. Results follow below.

Results

Having established the recipe through Malayalam and validated it on Hindi, we now present the full benchmark results across both languages, alongside external models for context. The heatmaps below show WER across all four Vividh-ASR tiers — each row is a model, each column a tier, greener is better.

Malayalam Results

Malayalam Results

Hindi Results

Hindi Results

Across both languages, our models substantially outperform the public Indic baselines. For Malayalam, R-MFT Medium reaches 39.64 global weighted WER — a 17% relative reduction over IndicWhisper (47.96) and a wide margin over Vegam Whisper (53.39). For Hindi, single-stage high LR Medium lands at 15.73 — beating the 1.5B Vaani Large-v3 (21.05) by 25% relative and IndicWhisper (25.01) by 37% relative. In both languages, our 244M Small models outperform the 769M IndicWhisper, which is the result that matters most where compute is the binding constraint.

The single largest hyperparameter effect we observed is the learning rate. The low-LR baselines collapse: Hindi Medium degrades from 15.73 to 24.40 global, and Hindi Small from 18.73 to 38.74 — more than doubling the WER. High LR is non-negotiable; every other choice in the pipeline is fine-tuning on top of that.

The curriculum story is more interesting and turns out to be language-dependent. For Malayalam, the standard easy-to-hard curriculum (Standard MFT) actually regresses against single-stage high LR (42.66 vs 40.85). Reversing the direction (R-MFT, hard-to-easy) flips this and produces our best Malayalam result at 39.64, with the gap concentrated where it matters most for deployment — Tier C Spontaneous (46.10 vs 50.91 for Standard MFT) and Tier D Noise (45.73 vs 51.51). For Malayalam, the curriculum direction genuinely matters: a 3-point global spread separates Standard MFT from R-MFT. For Hindi, that effect disappears. Standard MFT and R-MFT converge to nearly identical global WER (18.07 and 18.14), and both fall short of the 15.73 that single-stage High LR achieves on its own. Multi-stage fine-tuning offers no benefit for Hindi regardless of curriculum direction.

So the takeaway is layered. Universally: high LR is mandatory and our recipes beat the public Indic baselines at every scale we tested. Language-specific: for Malayalam, R-MFT is the best recipe, and the hard-to-easy direction is what makes it work — beating both Standard MFT and single-stage High LR. For Hindi, single-stage High LR is the best recipe, and the multi-stage curriculum converges to a similar (slightly worse) point regardless of direction. We don't yet have a clean account of why the curriculum direction matters in Malayalam but not in Hindi — candidates include data volume, language-specific acoustic difficulty, or that Hindi's single-stage training already finds a basin that staged exposure can't improve on.

To our knowledge, these are the strongest open-source Whisper-based results for Hindi and Malayalam across a complexity-stratified benchmark of this scope.

Takeaways for Practitioners

If you are fine-tuning Whisper for a low-resource Indic language, the conventional playbook — conservative learning rates, easy-to-hard curriculum — is likely leaving significant performance on the table. Here is what our experiments suggest:

  1. Use a high learning rate — without exception. A low LR will trap your model in the pre-trained basin.

  2. The conventional easy-to-hard curriculum is not a safe default. In both languages we tested, standard easy-to-hard multi-stage fine-tuning regressed against single-stage high LR. If you are going to use a curriculum, reverse the direction: hard-to-easy reliably outperforms easy-to-hard, and for Malayalam it surpasses single-stage training entirely.

  3. That said, single-stage High LR may be your ceiling. For Hindi, it outperformed the full three-stage R-MFT recipe even with properly stratified training data. If you cannot cleanly segment your data by acoustic complexity, single-stage training with a high LR captures the majority of the gains. Treat the curriculum as an optional refinement.

  4. If you do go multi-stage, consolidate at the end. A final stage that blends clean and spontaneous data recovers orthographic precision without sacrificing the acoustic robustness built up in earlier stages.

  5. Benchmark across complexity tiers, not just global weighted WER. A single aggregate metric hides regressions. A model that looks better on the global weighted average may have quietly degraded on spontaneous or noisy conditions — which are typically the conditions your deployment actually faces.

On the models we are releasing: we provide both High LR and R-MFT variants for Hindi and Malayalam at Medium (769M) and Small (244M) scales. R-MFT Medium is the strongest overall model for Malayalam; High LR Medium leads for Hindi. That said, the right choice depends on your deployment conditions — if your audio is primarily clean read speech, High LR Medium performs well on Tier A across both languages. Small models are included for deployments where compute efficiency or inference latency is a constraint; they trail the Medium models by a modest margin while substantially outperforming prior open-source baselines.

Standing on the Shoulders of Giants

None of this work would exist without the researchers and communities who built and shared the datasets and tools we depend on. Building robust Indic ASR is a collective project, and we are grateful to be part of that ecosystem.

  • Datasets: AI4Bharat for IndicVoices, Shrutilipi, Kathbath, and the Vistaar benchmark; IIT Madras SpringLab; Mozilla Common Voice; Google FLEURS; and all contributors to IMASC, OpenSLR, IndicTTS, Festvox, and Common Voice.
  • Models and prior work: The IndicWhisper team at AI4Bharat, whose work this builds directly on. OpenAI for Whisper.
  • Tooling and infrastructure: The HuggingFace ecosystem and the broader open-source ML community whose libraries, tutorials, and shared experiments made our own possible.

To everyone who annotated audio, wrote documentation, published findings openly, or answered a forum question that saved someone three days — this is for you too.

Limitations

This work has a few honest boundaries worth stating. The recipe was developed and validated on two languages — Malayalam and Hindi — representing the Dravidian and Indo-Aryan families respectively. Whether these findings generalize to other Indic language families, or to languages with significantly less spontaneous training data available, remains to be tested. Tier D, our acoustic robustness measure, uses synthetic noise profiles rather than real-world degradation; it is a useful proxy but not a substitute for evaluation on in-the-wild audio. The curriculum ordering in R-MFT also assumes you can stratify your training data by acoustic complexity — for languages where most available data is studio-recorded read speech, the hard-first stage has little material to work with, and single-stage high LR is the more practical starting point. Finally, all experiments use the Whisper architecture. We see extending Vividh-ASR to additional languages and architectures as the most important next step.

Additionally, our experiments use a fixed Whisper architecture with full fine-tuning of all parameters. But whether selective encoder freezing could achieve comparable results with lower compute cost, or whether these findings transfer to parameter-efficient methods such as LoRA, remains untested.

Additional Notes

A note on our evaluation setup
All results are reported using our production-like inference pipeline rather than the default HuggingFace Transformers inference. One way we achieve scalable concurrent serving is by managing audio context windows rather than always passing Whisper the full 30-second context it was trained on. There is an additional reason this matters for Indic languages specifically: Whisper's output is limited to 448 tokens per segment, and a 30-second window of spontaneous Indic speech routinely exceeds this limit in ways that English does not, due to the higher token density of Indic scripts. The default HuggingFace implementation can produce abrupt truncations in these cases; our pipeline handles arbitrarily long audio without this failure mode, at the cost of slightly reduced context per segment. Where audio fits comfortably within the token limit, HuggingFace inference may report marginally better numbers — but for realistic deployment conditions, our numbers are the more honest measure.
A note on WER in Indic ASR.
Word Error Rate means something very different here than in English benchmarks. For English clean read speech, sub-2% WER is routine. For Indic languages, error rates are substantially higher even for the best available models — the `Vistaar benchmark` puts this in perspective, showing that across Indian languages, even strong commercial and open-source systems struggle on spontaneous and noisy conditions, with numbers that would be considered unacceptable in English ASR. This reflects a combination of genuine acoustic difficulty, script complexity, and the fact that many Indic words have multiple valid orthographic forms that a single ground-truth transcript cannot fully capture. Our focus here is exclusively on the acoustic robustness gap — the orthographic ambiguity problem is a separate and ongoing challenge for the field.

Community

Sign up or log in to comment