equal-ai/whisper-transliterate

An improved model over whisper-large fine-tuned for Hinglish speech recognition — see the full release on Hugging Face.


Contents


What This Model Does

Most ASR systems treat Hinglish as a problem to be corrected — collapsing it into formal Hindi or transliterated English. This model takes the opposite stance: it accepts Hinglish as a valid, first-class output format and transcribes speech the way people actually talk.

Key capabilities:

  • Native Hinglish Output — Produces transcriptions that reflect real conversational speech patterns rather than standardized Hindi or English, cutting down on structural mismatches in downstream use.
  • Whisper-Compatible Architecture — Drops into any pipeline that already uses OpenAI Whisper, with full support from the 🤗 Transformers library.
  • Reliable in Noisy Conditions — Trained heavily on degraded, real-environment audio; produces silence rather than hallucinated output when no speech is present.
  • Reduced Hallucination Rate — Explicit design choices during training suppress phantom transcriptions, keeping the model grounded when audio quality drops.
  • ~39% WER Improvement — Averaged across all standard benchmarks, this model consistently outperforms base Whisper large-v3 by a substantial margin.

Training Details

Dataset

  • ~550 hours of Hindi audio sourced from real Indian environments — homes, streets, offices — with genuine background noise throughout.
  • Entirely proprietary — no public Hinglish ASR corpus matched the acoustic diversity needed, so a dataset was built from the ground up.
  • Hybrid labeling pipeline — a state-of-the-art model produced first-pass transcriptions, then human reviewers corrected errors and handled edge cases, ensuring label quality without full manual transcription cost.
  • Noise-forward collection strategy — rather than filtering for clean audio, data collection deliberately targeted noisy recordings to match deployment conditions.
  • Minimal preprocessing — audio clips were segmented to under 30 seconds with at most two speakers per clip; beyond that, the original acoustic character was preserved without normalization or filtering.

Fine-tuning Approach

  • Custom training loop — a purpose-built trainer with detailed callbacks provided granular visibility into training dynamics at each stage.
  • Dynamic layer freezing — activation patterns were profiled on a representative slice of the training data to identify the most task-sensitive layers; only those layers remained trainable, reducing compute without sacrificing convergence quality.
  • DeepSpeed-backed training — enabled memory-efficient, high-throughput runs across the full dataset.

How to Use

Install the Transformers library first:

pip install -U transformers

Then load and run the model using the pipeline API:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Device and precision setup
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "equal-ai/whisper-transliterate"

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

# Load processor
processor = AutoProcessor.from_pretrained(model_id)

# Build pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",
        "language": "en"
    }
)

# Run inference
result = pipe("sample.wav")
print(result["text"])

Evaluation Results

Quantitative Benchmarks

WER Benchmarks (Public Datasets)

WER scores compare Hinglish output from this model against unmodified Whisper large-v3 on the same audio. Lower is better.

Dataset Whisper Large V3 equal-ai/whisper-transliterate Relative Improvement
Common Voice 60.87 31.32 ~48.6%
FLEURS 51.66 29.37 ~43.2%
Indic-Voices 84.74 57.26 ~32.4%

Internal Evaluation vs. Sarvam ASR (34 Sessions)

To benchmark against a strong India-focused ASR baseline, we evaluated both models on 34 real-world sessions using two scoring dimensions:

  • Exact Score — strict token-level match between predicted and reference transcription
  • Semantic Score — embedding-based similarity that captures meaning even when surface wording differs
Model Exact Score (↑) Semantic Score (↑)
Sarvam ASR 0.7189 0.7829
equal-ai/whisper-transliterate 0.6219 0.7247

Scores are averages across all 34 held-out sessions. The fine-tuned model performs competitively on semantic similarity (~7.4% gap), suggesting that even when surface transcriptions differ, the underlying meaning is largely preserved. The larger gap on exact matching reflects Hinglish's inherent orthographic variability — the same spoken phrase can be written multiple valid ways — which strict token matching penalizes disproportionately.

Per-session breakdown (34 sessions)
# Session ID Sarvam Exact Sarvam Semantic Finetuned Exact Finetuned Semantic
1 a2322970-150c-4c0c-9bb2-27f714da5df0 0.805 0.75 0.796 0.98
2 06fc6bff-82ad-4faa-8dfd-037c92a85ab0 0.881 0.99 0.958 0.96
3 34b53a39-b1ad-48cc-812f-746c95c458ab 0.636 0.78 0.457 0.88
4 4f52fa89-b7ef-4efe-974a-9bac60f8b6d5 0.573 0.7 0.551 0.35
5 fca11fc4-666c-46a3-acdb-7856004630c4 0.741 0.92 0.95 0.87
6 8e55b4cf-e3de-4d65-b4a8-65bf811c6320 1.0 1.0 0.794 0.85
7 0d9bf5e9-b82c-444b-bdc5-9590368549ca 0.887 0.65 0.894 0.93
8 99bdf060-958c-4dd3-934d-c21b588ce80c 0.864 0.92 0.922 0.97
9 9e0bdf6f-4cc1-40c5-b103-06f7df7c4037 0.782 1.0 0.163 0.5
10 8bb896d8-fde3-4cbb-ad44-70a6271a660d 0.933 0.9 0.819 0.88
11 3745a1af-eabf-4739-88c5-ae4d11d55669 0.013 0.05 0.027 0.05
12 2cd26f08-341f-466a-96c6-e0dad456b58f 0.641 0.65 0.581 0.6
13 b3ae8227-228e-4a01-b3ba-1603adc93946 0.637 0.5 0.574 0.78
14 b91d2bcd-9687-42d6-bc7a-43cd59c0a159 0.687 0.58 0.276 0.5
15 8094d493-90a7-48fa-aaa2-191da2be3e58 0.795 0.99 0.656 0.95
16 cf3f1cf0-bae8-411a-b3fa-b129a00559f9 0.817 0.96 0.695 0.71
17 10c30325-e2bf-4573-92c2-37e7c44a51bd 0.826 0.92 0.782 0.87
18 28e2ea96-bccf-4a0d-ada6-234a9afd4f03 0.814 0.92 0.901 0.88
19 50ccee2b-ece6-41f9-b69f-7c8a170a0a95 0.616 0.45 0.544 0.62
20 d24ea561-1d55-4a3e-aa3a-38dd1c6a4e73 0.894 1.0 0.828 0.99
21 0cd9f6cb-50da-43c7-8854-369b94a4c52e 0.692 0.5 0.584 0.5
22 06be5e3e-8dd1-4f5b-97a2-7b88fa0b0161 0.814 0.85 0.745 0.82
23 ca7ff7c3-461d-4d06-8489-31158d1b3395 0.818 0.7 0.601 0.9
24 f2c1e117-1ec4-4053-a307-dbe83ff0d33e 0.15 0.75 0.018 0.85
25 35df3f7d-0b6b-4cab-b632-46b5f159a314 0.948 0.93 0.626 0.85
26 7bb34ab9-1110-45a6-9041-d43f8f2823a1 0.007 0.05 0.012 0.05
27 054085aa-cec5-4e0d-b822-865e9e0a9fa1 0.814 0.87 0.604 0.9
28 59cea2b9-46f0-4017-93fa-8d786a14d7f9 0.947 0.99 0.921 0.99
29 e735fbd1-0cb1-4bf6-9ec9-296047d84347 0.826 0.97 0.977 0.98
30 1335f05f-a6de-44d5-ba19-ff2825c6b064 0.975 0.99 0.458 0.8
31 f6387191-af5a-4e78-8168-d41d300e0d08 0.474 0.88 0.201 0.2
32 d7f0bc64-3ea3-4fcc-8010-7ccd1f974ef1 0.771 0.68 0.842 0.65
33 c0ea0733-465b-49f3-9181-e5f11dae9648 0.546 0.88 0.771 0.58
34 58251587-4b4f-49a3-a93b-559df5148527 0.818 0.95 0.616 0.45
Avg 0.7189 0.7829 0.6219 0.7247

Qualitative Observations

  • On conversational Hinglish with heavy code-switching, the model maintains coherent output where Whisper large-v3 tends to drift toward full Hindi or full English.
  • In high-noise scenarios (street audio, overlapping background speech), this model suppresses output rather than producing confident but incorrect transcriptions.
  • Multi-speaker clips (≤2 speakers) are handled without explicit diarization, though accuracy drops as overlap increases.

Additional Notes

  • This model does not perform speaker diarization.
  • Audio inputs should be under 30 seconds for best results; the pipeline handles longer files via chunking.
  • Language code should be set to "en" in generate_kwargs regardless of the Hindi content — this is intentional and reflects the Hinglish output format.
  • For production deployments with strict latency requirements, quantization via bitsandbytes or ctranslate2 is recommended.
Downloads last month
127
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for equal-ai/whisper-transliterate

Finetuned
(806)
this model

Evaluation results