equal-ai/whisper-transliterate
An improved model over
whisper-largefine-tuned for Hinglish speech recognition — see the full release on Hugging Face.
Contents
What This Model Does
Most ASR systems treat Hinglish as a problem to be corrected — collapsing it into formal Hindi or transliterated English. This model takes the opposite stance: it accepts Hinglish as a valid, first-class output format and transcribes speech the way people actually talk.
Key capabilities:
- Native Hinglish Output — Produces transcriptions that reflect real conversational speech patterns rather than standardized Hindi or English, cutting down on structural mismatches in downstream use.
- Whisper-Compatible Architecture — Drops into any pipeline that already uses OpenAI Whisper, with full support from the 🤗 Transformers library.
- Reliable in Noisy Conditions — Trained heavily on degraded, real-environment audio; produces silence rather than hallucinated output when no speech is present.
- Reduced Hallucination Rate — Explicit design choices during training suppress phantom transcriptions, keeping the model grounded when audio quality drops.
- ~39% WER Improvement — Averaged across all standard benchmarks, this model consistently outperforms base Whisper large-v3 by a substantial margin.
Training Details
Dataset
- ~550 hours of Hindi audio sourced from real Indian environments — homes, streets, offices — with genuine background noise throughout.
- Entirely proprietary — no public Hinglish ASR corpus matched the acoustic diversity needed, so a dataset was built from the ground up.
- Hybrid labeling pipeline — a state-of-the-art model produced first-pass transcriptions, then human reviewers corrected errors and handled edge cases, ensuring label quality without full manual transcription cost.
- Noise-forward collection strategy — rather than filtering for clean audio, data collection deliberately targeted noisy recordings to match deployment conditions.
- Minimal preprocessing — audio clips were segmented to under 30 seconds with at most two speakers per clip; beyond that, the original acoustic character was preserved without normalization or filtering.
Fine-tuning Approach
- Custom training loop — a purpose-built trainer with detailed callbacks provided granular visibility into training dynamics at each stage.
- Dynamic layer freezing — activation patterns were profiled on a representative slice of the training data to identify the most task-sensitive layers; only those layers remained trainable, reducing compute without sacrificing convergence quality.
- DeepSpeed-backed training — enabled memory-efficient, high-throughput runs across the full dataset.
How to Use
Install the Transformers library first:
pip install -U transformers
Then load and run the model using the pipeline API:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
# Device and precision setup
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "equal-ai/whisper-transliterate"
# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True
)
model.to(device)
# Load processor
processor = AutoProcessor.from_pretrained(model_id)
# Build pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={
"task": "transcribe",
"language": "en"
}
)
# Run inference
result = pipe("sample.wav")
print(result["text"])
Evaluation Results
Quantitative Benchmarks
WER Benchmarks (Public Datasets)
WER scores compare Hinglish output from this model against unmodified Whisper large-v3 on the same audio. Lower is better.
| Dataset | Whisper Large V3 | equal-ai/whisper-transliterate | Relative Improvement |
|---|---|---|---|
| Common Voice | 60.87 | 31.32 | ~48.6% |
| FLEURS | 51.66 | 29.37 | ~43.2% |
| Indic-Voices | 84.74 | 57.26 | ~32.4% |
Internal Evaluation vs. Sarvam ASR (34 Sessions)
To benchmark against a strong India-focused ASR baseline, we evaluated both models on 34 real-world sessions using two scoring dimensions:
- Exact Score — strict token-level match between predicted and reference transcription
- Semantic Score — embedding-based similarity that captures meaning even when surface wording differs
| Model | Exact Score (↑) | Semantic Score (↑) |
|---|---|---|
| Sarvam ASR | 0.7189 | 0.7829 |
| equal-ai/whisper-transliterate | 0.6219 | 0.7247 |
Scores are averages across all 34 held-out sessions. The fine-tuned model performs competitively on semantic similarity (~7.4% gap), suggesting that even when surface transcriptions differ, the underlying meaning is largely preserved. The larger gap on exact matching reflects Hinglish's inherent orthographic variability — the same spoken phrase can be written multiple valid ways — which strict token matching penalizes disproportionately.
Per-session breakdown (34 sessions)
| # | Session ID | Sarvam Exact | Sarvam Semantic | Finetuned Exact | Finetuned Semantic |
|---|---|---|---|---|---|
| 1 | a2322970-150c-4c0c-9bb2-27f714da5df0 | 0.805 | 0.75 | 0.796 | 0.98 |
| 2 | 06fc6bff-82ad-4faa-8dfd-037c92a85ab0 | 0.881 | 0.99 | 0.958 | 0.96 |
| 3 | 34b53a39-b1ad-48cc-812f-746c95c458ab | 0.636 | 0.78 | 0.457 | 0.88 |
| 4 | 4f52fa89-b7ef-4efe-974a-9bac60f8b6d5 | 0.573 | 0.7 | 0.551 | 0.35 |
| 5 | fca11fc4-666c-46a3-acdb-7856004630c4 | 0.741 | 0.92 | 0.95 | 0.87 |
| 6 | 8e55b4cf-e3de-4d65-b4a8-65bf811c6320 | 1.0 | 1.0 | 0.794 | 0.85 |
| 7 | 0d9bf5e9-b82c-444b-bdc5-9590368549ca | 0.887 | 0.65 | 0.894 | 0.93 |
| 8 | 99bdf060-958c-4dd3-934d-c21b588ce80c | 0.864 | 0.92 | 0.922 | 0.97 |
| 9 | 9e0bdf6f-4cc1-40c5-b103-06f7df7c4037 | 0.782 | 1.0 | 0.163 | 0.5 |
| 10 | 8bb896d8-fde3-4cbb-ad44-70a6271a660d | 0.933 | 0.9 | 0.819 | 0.88 |
| 11 | 3745a1af-eabf-4739-88c5-ae4d11d55669 | 0.013 | 0.05 | 0.027 | 0.05 |
| 12 | 2cd26f08-341f-466a-96c6-e0dad456b58f | 0.641 | 0.65 | 0.581 | 0.6 |
| 13 | b3ae8227-228e-4a01-b3ba-1603adc93946 | 0.637 | 0.5 | 0.574 | 0.78 |
| 14 | b91d2bcd-9687-42d6-bc7a-43cd59c0a159 | 0.687 | 0.58 | 0.276 | 0.5 |
| 15 | 8094d493-90a7-48fa-aaa2-191da2be3e58 | 0.795 | 0.99 | 0.656 | 0.95 |
| 16 | cf3f1cf0-bae8-411a-b3fa-b129a00559f9 | 0.817 | 0.96 | 0.695 | 0.71 |
| 17 | 10c30325-e2bf-4573-92c2-37e7c44a51bd | 0.826 | 0.92 | 0.782 | 0.87 |
| 18 | 28e2ea96-bccf-4a0d-ada6-234a9afd4f03 | 0.814 | 0.92 | 0.901 | 0.88 |
| 19 | 50ccee2b-ece6-41f9-b69f-7c8a170a0a95 | 0.616 | 0.45 | 0.544 | 0.62 |
| 20 | d24ea561-1d55-4a3e-aa3a-38dd1c6a4e73 | 0.894 | 1.0 | 0.828 | 0.99 |
| 21 | 0cd9f6cb-50da-43c7-8854-369b94a4c52e | 0.692 | 0.5 | 0.584 | 0.5 |
| 22 | 06be5e3e-8dd1-4f5b-97a2-7b88fa0b0161 | 0.814 | 0.85 | 0.745 | 0.82 |
| 23 | ca7ff7c3-461d-4d06-8489-31158d1b3395 | 0.818 | 0.7 | 0.601 | 0.9 |
| 24 | f2c1e117-1ec4-4053-a307-dbe83ff0d33e | 0.15 | 0.75 | 0.018 | 0.85 |
| 25 | 35df3f7d-0b6b-4cab-b632-46b5f159a314 | 0.948 | 0.93 | 0.626 | 0.85 |
| 26 | 7bb34ab9-1110-45a6-9041-d43f8f2823a1 | 0.007 | 0.05 | 0.012 | 0.05 |
| 27 | 054085aa-cec5-4e0d-b822-865e9e0a9fa1 | 0.814 | 0.87 | 0.604 | 0.9 |
| 28 | 59cea2b9-46f0-4017-93fa-8d786a14d7f9 | 0.947 | 0.99 | 0.921 | 0.99 |
| 29 | e735fbd1-0cb1-4bf6-9ec9-296047d84347 | 0.826 | 0.97 | 0.977 | 0.98 |
| 30 | 1335f05f-a6de-44d5-ba19-ff2825c6b064 | 0.975 | 0.99 | 0.458 | 0.8 |
| 31 | f6387191-af5a-4e78-8168-d41d300e0d08 | 0.474 | 0.88 | 0.201 | 0.2 |
| 32 | d7f0bc64-3ea3-4fcc-8010-7ccd1f974ef1 | 0.771 | 0.68 | 0.842 | 0.65 |
| 33 | c0ea0733-465b-49f3-9181-e5f11dae9648 | 0.546 | 0.88 | 0.771 | 0.58 |
| 34 | 58251587-4b4f-49a3-a93b-559df5148527 | 0.818 | 0.95 | 0.616 | 0.45 |
| Avg | 0.7189 | 0.7829 | 0.6219 | 0.7247 |
Qualitative Observations
- On conversational Hinglish with heavy code-switching, the model maintains coherent output where Whisper large-v3 tends to drift toward full Hindi or full English.
- In high-noise scenarios (street audio, overlapping background speech), this model suppresses output rather than producing confident but incorrect transcriptions.
- Multi-speaker clips (≤2 speakers) are handled without explicit diarization, though accuracy drops as overlap increases.
Additional Notes
- This model does not perform speaker diarization.
- Audio inputs should be under 30 seconds for best results; the pipeline handles longer files via chunking.
- Language code should be set to
"en"ingenerate_kwargsregardless of the Hindi content — this is intentional and reflects the Hinglish output format. - For production deployments with strict latency requirements, quantization via
bitsandbytesorctranslate2is recommended.
- Downloads last month
- 127
Model tree for equal-ai/whisper-transliterate
Base model
openai/whisper-large-v3Evaluation results
- WER on google/fleurstest set self-reported29.370
- WER on mozilla-foundation/common_voice_20_0test set self-reported31.320
- WER on Indic-Voicestest set self-reported57.260