equal-ai/whisper-transliterate

An improved model over whisper-large fine-tuned for Hinglish speech recognition — see the full release on Hugging Face.

What This Model Does
Training Details
- Dataset
- Fine-tuning Approach
How to Use
Evaluation Results
- Quantitative Benchmarks
- Qualitative Observations
Additional Notes

What This Model Does

Most ASR systems treat Hinglish as a problem to be corrected — collapsing it into formal Hindi or transliterated English. This model takes the opposite stance: it accepts Hinglish as a valid, first-class output format and transcribes speech the way people actually talk.

Key capabilities:

Native Hinglish Output — Produces transcriptions that reflect real conversational speech patterns rather than standardized Hindi or English, cutting down on structural mismatches in downstream use.
Whisper-Compatible Architecture — Drops into any pipeline that already uses OpenAI Whisper, with full support from the 🤗 Transformers library.
Reliable in Noisy Conditions — Trained heavily on degraded, real-environment audio; produces silence rather than hallucinated output when no speech is present.
Reduced Hallucination Rate — Explicit design choices during training suppress phantom transcriptions, keeping the model grounded when audio quality drops.
~39% WER Improvement — Averaged across all standard benchmarks, this model consistently outperforms base Whisper large-v3 by a substantial margin.

Training Details

Dataset

~550 hours of Hindi audio sourced from real Indian environments — homes, streets, offices — with genuine background noise throughout.
Entirely proprietary — no public Hinglish ASR corpus matched the acoustic diversity needed, so a dataset was built from the ground up.
Hybrid labeling pipeline — a state-of-the-art model produced first-pass transcriptions, then human reviewers corrected errors and handled edge cases, ensuring label quality without full manual transcription cost.
Noise-forward collection strategy — rather than filtering for clean audio, data collection deliberately targeted noisy recordings to match deployment conditions.
Minimal preprocessing — audio clips were segmented to under 30 seconds with at most two speakers per clip; beyond that, the original acoustic character was preserved without normalization or filtering.

Fine-tuning Approach

Custom training loop — a purpose-built trainer with detailed callbacks provided granular visibility into training dynamics at each stage.
Dynamic layer freezing — activation patterns were profiled on a representative slice of the training data to identify the most task-sensitive layers; only those layers remained trainable, reducing compute without sacrificing convergence quality.
DeepSpeed-backed training — enabled memory-efficient, high-throughput runs across the full dataset.

How to Use

Install the Transformers library first:

pip install -U transformers

Then load and run the model using the pipeline API:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Device and precision setup
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "equal-ai/whisper-transliterate"

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

# Load processor
processor = AutoProcessor.from_pretrained(model_id)

# Build pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",
        "language": "en"
    }
)

# Run inference
result = pipe("sample.wav")
print(result["text"])

Evaluation Results

Quantitative Benchmarks

WER Benchmarks (Public Datasets)

WER scores compare Hinglish output from this model against unmodified Whisper large-v3 on the same audio. Lower is better.

Dataset	Whisper Large V3	equal-ai/whisper-transliterate	Relative Improvement
Common Voice	60.87	31.32	~48.6%
FLEURS	51.66	29.37	~43.2%
Indic-Voices	84.74	57.26	~32.4%

Internal Evaluation vs. Sarvam ASR (34 Sessions)

To benchmark against a strong India-focused ASR baseline, we evaluated both models on 34 real-world sessions using two scoring dimensions:

Exact Score — strict token-level match between predicted and reference transcription
Semantic Score — embedding-based similarity that captures meaning even when surface wording differs

Model	Exact Score (↑)	Semantic Score (↑)
Sarvam ASR	0.7189	0.7829
equal-ai/whisper-transliterate	0.6219	0.7247

Scores are averages across all 34 held-out sessions. The fine-tuned model performs competitively on semantic similarity (~7.4% gap), suggesting that even when surface transcriptions differ, the underlying meaning is largely preserved. The larger gap on exact matching reflects Hinglish's inherent orthographic variability — the same spoken phrase can be written multiple valid ways — which strict token matching penalizes disproportionately.

Per-session breakdown (34 sessions)

#	Session ID	Sarvam Exact	Sarvam Semantic	Finetuned Exact	Finetuned Semantic
1	a2322970-150c-4c0c-9bb2-27f714da5df0	0.805	0.75	0.796	0.98
2	06fc6bff-82ad-4faa-8dfd-037c92a85ab0	0.881	0.99	0.958	0.96
3	34b53a39-b1ad-48cc-812f-746c95c458ab	0.636	0.78	0.457	0.88
4	4f52fa89-b7ef-4efe-974a-9bac60f8b6d5	0.573	0.7	0.551	0.35
5	fca11fc4-666c-46a3-acdb-7856004630c4	0.741	0.92	0.95	0.87
6	8e55b4cf-e3de-4d65-b4a8-65bf811c6320	1.0	1.0	0.794	0.85
7	0d9bf5e9-b82c-444b-bdc5-9590368549ca	0.887	0.65	0.894	0.93
8	99bdf060-958c-4dd3-934d-c21b588ce80c	0.864	0.92	0.922	0.97
9	9e0bdf6f-4cc1-40c5-b103-06f7df7c4037	0.782	1.0	0.163	0.5
10	8bb896d8-fde3-4cbb-ad44-70a6271a660d	0.933	0.9	0.819	0.88
11	3745a1af-eabf-4739-88c5-ae4d11d55669	0.013	0.05	0.027	0.05
12	2cd26f08-341f-466a-96c6-e0dad456b58f	0.641	0.65	0.581	0.6
13	b3ae8227-228e-4a01-b3ba-1603adc93946	0.637	0.5	0.574	0.78
14	b91d2bcd-9687-42d6-bc7a-43cd59c0a159	0.687	0.58	0.276	0.5
15	8094d493-90a7-48fa-aaa2-191da2be3e58	0.795	0.99	0.656	0.95
16	cf3f1cf0-bae8-411a-b3fa-b129a00559f9	0.817	0.96	0.695	0.71
17	10c30325-e2bf-4573-92c2-37e7c44a51bd	0.826	0.92	0.782	0.87
18	28e2ea96-bccf-4a0d-ada6-234a9afd4f03	0.814	0.92	0.901	0.88
19	50ccee2b-ece6-41f9-b69f-7c8a170a0a95	0.616	0.45	0.544	0.62
20	d24ea561-1d55-4a3e-aa3a-38dd1c6a4e73	0.894	1.0	0.828	0.99
21	0cd9f6cb-50da-43c7-8854-369b94a4c52e	0.692	0.5	0.584	0.5
22	06be5e3e-8dd1-4f5b-97a2-7b88fa0b0161	0.814	0.85	0.745	0.82
23	ca7ff7c3-461d-4d06-8489-31158d1b3395	0.818	0.7	0.601	0.9
24	f2c1e117-1ec4-4053-a307-dbe83ff0d33e	0.15	0.75	0.018	0.85
25	35df3f7d-0b6b-4cab-b632-46b5f159a314	0.948	0.93	0.626	0.85
26	7bb34ab9-1110-45a6-9041-d43f8f2823a1	0.007	0.05	0.012	0.05
27	054085aa-cec5-4e0d-b822-865e9e0a9fa1	0.814	0.87	0.604	0.9
28	59cea2b9-46f0-4017-93fa-8d786a14d7f9	0.947	0.99	0.921	0.99
29	e735fbd1-0cb1-4bf6-9ec9-296047d84347	0.826	0.97	0.977	0.98
30	1335f05f-a6de-44d5-ba19-ff2825c6b064	0.975	0.99	0.458	0.8
31	f6387191-af5a-4e78-8168-d41d300e0d08	0.474	0.88	0.201	0.2
32	d7f0bc64-3ea3-4fcc-8010-7ccd1f974ef1	0.771	0.68	0.842	0.65
33	c0ea0733-465b-49f3-9181-e5f11dae9648	0.546	0.88	0.771	0.58
34	58251587-4b4f-49a3-a93b-559df5148527	0.818	0.95	0.616	0.45
Avg		0.7189	0.7829	0.6219	0.7247

Qualitative Observations

On conversational Hinglish with heavy code-switching, the model maintains coherent output where Whisper large-v3 tends to drift toward full Hindi or full English.
In high-noise scenarios (street audio, overlapping background speech), this model suppresses output rather than producing confident but incorrect transcriptions.
Multi-speaker clips (≤2 speakers) are handled without explicit diarization, though accuracy drops as overlap increases.

Additional Notes

This model does not perform speaker diarization.
Audio inputs should be under 30 seconds for best results; the pipeline handles longer files via chunking.
Language code should be set to "en" in generate_kwargs regardless of the Hindi content — this is intentional and reflects the Hinglish output format.
For production deployments with strict latency requirements, quantization via bitsandbytes or ctranslate2 is recommended.

Downloads last month: 127

Safetensors

Model size

2B params

Tensor type

F16

Model tree for equal-ai/whisper-transliterate

Base model

openai/whisper-large-v3

Finetuned

(806)

this model

Evaluation results

WER on google/fleurs
test set self-reported

29.370
WER on mozilla-foundation/common_voice_20_0
test set self-reported

31.320
WER on Indic-Voices
test set self-reported

57.260

equal-ai
/

whisper-transliterate