Whisper-large-v3-turbo Polish Medical ASR
A fine-tuned openai/whisper-large-v3-turbo specialized for Polish medical speech using an
anti-forgetting training recipe (knowledge distillation + medical oversampling + general-domain replay)
and partial LoRA merge (α=0.75 weight interpolation) for implicit regularization.
This model reduces content WER on held-out Polish medical test sets by 42–50% relative to the base model while also improving performance on general Polish speech (including out-of-distribution European Parliament audio).
Benchmark results (content WER — lowercase, no punctuation)
Held-out test sets (3,205 samples, fair-eval methodology — no train/test text overlap):
| Test Set | Base whisper-large-v3-turbo | This model | Δ (pp) | Relative |
|---|---|---|---|---|
| admed_anoni (medical, synthetic) | 16.58 % | 11.30 % | −5.28 | −32 % |
| admed_human (medical, human read) | 17.07 % | 7.64 % | −9.43 | −55 % |
| youtube (medical-adjacent) | 10.07 % | 5.60 % | −4.47 | −44 % |
| gemini (medical test2) | 6.43 % | 4.88 % | −1.55 | −24 % |
| bigos (general Polish) | 5.37 % | 5.50 % | +0.13 | +2 % |
| VoxPopuli (OOD, EU Parliament) | 15.88 % | 9.28 % | −6.60 | −42 % |
No catastrophic forgetting: bigos (general Polish) is essentially unchanged (+0.13pp). The model even improves on out-of-distribution formal Polish (VoxPopuli −6.6pp) because fine-tuning with Polish data fixes a language-detection issue in the base model.
Error quality (qualitative analysis, n=2000 medical samples)
| Metric | This model |
|---|---|
| Regressions (FT worse than base, >5pp per sample) | 59 / 2000 (2.95%) |
| Improvements (FT better than base, >5pp per sample) | 622 / 2000 (31.1%) |
| Unchanged | 1319 / 2000 (66.0%) |
Regression types are predominantly minor: Polish compound-word boundary shifts ("niereagujące" → "nie reagujące"), rare grammar insertions, and 2 repeat-loop hallucinations on audio where the base model also fails. No systematic formatting habits or deploy-hazardous patterns were detected.
Training recipe
| Component | Value |
|---|---|
| Base model | openai/whisper-large-v3-turbo |
| Adapter | LoRA r=64, α=128, dropout=0.0 |
| LoRA targets | encoder + decoder attention + FFN projections (49M trainable params) |
| Learning rate | 2e-4 (cosine, 10% warmup) |
| Epochs | 5 |
| Batch size | 16 × 4 GPUs |
| Precision | fp16, gradient checkpointing |
| Anti-forgetting | KD α=0.3, T=2.0 from frozen base |
| Data mix | Medical × 2 oversampled + bigos 10k general replay |
| Post-training | Partial LoRA merge: α=0.75 weight interpolation |
Partial merge technique
Instead of fully merging the LoRA adapter (which overfits on some test sets), we interpolate weights:
final_weights = 0.75 × merged_lora_weights + 0.25 × base_weights
This acts as implicit regularization, reducing per-sample regressions by ~20% compared to full merge while preserving nearly all of the medical WER improvement. The optimal α=0.75 was found via an 8-point sweep across [0.25, 0.85].
Training: ~6h on 4×A100 (SXM4-40GB). Partial merge adds ~5 min of post-processing.
Datasets
| Dataset | Role | Samples (train) |
|---|---|---|
lion-ai/admed_voice (admed_anoni) |
Medical (synthetic) | 8,516 × 2 |
lion-ai/admed_voice (admed_human) |
Medical (human read) | 5,693 × 2 |
lion-ai/youtube_asr_30 |
Medical-adjacent YouTube | 3,712 × 2 |
lion-ai/pl_med_asr_test2 |
Medical (test2) | 1,301 × 2 |
lion-ai/bigos |
General Polish (replay) | 10,000 |
Evaluation uses held-out test splits from all five datasets (3,205 samples total), plus 200 out-of-distribution samples from VoxPopuli European Parliament.
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
model_id = "lion-ai/eskulap-asr-turbo-beta"
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = WhisperProcessor.from_pretrained(model_id, language="Polish", task="transcribe")
# Inference
import librosa
audio, sr = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
inputs["input_features"] = inputs["input_features"].half()
with torch.no_grad():
predicted_ids = model.generate(**inputs, language="pl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Why anti-forgetting matters
Naively fine-tuning Whisper on medical-only data destroys performance on general Polish. This recipe combines three techniques:
- Data replay — mixing general-domain (bigos) samples in training
- Knowledge distillation — KL divergence loss to frozen base preserves its output distribution
- Medical oversampling — repeats the medical training data 2× to shift the balance
- Partial merge — weight interpolation at deploy time provides implicit regularization
Result: strong medical WER improvement with no general-domain forgetting.
Known limitations
- Medical terminology with complex Polish morphology (e.g., "pozapiramidowego", "gruczolakorak") remains challenging — both base and fine-tuned models make similar errors on these terms (acoustic limit).
- ~3% of samples show minor regressions vs base (mostly compound-word boundary shifts).
- Trained primarily on read/dictated medical speech; spontaneous conversational medical speech may differ.
Related work
Part of the Eskulap project — Polish medical ASR research.
See also the smaller variant based on openai/whisper-medium.
License
Apache 2.0 (inherits from base model).
- Downloads last month
- 38
Model tree for lion-ai/eskulap-asr-turbo-beta
Base model
openai/whisper-large-v3