Whisper-large-v3-turbo Polish Medical ASR

A fine-tuned openai/whisper-large-v3-turbo specialized for Polish medical speech using an anti-forgetting training recipe (knowledge distillation + medical oversampling + general-domain replay) and partial LoRA merge (α=0.75 weight interpolation) for implicit regularization.

This model reduces content WER on held-out Polish medical test sets by 42–50% relative to the base model while also improving performance on general Polish speech (including out-of-distribution European Parliament audio).

Benchmark results (content WER — lowercase, no punctuation)

Held-out test sets (3,205 samples, fair-eval methodology — no train/test text overlap):

Test Set	Base whisper-large-v3-turbo	This model	Δ (pp)	Relative
admed_anoni (medical, synthetic)	16.58 %	11.30 %	−5.28	−32 %
admed_human (medical, human read)	17.07 %	7.64 %	−9.43	−55 %
youtube (medical-adjacent)	10.07 %	5.60 %	−4.47	−44 %
gemini (medical test2)	6.43 %	4.88 %	−1.55	−24 %
bigos (general Polish)	5.37 %	5.50 %	+0.13	+2 %
VoxPopuli (OOD, EU Parliament)	15.88 %	9.28 %	−6.60	−42 %

No catastrophic forgetting: bigos (general Polish) is essentially unchanged (+0.13pp). The model even improves on out-of-distribution formal Polish (VoxPopuli −6.6pp) because fine-tuning with Polish data fixes a language-detection issue in the base model.

Error quality (qualitative analysis, n=2000 medical samples)

Metric	This model
Regressions (FT worse than base, >5pp per sample)	59 / 2000 (2.95%)
Improvements (FT better than base, >5pp per sample)	622 / 2000 (31.1%)
Unchanged	1319 / 2000 (66.0%)

Regression types are predominantly minor: Polish compound-word boundary shifts ("niereagujące" → "nie reagujące"), rare grammar insertions, and 2 repeat-loop hallucinations on audio where the base model also fails. No systematic formatting habits or deploy-hazardous patterns were detected.

Training recipe

Component	Value
Base model	`openai/whisper-large-v3-turbo`
Adapter	LoRA r=64, α=128, dropout=0.0
LoRA targets	encoder + decoder attention + FFN projections (49M trainable params)
Learning rate	2e-4 (cosine, 10% warmup)
Epochs	5
Batch size	16 × 4 GPUs
Precision	fp16, gradient checkpointing
Anti-forgetting	KD α=0.3, T=2.0 from frozen base
Data mix	Medical × 2 oversampled + bigos 10k general replay
Post-training	Partial LoRA merge: α=0.75 weight interpolation

Partial merge technique

Instead of fully merging the LoRA adapter (which overfits on some test sets), we interpolate weights:

final_weights = 0.75 × merged_lora_weights + 0.25 × base_weights

This acts as implicit regularization, reducing per-sample regressions by ~20% compared to full merge while preserving nearly all of the medical WER improvement. The optimal α=0.75 was found via an 8-point sweep across [0.25, 0.85].

Training: ~6h on 4×A100 (SXM4-40GB). Partial merge adds ~5 min of post-processing.

Datasets

Dataset	Role	Samples (train)
`lion-ai/admed_voice` (admed_anoni)	Medical (synthetic)	8,516 × 2
`lion-ai/admed_voice` (admed_human)	Medical (human read)	5,693 × 2
`lion-ai/youtube_asr_30`	Medical-adjacent YouTube	3,712 × 2
`lion-ai/pl_med_asr_test2`	Medical (test2)	1,301 × 2
`lion-ai/bigos`	General Polish (replay)	10,000

Evaluation uses held-out test splits from all five datasets (3,205 samples total), plus 200 out-of-distribution samples from VoxPopuli European Parliament.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "lion-ai/eskulap-asr-turbo-beta"
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = WhisperProcessor.from_pretrained(model_id, language="Polish", task="transcribe")

# Inference
import librosa
audio, sr = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
inputs["input_features"] = inputs["input_features"].half()
with torch.no_grad():
    predicted_ids = model.generate(**inputs, language="pl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Why anti-forgetting matters

Naively fine-tuning Whisper on medical-only data destroys performance on general Polish. This recipe combines three techniques:

Data replay — mixing general-domain (bigos) samples in training
Knowledge distillation — KL divergence loss to frozen base preserves its output distribution
Medical oversampling — repeats the medical training data 2× to shift the balance
Partial merge — weight interpolation at deploy time provides implicit regularization

Result: strong medical WER improvement with no general-domain forgetting.

Known limitations

Medical terminology with complex Polish morphology (e.g., "pozapiramidowego", "gruczolakorak") remains challenging — both base and fine-tuned models make similar errors on these terms (acoustic limit).
~3% of samples show minor regressions vs base (mostly compound-word boundary shifts).
Trained primarily on read/dictated medical speech; spontaneous conversational medical speech may differ.

Related work

Part of the Eskulap project — Polish medical ASR research. See also the smaller variant based on openai/whisper-medium.

License

Apache 2.0 (inherits from base model).

Downloads last month: 38

Safetensors

Model size

0.8B params

Tensor type

F16

Model tree for lion-ai/eskulap-asr-turbo-beta

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo