Whisper-large-v3-turbo Polish Medical ASR

A fine-tuned openai/whisper-large-v3-turbo specialized for Polish medical speech using an anti-forgetting training recipe (knowledge distillation + medical oversampling + general-domain replay) and partial LoRA merge (α=0.75 weight interpolation) for implicit regularization.

This model reduces content WER on held-out Polish medical test sets by 42–50% relative to the base model while also improving performance on general Polish speech (including out-of-distribution European Parliament audio).

Benchmark results (content WER — lowercase, no punctuation)

Held-out test sets (3,205 samples, fair-eval methodology — no train/test text overlap):

Test Set Base whisper-large-v3-turbo This model Δ (pp) Relative
admed_anoni (medical, synthetic) 16.58 % 11.30 % −5.28 −32 %
admed_human (medical, human read) 17.07 % 7.64 % −9.43 −55 %
youtube (medical-adjacent) 10.07 % 5.60 % −4.47 −44 %
gemini (medical test2) 6.43 % 4.88 % −1.55 −24 %
bigos (general Polish) 5.37 % 5.50 % +0.13 +2 %
VoxPopuli (OOD, EU Parliament) 15.88 % 9.28 % −6.60 −42 %

No catastrophic forgetting: bigos (general Polish) is essentially unchanged (+0.13pp). The model even improves on out-of-distribution formal Polish (VoxPopuli −6.6pp) because fine-tuning with Polish data fixes a language-detection issue in the base model.

Error quality (qualitative analysis, n=2000 medical samples)

Metric This model
Regressions (FT worse than base, >5pp per sample) 59 / 2000 (2.95%)
Improvements (FT better than base, >5pp per sample) 622 / 2000 (31.1%)
Unchanged 1319 / 2000 (66.0%)

Regression types are predominantly minor: Polish compound-word boundary shifts ("niereagujące" → "nie reagujące"), rare grammar insertions, and 2 repeat-loop hallucinations on audio where the base model also fails. No systematic formatting habits or deploy-hazardous patterns were detected.

Training recipe

Component Value
Base model openai/whisper-large-v3-turbo
Adapter LoRA r=64, α=128, dropout=0.0
LoRA targets encoder + decoder attention + FFN projections (49M trainable params)
Learning rate 2e-4 (cosine, 10% warmup)
Epochs 5
Batch size 16 × 4 GPUs
Precision fp16, gradient checkpointing
Anti-forgetting KD α=0.3, T=2.0 from frozen base
Data mix Medical × 2 oversampled + bigos 10k general replay
Post-training Partial LoRA merge: α=0.75 weight interpolation

Partial merge technique

Instead of fully merging the LoRA adapter (which overfits on some test sets), we interpolate weights:

final_weights = 0.75 × merged_lora_weights + 0.25 × base_weights

This acts as implicit regularization, reducing per-sample regressions by ~20% compared to full merge while preserving nearly all of the medical WER improvement. The optimal α=0.75 was found via an 8-point sweep across [0.25, 0.85].

Training: ~6h on 4×A100 (SXM4-40GB). Partial merge adds ~5 min of post-processing.

Datasets

Dataset Role Samples (train)
lion-ai/admed_voice (admed_anoni) Medical (synthetic) 8,516 × 2
lion-ai/admed_voice (admed_human) Medical (human read) 5,693 × 2
lion-ai/youtube_asr_30 Medical-adjacent YouTube 3,712 × 2
lion-ai/pl_med_asr_test2 Medical (test2) 1,301 × 2
lion-ai/bigos General Polish (replay) 10,000

Evaluation uses held-out test splits from all five datasets (3,205 samples total), plus 200 out-of-distribution samples from VoxPopuli European Parliament.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "lion-ai/eskulap-asr-turbo-beta"
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = WhisperProcessor.from_pretrained(model_id, language="Polish", task="transcribe")

# Inference
import librosa
audio, sr = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
inputs["input_features"] = inputs["input_features"].half()
with torch.no_grad():
    predicted_ids = model.generate(**inputs, language="pl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Why anti-forgetting matters

Naively fine-tuning Whisper on medical-only data destroys performance on general Polish. This recipe combines three techniques:

  1. Data replay — mixing general-domain (bigos) samples in training
  2. Knowledge distillation — KL divergence loss to frozen base preserves its output distribution
  3. Medical oversampling — repeats the medical training data 2× to shift the balance
  4. Partial merge — weight interpolation at deploy time provides implicit regularization

Result: strong medical WER improvement with no general-domain forgetting.

Known limitations

  • Medical terminology with complex Polish morphology (e.g., "pozapiramidowego", "gruczolakorak") remains challenging — both base and fine-tuned models make similar errors on these terms (acoustic limit).
  • ~3% of samples show minor regressions vs base (mostly compound-word boundary shifts).
  • Trained primarily on read/dictated medical speech; spontaneous conversational medical speech may differ.

Related work

Part of the Eskulap project — Polish medical ASR research. See also the smaller variant based on openai/whisper-medium.

License

Apache 2.0 (inherits from base model).

Downloads last month
38
Safetensors
Model size
0.8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lion-ai/eskulap-asr-turbo-beta

Finetuned
(512)
this model

Datasets used to train lion-ai/eskulap-asr-turbo-beta

Space using lion-ai/eskulap-asr-turbo-beta 1