Whisper Large V3 Turbo Swiss German

This model is a fine-tuned version of openai/whisper-large-v3-turbo for Swiss German automatic speech recognition. It is intended to transcribe Swiss German speech into Standard German text.

The current upload corresponds to the best checkpoint from the March 20, 2026 retraining run: checkpoint-750.

Summary

Base model: openai/whisper-large-v3-turbo
Task: Swiss German speech recognition
Output language: Standard German text
Best uploaded checkpoint: checkpoint-750
Training data for this checkpoint: about 301 hours of curated private Swiss German training audio
Training infrastructure: 4x A100 80GB GPUs
Checkpoint format: safetensors

Training Data

The training data for this model is private.

For readers, the important part is the data mix and scale:

about 301 hours of training audio were used for this published checkpoint
about 335 hours are included across the corresponding train, validation, and test splits
the curated subset is made up primarily of Swiss German parliamentary or other semi-formal speech plus read or prompted Swiss German speech
transcripts are in Standard German

Because the corpus is private, this model card does not list internal dataset names, split names, or source identifiers. The March 20, 2026 retraining run used a filtered subset of the broader private Swiss German corpus after earlier experiments showed that some internal sources reduced validation quality.

Public corpus references that help explain the broader data provenance are:

SwissDial Dataset (ETH Zurich): around 24 hours across 8 major Swiss German dialects with Swiss German and High German transcripts
Swiss Parliaments Corpus V2 (FHNW): 293 hours of Swiss German parliamentary speech with Standard German transcripts
All Swiss German Dialects Test Set (FHNW): 13 hours with a dialect distribution intended to be close to real-world Swiss German
ArchiMob Release 2 (UZH / SWISSUbase): a transcribed spoken Swiss German corpus covering linguistic varieties across Switzerland

Those public references are included here for reader context and provenance. They should not be read as a verbatim public listing of the exact filtered private subset used for this published checkpoint.

Why This Checkpoint

The training run improved steadily up to checkpoint-750, then degraded afterward. The uploaded model is therefore the best checkpoint from the run, not the final checkpoint.

Validation trajectory during the successful run:

Step	WER	Normalized WER
250	41.05	40.21
500	39.63	38.86
750	37.96	37.25
1000	42.92	42.20
1250	43.64	42.92
1500	43.64	42.89

This is why checkpoint-750 is the shipped model.

Comparison To Base Whisper V3 Turbo

The tuned model was compared against the base openai/whisper-large-v3-turbo on a large random sample from the same private training corpus regime used for this retraining run.

Comparison setup:

split evaluated: training split
sample size: 16,384

Results:

Model	WER	Normalized WER
Base `openai/whisper-large-v3-turbo`	45.71	44.52
This model	39.18	38.48

Absolute improvement over base on that sampled training slice:

WER: -6.54
normalized WER: -6.04

Intended Use

This model is intended for Swiss German ASR workloads where the target transcription is Standard German text.

It is the right version to try if:

you want a Whisper Turbo model adapted for Swiss German speech
your audio is reasonably clean conversational or semi-formal speech
you want a stronger Swiss German starting point than zero-shot base Whisper Turbo

Limitations

The training data is private, so the reported metrics are self-reported from internal evaluation.
The reported best validation metric is from the curated private validation slice used for model selection.
The run overfit after checkpoint-750; later checkpoints were worse.
Performance can still vary by dialect, speaker population, audio quality, and domain.

Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe("path/to/audio.wav")
print(result["text"])

Technical Notes

architecture: Whisper Large V3 Turbo
framework: PyTorch + Transformers
optimizer: adamw_torch_fused
scheduler: cosine
learning rate: 1e-5
epochs configured: 5
model selection: best checkpoint by validation WER

License

This model is distributed under the Creative Commons Attribution-NonCommercial 4.0 license (cc-by-nc-4.0).

Downloads last month: 2,732

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for Flurin17/whisper-large-v3-turbo-swiss-german

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(512)

this model

Finetunes

1 model

Quantizations

1 model

Spaces using Flurin17/whisper-large-v3-turbo-swiss-german 2

Evaluation results

Word Error Rate on Private Swiss German validation split
validation set self-reported

37.963
Normalized Word Error Rate on Private Swiss German validation split
validation set self-reported

37.250