whisper-large-v3 - Fine-tuned for Ukrainian ASR

This model is a fine-tuned version of openai/whisper-large-v3 on the Dido Yvanchyk Audio Dataset v2 for Ukrainian speech recognition.

Model Description

Fine-tune openai/whisper-large-v3 on Dido-Yvanchyk dataset

Training Details

Training Data

Property	Value
Dataset	KSE-RESEARCH-Group/Dido-Yvanchyk-Audio-Dataset-v2
Training samples	6811
Validation samples	757
Test samples	842
Language	Ukrainian
Max token length	448

Training Hyperparameters

Parameter	Value
Base model	openai/whisper-large-v3
Learning rate	1e-05
Warmup steps	500
Max steps	8000
Batch size (per device)	4
Gradient accumulation steps	4
Effective batch size	16
FP16	True
Gradient checkpointing	False
Eval strategy	steps
Eval/Save steps	500
Metric for best model	cer

Training Results

The model was trained for 8000 steps with evaluation every 500 steps. The best checkpoint was selected based on the lowest CER.

Step	Train Loss	Eval Loss	Eval CER (%)	Eval WER (%)
500	0.15	0.211	5.36	16.24
1000	0.093	0.196	4.64	14.34
1500	0.06	0.205	4.19	14.43
2000	0.044	0.211	4.19	13.94
2500	0.021	0.227	4.09	13.04
3000	0.011	0.238	3.93	13.06
3500	0.005	0.25	3.97	13.19
4000	0.003	0.253	3.84	13.08
4500	0.002	0.255	3.85	12.74
5000	0.002	0.267	3.86	12.61
5500	0.001	0.275	3.89	12.82
6000	0.001	0.274	3.73	12.44
6500	0.001	0.28	3.87	12.62
7000	0.0	0.286	3.71	12.41
7500	0.0	0.29	3.73	12.42
8000	0.0	0.291	3.73	12.4

Best Model Checkpoint: Step 7000

Final Evaluation Metrics

Validation Set

Metric	Value
CER	3.73%
WER	12.4%
Eval Loss	0.291

Test Set

Metric	Value
CER	3.9%
WER	13.2%

Usage

Using Pipeline (Recommended)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2",
    device=device,
)

result = pipe(
    "path/to/audio.wav",
    generate_kwargs={
        "task": "transcribe",
        "language": "ukrainian",
    },
    chunk_length_s=30,
)
print(result["text"])

Using Transformers Directly

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Move to GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Process audio (audio_array should be a numpy array at 16kHz)
input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Infrastructure

Hardware

Component	Specification
GPU	NVIDIA GeForce RTX 4090
GPU Memory	47.38 GB
GPU Count	1
CUDA Compute Capability	8.9

Environment

Package	Version
Python	3.12.12
PyTorch	2.8.0+cu128
CUDA	12.8
Transformers	4.52.4
Datasets	3.6.0
Evaluate	0.4.3

Training Time

Metric	Value
Total training time	8:55:01.513110
Training started	2025-12-26 02:55:57
Training completed	2025-12-26 11:50:58

Experiment Details

Property	Value
Experiment ID	whisper-large-v3-011
WandB Project	dido-yvanchik-stt
WandB Run	whisper-large-v3-011

Citation

If you use this model, please cite:

@misc{KSE-RESEARCH-Group-whisper-large-v3-dido-yvanchyk-v2,
  author = {KSE-RESEARCH-Group},
  title = {whisper-large-v3 - Fine-tuned for Ukrainian ASR},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2}
}

License

This model is released under the MIT license.

Acknowledgements

Base model: openai/whisper-large-v3
Dataset: Dido Yvanchyk Audio Dataset v2
Training infrastructure: NVIDIA GeForce RTX 4090

Downloads last month: 11

Safetensors

Model size

2B params

Tensor type

F32

Model tree for KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2

Base model

openai/whisper-large-v3

Finetuned

(815)

this model