whisper-large-v3 - Fine-tuned for Ukrainian ASR
This model is a fine-tuned version of openai/whisper-large-v3 on the Dido Yvanchyk Audio Dataset v2 for Ukrainian speech recognition.
Model Description
Fine-tune openai/whisper-large-v3 on Dido-Yvanchyk dataset
Training Details
Training Data
Training Hyperparameters
| Parameter |
Value |
| Base model |
openai/whisper-large-v3 |
| Learning rate |
1e-05 |
| Warmup steps |
500 |
| Max steps |
8000 |
| Batch size (per device) |
4 |
| Gradient accumulation steps |
4 |
| Effective batch size |
16 |
| FP16 |
True |
| Gradient checkpointing |
False |
| Eval strategy |
steps |
| Eval/Save steps |
500 |
| Metric for best model |
cer |
Training Results
The model was trained for 8000 steps with evaluation every 500 steps. The best checkpoint was selected based on the lowest CER.
| Step |
Train Loss |
Eval Loss |
Eval CER (%) |
Eval WER (%) |
| 500 |
0.15 |
0.211 |
5.36 |
16.24 |
| 1000 |
0.093 |
0.196 |
4.64 |
14.34 |
| 1500 |
0.06 |
0.205 |
4.19 |
14.43 |
| 2000 |
0.044 |
0.211 |
4.19 |
13.94 |
| 2500 |
0.021 |
0.227 |
4.09 |
13.04 |
| 3000 |
0.011 |
0.238 |
3.93 |
13.06 |
| 3500 |
0.005 |
0.25 |
3.97 |
13.19 |
| 4000 |
0.003 |
0.253 |
3.84 |
13.08 |
| 4500 |
0.002 |
0.255 |
3.85 |
12.74 |
| 5000 |
0.002 |
0.267 |
3.86 |
12.61 |
| 5500 |
0.001 |
0.275 |
3.89 |
12.82 |
| 6000 |
0.001 |
0.274 |
3.73 |
12.44 |
| 6500 |
0.001 |
0.28 |
3.87 |
12.62 |
| 7000 |
0.0 |
0.286 |
3.71 |
12.41 |
| 7500 |
0.0 |
0.29 |
3.73 |
12.42 |
| 8000 |
0.0 |
0.291 |
3.73 |
12.4 |
Best Model Checkpoint: Step 7000
Final Evaluation Metrics
Validation Set
| Metric |
Value |
| CER |
3.73% |
| WER |
12.4% |
| Eval Loss |
0.291 |
Test Set
| Metric |
Value |
| CER |
3.9% |
| WER |
13.2% |
Usage
Using Pipeline (Recommended)
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2",
device=device,
)
result = pipe(
"path/to/audio.wav",
generate_kwargs={
"task": "transcribe",
"language": "ukrainian",
},
chunk_length_s=30,
)
print(result["text"])
Using Transformers Directly
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
model_id = "KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
input_features = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Infrastructure
Hardware
| Component |
Specification |
| GPU |
NVIDIA GeForce RTX 4090 |
| GPU Memory |
47.38 GB |
| GPU Count |
1 |
| CUDA Compute Capability |
8.9 |
Environment
| Package |
Version |
| Python |
3.12.12 |
| PyTorch |
2.8.0+cu128 |
| CUDA |
12.8 |
| Transformers |
4.52.4 |
| Datasets |
3.6.0 |
| Evaluate |
0.4.3 |
Training Time
| Metric |
Value |
| Total training time |
8:55:01.513110 |
| Training started |
2025-12-26 02:55:57 |
| Training completed |
2025-12-26 11:50:58 |
Experiment Details
| Property |
Value |
| Experiment ID |
whisper-large-v3-011 |
| WandB Project |
dido-yvanchik-stt |
| WandB Run |
whisper-large-v3-011 |
Citation
If you use this model, please cite:
@misc{KSE-RESEARCH-Group-whisper-large-v3-dido-yvanchyk-v2,
author = {KSE-RESEARCH-Group},
title = {whisper-large-v3 - Fine-tuned for Ukrainian ASR},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/KSE-RESEARCH-Group/whisper-large-v3-dido-yvanchyk-v2}
}
License
This model is released under the MIT license.
Acknowledgements