whisper-small - Fine-tuned for Ukrainian ASR
This model is a fine-tuned version of openai/whisper-small on the Dido Yvanchyk Audio Dataset v2 for Ukrainian speech recognition.
Model Description
Fine-tune openai/whisper-small on Dido-Yvanchyk dataset
Training Details
Training Data
Training Hyperparameters
| Parameter |
Value |
| Base model |
openai/whisper-small |
| Learning rate |
1e-05 |
| Warmup steps |
500 |
| Max steps |
8000 |
| Batch size (per device) |
4 |
| Gradient accumulation steps |
4 |
| Effective batch size |
16 |
| FP16 |
True |
| Gradient checkpointing |
False |
| Eval strategy |
steps |
| Eval/Save steps |
500 |
| Metric for best model |
cer |
Training Results
The model was trained for 8000 steps with evaluation every 500 steps. The best checkpoint was selected based on the lowest CER.
| Step |
Train Loss |
Eval Loss |
Eval CER (%) |
Eval WER (%) |
| 500 |
0.239 |
0.315 |
8.39 |
24.72 |
| 1000 |
0.132 |
0.252 |
6.33 |
19.48 |
| 1500 |
0.079 |
0.251 |
5.73 |
19.37 |
| 2000 |
0.06 |
0.263 |
5.79 |
19.34 |
| 2500 |
0.027 |
0.28 |
5.1 |
18.42 |
| 3000 |
0.015 |
0.298 |
5.32 |
18.31 |
| 3500 |
0.007 |
0.305 |
4.88 |
18.02 |
| 4000 |
0.006 |
0.316 |
4.79 |
17.42 |
| 4500 |
0.002 |
0.326 |
4.89 |
17.56 |
| 5000 |
0.002 |
0.335 |
4.75 |
17.36 |
| 5500 |
0.001 |
0.339 |
4.81 |
17.58 |
| 6000 |
0.001 |
0.345 |
4.75 |
17.43 |
| 6500 |
0.001 |
0.347 |
4.68 |
17.45 |
| 7000 |
0.001 |
0.352 |
4.77 |
17.49 |
| 7500 |
0.0 |
0.355 |
4.71 |
17.59 |
| 8000 |
0.001 |
0.357 |
4.71 |
17.6 |
Best Model Checkpoint: Step 6500
Final Evaluation Metrics
Validation Set
| Metric |
Value |
| CER |
4.71% |
| WER |
17.6% |
| Eval Loss |
0.357 |
Test Set
| Metric |
Value |
| CER |
4.72% |
| WER |
17.84% |
Usage
Using Pipeline (Recommended)
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="artem-orlovskyi/whisper-small-whisper-small-dido-yvanchyk-v2",
device=device,
)
result = pipe(
"path/to/audio.wav",
generate_kwargs={
"task": "transcribe",
"language": "ukrainian",
},
chunk_length_s=30,
)
print(result["text"])
Using Transformers Directly
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
model_id = "artem-orlovskyi/whisper-small-whisper-small-dido-yvanchyk-v2"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
input_features = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Infrastructure
Hardware
| Component |
Specification |
| GPU |
NVIDIA GeForce RTX 4090 |
| GPU Memory |
47.4 GB |
| GPU Count |
1 |
| CUDA Compute Capability |
8.9 |
Environment
| Package |
Version |
| Python |
3.12.12 |
| PyTorch |
2.8.0+cu128 |
| CUDA |
12.8 |
| Transformers |
4.57.3 |
| Datasets |
2.21.0 |
| Evaluate |
0.4.6 |
Training Time
| Metric |
Value |
| Total training time |
1 day, 14:19:54.145862 |
| Training started |
2025-12-27 16:03:58 |
| Training completed |
2025-12-29 06:23:52 |
Experiment Details
| Property |
Value |
| Experiment ID |
whisper-small-005 |
| WandB Project |
dido-yvanchik-stt |
| WandB Run |
whisper-small-005 |
Citation
If you use this model, please cite:
@misc{artem-orlovskyi-whisper-small-whisper-small-dido-yvanchyk-v2,
author = {artem-orlovskyi},
title = {whisper-small - Fine-tuned for Ukrainian ASR},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/artem-orlovskyi/whisper-small-whisper-small-dido-yvanchyk-v2}
}
License
This model is released under the MIT license.
Acknowledgements