whisper-small - Fine-tuned for Ukrainian ASR

This model is a fine-tuned version of openai/whisper-small on the Dido Yvanchyk Audio Dataset v2 for Ukrainian speech recognition.

Model Description

Fine-tune openai/whisper-small on Dido-Yvanchyk dataset

Training Details

Training Data

Property Value
Dataset KSE-RESEARCH-Group/Dido-Yvanchyk-Audio-Dataset-v2
Training samples 6811
Validation samples 757
Test samples 842
Language Ukrainian
Max token length 448

Training Hyperparameters

Parameter Value
Base model openai/whisper-small
Learning rate 1e-05
Warmup steps 500
Max steps 8000
Batch size (per device) 4
Gradient accumulation steps 4
Effective batch size 16
FP16 True
Gradient checkpointing False
Eval strategy steps
Eval/Save steps 500
Metric for best model cer

Training Results

The model was trained for 8000 steps with evaluation every 500 steps. The best checkpoint was selected based on the lowest CER.

Step Train Loss Eval Loss Eval CER (%) Eval WER (%)
500 0.239 0.315 8.39 24.72
1000 0.132 0.252 6.33 19.48
1500 0.079 0.251 5.73 19.37
2000 0.06 0.263 5.79 19.34
2500 0.027 0.28 5.1 18.42
3000 0.015 0.298 5.32 18.31
3500 0.007 0.305 4.88 18.02
4000 0.006 0.316 4.79 17.42
4500 0.002 0.326 4.89 17.56
5000 0.002 0.335 4.75 17.36
5500 0.001 0.339 4.81 17.58
6000 0.001 0.345 4.75 17.43
6500 0.001 0.347 4.68 17.45
7000 0.001 0.352 4.77 17.49
7500 0.0 0.355 4.71 17.59
8000 0.001 0.357 4.71 17.6

Best Model Checkpoint: Step 6500

Final Evaluation Metrics

Validation Set

Metric Value
CER 4.71%
WER 17.6%
Eval Loss 0.357

Test Set

Metric Value
CER 4.72%
WER 17.84%

Usage

Using Pipeline (Recommended)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="artem-orlovskyi/whisper-small-whisper-small-dido-yvanchyk-v2",
    device=device,
)

result = pipe(
    "path/to/audio.wav",
    generate_kwargs={
        "task": "transcribe",
        "language": "ukrainian",
    },
    chunk_length_s=30,
)
print(result["text"])

Using Transformers Directly

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "artem-orlovskyi/whisper-small-whisper-small-dido-yvanchyk-v2"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Move to GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Process audio (audio_array should be a numpy array at 16kHz)
input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Infrastructure

Hardware

Component Specification
GPU NVIDIA GeForce RTX 4090
GPU Memory 47.4 GB
GPU Count 1
CUDA Compute Capability 8.9

Environment

Package Version
Python 3.12.12
PyTorch 2.8.0+cu128
CUDA 12.8
Transformers 4.57.3
Datasets 2.21.0
Evaluate 0.4.6

Training Time

Metric Value
Total training time 1 day, 14:19:54.145862
Training started 2025-12-27 16:03:58
Training completed 2025-12-29 06:23:52

Experiment Details

Property Value
Experiment ID whisper-small-005
WandB Project dido-yvanchik-stt
WandB Run whisper-small-005

Citation

If you use this model, please cite:

@misc{artem-orlovskyi-whisper-small-whisper-small-dido-yvanchyk-v2,
  author = {artem-orlovskyi},
  title = {whisper-small - Fine-tuned for Ukrainian ASR},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/artem-orlovskyi/whisper-small-whisper-small-dido-yvanchyk-v2}
}

License

This model is released under the MIT license.

Acknowledgements

Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KSE-RESEARCH-Group/whisper-small-dido-yvanchyk-v2

Finetuned
(3445)
this model