Wav2Vec2-XLSR-300M-UA-uk-dido-tuned
This model is a fine-tuned version of Yehor/w2v-xls-r-uk on the Dido Yvanchyk Audio Dataset v2 dataset for Ukrainian speech recognition.
Model Description
This model utilizes the Yehor/w2v-xls-r-uk architecture, fine-tuned specifically for the hutsul dialect of Ukraine language using the specific dataset. It is capable of transcribing audio input into hutsul dialect text with high accuracy.
Training Details
Training Data
| Property | Value |
|---|---|
| Dataset | Dido Yvanchyk Audio Dataset v2 |
| Training samples | 6,813 |
| Test samples | 787 |
| Language | uk |
| Epochs | 50 |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | Yehor/w2v-xls-r-uk |
| Learning rate | 1e-4 |
| Warmup steps | 1000 |
| Batch size (per device) | 8 |
| Gradient accumulation steps | 8 |
| Effective batch size | 64 |
| FP16 | True |
| Gradient checkpointing | True |
| Eval strategy | epoch |
| Metric for best model | cer |
Final Evaluation Metrics
The model was evaluated on the test split of the Dido dataset after 50 epochs of training.
| Metric | Value |
|---|---|
| WER | 13.61% |
| CER | 2.43% |
| Test Loss | 0.2063 |
Usage
Using Pipeline (Recommended)
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned",
device=device,
)
# Note: Wav2Vec2 usually requires 16kHz audio
# You can pass a path to a file or a dataset item
result = pipe("path/to/audio.wav")
print(result["text"])
Using Transformers Directly
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import librosa
model_id = "KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned"
processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
# Move to GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load audio (must be 16kHz for Wav2Vec2)
audio_array, sr = librosa.load("path/to/audio.wav", sr=16000)
# Process audio
input_features = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt",
padding=True
).input_values.to(device)
# Generate transcription
with torch.no_grad():
logits = model(input_features).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
Infrastructure
Hardware
| Component | Specification |
|---|---|
| GPU | NVIDIA GeForce RTX 4090 |
| GPU Memory | 48 GB |
| GPU Count | 1 |
Environment
| Package | Version |
|---|---|
| Python | 3.11 |
| CUDA | 12.4 |
| PyTorch | 2.5.1 |
| Torchaudio | 2.5.1 |
| Transformers | 4.48.2 |
| Accelerate | 1.12.0 |
| Datasets | 4.5.0 |
| Evaluate | 0.4.6 |
| JiWER | 4.0.0 |
| PyCTCDecode | 0.5.0 |
| WandB | 0.24.2 |
Training Stats
| Metric | Value |
|---|---|
| Total training time | ~5h 49m |
| Total FLOS | 6.92e19 |
| Final Train Loss | 0.4393 |
Citation
If you use this model, please cite:
@misc{KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned,
author = {KSE RESEARCH Group},
title = {Wav2Vec2-XLSR-300M-UA-uk-dido-tuned},
year = {2025},
publisher = {Hugging Face},
url = {[https://huggingface.co/KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned](https://huggingface.co/KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned)}
}
License
This model is released under the MIT license.
Acknowledgements
- Base model: Yehor/w2v-xls-r-uk
- Dataset: Dido Yvanchyk Audio Dataset v2
- Training infrastructure: NVIDIA GeForce RTX 4090
- Downloads last month
- 9
Model tree for KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned
Evaluation results
- Test WER on Dido Yvanchyk Audio Dataset v2test set self-reported13.610
- Test CER on Dido Yvanchyk Audio Dataset v2test set self-reported2.430