Wav2Vec2-XLSR-300M-UA-uk-dido-tuned

This model is a fine-tuned version of Yehor/w2v-xls-r-uk on the Dido Yvanchyk Audio Dataset v2 dataset for Ukrainian speech recognition.

Model Description

This model utilizes the Yehor/w2v-xls-r-uk architecture, fine-tuned specifically for the hutsul dialect of Ukraine language using the specific dataset. It is capable of transcribing audio input into hutsul dialect text with high accuracy.

Training Details

Training Data

Property Value
Dataset Dido Yvanchyk Audio Dataset v2
Training samples 6,813
Test samples 787
Language uk
Epochs 50

Training Hyperparameters

Parameter Value
Base model Yehor/w2v-xls-r-uk
Learning rate 1e-4
Warmup steps 1000
Batch size (per device) 8
Gradient accumulation steps 8
Effective batch size 64
FP16 True
Gradient checkpointing True
Eval strategy epoch
Metric for best model cer

Final Evaluation Metrics

The model was evaluated on the test split of the Dido dataset after 50 epochs of training.

Metric Value
WER 13.61%
CER 2.43%
Test Loss 0.2063

Usage

Using Pipeline (Recommended)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned",
    device=device,
)

# Note: Wav2Vec2 usually requires 16kHz audio
# You can pass a path to a file or a dataset item
result = pipe("path/to/audio.wav")
print(result["text"])

Using Transformers Directly

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import librosa

model_id = "KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned"

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

# Move to GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load audio (must be 16kHz for Wav2Vec2)
audio_array, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process audio
input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt",
    padding=True
).input_values.to(device)

# Generate transcription
with torch.no_grad():
    logits = model(input_features).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print(transcription)

Infrastructure

Hardware

Component Specification
GPU NVIDIA GeForce RTX 4090
GPU Memory 48 GB
GPU Count 1

Environment

Package Version
Python 3.11
CUDA 12.4
PyTorch 2.5.1
Torchaudio 2.5.1
Transformers 4.48.2
Accelerate 1.12.0
Datasets 4.5.0
Evaluate 0.4.6
JiWER 4.0.0
PyCTCDecode 0.5.0
WandB 0.24.2

Training Stats

Metric Value
Total training time ~5h 49m
Total FLOS 6.92e19
Final Train Loss 0.4393

Citation

If you use this model, please cite:

@misc{KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned,
  author = {KSE RESEARCH Group},
  title = {Wav2Vec2-XLSR-300M-UA-uk-dido-tuned},
  year = {2025},
  publisher = {Hugging Face},
  url = {[https://huggingface.co/KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned](https://huggingface.co/KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned)}
}

License

This model is released under the MIT license.

Acknowledgements

  • Base model: Yehor/w2v-xls-r-uk
  • Dataset: Dido Yvanchyk Audio Dataset v2
  • Training infrastructure: NVIDIA GeForce RTX 4090
Downloads last month
9
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned

Finetuned
(1)
this model

Evaluation results