Wav2Vec2-XLSR-300M-UA-uk-dido-tuned

This model is a fine-tuned version of Yehor/w2v-xls-r-uk on the Dido Yvanchyk Audio Dataset v2 dataset for Ukrainian speech recognition.

Model Description

This model utilizes the Yehor/w2v-xls-r-uk architecture, fine-tuned specifically for the hutsul dialect of Ukraine language using the specific dataset. It is capable of transcribing audio input into hutsul dialect text with high accuracy.

Training Details

Training Data

Property	Value
Dataset	Dido Yvanchyk Audio Dataset v2
Training samples	6,813
Test samples	787
Language	uk
Epochs	50

Training Hyperparameters

Parameter	Value
Base model	Yehor/w2v-xls-r-uk
Learning rate	1e-4
Warmup steps	1000
Batch size (per device)	8
Gradient accumulation steps	8
Effective batch size	64
FP16	True
Gradient checkpointing	True
Eval strategy	epoch
Metric for best model	cer

Final Evaluation Metrics

The model was evaluated on the test split of the Dido dataset after 50 epochs of training.

Metric	Value
WER	13.61%
CER	2.43%
Test Loss	0.2063

Usage

Using Pipeline (Recommended)

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned",
    device=device,
)

# Note: Wav2Vec2 usually requires 16kHz audio
# You can pass a path to a file or a dataset item
result = pipe("path/to/audio.wav")
print(result["text"])

Using Transformers Directly

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import librosa

model_id = "KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned"

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

# Move to GPU if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load audio (must be 16kHz for Wav2Vec2)
audio_array, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process audio
input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt",
    padding=True
).input_values.to(device)

# Generate transcription
with torch.no_grad():
    logits = model(input_features).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print(transcription)

Infrastructure

Hardware

Component	Specification
GPU	NVIDIA GeForce RTX 4090
GPU Memory	48 GB
GPU Count	1

Environment

Package	Version
Python	3.11
CUDA	12.4
PyTorch	2.5.1
Torchaudio	2.5.1
Transformers	4.48.2
Accelerate	1.12.0
Datasets	4.5.0
Evaluate	0.4.6
JiWER	4.0.0
PyCTCDecode	0.5.0
WandB	0.24.2

Training Stats

Metric	Value
Total training time	~5h 49m
Total FLOS	6.92e19
Final Train Loss	0.4393

Citation

If you use this model, please cite:

@misc{KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned,
  author = {KSE RESEARCH Group},
  title = {Wav2Vec2-XLSR-300M-UA-uk-dido-tuned},
  year = {2025},
  publisher = {Hugging Face},
  url = {[https://huggingface.co/KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned](https://huggingface.co/KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned)}
}

License

This model is released under the MIT license.

Acknowledgements

Base model: Yehor/w2v-xls-r-uk
Dataset: Dido Yvanchyk Audio Dataset v2
Training infrastructure: NVIDIA GeForce RTX 4090

Downloads last month: 9

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for KSE-RESEARCH-Group/Wav2Vec2-XLSR-300M-UA-uk-dido-tuned

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

Yehor/w2v-xls-r-uk

Finetuned

(1)

this model

Evaluation results

Test WER on Dido Yvanchyk Audio Dataset v2
test set self-reported

13.610
Test CER on Dido Yvanchyk Audio Dataset v2
test set self-reported

2.430