Hungarian ASR Model fine-tuned on Common Voice 24.0
Model Description
This model is a Hungarian Automatic Speech Recognition (ASR) model fine-tuned on the Common Voice 24.0 Hungarian dataset (scripted speech).
It is based on the multilingual XLSR Wav2Vec2 speech representation model, optimized for Hungarian speech recognition tasks.
Fine-tuning was performed with a CTC training objective for robust character sequence prediction from audio inputs.
Intended Uses
Primary Use Cases
- Transcribing Hungarian speech (scripted, read speech).
- Research and benchmarking in Hungarian ASR.
- Offline or API-based speech recognition tasks.
- Integration into speech-to-text pipelines that require open-source Hungarian models.
Example usage
import torch
import soundfile as sf
import numpy as np
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
# =========================================================
# CONFIG
# =========================================================
CHECKPOINT_PATH = "GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1"
WAV_PATH = "sample.wav"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# =========================================================
# MODEL LOAD
# =========================================================
print("Loading model...")
processor = Wav2Vec2Processor.from_pretrained(CHECKPOINT_PATH)
model = Wav2Vec2ForCTC.from_pretrained(CHECKPOINT_PATH)
model.to(DEVICE)
model.eval()
# =========================================================
# INFERENCE ENGINE
# =========================================================
class RobustInferenceEngine:
def __init__(self, model, processor, device="cpu"):
self.model = model
self.processor = processor
self.device = device
self.special_tokens = set([
"[PAD]", "[UNK]",
"<pad>", "<unk>",
"<s>", "</s>"
])
self.tokenizer = processor.tokenizer
@torch.no_grad()
def decode_tokens(self, pred_ids: torch.Tensor) -> str:
tokens = self.tokenizer.convert_ids_to_tokens(
pred_ids.tolist()
)
prev_token = None
output_chars = []
for token in tokens:
if token in self.special_tokens:
prev_token = token
continue
if token == prev_token:
continue
if token.startswith("▁"):
output_chars.append(" ")
output_chars.append(token[1:])
else:
output_chars.append(token)
prev_token = token
text = "".join(output_chars)
text = text.replace("▁", " ")
return text.strip()
@torch.no_grad()
def transcribe_file(self, wav_path: str) -> str:
audio_input, sr = sf.read(wav_path)
# Mono conversion
if len(audio_input.shape) > 1:
audio_input = audio_input.mean(axis=1)
inputs = self.processor(
audio_input,
sampling_rate=sr,
return_tensors="pt",
padding=True
)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
logits = self.model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
return self.decode_tokens(pred_ids[0])
# =========================================================
# RUN SAMPLE INFERENCE
# =========================================================
engine = RobustInferenceEngine(
model=model,
processor=processor,
device=DEVICE
)
print("Transcribing sample.wav ...")
text = engine.transcribe_file(WAV_PATH)
print("\nTranscription:")
print(text)
Training Details
Base model: facebook/wav2vec2-xls-r-300m
Fine-tuned on: Common Voice 24.0 Hungarian (Scripted Speech)
Loss: CTC (Connectionist Temporal Classification)
Decoder: Greedy CTC decoding
Benchmark Comparison (Hungarian – Common Voice 24.0)
Evaluated on the Common Voice 24.0 Hungarian test set.
| Model | Architecture | Parameters | Decoding | Hungarian Adaptation | WER (CV HU) | Evaluation Source |
|---|---|---|---|---|---|---|
| This model (XLSR fine-tuned) | Wav2Vec2-CTC (encoder-only) | ~300M | Greedy CTC | Fine-tuned on CV 24.0 HU | 0.1647 | Measured (this work) |
| Whisper-Large | Encoder-Decoder (seq2seq Transformer) | ~1.55B | Beam search | Multilingual pretraining | ~0.08–0.12 | Reported (public benchmarks) |
| Google Speech-to-Text | Proprietary hybrid DNN/Transformer | Not disclosed | Internal LM + beam | Production-scale multilingual | ~0.07–0.12 | Reported (vendor benchmarks) |
| XLSR (base, not fine-tuned) | Wav2Vec2-CTC (encoder-only) | ~300M | Greedy CTC | None | ~0.35+ | Reported (zero-shot HU) |
These results indicate strong character-level accuracy and reasonable word-level transcription performance for scripted Hungarian speech.
Limitations and Bias
This model was trained and evaluated on scripted speech; performance may degrade on spontaneous, conversational, or noisy recordings.
Greedy CTC decoding is used; incorporating a language model or beam-search decoder can improve quality.
The Common Voice dataset reflects the demographics of volunteer contributors, which may introduce dataset bias.
Comparison to Other Models
The fine-tuned model offers a competitive open-source alternative, though proprietary cloud solutions and larger seq2seq ASR models often yield lower WER. Performance will vary depending on domain and recording conditions.
Ethical Considerations
This model should not be used for sensitive or safety-critical applications without proper evaluation. The contributor agrees to abide by the terms of the Apache-2.0 license.
Citation
If you use this model in research or applications, please reference:
@misc{GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1,
title={Hungarian ASR model fine-tuned on Common Voice 24.0},
author={Gabor Madarasz},
year={2026},
howpublished={Hugging Face Model Hub},
}
- Downloads last month
- 7
Model tree for GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1
Base model
facebook/wav2vec2-xls-r-300m