Hungarian ASR Model ﬁne-tuned on Common Voice 24.0

Model Description

This model is a Hungarian Automatic Speech Recognition (ASR) model fine-tuned on the Common Voice 24.0 Hungarian dataset (scripted speech).
It is based on the multilingual XLSR Wav2Vec2 speech representation model, optimized for Hungarian speech recognition tasks.

Fine-tuning was performed with a CTC training objective for robust character sequence prediction from audio inputs.

Intended Uses

Primary Use Cases

Transcribing Hungarian speech (scripted, read speech).
Research and benchmarking in Hungarian ASR.
Offline or API-based speech recognition tasks.
Integration into speech-to-text pipelines that require open-source Hungarian models.

Example usage

import torch
import soundfile as sf
import numpy as np
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

# =========================================================
# CONFIG
# =========================================================

CHECKPOINT_PATH = "GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1"
WAV_PATH = "sample.wav"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# =========================================================
# MODEL LOAD
# =========================================================

print("Loading model...")

processor = Wav2Vec2Processor.from_pretrained(CHECKPOINT_PATH)

model = Wav2Vec2ForCTC.from_pretrained(CHECKPOINT_PATH)
model.to(DEVICE)
model.eval()

# =========================================================
# INFERENCE ENGINE
# =========================================================

class RobustInferenceEngine:

    def __init__(self, model, processor, device="cpu"):
        self.model = model
        self.processor = processor
        self.device = device

        self.special_tokens = set([
            "[PAD]", "[UNK]",
            "<pad>", "<unk>",
            "<s>", "</s>"
        ])

        self.tokenizer = processor.tokenizer

    @torch.no_grad()
    def decode_tokens(self, pred_ids: torch.Tensor) -> str:

        tokens = self.tokenizer.convert_ids_to_tokens(
            pred_ids.tolist()
        )

        prev_token = None
        output_chars = []

        for token in tokens:

            if token in self.special_tokens:
                prev_token = token
                continue

            if token == prev_token:
                continue

            if token.startswith("▁"):
                output_chars.append(" ")
                output_chars.append(token[1:])
            else:
                output_chars.append(token)

            prev_token = token

        text = "".join(output_chars)
        text = text.replace("▁", " ")
        return text.strip()

    @torch.no_grad()
    def transcribe_file(self, wav_path: str) -> str:

        audio_input, sr = sf.read(wav_path)

        # Mono conversion
        if len(audio_input.shape) > 1:
            audio_input = audio_input.mean(axis=1)

        inputs = self.processor(
            audio_input,
            sampling_rate=sr,
            return_tensors="pt",
            padding=True
        )

        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        logits = self.model(**inputs).logits
        pred_ids = torch.argmax(logits, dim=-1)

        return self.decode_tokens(pred_ids[0])

# =========================================================
# RUN SAMPLE INFERENCE
# =========================================================

engine = RobustInferenceEngine(
    model=model,
    processor=processor,
    device=DEVICE
)

print("Transcribing sample.wav ...")

text = engine.transcribe_file(WAV_PATH)

print("\nTranscription:")
print(text)

Training Details

Base model: facebook/wav2vec2-xls-r-300m

Fine-tuned on: Common Voice 24.0 Hungarian (Scripted Speech)

Loss: CTC (Connectionist Temporal Classiﬁcation)

Decoder: Greedy CTC decoding

Benchmark Comparison (Hungarian – Common Voice 24.0)

Evaluated on the Common Voice 24.0 Hungarian test set.

Model	Architecture	Parameters	Decoding	Hungarian Adaptation	WER (CV HU)	Evaluation Source
This model (XLSR fine-tuned)	Wav2Vec2-CTC (encoder-only)	~300M	Greedy CTC	Fine-tuned on CV 24.0 HU	0.1647	Measured (this work)
Whisper-Large	Encoder-Decoder (seq2seq Transformer)	~1.55B	Beam search	Multilingual pretraining	~0.08–0.12	Reported (public benchmarks)
Google Speech-to-Text	Proprietary hybrid DNN/Transformer	Not disclosed	Internal LM + beam	Production-scale multilingual	~0.07–0.12	Reported (vendor benchmarks)
XLSR (base, not fine-tuned)	Wav2Vec2-CTC (encoder-only)	~300M	Greedy CTC	None	~0.35+	Reported (zero-shot HU)

These results indicate strong character-level accuracy and reasonable word-level transcription performance for scripted Hungarian speech.

Limitations and Bias

This model was trained and evaluated on scripted speech; performance may degrade on spontaneous, conversational, or noisy recordings.

Greedy CTC decoding is used; incorporating a language model or beam-search decoder can improve quality.

The Common Voice dataset reflects the demographics of volunteer contributors, which may introduce dataset bias.

Comparison to Other Models

The fine-tuned model offers a competitive open-source alternative, though proprietary cloud solutions and larger seq2seq ASR models often yield lower WER. Performance will vary depending on domain and recording conditions.

Ethical Considerations

This model should not be used for sensitive or safety-critical applications without proper evaluation. The contributor agrees to abide by the terms of the Apache-2.0 license.

Citation

If you use this model in research or applications, please reference:

@misc{GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1,
  title={Hungarian ASR model ﬁne-tuned on Common Voice 24.0},
  author={Gabor Madarasz},
  year={2026},
  howpublished={Hugging Face Model Hub},
  }

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(833)

this model