Hungarian ASR Model fine-tuned on Common Voice 24.0

Model Description

This model is a Hungarian Automatic Speech Recognition (ASR) model fine-tuned on the Common Voice 24.0 Hungarian dataset (scripted speech).
It is based on the multilingual XLSR Wav2Vec2 speech representation model, optimized for Hungarian speech recognition tasks.

Fine-tuning was performed with a CTC training objective for robust character sequence prediction from audio inputs.

Intended Uses

Primary Use Cases

  • Transcribing Hungarian speech (scripted, read speech).
  • Research and benchmarking in Hungarian ASR.
  • Offline or API-based speech recognition tasks.
  • Integration into speech-to-text pipelines that require open-source Hungarian models.

Example usage

import torch
import soundfile as sf
import numpy as np
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

# =========================================================
# CONFIG
# =========================================================

CHECKPOINT_PATH = "GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1"
WAV_PATH = "sample.wav"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# =========================================================
# MODEL LOAD
# =========================================================

print("Loading model...")

processor = Wav2Vec2Processor.from_pretrained(CHECKPOINT_PATH)

model = Wav2Vec2ForCTC.from_pretrained(CHECKPOINT_PATH)
model.to(DEVICE)
model.eval()

# =========================================================
# INFERENCE ENGINE
# =========================================================

class RobustInferenceEngine:

    def __init__(self, model, processor, device="cpu"):
        self.model = model
        self.processor = processor
        self.device = device

        self.special_tokens = set([
            "[PAD]", "[UNK]",
            "<pad>", "<unk>",
            "<s>", "</s>"
        ])

        self.tokenizer = processor.tokenizer

    @torch.no_grad()
    def decode_tokens(self, pred_ids: torch.Tensor) -> str:

        tokens = self.tokenizer.convert_ids_to_tokens(
            pred_ids.tolist()
        )

        prev_token = None
        output_chars = []

        for token in tokens:

            if token in self.special_tokens:
                prev_token = token
                continue

            if token == prev_token:
                continue

            if token.startswith("▁"):
                output_chars.append(" ")
                output_chars.append(token[1:])
            else:
                output_chars.append(token)

            prev_token = token

        text = "".join(output_chars)
        text = text.replace("▁", " ")
        return text.strip()

    @torch.no_grad()
    def transcribe_file(self, wav_path: str) -> str:

        audio_input, sr = sf.read(wav_path)

        # Mono conversion
        if len(audio_input.shape) > 1:
            audio_input = audio_input.mean(axis=1)

        inputs = self.processor(
            audio_input,
            sampling_rate=sr,
            return_tensors="pt",
            padding=True
        )

        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        logits = self.model(**inputs).logits
        pred_ids = torch.argmax(logits, dim=-1)

        return self.decode_tokens(pred_ids[0])

# =========================================================
# RUN SAMPLE INFERENCE
# =========================================================

engine = RobustInferenceEngine(
    model=model,
    processor=processor,
    device=DEVICE
)

print("Transcribing sample.wav ...")

text = engine.transcribe_file(WAV_PATH)

print("\nTranscription:")
print(text)

Training Details

Base model: facebook/wav2vec2-xls-r-300m

Fine-tuned on: Common Voice 24.0 Hungarian (Scripted Speech)

Loss: CTC (Connectionist Temporal Classification)

Decoder: Greedy CTC decoding

Benchmark Comparison (Hungarian – Common Voice 24.0)

Evaluated on the Common Voice 24.0 Hungarian test set.

Model Architecture Parameters Decoding Hungarian Adaptation WER (CV HU) Evaluation Source
This model (XLSR fine-tuned) Wav2Vec2-CTC (encoder-only) ~300M Greedy CTC Fine-tuned on CV 24.0 HU 0.1647 Measured (this work)
Whisper-Large Encoder-Decoder (seq2seq Transformer) ~1.55B Beam search Multilingual pretraining ~0.08–0.12 Reported (public benchmarks)
Google Speech-to-Text Proprietary hybrid DNN/Transformer Not disclosed Internal LM + beam Production-scale multilingual ~0.07–0.12 Reported (vendor benchmarks)
XLSR (base, not fine-tuned) Wav2Vec2-CTC (encoder-only) ~300M Greedy CTC None ~0.35+ Reported (zero-shot HU)

These results indicate strong character-level accuracy and reasonable word-level transcription performance for scripted Hungarian speech.

Limitations and Bias

This model was trained and evaluated on scripted speech; performance may degrade on spontaneous, conversational, or noisy recordings.

Greedy CTC decoding is used; incorporating a language model or beam-search decoder can improve quality.

The Common Voice dataset reflects the demographics of volunteer contributors, which may introduce dataset bias.

Comparison to Other Models

The fine-tuned model offers a competitive open-source alternative, though proprietary cloud solutions and larger seq2seq ASR models often yield lower WER. Performance will vary depending on domain and recording conditions.

Ethical Considerations

This model should not be used for sensitive or safety-critical applications without proper evaluation. The contributor agrees to abide by the terms of the Apache-2.0 license.

Citation

If you use this model in research or applications, please reference:

@misc{GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1,
  title={Hungarian ASR model fine-tuned on Common Voice 24.0},
  author={Gabor Madarasz},
  year={2026},
  howpublished={Hugging Face Model Hub},
  }
Downloads last month
7
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GaborMadarasz/wav2vec2-large-xlsr-53-hungarian1

Finetuned
(833)
this model