parakeet-tdt-lt — Lithuanian fine-tune of NVIDIA Parakeet TDT 0.6B v3

Fine-tuned version of nvidia/parakeet-tdt-0.6b-v3 on ~43 hours of Lithuanian speech data. Achieves a 45.5% relative WER reduction on Common Voice 25 Lithuanian test (16.53% → 8.91% with beam search + domain 5-gram language model, BasicTextNormalizer).

Results

Configuration CV25 LT WER CV25 LT CER FLEURS LT WER
Baseline (pretrained, greedy) 16.53% 4.29% 22.15%*
Fine-tuned epoch 11 (greedy) 13.55% 2.76%
Fine-tuned + beam + domain 5-gram LM α=0.5 9.40% 2.15%
Same, BasicTextNormalizer (leaderboard) 8.91% 2.07% 15.87%

* BasicTextNormalizer. Live results: speechbench-viz.web.app

Usage

import nemo.collections.asr as nemo_asr

# Greedy decoding
model = nemo_asr.models.ASRModel.from_pretrained("sliderforthewin/parakeet-tdt-lt")
transcriptions = model.transcribe(["audio.wav"])

With beam search + language model (best quality)

from omegaconf import open_dict
from huggingface_hub import hf_hub_download

model = nemo_asr.models.ASRModel.from_pretrained("sliderforthewin/parakeet-tdt-lt")

# Download the token-level LM
lm_path = hf_hub_download("sliderforthewin/parakeet-tdt-lt", "lt_token_4gram.arpa")

# Switch to beam search with LM fusion
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
    decoding_cfg.strategy = "maes"
    decoding_cfg.beam.beam_size = 4
    decoding_cfg.beam.return_best_hypothesis = True
    decoding_cfg.beam.ngram_lm_model = lm_path
    decoding_cfg.beam.ngram_lm_alpha = 0.5
model.change_decoding_strategy(decoding_cfg)

transcriptions = model.transcribe(["audio.wav"])

Training details

  • Base model: nvidia/parakeet-tdt-0.6b-v3 (multilingual European, 25 languages)
  • Architecture: Conformer encoder + Token-and-Duration Transducer (TDT)
  • Training data: ~43h Lithuanian speech
    • Common Voice 25 LT train+validated+other (17.9h, 12.7k clips)
    • shunyalabs/lithuanian-speech-dataset (14.7h, 2.9k clips)
    • FLEURS LT (9.8h, 2.9k clips)
    • VoxPopuli LT (1.3h, 456 clips)
  • Recipe: Raw PyTorch loop (no Lightning), AdamW lr=1e-6, 5 epochs, BatchNorm frozen to eval mode
  • Critical insight: BatchNorm running statistics must be frozen — updating them with fine-tuning data destroys the pretrained encoder representations. See lessons.
  • Language model: 4-gram subword-token LM trained on training transcripts + Lithuanian Wikipedia (~240MB text), using the model's own SentencePiece tokenizer

Reproduce

git clone https://github.com/jasontitus/finetuneparakeet.git
cd finetuneparakeet
bash scripts/gcp_eval.sh  # on a GCP VM with GPU

Files

  • parakeet-tdt-lt.nemo — NeMo checkpoint (epoch 11, best WER)
  • lt_europarl_wiki_subs_5gram.arpa — Europarl+wiki+subs 5-gram token LM (recommended)
  • lt_token_4gram.arpa — Original 4-gram token LM (smaller, still good)

License

CC-BY-4.0 (same as the training data sources)

Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sliderforthewin/parakeet-tdt-lt

Finetuned
(35)
this model

Datasets used to train sliderforthewin/parakeet-tdt-lt

Evaluation results