parakeet-tdt-lt — Lithuanian fine-tune of NVIDIA Parakeet TDT 0.6B v3

Fine-tuned version of nvidia/parakeet-tdt-0.6b-v3 on ~43 hours of Lithuanian speech data. Achieves a 45.5% relative WER reduction on Common Voice 25 Lithuanian test (16.53% → 8.91% with beam search + domain 5-gram language model, BasicTextNormalizer).

Results

Configuration	CV25 LT WER	CV25 LT CER	FLEURS LT WER
Baseline (pretrained, greedy)	16.53%	4.29%	22.15%*
Fine-tuned epoch 11 (greedy)	13.55%	2.76%	—
Fine-tuned + beam + domain 5-gram LM α=0.5	9.40%	2.15%	—
Same, BasicTextNormalizer (leaderboard)	8.91%	2.07%	15.87%

* BasicTextNormalizer. Live results: speechbench-viz.web.app

Usage

import nemo.collections.asr as nemo_asr

# Greedy decoding
model = nemo_asr.models.ASRModel.from_pretrained("sliderforthewin/parakeet-tdt-lt")
transcriptions = model.transcribe(["audio.wav"])

With beam search + language model (best quality)

from omegaconf import open_dict
from huggingface_hub import hf_hub_download

model = nemo_asr.models.ASRModel.from_pretrained("sliderforthewin/parakeet-tdt-lt")

# Download the token-level LM
lm_path = hf_hub_download("sliderforthewin/parakeet-tdt-lt", "lt_token_4gram.arpa")

# Switch to beam search with LM fusion
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
    decoding_cfg.strategy = "maes"
    decoding_cfg.beam.beam_size = 4
    decoding_cfg.beam.return_best_hypothesis = True
    decoding_cfg.beam.ngram_lm_model = lm_path
    decoding_cfg.beam.ngram_lm_alpha = 0.5
model.change_decoding_strategy(decoding_cfg)

transcriptions = model.transcribe(["audio.wav"])

Training details

Base model: nvidia/parakeet-tdt-0.6b-v3 (multilingual European, 25 languages)
Architecture: Conformer encoder + Token-and-Duration Transducer (TDT)
Training data: ~43h Lithuanian speech
- Common Voice 25 LT train+validated+other (17.9h, 12.7k clips)
- shunyalabs/lithuanian-speech-dataset (14.7h, 2.9k clips)
- FLEURS LT (9.8h, 2.9k clips)
- VoxPopuli LT (1.3h, 456 clips)
Recipe: Raw PyTorch loop (no Lightning), AdamW lr=1e-6, 5 epochs, BatchNorm frozen to eval mode
Critical insight: BatchNorm running statistics must be frozen — updating them with fine-tuning data destroys the pretrained encoder representations. See lessons.
Language model: 4-gram subword-token LM trained on training transcripts + Lithuanian Wikipedia (~240MB text), using the model's own SentencePiece tokenizer

Reproduce

git clone https://github.com/jasontitus/finetuneparakeet.git
cd finetuneparakeet
bash scripts/gcp_eval.sh  # on a GCP VM with GPU

Files

parakeet-tdt-lt.nemo — NeMo checkpoint (epoch 11, best WER)
lt_europarl_wiki_subs_5gram.arpa — Europarl+wiki+subs 5-gram token LM (recommended)
lt_token_4gram.arpa — Original 4-gram token LM (smaller, still good)

License

CC-BY-4.0 (same as the training data sources)

Downloads last month: 40

Model tree for sliderforthewin/parakeet-tdt-lt

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

(35)

this model

Datasets used to train sliderforthewin/parakeet-tdt-lt

Evaluation results

WER (beam + europarl+wiki+subs 5-gram α=0.5) on Common Voice 25 Lithuanian
test set self-reported

9.400
WER (beam + europarl+wiki+subs 5-gram α=0.5) on Common Voice 25 Lithuanian
test set self-reported

9.400
CER (beam + domain 5-gram LM α=0.5) on Common Voice 25 Lithuanian
test set self-reported

2.150
WER (greedy) on Common Voice 25 Lithuanian
test set self-reported

13.550
CER (greedy) on Common Voice 25 Lithuanian
test set self-reported

2.760
WER (beam + europarl+wiki+subs 5-gram α=0.5) on FLEURS Lithuanian
test set self-reported

15.220
WER (greedy) on FLEURS Lithuanian
test set self-reported

19.210
CER (greedy) on FLEURS Lithuanian
test set self-reported

4.770