parakeet-tdt-lt — Lithuanian fine-tune of NVIDIA Parakeet TDT 0.6B v3
Fine-tuned version of nvidia/parakeet-tdt-0.6b-v3 on ~43 hours of Lithuanian speech data. Achieves a 45.5% relative WER reduction on Common Voice 25 Lithuanian test (16.53% → 8.91% with beam search + domain 5-gram language model, BasicTextNormalizer).
Results
| Configuration | CV25 LT WER | CV25 LT CER | FLEURS LT WER |
|---|---|---|---|
| Baseline (pretrained, greedy) | 16.53% | 4.29% | 22.15%* |
| Fine-tuned epoch 11 (greedy) | 13.55% | 2.76% | — |
| Fine-tuned + beam + domain 5-gram LM α=0.5 | 9.40% | 2.15% | — |
| Same, BasicTextNormalizer (leaderboard) | 8.91% | 2.07% | 15.87% |
* BasicTextNormalizer. Live results: speechbench-viz.web.app
Usage
import nemo.collections.asr as nemo_asr
# Greedy decoding
model = nemo_asr.models.ASRModel.from_pretrained("sliderforthewin/parakeet-tdt-lt")
transcriptions = model.transcribe(["audio.wav"])
With beam search + language model (best quality)
from omegaconf import open_dict
from huggingface_hub import hf_hub_download
model = nemo_asr.models.ASRModel.from_pretrained("sliderforthewin/parakeet-tdt-lt")
# Download the token-level LM
lm_path = hf_hub_download("sliderforthewin/parakeet-tdt-lt", "lt_token_4gram.arpa")
# Switch to beam search with LM fusion
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
decoding_cfg.strategy = "maes"
decoding_cfg.beam.beam_size = 4
decoding_cfg.beam.return_best_hypothesis = True
decoding_cfg.beam.ngram_lm_model = lm_path
decoding_cfg.beam.ngram_lm_alpha = 0.5
model.change_decoding_strategy(decoding_cfg)
transcriptions = model.transcribe(["audio.wav"])
Training details
- Base model: nvidia/parakeet-tdt-0.6b-v3 (multilingual European, 25 languages)
- Architecture: Conformer encoder + Token-and-Duration Transducer (TDT)
- Training data: ~43h Lithuanian speech
- Common Voice 25 LT train+validated+other (17.9h, 12.7k clips)
- shunyalabs/lithuanian-speech-dataset (14.7h, 2.9k clips)
- FLEURS LT (9.8h, 2.9k clips)
- VoxPopuli LT (1.3h, 456 clips)
- Recipe: Raw PyTorch loop (no Lightning), AdamW lr=1e-6, 5 epochs, BatchNorm frozen to eval mode
- Critical insight: BatchNorm running statistics must be frozen — updating them with fine-tuning data destroys the pretrained encoder representations. See lessons.
- Language model: 4-gram subword-token LM trained on training transcripts + Lithuanian Wikipedia (~240MB text), using the model's own SentencePiece tokenizer
Reproduce
git clone https://github.com/jasontitus/finetuneparakeet.git
cd finetuneparakeet
bash scripts/gcp_eval.sh # on a GCP VM with GPU
Files
parakeet-tdt-lt.nemo— NeMo checkpoint (epoch 11, best WER)lt_europarl_wiki_subs_5gram.arpa— Europarl+wiki+subs 5-gram token LM (recommended)lt_token_4gram.arpa— Original 4-gram token LM (smaller, still good)
License
CC-BY-4.0 (same as the training data sources)
- Downloads last month
- 40
Model tree for sliderforthewin/parakeet-tdt-lt
Base model
nvidia/parakeet-tdt-0.6b-v3Datasets used to train sliderforthewin/parakeet-tdt-lt
Evaluation results
- WER (beam + europarl+wiki+subs 5-gram α=0.5) on Common Voice 25 Lithuaniantest set self-reported9.400
- WER (beam + europarl+wiki+subs 5-gram α=0.5) on Common Voice 25 Lithuaniantest set self-reported9.400
- CER (beam + domain 5-gram LM α=0.5) on Common Voice 25 Lithuaniantest set self-reported2.150
- WER (greedy) on Common Voice 25 Lithuaniantest set self-reported13.550
- CER (greedy) on Common Voice 25 Lithuaniantest set self-reported2.760
- WER (beam + europarl+wiki+subs 5-gram α=0.5) on FLEURS Lithuaniantest set self-reported15.220
- WER (greedy) on FLEURS Lithuaniantest set self-reported19.210
- CER (greedy) on FLEURS Lithuaniantest set self-reported4.770