Automatic Speech Recognition
NeMo
English
asr
atc
air-traffic-control
aviation
parakeet
fastconformer
tdt
finetuned
built-with-llama
Eval Results (legacy)
Instructions to use twangodev/rasr-parakeet-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use twangodev/rasr-parakeet-v1 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| license: llama3.2 | |
| language: | |
| - en | |
| base_model: nvidia/parakeet-tdt-0.6b-v3 | |
| tags: | |
| - automatic-speech-recognition | |
| - asr | |
| - atc | |
| - air-traffic-control | |
| - aviation | |
| - parakeet | |
| - nemo | |
| - fastconformer | |
| - tdt | |
| - finetuned | |
| - built-with-llama | |
| datasets: | |
| - twangodev/radiotalk-us-audio-tada-noisy | |
| - jlvdoorn/atco2-asr | |
| - jlvdoorn/atco2-asr-atcosim | |
| metrics: | |
| - wer | |
| - cer | |
| library_name: nemo | |
| pipeline_tag: automatic-speech-recognition | |
| model-index: | |
| - name: rasr-parakeet-v1 | |
| results: | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Speech-to-Text | |
| dataset: | |
| name: ATCO2 (jlvdoorn/atco2-asr validation) | |
| type: jlvdoorn/atco2-asr | |
| split: validation | |
| metrics: | |
| - type: wer | |
| value: 0.1246 | |
| name: Word Error Rate | |
| - type: cer | |
| value: 0.0780 | |
| name: Character Error Rate | |
| # rasr-parakeet-v1 | |
| ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit. | |
| ## Headline | |
| | Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) | | |
| |---|---|---| | |
| | **ATCO2 val WER** | **0.125** | 0.157 | | |
| | **ATCO2 val CER** | **0.078** | 0.088 | | |
| | **ATCO2 val numeric WER** | **0.050** | 0.074 | | |
| 21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B). | |
| ## Quick start | |
| ```python | |
| import nemo.collections.asr as nemo_asr | |
| model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1") | |
| result = model.transcribe(["atc_clip.wav"]) | |
| print(result[0].text) | |
| ``` | |
| Or via the rasr eval toolkit: | |
| ```bash | |
| pip install rasr | |
| rasr eval run \ | |
| -m nemo:hf://twangodev/rasr-parakeet-v1 \ | |
| -d hf:jlvdoorn/atco2-asr:validation \ | |
| --language en --batch-size 16 | |
| ``` | |
| ## Architecture | |
| - **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params) | |
| - **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual | |
| - **Sample rate**: 16 kHz mono | |
| - **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory) | |
| ## Training data | |
| **This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically: | |
| | Source | Type | Role | | |
| |---|---|---| | |
| | [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. | | |
| | [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. | | |
| | [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. | | |
| ### Llama 3.2 attribution | |
| This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama). | |
| ## Training recipe | |
| Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml). | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 | | |
| | Learning rate | 1e-4 | | |
| | Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 | | |
| | Batch size | 32 (effective) | | |
| | Precision | bf16-mixed | | |
| | Max steps | 50,000 | | |
| | Augmentation | SpecAugment (default), speed perturb 0.95-1.05 | | |
| | Max audio duration | 18.0 s | | |
| | Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) | | |
| | Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) | | |
| | Wall clock | ~12 hours | | |
| ## Strengths | |
| - **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly. | |
| - **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis). | |
| - **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio. | |
| - **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes. | |
| ## Limitations | |
| This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of: | |
| 1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes. | |
| 2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic"). | |
| 3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear). | |
| 4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination. | |
| 5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists. | |
| ## Recommended usage | |
| - **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance. | |
| - **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing. | |
| - **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system. | |
| ## Citation | |
| If you use this model, please cite the project and the underlying components: | |
| ```bibtex | |
| @software{rasr, | |
| author = {Ding, James}, | |
| title = {rasr: ATC ASR finetuning toolkit}, | |
| url = {https://github.com/twangodev/rasr}, | |
| year = {2026} | |
| } | |
| ``` | |
| And the base model: | |
| ```bibtex | |
| @misc{parakeet-tdt, | |
| author = {NVIDIA}, | |
| title = {Parakeet-TDT-0.6B-v3}, | |
| url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3} | |
| } | |
| ``` | |
| And Llama 3.2 (training transcripts): | |
| ```bibtex | |
| @misc{llama3.2, | |
| author = {{Meta AI}}, | |
| title = {The Llama 3.2 Herd of Models}, | |
| year = {2024}, | |
| url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/} | |
| } | |
| ``` | |
| ## License | |
| Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes. | |
| In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents: | |
| - **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model | |
| - **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split) | |
| - **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)) | |
| - **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio | |
| To redistribute or deploy: | |
| 1. Include a copy of the Llama 3.2 Community License. | |
| 2. Display "Built with Llama" in your product / user interface / about page. | |
| 3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md). | |
| 4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta. | |
| This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses. | |