--- license: llama3.2 language: - en base_model: nvidia/parakeet-tdt-0.6b-v3 tags: - automatic-speech-recognition - asr - atc - air-traffic-control - aviation - parakeet - nemo - fastconformer - tdt - finetuned - built-with-llama datasets: - twangodev/radiotalk-us-audio-tada-noisy - jlvdoorn/atco2-asr - jlvdoorn/atco2-asr-atcosim metrics: - wer - cer library_name: nemo pipeline_tag: automatic-speech-recognition model-index: - name: rasr-parakeet-v1 results: - task: type: automatic-speech-recognition name: Speech-to-Text dataset: name: ATCO2 (jlvdoorn/atco2-asr validation) type: jlvdoorn/atco2-asr split: validation metrics: - type: wer value: 0.1246 name: Word Error Rate - type: cer value: 0.0780 name: Character Error Rate --- # rasr-parakeet-v1 ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit. ## Headline | Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) | |---|---|---| | **ATCO2 val WER** | **0.125** | 0.157 | | **ATCO2 val CER** | **0.078** | 0.088 | | **ATCO2 val numeric WER** | **0.050** | 0.074 | 21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B). ## Quick start ```python import nemo.collections.asr as nemo_asr model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1") result = model.transcribe(["atc_clip.wav"]) print(result[0].text) ``` Or via the rasr eval toolkit: ```bash pip install rasr rasr eval run \ -m nemo:hf://twangodev/rasr-parakeet-v1 \ -d hf:jlvdoorn/atco2-asr:validation \ --language en --batch-size 16 ``` ## Architecture - **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params) - **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual - **Sample rate**: 16 kHz mono - **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory) ## Training data **This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically: | Source | Type | Role | |---|---|---| | [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. | | [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. | | [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. | ### Llama 3.2 attribution This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama). ## Training recipe Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml). | Hyperparameter | Value | |---|---| | Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 | | Learning rate | 1e-4 | | Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 | | Batch size | 32 (effective) | | Precision | bf16-mixed | | Max steps | 50,000 | | Augmentation | SpecAugment (default), speed perturb 0.95-1.05 | | Max audio duration | 18.0 s | | Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) | | Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) | | Wall clock | ~12 hours | ## Strengths - **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly. - **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis). - **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio. - **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes. ## Limitations This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of: 1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes. 2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic"). 3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear). 4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination. 5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists. ## Recommended usage - **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance. - **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing. - **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system. ## Citation If you use this model, please cite the project and the underlying components: ```bibtex @software{rasr, author = {Ding, James}, title = {rasr: ATC ASR finetuning toolkit}, url = {https://github.com/twangodev/rasr}, year = {2026} } ``` And the base model: ```bibtex @misc{parakeet-tdt, author = {NVIDIA}, title = {Parakeet-TDT-0.6B-v3}, url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3} } ``` And Llama 3.2 (training transcripts): ```bibtex @misc{llama3.2, author = {{Meta AI}}, title = {The Llama 3.2 Herd of Models}, year = {2024}, url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/} } ``` ## License Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes. In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents: - **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model - **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split) - **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)) - **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio To redistribute or deploy: 1. Include a copy of the Llama 3.2 Community License. 2. Display "Built with Llama" in your product / user interface / about page. 3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md). 4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta. This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.