Automatic Speech Recognition
NeMo
English
asr
atc
air-traffic-control
aviation
parakeet
fastconformer
tdt
finetuned
built-with-llama
Eval Results (legacy)
Instructions to use twangodev/rasr-parakeet-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use twangodev/rasr-parakeet-v1 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
File size: 9,563 Bytes
1faf953 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
license: llama3.2
language:
- en
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- asr
- atc
- air-traffic-control
- aviation
- parakeet
- nemo
- fastconformer
- tdt
- finetuned
- built-with-llama
datasets:
- twangodev/radiotalk-us-audio-tada-noisy
- jlvdoorn/atco2-asr
- jlvdoorn/atco2-asr-atcosim
metrics:
- wer
- cer
library_name: nemo
pipeline_tag: automatic-speech-recognition
model-index:
- name: rasr-parakeet-v1
results:
- task:
type: automatic-speech-recognition
name: Speech-to-Text
dataset:
name: ATCO2 (jlvdoorn/atco2-asr validation)
type: jlvdoorn/atco2-asr
split: validation
metrics:
- type: wer
value: 0.1246
name: Word Error Rate
- type: cer
value: 0.0780
name: Character Error Rate
---
# rasr-parakeet-v1
ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.
## Headline
| Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) |
|---|---|---|
| **ATCO2 val WER** | **0.125** | 0.157 |
| **ATCO2 val CER** | **0.078** | 0.088 |
| **ATCO2 val numeric WER** | **0.050** | 0.074 |
21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).
## Quick start
```python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)
```
Or via the rasr eval toolkit:
```bash
pip install rasr
rasr eval run \
-m nemo:hf://twangodev/rasr-parakeet-v1 \
-d hf:jlvdoorn/atco2-asr:validation \
--language en --batch-size 16
```
## Architecture
- **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
- **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual
- **Sample rate**: 16 kHz mono
- **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory)
## Training data
**This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically:
| Source | Type | Role |
|---|---|---|
| [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. |
| [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. |
| [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. |
### Llama 3.2 attribution
This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).
## Training recipe
Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 |
| Learning rate | 1e-4 |
| Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 |
| Batch size | 32 (effective) |
| Precision | bf16-mixed |
| Max steps | 50,000 |
| Augmentation | SpecAugment (default), speed perturb 0.95-1.05 |
| Max audio duration | 18.0 s |
| Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) |
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Wall clock | ~12 hours |
## Strengths
- **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
- **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
- **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
- **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.
## Limitations
This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:
1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.
2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").
3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).
4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination.
5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.
## Recommended usage
- **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
- **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
- **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.
## Citation
If you use this model, please cite the project and the underlying components:
```bibtex
@software{rasr,
author = {Ding, James},
title = {rasr: ATC ASR finetuning toolkit},
url = {https://github.com/twangodev/rasr},
year = {2026}
}
```
And the base model:
```bibtex
@misc{parakeet-tdt,
author = {NVIDIA},
title = {Parakeet-TDT-0.6B-v3},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
```
And Llama 3.2 (training transcripts):
```bibtex
@misc{llama3.2,
author = {{Meta AI}},
title = {The Llama 3.2 Herd of Models},
year = {2024},
url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}
```
## License
Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.
In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:
- **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
- **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split)
- **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
- **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio
To redistribute or deploy:
1. Include a copy of the Llama 3.2 Community License.
2. Display "Built with Llama" in your product / user interface / about page.
3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.
This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.
|