rasr-parakeet-v1 / README.md
twangodev's picture
Initial release: rasr-parakeet-v1
1faf953
---
license: llama3.2
language:
- en
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- asr
- atc
- air-traffic-control
- aviation
- parakeet
- nemo
- fastconformer
- tdt
- finetuned
- built-with-llama
datasets:
- twangodev/radiotalk-us-audio-tada-noisy
- jlvdoorn/atco2-asr
- jlvdoorn/atco2-asr-atcosim
metrics:
- wer
- cer
library_name: nemo
pipeline_tag: automatic-speech-recognition
model-index:
- name: rasr-parakeet-v1
results:
- task:
type: automatic-speech-recognition
name: Speech-to-Text
dataset:
name: ATCO2 (jlvdoorn/atco2-asr validation)
type: jlvdoorn/atco2-asr
split: validation
metrics:
- type: wer
value: 0.1246
name: Word Error Rate
- type: cer
value: 0.0780
name: Character Error Rate
---
# rasr-parakeet-v1
ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.
## Headline
| Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) |
|---|---|---|
| **ATCO2 val WER** | **0.125** | 0.157 |
| **ATCO2 val CER** | **0.078** | 0.088 |
| **ATCO2 val numeric WER** | **0.050** | 0.074 |
21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).
## Quick start
```python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)
```
Or via the rasr eval toolkit:
```bash
pip install rasr
rasr eval run \
-m nemo:hf://twangodev/rasr-parakeet-v1 \
-d hf:jlvdoorn/atco2-asr:validation \
--language en --batch-size 16
```
## Architecture
- **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
- **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual
- **Sample rate**: 16 kHz mono
- **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory)
## Training data
**This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically:
| Source | Type | Role |
|---|---|---|
| [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. |
| [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. |
| [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. |
### Llama 3.2 attribution
This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).
## Training recipe
Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 |
| Learning rate | 1e-4 |
| Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 |
| Batch size | 32 (effective) |
| Precision | bf16-mixed |
| Max steps | 50,000 |
| Augmentation | SpecAugment (default), speed perturb 0.95-1.05 |
| Max audio duration | 18.0 s |
| Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) |
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Wall clock | ~12 hours |
## Strengths
- **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
- **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
- **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
- **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.
## Limitations
This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:
1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.
2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").
3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).
4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination.
5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.
## Recommended usage
- **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
- **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
- **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.
## Citation
If you use this model, please cite the project and the underlying components:
```bibtex
@software{rasr,
author = {Ding, James},
title = {rasr: ATC ASR finetuning toolkit},
url = {https://github.com/twangodev/rasr},
year = {2026}
}
```
And the base model:
```bibtex
@misc{parakeet-tdt,
author = {NVIDIA},
title = {Parakeet-TDT-0.6B-v3},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
```
And Llama 3.2 (training transcripts):
```bibtex
@misc{llama3.2,
author = {{Meta AI}},
title = {The Llama 3.2 Herd of Models},
year = {2024},
url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}
```
## License
Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.
In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:
- **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
- **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split)
- **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
- **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio
To redistribute or deploy:
1. Include a copy of the Llama 3.2 Community License.
2. Display "Built with Llama" in your product / user interface / about page.
3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.
This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.