File size: 9,563 Bytes
1faf953
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
license: llama3.2
language:
- en
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- asr
- atc
- air-traffic-control
- aviation
- parakeet
- nemo
- fastconformer
- tdt
- finetuned
- built-with-llama
datasets:
- twangodev/radiotalk-us-audio-tada-noisy
- jlvdoorn/atco2-asr
- jlvdoorn/atco2-asr-atcosim
metrics:
- wer
- cer
library_name: nemo
pipeline_tag: automatic-speech-recognition
model-index:
- name: rasr-parakeet-v1
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech-to-Text
    dataset:
      name: ATCO2 (jlvdoorn/atco2-asr validation)
      type: jlvdoorn/atco2-asr
      split: validation
    metrics:
    - type: wer
      value: 0.1246
      name: Word Error Rate
    - type: cer
      value: 0.0780
      name: Character Error Rate
---

# rasr-parakeet-v1

ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.

## Headline

| Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) |
|---|---|---|
| **ATCO2 val WER** | **0.125** | 0.157 |
| **ATCO2 val CER** | **0.078** | 0.088 |
| **ATCO2 val numeric WER** | **0.050** | 0.074 |

21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).

## Quick start

```python
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)
```

Or via the rasr eval toolkit:

```bash
pip install rasr
rasr eval run \
  -m nemo:hf://twangodev/rasr-parakeet-v1 \
  -d hf:jlvdoorn/atco2-asr:validation \
  --language en --batch-size 16
```

## Architecture

- **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
- **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual
- **Sample rate**: 16 kHz mono
- **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory)

## Training data

**This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically:

| Source | Type | Role |
|---|---|---|
| [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. |
| [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. |
| [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. |

### Llama 3.2 attribution

This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).

## Training recipe

Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).

| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 |
| Learning rate | 1e-4 |
| Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 |
| Batch size | 32 (effective) |
| Precision | bf16-mixed |
| Max steps | 50,000 |
| Augmentation | SpecAugment (default), speed perturb 0.95-1.05 |
| Max audio duration | 18.0 s |
| Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) |
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Wall clock | ~12 hours |

## Strengths

- **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
- **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
- **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
- **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.

## Limitations

This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:

1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.

2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").

3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).

4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination.

5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.

## Recommended usage

- **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
- **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
- **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.

## Citation

If you use this model, please cite the project and the underlying components:

```bibtex
@software{rasr,
  author = {Ding, James},
  title = {rasr: ATC ASR finetuning toolkit},
  url = {https://github.com/twangodev/rasr},
  year = {2026}
}
```

And the base model:

```bibtex
@misc{parakeet-tdt,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B-v3},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
```

And Llama 3.2 (training transcripts):

```bibtex
@misc{llama3.2,
  author = {{Meta AI}},
  title = {The Llama 3.2 Herd of Models},
  year = {2024},
  url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}
```

## License

Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.

In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:

- **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
- **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split)
- **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
- **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio

To redistribute or deploy:
1. Include a copy of the Llama 3.2 Community License.
2. Display "Built with Llama" in your product / user interface / about page.
3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.

This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.