twangodev commited on
Commit
1faf953
·
0 Parent(s):

Initial release: rasr-parakeet-v1

Browse files
Files changed (4) hide show
  1. .gitattributes +36 -0
  2. README.md +197 -0
  3. rasr-parakeet-v1.nemo +3 -0
  4. training_recipe.yaml +36 -0
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ rasr-parakeet-v1.nemo filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.2
3
+ language:
4
+ - en
5
+ base_model: nvidia/parakeet-tdt-0.6b-v3
6
+ tags:
7
+ - automatic-speech-recognition
8
+ - asr
9
+ - atc
10
+ - air-traffic-control
11
+ - aviation
12
+ - parakeet
13
+ - nemo
14
+ - fastconformer
15
+ - tdt
16
+ - finetuned
17
+ - built-with-llama
18
+ datasets:
19
+ - twangodev/radiotalk-us-audio-tada-noisy
20
+ - jlvdoorn/atco2-asr
21
+ - jlvdoorn/atco2-asr-atcosim
22
+ metrics:
23
+ - wer
24
+ - cer
25
+ library_name: nemo
26
+ pipeline_tag: automatic-speech-recognition
27
+ model-index:
28
+ - name: rasr-parakeet-v1
29
+ results:
30
+ - task:
31
+ type: automatic-speech-recognition
32
+ name: Speech-to-Text
33
+ dataset:
34
+ name: ATCO2 (jlvdoorn/atco2-asr validation)
35
+ type: jlvdoorn/atco2-asr
36
+ split: validation
37
+ metrics:
38
+ - type: wer
39
+ value: 0.1246
40
+ name: Word Error Rate
41
+ - type: cer
42
+ value: 0.0780
43
+ name: Character Error Rate
44
+ ---
45
+
46
+ # rasr-parakeet-v1
47
+
48
+ ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.
49
+
50
+ ## Headline
51
+
52
+ | Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) |
53
+ |---|---|---|
54
+ | **ATCO2 val WER** | **0.125** | 0.157 |
55
+ | **ATCO2 val CER** | **0.078** | 0.088 |
56
+ | **ATCO2 val numeric WER** | **0.050** | 0.074 |
57
+
58
+ 21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).
59
+
60
+ ## Quick start
61
+
62
+ ```python
63
+ import nemo.collections.asr as nemo_asr
64
+
65
+ model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
66
+ result = model.transcribe(["atc_clip.wav"])
67
+ print(result[0].text)
68
+ ```
69
+
70
+ Or via the rasr eval toolkit:
71
+
72
+ ```bash
73
+ pip install rasr
74
+ rasr eval run \
75
+ -m nemo:hf://twangodev/rasr-parakeet-v1 \
76
+ -d hf:jlvdoorn/atco2-asr:validation \
77
+ --language en --batch-size 16
78
+ ```
79
+
80
+ ## Architecture
81
+
82
+ - **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
83
+ - **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual
84
+ - **Sample rate**: 16 kHz mono
85
+ - **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory)
86
+
87
+ ## Training data
88
+
89
+ **This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically:
90
+
91
+ | Source | Type | Role |
92
+ |---|---|---|
93
+ | [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. |
94
+ | [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. |
95
+ | [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. |
96
+
97
+ ### Llama 3.2 attribution
98
+
99
+ This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).
100
+
101
+ ## Training recipe
102
+
103
+ Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).
104
+
105
+ | Hyperparameter | Value |
106
+ |---|---|
107
+ | Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 |
108
+ | Learning rate | 1e-4 |
109
+ | Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 |
110
+ | Batch size | 32 (effective) |
111
+ | Precision | bf16-mixed |
112
+ | Max steps | 50,000 |
113
+ | Augmentation | SpecAugment (default), speed perturb 0.95-1.05 |
114
+ | Max audio duration | 18.0 s |
115
+ | Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) |
116
+ | Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
117
+ | Wall clock | ~12 hours |
118
+
119
+ ## Strengths
120
+
121
+ - **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
122
+ - **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
123
+ - **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
124
+ - **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.
125
+
126
+ ## Limitations
127
+
128
+ This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:
129
+
130
+ 1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.
131
+
132
+ 2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").
133
+
134
+ 3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).
135
+
136
+ 4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination.
137
+
138
+ 5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.
139
+
140
+ ## Recommended usage
141
+
142
+ - **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
143
+ - **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
144
+ - **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.
145
+
146
+ ## Citation
147
+
148
+ If you use this model, please cite the project and the underlying components:
149
+
150
+ ```bibtex
151
+ @software{rasr,
152
+ author = {Ding, James},
153
+ title = {rasr: ATC ASR finetuning toolkit},
154
+ url = {https://github.com/twangodev/rasr},
155
+ year = {2026}
156
+ }
157
+ ```
158
+
159
+ And the base model:
160
+
161
+ ```bibtex
162
+ @misc{parakeet-tdt,
163
+ author = {NVIDIA},
164
+ title = {Parakeet-TDT-0.6B-v3},
165
+ url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
166
+ }
167
+ ```
168
+
169
+ And Llama 3.2 (training transcripts):
170
+
171
+ ```bibtex
172
+ @misc{llama3.2,
173
+ author = {{Meta AI}},
174
+ title = {The Llama 3.2 Herd of Models},
175
+ year = {2024},
176
+ url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
177
+ }
178
+ ```
179
+
180
+ ## License
181
+
182
+ Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.
183
+
184
+ In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:
185
+
186
+ - **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
187
+ - **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split)
188
+ - **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
189
+ - **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio
190
+
191
+ To redistribute or deploy:
192
+ 1. Include a copy of the Llama 3.2 Community License.
193
+ 2. Display "Built with Llama" in your product / user interface / about page.
194
+ 3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
195
+ 4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.
196
+
197
+ This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.
rasr-parakeet-v1.nemo ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32171df9b141665764153d522b93a2a282aa6836ee80158fe77ff4b6f67f189d
3
+ size 2509332480
training_recipe.yaml ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 50k-step mixed run. Radiotalk synthetic + real ATCO2/ATCOSIM upweighted
2
+ # 10x to anchor distribution and supply the European/GA callsigns radiotalk
3
+ # doesn't cover. Target: meaningfully close the gap vs jlvdoorn's 0.157 WER
4
+ # on ATCO2 val. Expected wall clock: ~10-12 hours on the 6000 Pro
5
+ # (includes ~1 hr to dump the additional 100k radiotalk WAVs).
6
+
7
+ defaults: [base, rtx6kpro/hw]
8
+
9
+ name: parakeet-mixed
10
+
11
+ model:
12
+ scheme: parakeet
13
+ ref: nvidia/parakeet-tdt-0.6b-v3
14
+
15
+ data:
16
+ train:
17
+ - dataset: hf:twangodev/radiotalk-us-audio-tada-noisy:train
18
+ weight: 1.0
19
+ limit: 200000 # 2x the radiotalk-100k cache; remove when Lhotse lands
20
+ - dataset: hf:jlvdoorn/atco2-asr:train
21
+ weight: 10.0 # upweight real ATC 10x; small but anchors distribution
22
+ - dataset: hf:jlvdoorn/atco2-asr-atcosim:train
23
+ weight: 10.0
24
+ validation:
25
+ - dataset: hf:jlvdoorn/atco2-asr:validation
26
+
27
+ augmentation:
28
+ noise:
29
+ enabled: false # leaving off until a noise corpus is wired up
30
+
31
+ trainer:
32
+ max_steps: 50000
33
+ val_check_interval: 2000
34
+
35
+ output:
36
+ dir: ckpt/${name}