Initial release: rasr-parakeet-v1

1faf953 4 days ago

9.56 kB

	---
	license: llama3.2
	language:
	- en
	base_model: nvidia/parakeet-tdt-0.6b-v3
	tags:
	- automatic-speech-recognition
	- asr
	- atc
	- air-traffic-control
	- aviation
	- parakeet
	- nemo
	- fastconformer
	- tdt
	- finetuned
	- built-with-llama
	datasets:
	- twangodev/radiotalk-us-audio-tada-noisy
	- jlvdoorn/atco2-asr
	- jlvdoorn/atco2-asr-atcosim
	metrics:
	- wer
	- cer
	library_name: nemo
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: rasr-parakeet-v1
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech-to-Text
	dataset:
	name: ATCO2 (jlvdoorn/atco2-asr validation)
	type: jlvdoorn/atco2-asr
	split: validation
	metrics:
	- type: wer
	value: 0.1246
	name: Word Error Rate
	- type: cer
	value: 0.0780
	name: Character Error Rate
	---

	# rasr-parakeet-v1

	ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.

	## Headline

	\| Metric \| This model \| Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) \|
	\|---\|---\|---\|
	\| ATCO2 val WER \| 0.125 \| 0.157 \|
	\| ATCO2 val CER \| 0.078 \| 0.088 \|
	\| ATCO2 val numeric WER \| 0.050 \| 0.074 \|

	21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).

	## Quick start

	```python
	import nemo.collections.asr as nemo_asr

	model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
	result = model.transcribe(["atc_clip.wav"])
	print(result[0].text)
	```

	Or via the rasr eval toolkit:

	```bash
	pip install rasr
	rasr eval run \
	-m nemo:hf://twangodev/rasr-parakeet-v1 \
	-d hf:jlvdoorn/atco2-asr:validation \
	--language en --batch-size 16
	```

	## Architecture

	- Base: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
	- Tokenizer: kept from base — SentencePiece BPE 8192 tokens, multilingual
	- Sample rate: 16 kHz mono
	- Max input duration: 18 seconds (extended-length inputs may degrade — TDT joint memory)

	## Training data

	This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline. Specifically:

	\| Source \| Type \| Role \|
	\|---\|---\|---\|
	\| [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) \| Synthetic US ATC \| Bulk training audio. Dialogue transcripts generated by Llama 3.2, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. \|
	\| [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) \| Real European ATC \| Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. \|
	\| [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) \| Real EU ATC + simulator \| Real-data anchor; upweighted 10×. \|

	### Llama 3.2 attribution

	This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).

	## Training recipe

	Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW, β=(0.9, 0.98), weight_decay=1e-3 \|
	\| Learning rate \| 1e-4 \|
	\| Schedule \| CosineAnnealing, warmup 5000 steps, min_lr=1e-6 \|
	\| Batch size \| 32 (effective) \|
	\| Precision \| bf16-mixed \|
	\| Max steps \| 50,000 \|
	\| Augmentation \| SpecAugment (default), speed perturb 0.95-1.05 \|
	\| Max audio duration \| 18.0 s \|
	\| Mixing \| weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) \|
	\| Hardware \| NVIDIA RTX PRO 6000 Blackwell (96 GB) \|
	\| Wall clock \| ~12 hours \|

	## Strengths

	- Structurally robust ATC output. Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
	- Strong on numeric/safety-critical content. Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
	- Stable on out-of-distribution audio. Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
	- Small footprint. 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.

	## Limitations

	This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:

	1. Operator substitution bias. The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.

	2. Limited US GA airport name coverage. The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").

	3. European real-anchor contamination on US output. Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).

	4. Sanity rate on real US GA audio: 77% (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly substitution of correct word in correct slot, not garbling or hallucination.

	5. Evaluation distribution. This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.

	## Recommended usage

	- For European ATC (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
	- For US ATC: use with inference-time hot-word biasing against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
	- For safety-critical applications: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.

	## Citation

	If you use this model, please cite the project and the underlying components:

	```bibtex
	@software{rasr,
	author = {Ding, James},
	title = {rasr: ATC ASR finetuning toolkit},
	url = {https://github.com/twangodev/rasr},
	year = {2026}
	}
	```

	And the base model:

	```bibtex
	@misc{parakeet-tdt,
	author = {NVIDIA},
	title = {Parakeet-TDT-0.6B-v3},
	url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
	}
	```

	And Llama 3.2 (training transcripts):

	```bibtex
	@misc{llama3.2,
	author = {{Meta AI}},
	title = {The Llama 3.2 Herd of Models},
	year = {2024},
	url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
	}
	```

	## License

	Released under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.

	In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:

	- Parakeet-TDT-0.6B-v3 ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
	- ATCO2 corpus (CC-BY-4.0) — real-data anchor (train split)
	- ATCOSIM corpus (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
	- radiotalk-us-audio-tada-noisy (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio

	To redistribute or deploy:
	1. Include a copy of the Llama 3.2 Community License.
	2. Display "Built with Llama" in your product / user interface / about page.
	3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
	4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.

	This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.