File size: 9,563 Bytes

1faf953

---
license: llama3.2
language:
- en
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- asr
- atc
- air-traffic-control
- aviation
- parakeet
- nemo
- fastconformer
- tdt
- finetuned
- built-with-llama
datasets:
- twangodev/radiotalk-us-audio-tada-noisy
- jlvdoorn/atco2-asr
- jlvdoorn/atco2-asr-atcosim
metrics:
- wer
- cer
library_name: nemo
pipeline_tag: automatic-speech-recognition
model-index:
- name: rasr-parakeet-v1
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech-to-Text
    dataset:
      name: ATCO2 (jlvdoorn/atco2-asr validation)
      type: jlvdoorn/atco2-asr
      split: validation
    metrics:
    - type: wer
      value: 0.1246
      name: Word Error Rate
    - type: cer
      value: 0.0780
      name: Character Error Rate
---

# rasr-parakeet-v1

ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.

## Headline

| Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) |
|---|---|---|
| **ATCO2 val WER** | **0.125** | 0.157 |
| **ATCO2 val CER** | **0.078** | 0.088 |
| **ATCO2 val numeric WER** | **0.050** | 0.074 |

21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).

## Quick start

```python
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)
```

Or via the rasr eval toolkit:

```bash
pip install rasr
rasr eval run \
  -m nemo:hf://twangodev/rasr-parakeet-v1 \
  -d hf:jlvdoorn/atco2-asr:validation \
  --language en --batch-size 16
```

## Architecture

- **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
- **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual
- **Sample rate**: 16 kHz mono
- **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory)

## Training data

**This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically:

| Source | Type | Role |
|---|---|---|
| [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. |
| [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. |
| [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. |

### Llama 3.2 attribution

This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).

## Training recipe

Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).

| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 |
| Learning rate | 1e-4 |
| Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 |
| Batch size | 32 (effective) |
| Precision | bf16-mixed |
| Max steps | 50,000 |
| Augmentation | SpecAugment (default), speed perturb 0.95-1.05 |
| Max audio duration | 18.0 s |
| Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) |
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Wall clock | ~12 hours |

## Strengths

- **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
- **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
- **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
- **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.

## Limitations

This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:

1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.

2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").

3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).

4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination.

5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.

## Recommended usage

- **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
- **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
- **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.

## Citation

If you use this model, please cite the project and the underlying components:

```bibtex
@software{rasr,
  author = {Ding, James},
  title = {rasr: ATC ASR finetuning toolkit},
  url = {https://github.com/twangodev/rasr},
  year = {2026}
}
```

And the base model:

```bibtex
@misc{parakeet-tdt,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B-v3},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
```

And Llama 3.2 (training transcripts):

```bibtex
@misc{llama3.2,
  author = {{Meta AI}},
  title = {The Llama 3.2 Herd of Models},
  year = {2024},
  url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}
```

## License

Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.

In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:

- **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
- **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split)
- **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
- **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio

To redistribute or deploy:
1. Include a copy of the Llama 3.2 Community License.
2. Display "Built with Llama" in your product / user interface / about page.
3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.

This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.