twangodev
/

rasr-parakeet-v1

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+rasr-parakeet-v1.nemo filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,197 @@

+---
+license: llama3.2
+language:
+- en
+base_model: nvidia/parakeet-tdt-0.6b-v3
+tags:
+- automatic-speech-recognition
+- asr
+- atc
+- air-traffic-control
+- aviation
+- parakeet
+- nemo
+- fastconformer
+- tdt
+- finetuned
+- built-with-llama
+datasets:
+- twangodev/radiotalk-us-audio-tada-noisy
+- jlvdoorn/atco2-asr
+- jlvdoorn/atco2-asr-atcosim
+metrics:
+- wer
+- cer
+library_name: nemo
+pipeline_tag: automatic-speech-recognition
+model-index:
+- name: rasr-parakeet-v1
+  results:
+  - task:
+      type: automatic-speech-recognition
+      name: Speech-to-Text
+    dataset:
+      name: ATCO2 (jlvdoorn/atco2-asr validation)
+      type: jlvdoorn/atco2-asr
+      split: validation
+    metrics:
+    - type: wer
+      value: 0.1246
+      name: Word Error Rate
+    - type: cer
+      value: 0.0780
+      name: Character Error Rate
+---
+# rasr-parakeet-v1
+ATC ASR finetune of `nvidia/parakeet-tdt-0.6b-v3` on a synthetic US-style ATC corpus (`radiotalk-us-audio-tada-noisy`) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the [rasr](https://github.com/twangodev/rasr) toolkit.
+## Headline
+| Metric | This model | Prior public SOTA (`jlvdoorn/whisper-large-v3-atco2-asr`) |
+|---|---|---|
+| **ATCO2 val WER** | **0.125** | 0.157 |
+| **ATCO2 val CER** | **0.078** | 0.088 |
+| **ATCO2 val numeric WER** | **0.050** | 0.074 |
+21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).
+## Quick start
+```python
+import nemo.collections.asr as nemo_asr
+model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
+result = model.transcribe(["atc_clip.wav"])
+print(result[0].text)
+```
+Or via the rasr eval toolkit:
+```bash
+pip install rasr
+rasr eval run \
+  -m nemo:hf://twangodev/rasr-parakeet-v1 \
+  -d hf:jlvdoorn/atco2-asr:validation \
+  --language en --batch-size 16
+```
+## Architecture
+- **Base**: [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (FastConformer encoder + TDT decoder, 0.6B params)
+- **Tokenizer**: kept from base — SentencePiece BPE 8192 tokens, multilingual
+- **Sample rate**: 16 kHz mono
+- **Max input duration**: 18 seconds (extended-length inputs may degrade — TDT joint memory)
+## Training data
+**This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline.** Specifically:
+| Source | Type | Role |
+|---|---|---|
+| [`twangodev/radiotalk-us-audio-tada-noisy`](https://huggingface.co/datasets/twangodev/radiotalk-us-audio-tada-noisy) (200k subset) | Synthetic US ATC | Bulk training audio. Dialogue transcripts generated by **Llama 3.2**, audio synthesized by [Tada](https://github.com/twangodev/tada) (TTS) with VHF channel degradation pipeline. |
+| [`jlvdoorn/atco2-asr`](https://huggingface.co/datasets/jlvdoorn/atco2-asr) (train split, ~446 clips) | Real European ATC | Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary. |
+| [`jlvdoorn/atco2-asr-atcosim`](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim) (train, ~10k clips) | Real EU ATC + simulator | Real-data anchor; upweighted 10×. |
+### Llama 3.2 attribution
+This model is "Built with Llama" under the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Llama 3.2 was used to generate the ATC dialogue transcripts in the `radiotalk-us-audio-tada-noisy` dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).
+## Training recipe
+Full reproducible recipe: [`configs/train/rtx6kpro/parakeet-mixed.yaml`](https://github.com/twangodev/rasr/blob/main/configs/train/rtx6kpro/parakeet-mixed.yaml).
+| Hyperparameter | Value |
+|---|---|
+| Optimizer | AdamW, β=(0.9, 0.98), weight_decay=1e-3 |
+| Learning rate | 1e-4 |
+| Schedule | CosineAnnealing, warmup 5000 steps, min_lr=1e-6 |
+| Batch size | 32 (effective) |
+| Precision | bf16-mixed |
+| Max steps | 50,000 |
+| Augmentation | SpecAugment (default), speed perturb 0.95-1.05 |
+| Max audio duration | 18.0 s |
+| Mixing | weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10) |
+| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
+| Wall clock | ~12 hours |
+## Strengths
+- **Structurally robust ATC output.** Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
+- **Strong on numeric/safety-critical content.** Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
+- **Stable on out-of-distribution audio.** Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
+- **Small footprint.** 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.
+## Limitations
+This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:
+1. **Operator substitution bias.** The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.
+2. **Limited US GA airport name coverage.** The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").
+3. **European real-anchor contamination on US output.** Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).
+4. **Sanity rate on real US GA audio: 77%** (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly *substitution of correct word in correct slot*, not garbling or hallucination.
+5. **Evaluation distribution.** This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.
+## Recommended usage
+- **For European ATC** (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
+- **For US ATC**: use with **inference-time hot-word biasing** against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via `change_decoding_strategy()`. Most substitution failures collapse to correct output with appropriate biasing.
+- **For safety-critical applications**: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.
+## Citation
+If you use this model, please cite the project and the underlying components:
+```bibtex
+@software{rasr,
+  author = {Ding, James},
+  title = {rasr: ATC ASR finetuning toolkit},
+  url = {https://github.com/twangodev/rasr},
+  year = {2026}
+}
+```
+And the base model:
+```bibtex
+@misc{parakeet-tdt,
+  author = {NVIDIA},
+  title = {Parakeet-TDT-0.6B-v3},
+  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
+}
+```
+And Llama 3.2 (training transcripts):
+```bibtex
+@misc{llama3.2,
+  author = {{Meta AI}},
+  title = {The Llama 3.2 Herd of Models},
+  year = {2024},
+  url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
+}
+```
+## License
+Released under the **[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)** ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.
+In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:
+- **Parakeet-TDT-0.6B-v3** ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), NVIDIA) — base model
+- **ATCO2 corpus** (CC-BY-4.0) — real-data anchor (train split)
+- **ATCOSIM corpus** (research use; see [source](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))
+- **radiotalk-us-audio-tada-noisy** (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio
+To redistribute or deploy:
+1. Include a copy of the Llama 3.2 Community License.
+2. Display "Built with Llama" in your product / user interface / about page.
+3. Comply with the [Llama 3.2 Acceptable Use Policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md).
+4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.
+This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.

rasr-parakeet-v1.nemo ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:32171df9b141665764153d522b93a2a282aa6836ee80158fe77ff4b6f67f189d
+size 2509332480

training_recipe.yaml ADDED Viewed

	@@ -0,0 +1,36 @@

+# 50k-step mixed run. Radiotalk synthetic + real ATCO2/ATCOSIM upweighted
+# 10x to anchor distribution and supply the European/GA callsigns radiotalk
+# doesn't cover. Target: meaningfully close the gap vs jlvdoorn's 0.157 WER
+# on ATCO2 val. Expected wall clock: ~10-12 hours on the 6000 Pro
+# (includes ~1 hr to dump the additional 100k radiotalk WAVs).
+defaults: [base, rtx6kpro/hw]
+name: parakeet-mixed
+model:
+  scheme: parakeet
+  ref: nvidia/parakeet-tdt-0.6b-v3
+data:
+  train:
+    - dataset: hf:twangodev/radiotalk-us-audio-tada-noisy:train
+      weight: 1.0
+      limit: 200000           # 2x the radiotalk-100k cache; remove when Lhotse lands
+    - dataset: hf:jlvdoorn/atco2-asr:train
+      weight: 10.0            # upweight real ATC 10x; small but anchors distribution
+    - dataset: hf:jlvdoorn/atco2-asr-atcosim:train
+      weight: 10.0
+  validation:
+    - dataset: hf:jlvdoorn/atco2-asr:validation
+augmentation:
+  noise:
+    enabled: false            # leaving off until a noise corpus is wired up
+trainer:
+  max_steps: 50000
+  val_check_interval: 2000
+output:
+  dir: ckpt/${name}