youtube-atc-fastconformer

A compact 115M-parameter FastConformer Hybrid RNNT-CTC model for automatic speech recognition in the air traffic control (ATC) domain, trained exclusively on pseudo-labeled data from YouTube recordings of virtual ATC simulator sessions (VATSIM/IVAO).

Overview

Automatic speech recognition for air traffic control faces severe training data scarcity due to operational recording restrictions and expensive domain-expert transcription requirements. This model demonstrates that large-scale, pseudo-labeled data from publicly available YouTube streams can effectively train specialized ASR systems without any manually annotated operational data.

The model was trained on the youtube-atc dataset containing over 800 hours of content spanning 709 videos from virtual airports in 17 countries, covering ground, tower, approach, and en-route operational domains with diverse speaker accents. Training data was generated using an automated pipeline with speaker diarization, multi-model transcription, and LLM-based transcript fusion.

Usage

Load from Hugging Face

from nemo.collections.asr.models import EncDecHybridRNNTCTCModel

model = EncDecHybridRNNTCTCModel.from_pretrained("niclaswue/youtube-atc-fastconformer")

results = model.transcribe(["audio.wav"], batch_size=16)
for hyp in results:
    text = hyp.text if hasattr(hyp, "text") else hyp
    print(text)

Load from .nemo file

from nemo.collections.asr.models import EncDecHybridRNNTCTCModel

model = EncDecHybridRNNTCTCModel.restore_from("FastConformer-Hybrid-Transducer-CTC-Char.nemo")
results = model.transcribe(["audio.wav"], batch_size=16)

Training

Trained using the NVIDIA NeMo framework (v2.5.0) with character-level tokenization (a-z, space, apostrophe) on the youtube-atc dataset.

Dataset

The training data was collected from publicly available YouTube streams of virtual ATC simulator sessions. The full data collection pipeline and curated video collection are available at: github.com/niclaswue/youtube-atc

Citation

@inproceedings{dlr219501,
    title = {Can YouTube Stream Recordings Improve Automatic Speech Recognition for Air Traffic Control?},
    year = {2025},
    booktitle = {13th OpenSky Symposium},
    author = {W{\"u}stenbecker, Niclas and Ohneiser, Oliver and Kleinert, Matthias},
    month = {November},
    url = {https://elib.dlr.de/219501/},
    keywords = {Air Traffic Control; Automatic Speech Recognition; Public Dataset; Large Language Model;}
}

License

MIT

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support