Add build_dataset.py: YouTube scraping + HF datasets + synthetic generation pipeline (1400+ lines, no API keys needed) ddf61da verified Ellaft commited on 12 days ago