--- title: README emoji: 🏢 colorFrom: purple colorTo: yellow sdk: docker pinned: false --- # Reuben Data Lab > 🏆 Work here was produced for the > **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)** > hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** — credit to > **Adaptive Data by Adaption** for organizing the hackathon. Building **open, underserved datasets** for training and evaluating modern audio, speech, and multimodal models. Every release is open-sourced on Hugging Face with permissive licensing and rich metadata, targeting the three criteria the Uncharted Data Challenge cares about: **under-served problem domains**, **scarce open-source data**, and **under-resourced languages**. ## Datasets ### 🎵 [FMA Labeled — Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled) 29k Creative-Commons tracks from the Free Music Archive, automatically annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key, vocal type, energy, era, and audio quality** using `gemini-flash-latest`. Paired audio + text for music tagging, music-LM training, and auto-lyric research. ### 🗣️ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts) ~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es, fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a rotating pool of reference speakers. Covers conversational, informational, technical, emotional, and proverb-style utterances — useful for TTS fine-tuning, ASR augmentation, and cross-lingual voice-conversion research. ## Focus Areas - **Under-resourced languages** — expanding speech and text coverage beyond English-only datasets. - **Rich supervision** — datasets ship with detailed structured metadata (genre/mood/BPM/key for music; language/style/voice for speech), not just audio + class labels. - **Permissive licensing** — Creative Commons / CC0 where possible; synthetic outputs released for open research. - **Reproducibility** — generation pipelines and labeling scripts are open-sourced alongside the data. ## Tooling & Pipeline - **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs. - **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with zero-shot voice cloning. - **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running multi-GPU jobs, Hugging Face Hub for distribution. ## Get In Touch - Hugging Face: [@Reubencf](https://huggingface.co/Reubencf) - Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab) --- *More datasets coming soon as part of the Uncharted Data Challenge submission.*