Spaces:
No application file
No application file
| title: README | |
| emoji: π’ | |
| colorFrom: purple | |
| colorTo: yellow | |
| sdk: docker | |
| pinned: false | |
| # Reuben Data Lab | |
| > π Work here was produced for the | |
| > **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)** | |
| > hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β credit to | |
| > **Adaptive Data by Adaption** for organizing the hackathon. | |
| Building **open, underserved datasets** for training and evaluating modern | |
| audio, speech, and multimodal models. Every release is open-sourced on | |
| Hugging Face with permissive licensing and rich metadata, targeting the three | |
| criteria the Uncharted Data Challenge cares about: **under-served problem | |
| domains**, **scarce open-source data**, and **under-resourced languages**. | |
| ## Datasets | |
| ### π΅ [FMA Labeled β Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled) | |
| 29k Creative-Commons tracks from the Free Music Archive, automatically | |
| annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key, | |
| vocal type, energy, era, and audio quality** using `gemini-flash-latest`. | |
| Paired audio + text for music tagging, music-LM training, and auto-lyric | |
| research. | |
| ### π£οΈ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts) | |
| ~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es, | |
| fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a | |
| rotating pool of reference speakers. Covers conversational, informational, | |
| technical, emotional, and proverb-style utterances β useful for TTS | |
| fine-tuning, ASR augmentation, and cross-lingual voice-conversion research. | |
| ## Focus Areas | |
| - **Under-resourced languages** β expanding speech and text coverage beyond | |
| English-only datasets. | |
| - **Rich supervision** β datasets ship with detailed structured metadata | |
| (genre/mood/BPM/key for music; language/style/voice for speech), not just | |
| audio + class labels. | |
| - **Permissive licensing** β Creative Commons / CC0 where possible; synthetic | |
| outputs released for open research. | |
| - **Reproducibility** β generation pipelines and labeling scripts are | |
| open-sourced alongside the data. | |
| ## Tooling & Pipeline | |
| - **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs. | |
| - **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ H100 with zero-shot | |
| voice cloning. | |
| - **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running | |
| multi-GPU jobs, Hugging Face Hub for distribution. | |
| ## Get In Touch | |
| - Hugging Face: [@Reubencf](https://huggingface.co/Reubencf) | |
| - Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab) | |
| --- | |
| *More datasets coming soon as part of the Uncharted Data Challenge submission.* | |