Spaces:
No application file
No application file
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,60 @@ sdk: docker
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# Reuben Data Lab
|
| 11 |
+
|
| 12 |
+
> π Work here was produced for the
|
| 13 |
+
> **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
|
| 14 |
+
> hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β credit to
|
| 15 |
+
> **Adaptive Data by Adaption** for organizing the hackathon.
|
| 16 |
+
|
| 17 |
+
Building **open, underserved datasets** for training and evaluating modern
|
| 18 |
+
audio, speech, and multimodal models. Every release is open-sourced on
|
| 19 |
+
Hugging Face with permissive licensing and rich metadata, targeting the three
|
| 20 |
+
criteria the Uncharted Data Challenge cares about: **under-served problem
|
| 21 |
+
domains**, **scarce open-source data**, and **under-resourced languages**.
|
| 22 |
+
|
| 23 |
+
## Datasets
|
| 24 |
+
|
| 25 |
+
### π΅ [FMA Labeled β Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled)
|
| 26 |
+
29k Creative-Commons tracks from the Free Music Archive, automatically
|
| 27 |
+
annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key,
|
| 28 |
+
vocal type, energy, era, and audio quality** using `gemini-flash-latest`.
|
| 29 |
+
Paired audio + text for music tagging, music-LM training, and auto-lyric
|
| 30 |
+
research.
|
| 31 |
+
|
| 32 |
+
### π£οΈ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts)
|
| 33 |
+
~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es,
|
| 34 |
+
fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a
|
| 35 |
+
rotating pool of reference speakers. Covers conversational, informational,
|
| 36 |
+
technical, emotional, and proverb-style utterances β useful for TTS
|
| 37 |
+
fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.
|
| 38 |
+
|
| 39 |
+
## Focus Areas
|
| 40 |
+
|
| 41 |
+
- **Under-resourced languages** β expanding speech and text coverage beyond
|
| 42 |
+
English-only datasets.
|
| 43 |
+
- **Rich supervision** β datasets ship with detailed structured metadata
|
| 44 |
+
(genre/mood/BPM/key for music; language/style/voice for speech), not just
|
| 45 |
+
audio + class labels.
|
| 46 |
+
- **Permissive licensing** β Creative Commons / CC0 where possible; synthetic
|
| 47 |
+
outputs released for open research.
|
| 48 |
+
- **Reproducibility** β generation pipelines and labeling scripts are
|
| 49 |
+
open-sourced alongside the data.
|
| 50 |
+
|
| 51 |
+
## Tooling & Pipeline
|
| 52 |
+
|
| 53 |
+
- **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs.
|
| 54 |
+
- **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ H100 with zero-shot
|
| 55 |
+
voice cloning.
|
| 56 |
+
- **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running
|
| 57 |
+
multi-GPU jobs, Hugging Face Hub for distribution.
|
| 58 |
+
|
| 59 |
+
## Get In Touch
|
| 60 |
+
|
| 61 |
+
- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
|
| 62 |
+
- Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
*More datasets coming soon as part of the Uncharted Data Challenge submission.*
|