Reubencf commited on
Commit
34bf0cd
Β·
verified Β·
1 Parent(s): 730f9d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -1
README.md CHANGED
@@ -7,4 +7,60 @@ sdk: docker
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # Reuben Data Lab
11
+
12
+ > πŸ† Work here was produced for the
13
+ > **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
14
+ > hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β€” credit to
15
+ > **Adaptive Data by Adaption** for organizing the hackathon.
16
+
17
+ Building **open, underserved datasets** for training and evaluating modern
18
+ audio, speech, and multimodal models. Every release is open-sourced on
19
+ Hugging Face with permissive licensing and rich metadata, targeting the three
20
+ criteria the Uncharted Data Challenge cares about: **under-served problem
21
+ domains**, **scarce open-source data**, and **under-resourced languages**.
22
+
23
+ ## Datasets
24
+
25
+ ### 🎡 [FMA Labeled β€” Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled)
26
+ 29k Creative-Commons tracks from the Free Music Archive, automatically
27
+ annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key,
28
+ vocal type, energy, era, and audio quality** using `gemini-flash-latest`.
29
+ Paired audio + text for music tagging, music-LM training, and auto-lyric
30
+ research.
31
+
32
+ ### πŸ—£οΈ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts)
33
+ ~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es,
34
+ fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a
35
+ rotating pool of reference speakers. Covers conversational, informational,
36
+ technical, emotional, and proverb-style utterances β€” useful for TTS
37
+ fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.
38
+
39
+ ## Focus Areas
40
+
41
+ - **Under-resourced languages** β€” expanding speech and text coverage beyond
42
+ English-only datasets.
43
+ - **Rich supervision** β€” datasets ship with detailed structured metadata
44
+ (genre/mood/BPM/key for music; language/style/voice for speech), not just
45
+ audio + class labels.
46
+ - **Permissive licensing** β€” Creative Commons / CC0 where possible; synthetic
47
+ outputs released for open research.
48
+ - **Reproducibility** β€” generation pipelines and labeling scripts are
49
+ open-sourced alongside the data.
50
+
51
+ ## Tooling & Pipeline
52
+
53
+ - **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs.
54
+ - **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ— H100 with zero-shot
55
+ voice cloning.
56
+ - **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running
57
+ multi-GPU jobs, Hugging Face Hub for distribution.
58
+
59
+ ## Get In Touch
60
+
61
+ - Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
62
+ - Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)
63
+
64
+ ---
65
+
66
+ *More datasets coming soon as part of the Uncharted Data Challenge submission.*