File size: 2,759 Bytes
730f9d2
 
 
 
 
 
 
 
 
34bf0cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: README
emoji: 🏒
colorFrom: purple
colorTo: yellow
sdk: docker
pinned: false
---

# Reuben Data Lab

> πŸ† Work here was produced for the
> **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
> hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β€” credit to
> **Adaptive Data by Adaption** for organizing the hackathon.

Building **open, underserved datasets** for training and evaluating modern
audio, speech, and multimodal models. Every release is open-sourced on
Hugging Face with permissive licensing and rich metadata, targeting the three
criteria the Uncharted Data Challenge cares about: **under-served problem
domains**, **scarce open-source data**, and **under-resourced languages**.

## Datasets

### 🎡 [FMA Labeled β€” Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled)
29k Creative-Commons tracks from the Free Music Archive, automatically
annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key,
vocal type, energy, era, and audio quality** using `gemini-flash-latest`.
Paired audio + text for music tagging, music-LM training, and auto-lyric
research.

### πŸ—£οΈ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts)
~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es,
fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a
rotating pool of reference speakers. Covers conversational, informational,
technical, emotional, and proverb-style utterances β€” useful for TTS
fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.

## Focus Areas

- **Under-resourced languages** β€” expanding speech and text coverage beyond
  English-only datasets.
- **Rich supervision** β€” datasets ship with detailed structured metadata
  (genre/mood/BPM/key for music; language/style/voice for speech), not just
  audio + class labels.
- **Permissive licensing** β€” Creative Commons / CC0 where possible; synthetic
  outputs released for open research.
- **Reproducibility** β€” generation pipelines and labeling scripts are
  open-sourced alongside the data.

## Tooling & Pipeline

- **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs.
- **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ— H100 with zero-shot
  voice cloning.
- **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running
  multi-GPU jobs, Hugging Face Hub for distribution.

## Get In Touch

- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
- Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)

---

*More datasets coming soon as part of the Uncharted Data Challenge submission.*