Spaces:

ReubenDataLab
/

README

No application file

App Files Files Community

README / README.md

Reubencf

Update README.md

34bf0cd verified about 15 hours ago

preview code

raw

history blame contribute delete

2.76 kB

	---
	title: README
	emoji: 🏢
	colorFrom: purple
	colorTo: yellow
	sdk: docker
	pinned: false
	---

	# Reuben Data Lab

	> 🏆 Work here was produced for the
	> [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)
	> hosted by [Adaption Labs](https://www.adaptionlabs.ai) — credit to
	> Adaptive Data by Adaption for organizing the hackathon.

	Building open, underserved datasets for training and evaluating modern
	audio, speech, and multimodal models. Every release is open-sourced on
	Hugging Face with permissive licensing and rich metadata, targeting the three
	criteria the Uncharted Data Challenge cares about: **under-served problem
	domains, scarce open-source data, and under-resourced languages**.

	## Datasets

	### 🎵 [FMA Labeled — Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled)
	29k Creative-Commons tracks from the Free Music Archive, automatically
	annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key,
	vocal type, energy, era, and audio quality** using `gemini-flash-latest`.
	Paired audio + text for music tagging, music-LM training, and auto-lyric
	research.

	### 🗣️ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts)
	~69k synthetic speech clips across 9 languages (en, ja, zh, ko, de, es,
	fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a
	rotating pool of reference speakers. Covers conversational, informational,
	technical, emotional, and proverb-style utterances — useful for TTS
	fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.

	## Focus Areas

	- Under-resourced languages — expanding speech and text coverage beyond
	English-only datasets.
	- Rich supervision — datasets ship with detailed structured metadata
	(genre/mood/BPM/key for music; language/style/voice for speech), not just
	audio + class labels.
	- Permissive licensing — Creative Commons / CC0 where possible; synthetic
	outputs released for open research.
	- Reproducibility — generation pipelines and labeling scripts are
	open-sourced alongside the data.

	## Tooling & Pipeline

	- Labeling: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs.
	- Speech synthesis: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with zero-shot
	voice cloning.
	- Infra: Hyperbolic GPU rentals, custom stall-watchers for long-running
	multi-GPU jobs, Hugging Face Hub for distribution.

	## Get In Touch

	- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
	- Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)

	---

	More datasets coming soon as part of the Uncharted Data Challenge submission.