Spaces:

ReubenDataLab
/

README

No application file

App Files Files Community

Reubencf commited on 3 days ago

Commit

34bf0cd

verified ·

1 Parent(s): 730f9d2

Update README.md

Browse files

Files changed (1) hide show

README.md +57 -1

README.md CHANGED Viewed

@@ -7,4 +7,60 @@ sdk: docker
 pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 pinned: false
 ---
+# Reuben Data Lab
+> 🏆 Work here was produced for the
+> **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
+> hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** — credit to
+> **Adaptive Data by Adaption** for organizing the hackathon.
+Building **open, underserved datasets** for training and evaluating modern
+audio, speech, and multimodal models. Every release is open-sourced on
+Hugging Face with permissive licensing and rich metadata, targeting the three
+criteria the Uncharted Data Challenge cares about: **under-served problem
+domains**, **scarce open-source data**, and **under-resourced languages**.
+## Datasets
+### 🎵 [FMA Labeled — Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled)
+29k Creative-Commons tracks from the Free Music Archive, automatically
+annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key,
+vocal type, energy, era, and audio quality** using `gemini-flash-latest`.
+Paired audio + text for music tagging, music-LM training, and auto-lyric
+research.
+### 🗣️ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts)
+~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es,
+fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a
+rotating pool of reference speakers. Covers conversational, informational,
+technical, emotional, and proverb-style utterances — useful for TTS
+fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.
+## Focus Areas
+- **Under-resourced languages** — expanding speech and text coverage beyond
+  English-only datasets.
+- **Rich supervision** — datasets ship with detailed structured metadata
+  (genre/mood/BPM/key for music; language/style/voice for speech), not just
+  audio + class labels.
+- **Permissive licensing** — Creative Commons / CC0 where possible; synthetic
+  outputs released for open research.
+- **Reproducibility** — generation pipelines and labeling scripts are
+  open-sourced alongside the data.
+## Tooling & Pipeline
+- **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs.
+- **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with zero-shot
+  voice cloning.
+- **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running
+  multi-GPU jobs, Hugging Face Hub for distribution.
+## Get In Touch
+- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
+- Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)
+---
+*More datasets coming soon as part of the Uncharted Data Challenge submission.*