Experiment: Speaker Embedding Visualization on Expresso Dataset
#1
by Simonlob - opened
Experiment: Speaker Embedding Visualization on Expresso Dataset
Dataset: ylacombe/expresso
Setup:
- Model: Speaker embedding network (768-dimensional output)
- Dataset: Expresso audio dataset (4 speakers: ex01-ex04)
- Dimensionality reduction: t-SNE (768 β 2D)
- Emotions/styles: 11 categories (default, happy, sad, whisper, singing, enunciated, confused, emphasis, laughing, longform, essentials)
Results:
- Strong speaker separation: Model successfully clusters speakers into distinct, well-separated groups, demonstrating robust speaker identity encoding
- Emotion-style substructure: Within each speaker cluster, emotional and prosodic variations form visible subclusters, particularly for "whisper", "singing", and "default" styles
- Cross-speaker consistency: Similar emotions (e.g., whisper) show consistent positioning across different speaker clusters, suggesting the embeddings encode both speaker identity and speaking style
Implications for TTS Voice Cloning:

