Experiment: Speaker Embedding Visualization on Expresso Dataset

by Simonlob - opened Jan 21

Jan 21

Dataset: ylacombe/expresso

Setup:

Model: Speaker embedding network (768-dimensional output)
Dataset: Expresso audio dataset (4 speakers: ex01-ex04)
Dimensionality reduction: t-SNE (768 → 2D)
Emotions/styles: 11 categories (default, happy, sad, whisper, singing, enunciated, confused, emphasis, laughing, longform, essentials)

Results:

Strong speaker separation: Model successfully clusters speakers into distinct, well-separated groups, demonstrating robust speaker identity encoding
Emotion-style substructure: Within each speaker cluster, emotional and prosodic variations form visible subclusters, particularly for "whisper", "singing", and "default" styles
Cross-speaker consistency: Similar emotions (e.g., whisper) show consistent positioning across different speaker clusters, suggesting the embeddings encode both speaker identity and speaking style

Implications for TTS Voice Cloning:

Embeddings capture rich acoustic information beyond pure speaker identity
May require style normalization or multi-reference averaging for consistent voice cloning
Consider disentangling speaker identity from prosody/emotion for production use

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment