Experiment: Speaker Embedding Visualization on Expresso Dataset

#1
by Simonlob - opened

Experiment: Speaker Embedding Visualization on Expresso Dataset

Dataset: ylacombe/expresso

Setup:

  • Model: Speaker embedding network (768-dimensional output)
  • Dataset: Expresso audio dataset (4 speakers: ex01-ex04)
  • Dimensionality reduction: t-SNE (768 β†’ 2D)
  • Emotions/styles: 11 categories (default, happy, sad, whisper, singing, enunciated, confused, emphasis, laughing, longform, essentials)

Results:

  • Strong speaker separation: Model successfully clusters speakers into distinct, well-separated groups, demonstrating robust speaker identity encoding
  • Emotion-style substructure: Within each speaker cluster, emotional and prosodic variations form visible subclusters, particularly for "whisper", "singing", and "default" styles
  • Cross-speaker consistency: Similar emotions (e.g., whisper) show consistent positioning across different speaker clusters, suggesting the embeddings encode both speaker identity and speaking style

Implications for TTS Voice Cloning:

  • Embeddings capture rich acoustic information beyond pure speaker identity
  • May require style normalization or multi-reference averaging for consistent voice cloning
  • Consider disentangling speaker identity from prosody/emotion for production use
    download
    download (1)

Sign up or log in to comment