Minimal Example for English Text-to-Speech with VITS: Female and Male Voices

Community Article Published February 2, 2026

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a popular neural TTS architecture that combines text-to-speech and vocoding in a single model. It produces natural-sounding speech with relatively simple APIs.

In this post, we show a minimal example of generating:

  • An English female voice (single-speaker)
  • An English male voice (multi-speaker, speaker-selectable)

using open VITS models via the Coqui TTS library.


What this example demonstrates

  • How to load pretrained VITS models from Hugging Face
  • How single-speaker and multi-speaker TTS models differ
  • How to select a specific speaker ID for multi-speaker synthesis
  • How to generate WAV files in just a few lines of Python

Install dependencies

pip install TTS torch torchaudio

For best performance, a GPU is recommended but not required.


Example 1: English female voice (single-speaker VITS)

This model is trained on LJSpeech, a single-speaker English dataset. Because it’s single-speaker, no speaker ID is required.

from TTS.api import TTS

tts_female = TTS("tts_models/en/ljspeech/vits")

text = "Hello, I am an English female voice powered by VITS."

tts_female.tts_to_file(
    text=text,
    file_path="female.wav"
)

Output:

female.wav

Example 2: English male voice (multi-speaker VITS)

This model is trained on VCTK, a multi-speaker English dataset. You must specify a speaker ID to choose the voice.

from TTS.api import TTS

tts_male = TTS("tts_models/en/vctk/vits")

# Example speaker ID
speaker = "p252"

text = "Hello, I am an English male voice powered by VITS just for testing."

tts_male.tts_to_file(
    text=text,
    speaker=speaker,
    file_path="male.wav"
)

Output:

male.wav

About speaker IDs (VCTK)

The VCTK model includes many speakers, typically labeled:

p225, p226, p227, ... p300

In practice:

  • Many male voices fall roughly in the range p225–p272
  • Female voices appear throughout the remaining IDs

Exact gender or voice characteristics vary by speaker — experimentation is encouraged.

You can list available speakers programmatically:

print(tts_male.speakers)

When to use each model

Use case Recommended model
Simple English TTS en/ljspeech/vits
Voice variation en/vctk/vits
Speaker selection en/vctk/vits
Prototyping Both

Practical tips

  • Single-speaker models are simpler and more consistent

  • Multi-speaker models give flexibility but require speaker selection

  • Audio quality improves with:

    • Shorter sentences
    • Proper punctuation
    • GPU inference (when available)

Limitations

  • VITS models are not optimized for real-time synthesis
  • Voice emotion and style control are limited
  • Speaker identity is fixed by the dataset (no voice cloning)

For voice cloning or multilingual output, consider XTTS v2 instead.


Final thoughts

VITS remains a strong baseline for high-quality, open-source English TTS. With just a few lines of code, you can generate natural-sounding speech and explore different speaker identities.

If you’re building or learning TTS systems, this is an excellent place to start.


Are you using single-speaker TTS for consistency, or multi-speaker models for flexibility?

Community

Sign up or log in to comment