Minimal Example for English Text-to-Speech with VITS: Female and Male Voices
In this post, we show a minimal example of generating:
- An English female voice (single-speaker)
- An English male voice (multi-speaker, speaker-selectable)
using open VITS models via the Coqui TTS library.
What this example demonstrates
- How to load pretrained VITS models from Hugging Face
- How single-speaker and multi-speaker TTS models differ
- How to select a specific speaker ID for multi-speaker synthesis
- How to generate WAV files in just a few lines of Python
Install dependencies
pip install TTS torch torchaudio
For best performance, a GPU is recommended but not required.
Example 1: English female voice (single-speaker VITS)
This model is trained on LJSpeech, a single-speaker English dataset. Because it’s single-speaker, no speaker ID is required.
from TTS.api import TTS
tts_female = TTS("tts_models/en/ljspeech/vits")
text = "Hello, I am an English female voice powered by VITS."
tts_female.tts_to_file(
text=text,
file_path="female.wav"
)
Output:
female.wav
Example 2: English male voice (multi-speaker VITS)
This model is trained on VCTK, a multi-speaker English dataset. You must specify a speaker ID to choose the voice.
from TTS.api import TTS
tts_male = TTS("tts_models/en/vctk/vits")
# Example speaker ID
speaker = "p252"
text = "Hello, I am an English male voice powered by VITS just for testing."
tts_male.tts_to_file(
text=text,
speaker=speaker,
file_path="male.wav"
)
Output:
male.wav
About speaker IDs (VCTK)
The VCTK model includes many speakers, typically labeled:
p225, p226, p227, ... p300
In practice:
- Many male voices fall roughly in the range p225–p272
- Female voices appear throughout the remaining IDs
Exact gender or voice characteristics vary by speaker — experimentation is encouraged.
You can list available speakers programmatically:
print(tts_male.speakers)
When to use each model
| Use case | Recommended model |
|---|---|
| Simple English TTS | en/ljspeech/vits |
| Voice variation | en/vctk/vits |
| Speaker selection | en/vctk/vits |
| Prototyping | Both |
Practical tips
Single-speaker models are simpler and more consistent
Multi-speaker models give flexibility but require speaker selection
Audio quality improves with:
- Shorter sentences
- Proper punctuation
- GPU inference (when available)
Limitations
- VITS models are not optimized for real-time synthesis
- Voice emotion and style control are limited
- Speaker identity is fixed by the dataset (no voice cloning)
For voice cloning or multilingual output, consider XTTS v2 instead.
Final thoughts
VITS remains a strong baseline for high-quality, open-source English TTS. With just a few lines of code, you can generate natural-sounding speech and explore different speaker identities.
If you’re building or learning TTS systems, this is an excellent place to start.
Are you using single-speaker TTS for consistency, or multi-speaker models for flexibility?