Minimal Example for English Text-to-Speech with VITS: Female and Male Voices

Community Article Published February 2, 2026

Upvote

nazemi

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a popular neural TTS architecture that combines text-to-speech and vocoding in a single model. It produces natural-sounding speech with relatively simple APIs.

In this post, we show a minimal example of generating:

An English female voice (single-speaker)
An English male voice (multi-speaker, speaker-selectable)

using open VITS models via the Coqui TTS library.

What this example demonstrates

How to load pretrained VITS models from Hugging Face
How single-speaker and multi-speaker TTS models differ
How to select a specific speaker ID for multi-speaker synthesis
How to generate WAV files in just a few lines of Python

Install dependencies

pip install TTS torch torchaudio

For best performance, a GPU is recommended but not required.

Example 1: English female voice (single-speaker VITS)

This model is trained on LJSpeech, a single-speaker English dataset. Because it’s single-speaker, no speaker ID is required.

from TTS.api import TTS

tts_female = TTS("tts_models/en/ljspeech/vits")

text = "Hello, I am an English female voice powered by VITS."

tts_female.tts_to_file(
    text=text,
    file_path="female.wav"
)

Output:

female.wav

Example 2: English male voice (multi-speaker VITS)

This model is trained on VCTK, a multi-speaker English dataset. You must specify a speaker ID to choose the voice.

from TTS.api import TTS

tts_male = TTS("tts_models/en/vctk/vits")

# Example speaker ID
speaker = "p252"

text = "Hello, I am an English male voice powered by VITS just for testing."

tts_male.tts_to_file(
    text=text,
    speaker=speaker,
    file_path="male.wav"
)

Output:

male.wav

About speaker IDs (VCTK)

The VCTK model includes many speakers, typically labeled:

p225, p226, p227, ... p300

In practice:

Many male voices fall roughly in the range p225–p272
Female voices appear throughout the remaining IDs

Exact gender or voice characteristics vary by speaker — experimentation is encouraged.

You can list available speakers programmatically:

print(tts_male.speakers)

When to use each model

Use case	Recommended model
Simple English TTS	`en/ljspeech/vits`
Voice variation	`en/vctk/vits`
Speaker selection	`en/vctk/vits`
Prototyping	Both

Practical tips

Single-speaker models are simpler and more consistent
Multi-speaker models give flexibility but require speaker selection
Audio quality improves with:
- Shorter sentences
- Proper punctuation
- GPU inference (when available)

Limitations

VITS models are not optimized for real-time synthesis
Voice emotion and style control are limited
Speaker identity is fixed by the dataset (no voice cloning)

For voice cloning or multilingual output, consider XTTS v2 instead.

Final thoughts

VITS remains a strong baseline for high-quality, open-source English TTS. With just a few lines of code, you can generate natural-sounding speech and explore different speaker identities.

If you’re building or learning TTS systems, this is an excellent place to start.

Are you using single-speaker TTS for consistency, or multi-speaker models for flexibility?

Running PersonaPlex-7B on Hugging Face ZeroGPU: A Complete Guide

April 8, 2026

VoxCeleb Dataset: Real-World Speech for Speaker Recognition

March 17, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote