Vietnamese Parler-TTS Voice Design

Vietnamese instruction-guided text-to-speech built on top of Parler-TTS.

This model generates Vietnamese speech from:

text: the content to speak
description: a natural-language voice instruction describing accent, gender, age, pitch, speed, loudness, or speaking style

Demo Idea

Example voice descriptions:

Giọng nữ trẻ miền Bắc, nói chậm rãi, giọng cao
Giọng nam trưởng thành miền Nam, nói nhanh và rất to
Giọng nữ miền Trung, nói chậm rãi với âm lượng nhỏ, giọng trầm

Model Details

Model name: thangquang09/parler-tts-vietnamese-v1-stage2
Architecture: Parler-TTS
Language: Vietnamese
Use case: instruction TTS / voice design
Model type: controllable text-to-speech
Base family: Parler-TTS
Adaptation: Vietnamese fine-tuning with Vietnamese voice descriptions

How to Use

Python

import torch
import soundfile as sf
from transformers import AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration

repo_id = "thangquang09/parler-tts-vietnamese-v1-stage2"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained(repo_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False)

text = "Xin chào, hôm nay bạn có khoẻ không?"
description = "Giọng nữ trẻ miền Bắc, nói chậm rãi, giọng cao"

desc_tokens = tokenizer(description, return_tensors="pt").to(device)
text_tokens = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    generation = model.generate(
        input_ids=desc_tokens.input_ids,
        attention_mask=desc_tokens.attention_mask,
        prompt_input_ids=text_tokens.input_ids,
        prompt_attention_mask=text_tokens.attention_mask,
        do_sample=True,
        temperature=1.0,
    )

audio = generation.cpu().float().numpy().squeeze()
sf.write("output.wav", audio, model.config.sampling_rate)

Recommended Inference Repo

For a cleaner inference workflow with CLI and Gradio app, use the GitHub repository:

GitHub: https://github.com/thangquang09/vietnamese-parlertts

That repository provides:

api.py for CLI and Python inference
app.py for Gradio web UI
simpler loading from Hugging Face or local checkpoints

Example CLI from the GitHub repo:

python api.py \
  --text "Xin chào, hôm nay bạn có khoẻ không?" \
  --description "Giọng nữ trẻ miền Bắc, nói chậm rãi, giọng cao" \
  --hf-repo thangquang09/parler-tts-vietnamese-v1-stage2 \
  --output output.wav

Input Format

This model expects two text inputs:

Speech text: the Vietnamese sentence to synthesize
Voice description: the natural-language instruction describing how the voice should sound

Good descriptions usually mention some of:

gender
age
accent or region
speaking rate
pitch
loudness
expressiveness or emotion

Intended Use

This model is intended for:

Vietnamese TTS research
controllable speech generation
voice style prompting
demo systems and rapid prototyping

Limitations

Voice control is prompt-based, so instruction following may vary by prompt quality
Some accents, ages, or speaking styles may be stronger than others
Performance may differ across text length and prompting style
This is not a voice cloning model

Training Notes

This is a Vietnamese adaptation of Parler-TTS released for inference and downstream use.

If you want the cleaned inference-first codebase, scripts, and app, please use the GitHub repo: https://github.com/thangquang09/vietnamese-parlertts

Citation

If you use this model, please cite the original Parler-TTS work:

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

@misc{lyth2024natural,
  title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
  author={Dan Lyth and Simon King},
  year={2024},
  eprint={2402.01912},
  archivePrefix={arXiv},
  primaryClass={cs.SD}
}

Acknowledgements

Parler-TTS
Hugging Face
Vietnamese adaptation and inference packaging by thangquang09

Downloads last month: 111

Safetensors

Model size

0.9B params

Tensor type

F32

Model tree for thangquang09/parler-tts-vietnamese-v1-stage2

Base model

parler-tts/parler-tts-mini-v1

Finetuned

(8)

this model

Paper for thangquang09/parler-tts-vietnamese-v1-stage2

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

Paper • 2402.01912 • Published Feb 2, 2024 • 13