Vietnamese Parler-TTS Voice Design

Vietnamese instruction-guided text-to-speech built on top of Parler-TTS.

This model generates Vietnamese speech from:

  • text: the content to speak
  • description: a natural-language voice instruction describing accent, gender, age, pitch, speed, loudness, or speaking style

Demo Idea

Example voice descriptions:

  • Giọng nữ trẻ miền Bắc, nói chậm rãi, giọng cao
  • Giọng nam trưởng thành miền Nam, nói nhanh và rất to
  • Giọng nữ miền Trung, nói chậm rãi với âm lượng nhỏ, giọng trầm

Model Details

  • Model name: thangquang09/parler-tts-vietnamese-v1-stage2
  • Architecture: Parler-TTS
  • Language: Vietnamese
  • Use case: instruction TTS / voice design
  • Model type: controllable text-to-speech
  • Base family: Parler-TTS
  • Adaptation: Vietnamese fine-tuning with Vietnamese voice descriptions

How to Use

Python

import torch
import soundfile as sf
from transformers import AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration

repo_id = "thangquang09/parler-tts-vietnamese-v1-stage2"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained(repo_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False)

text = "Xin chào, hôm nay bạn có khoẻ không?"
description = "Giọng nữ trẻ miền Bắc, nói chậm rãi, giọng cao"

desc_tokens = tokenizer(description, return_tensors="pt").to(device)
text_tokens = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    generation = model.generate(
        input_ids=desc_tokens.input_ids,
        attention_mask=desc_tokens.attention_mask,
        prompt_input_ids=text_tokens.input_ids,
        prompt_attention_mask=text_tokens.attention_mask,
        do_sample=True,
        temperature=1.0,
    )

audio = generation.cpu().float().numpy().squeeze()
sf.write("output.wav", audio, model.config.sampling_rate)

Recommended Inference Repo

For a cleaner inference workflow with CLI and Gradio app, use the GitHub repository:

GitHub: https://github.com/thangquang09/vietnamese-parlertts

That repository provides:

  • api.py for CLI and Python inference
  • app.py for Gradio web UI
  • simpler loading from Hugging Face or local checkpoints

Example CLI from the GitHub repo:

python api.py \
  --text "Xin chào, hôm nay bạn có khoẻ không?" \
  --description "Giọng nữ trẻ miền Bắc, nói chậm rãi, giọng cao" \
  --hf-repo thangquang09/parler-tts-vietnamese-v1-stage2 \
  --output output.wav

Input Format

This model expects two text inputs:

  • Speech text: the Vietnamese sentence to synthesize
  • Voice description: the natural-language instruction describing how the voice should sound

Good descriptions usually mention some of:

  • gender
  • age
  • accent or region
  • speaking rate
  • pitch
  • loudness
  • expressiveness or emotion

Intended Use

This model is intended for:

  • Vietnamese TTS research
  • controllable speech generation
  • voice style prompting
  • demo systems and rapid prototyping

Limitations

  • Voice control is prompt-based, so instruction following may vary by prompt quality
  • Some accents, ages, or speaking styles may be stronger than others
  • Performance may differ across text length and prompting style
  • This is not a voice cloning model

Training Notes

This is a Vietnamese adaptation of Parler-TTS released for inference and downstream use.

If you want the cleaned inference-first codebase, scripts, and app, please use the GitHub repo: https://github.com/thangquang09/vietnamese-parlertts

Citation

If you use this model, please cite the original Parler-TTS work:

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}
@misc{lyth2024natural,
  title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
  author={Dan Lyth and Simon King},
  year={2024},
  eprint={2402.01912},
  archivePrefix={arXiv},
  primaryClass={cs.SD}
}

Acknowledgements

Downloads last month
111
Safetensors
Model size
0.9B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thangquang09/parler-tts-vietnamese-v1-stage2

Finetuned
(8)
this model

Paper for thangquang09/parler-tts-vietnamese-v1-stage2