CapSpeech NAR Vietnamese Stage 2

Vietnamese instruction-guided text-to-speech model for controllable voice generation.

This repository contains the Stage 2 checkpoint of my Vietnamese CapSpeech NAR pipeline, fine-tuned on a curated instruction-TTS mixture for:

emotion control
accent control
age-group control
replay/general speech preservation

Model repo:

thangquang09/capspeech-nar-vietnamese-stage2

Project GitHub:

https://github.com/thangquang09/vietnamese-f5tts-voice-design

Base model:

thangquang09/capspeech-nar-vietnamese

Upstream research/codebase:

https://github.com/WangHelin1997/CapSpeech

What This Model Does

The model takes:

text: the Vietnamese sentence to synthesize
caption: a natural-language description of the desired voice

and generates speech that follows the requested speaking style.

Example captions:

Giọng nữ trẻ, vui vẻ, nhịp nói nhanh
Giọng nam trung niên miền Bắc, trầm, rõ chữ
Giọng nữ cao tuổi miền Nam, chậm rãi, ấm áp
Giọng thiếu niên, hào hứng, năng lượng cao

Main Differences From Stage 1

Compared with the base Vietnamese Stage 1 model, this Stage 2 release is designed to improve controllability for:

emotion
regional accent
age-related speaking characteristics

while still keeping general Vietnamese TTS quality through replay data.

Model Overview

Architecture: CrossDiT-based non-autoregressive TTS
Language: Vietnamese
Vocoder: BigVGAN v2 24kHz 100-band
Caption encoder: ViT5-large
Text representation: Vietnamese character-level vocabulary
Base checkpoint: thangquang09/capspeech-nar-vietnamese

Intended Use

This model is intended for:

Vietnamese controllable TTS research
voice design experiments with text instructions
prototyping demos with controllable emotion/accent/age prompts

This model is not packaged as a standard transformers checkpoint. The recommended way to use it is through the project inference code from GitHub.

Quick Start

1. Clone the GitHub repository

git clone https://github.com/thangquang09/vietnamese-f5tts-voice-design.git
cd vietnamese-f5tts-voice-design
pip install -r requirements.txt

2. Run the Python API

from api import InstructVoiceAPI

tts = InstructVoiceAPI(
    device="cuda:0",
    hf_model_repo="thangquang09/capspeech-nar-vietnamese-stage2",
)

tts.synthesize(
    text="Xin chào, rất vui được gặp bạn.",
    caption="Giọng nữ trẻ, nhẹ nhàng, vui vẻ",
    output_path="output.wav",
)

3. Run the Gradio app

python app.py \
  --hf_model_repo thangquang09/capspeech-nar-vietnamese-stage2 \
  --device cuda:0 \
  --port 7860

Files In This Repository

checkpoint.pt: Stage 2 model weights
finetune_vn_stage2.yaml: training/inference config
vocab.txt: Vietnamese text vocabulary
duration_predictor/: duration prediction module
README.md: model card

Notes

This is a custom CapSpeech-based release, so users should rely on the GitHub inference code rather than generic Hugging Face auto-loading APIs.
If you build demos or benchmarks on top of this model, please make sure your prompt format matches the instruction style shown above.

How To Read Download Counts

Hugging Face exposes repository-level download statistics.

In the UI

Open:

https://huggingface.co/thangquang09/capspeech-nar-vietnamese-stage2

and look for:

Downloads last month

In Python

from huggingface_hub import HfApi

api = HfApi()
info = api.model_info(
    "thangquang09/capspeech-nar-vietnamese-stage2",
    expand=["downloads", "downloadsAllTime"],
)

print("Downloads last 30 days:", info.downloads)
print(
    "Downloads all time:",
    getattr(info, "downloads_all_time", None) or getattr(info, "downloadsAllTime", None),
)

Citation

If you use this model, please cite:

the original CapSpeech project: https://github.com/WangHelin1997/CapSpeech
this Vietnamese adaptation/release: https://github.com/thangquang09/vietnamese-f5tts-voice-design

Acknowledgements

This model is built on top of:

CapSpeech
BigVGAN
ViT5
the Vietnamese instruction-TTS data pipeline developed in this project

Downloads last month: 17

Model tree for thangquang09/capspeech-nar-vietnamese-stage2-v3

Base model

thangquang09/capspeech-nar-vietnamese

Finetuned

(2)

this model