CapSpeech NAR Vietnamese Stage 2
Vietnamese instruction-guided text-to-speech model for controllable voice generation.
This repository contains the Stage 2 checkpoint of my Vietnamese CapSpeech NAR pipeline, fine-tuned on a curated instruction-TTS mixture for:
- emotion control
- accent control
- age-group control
- replay/general speech preservation
Model repo:
thangquang09/capspeech-nar-vietnamese-stage2
Project GitHub:
https://github.com/thangquang09/vietnamese-f5tts-voice-design
Base model:
thangquang09/capspeech-nar-vietnamese
Upstream research/codebase:
https://github.com/WangHelin1997/CapSpeech
What This Model Does
The model takes:
text: the Vietnamese sentence to synthesizecaption: a natural-language description of the desired voice
and generates speech that follows the requested speaking style.
Example captions:
Giọng nữ trẻ, vui vẻ, nhịp nói nhanhGiọng nam trung niên miền Bắc, trầm, rõ chữGiọng nữ cao tuổi miền Nam, chậm rãi, ấm ápGiọng thiếu niên, hào hứng, năng lượng cao
Main Differences From Stage 1
Compared with the base Vietnamese Stage 1 model, this Stage 2 release is designed to improve controllability for:
- emotion
- regional accent
- age-related speaking characteristics
while still keeping general Vietnamese TTS quality through replay data.
Model Overview
- Architecture: CrossDiT-based non-autoregressive TTS
- Language: Vietnamese
- Vocoder: BigVGAN v2 24kHz 100-band
- Caption encoder: ViT5-large
- Text representation: Vietnamese character-level vocabulary
- Base checkpoint:
thangquang09/capspeech-nar-vietnamese
Intended Use
This model is intended for:
- Vietnamese controllable TTS research
- voice design experiments with text instructions
- prototyping demos with controllable emotion/accent/age prompts
This model is not packaged as a standard transformers checkpoint. The recommended way to use it is through the project inference code from GitHub.
Quick Start
1. Clone the GitHub repository
git clone https://github.com/thangquang09/vietnamese-f5tts-voice-design.git
cd vietnamese-f5tts-voice-design
pip install -r requirements.txt
2. Run the Python API
from api import InstructVoiceAPI
tts = InstructVoiceAPI(
device="cuda:0",
hf_model_repo="thangquang09/capspeech-nar-vietnamese-stage2",
)
tts.synthesize(
text="Xin chào, rất vui được gặp bạn.",
caption="Giọng nữ trẻ, nhẹ nhàng, vui vẻ",
output_path="output.wav",
)
3. Run the Gradio app
python app.py \
--hf_model_repo thangquang09/capspeech-nar-vietnamese-stage2 \
--device cuda:0 \
--port 7860
Files In This Repository
checkpoint.pt: Stage 2 model weightsfinetune_vn_stage2.yaml: training/inference configvocab.txt: Vietnamese text vocabularyduration_predictor/: duration prediction moduleREADME.md: model card
Notes
- This is a custom CapSpeech-based release, so users should rely on the GitHub inference code rather than generic Hugging Face auto-loading APIs.
- If you build demos or benchmarks on top of this model, please make sure your prompt format matches the instruction style shown above.
How To Read Download Counts
Hugging Face exposes repository-level download statistics.
In the UI
Open:
https://huggingface.co/thangquang09/capspeech-nar-vietnamese-stage2
and look for:
Downloads last month
In Python
from huggingface_hub import HfApi
api = HfApi()
info = api.model_info(
"thangquang09/capspeech-nar-vietnamese-stage2",
expand=["downloads", "downloadsAllTime"],
)
print("Downloads last 30 days:", info.downloads)
print(
"Downloads all time:",
getattr(info, "downloads_all_time", None) or getattr(info, "downloadsAllTime", None),
)
Citation
If you use this model, please cite:
- the original CapSpeech project:
https://github.com/WangHelin1997/CapSpeech - this Vietnamese adaptation/release:
https://github.com/thangquang09/vietnamese-f5tts-voice-design
Acknowledgements
This model is built on top of:
- CapSpeech
- BigVGAN
- ViT5
- the Vietnamese instruction-TTS data pipeline developed in this project
- Downloads last month
- 17
Model tree for thangquang09/capspeech-nar-vietnamese-stage2-v3
Base model
thangquang09/capspeech-nar-vietnamese