CapSpeech NAR Vietnamese Stage 2

Vietnamese instruction-guided text-to-speech model for controllable voice generation.

This repository contains the Stage 2 checkpoint of my Vietnamese CapSpeech NAR pipeline, fine-tuned on a curated instruction-TTS mixture for:

  • emotion control
  • accent control
  • age-group control
  • replay/general speech preservation

Model repo:

  • thangquang09/capspeech-nar-vietnamese-stage2

Project GitHub:

  • https://github.com/thangquang09/vietnamese-f5tts-voice-design

Base model:

  • thangquang09/capspeech-nar-vietnamese

Upstream research/codebase:

  • https://github.com/WangHelin1997/CapSpeech

What This Model Does

The model takes:

  • text: the Vietnamese sentence to synthesize
  • caption: a natural-language description of the desired voice

and generates speech that follows the requested speaking style.

Example captions:

  • Giọng nữ trẻ, vui vẻ, nhịp nói nhanh
  • Giọng nam trung niên miền Bắc, trầm, rõ chữ
  • Giọng nữ cao tuổi miền Nam, chậm rãi, ấm áp
  • Giọng thiếu niên, hào hứng, năng lượng cao

Main Differences From Stage 1

Compared with the base Vietnamese Stage 1 model, this Stage 2 release is designed to improve controllability for:

  • emotion
  • regional accent
  • age-related speaking characteristics

while still keeping general Vietnamese TTS quality through replay data.

Model Overview

  • Architecture: CrossDiT-based non-autoregressive TTS
  • Language: Vietnamese
  • Vocoder: BigVGAN v2 24kHz 100-band
  • Caption encoder: ViT5-large
  • Text representation: Vietnamese character-level vocabulary
  • Base checkpoint: thangquang09/capspeech-nar-vietnamese

Intended Use

This model is intended for:

  • Vietnamese controllable TTS research
  • voice design experiments with text instructions
  • prototyping demos with controllable emotion/accent/age prompts

This model is not packaged as a standard transformers checkpoint. The recommended way to use it is through the project inference code from GitHub.

Quick Start

1. Clone the GitHub repository

git clone https://github.com/thangquang09/vietnamese-f5tts-voice-design.git
cd vietnamese-f5tts-voice-design
pip install -r requirements.txt

2. Run the Python API

from api import InstructVoiceAPI

tts = InstructVoiceAPI(
    device="cuda:0",
    hf_model_repo="thangquang09/capspeech-nar-vietnamese-stage2",
)

tts.synthesize(
    text="Xin chào, rất vui được gặp bạn.",
    caption="Giọng nữ trẻ, nhẹ nhàng, vui vẻ",
    output_path="output.wav",
)

3. Run the Gradio app

python app.py \
  --hf_model_repo thangquang09/capspeech-nar-vietnamese-stage2 \
  --device cuda:0 \
  --port 7860

Files In This Repository

  • checkpoint.pt: Stage 2 model weights
  • finetune_vn_stage2.yaml: training/inference config
  • vocab.txt: Vietnamese text vocabulary
  • duration_predictor/: duration prediction module
  • README.md: model card

Notes

  • This is a custom CapSpeech-based release, so users should rely on the GitHub inference code rather than generic Hugging Face auto-loading APIs.
  • If you build demos or benchmarks on top of this model, please make sure your prompt format matches the instruction style shown above.

How To Read Download Counts

Hugging Face exposes repository-level download statistics.

In the UI

Open:

  • https://huggingface.co/thangquang09/capspeech-nar-vietnamese-stage2

and look for:

  • Downloads last month

In Python

from huggingface_hub import HfApi

api = HfApi()
info = api.model_info(
    "thangquang09/capspeech-nar-vietnamese-stage2",
    expand=["downloads", "downloadsAllTime"],
)

print("Downloads last 30 days:", info.downloads)
print(
    "Downloads all time:",
    getattr(info, "downloads_all_time", None) or getattr(info, "downloadsAllTime", None),
)

Citation

If you use this model, please cite:

  • the original CapSpeech project: https://github.com/WangHelin1997/CapSpeech
  • this Vietnamese adaptation/release: https://github.com/thangquang09/vietnamese-f5tts-voice-design

Acknowledgements

This model is built on top of:

  • CapSpeech
  • BigVGAN
  • ViT5
  • the Vietnamese instruction-TTS data pipeline developed in this project
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thangquang09/capspeech-nar-vietnamese-stage2-v3

Finetuned
(2)
this model