🛑 Important Note ⚠️

This Text-to-Speech (TTS) model is provided solely for research, experimentation, and technology development purposes. Any audio content generated by the model does not represent the voice, identity, opinions, or endorsement of any real individual or organization. The authors and related parties assume no responsibility for any misuse, unlawful activities, violations of privacy, personality rights, intellectual property rights, or any direct or indirect damages arising from the use of this model.

Users bear full responsibility and legal liability for the deployment, distribution, and use of the model. The use of this model for impersonation, voice cloning of individuals without lawful consent, creating misleading content, fraud, manipulation of public opinion, or any purpose that violates applicable laws is strictly prohibited. When using or sharing generated audio, it is strongly recommended to clearly disclose that the content is AI-generated and to comply fully with all applicable legal regulations, platform policies, and ethical standards.

🎙️ ZipVoice-Vietnamese-2500h

ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching.

Key features:

Small and fast: only 123M parameters.
High-quality voice cloning: state-of-the-art performance in speaker similarity, intelligibility, and naturalness.
Multi-lingual: support Chinese and English.
Multi-mode: support both single-speaker and dialogue speech generation.

This checkpoint is a compact fine-tuned version of ZipVoice trained on 2500 hours of Vietnamese speech.

🔗 For more fine-tuning and inference experiments, visit: https://github.com/k2-fsa/ZipVoice.

📜 License: CC-BY-NC-SA-4.0 — Non-commercial research use only.

📌 Model Details

Dataset: PhoAudioBook, ViVoice, TeacherDinh-UEH.
Total dataset durations: 2500 hours
Data processing Technique:
- Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs
- Do not use audio files shorter than 1 second or longer than 30 seconds.
- Keep the default punctuation marks unchanged.
- Normalize to lowercase format.
Training Configuration:
- Base Model: ZipVoice with espeak-ng vi for tokenizer
- GPU: RTX 3090
- Batch Siz: Max duration 200
Training Progress: Stopped at 525,000 steps at epoch 11

Downloads last month: 172