Text-2-Voice: Qwen3-TTS Hindi Voice Cloning
Fine-tuning and inference pipeline for Qwen3-TTS-12Hz-1.7B-Base with voice cloning. Generate speech in Hindi and other languages using reference audio — no fine-tuning required for basic use.
Features
- Voice cloning — Use any reference audio (3–10s, single speaker) to clone voice
- Base model ready — Run the raw Qwen3-TTS Base model out of the box
- Hindi fine-tuning — Optional fine-tuning on IndicTTS-Hindi + English replay data
- Gradio UI — Interactive app with default voices and custom upload
Requirements
- Python 3.9+
- CUDA-capable GPU (recommended; CPU is very slow)
- ~8GB+ GPU VRAM for inference; ~16GB+ for fine-tuning
pip install -r requirements.txt
# Optional: pip install bitsandbytes # Reduces optimizer memory ~4x during training
Quick Start: Inference (Base Model)
Generate speech from text using reference audio (voice cloning):
# Using sample reference audio (English)
python scripts/quick_test.py --ref-audio sample_ref.wav --text "नमस्ते, यह हिंदी में एक परीक्षण है।" --output output.wav
# Download sample ref audio if you don't have one
python scripts/quick_test.py --download-sample --text "अभी न जाओ छोड़ कर के दिल अभी भरा नहीं" --output hindi_output.wav
# Using Hindi reference voice
python scripts/quick_test.py --ref-audio data/hindi/audio/utt_000001.wav \
--text "अभी न जाओ छोड़ कर के दिल अभी भरा नहीं अभी अभी तो आई हो अभी अभी तो" \
--language Auto --output hindi_output.wav
Direct inference (scripts/inference.py)
python scripts/inference.py --ref-audio sample_ref.wav --text "Your text here" --output output.wav
Gradio App
Launch the web UI with default voices and optional custom reference upload:
# Default: Base model on port 7860
./scripts/run_radio.sh
# Or directly
python scripts/radio_app.py --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --port 7860
Interactive tab: Select a default voice (Sample English, Hindi 1–3) or upload your own reference audio. Enter text, choose language, and generate.
Batch evaluation tab: Run predefined test phrases (Hindi, English, Chinese, Japanese) with selected or uploaded reference audio.
Fine-tuning (Optional)
Fine-tune the Base model on Hindi data for improved Hindi prosody:
# 1. Set GPU (e.g., dedicated GPU 3)
export CUDA_VISIBLE_DEVICES=3
# 2. Download datasets and run full pipeline
./scripts/train.sh
# Or step by step:
./scripts/train.sh download # Download IndicTTS-Hindi + replay data
./scripts/train.sh # Prepare data + fine-tune
Checkpoints are saved under output/. Use a checkpoint for inference:
python scripts/quick_test.py --model output/checkpoint-epoch-3 --ref-audio sample_ref.wav --text "नमस्ते" --output ft_output.wav
Project Structure
text-2-voice/
├── config.yaml # GPU, paths, datasets, training config
├── scripts/
│ ├── inference.py # Core load_model + generate (voice cloning)
│ ├── quick_test.py # CLI for single-phrase generation
│ ├── radio_app.py # Gradio app (default voices + upload)
│ ├── run_radio.sh # Launch Gradio with config
│ ├── train.sh # Fine-tuning pipeline
│ ├── download_datasets.py
│ ├── merge_with_replay.py
│ └── eval_phrases.py # Test phrases for batch eval
├── finetuning/
│ ├── prepare_data.py # Extract audio codes for training
│ ├── sft_12hz.py # Fine-tuning script
│ └── dataset.py
├── data/
│ ├── hindi/ # Hindi audio + train_raw.jsonl
│ └── replay/ # English replay data
├── output/ # Fine-tuned checkpoints
└── sample_ref.wav # Sample reference (or use --download-sample)
Configuration
Edit config.yaml for:
- GPU:
gpu.device_ids,gpu.device— target GPU for training/inference - Paths:
paths.data_dir,paths.output_dir - Datasets:
datasets.hindi.primary,datasets.replay_sources - Training:
training.batch_size,lr,num_epochs,hindi_ratio,replay_ratio
Reference Audio Tips
- Duration: 3–10 seconds of clean speech
- Quality: Single speaker, low background noise
- Format: WAV preferred; model resamples to 24 kHz internally
License
See the Qwen3-TTS upstream license. This project uses Apache-2.0 where applicable.
- Downloads last month
- 6
Model tree for saumilyajj/Qwen3-TTS-12Hz-1.7B-Base-hindi-ft
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-Base