Text-2-Voice: Qwen3-TTS Hindi Voice Cloning

Fine-tuning and inference pipeline for Qwen3-TTS-12Hz-1.7B-Base with voice cloning. Generate speech in Hindi and other languages using reference audio — no fine-tuning required for basic use.

Features

Voice cloning — Use any reference audio (3–10s, single speaker) to clone voice
Base model ready — Run the raw Qwen3-TTS Base model out of the box
Hindi fine-tuning — Optional fine-tuning on IndicTTS-Hindi + English replay data
Gradio UI — Interactive app with default voices and custom upload

Requirements

Python 3.9+
CUDA-capable GPU (recommended; CPU is very slow)
~8GB+ GPU VRAM for inference; ~16GB+ for fine-tuning

pip install -r requirements.txt
# Optional: pip install bitsandbytes  # Reduces optimizer memory ~4x during training

Quick Start: Inference (Base Model)

Generate speech from text using reference audio (voice cloning):

# Using sample reference audio (English)
python scripts/quick_test.py --ref-audio sample_ref.wav --text "नमस्ते, यह हिंदी में एक परीक्षण है।" --output output.wav

# Download sample ref audio if you don't have one
python scripts/quick_test.py --download-sample --text "अभी न जाओ छोड़ कर के दिल अभी भरा नहीं" --output hindi_output.wav

# Using Hindi reference voice
python scripts/quick_test.py --ref-audio data/hindi/audio/utt_000001.wav \
  --text "अभी न जाओ छोड़ कर के दिल अभी भरा नहीं अभी अभी तो आई हो अभी अभी तो" \
  --language Auto --output hindi_output.wav

Direct inference (scripts/inference.py)

python scripts/inference.py --ref-audio sample_ref.wav --text "Your text here" --output output.wav

Gradio App

Launch the web UI with default voices and optional custom reference upload:

# Default: Base model on port 7860
./scripts/run_radio.sh

# Or directly
python scripts/radio_app.py --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --port 7860

Interactive tab: Select a default voice (Sample English, Hindi 1–3) or upload your own reference audio. Enter text, choose language, and generate.

Batch evaluation tab: Run predefined test phrases (Hindi, English, Chinese, Japanese) with selected or uploaded reference audio.

Fine-tuning (Optional)

Fine-tune the Base model on Hindi data for improved Hindi prosody:

# 1. Set GPU (e.g., dedicated GPU 3)
export CUDA_VISIBLE_DEVICES=3

# 2. Download datasets and run full pipeline
./scripts/train.sh

# Or step by step:
./scripts/train.sh download    # Download IndicTTS-Hindi + replay data
./scripts/train.sh             # Prepare data + fine-tune

Checkpoints are saved under output/. Use a checkpoint for inference:

python scripts/quick_test.py --model output/checkpoint-epoch-3 --ref-audio sample_ref.wav --text "नमस्ते" --output ft_output.wav

Project Structure

text-2-voice/
├── config.yaml           # GPU, paths, datasets, training config
├── scripts/
│   ├── inference.py      # Core load_model + generate (voice cloning)
│   ├── quick_test.py     # CLI for single-phrase generation
│   ├── radio_app.py      # Gradio app (default voices + upload)
│   ├── run_radio.sh      # Launch Gradio with config
│   ├── train.sh          # Fine-tuning pipeline
│   ├── download_datasets.py
│   ├── merge_with_replay.py
│   └── eval_phrases.py   # Test phrases for batch eval
├── finetuning/
│   ├── prepare_data.py   # Extract audio codes for training
│   ├── sft_12hz.py       # Fine-tuning script
│   └── dataset.py
├── data/
│   ├── hindi/            # Hindi audio + train_raw.jsonl
│   └── replay/           # English replay data
├── output/               # Fine-tuned checkpoints
└── sample_ref.wav        # Sample reference (or use --download-sample)

Configuration

Edit config.yaml for:

GPU: gpu.device_ids, gpu.device — target GPU for training/inference
Paths: paths.data_dir, paths.output_dir
Datasets: datasets.hindi.primary, datasets.replay_sources
Training: training.batch_size, lr, num_epochs, hindi_ratio, replay_ratio

Reference Audio Tips

Duration: 3–10 seconds of clean speech
Quality: Single speaker, low background noise
Format: WAV preferred; model resamples to 24 kHz internally

License

See the Qwen3-TTS upstream license. This project uses Apache-2.0 where applicable.

Downloads last month: 6

Model tree for saumilyajj/Qwen3-TTS-12Hz-1.7B-Base-hindi-ft

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Finetuned

(18)

this model