Svara-TTS v1 — Hindi Fine-tuned

A fine-tuned version of kenpath/svara-tts-v1 on the Hindi TTS dataset (11.8K samples).

Model Details

Property	Value
Base model	kenpath/svara-tts-v1 (Llama 3.2 + SNAC audio tokens)
Fine-tuning method	LoRA (rank 64, alpha 128)
Training data	10,876 Hindi audio-text pairs from Kaggle Hindi TTS dataset
Training time	~2 hours on NVIDIA A100 80GB
Epochs	3
Final train loss	3.406
Final eval loss	3.393

A/B Test Results

Controlled comparison using identical inputs (Hindi Male, temp=0.7, top_p=0.9, 5 sentences):

Metric	Base Model	Fine-tuned	Improvement
RMS (loudness/clarity)	2,225	4,010	+80%
Silence ratio	53.0%	39.6%	-13.4%
Audio duration	5.08s avg	4.21s avg	-17% (more concise)
Generation speed	3.01s avg	2.48s avg	17.5% faster

Usage

With vLLM

vllm serve samudr-ai/svara-tts-v1-hindi-ft \
    --port 8000 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096

With Svara-TTS API

export VLLM_MODEL=samudr-ai/svara-tts-v1-hindi-ft
export VLLM_BASE_URL=http://localhost:8000/v1
uvicorn api.server:app --host 0.0.0.0 --port 8080

Python (direct)

from tts_engine.orchestrator import SvaraTTSOrchestrator

orch = SvaraTTSOrchestrator(
    base_url="http://localhost:8000/v1",
    model="samudr-ai/svara-tts-v1-hindi-ft",
    speaker_id="Hindi (Male)",
)

audio_chunks = list(orch.stream("नमस्ते, आप कैसे हैं?"))
audio = b"".join(audio_chunks)

Architecture

Hindi text → Llama 3.2 tokenizer → text tokens
                                         ↓
                              Svara-TTS LLM (this model)
                                         ↓
                              SNAC audio tokens → SNAC decoder → 24kHz PCM audio

The model generates discrete SNAC audio tokens autoregressively, which are then decoded to waveform audio using the SNAC 24kHz codec.

Training Details

Method: LoRA applied to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Precision: bf16
Batch size: 2 × 8 gradient accumulation = 16 effective
Learning rate: 2e-4 with cosine schedule
Max sequence length: 2048 tokens
Loss masking: Only audio tokens are supervised (text prompt masked with -100)

Limitations

Optimized specifically for Hindi; other languages in the base model may not benefit
Best results with the "Hindi (Male)" speaker ID from svara-tts-v1
Requires SNAC decoder for audio output (not a standalone audio model)