Svara-TTS v1 โ€” Hindi Fine-tuned

A fine-tuned version of kenpath/svara-tts-v1 on the Hindi TTS dataset (11.8K samples).

Model Details

Property Value
Base model kenpath/svara-tts-v1 (Llama 3.2 + SNAC audio tokens)
Fine-tuning method LoRA (rank 64, alpha 128)
Training data 10,876 Hindi audio-text pairs from Kaggle Hindi TTS dataset
Training time ~2 hours on NVIDIA A100 80GB
Epochs 3
Final train loss 3.406
Final eval loss 3.393

A/B Test Results

Controlled comparison using identical inputs (Hindi Male, temp=0.7, top_p=0.9, 5 sentences):

Metric Base Model Fine-tuned Improvement
RMS (loudness/clarity) 2,225 4,010 +80%
Silence ratio 53.0% 39.6% -13.4%
Audio duration 5.08s avg 4.21s avg -17% (more concise)
Generation speed 3.01s avg 2.48s avg 17.5% faster

Usage

With vLLM

vllm serve samudr-ai/svara-tts-v1-hindi-ft \
    --port 8000 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096

With Svara-TTS API

export VLLM_MODEL=samudr-ai/svara-tts-v1-hindi-ft
export VLLM_BASE_URL=http://localhost:8000/v1
uvicorn api.server:app --host 0.0.0.0 --port 8080

Python (direct)

from tts_engine.orchestrator import SvaraTTSOrchestrator

orch = SvaraTTSOrchestrator(
    base_url="http://localhost:8000/v1",
    model="samudr-ai/svara-tts-v1-hindi-ft",
    speaker_id="Hindi (Male)",
)

audio_chunks = list(orch.stream("เคจเคฎเคธเฅเคคเฅ‡, เค†เคช เค•เฅˆเคธเฅ‡ เคนเฅˆเค‚?"))
audio = b"".join(audio_chunks)

Architecture

Hindi text โ†’ Llama 3.2 tokenizer โ†’ text tokens
                                         โ†“
                              Svara-TTS LLM (this model)
                                         โ†“
                              SNAC audio tokens โ†’ SNAC decoder โ†’ 24kHz PCM audio

The model generates discrete SNAC audio tokens autoregressively, which are then decoded to waveform audio using the SNAC 24kHz codec.

Training Details

  • Method: LoRA applied to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Precision: bf16
  • Batch size: 2 ร— 8 gradient accumulation = 16 effective
  • Learning rate: 2e-4 with cosine schedule
  • Max sequence length: 2048 tokens
  • Loss masking: Only audio tokens are supervised (text prompt masked with -100)

Limitations

  • Optimized specifically for Hindi; other languages in the base model may not benefit
  • Best results with the "Hindi (Male)" speaker ID from svara-tts-v1
  • Requires SNAC decoder for audio output (not a standalone audio model)

License

Apache 2.0 (same as base model)

Downloads last month
12
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for samudr-ai/svara-tts-v1-hindi-ft