Svara-TTS v1 โ Hindi Fine-tuned
A fine-tuned version of kenpath/svara-tts-v1 on the Hindi TTS dataset (11.8K samples).
Model Details
| Property | Value |
|---|---|
| Base model | kenpath/svara-tts-v1 (Llama 3.2 + SNAC audio tokens) |
| Fine-tuning method | LoRA (rank 64, alpha 128) |
| Training data | 10,876 Hindi audio-text pairs from Kaggle Hindi TTS dataset |
| Training time | ~2 hours on NVIDIA A100 80GB |
| Epochs | 3 |
| Final train loss | 3.406 |
| Final eval loss | 3.393 |
A/B Test Results
Controlled comparison using identical inputs (Hindi Male, temp=0.7, top_p=0.9, 5 sentences):
| Metric | Base Model | Fine-tuned | Improvement |
|---|---|---|---|
| RMS (loudness/clarity) | 2,225 | 4,010 | +80% |
| Silence ratio | 53.0% | 39.6% | -13.4% |
| Audio duration | 5.08s avg | 4.21s avg | -17% (more concise) |
| Generation speed | 3.01s avg | 2.48s avg | 17.5% faster |
Usage
With vLLM
vllm serve samudr-ai/svara-tts-v1-hindi-ft \
--port 8000 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 4096
With Svara-TTS API
export VLLM_MODEL=samudr-ai/svara-tts-v1-hindi-ft
export VLLM_BASE_URL=http://localhost:8000/v1
uvicorn api.server:app --host 0.0.0.0 --port 8080
Python (direct)
from tts_engine.orchestrator import SvaraTTSOrchestrator
orch = SvaraTTSOrchestrator(
base_url="http://localhost:8000/v1",
model="samudr-ai/svara-tts-v1-hindi-ft",
speaker_id="Hindi (Male)",
)
audio_chunks = list(orch.stream("เคจเคฎเคธเฅเคคเฅ, เคเคช เคเฅเคธเฅ เคนเฅเค?"))
audio = b"".join(audio_chunks)
Architecture
Hindi text โ Llama 3.2 tokenizer โ text tokens
โ
Svara-TTS LLM (this model)
โ
SNAC audio tokens โ SNAC decoder โ 24kHz PCM audio
The model generates discrete SNAC audio tokens autoregressively, which are then decoded to waveform audio using the SNAC 24kHz codec.
Training Details
- Method: LoRA applied to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Precision: bf16
- Batch size: 2 ร 8 gradient accumulation = 16 effective
- Learning rate: 2e-4 with cosine schedule
- Max sequence length: 2048 tokens
- Loss masking: Only audio tokens are supervised (text prompt masked with -100)
Limitations
- Optimized specifically for Hindi; other languages in the base model may not benefit
- Best results with the "Hindi (Male)" speaker ID from svara-tts-v1
- Requires SNAC decoder for audio output (not a standalone audio model)
License
Apache 2.0 (same as base model)
- Downloads last month
- 12
Model tree for samudr-ai/svara-tts-v1-hindi-ft
Base model
meta-llama/Llama-3.2-3B-Instruct Finetuned
canopylabs/orpheus-3b-0.1-pretrained Finetuned
canopylabs/3b-hi-ft-research_release Adapter
kenpath/svara-tts-v1