File size: 3,200 Bytes
3768ffe
 
 
dffdbf7
 
 
 
 
 
 
 
 
 
 
3768ffe
dffdbf7
 
 
 
 
 
 
 
3768ffe
2640afe
3768ffe
dffdbf7
3768ffe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dffdbf7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: apache-2.0
language:
- hi
- kn
- bn
- gu
- te
- mr
- bn
- bh
- mai
- mag
- hne
tags:
- text-to-speech
- tts
- indic
- onnx
- onnxruntime-genai
- quantized
- zero-shot
- voice-cloning
pipeline_tag: text-to-speech
base_model:
- somyalab/Spark_somya_TTS
- SparkAudio/Spark-TTS-0.5B
---

# Spark-Somya-TTS

Zero-shot voice cloning TTS model for Indic languages, fine-tuned from Spark-TTS-0.5B.

## Supported Languages

- Hindi (hi)
- Kannada (kn)
- Bengali (bn)
- Gujarati (gu)
- Telugu (te)
- Marathi (mr)
- Bhojpuri (bh)
- Maithili (mai)
- Maghahi (mag)
- Bangali (bn)
- chhattisgarhi (hne)

## Quick Start

### Installation

```bash
pip install torch transformers huggingface_hub unsloth soundfile librosa numpy
```

### Download Model

```python
from huggingface_hub import snapshot_download

model_dir = snapshot_download("somyalab/Spark_somya_TTS")
```

### Inference

```python
import torch
import numpy as np
import soundfile as sf
from unsloth import FastLanguageModel

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_dir,
    max_seq_length=2048,
    dtype=torch.bfloat16,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(model)

# Load audio tokenizer (BiCodec)
import sys
sys.path.insert(0, model_dir)
from sparktts.models.audio_tokenizer import BiCodecTokenizer

audio_tokenizer = BiCodecTokenizer(model_dir, "cuda")

# Reference audio for voice cloning
import librosa
ref_audio, ref_sr = librosa.load("reference_voice.wav", sr=None)
ref_global_tokens, _ = audio_tokenizer.tokenize_audio(ref_audio, ref_sr)

# Generate speech
text = "नमस्ते, यह एक परीक्षण है।"

prompt = "".join([
    "<|task_tts|>",
    "<|start_content|>",
    text,
    "<|end_content|>",
    "<|start_global_token|>",
    ref_global_tokens,
    "<|end_global_token|>",
    "<|start_semantic_token|>",
])

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.7,
)

# Decode to audio
generated_ids = outputs[:, inputs.input_ids.shape[1]:]
generated_tokens = tokenizer.convert_ids_to_tokens(generated_ids[0].tolist())

# Extract semantic token IDs
semantic_ids = []
for t in generated_tokens:
    if t.startswith("<|bicodec_semantic_") and t.endswith("|>"):
        semantic_ids.append(int(t[18:-2]))

# Detokenize to waveform
import re
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", ref_global_tokens)
global_ids = torch.tensor([int(t) for t in global_matches]).unsqueeze(0).unsqueeze(0)
semantic_ids = torch.tensor(semantic_ids).unsqueeze(0)

wav = audio_tokenizer.detokenize(
    global_ids.to("cuda").squeeze(0),
    semantic_ids.to("cuda"),
)

sf.write("output.wav", wav, 16000)
```

## Model Architecture

- Base: Qwen2ForCausalLM (0.5B parameters)
- Fine-tuned for Indic languages with extended tokenizer
- Uses BiCodec for audio tokenization/detokenization

## Citation

If you use this model, please cite:

```bibtex
@misc{spark-somya-tts,
  title={Spark-Somya-TTS},
  author={Somya Lab},
  year={2025},
  url={https://huggingface.co/somyalab/Spark_somya_TTS}
}
```

## License

Apache 2.0