| --- |
| license: apache-2.0 |
| language: |
| - hi |
| - kn |
| - bn |
| - gu |
| - te |
| - mr |
| - bn |
| - bh |
| - mai |
| - mag |
| - hne |
| tags: |
| - text-to-speech |
| - tts |
| - indic |
| - onnx |
| - onnxruntime-genai |
| - quantized |
| - zero-shot |
| - voice-cloning |
| pipeline_tag: text-to-speech |
| base_model: |
| - somyalab/Spark_somya_TTS |
| - SparkAudio/Spark-TTS-0.5B |
| --- |
| |
| # Spark-Somya-TTS |
|
|
| Zero-shot voice cloning TTS model for Indic languages, fine-tuned from Spark-TTS-0.5B. |
|
|
| ## Supported Languages |
|
|
| - Hindi (hi) |
| - Kannada (kn) |
| - Bengali (bn) |
| - Gujarati (gu) |
| - Telugu (te) |
| - Marathi (mr) |
| - Bhojpuri (bh) |
| - Maithili (mai) |
| - Maghahi (mag) |
| - Bangali (bn) |
| - chhattisgarhi (hne) |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install torch transformers huggingface_hub unsloth soundfile librosa numpy |
| ``` |
|
|
| ### Download Model |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| model_dir = snapshot_download("somyalab/Spark_somya_TTS") |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| import torch |
| import numpy as np |
| import soundfile as sf |
| from unsloth import FastLanguageModel |
| |
| # Load model |
| model, tokenizer = FastLanguageModel.from_pretrained( |
| model_name=model_dir, |
| max_seq_length=2048, |
| dtype=torch.bfloat16, |
| load_in_4bit=False, |
| ) |
| FastLanguageModel.for_inference(model) |
| |
| # Load audio tokenizer (BiCodec) |
| import sys |
| sys.path.insert(0, model_dir) |
| from sparktts.models.audio_tokenizer import BiCodecTokenizer |
| |
| audio_tokenizer = BiCodecTokenizer(model_dir, "cuda") |
| |
| # Reference audio for voice cloning |
| import librosa |
| ref_audio, ref_sr = librosa.load("reference_voice.wav", sr=None) |
| ref_global_tokens, _ = audio_tokenizer.tokenize_audio(ref_audio, ref_sr) |
| |
| # Generate speech |
| text = "नमस्ते, यह एक परीक्षण है।" |
| |
| prompt = "".join([ |
| "<|task_tts|>", |
| "<|start_content|>", |
| text, |
| "<|end_content|>", |
| "<|start_global_token|>", |
| ref_global_tokens, |
| "<|end_global_token|>", |
| "<|start_semantic_token|>", |
| ]) |
| |
| inputs = tokenizer([prompt], return_tensors="pt").to("cuda") |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=2048, |
| do_sample=True, |
| temperature=0.7, |
| ) |
| |
| # Decode to audio |
| generated_ids = outputs[:, inputs.input_ids.shape[1]:] |
| generated_tokens = tokenizer.convert_ids_to_tokens(generated_ids[0].tolist()) |
| |
| # Extract semantic token IDs |
| semantic_ids = [] |
| for t in generated_tokens: |
| if t.startswith("<|bicodec_semantic_") and t.endswith("|>"): |
| semantic_ids.append(int(t[18:-2])) |
| |
| # Detokenize to waveform |
| import re |
| global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", ref_global_tokens) |
| global_ids = torch.tensor([int(t) for t in global_matches]).unsqueeze(0).unsqueeze(0) |
| semantic_ids = torch.tensor(semantic_ids).unsqueeze(0) |
| |
| wav = audio_tokenizer.detokenize( |
| global_ids.to("cuda").squeeze(0), |
| semantic_ids.to("cuda"), |
| ) |
| |
| sf.write("output.wav", wav, 16000) |
| ``` |
|
|
| ## Model Architecture |
|
|
| - Base: Qwen2ForCausalLM (0.5B parameters) |
| - Fine-tuned for Indic languages with extended tokenizer |
| - Uses BiCodec for audio tokenization/detokenization |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{spark-somya-tts, |
| title={Spark-Somya-TTS}, |
| author={Somya Lab}, |
| year={2025}, |
| url={https://huggingface.co/somyalab/Spark_somya_TTS} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |