--- license: apache-2.0 language: - hi - kn - bn - gu - te - mr - bn - bh - mai - mag - hne tags: - text-to-speech - tts - indic - onnx - onnxruntime-genai - quantized - zero-shot - voice-cloning pipeline_tag: text-to-speech base_model: - somyalab/Spark_somya_TTS - SparkAudio/Spark-TTS-0.5B --- # Spark-Somya-TTS Zero-shot voice cloning TTS model for Indic languages, fine-tuned from Spark-TTS-0.5B. ## Supported Languages - Hindi (hi) - Kannada (kn) - Bengali (bn) - Gujarati (gu) - Telugu (te) - Marathi (mr) - Bhojpuri (bh) - Maithili (mai) - Maghahi (mag) - Bangali (bn) - chhattisgarhi (hne) ## Quick Start ### Installation ```bash pip install torch transformers huggingface_hub unsloth soundfile librosa numpy ``` ### Download Model ```python from huggingface_hub import snapshot_download model_dir = snapshot_download("somyalab/Spark_somya_TTS") ``` ### Inference ```python import torch import numpy as np import soundfile as sf from unsloth import FastLanguageModel # Load model model, tokenizer = FastLanguageModel.from_pretrained( model_name=model_dir, max_seq_length=2048, dtype=torch.bfloat16, load_in_4bit=False, ) FastLanguageModel.for_inference(model) # Load audio tokenizer (BiCodec) import sys sys.path.insert(0, model_dir) from sparktts.models.audio_tokenizer import BiCodecTokenizer audio_tokenizer = BiCodecTokenizer(model_dir, "cuda") # Reference audio for voice cloning import librosa ref_audio, ref_sr = librosa.load("reference_voice.wav", sr=None) ref_global_tokens, _ = audio_tokenizer.tokenize_audio(ref_audio, ref_sr) # Generate speech text = "नमस्ते, यह एक परीक्षण है।" prompt = "".join([ "<|task_tts|>", "<|start_content|>", text, "<|end_content|>", "<|start_global_token|>", ref_global_tokens, "<|end_global_token|>", "<|start_semantic_token|>", ]) inputs = tokenizer([prompt], return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=2048, do_sample=True, temperature=0.7, ) # Decode to audio generated_ids = outputs[:, inputs.input_ids.shape[1]:] generated_tokens = tokenizer.convert_ids_to_tokens(generated_ids[0].tolist()) # Extract semantic token IDs semantic_ids = [] for t in generated_tokens: if t.startswith("<|bicodec_semantic_") and t.endswith("|>"): semantic_ids.append(int(t[18:-2])) # Detokenize to waveform import re global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", ref_global_tokens) global_ids = torch.tensor([int(t) for t in global_matches]).unsqueeze(0).unsqueeze(0) semantic_ids = torch.tensor(semantic_ids).unsqueeze(0) wav = audio_tokenizer.detokenize( global_ids.to("cuda").squeeze(0), semantic_ids.to("cuda"), ) sf.write("output.wav", wav, 16000) ``` ## Model Architecture - Base: Qwen2ForCausalLM (0.5B parameters) - Fine-tuned for Indic languages with extended tokenizer - Uses BiCodec for audio tokenization/detokenization ## Citation If you use this model, please cite: ```bibtex @misc{spark-somya-tts, title={Spark-Somya-TTS}, author={Somya Lab}, year={2025}, url={https://huggingface.co/somyalab/Spark_somya_TTS} } ``` ## License Apache 2.0