Orpheus 3B Hindi Male TTS

A LoRA finetune of canopylabs/3b-hi-pretrain-research_release on the ai4bharat/Rasa Hindi Male subset, producing a single consistent male Hindi voice named arjun.

Training Details


Base model	`canopylabs/3b-hi-pretrain-research_release`
Dataset	`ai4bharat/Rasa` — Hindi config, Male speaker
Train examples	12,116 utterances (~23.78 hours)
Codec	SNAC 24kHz (`hubertsiuzdak/snac_24khz`)
LoRA rank	32 (RSLoRA, α=64)
LoRA targets	All attention + MLP projections + `lm_head` + `embed_tokens`
Epochs	3
Batch size	4 (effective, grad accum)
Learning rate	2e-4 (cosine schedule)
Hardware	1× NVIDIA A100-SXM4-40GB

Generated Samples

All samples generated with temperature=0.4, top_p=0.9, repetition_penalty=1.1.

नमस्ते, मेरा नाम अर्जुन है और मैं दिल्ली में रहता हूँ। मुझे हिंदी में बात करना बहुत पसंद है क्योंकि यह मेरी मातृभाषा है।

आज सुबह से ही मौसम बहुत सुहाना है और आसमान में हल्के बादल छाए हुए हैं। ऐसे मौसम में चाय पीते हुए किताब पढ़ना बहुत अच्छा लगता है।

भारत एक विविधताओं से भरा हुआ देश है जहाँ अलग-अलग भाषाएँ, संस्कृतियाँ और परंपराएँ एक साथ फलती-फूलती हैं। यहाँ के लोग मिलजुल कर रहते हैं और एक-दूसरे की मदद करते हैं।

Usage

import torch
import wave
import numpy as np
from snac import SNAC
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID     = "edzsaji26/orpheus-3b-0.1-hindi-male-lora"
VOICE_NAME   = "arjun"
AUDIO_OFFSET = 128266
STOP_TOKEN   = 128258
SAMPLE_RATE  = 24_000

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto")
model.eval()

snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to(device)

def synthesise(text):
    prompt = f"{VOICE_NAME}: {text}"
    ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = torch.cat([
        torch.tensor([[128259]]),
        ids,
        torch.tensor([[128009, 128260, 128261, 128257]])
    ], dim=1).to(device)

    with torch.inference_mode():
        out = model.generate(
            input_ids=input_ids,
            max_new_tokens=2000,
            do_sample=True,
            temperature=0.4,
            top_p=0.9,
            repetition_penalty=1.1,
            eos_token_id=STOP_TOKEN,
        )

    new_tokens = out[0, input_ids.shape[1]:].tolist()
    audio_tokens = [t for t in new_tokens if t != STOP_TOKEN]
    n_frames = len(audio_tokens) // 7
    audio_tokens = audio_tokens[:n_frames * 7]

    c0, c1, c2 = [], [], []
    for f in range(n_frames):
        i = f * 7
        c0.append(audio_tokens[i]   - AUDIO_OFFSET)
        c1.append(audio_tokens[i+1] - AUDIO_OFFSET - 4096)
        c2.append(audio_tokens[i+2] - AUDIO_OFFSET - 2*4096)
        c2.append(audio_tokens[i+3] - AUDIO_OFFSET - 3*4096)
        c1.append(audio_tokens[i+4] - AUDIO_OFFSET - 4*4096)
        c2.append(audio_tokens[i+5] - AUDIO_OFFSET - 5*4096)
        c2.append(audio_tokens[i+6] - AUDIO_OFFSET - 6*4096)

    codes = [torch.tensor(c0).unsqueeze(0).to(device),
             torch.tensor(c1).unsqueeze(0).to(device),
             torch.tensor(c2).unsqueeze(0).to(device)]

    with torch.inference_mode():
        audio = snac.decode(codes)

    waveform = audio.squeeze().cpu().numpy()
    int16 = (waveform * 32767).clip(-32768, 32767).astype(np.int16)

    with wave.open("output.wav", "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(SAMPLE_RATE)
        wf.writeframes(int16.tobytes())

synthesise("नमस्ते, मेरा नाम अर्जुन है और मैं आपसे बात करके बहुत खुश हूँ।")

Prompt Format

arjun: <hindi text here>

The voice name arjun is the learned speaker identity — always include it as the prefix.

Limitations

Single male speaker (Arjun) — not multi-speaker
Hindi only
Based on Rasa dataset styles: neutral speech, commands, conversations, news, emotions
For best results keep utterances under ~15 seconds

Citation

If you use this model, please also cite the underlying work:

@misc{orpheus2025,
  title  = {Orpheus TTS},
  author = {Canopy Labs},
  year   = {2025},
  url    = {https://github.com/canopyai/Orpheus-TTS}
}

@inproceedings{ai4bharat2024rasa,
  author    = {Praveen Srinivasa Varadhan and Ashwin Sankar and Giri Raju and Mitesh M. Khapra},
  title     = {Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings},
  booktitle = {Proc. INTERSPEECH 2024},
  year      = {2024}
}

Downloads last month: 65

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for edzsaji26/orpheus-3b-0.1-hindi-male-lora

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

canopylabs/orpheus-3b-0.1-pretrained

Finetuned

canopylabs/3b-hi-pretrain-research_release

Adapter

(10)

this model

Adapters

1 model