Dramabox / README.md

Simplify language support to English only (multilingual release deferred)

f60192e verified about 1 month ago

7.52 kB

language:
  - en
license: other
pipeline_tag: text-to-speech
tags:
  - tts
  - voice-cloning
  - audio-generation
  - diffusion-transformer
  - flow-matching
  - ltx-2
library_name: ltx-audio-tts

Dramabox - Expressive TTS with Voice Cloning

Dramabox generates expressive, emotionally rich speech from scene descriptions with optional voice cloning. Built on a 3.3B Diffusion Transformer with flow matching, conditioned on Gemma 3 12B text embeddings.

Audio Samples

Regal Queen - Cold Fury to Venomous Whisper

Prompt: A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."

Catgirl - Uncontrollable Giggling

Prompt: A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, hehe, oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay okay, I will stop, I promise I will stop." She leans in and whispers conspiratorially, "But seriously though, between you and me," then immediately loses it again, "Haha, no I, hehehe, I just cannot! You are way too funny, haha!" She snorts mid-laugh, "Pfft, oh no no no, that was so embarrassing, pretend you did not hear that!"

Action Hero - Panting Triumph

Prompt: A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."

Villain - Sinister Laugh

Prompt: A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heheheh. Hahahahahahaha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, He clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heheheh."

Talk Show Host - Wheezing Laughter

Prompt: A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"

Model Description

Dramabox is a prompt-driven TTS model where the text prompt controls everything - speaker identity, emotion, delivery style, laughs, sighs, pauses, and transitions. With voice cloning, a 10-second reference clip conditions the model to reproduce the speaker's timbre and characteristics.

Key Features

Prompt-driven expressiveness - laughs, sighs, whispers, shouts, emotional transitions all controlled by the scene description
Voice cloning from 10s reference audio
English speech synthesis
Fast inference - ~2.5s per generation with warm server on H100

Architecture

Component	Details
Transformer	3.3B parameter DiT, 48 layers, flow matching (30-step Euler)
Text Encoder	Gemma 3 12B (q4 quantized) + learned embeddings processor
Audio VAE	Encodes/decodes 48kHz audio via mel spectrogram latents
Voice Cloning	Reference audio tokens appended to target with asymmetric attention mask

Files

File	Size	Description
`dramabox-dit-v1.safetensors`	6.6 GB	DiT transformer (voice cloning weights merged)
`dramabox-audio-components.safetensors`	2.7 GB	Audio VAE encoder/decoder + vocoder + text projection
`assets/silence_latent_frame.pt`	1.5 KB	VAE-encoded silence frame
`config.json`	-	Model configuration

Additional requirement: unsloth/gemma-3-12b-it-bnb-4bit (text encoder, pre-quantized 4-bit, auto-downloaded)

Quick Start

from inference_server import TTSServer

# Models auto-download from HuggingFace
server = TTSServer(device="cuda", bnb_4bit=True)

# Text-to-speech
server.generate_to_file(
    prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
    output="output.wav",
)

# Voice cloning
server.generate_to_file(
    prompt='A woman speaks warmly, "Hello, how are you today?"',
    output="cloned.wav",
    voice_ref="reference.wav",  # 10+ seconds of target voice
)

Prompt Format

The prompt is a scene description that controls how the model speaks:

<speaker description>, "<dialogue>" <action direction> "<more dialogue>"

What Works Inside Quotes (model produces actual sounds)

Laughs: "Hahaha" "Hehehe" (always as one word, never separated)
Sounds: "Mmmmm" "Ugh" "Argh" "Ahhh" "Hmm"

What Goes Outside Quotes (stage directions)

She sighs deeply. He gulps nervously. A long pause.
Her voice cracks. He clears his throat. She scoffs.

Never Inside Quotes (model speaks them literally)

Ahem, Pfft, Sigh, Gasp, Cough

Inference Settings

Parameter	Default	Notes
cfg_scale	2.5	Text adherence (lower = more natural)
stg_scale	1.5	Skip-token guidance
rescale	0.0	No rescaling
modality	1.0	No modality guidance
duration_multiplier	1.1	10% extra breathing room
steps	30	Euler flow matching

VRAM Requirements

Setup	VRAM	Speed
Warm server (recommended)	~24 GB	~2.5s
Cold inference (per-call loading)	~8 GB peak	~30s

License

Built on LTX-2.3 by Lightricks. Please refer to the LTX-2 license for usage terms.