File size: 7,524 Bytes

---
language:
  - en
license: other
pipeline_tag: text-to-speech
tags:
  - tts
  - voice-cloning
  - audio-generation
  - diffusion-transformer
  - flow-matching
  - ltx-2
library_name: ltx-audio-tts
---

# Dramabox - Expressive TTS with Voice Cloning

Dramabox generates expressive, emotionally rich speech from scene descriptions with optional voice cloning. Built on a 3.3B Diffusion Transformer with flow matching, conditioned on Gemma 3 12B text embeddings.

## Audio Samples

### Regal Queen - Cold Fury to Venomous Whisper

**Prompt:** A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."

<audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/01_queen_sighs_rage.wav"></audio>

### Catgirl - Uncontrollable Giggling

**Prompt:** A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, hehe, oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay okay, I will stop, I promise I will stop." She leans in and whispers conspiratorially, "But seriously though, between you and me," then immediately loses it again, "Haha, no I, hehehe, I just cannot! You are way too funny, haha!" She snorts mid-laugh, "Pfft, oh no no no, that was so embarrassing, pretend you did not hear that!"

<audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/04_catgirl_giggles_snort.wav"></audio>

### Action Hero - Panting Triumph

**Prompt:** A muscular man speaks with a thick accent, panting heavily, completely out of breath, "Hah... hah... we made it, we actually made it." He coughs roughly, "Ugh, that was the hardest fight of my entire life, I swear." He groans and clutches his side, "Argh, my ribs, I think something is broken." But then a grin spreads and he laughs heartily despite the pain, "Hahaha! But we WON! Can you believe it? We actually won!" He takes a deep, shuddering breath, "I told you, heh, I told you we would make it. Ahhh, it is finally over."

<audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/06_arnie_panting_triumph.wav"></audio>

### Villain - Sinister Laugh

**Prompt:** A deep-voiced villain speaks with theatrical menace, chuckling softly at first, "Heheheh. Hahahahahahaha! Oh, forgive me, forgive me." He catches his breath with a sinister grin, He clears his throat. "It is just SO amusing when they struggle, is it not?" His voice drips with contempt, "I expected more from you, truly I did. How disappointing." He leans in close and whispers with vicious intensity, "But fear not, my dear. The REAL entertainment has only just begun." He chuckles one last time, "Heheheh."

<audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/09_villain_sinister_laugh.wav"></audio>

### Talk Show Host - Wheezing Laughter

**Prompt:** A talk show host speaks with animated enthusiasm. He gasps with exaggerated shock, "No! You did NOT just say that, tell me you did not just say that!" He bursts into uncontrollable laughter, "HAHAHA! Oh my god, oh my god!" He wheezes, barely getting words out, "I cannot, I literally cannot breathe right now!" He wipes his eyes, sniffling, "Oh that is so good, that is really genuinely good." He sighs happily, "Ahhh okay okay, let me compose myself, I am a professional." He takes one breath then immediately cracks up again, "Pfft hehehe, no I absolutely cannot, I am so sorry everybody!" He claps, "Folks, THIS, this right here, is why I love my job!"

<audio controls src="https://huggingface.co/ResembleAI/Dramabox/resolve/main/samples/13_conan_wheezing_laughter.wav"></audio>

---

## Model Description

Dramabox is a prompt-driven TTS model where **the text prompt controls everything** - speaker identity, emotion, delivery style, laughs, sighs, pauses, and transitions. With voice cloning, a 10-second reference clip conditions the model to reproduce the speaker's timbre and characteristics.

### Key Features

- **Prompt-driven expressiveness** - laughs, sighs, whispers, shouts, emotional transitions all controlled by the scene description
- **Voice cloning** from 10s reference audio
- **English** speech synthesis
- **Fast inference** - ~2.5s per generation with warm server on H100

### Architecture

| Component | Details |
|-----------|---------|
| **Transformer** | 3.3B parameter DiT, 48 layers, flow matching (30-step Euler) |
| **Text Encoder** | Gemma 3 12B (q4 quantized) + learned embeddings processor |
| **Audio VAE** | Encodes/decodes 48kHz audio via mel spectrogram latents |
| **Voice Cloning** | Reference audio tokens appended to target with asymmetric attention mask |

## Files

| File | Size | Description |
|------|------|-------------|
| `dramabox-dit-v1.safetensors` | 6.6 GB | DiT transformer (voice cloning weights merged) |
| `dramabox-audio-components.safetensors` | 2.7 GB | Audio VAE encoder/decoder + vocoder + text projection |
| `assets/silence_latent_frame.pt` | 1.5 KB | VAE-encoded silence frame |
| `config.json` | - | Model configuration |

**Additional requirement**: [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) (text encoder, pre-quantized 4-bit, auto-downloaded)

## Quick Start

```python
from inference_server import TTSServer

# Models auto-download from HuggingFace
server = TTSServer(device="cuda", bnb_4bit=True)

# Text-to-speech
server.generate_to_file(
    prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
    output="output.wav",
)

# Voice cloning
server.generate_to_file(
    prompt='A woman speaks warmly, "Hello, how are you today?"',
    output="cloned.wav",
    voice_ref="reference.wav",  # 10+ seconds of target voice
)
```

## Prompt Format

The prompt is a scene description that controls how the model speaks:

```
<speaker description>, "<dialogue>" <action direction> "<more dialogue>"
```

### What Works Inside Quotes (model produces actual sounds)
- Laughs: `"Hahaha"` `"Hehehe"` (always as one word, never separated)
- Sounds: `"Mmmmm"` `"Ugh"` `"Argh"` `"Ahhh"` `"Hmm"`

### What Goes Outside Quotes (stage directions)
- `She sighs deeply.` `He gulps nervously.` `A long pause.`
- `Her voice cracks.` `He clears his throat.` `She scoffs.`

### Never Inside Quotes (model speaks them literally)
- Ahem, Pfft, Sigh, Gasp, Cough

## Inference Settings

| Parameter | Default | Notes |
|-----------|---------|-------|
| cfg_scale | 2.5 | Text adherence (lower = more natural) |
| stg_scale | 1.5 | Skip-token guidance |
| rescale | 0.0 | No rescaling |
| modality | 1.0 | No modality guidance |
| duration_multiplier | 1.1 | 10% extra breathing room |
| steps | 30 | Euler flow matching |

## VRAM Requirements

| Setup | VRAM | Speed |
|-------|------|-------|
| Warm server (recommended) | **~24 GB** | **~2.5s** |
| Cold inference (per-call loading) | ~8 GB peak | ~30s |

## License

Built on [LTX-2.3](https://github.com/Lightricks/LTX-2) by Lightricks. Please refer to the LTX-2 license for usage terms.