--- license: other license_name: ltx-2-community license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE base_model: ResembleAI/Dramabox tags: - tts - text-to-speech - audio - quantized - int8 - dramabox - torchao - diffusion-transformer - flow-matching library_name: pytorch pipeline_tag: text-to-speech --- # DramaBox DiT INT8 — Selective Weight-Only Quantization A selectively quantized version of the [DramaBox TTS](https://huggingface.co/ResembleAI/Dramabox) 3.3B DiT (Diffusion Transformer) model from [Resemble AI](https://huggingface.co/ResembleAI). Reduces VRAM by 20% and checkpoint size by 45% while preserving audio quality. > **Base model:** [ResembleAI/Dramabox](https://huggingface.co/ResembleAI/Dramabox) | **Code:** [resemble-ai/DramaBox](https://github.com/resemble-ai/DramaBox) | **Architecture:** LTX-2.3 DiT + Gemma 3 12B ## What's included | File | Size | Description | |------|------|-------------| | `dramabox-dit-int8-selective.safetensors` | 3.37 GB | Quantized DiT weights (INT8 data + BF16 scales) | | `config.json` | 28 KB | Layer map: which 562 layers are quantized | | `load_int8.py` | 3.6 KB | Loader script (works with or without torchao) | | `inference_optimized.py` | 4.3 KB | Full pipeline with INT8 + Gemma CPU offload | You still need the other components from [ResembleAI/Dramabox](https://huggingface.co/ResembleAI/Dramabox): - `dramabox-audio-components.safetensors` (1.9 GB) — VAE + vocoder - [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) (~8 GB) — text encoder ## Results | Metric | Baseline (BF16) | This model (INT8) | Change | |--------|-----------------|-------------------|--------| | DiT checkpoint size | 6.1 GB | 3.37 GB | **-45%** | | Peak VRAM | 17.39 GB | 13.8 GB | **-20.6%** | | VRAM during denoising | 17.39 GB | 5.93 GB | **-65.9%** | | Audio quality (MCD) | 0.0 dB | 4.98 dB | Within threshold | | Generation time | 2.62s | 3.22s | +23% | MCD (Mel-Cepstral Distortion) measures spectral distance from the BF16 baseline. Lower is better. Scores below 5.0 dB are perceptually near-identical for speech. ## Quantization details **Method:** Selective INT8 weight-only quantization via [torchao](https://github.com/pytorch/ao) `Int8WeightOnlyConfig`. Weights are stored as INT8 with per-channel BF16 scales and dequantized at runtime during matrix multiplication. **What's quantized (562 layers, ~81.5% of DiT parameters):** - All attention projections (`to_q`, `to_k`, `to_v`, `to_out`) across all 48 transformer blocks - All `gate_logits` layers - All FFN GELU projections (`audio_ff.net.0.proj`) across all 48 blocks - FFN output projections (`audio_ff.net.2`) in blocks 15–47, excluding block 17 - Input/output projections (`audio_patchify_proj`, `audio_proj_out`) **What's NOT quantized (kept in BF16):** - All normalization layers — extremely sensitive to precision changes - AdaLN conditioning layers — controls the diffusion process globally - Timestep embedder — conditioning pathway, highly sensitive - FFN output projections in blocks 0–14 — early blocks are most sensitive to quantization - FFN output projection in block 17 — anomalously sensitive individual block This layer map was discovered through 80+ automated experiments using [Andrej Karpathy's auto-research methodology](https://github.com/karpathy/autoresearch), systematically testing each layer type and block index. ## Usage ### Option 1: Runtime quantization (simplest, no extra downloads) If you just want VRAM savings without downloading this checkpoint, you can apply quantization at load time to the original DramaBox model: ```python import torch, re from torchao.quantization import quantize_, Int8WeightOnlyConfig # After loading the standard DramaBox TTSServer: attn_proj_keys = ("to_q", "to_k", "to_v", "to_out") def dit_filter(mod, fqn): if not isinstance(mod, torch.nn.Linear): return False if "norm" in fqn: return False if "gate_logits" in fqn: return True if any(k in fqn for k in attn_proj_keys): return True if "audio_ff" in fqn: m = re.search(r'transformer_blocks\.(\d+)\.', fqn) if m: idx = int(m.group(1)) if "net.2" in fqn and idx >= 15 and idx != 17: return True if "net.0.proj" in fqn: return True return False def io_filter(mod, fqn): return fqn in ("audio_patchify_proj", "audio_proj_out") and isinstance(mod, torch.nn.Linear) quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=dit_filter) quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=io_filter) ``` ### Option 2: Load pre-quantized weights (faster startup) ```python from load_int8 import load_int8_dit # Loads the INT8 safetensors and reconstructs quantized Linear layers load_int8_dit(tts._velocity_model, "dramabox-dit-int8-selective.safetensors") ``` ### Option 3: Full optimized pipeline with Gemma offload For maximum VRAM savings (5.93 GB during denoising), use the included `inference_optimized.py` which also offloads Gemma 12B to CPU between text encoding and audio generation. ## Requirements - PyTorch >= 2.4 - torchao >= 0.15.0 - CUDA GPU with >= 16 GB VRAM (14 GB with Gemma offload) - The original DramaBox model and its dependencies ## How this was made We ran 80+ experiments using an automated loop inspired by Karpathy's auto-research methodology: 1. Start from the BF16 baseline 2. Modify quantization config (which layers, which precision, which blocks) 3. Generate 3 evaluation audio samples with fixed prompts/seeds 4. Measure peak VRAM, generation time, and MCD vs baseline 5. Keep the change if MCD < 5.0 dB, discard otherwise 6. Repeat Key findings from the search: - **Flow-matching diffusion models are far more precision-sensitive than autoregressive LLMs.** All 4-bit approaches (NF4, NVFP4, FP4, Int4) produced unacceptable quality (MCD 17–32 dB). - **FP8 is worse than INT8** for weight representation in this model (MCD 11.8 vs 4.35). - **`torch.compile` breaks audio output** even on the unquantized baseline (MCD 24–32 dB). The iterative denoising loop is numerically sensitive to graph optimizations. - **Early transformer blocks (0–14) are most sensitive** in their FFN output projections. Block 17 is an outlier. - **Attention projections and GELU gates are universally robust** to INT8 across all 48 blocks. ## Citation If you use this work, please cite the original DramaBox model: ```bibtex @misc{dramabox2025, title={DramaBox: Expressive Text to Speech Model}, author={Resemble AI}, year={2025}, url={https://github.com/resemble-ai/DramaBox} } ``` ## License Same as the base DramaBox model — [LTX-2 Community License](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE).