Duplicate from moe2382/dramabox-dit-int8

5e19135 8 days ago

6.8 kB

	---
	license: other
	license_name: ltx-2-community
	license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
	base_model: ResembleAI/Dramabox
	tags:
	- tts
	- text-to-speech
	- audio
	- quantized
	- int8
	- dramabox
	- torchao
	- diffusion-transformer
	- flow-matching
	library_name: pytorch
	pipeline_tag: text-to-speech
	---

	# DramaBox DiT INT8 — Selective Weight-Only Quantization

	A selectively quantized version of the [DramaBox TTS](https://huggingface.co/ResembleAI/Dramabox) 3.3B DiT (Diffusion Transformer) model from [Resemble AI](https://huggingface.co/ResembleAI). Reduces VRAM by 20% and checkpoint size by 45% while preserving audio quality.

	> Base model: [ResembleAI/Dramabox](https://huggingface.co/ResembleAI/Dramabox) \| Code: [resemble-ai/DramaBox](https://github.com/resemble-ai/DramaBox) \| Architecture: LTX-2.3 DiT + Gemma 3 12B

	## What's included

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `dramabox-dit-int8-selective.safetensors` \| 3.37 GB \| Quantized DiT weights (INT8 data + BF16 scales) \|
	\| `config.json` \| 28 KB \| Layer map: which 562 layers are quantized \|
	\| `load_int8.py` \| 3.6 KB \| Loader script (works with or without torchao) \|
	\| `inference_optimized.py` \| 4.3 KB \| Full pipeline with INT8 + Gemma CPU offload \|

	You still need the other components from [ResembleAI/Dramabox](https://huggingface.co/ResembleAI/Dramabox):
	- `dramabox-audio-components.safetensors` (1.9 GB) — VAE + vocoder
	- [unsloth/gemma-3-12b-it-bnb-4bit](https://huggingface.co/unsloth/gemma-3-12b-it-bnb-4bit) (~8 GB) — text encoder

	## Results

	\| Metric \| Baseline (BF16) \| This model (INT8) \| Change \|
	\|--------\|-----------------\|-------------------\|--------\|
	\| DiT checkpoint size \| 6.1 GB \| 3.37 GB \| -45% \|
	\| Peak VRAM \| 17.39 GB \| 13.8 GB \| -20.6% \|
	\| VRAM during denoising \| 17.39 GB \| 5.93 GB \| -65.9% \|
	\| Audio quality (MCD) \| 0.0 dB \| 4.98 dB \| Within threshold \|
	\| Generation time \| 2.62s \| 3.22s \| +23% \|

	MCD (Mel-Cepstral Distortion) measures spectral distance from the BF16 baseline. Lower is better. Scores below 5.0 dB are perceptually near-identical for speech.

	## Quantization details

	Method: Selective INT8 weight-only quantization via [torchao](https://github.com/pytorch/ao) `Int8WeightOnlyConfig`. Weights are stored as INT8 with per-channel BF16 scales and dequantized at runtime during matrix multiplication.

	What's quantized (562 layers, ~81.5% of DiT parameters):
	- All attention projections (`to_q`, `to_k`, `to_v`, `to_out`) across all 48 transformer blocks
	- All `gate_logits` layers
	- All FFN GELU projections (`audio_ff.net.0.proj`) across all 48 blocks
	- FFN output projections (`audio_ff.net.2`) in blocks 15–47, excluding block 17
	- Input/output projections (`audio_patchify_proj`, `audio_proj_out`)

	What's NOT quantized (kept in BF16):
	- All normalization layers — extremely sensitive to precision changes
	- AdaLN conditioning layers — controls the diffusion process globally
	- Timestep embedder — conditioning pathway, highly sensitive
	- FFN output projections in blocks 0–14 — early blocks are most sensitive to quantization
	- FFN output projection in block 17 — anomalously sensitive individual block

	This layer map was discovered through 80+ automated experiments using [Andrej Karpathy's auto-research methodology](https://github.com/karpathy/autoresearch), systematically testing each layer type and block index.

	## Usage

	### Option 1: Runtime quantization (simplest, no extra downloads)

	If you just want VRAM savings without downloading this checkpoint, you can apply quantization at load time to the original DramaBox model:

	```python
	import torch, re
	from torchao.quantization import quantize_, Int8WeightOnlyConfig

	# After loading the standard DramaBox TTSServer:
	attn_proj_keys = ("to_q", "to_k", "to_v", "to_out")

	def dit_filter(mod, fqn):
	if not isinstance(mod, torch.nn.Linear): return False
	if "norm" in fqn: return False
	if "gate_logits" in fqn: return True
	if any(k in fqn for k in attn_proj_keys): return True
	if "audio_ff" in fqn:
	m = re.search(r'transformer_blocks\.(\d+)\.', fqn)
	if m:
	idx = int(m.group(1))
	if "net.2" in fqn and idx >= 15 and idx != 17: return True
	if "net.0.proj" in fqn: return True
	return False

	def io_filter(mod, fqn):
	return fqn in ("audio_patchify_proj", "audio_proj_out") and isinstance(mod, torch.nn.Linear)

	quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=dit_filter)
	quantize_(tts._velocity_model, Int8WeightOnlyConfig(), filter_fn=io_filter)
	```

	### Option 2: Load pre-quantized weights (faster startup)

	```python
	from load_int8 import load_int8_dit

	# Loads the INT8 safetensors and reconstructs quantized Linear layers
	load_int8_dit(tts._velocity_model, "dramabox-dit-int8-selective.safetensors")
	```

	### Option 3: Full optimized pipeline with Gemma offload

	For maximum VRAM savings (5.93 GB during denoising), use the included `inference_optimized.py` which also offloads Gemma 12B to CPU between text encoding and audio generation.

	## Requirements

	- PyTorch >= 2.4
	- torchao >= 0.15.0
	- CUDA GPU with >= 16 GB VRAM (14 GB with Gemma offload)
	- The original DramaBox model and its dependencies

	## How this was made

	We ran 80+ experiments using an automated loop inspired by Karpathy's auto-research methodology:

	1. Start from the BF16 baseline
	2. Modify quantization config (which layers, which precision, which blocks)
	3. Generate 3 evaluation audio samples with fixed prompts/seeds
	4. Measure peak VRAM, generation time, and MCD vs baseline
	5. Keep the change if MCD < 5.0 dB, discard otherwise
	6. Repeat

	Key findings from the search:
	- Flow-matching diffusion models are far more precision-sensitive than autoregressive LLMs. All 4-bit approaches (NF4, NVFP4, FP4, Int4) produced unacceptable quality (MCD 17–32 dB).
	- FP8 is worse than INT8 for weight representation in this model (MCD 11.8 vs 4.35).
	- `torch.compile` breaks audio output even on the unquantized baseline (MCD 24–32 dB). The iterative denoising loop is numerically sensitive to graph optimizations.
	- Early transformer blocks (0–14) are most sensitive in their FFN output projections. Block 17 is an outlier.
	- Attention projections and GELU gates are universally robust to INT8 across all 48 blocks.

	## Citation

	If you use this work, please cite the original DramaBox model:

	```bibtex
	@misc{dramabox2025,
	title={DramaBox: Expressive Text to Speech Model},
	author={Resemble AI},
	year={2025},
	url={https://github.com/resemble-ai/DramaBox}
	}
	```

	## License

	Same as the base DramaBox model — [LTX-2 Community License](https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE).