DACVAE-RVQ β 16K Vocabulary, 400 Tokens/sec
A Residual Vector Quantizer (RVQ) trained to discretize DACVAE latent representations into discrete tokens suitable for autoregressive language model generation.
Why Quantize DACVAE?
The Problem
DACVAE is an exceptionally efficient latent audio codec β its 128-dimensional continuous latent space at 25 fps is ideal for diffusion models (DiTs), flow matching, and other continuous generative approaches. However:
DiTs struggle with language understanding. Diffusion Transformers are powerful for audio generation but are not naturally suited to understanding complex natural language instructions β e.g., "speak this line with a warm, slightly raspy voice, as if comforting a child" or "deliver this with growing frustration, starting calm and ending in a shout."
LLMs excel at instruction following. Large Language Models are unmatched at understanding nuanced text instructions, reasoning about style/emotion/prosody, and generating structured outputs β but they need discrete tokens, not continuous vectors.
The Solution: RVQ as a Bridge
This RVQ model bridges the gap between LLMs and DACVAE:
Text Instruction β LLM β Discrete RVQ Tokens β Codebook Lookup β DACVAE Latents β Audio
^^^^ ^^^^^^^^^^^^^^^^^
Understands voice Instant decode (no neural net!)
design instructions Just a table lookup + sum
The key insight: RVQ tokens can be added to an LLM's vocabulary, enabling a unified model that understands both text and audio. The LLM generates coarse audio tokens that capture semantic content and speaker characteristics, while a lightweight acoustic model (or DiT) refines them into high-fidelity audio.
The Hybrid Architecture Vision
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM (Instruction Understanding) β
β β
β Input: "Say 'Hello world' in a deep, warm male voice" β
β Output: Coarse RVQ tokens (levels 1-2) β
β β +16K vocab, 50 tokens/sec of speech β
β β Captures speaker identity, phonetics, prosody β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β coarse tokens
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DiT / Flow Matching (Acoustic Refinement) β
β β
β Input: Coarse tokens + text alignment β
β Output: Full fine-grained DACVAE latents β
β β High-quality, natural-sounding speech β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β continuous latents
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DACVAE Decoder β
β β
β Input: 128-dim latent vectors at 25fps β
β Output: 48kHz waveform β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Details
| Property | Value |
|---|---|
| Architecture | Residual Vector Quantization (RVQ) |
| Quantizer levels | 16 |
| Codebook size | 16,384 per level |
| Input dimension | 128 (DACVAE latent dim) |
| Parameters | 67.4M (pure codebook lookup tables) |
| DACVAE frame rate | 25 fps |
| Total token rate | 400 tokens/sec (16 levels Γ 25 fps) |
| Vocabulary addition for LLM | +16,384 tokens (shared across levels) |
| Training data | TTS-AGI/maestrino-data-DACVAE |
| Training vectors | 9.5 billion (1.58B/epoch Γ 6 epochs) |
| Hardware | 8Γ NVIDIA A100-SXM4-80GB |
Reconstruction Quality
| Metric | Value |
|---|---|
| Cosine Similarity | 0.9647 |
| MSE | 0.0390 |
| Angular Error | ~15Β° |
| SNR | ~11.6 dB |
Coarse-to-Fine Quality (Cumulative Levels)
The RVQ is hierarchical β each level refines the previous reconstruction:
| Levels | CosSim | What it captures | LLM use case |
|---|---|---|---|
| 1 | 0.656 | Rough speaker/pitch | Minimal |
| 1-2 | 0.747 | Speaker identity + phonetics | LLM coarse tokens (50 tok/s) |
| 1-4 | 0.832 | Good semantic content | LLM fine tokens (100 tok/s) |
| 1-8 | 0.906 | High quality | Acoustic model input |
| 1-12 | 0.943 | Very high quality | Near-transparent |
| 1-16 | 0.965 | Near-perfect | Full reconstruction |
Quick Start
Installation
pip install vector-quantize-pytorch torch numpy
Encode & Decode
from encode_decode import RVQCodec
import numpy as np
# Load the codec
codec = RVQCodec("rvq_cb16384_nq16.pt", device="cuda")
# Encode DACVAE latents to discrete tokens
latent_vectors = np.load("my_dacvae_latents.npy") # (T, 128) float16/32
tokens = codec.encode(latent_vectors) # (T, 16) int32
print(f"Compressed {latent_vectors.shape} to {tokens.shape}")
print(f"Token vocabulary: 0-16383")
# Decode back to continuous latents (pure lookup, instant)
reconstructed = codec.decode(tokens) # (T, 128) float32
# Coarse decode β only first 2 levels (for LLM output)
coarse = codec.decode(tokens, num_levels=2) # still (T, 128) but rougher
Integration with DACVAE
from dacvae import DACVAE
import torch
# Load DACVAE decoder
dacvae = DACVAE.load("facebook/dacvae-watermarked").eval().cuda()
# RVQ tokens β DACVAE latents β Audio
codec = RVQCodec("rvq_cb16384_nq16.pt", device="cuda")
reconstructed = codec.decode(tokens) # (T, 128)
# DACVAE expects (batch, channels, time)
latent = torch.from_numpy(reconstructed).unsqueeze(0).permute(0, 2, 1).cuda()
audio = dacvae.decode(latent) # (1, 1, num_samples) at 48kHz
LLM Integration Strategy
For adding audio generation to an LLM:
# Only use levels 0-1 in the LLM vocabulary
# This adds just 16,384 tokens and produces 50 tokens/sec
# LLM generates coarse tokens:
# <audio_start> tok_2341 tok_8821 tok_1123 tok_9942 ... <audio_end>
# Separate acoustic model predicts fine tokens (levels 2-15)
# from the coarse tokens, then decode via RVQ + DACVAE
coarse_tokens = llm_output # (T, 2) from LLM
fine_tokens = acoustic_model(coarse_tokens) # (T, 14) predicted
all_tokens = np.concatenate([coarse_tokens, fine_tokens], axis=1) # (T, 16)
latents = codec.decode(all_tokens)
Architecture
There is no encoder or decoder neural network. The entire model is 16 codebook lookup tables:
Encoding (nearest-neighbor search):
Input x (128-dim float vector)
β Find nearest entry in codebook_0 β token_0, residual = x - codebook_0[token_0]
β Find nearest entry in codebook_1 β token_1, residual -= codebook_1[token_1]
β ... repeat for all 16 levels
Output: 16 integer tokens (each 0-16383)
Decoding (pure addition):
reconstruction = codebook_0[t0] + codebook_1[t1] + ... + codebook_15[t15]
Each codebook is a (16384, 128) float32 matrix. Decoding is a single gather + sum operation β no neural network inference required.
Training Details
- Method: Exponential Moving Average (EMA) codebook updates β not gradient-based
- Dataset: TTS-AGI/maestrino-data-DACVAE β 1,137 WebDataset tar files containing ~5.5M speech samples with 128-dim DACVAE latent vectors
- Scale: 1.58 billion vectors per epoch, 6 epochs = 9.5 billion training vectors
- Streaming: Data streamed directly from HuggingFace (no local storage needed)
- Multi-GPU: 8 shards across 8ΓA100 GPUs, codebooks averaged after each epoch
- Library: vector-quantize-pytorch
Files
| File | Description |
|---|---|
rvq_cb16384_nq16.pt |
Model weights (258 MB) β 16 codebook tables |
config.json |
Model configuration and training metadata |
encode_decode.py |
Python codec class for encoding/decoding |
README.md |
This file |
Citation
@misc{dacvae-rvq-2026,
title={DACVAE-RVQ: Residual Vector Quantization for DACVAE Latent Tokenization},
author={TTS-AGI},
year={2026},
url={https://huggingface.co/TTS-AGI/DACVAE-RVQ-vocab16k-400-hz}
}
License
MIT
- Downloads last month
- 4