DACVAE-RVQ β€” 16K Vocabulary, 400 Tokens/sec

A Residual Vector Quantizer (RVQ) trained to discretize DACVAE latent representations into discrete tokens suitable for autoregressive language model generation.

Why Quantize DACVAE?

The Problem

DACVAE is an exceptionally efficient latent audio codec β€” its 128-dimensional continuous latent space at 25 fps is ideal for diffusion models (DiTs), flow matching, and other continuous generative approaches. However:

  1. DiTs struggle with language understanding. Diffusion Transformers are powerful for audio generation but are not naturally suited to understanding complex natural language instructions β€” e.g., "speak this line with a warm, slightly raspy voice, as if comforting a child" or "deliver this with growing frustration, starting calm and ending in a shout."

  2. LLMs excel at instruction following. Large Language Models are unmatched at understanding nuanced text instructions, reasoning about style/emotion/prosody, and generating structured outputs β€” but they need discrete tokens, not continuous vectors.

The Solution: RVQ as a Bridge

This RVQ model bridges the gap between LLMs and DACVAE:

Text Instruction β†’ LLM β†’ Discrete RVQ Tokens β†’ Codebook Lookup β†’ DACVAE Latents β†’ Audio
                   ^^^^                         ^^^^^^^^^^^^^^^^^
              Understands voice           Instant decode (no neural net!)
              design instructions         Just a table lookup + sum

The key insight: RVQ tokens can be added to an LLM's vocabulary, enabling a unified model that understands both text and audio. The LLM generates coarse audio tokens that capture semantic content and speaker characteristics, while a lightweight acoustic model (or DiT) refines them into high-fidelity audio.

The Hybrid Architecture Vision

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LLM (Instruction Understanding)           β”‚
β”‚                                                              β”‚
β”‚  Input:  "Say 'Hello world' in a deep, warm male voice"     β”‚
β”‚  Output: Coarse RVQ tokens (levels 1-2)                     β”‚
β”‚          β†’ +16K vocab, 50 tokens/sec of speech               β”‚
β”‚          β†’ Captures speaker identity, phonetics, prosody     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ coarse tokens
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DiT / Flow Matching (Acoustic Refinement)        β”‚
β”‚                                                              β”‚
β”‚  Input:  Coarse tokens + text alignment                      β”‚
β”‚  Output: Full fine-grained DACVAE latents                    β”‚
β”‚          β†’ High-quality, natural-sounding speech              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ continuous latents
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DACVAE Decoder                             β”‚
β”‚                                                              β”‚
β”‚  Input:  128-dim latent vectors at 25fps                     β”‚
β”‚  Output: 48kHz waveform                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Details

Property Value
Architecture Residual Vector Quantization (RVQ)
Quantizer levels 16
Codebook size 16,384 per level
Input dimension 128 (DACVAE latent dim)
Parameters 67.4M (pure codebook lookup tables)
DACVAE frame rate 25 fps
Total token rate 400 tokens/sec (16 levels Γ— 25 fps)
Vocabulary addition for LLM +16,384 tokens (shared across levels)
Training data TTS-AGI/maestrino-data-DACVAE
Training vectors 9.5 billion (1.58B/epoch Γ— 6 epochs)
Hardware 8Γ— NVIDIA A100-SXM4-80GB

Reconstruction Quality

Metric Value
Cosine Similarity 0.9647
MSE 0.0390
Angular Error ~15Β°
SNR ~11.6 dB

Coarse-to-Fine Quality (Cumulative Levels)

The RVQ is hierarchical β€” each level refines the previous reconstruction:

Levels CosSim What it captures LLM use case
1 0.656 Rough speaker/pitch Minimal
1-2 0.747 Speaker identity + phonetics LLM coarse tokens (50 tok/s)
1-4 0.832 Good semantic content LLM fine tokens (100 tok/s)
1-8 0.906 High quality Acoustic model input
1-12 0.943 Very high quality Near-transparent
1-16 0.965 Near-perfect Full reconstruction

Quick Start

Installation

pip install vector-quantize-pytorch torch numpy

Encode & Decode

from encode_decode import RVQCodec
import numpy as np

# Load the codec
codec = RVQCodec("rvq_cb16384_nq16.pt", device="cuda")

# Encode DACVAE latents to discrete tokens
latent_vectors = np.load("my_dacvae_latents.npy")  # (T, 128) float16/32
tokens = codec.encode(latent_vectors)                # (T, 16) int32
print(f"Compressed {latent_vectors.shape} to {tokens.shape}")
print(f"Token vocabulary: 0-16383")

# Decode back to continuous latents (pure lookup, instant)
reconstructed = codec.decode(tokens)  # (T, 128) float32

# Coarse decode β€” only first 2 levels (for LLM output)
coarse = codec.decode(tokens, num_levels=2)  # still (T, 128) but rougher

Integration with DACVAE

from dacvae import DACVAE
import torch

# Load DACVAE decoder
dacvae = DACVAE.load("facebook/dacvae-watermarked").eval().cuda()

# RVQ tokens β†’ DACVAE latents β†’ Audio
codec = RVQCodec("rvq_cb16384_nq16.pt", device="cuda")
reconstructed = codec.decode(tokens)  # (T, 128)

# DACVAE expects (batch, channels, time)
latent = torch.from_numpy(reconstructed).unsqueeze(0).permute(0, 2, 1).cuda()
audio = dacvae.decode(latent)  # (1, 1, num_samples) at 48kHz

LLM Integration Strategy

For adding audio generation to an LLM:

# Only use levels 0-1 in the LLM vocabulary
# This adds just 16,384 tokens and produces 50 tokens/sec

# LLM generates coarse tokens:
#   <audio_start> tok_2341 tok_8821 tok_1123 tok_9942 ... <audio_end>

# Separate acoustic model predicts fine tokens (levels 2-15)
# from the coarse tokens, then decode via RVQ + DACVAE

coarse_tokens = llm_output  # (T, 2) from LLM
fine_tokens = acoustic_model(coarse_tokens)  # (T, 14) predicted
all_tokens = np.concatenate([coarse_tokens, fine_tokens], axis=1)  # (T, 16)
latents = codec.decode(all_tokens)

Architecture

There is no encoder or decoder neural network. The entire model is 16 codebook lookup tables:

Encoding (nearest-neighbor search):
  Input x (128-dim float vector)
    β†’ Find nearest entry in codebook_0 β†’ token_0, residual = x - codebook_0[token_0]
    β†’ Find nearest entry in codebook_1 β†’ token_1, residual -= codebook_1[token_1]
    β†’ ... repeat for all 16 levels
  Output: 16 integer tokens (each 0-16383)

Decoding (pure addition):
  reconstruction = codebook_0[t0] + codebook_1[t1] + ... + codebook_15[t15]

Each codebook is a (16384, 128) float32 matrix. Decoding is a single gather + sum operation β€” no neural network inference required.

Training Details

  • Method: Exponential Moving Average (EMA) codebook updates β€” not gradient-based
  • Dataset: TTS-AGI/maestrino-data-DACVAE β€” 1,137 WebDataset tar files containing ~5.5M speech samples with 128-dim DACVAE latent vectors
  • Scale: 1.58 billion vectors per epoch, 6 epochs = 9.5 billion training vectors
  • Streaming: Data streamed directly from HuggingFace (no local storage needed)
  • Multi-GPU: 8 shards across 8Γ—A100 GPUs, codebooks averaged after each epoch
  • Library: vector-quantize-pytorch

Files

File Description
rvq_cb16384_nq16.pt Model weights (258 MB) β€” 16 codebook tables
config.json Model configuration and training metadata
encode_decode.py Python codec class for encoding/decoding
README.md This file

Citation

@misc{dacvae-rvq-2026,
  title={DACVAE-RVQ: Residual Vector Quantization for DACVAE Latent Tokenization},
  author={TTS-AGI},
  year={2026},
  url={https://huggingface.co/TTS-AGI/DACVAE-RVQ-vocab16k-400-hz}
}

License

MIT

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train TTS-AGI/DACVAE-RVQ-vocab16k-400-hz