DACVAE-RVQ — 16K Vocabulary, 400 Tokens/sec

A Residual Vector Quantizer (RVQ) trained to discretize DACVAE latent representations into discrete tokens suitable for autoregressive language model generation.

Why Quantize DACVAE?

The Problem

DACVAE is an exceptionally efficient latent audio codec — its 128-dimensional continuous latent space at 25 fps is ideal for diffusion models (DiTs), flow matching, and other continuous generative approaches. However:

DiTs struggle with language understanding. Diffusion Transformers are powerful for audio generation but are not naturally suited to understanding complex natural language instructions — e.g., "speak this line with a warm, slightly raspy voice, as if comforting a child" or "deliver this with growing frustration, starting calm and ending in a shout."
LLMs excel at instruction following. Large Language Models are unmatched at understanding nuanced text instructions, reasoning about style/emotion/prosody, and generating structured outputs — but they need discrete tokens, not continuous vectors.

The Solution: RVQ as a Bridge

This RVQ model bridges the gap between LLMs and DACVAE:

Text Instruction → LLM → Discrete RVQ Tokens → Codebook Lookup → DACVAE Latents → Audio
                   ^^^^                         ^^^^^^^^^^^^^^^^^
              Understands voice           Instant decode (no neural net!)
              design instructions         Just a table lookup + sum

The key insight: RVQ tokens can be added to an LLM's vocabulary, enabling a unified model that understands both text and audio. The LLM generates coarse audio tokens that capture semantic content and speaker characteristics, while a lightweight acoustic model (or DiT) refines them into high-fidelity audio.

The Hybrid Architecture Vision

┌─────────────────────────────────────────────────────────────┐
│                    LLM (Instruction Understanding)           │
│                                                              │
│  Input:  "Say 'Hello world' in a deep, warm male voice"     │
│  Output: Coarse RVQ tokens (levels 1-2)                     │
│          → +16K vocab, 50 tokens/sec of speech               │
│          → Captures speaker identity, phonetics, prosody     │
└──────────────────────┬──────────────────────────────────────┘
                       │ coarse tokens
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              DiT / Flow Matching (Acoustic Refinement)        │
│                                                              │
│  Input:  Coarse tokens + text alignment                      │
│  Output: Full fine-grained DACVAE latents                    │
│          → High-quality, natural-sounding speech              │
└──────────────────────┬──────────────────────────────────────┘
                       │ continuous latents
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    DACVAE Decoder                             │
│                                                              │
│  Input:  128-dim latent vectors at 25fps                     │
│  Output: 48kHz waveform                                      │
└─────────────────────────────────────────────────────────────┘

Model Details

Property	Value
Architecture	Residual Vector Quantization (RVQ)
Quantizer levels	16
Codebook size	16,384 per level
Input dimension	128 (DACVAE latent dim)
Parameters	67.4M (pure codebook lookup tables)
DACVAE frame rate	25 fps
Total token rate	400 tokens/sec (16 levels × 25 fps)
Vocabulary addition for LLM	+16,384 tokens (shared across levels)
Training data	TTS-AGI/maestrino-data-DACVAE
Training vectors	9.5 billion (1.58B/epoch × 6 epochs)
Hardware	8× NVIDIA A100-SXM4-80GB

Reconstruction Quality

Metric	Value
Cosine Similarity	0.9647
MSE	0.0390
Angular Error	~15°
SNR	~11.6 dB

Coarse-to-Fine Quality (Cumulative Levels)

The RVQ is hierarchical — each level refines the previous reconstruction:

Levels	CosSim	What it captures	LLM use case
1	0.656	Rough speaker/pitch	Minimal
1-2	0.747	Speaker identity + phonetics	LLM coarse tokens (50 tok/s)
1-4	0.832	Good semantic content	LLM fine tokens (100 tok/s)
1-8	0.906	High quality	Acoustic model input
1-12	0.943	Very high quality	Near-transparent
1-16	0.965	Near-perfect	Full reconstruction

Quick Start

Installation

pip install vector-quantize-pytorch torch numpy

Encode & Decode

from encode_decode import RVQCodec
import numpy as np

# Load the codec
codec = RVQCodec("rvq_cb16384_nq16.pt", device="cuda")

# Encode DACVAE latents to discrete tokens
latent_vectors = np.load("my_dacvae_latents.npy")  # (T, 128) float16/32
tokens = codec.encode(latent_vectors)                # (T, 16) int32
print(f"Compressed {latent_vectors.shape} to {tokens.shape}")
print(f"Token vocabulary: 0-16383")

# Decode back to continuous latents (pure lookup, instant)
reconstructed = codec.decode(tokens)  # (T, 128) float32

# Coarse decode — only first 2 levels (for LLM output)
coarse = codec.decode(tokens, num_levels=2)  # still (T, 128) but rougher

Integration with DACVAE

from dacvae import DACVAE
import torch

# Load DACVAE decoder
dacvae = DACVAE.load("facebook/dacvae-watermarked").eval().cuda()

# RVQ tokens → DACVAE latents → Audio
codec = RVQCodec("rvq_cb16384_nq16.pt", device="cuda")
reconstructed = codec.decode(tokens)  # (T, 128)

# DACVAE expects (batch, channels, time)
latent = torch.from_numpy(reconstructed).unsqueeze(0).permute(0, 2, 1).cuda()
audio = dacvae.decode(latent)  # (1, 1, num_samples) at 48kHz

LLM Integration Strategy

For adding audio generation to an LLM:

# Only use levels 0-1 in the LLM vocabulary
# This adds just 16,384 tokens and produces 50 tokens/sec

# LLM generates coarse tokens:
#   <audio_start> tok_2341 tok_8821 tok_1123 tok_9942 ... <audio_end>

# Separate acoustic model predicts fine tokens (levels 2-15)
# from the coarse tokens, then decode via RVQ + DACVAE

coarse_tokens = llm_output  # (T, 2) from LLM
fine_tokens = acoustic_model(coarse_tokens)  # (T, 14) predicted
all_tokens = np.concatenate([coarse_tokens, fine_tokens], axis=1)  # (T, 16)
latents = codec.decode(all_tokens)

Architecture

There is no encoder or decoder neural network. The entire model is 16 codebook lookup tables:

Encoding (nearest-neighbor search):
  Input x (128-dim float vector)
    → Find nearest entry in codebook_0 → token_0, residual = x - codebook_0[token_0]
    → Find nearest entry in codebook_1 → token_1, residual -= codebook_1[token_1]
    → ... repeat for all 16 levels
  Output: 16 integer tokens (each 0-16383)

Decoding (pure addition):
  reconstruction = codebook_0[t0] + codebook_1[t1] + ... + codebook_15[t15]

Each codebook is a (16384, 128) float32 matrix. Decoding is a single gather + sum operation — no neural network inference required.

Training Details

Method: Exponential Moving Average (EMA) codebook updates — not gradient-based
Dataset: TTS-AGI/maestrino-data-DACVAE — 1,137 WebDataset tar files containing ~5.5M speech samples with 128-dim DACVAE latent vectors
Scale: 1.58 billion vectors per epoch, 6 epochs = 9.5 billion training vectors
Streaming: Data streamed directly from HuggingFace (no local storage needed)
Multi-GPU: 8 shards across 8×A100 GPUs, codebooks averaged after each epoch
Library: vector-quantize-pytorch

Files

File	Description
`rvq_cb16384_nq16.pt`	Model weights (258 MB) — 16 codebook tables
`config.json`	Model configuration and training metadata
`encode_decode.py`	Python codec class for encoding/decoding
`README.md`	This file

Citation

@misc{dacvae-rvq-2026,
  title={DACVAE-RVQ: Residual Vector Quantization for DACVAE Latent Tokenization},
  author={TTS-AGI},
  year={2026},
  url={https://huggingface.co/TTS-AGI/DACVAE-RVQ-vocab16k-400-hz}
}

License

MIT

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

TTS-AGI
/

DACVAE-RVQ-vocab16k-400-hz