Speechless TWI โ€” Stage 1: RVQ for Gemma 4 Audio Encoder (512 codes)

Trained Residual Vector Quantizer that discretizes Gemma 4 audio encoder (1024-dim Conformer) representations into discrete semantic tokens for Twi/Akan.

Architecture

Component Details
Audio Encoder rnagabh/gemma4-audio-encoder (304.8M params, USM Conformer)
Encoder output 1024-dim pre-projection embeddings @ 25 fps
RVQ 8 codebooks x 512 entries
Training data Mixed Twi/Akan (~40,261 samples, ~100h)
Loss 0.5MSE + 0.5(1-cosine) + commitment

Training Data Sources

  • Financial Inclusion Twi Multispeaker (~30h, ~200 speakers)
  • Fante Multispeaker (~8h subsample)
  • BibleTTS Asante Twi (~5h subsample, studio quality)
  • BibleTTS Akuapem Twi (~5h subsample, studio quality)
  • Common Voice Twi (~3h)

Usage

import os, json, importlib.util, torch
from huggingface_hub import hf_hub_download

REPO_ID = "ik/speechless-twi-stage1-rvq-gemma4-512"

cfg_path = hf_hub_download(REPO_ID, "config_stage1.json")
ckpt_path = hf_hub_download(REPO_ID, "rvq_best.pt")
wrp_path = hf_hub_download(REPO_ID, "rvq_wrapper.py")

with open(cfg_path) as f:
    cfg = json.load(f)

spec = importlib.util.spec_from_file_location("rvq_wrapper", wrp_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

rvq = mod.RVQWrapper(cfg["rvq_dim"], cfg["rvq_num_quantizers"], cfg["rvq_codebook_size"])
ckpt = torch.load(ckpt_path, map_location="cpu")
rvq.load_state_dict(ckpt["rvq"], strict=False)
rvq.eval()

Files

File Description
rvq_best.pt Best checkpoint (lowest eval MSE)
rvq_final.pt Final checkpoint (end of training)
config_stage1.json Training configuration
rvq_wrapper.py Model class definition (needed for loading)

Generated: 2026-04-13 12:38

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support