Speechless TWI — Stage 1: RVQ for Gemma 4 Audio Encoder (512 codes)

Trained Residual Vector Quantizer that discretizes Gemma 4 audio encoder (1024-dim Conformer) representations into discrete semantic tokens for Twi/Akan.

Architecture

Component	Details
Audio Encoder	`rnagabh/gemma4-audio-encoder` (304.8M params, USM Conformer)
Encoder output	1024-dim pre-projection embeddings @ 25 fps
RVQ	8 codebooks x 512 entries
Training data	Mixed Twi/Akan (~40,261 samples, ~100h)
Loss	0.5MSE + 0.5(1-cosine) + commitment

Training Data Sources

Financial Inclusion Twi Multispeaker (~30h, ~200 speakers)
Fante Multispeaker (~8h subsample)
BibleTTS Asante Twi (~5h subsample, studio quality)
BibleTTS Akuapem Twi (~5h subsample, studio quality)
Common Voice Twi (~3h)

Usage

import os, json, importlib.util, torch
from huggingface_hub import hf_hub_download

REPO_ID = "ik/speechless-twi-stage1-rvq-gemma4-512"

cfg_path = hf_hub_download(REPO_ID, "config_stage1.json")
ckpt_path = hf_hub_download(REPO_ID, "rvq_best.pt")
wrp_path = hf_hub_download(REPO_ID, "rvq_wrapper.py")

with open(cfg_path) as f:
    cfg = json.load(f)

spec = importlib.util.spec_from_file_location("rvq_wrapper", wrp_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)

rvq = mod.RVQWrapper(cfg["rvq_dim"], cfg["rvq_num_quantizers"], cfg["rvq_codebook_size"])
ckpt = torch.load(ckpt_path, map_location="cpu")
rvq.load_state_dict(ckpt["rvq"], strict=False)
rvq.eval()

Files

File	Description
`rvq_best.pt`	Best checkpoint (lowest eval MSE)
`rvq_final.pt`	Final checkpoint (end of training)
`config_stage1.json`	Training configuration
`rvq_wrapper.py`	Model class definition (needed for loading)

Generated: 2026-04-13 12:38

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support