Speechless TWI โ Stage 1: RVQ for Gemma 4 Audio Encoder (512 codes)
Trained Residual Vector Quantizer that discretizes Gemma 4 audio encoder (1024-dim Conformer) representations into discrete semantic tokens for Twi/Akan.
Architecture
| Component | Details |
|---|---|
| Audio Encoder | rnagabh/gemma4-audio-encoder (304.8M params, USM Conformer) |
| Encoder output | 1024-dim pre-projection embeddings @ 25 fps |
| RVQ | 8 codebooks x 512 entries |
| Training data | Mixed Twi/Akan (~40,261 samples, ~100h) |
| Loss | 0.5MSE + 0.5(1-cosine) + commitment |
Training Data Sources
- Financial Inclusion Twi Multispeaker (~30h, ~200 speakers)
- Fante Multispeaker (~8h subsample)
- BibleTTS Asante Twi (~5h subsample, studio quality)
- BibleTTS Akuapem Twi (~5h subsample, studio quality)
- Common Voice Twi (~3h)
Usage
import os, json, importlib.util, torch
from huggingface_hub import hf_hub_download
REPO_ID = "ik/speechless-twi-stage1-rvq-gemma4-512"
cfg_path = hf_hub_download(REPO_ID, "config_stage1.json")
ckpt_path = hf_hub_download(REPO_ID, "rvq_best.pt")
wrp_path = hf_hub_download(REPO_ID, "rvq_wrapper.py")
with open(cfg_path) as f:
cfg = json.load(f)
spec = importlib.util.spec_from_file_location("rvq_wrapper", wrp_path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
rvq = mod.RVQWrapper(cfg["rvq_dim"], cfg["rvq_num_quantizers"], cfg["rvq_codebook_size"])
ckpt = torch.load(ckpt_path, map_location="cpu")
rvq.load_state_dict(ckpt["rvq"], strict=False)
rvq.eval()
Files
| File | Description |
|---|---|
rvq_best.pt |
Best checkpoint (lowest eval MSE) |
rvq_final.pt |
Final checkpoint (end of training) |
config_stage1.json |
Training configuration |
rvq_wrapper.py |
Model class definition (needed for loading) |
Generated: 2026-04-13 12:38
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support