Speechless TWI — Stage 1: RVQ for Gemma 4 Audio Encoder

Trained Residual Vector Quantizer that discretizes Gemma 4 audio encoder (1024-dim Conformer) representations into discrete semantic tokens for Twi/Akan.

Architecture

Component	Details
Audio Encoder	`rnagabh/gemma4-audio-encoder` (304.8M params, USM Conformer)
Encoder output	1024-dim pre-projection embeddings @ 25 fps
RVQ	8 codebooks × 2048 entries
Training data	Mixed Twi/Akan (~40,261 samples, ~100h)
Loss	0.5×MSE + 0.5×(1-cosine) + commitment

Training Data Sources

Financial Inclusion Twi Multispeaker (~30h, ~200 speakers)
Fante Multispeaker (~8h subsample)
BibleTTS Asante Twi (~5h subsample, studio quality)
BibleTTS Akuapem Twi (~5h subsample, studio quality)
Common Voice Twi (~3h)

Results

Metric	Value
Reconstruction MSE	1.8069
Cosine Similarity	0.9768
Codebook Utilization	70.0%
Codebook Perplexity	1756.0

Files

File	Description
`rvq_best.pt`	Best checkpoint (lowest dev MSE)
`rvq_final.pt`	Final epoch checkpoint
`rvq_averaged.pt`	Averaged checkpoint (recommended)
`config_stage1.json`	Full training config
`rvq_wrapper.py`	Model class definition
`twi_semantic_tokens.jsonl`	Eval set with semantic token strings
`twi_semantic_tokens_train.jsonl`	Train set with semantic token strings

Usage

import torch, json
from huggingface_hub import hf_hub_download
from transformers import Gemma4AudioModel, Gemma4AudioFeatureExtractor

# Load RVQ
cfg = json.load(open(hf_hub_download("REPO_ID", "config_stage1.json")))
exec(open(hf_hub_download("REPO_ID", "rvq_wrapper.py")).read())
rvq = RVQWrapper(cfg["rvq_dim"], cfg["rvq_num_quantizers"], cfg["rvq_codebook_size"])
ckpt = torch.load(hf_hub_download("REPO_ID", "rvq_averaged.pt"), map_location="cpu")
rvq.load_state_dict(ckpt["rvq"])
rvq.eval()

# Load Gemma 4 encoder
encoder = Gemma4AudioModel.from_pretrained("rnagabh/gemma4-audio-encoder", torch_dtype=torch.bfloat16)
feat_ext = Gemma4AudioFeatureExtractor.from_pretrained("rnagabh/gemma4-audio-encoder")

# Encode audio → semantic tokens
import numpy as np
wav = np.random.randn(16000).astype(np.float32)  # your audio here
feats = feat_ext([wav], sampling_rate=16000, return_tensors="pt")

hook_store = {}
def hook(m, i, o): hook_store["h"] = i[0]
handle = encoder.output_proj.register_forward_hook(hook)
with torch.no_grad():
    encoder(feats["input_features"].to(torch.bfloat16))
    hs = hook_store["h"].float()
    _, indices, _ = rvq(hs)
handle.remove()

print("Token indices:", indices.shape)  # [1, Q, T]

Pipeline Context

This is Stage 1 of the Speechless pipeline for Twi:

Stage	Purpose	Status
1. RVQ	Discretize encoder embeddings	✅ This model
2. Speechless	Text → semantic tokens (no audio needed)	Next
3. LLM fine-tune	Train Gemma 4 to respond in Twi	Planned

Inference: Twi speech → Gemma 4 encoder → (same embedding space) → Gemma 4 LLM → Twi text response

Citation

Based on: Speechless: Speech Instruction Training Without Speech for Low Resource Languages (Dao et al., INTERSPEECH 2025, arXiv:2505.17417)

Trained: 2026-04-12

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ik/speechless-twi-stage1-rvq-gemma4

Speechless: Speech Instruction Training Without Speech for Low Resource Languages

Paper • 2505.17417 • Published May 23, 2025 • 14