Speechless TWI β€” Stage 1: RVQ for Gemma 4 Audio Encoder

Trained Residual Vector Quantizer that discretizes Gemma 4 audio encoder (1024-dim Conformer) representations into discrete semantic tokens for Twi/Akan.

Architecture

Component Details
Audio Encoder rnagabh/gemma4-audio-encoder (304.8M params, USM Conformer)
Encoder output 1024-dim pre-projection embeddings @ 25 fps
RVQ 8 codebooks Γ— 2048 entries
Training data Mixed Twi/Akan (~40,261 samples, ~100h)
Loss 0.5Γ—MSE + 0.5Γ—(1-cosine) + commitment

Training Data Sources

  • Financial Inclusion Twi Multispeaker (~30h, ~200 speakers)
  • Fante Multispeaker (~8h subsample)
  • BibleTTS Asante Twi (~5h subsample, studio quality)
  • BibleTTS Akuapem Twi (~5h subsample, studio quality)
  • Common Voice Twi (~3h)

Results

Metric Value
Reconstruction MSE 1.8069
Cosine Similarity 0.9768
Codebook Utilization 70.0%
Codebook Perplexity 1756.0

Files

File Description
rvq_best.pt Best checkpoint (lowest dev MSE)
rvq_final.pt Final epoch checkpoint
rvq_averaged.pt Averaged checkpoint (recommended)
config_stage1.json Full training config
rvq_wrapper.py Model class definition
twi_semantic_tokens.jsonl Eval set with semantic token strings
twi_semantic_tokens_train.jsonl Train set with semantic token strings

Usage

import torch, json
from huggingface_hub import hf_hub_download
from transformers import Gemma4AudioModel, Gemma4AudioFeatureExtractor

# Load RVQ
cfg = json.load(open(hf_hub_download("REPO_ID", "config_stage1.json")))
exec(open(hf_hub_download("REPO_ID", "rvq_wrapper.py")).read())
rvq = RVQWrapper(cfg["rvq_dim"], cfg["rvq_num_quantizers"], cfg["rvq_codebook_size"])
ckpt = torch.load(hf_hub_download("REPO_ID", "rvq_averaged.pt"), map_location="cpu")
rvq.load_state_dict(ckpt["rvq"])
rvq.eval()

# Load Gemma 4 encoder
encoder = Gemma4AudioModel.from_pretrained("rnagabh/gemma4-audio-encoder", torch_dtype=torch.bfloat16)
feat_ext = Gemma4AudioFeatureExtractor.from_pretrained("rnagabh/gemma4-audio-encoder")

# Encode audio β†’ semantic tokens
import numpy as np
wav = np.random.randn(16000).astype(np.float32)  # your audio here
feats = feat_ext([wav], sampling_rate=16000, return_tensors="pt")

hook_store = {}
def hook(m, i, o): hook_store["h"] = i[0]
handle = encoder.output_proj.register_forward_hook(hook)
with torch.no_grad():
    encoder(feats["input_features"].to(torch.bfloat16))
    hs = hook_store["h"].float()
    _, indices, _ = rvq(hs)
handle.remove()

print("Token indices:", indices.shape)  # [1, Q, T]

Pipeline Context

This is Stage 1 of the Speechless pipeline for Twi:

Stage Purpose Status
1. RVQ Discretize encoder embeddings βœ… This model
2. Speechless Text β†’ semantic tokens (no audio needed) Next
3. LLM fine-tune Train Gemma 4 to respond in Twi Planned

Inference: Twi speech β†’ Gemma 4 encoder β†’ (same embedding space) β†’ Gemma 4 LLM β†’ Twi text response

Citation

Based on: Speechless: Speech Instruction Training Without Speech for Low Resource Languages (Dao et al., INTERSPEECH 2025, arXiv:2505.17417)

Trained: 2026-04-12

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for ik/speechless-twi-stage1-rvq-gemma4