Speechless: Speech Instruction Training Without Speech for Low Resource Languages
Paper β’ 2505.17417 β’ Published β’ 14
Trained Residual Vector Quantizer that discretizes Gemma 4 audio encoder (1024-dim Conformer) representations into discrete semantic tokens for Twi/Akan.
| Component | Details |
|---|---|
| Audio Encoder | rnagabh/gemma4-audio-encoder (304.8M params, USM Conformer) |
| Encoder output | 1024-dim pre-projection embeddings @ 25 fps |
| RVQ | 8 codebooks Γ 2048 entries |
| Training data | Mixed Twi/Akan (~40,261 samples, ~100h) |
| Loss | 0.5ΓMSE + 0.5Γ(1-cosine) + commitment |
| Metric | Value |
|---|---|
| Reconstruction MSE | 1.8069 |
| Cosine Similarity | 0.9768 |
| Codebook Utilization | 70.0% |
| Codebook Perplexity | 1756.0 |
| File | Description |
|---|---|
rvq_best.pt |
Best checkpoint (lowest dev MSE) |
rvq_final.pt |
Final epoch checkpoint |
rvq_averaged.pt |
Averaged checkpoint (recommended) |
config_stage1.json |
Full training config |
rvq_wrapper.py |
Model class definition |
twi_semantic_tokens.jsonl |
Eval set with semantic token strings |
twi_semantic_tokens_train.jsonl |
Train set with semantic token strings |
import torch, json
from huggingface_hub import hf_hub_download
from transformers import Gemma4AudioModel, Gemma4AudioFeatureExtractor
# Load RVQ
cfg = json.load(open(hf_hub_download("REPO_ID", "config_stage1.json")))
exec(open(hf_hub_download("REPO_ID", "rvq_wrapper.py")).read())
rvq = RVQWrapper(cfg["rvq_dim"], cfg["rvq_num_quantizers"], cfg["rvq_codebook_size"])
ckpt = torch.load(hf_hub_download("REPO_ID", "rvq_averaged.pt"), map_location="cpu")
rvq.load_state_dict(ckpt["rvq"])
rvq.eval()
# Load Gemma 4 encoder
encoder = Gemma4AudioModel.from_pretrained("rnagabh/gemma4-audio-encoder", torch_dtype=torch.bfloat16)
feat_ext = Gemma4AudioFeatureExtractor.from_pretrained("rnagabh/gemma4-audio-encoder")
# Encode audio β semantic tokens
import numpy as np
wav = np.random.randn(16000).astype(np.float32) # your audio here
feats = feat_ext([wav], sampling_rate=16000, return_tensors="pt")
hook_store = {}
def hook(m, i, o): hook_store["h"] = i[0]
handle = encoder.output_proj.register_forward_hook(hook)
with torch.no_grad():
encoder(feats["input_features"].to(torch.bfloat16))
hs = hook_store["h"].float()
_, indices, _ = rvq(hs)
handle.remove()
print("Token indices:", indices.shape) # [1, Q, T]
This is Stage 1 of the Speechless pipeline for Twi:
| Stage | Purpose | Status |
|---|---|---|
| 1. RVQ | Discretize encoder embeddings | β This model |
| 2. Speechless | Text β semantic tokens (no audio needed) | Next |
| 3. LLM fine-tune | Train Gemma 4 to respond in Twi | Planned |
Inference: Twi speech β Gemma 4 encoder β (same embedding space) β Gemma 4 LLM β Twi text response
Based on: Speechless: Speech Instruction Training Without Speech for Low Resource Languages (Dao et al., INTERSPEECH 2025, arXiv:2505.17417)
Trained: 2026-04-12