lase-r1 / README.md

docs: link LASE arXiv:2605.00777

db60d7a verified 2 days ago

2.99 kB

license: mit
library_name: pytorch
tags:
  - speaker-recognition
  - speaker-encoding
  - speech
  - indic
  - cross-lingual
  - voice-cloning
language:
  - en
  - hi
  - te
  - ta

LASE r1 — Language-Adversarial Speaker Encoder

Reference checkpoint for the paper "LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation" (arXiv:2605.00777).

LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen microsoft/wavlm-base-plus backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).

Headline result

Encoder	Western voices gap	Indian voices gap
WavLM-base-plus-sv (off-the-shelf)	0.082	0.006
ECAPA-TDNN (off-the-shelf)	0.105	0.058
ECAPA + GRL (ablation)	0.027	0.037
LASE r1 (ours)	0.013	−0.000

Lower is better. gap = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.

Usage

from huggingface_hub import hf_hub_download
import torch
# clone github.com/praxelhq/lase first for the model code
from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder

ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
model = LASE(backbone, embedding_dim=256, n_languages=4,
             lambda_schedule=LambdaSchedule(200, 500, 0.1))
model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
model.eval()

# wav: (B, T) float32 at 16 kHz, ~2 seconds
embedding = model(wav)["embedding"]   # (B, 256)

Training

Backbone: microsoft/wavlm-base-plus (frozen)
Projection MLP: 768 → 512 → 256 (~170k params)
Losses: SupCon (voice identity) + GRL CE (4-language adversarial)
λ schedule: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold
Optimisation: 1000 steps, batch 16, AdamW, LR 1e-4
Data: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90
Hardware: 1× A10G on Modal, ~17 min, ~$0.31

Datasets

Training: Praxel/codeswitch-pairs-lase
Western held-out: Praxel/codeswitch-pairs-lase-heldout
Indian held-out: Praxel/codeswitch-pairs-lase-indian

License

MIT.

Citation

@misc{lase2026,
  title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
  author={Menta, Venkata Pushpak Teja},
  year={2026},
  eprint={2605.00777},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
}