| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - speaker-recognition |
| - speaker-encoding |
| - speech |
| - indic |
| - cross-lingual |
| - voice-cloning |
| language: |
| - en |
| - hi |
| - te |
| - ta |
| --- |
| |
| # LASE r1 — Language-Adversarial Speaker Encoder |
|
|
| Reference checkpoint for the paper *"LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation"* ([arXiv:2605.00777](https://arxiv.org/abs/2605.00777)). |
|
|
| LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen `microsoft/wavlm-base-plus` backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params). |
|
|
| ## Headline result |
|
|
| | Encoder | Western voices gap | Indian voices gap | |
| |---|---|---| |
| | WavLM-base-plus-sv (off-the-shelf) | 0.082 | 0.006 | |
| | ECAPA-TDNN (off-the-shelf) | 0.105 | 0.058 | |
| | ECAPA + GRL (ablation) | 0.027 | 0.037 | |
| | **LASE r1 (ours)** | **0.013** | **−0.000** | |
|
|
| Lower is better. *gap* = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora. |
|
|
| ## Usage |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import torch |
| # clone github.com/praxelhq/lase first for the model code |
| from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder |
| |
| ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt") |
| backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True) |
| model = LASE(backbone, embedding_dim=256, n_languages=4, |
| lambda_schedule=LambdaSchedule(200, 500, 0.1)) |
| model.load_state_dict(torch.load(ckpt_path)["model"], strict=False) |
| model.eval() |
| |
| # wav: (B, T) float32 at 16 kHz, ~2 seconds |
| embedding = model(wav)["embedding"] # (B, 256) |
| ``` |
|
|
| ## Training |
|
|
| - **Backbone**: `microsoft/wavlm-base-plus` (frozen) |
| - **Projection MLP**: 768 → 512 → 256 (~170k params) |
| - **Losses**: SupCon (voice identity) + GRL CE (4-language adversarial) |
| - **λ schedule**: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold |
| - **Optimisation**: 1000 steps, batch 16, AdamW, LR 1e-4 |
| - **Data**: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90 |
| - **Hardware**: 1× A10G on Modal, ~17 min, ~$0.31 |
|
|
| ## Datasets |
|
|
| - Training: [`Praxel/codeswitch-pairs-lase`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase) |
| - Western held-out: [`Praxel/codeswitch-pairs-lase-heldout`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-heldout) |
| - Indian held-out: [`Praxel/codeswitch-pairs-lase-indian`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-indian) |
|
|
| ## License |
|
|
| MIT. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{lase2026, |
| title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation}, |
| author={Menta, Venkata Pushpak Teja}, |
| year={2026}, |
| eprint={2605.00777}, |
| archivePrefix={arXiv}, |
| primaryClass={eess.AS}, |
| } |
| ``` |
|
|