File size: 2,990 Bytes
faf1b07 db60d7a faf1b07 db60d7a faf1b07 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | ---
license: mit
library_name: pytorch
tags:
- speaker-recognition
- speaker-encoding
- speech
- indic
- cross-lingual
- voice-cloning
language:
- en
- hi
- te
- ta
---
# LASE r1 — Language-Adversarial Speaker Encoder
Reference checkpoint for the paper *"LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation"* ([arXiv:2605.00777](https://arxiv.org/abs/2605.00777)).
LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen `microsoft/wavlm-base-plus` backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).
## Headline result
| Encoder | Western voices gap | Indian voices gap |
|---|---|---|
| WavLM-base-plus-sv (off-the-shelf) | 0.082 | 0.006 |
| ECAPA-TDNN (off-the-shelf) | 0.105 | 0.058 |
| ECAPA + GRL (ablation) | 0.027 | 0.037 |
| **LASE r1 (ours)** | **0.013** | **−0.000** |
Lower is better. *gap* = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.
## Usage
```python
from huggingface_hub import hf_hub_download
import torch
# clone github.com/praxelhq/lase first for the model code
from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder
ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
model = LASE(backbone, embedding_dim=256, n_languages=4,
lambda_schedule=LambdaSchedule(200, 500, 0.1))
model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
model.eval()
# wav: (B, T) float32 at 16 kHz, ~2 seconds
embedding = model(wav)["embedding"] # (B, 256)
```
## Training
- **Backbone**: `microsoft/wavlm-base-plus` (frozen)
- **Projection MLP**: 768 → 512 → 256 (~170k params)
- **Losses**: SupCon (voice identity) + GRL CE (4-language adversarial)
- **λ schedule**: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold
- **Optimisation**: 1000 steps, batch 16, AdamW, LR 1e-4
- **Data**: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90
- **Hardware**: 1× A10G on Modal, ~17 min, ~$0.31
## Datasets
- Training: [`Praxel/codeswitch-pairs-lase`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase)
- Western held-out: [`Praxel/codeswitch-pairs-lase-heldout`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-heldout)
- Indian held-out: [`Praxel/codeswitch-pairs-lase-indian`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-indian)
## License
MIT.
## Citation
```bibtex
@misc{lase2026,
title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
author={Menta, Venkata Pushpak Teja},
year={2026},
eprint={2605.00777},
archivePrefix={arXiv},
primaryClass={eess.AS},
}
```
|