Praxel
/

lase-r1

+---
+license: mit
+library_name: pytorch
+tags:
+  - speaker-recognition
+  - speaker-encoding
+  - speech
+  - indic
+  - cross-lingual
+  - voice-cloning
+language:
+  - en
+  - hi
+  - te
+  - ta
+---
+# LASE r1 — Language-Adversarial Speaker Encoder
+Reference checkpoint for the paper *"LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation"* (arXiv:TBD).
+LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen `microsoft/wavlm-base-plus` backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).
+## Headline result
+| Encoder | Western voices gap | Indian voices gap |
+|---|---|---|
+| WavLM-base-plus-sv (off-the-shelf) | 0.082 | 0.006 |
+| ECAPA-TDNN (off-the-shelf) | 0.105 | 0.058 |
+| ECAPA + GRL (ablation) | 0.027 | 0.037 |
+| **LASE r1 (ours)** | **0.013** | **−0.000** |
+Lower is better. *gap* = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.
+## Usage
+```python
+from huggingface_hub import hf_hub_download
+import torch
+# clone github.com/praxelhq/lase first for the model code
+from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder
+ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
+backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
+model = LASE(backbone, embedding_dim=256, n_languages=4,
+             lambda_schedule=LambdaSchedule(200, 500, 0.1))
+model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
+model.eval()
+# wav: (B, T) float32 at 16 kHz, ~2 seconds
+embedding = model(wav)["embedding"]   # (B, 256)
+```
+## Training
+- **Backbone**: `microsoft/wavlm-base-plus` (frozen)
+- **Projection MLP**: 768 → 512 → 256 (~170k params)
+- **Losses**: SupCon (voice identity) + GRL CE (4-language adversarial)
+- **λ schedule**: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold
+- **Optimisation**: 1000 steps, batch 16, AdamW, LR 1e-4
+- **Data**: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90
+- **Hardware**: 1× A10G on Modal, ~17 min, ~$0.31
+## Datasets
+- Training: [`Praxel/codeswitch-pairs-lase`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase)
+- Western held-out: [`Praxel/codeswitch-pairs-lase-heldout`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-heldout)
+- Indian held-out: [`Praxel/codeswitch-pairs-lase-indian`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-indian)
+## License
+MIT.
+## Citation
+```bibtex
+@misc{lase2026,
+  title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
+  author={Menta, Venkata Pushpak Teja},
+  year={2026},
+  eprint={TBD},
+  archivePrefix={arXiv},
+  primaryClass={eess.AS},
+}
+```