Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
tags:
|
| 5 |
+
- speaker-recognition
|
| 6 |
+
- speaker-encoding
|
| 7 |
+
- speech
|
| 8 |
+
- indic
|
| 9 |
+
- cross-lingual
|
| 10 |
+
- voice-cloning
|
| 11 |
+
language:
|
| 12 |
+
- en
|
| 13 |
+
- hi
|
| 14 |
+
- te
|
| 15 |
+
- ta
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# LASE r1 — Language-Adversarial Speaker Encoder
|
| 19 |
+
|
| 20 |
+
Reference checkpoint for the paper *"LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation"* (arXiv:TBD).
|
| 21 |
+
|
| 22 |
+
LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen `microsoft/wavlm-base-plus` backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).
|
| 23 |
+
|
| 24 |
+
## Headline result
|
| 25 |
+
|
| 26 |
+
| Encoder | Western voices gap | Indian voices gap |
|
| 27 |
+
|---|---|---|
|
| 28 |
+
| WavLM-base-plus-sv (off-the-shelf) | 0.082 | 0.006 |
|
| 29 |
+
| ECAPA-TDNN (off-the-shelf) | 0.105 | 0.058 |
|
| 30 |
+
| ECAPA + GRL (ablation) | 0.027 | 0.037 |
|
| 31 |
+
| **LASE r1 (ours)** | **0.013** | **−0.000** |
|
| 32 |
+
|
| 33 |
+
Lower is better. *gap* = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.
|
| 34 |
+
|
| 35 |
+
## Usage
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
from huggingface_hub import hf_hub_download
|
| 39 |
+
import torch
|
| 40 |
+
# clone github.com/praxelhq/lase first for the model code
|
| 41 |
+
from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder
|
| 42 |
+
|
| 43 |
+
ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
|
| 44 |
+
backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
|
| 45 |
+
model = LASE(backbone, embedding_dim=256, n_languages=4,
|
| 46 |
+
lambda_schedule=LambdaSchedule(200, 500, 0.1))
|
| 47 |
+
model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
|
| 48 |
+
model.eval()
|
| 49 |
+
|
| 50 |
+
# wav: (B, T) float32 at 16 kHz, ~2 seconds
|
| 51 |
+
embedding = model(wav)["embedding"] # (B, 256)
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Training
|
| 55 |
+
|
| 56 |
+
- **Backbone**: `microsoft/wavlm-base-plus` (frozen)
|
| 57 |
+
- **Projection MLP**: 768 → 512 → 256 (~170k params)
|
| 58 |
+
- **Losses**: SupCon (voice identity) + GRL CE (4-language adversarial)
|
| 59 |
+
- **λ schedule**: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold
|
| 60 |
+
- **Optimisation**: 1000 steps, batch 16, AdamW, LR 1e-4
|
| 61 |
+
- **Data**: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90
|
| 62 |
+
- **Hardware**: 1× A10G on Modal, ~17 min, ~$0.31
|
| 63 |
+
|
| 64 |
+
## Datasets
|
| 65 |
+
|
| 66 |
+
- Training: [`Praxel/codeswitch-pairs-lase`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase)
|
| 67 |
+
- Western held-out: [`Praxel/codeswitch-pairs-lase-heldout`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-heldout)
|
| 68 |
+
- Indian held-out: [`Praxel/codeswitch-pairs-lase-indian`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-indian)
|
| 69 |
+
|
| 70 |
+
## License
|
| 71 |
+
|
| 72 |
+
MIT.
|
| 73 |
+
|
| 74 |
+
## Citation
|
| 75 |
+
|
| 76 |
+
```bibtex
|
| 77 |
+
@misc{lase2026,
|
| 78 |
+
title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
|
| 79 |
+
author={Menta, Venkata Pushpak Teja},
|
| 80 |
+
year={2026},
|
| 81 |
+
eprint={TBD},
|
| 82 |
+
archivePrefix={arXiv},
|
| 83 |
+
primaryClass={eess.AS},
|
| 84 |
+
}
|
| 85 |
+
```
|