praxelhq commited on
Commit
faf1b07
·
verified ·
1 Parent(s): e892033

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: pytorch
4
+ tags:
5
+ - speaker-recognition
6
+ - speaker-encoding
7
+ - speech
8
+ - indic
9
+ - cross-lingual
10
+ - voice-cloning
11
+ language:
12
+ - en
13
+ - hi
14
+ - te
15
+ - ta
16
+ ---
17
+
18
+ # LASE r1 — Language-Adversarial Speaker Encoder
19
+
20
+ Reference checkpoint for the paper *"LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation"* (arXiv:TBD).
21
+
22
+ LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen `microsoft/wavlm-base-plus` backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).
23
+
24
+ ## Headline result
25
+
26
+ | Encoder | Western voices gap | Indian voices gap |
27
+ |---|---|---|
28
+ | WavLM-base-plus-sv (off-the-shelf) | 0.082 | 0.006 |
29
+ | ECAPA-TDNN (off-the-shelf) | 0.105 | 0.058 |
30
+ | ECAPA + GRL (ablation) | 0.027 | 0.037 |
31
+ | **LASE r1 (ours)** | **0.013** | **−0.000** |
32
+
33
+ Lower is better. *gap* = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.
34
+
35
+ ## Usage
36
+
37
+ ```python
38
+ from huggingface_hub import hf_hub_download
39
+ import torch
40
+ # clone github.com/praxelhq/lase first for the model code
41
+ from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder
42
+
43
+ ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
44
+ backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
45
+ model = LASE(backbone, embedding_dim=256, n_languages=4,
46
+ lambda_schedule=LambdaSchedule(200, 500, 0.1))
47
+ model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
48
+ model.eval()
49
+
50
+ # wav: (B, T) float32 at 16 kHz, ~2 seconds
51
+ embedding = model(wav)["embedding"] # (B, 256)
52
+ ```
53
+
54
+ ## Training
55
+
56
+ - **Backbone**: `microsoft/wavlm-base-plus` (frozen)
57
+ - **Projection MLP**: 768 → 512 → 256 (~170k params)
58
+ - **Losses**: SupCon (voice identity) + GRL CE (4-language adversarial)
59
+ - **λ schedule**: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold
60
+ - **Optimisation**: 1000 steps, batch 16, AdamW, LR 1e-4
61
+ - **Data**: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90
62
+ - **Hardware**: 1× A10G on Modal, ~17 min, ~$0.31
63
+
64
+ ## Datasets
65
+
66
+ - Training: [`Praxel/codeswitch-pairs-lase`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase)
67
+ - Western held-out: [`Praxel/codeswitch-pairs-lase-heldout`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-heldout)
68
+ - Indian held-out: [`Praxel/codeswitch-pairs-lase-indian`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-indian)
69
+
70
+ ## License
71
+
72
+ MIT.
73
+
74
+ ## Citation
75
+
76
+ ```bibtex
77
+ @misc{lase2026,
78
+ title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
79
+ author={Menta, Venkata Pushpak Teja},
80
+ year={2026},
81
+ eprint={TBD},
82
+ archivePrefix={arXiv},
83
+ primaryClass={eess.AS},
84
+ }
85
+ ```