Praxel
/

lase-r1

speaker-recognition

speaker-encoding

Model card Files Files and versions

lase-r1 / README.md

praxelhq's picture

docs: link LASE arXiv:2605.00777

db60d7a verified 2 days ago

|

history blame contribute delete

2.99 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- speaker-recognition
	- speaker-encoding
	- speech
	- indic
	- cross-lingual
	- voice-cloning
	language:
	- en
	- hi
	- te
	- ta
	---

	# LASE r1 — Language-Adversarial Speaker Encoder

	Reference checkpoint for the paper "LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation" ([arXiv:2605.00777](https://arxiv.org/abs/2605.00777)).

	LASE is a 256-d speaker embedding that preserves speaker identity across Devanagari, Telugu, Tamil, and Latin scripts. It wraps a frozen `microsoft/wavlm-base-plus` backbone with a 2-layer projection MLP and a gradient-reversal language classifier (~170k trainable params).

	## Headline result

	\| Encoder \| Western voices gap \| Indian voices gap \|
	\|---\|---\|---\|
	\| WavLM-base-plus-sv (off-the-shelf) \| 0.082 \| 0.006 \|
	\| ECAPA-TDNN (off-the-shelf) \| 0.105 \| 0.058 \|
	\| ECAPA + GRL (ablation) \| 0.027 \| 0.037 \|
	\| LASE r1 (ours) \| 0.013 \| −0.000 \|

	Lower is better. gap = within-script median minus cross-script median for the same speaker. LASE r1's bootstrap 95% CI on gap straddles zero on both held-out corpora.

	## Usage

	```python
	from huggingface_hub import hf_hub_download
	import torch
	# clone github.com/praxelhq/lase first for the model code
	from models.lase import LASE, LambdaSchedule, WavLMSpeakerEncoder

	ckpt_path = hf_hub_download("Praxel/lase-r1", "last.pt")
	backbone = WavLMSpeakerEncoder("microsoft/wavlm-base-plus", embedding_dim=256, freeze_backbone=True)
	model = LASE(backbone, embedding_dim=256, n_languages=4,
	lambda_schedule=LambdaSchedule(200, 500, 0.1))
	model.load_state_dict(torch.load(ckpt_path)["model"], strict=False)
	model.eval()

	# wav: (B, T) float32 at 16 kHz, ~2 seconds
	embedding = model(wav)["embedding"] # (B, 256)
	```

	## Training

	- Backbone: `microsoft/wavlm-base-plus` (frozen)
	- Projection MLP: 768 → 512 → 256 (~170k params)
	- Losses: SupCon (voice identity) + GRL CE (4-language adversarial)
	- λ schedule: warmup 0 for 200 steps, ramp to 0.1 over 500 steps, hold
	- Optimisation: 1000 steps, batch 16, AdamW, LR 1e-4
	- Data: 1118 same-voice cross-script pairs from 8 ElevenLabs Multilingual voices, gated through WavLM-cosine ≥ 0.90
	- Hardware: 1× A10G on Modal, ~17 min, ~$0.31

	## Datasets

	- Training: [`Praxel/codeswitch-pairs-lase`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase)
	- Western held-out: [`Praxel/codeswitch-pairs-lase-heldout`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-heldout)
	- Indian held-out: [`Praxel/codeswitch-pairs-lase-indian`](https://huggingface.co/datasets/Praxel/codeswitch-pairs-lase-indian)

	## License

	MIT.

	## Citation

	```bibtex
	@misc{lase2026,
	title={{LASE}: Language-Adversarial Speaker Encoding for {Indic} Cross-Script Identity Preservation},
	author={Menta, Venkata Pushpak Teja},
	year={2026},
	eprint={2605.00777},
	archivePrefix={arXiv},
	primaryClass={eess.AS},
	}
	```