pase / README.md

Duplicate from cisco-ai/pase

4f3cde9 7 days ago

7.57 kB

	---
	license: apache-2.0
	pipeline_tag: audio-to-audio
	---
	# PASE: Phonologically Anchored Speech Enhancer

	PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.

	---

	## Model Details

	### Model Description


	<img src="framework_all.png" alt="High-level system design" width="80%">

	PASE contains two main components:

	- Denoising WavLM (DeWavLM)
	Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
	Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.

	- Dual‑Stream Vocoder
	Reconstructs audio using DeWavLM's dual-stream representations:
	- Phonetic representation: high-level linguistic structure
	- Acoustic representation: speaker identity and prosody

	Developed by: Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
	Cisco product group: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
	Model type: Generative Speech Enhancement
	License: Apache 2.0
	Finetuned from: [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)


	---

	### Model Sources

	- Repository: https://github.com/cisco-open/pase
	- Paper: https://arxiv.org/abs/2511.13300
	- Demo: https://xiaobin-rong.github.io/pase_demo/
	---
	## Uses
	### Direct Use
	- Enhance noisy or reverberant speech recordings
	- Improve perceptual quality and intelligibility
	- Preserve speaker identity and linguistic content
	- Supports 16 kHz mono audio
	### Out-of-Scope Use
	- Medical, legal, or safety‑critical decisions
	- Voice conversion or identity manipulation
	- Non‑speech audio enhancement
	---
	## How to Get Started
	Refer to the repository for quick-start code and examples:
	https://github.com/cisco-open/pase

	---
	## Training Details
	### Training Data
	We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:

	- Clean speech:
	- DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
	- [LibriTTS](https://www.openslr.org/60/)
	- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
	- Noise:
	- DNS5 Challenge noise resources
	- Room impulse responses:
	- [OpenSLR26](https://www.openslr.org/26/)
	- [OpenSLR28](https://www.openslr.org/28/)

	These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.

	### Dataset Attribution
	- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
	- LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
	- VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
	- DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
	- OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.

	All audio was resampled to 16 kHz.

	### Training Procedure
	#### Preprocessing
	- Mixtures generated dynamically
	- SNR sampled from –5 to 15 dB
	- Reverberation applied with 50% probability
	#### Training Hyperparameters
	- DeWavLM: 100k steps, LR 1e‑4, batch size 4
	- Vocoder: 200k steps, LR 2e‑4, batch size 12
	- Optimizer: AdamW with warmup + cosine decay
	- Hardware: 4 × NVIDIA RTX 4090 GPUs
	#### Speeds, Sizes, Times
	- Total parameters: ~382M
	- Inference compute: ~21.4 GMAC/s
	---
	## Evaluation
	### Testing Data
	- Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split)
	- [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
	### Metrics
	- DNSMOS, UTMOS
	- LPS, SpeechBERTScore (SBS)
	- Speaker Similarity (RawNet3)
	- WER (OWSM v3.1)

	### Results

	The performance of the released version compared to the paper's results:
	\| Model \| DNSMOS \| UTMOS \| SBS \| LPS \| SpkSim \| WER (%) \|
	\|:-----:\|:------:\|:-----:\|:---:\|:---:\|:------:\|:-------:\|
	\| Vocoder-L24 (paper) \| 3.23 \| 3.40 \| 0.94 \| 0.97 \| 0.65 \| 2.86 \|
	\| Vocoder-L24 (released) \| 3.29 \| 3.30 \| 0.94 \| 0.96 \| 0.59 \| 3.46 \|
	\| DeWavLM (paper) \| 3.26 \| 3.42 \| 0.88 \| 0.93 \| 0.57 \| 7.62 \|
	\| DeWavLM (released) \| 3.31 \| 3.39 \| 0.88 \| 0.93 \| 0.52 \| 7.25
	\| PASE (paper) \| 3.12 \| 3.09 \|0.90 \|0.93 \|0.80 \| 7.49 \|
	\| PASE (released) \| 3.08 \| 3.21 \| 0.91 \| 0.94 \| 0.80 \| 6.76 \|

	It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.

	Overall, PASE achieves:
	- Lowest WER among evaluated generative and discriminative baselines
	- Highest speaker similarity (SpkSim)
	- Strong perceptual quality with low hallucination rates
	- Consistent performance across noisy and reverberant conditions

	---
	## Bias, Risks, and Limitations
	- Model trained primarily on English speech; performance may degrade for other languages.
	- Very strong noise or mismatched reverberation conditions can introduce artifacts.
	- Speaker characteristics are preserved but not guaranteed perfectly.

	---
	### Recommendations
	Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.

	---
	## Citation
	If you use PASE in your research, please cite:
	```bibtex
	@article{PASE,
	title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
	volume={40},
	DOI={10.1609/aaai.v40i39.40562},
	number={39},
	journal={Proceedings of the AAAI Conference on Artificial Intelligence},
	author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
	year={2026},
	month={Mar.},
	pages={32826-32834}
	}
	```
	Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
	## Model Card Authorship & Contact
	- Mansur Yesilbursa: myesilbu@cisco.com