pase / README.md
Xiaobin-Rong's picture
Duplicate from cisco-ai/pase
4f3cde9
---
license: apache-2.0
pipeline_tag: audio-to-audio
---
# PASE: Phonologically Anchored Speech Enhancer
PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.
---
## Model Details
### Model Description
<img src="framework_all.png" alt="High-level system design" width="80%">
PASE contains two main components:
- **Denoising WavLM (DeWavLM)**
Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.
- **Dual‑Stream Vocoder**
Reconstructs audio using DeWavLM's dual-stream representations:
- **Phonetic representation**: high-level linguistic structure
- **Acoustic representation**: speaker identity and prosody
**Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
**Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
**Model type:** Generative Speech Enhancement
**License:** Apache 2.0
**Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)
---
### Model Sources
- **Repository:** https://github.com/cisco-open/pase
- **Paper:** https://arxiv.org/abs/2511.13300
- **Demo:** https://xiaobin-rong.github.io/pase_demo/
---
## Uses
### Direct Use
- Enhance noisy or reverberant speech recordings
- Improve perceptual quality and intelligibility
- Preserve speaker identity and linguistic content
- Supports **16 kHz mono audio**
### Out-of-Scope Use
- Medical, legal, or safety‑critical decisions
- Voice conversion or identity manipulation
- Non‑speech audio enhancement
---
## How to Get Started
Refer to the repository for quick-start code and examples:
https://github.com/cisco-open/pase
---
## Training Details
### Training Data
We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:
- Clean speech:
- DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
- [LibriTTS](https://www.openslr.org/60/)
- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
- Noise:
- DNS5 Challenge noise resources
- Room impulse responses:
- [OpenSLR26](https://www.openslr.org/26/)
- [OpenSLR28](https://www.openslr.org/28/)
These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.
### Dataset Attribution
- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
- LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
- VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
- DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
- OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.
All audio was resampled to 16 kHz.
### Training Procedure
#### Preprocessing
- Mixtures generated dynamically
- SNR sampled from –5 to 15 dB
- Reverberation applied with 50% probability
#### Training Hyperparameters
- **DeWavLM:** 100k steps, LR 1e‑4, batch size 4
- **Vocoder:** 200k steps, LR 2e‑4, batch size 12
- Optimizer: AdamW with warmup + cosine decay
- Hardware: 4 × NVIDIA RTX 4090 GPUs
#### Speeds, Sizes, Times
- Total parameters: ~382M
- Inference compute: ~21.4 GMAC/s
---
## Evaluation
### Testing Data
- Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split)
- [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
### Metrics
- DNSMOS, UTMOS
- LPS, SpeechBERTScore (SBS)
- Speaker Similarity (RawNet3)
- WER (OWSM v3.1)
### Results
The performance of the released version compared to the paper's results:
| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
|:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:|
| Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
| **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
| DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
| **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25
| PASE (paper) | 3.12 | 3.09 |0.90 |0.93 |0.80 | 7.49 |
| **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |
It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.
Overall, PASE achieves:
- Lowest WER among evaluated generative and discriminative baselines
- Highest speaker similarity (SpkSim)
- Strong perceptual quality with low hallucination rates
- Consistent performance across noisy and reverberant conditions
---
## Bias, Risks, and Limitations
- Model trained primarily on English speech; performance may degrade for other languages.
- Very strong noise or mismatched reverberation conditions can introduce artifacts.
- Speaker characteristics are preserved but not guaranteed perfectly.
---
### Recommendations
Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.
---
## Citation
If you use PASE in your research, please cite:
```bibtex
@article{PASE,
title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
volume={40},
DOI={10.1609/aaai.v40i39.40562},
number={39},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
year={2026},
month={Mar.},
pages={32826-32834}
}
```
Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
## Model Card Authorship & Contact
- Mansur Yesilbursa: myesilbu@cisco.com