| --- |
| license: apache-2.0 |
| pipeline_tag: audio-to-audio |
| --- |
| # PASE: Phonologically Anchored Speech Enhancer |
|
|
| PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
|
|
| <img src="framework_all.png" alt="High-level system design" width="80%"> |
|
|
| PASE contains two main components: |
|
|
| - **Denoising WavLM (DeWavLM)** |
| Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD). |
| Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM. |
|
|
| - **Dual‑Stream Vocoder** |
| Reconstructs audio using DeWavLM's dual-stream representations: |
| - **Phonetic representation**: high-level linguistic structure |
| - **Acoustic representation**: speaker identity and prosody |
|
|
| **Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved. |
| **Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki |
| **Model type:** Generative Speech Enhancement |
| **License:** Apache 2.0 |
| **Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm) |
|
|
|
|
| --- |
|
|
| ### Model Sources |
|
|
| - **Repository:** https://github.com/cisco-open/pase |
| - **Paper:** https://arxiv.org/abs/2511.13300 |
| - **Demo:** https://xiaobin-rong.github.io/pase_demo/ |
| --- |
| ## Uses |
| ### Direct Use |
| - Enhance noisy or reverberant speech recordings |
| - Improve perceptual quality and intelligibility |
| - Preserve speaker identity and linguistic content |
| - Supports **16 kHz mono audio** |
| ### Out-of-Scope Use |
| - Medical, legal, or safety‑critical decisions |
| - Voice conversion or identity manipulation |
| - Non‑speech audio enhancement |
| --- |
| ## How to Get Started |
| Refer to the repository for quick-start code and examples: |
| https://github.com/cisco-open/pase |
| |
| --- |
| ## Training Details |
| ### Training Data |
| We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used: |
| |
| - Clean speech: |
| - DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset |
| - [LibriTTS](https://www.openslr.org/60/) |
| - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) |
| - Noise: |
| - DNS5 Challenge noise resources |
| - Room impulse responses: |
| - [OpenSLR26](https://www.openslr.org/26/) |
| - [OpenSLR28](https://www.openslr.org/28/) |
| |
| These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below. |
| |
| ### Dataset Attribution |
| - DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint. |
| - LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint. |
| - VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint. |
| - DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA). |
| - OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training. |
| |
| All audio was resampled to 16 kHz. |
| |
| ### Training Procedure |
| #### Preprocessing |
| - Mixtures generated dynamically |
| - SNR sampled from –5 to 15 dB |
| - Reverberation applied with 50% probability |
| #### Training Hyperparameters |
| - **DeWavLM:** 100k steps, LR 1e‑4, batch size 4 |
| - **Vocoder:** 200k steps, LR 2e‑4, batch size 12 |
| - Optimizer: AdamW with warmup + cosine decay |
| - Hardware: 4 × NVIDIA RTX 4090 GPUs |
| #### Speeds, Sizes, Times |
| - Total parameters: ~382M |
| - Inference compute: ~21.4 GMAC/s |
| --- |
| ## Evaluation |
| ### Testing Data |
| - Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split) |
| - [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation |
| ### Metrics |
| - DNSMOS, UTMOS |
| - LPS, SpeechBERTScore (SBS) |
| - Speaker Similarity (RawNet3) |
| - WER (OWSM v3.1) |
| |
| ### Results |
| |
| The performance of the released version compared to the paper's results: |
| | Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) | |
| |:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:| |
| | Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 | |
| | **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 | |
| | DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 | |
| | **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25 |
| | PASE (paper) | 3.12 | 3.09 |0.90 |0.93 |0.80 | 7.49 | |
| | **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 | |
| |
| It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set. |
| |
| Overall, PASE achieves: |
| - Lowest WER among evaluated generative and discriminative baselines |
| - Highest speaker similarity (SpkSim) |
| - Strong perceptual quality with low hallucination rates |
| - Consistent performance across noisy and reverberant conditions |
| |
| --- |
| ## Bias, Risks, and Limitations |
| - Model trained primarily on English speech; performance may degrade for other languages. |
| - Very strong noise or mismatched reverberation conditions can introduce artifacts. |
| - Speaker characteristics are preserved but not guaranteed perfectly. |
| |
| --- |
| ### Recommendations |
| Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences. |
| |
| --- |
| ## Citation |
| If you use PASE in your research, please cite: |
| ```bibtex |
| @article{PASE, |
| title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}}, |
| volume={40}, |
| DOI={10.1609/aaai.v40i39.40562}, |
| number={39}, |
| journal={Proceedings of the AAAI Conference on Artificial Intelligence}, |
| author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing}, |
| year={2026}, |
| month={Mar.}, |
| pages={32826-32834} |
| } |
| ``` |
| Copyright © 2026 by Cisco Systems, Inc. All rights reserved. |
| ## Model Card Authorship & Contact |
| - Mansur Yesilbursa: myesilbu@cisco.com |
| |