--- license: apache-2.0 pipeline_tag: audio-to-audio --- # PASE: Phonologically Anchored Speech Enhancer PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio. --- ## Model Details ### Model Description High-level system design PASE contains two main components: - **Denoising WavLM (DeWavLM)** Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD). Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM. - **Dual‑Stream Vocoder** Reconstructs audio using DeWavLM's dual-stream representations: - **Phonetic representation**: high-level linguistic structure - **Acoustic representation**: speaker identity and prosody **Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved. **Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki **Model type:** Generative Speech Enhancement **License:** Apache 2.0 **Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm) --- ### Model Sources - **Repository:** https://github.com/cisco-open/pase - **Paper:** https://arxiv.org/abs/2511.13300 - **Demo:** https://xiaobin-rong.github.io/pase_demo/ --- ## Uses ### Direct Use - Enhance noisy or reverberant speech recordings - Improve perceptual quality and intelligibility - Preserve speaker identity and linguistic content - Supports **16 kHz mono audio** ### Out-of-Scope Use - Medical, legal, or safety‑critical decisions - Voice conversion or identity manipulation - Non‑speech audio enhancement --- ## How to Get Started Refer to the repository for quick-start code and examples: https://github.com/cisco-open/pase --- ## Training Details ### Training Data We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used: - Clean speech: - DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset - [LibriTTS](https://www.openslr.org/60/) - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) - Noise: - DNS5 Challenge noise resources - Room impulse responses: - [OpenSLR26](https://www.openslr.org/26/) - [OpenSLR28](https://www.openslr.org/28/) These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below. ### Dataset Attribution - DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint. - LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint. - VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint. - DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA). - OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training. All audio was resampled to 16 kHz. ### Training Procedure #### Preprocessing - Mixtures generated dynamically - SNR sampled from –5 to 15 dB - Reverberation applied with 50% probability #### Training Hyperparameters - **DeWavLM:** 100k steps, LR 1e‑4, batch size 4 - **Vocoder:** 200k steps, LR 2e‑4, batch size 12 - Optimizer: AdamW with warmup + cosine decay - Hardware: 4 × NVIDIA RTX 4090 GPUs #### Speeds, Sizes, Times - Total parameters: ~382M - Inference compute: ~21.4 GMAC/s --- ## Evaluation ### Testing Data - Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split) - [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation ### Metrics - DNSMOS, UTMOS - LPS, SpeechBERTScore (SBS) - Speaker Similarity (RawNet3) - WER (OWSM v3.1) ### Results The performance of the released version compared to the paper's results: | Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) | |:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:| | Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 | | **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 | | DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 | | **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25 | PASE (paper) | 3.12 | 3.09 |0.90 |0.93 |0.80 | 7.49 | | **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 | It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set. Overall, PASE achieves: - Lowest WER among evaluated generative and discriminative baselines - Highest speaker similarity (SpkSim) - Strong perceptual quality with low hallucination rates - Consistent performance across noisy and reverberant conditions --- ## Bias, Risks, and Limitations - Model trained primarily on English speech; performance may degrade for other languages. - Very strong noise or mismatched reverberation conditions can introduce artifacts. - Speaker characteristics are preserved but not guaranteed perfectly. --- ### Recommendations Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences. --- ## Citation If you use PASE in your research, please cite: ```bibtex @article{PASE, title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}}, volume={40}, DOI={10.1609/aaai.v40i39.40562}, number={39}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing}, year={2026}, month={Mar.}, pages={32826-32834} } ``` Copyright © 2026 by Cisco Systems, Inc. All rights reserved. ## Model Card Authorship & Contact - Mansur Yesilbursa: myesilbu@cisco.com