File size: 7,572 Bytes

4f3cde9

---
license: apache-2.0
pipeline_tag: audio-to-audio
---
# PASE: Phonologically Anchored Speech Enhancer

PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.

---

## Model Details

### Model Description


<img src="framework_all.png" alt="High-level system design" width="80%">

PASE contains two main components:

- **Denoising WavLM (DeWavLM)**  
  Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).  
  Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.

- **Dual‑Stream Vocoder**  
  Reconstructs audio using DeWavLM's dual-stream representations:
  - **Phonetic representation**: high-level linguistic structure  
  - **Acoustic representation**: speaker identity and prosody

**Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved.  
**Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki  
**Model type:** Generative Speech Enhancement  
**License:** Apache 2.0  
**Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)


---

### Model Sources

- **Repository:** https://github.com/cisco-open/pase
- **Paper:** https://arxiv.org/abs/2511.13300
- **Demo:** https://xiaobin-rong.github.io/pase_demo/
---
## Uses
### Direct Use
- Enhance noisy or reverberant speech recordings  
- Improve perceptual quality and intelligibility  
- Preserve speaker identity and linguistic content  
- Supports **16 kHz mono audio**
### Out-of-Scope Use
- Medical, legal, or safety‑critical decisions  
- Voice conversion or identity manipulation  
- Non‑speech audio enhancement
---
## How to Get Started
Refer to the repository for quick-start code and examples:  
https://github.com/cisco-open/pase

---
## Training Details
### Training Data
We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:

- Clean speech:
  - DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
  - [LibriTTS](https://www.openslr.org/60/)
  - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
- Noise:
  - DNS5 Challenge noise resources
- Room impulse responses:
  - [OpenSLR26](https://www.openslr.org/26/)
  - [OpenSLR28](https://www.openslr.org/28/)

These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.

### Dataset Attribution
- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
- LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
- VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
- DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
- OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.

All audio was resampled to 16 kHz.

### Training Procedure
#### Preprocessing
- Mixtures generated dynamically  
- SNR sampled from –5 to 15 dB  
- Reverberation applied with 50% probability
#### Training Hyperparameters
- **DeWavLM:** 100k steps, LR 1e‑4, batch size 4  
- **Vocoder:** 200k steps, LR 2e‑4, batch size 12  
- Optimizer: AdamW with warmup + cosine decay  
- Hardware: 4 × NVIDIA RTX 4090 GPUs
#### Speeds, Sizes, Times
- Total parameters: ~382M  
- Inference compute: ~21.4 GMAC/s
---
## Evaluation
### Testing Data
- Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split) 
- [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
### Metrics
- DNSMOS, UTMOS  
- LPS, SpeechBERTScore (SBS)  
- Speaker Similarity (RawNet3)  
- WER (OWSM v3.1)

### Results

The performance of the released version compared to the paper's results:
| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
|:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:|
| Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
| **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
| DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
| **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25
| PASE (paper) | 3.12   | 3.09  |0.90 |0.93 |0.80    | 7.49    |
| **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |

It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.

Overall, PASE achieves:
- Lowest WER among evaluated generative and discriminative baselines  
- Highest speaker similarity (SpkSim)  
- Strong perceptual quality with low hallucination rates  
- Consistent performance across noisy and reverberant conditions

---
## Bias, Risks, and Limitations
- Model trained primarily on English speech; performance may degrade for other languages.  
- Very strong noise or mismatched reverberation conditions can introduce artifacts.  
- Speaker characteristics are preserved but not guaranteed perfectly.

---
### Recommendations
Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.

---
## Citation
If you use PASE in your research, please cite:
```bibtex
@article{PASE, 
    title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
    volume={40},
    DOI={10.1609/aaai.v40i39.40562}, 
    number={39}, 
    journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
    author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing}, 
    year={2026},
    month={Mar.}, 
    pages={32826-32834}
}
```
Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
## Model Card Authorship & Contact
- Mansur Yesilbursa: myesilbu@cisco.com