File size: 7,572 Bytes
4f3cde9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: apache-2.0
pipeline_tag: audio-to-audio
---
# PASE: Phonologically Anchored Speech Enhancer

PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.

---

## Model Details

### Model Description


<img src="framework_all.png" alt="High-level system design" width="80%">

PASE contains two main components:

- **Denoising WavLM (DeWavLM)**  
  Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).  
  Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.

- **Dual‑Stream Vocoder**  
  Reconstructs audio using DeWavLM's dual-stream representations:
  - **Phonetic representation**: high-level linguistic structure  
  - **Acoustic representation**: speaker identity and prosody

**Developed by:** Copyright © 2026 by Cisco Systems, Inc. All rights reserved.  
**Cisco product group**: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki  
**Model type:** Generative Speech Enhancement  
**License:** Apache 2.0  
**Finetuned from:** [WavLM-Large](https://github.com/microsoft/unilm/tree/master/wavlm)


---

### Model Sources

- **Repository:** https://github.com/cisco-open/pase
- **Paper:** https://arxiv.org/abs/2511.13300
- **Demo:** https://xiaobin-rong.github.io/pase_demo/
---
## Uses
### Direct Use
- Enhance noisy or reverberant speech recordings  
- Improve perceptual quality and intelligibility  
- Preserve speaker identity and linguistic content  
- Supports **16 kHz mono audio**
### Out-of-Scope Use
- Medical, legal, or safety‑critical decisions  
- Voice conversion or identity manipulation  
- Non‑speech audio enhancement
---
## How to Get Started
Refer to the repository for quick-start code and examples:  
https://github.com/cisco-open/pase

---
## Training Details
### Training Data
We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:

- Clean speech:
  - DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
  - [LibriTTS](https://www.openslr.org/60/)
  - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
- Noise:
  - DNS5 Challenge noise resources
- Room impulse responses:
  - [OpenSLR26](https://www.openslr.org/26/)
  - [OpenSLR28](https://www.openslr.org/28/)

These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.

### Dataset Attribution
- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from [LibriVox](https://librivox.org/) through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge). The LibriVox recordings used for this portion are [public domain](https://librivox.org/pages/public-domain/) and were used as clean-speech training data for the released checkpoint.
- LibriTTS: [LibriTTS](https://www.openslr.org/60/) by Heiga Zen et al., licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
- VCTK Corpus: the [VCTK dataset](https://datashare.ed.ac.uk/handle/10283/3443) from the Centre for Speech Technology Research, University of Edinburgh, licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). It was used as clean-speech training data for the released checkpoint.
- DNS5 Challenge noise resources: noise data prepared through the [DNS Challenge](https://github.com/microsoft/DNS-Challenge) and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on [AudioSet](https://research.google.com/audioset/index.html) material licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), selected [Freesound](https://freesound.org/) files licensed under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/), and [DEMAND](https://zenodo.org/record/1227121#.XRKKxYhKiUk) environmental recordings licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA).
- OpenSLR26 and OpenSLR28: [OpenSLR26](https://www.openslr.org/26/) and [OpenSLR28](https://www.openslr.org/28/) room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.

All audio was resampled to 16 kHz.

### Training Procedure
#### Preprocessing
- Mixtures generated dynamically  
- SNR sampled from –5 to 15 dB  
- Reverberation applied with 50% probability
#### Training Hyperparameters
- **DeWavLM:** 100k steps, LR 1e‑4, batch size 4  
- **Vocoder:** 200k steps, LR 2e‑4, batch size 12  
- Optimizer: AdamW with warmup + cosine decay  
- Hardware: 4 × NVIDIA RTX 4090 GPUs
#### Speeds, Sizes, Times
- Total parameters: ~382M  
- Inference compute: ~21.4 GMAC/s
---
## Evaluation
### Testing Data
- Simulated [LibriTTS](https://www.openslr.org/60/) test set (using test split) 
- [DNS1 test set](https://github.com/microsoft/DNS-Challenge/tree/interspeech2020/master/datasets/test_set/synthetic) with/without reverberation
### Metrics
- DNSMOS, UTMOS  
- LPS, SpeechBERTScore (SBS)  
- Speaker Similarity (RawNet3)  
- WER (OWSM v3.1)

### Results

The performance of the released version compared to the paper's results:
| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
|:-----:|:------:|:-----:|:---:|:---:|:------:|:-------:|
| Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
| **Vocoder-L24 (released)** | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
| DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
| **DeWavLM (released)** | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25
| PASE (paper) | 3.12   | 3.09  |0.90 |0.93 |0.80    | 7.49    |
| **PASE (released)** | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |

It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.

Overall, PASE achieves:
- Lowest WER among evaluated generative and discriminative baselines  
- Highest speaker similarity (SpkSim)  
- Strong perceptual quality with low hallucination rates  
- Consistent performance across noisy and reverberant conditions

---
## Bias, Risks, and Limitations
- Model trained primarily on English speech; performance may degrade for other languages.  
- Very strong noise or mismatched reverberation conditions can introduce artifacts.  
- Speaker characteristics are preserved but not guaranteed perfectly.

---
### Recommendations
Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.

---
## Citation
If you use PASE in your research, please cite:
```bibtex
@article{PASE, 
    title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
    volume={40},
    DOI={10.1609/aaai.v40i39.40562}, 
    number={39}, 
    journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
    author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing}, 
    year={2026},
    month={Mar.}, 
    pages={32826-32834}
}
```
Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
## Model Card Authorship & Contact
- Mansur Yesilbursa: myesilbu@cisco.com