Update README.md
Browse files
README.md
CHANGED
|
@@ -8,4 +8,39 @@ tags:
|
|
| 8 |
- speech
|
| 9 |
- SE
|
| 10 |
- Neural-Audio-Codec
|
|
|
|
| 11 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- speech
|
| 9 |
- SE
|
| 10 |
- Neural-Audio-Codec
|
| 11 |
+
pipeline_tag: audio-to-audio
|
| 12 |
---
|
| 13 |
+
# Modeling strategies for speech enhancement in the latent space of a neural audio codec
|
| 14 |
+
|
| 15 |
+
This repository provides the official model checkpoints for the paper *[Modeling strategies for speech enhancement in the latent space of a neural audio codec](https://arxiv.org/abs/2510.26299)* authored by Sofiene Kammoun, Xavier Alameda-Pineda, and Simon Leglaive, and published at IEEE ICASSP 2026.
|
| 16 |
+
|
| 17 |
+
We explore different modeling strategies (autoregressive vs. non-autoregressive) and representation spaces (discrete vs. continuous) for speech enhancement using neural audio codecs and Conformer-based architectures.
|
| 18 |
+
|
| 19 |
+
[arXiv](https://arxiv.org/abs/2510.26299) | [Code and Audio examples](https://sofienekammoun.github.io/SE-NAC-25/) | [Bibtex](#citation)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
Our work introduces and compares a family of speech enhancement models that systematically vary along two main axes:
|
| 25 |
+
|
| 26 |
+
- **Representation Type**
|
| 27 |
+
- Discrete tokens
|
| 28 |
+
- Continuous latent vectors
|
| 29 |
+
|
| 30 |
+
- **Modeling Strategy**
|
| 31 |
+
- Autoregressive (AR): Sequential prediction of clean speech representation
|
| 32 |
+
- Non-Autoregressive (NAR): Parallel prediction of clean speech representation
|
| 33 |
+
|
| 34 |
+
The current release includes the following models:
|
| 35 |
+
|
| 36 |
+
| Model Name | Modeling Strategy | Input Representation | Output Representation | Model Checkpoint |
|
| 37 |
+
|-------------|------|----------------|----------------|----------------|
|
| 38 |
+
| **D-AR** | Autoregressive | Discrete |Discrete | `D-AR_ckpt_300.pt` |
|
| 39 |
+
| **D-NAR** | Non-Autoregressive | Discrete |Discrete | `D-NAR_ckpt_300.pt` |
|
| 40 |
+
| **D-NAR*** | Non-Autoregressive | Continuous |Discrete | `D-NAR_star_ckpt_300.pt` |
|
| 41 |
+
| **C-AR** | Autoregressive | Continuous | Continuous | `C-AR_ckpt_300.pt` |
|
| 42 |
+
| **C-NAR** | Non-Autoregressive | Continuous | Continuous | `C-NAR_ckpt_300.pt` |
|
| 43 |
+
|
| 44 |
+
Additional models:
|
| 45 |
+
- **C-FT** (`C-FT-encoder_ckpt_300.pt`) and **D-FT** (`D-FT-encoder_ckpt_300.pt`), where we only finetune the NAC's encoder with an MSE loss and a cross-entropy loss, respectively.
|
| 46 |
+
- **STFT-NAR** (`STFT_NAR_Mask_ckpt_300.pt`), where instead of the embeddings of the NAC, we work with STFT representations, and we train the model to output an STFT mask.
|