SofieneK commited on
Commit
4bf4200
·
verified ·
1 Parent(s): 5954576

Update README.md

Browse files

# Modeling strategies for speech enhancement in the latent space of a neural audio codec

This repository provides the official model checkpoints of the paper *[Modeling strategies for speech enhancement in the latent space of a neural audio codec](https://arxiv.org/abs/2510.26299)* authored by Sofiene Kammoun, Xavier Alameda-Pineda, and Simon Leglaive, and published at IEEE ICASSP 2026.

We explore different modeling strategies (autoregressive vs. non-autoregressive) and representation spaces (discrete vs. continuous) for speech enhancement using neural audio codecs and Conformer-based architectures.

[arXiv](https://arxiv.org/abs/2510.26299) | [Audio examples](https://sofienekammoun.github.io/SE-NAC-25/) | [Bibtex](#citation)


## Overview

Our work introduces and compares a family of speech enhancement models that systematically vary along two main axes:

- **Representation Type**
- Discrete tokens
- Continuous latent vectors

- **Modeling Strategy**
- Autoregressive (AR): Sequential prediction of clean speech representation
- Non-Autoregressive (NAR): Parallel prediction of clean speech representation

The current release includes the following models:

| Model Name | Modeling Strategy | Input Representation | Output Representation | Checkpoint |
|-------------|------|----------------|----------------|----------------|
| **D-AR** | Autoregressive | Discrete |Discrete | `D-AR_ckpt_300.pt` |
| **D-NAR** | Non-Autoregressive | Discrete |Discrete | `D-NAR_ckpt_300.pt` |
| **D-NAR*** | Non-Autoregressive | Continuous |Discrete | `D-NAR-star_ckpt_300.pt` |
| **C-AR** | Autoregressive | Continuous | Continuous | `C-AR_ckpt_300.pt` |
| **C-NAR** | Non-Autoregressive | Continuous | Continuous | `C-NAR_ckpt_300.pt` |

Additional models:
- **C-FT** (`C-FT-encoder_ckpt_300.pt`) and **D-FT** (`D-FT-encoder_ckpt_300.pt`), where we only finetune the NAC's encoder with an MSE loss and a cross-entropy loss, respectively.
- **STFT-NAR** (`STFT_NAR_Mask_ckpt_300.pt`), where instead of the embeddings of the NAC, we work with STFT representations, and we train the model to output an STFT mask.

Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -1,3 +1,11 @@
1
  ---
2
  license: mit
3
- ---
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - westbrook/LibriMix
5
+ language:
6
+ - en
7
+ tags:
8
+ - speech
9
+ - SE
10
+ - Neural-Audio-Codec
11
+ ---