| --- |
| license: mit |
| datasets: |
| - westbrook/LibriMix |
| language: |
| - en |
| tags: |
| - speech |
| - SE |
| - Neural-Audio-Codec |
| pipeline_tag: audio-to-audio |
| --- |
| # Modeling strategies for speech enhancement in the latent space of a neural audio codec |
|
|
| This repository provides the official model checkpoints for the paper *[Modeling strategies for speech enhancement in the latent space of a neural audio codec](https://arxiv.org/abs/2510.26299)* authored by Sofiene Kammoun, Xavier Alameda-Pineda, and Simon Leglaive, and published at IEEE ICASSP 2026. |
|
|
| We explore different modeling strategies (autoregressive vs. non-autoregressive) and representation spaces (discrete vs. continuous) for speech enhancement using neural audio codecs and Conformer-based architectures. |
|
|
| [arXiv](https://arxiv.org/abs/2510.26299) | [Code and Audio examples](https://sofienekammoun.github.io/SE-NAC-25/) | [Bibtex](#citation) |
|
|
|
|
| ## Overview |
|
|
| Our work introduces and compares a family of speech enhancement models that systematically vary along two main axes: |
|
|
| - **Representation Type** |
| - Discrete tokens |
| - Continuous latent vectors |
|
|
| - **Modeling Strategy** |
| - Autoregressive (AR): Sequential prediction of clean speech representation |
| - Non-Autoregressive (NAR): Parallel prediction of clean speech representation |
|
|
| The current release includes the following models: |
|
|
| | Model Name | Modeling Strategy | Input Representation | Output Representation | Model Checkpoint | |
| |-------------|------|----------------|----------------|----------------| |
| | **D-AR** | Autoregressive | Discrete |Discrete | `D-AR_ckpt_300.pt` | |
| | **D-NAR** | Non-Autoregressive | Discrete |Discrete | `D-NAR_ckpt_300.pt` | |
| | **D-NAR*** | Non-Autoregressive | Continuous |Discrete | `D-NAR_star_ckpt_300.pt` | |
| | **C-AR** | Autoregressive | Continuous | Continuous | `C-AR_ckpt_300.pt` | |
| | **C-NAR** | Non-Autoregressive | Continuous | Continuous | `C-NAR_ckpt_300.pt` | |
|
|
| Additional models: |
| - **C-FT** (`C-FT-encoder_ckpt_300.pt`) and **D-FT** (`D-FT-encoder_ckpt_300.pt`), where we only finetune the NAC's encoder with an MSE loss and a cross-entropy loss, respectively. |
| - **STFT-NAR** (`STFT_NAR_Mask_ckpt_300.pt`), where instead of the embeddings of the NAC, we work with STFT representations, and we train the model to output an STFT mask. |