SE-NAC / README.md
SofieneK's picture
Update README.md
4bd17bf verified
metadata
license: mit
datasets:
  - westbrook/LibriMix
language:
  - en
tags:
  - speech
  - SE
  - Neural-Audio-Codec
pipeline_tag: audio-to-audio

Modeling strategies for speech enhancement in the latent space of a neural audio codec

This repository provides the official model checkpoints for the paper Modeling strategies for speech enhancement in the latent space of a neural audio codec authored by Sofiene Kammoun, Xavier Alameda-Pineda, and Simon Leglaive, and published at IEEE ICASSP 2026.

We explore different modeling strategies (autoregressive vs. non-autoregressive) and representation spaces (discrete vs. continuous) for speech enhancement using neural audio codecs and Conformer-based architectures.

arXiv | Code and Audio examples | Bibtex

Overview

Our work introduces and compares a family of speech enhancement models that systematically vary along two main axes:

  • Representation Type

  • Discrete tokens

  • Continuous latent vectors

  • Modeling Strategy

  • Autoregressive (AR): Sequential prediction of clean speech representation

  • Non-Autoregressive (NAR): Parallel prediction of clean speech representation

The current release includes the following models:

Model Name Modeling Strategy Input Representation Output Representation Model Checkpoint
D-AR Autoregressive Discrete Discrete D-AR_ckpt_300.pt
D-NAR Non-Autoregressive Discrete Discrete D-NAR_ckpt_300.pt
D-NAR* Non-Autoregressive Continuous Discrete D-NAR_star_ckpt_300.pt
C-AR Autoregressive Continuous Continuous C-AR_ckpt_300.pt
C-NAR Non-Autoregressive Continuous Continuous C-NAR_ckpt_300.pt

Additional models:

  • C-FT (C-FT-encoder_ckpt_300.pt) and D-FT (D-FT-encoder_ckpt_300.pt), where we only finetune the NAC's encoder with an MSE loss and a cross-entropy loss, respectively.
  • STFT-NAR (STFT_NAR_Mask_ckpt_300.pt), where instead of the embeddings of the NAC, we work with STFT representations, and we train the model to output an STFT mask.