SE-NAC / README.md

Update README.md

4bd17bf verified about 13 hours ago

2.27 kB

license: mit
datasets:
  - westbrook/LibriMix
language:
  - en
tags:
  - speech
  - SE
  - Neural-Audio-Codec
pipeline_tag: audio-to-audio

Modeling strategies for speech enhancement in the latent space of a neural audio codec

This repository provides the official model checkpoints for the paper Modeling strategies for speech enhancement in the latent space of a neural audio codec authored by Sofiene Kammoun, Xavier Alameda-Pineda, and Simon Leglaive, and published at IEEE ICASSP 2026.

We explore different modeling strategies (autoregressive vs. non-autoregressive) and representation spaces (discrete vs. continuous) for speech enhancement using neural audio codecs and Conformer-based architectures.

arXiv | Code and Audio examples | Bibtex

Overview

Our work introduces and compares a family of speech enhancement models that systematically vary along two main axes:

Representation Type
Discrete tokens
Continuous latent vectors
Modeling Strategy
Autoregressive (AR): Sequential prediction of clean speech representation
Non-Autoregressive (NAR): Parallel prediction of clean speech representation

The current release includes the following models:

Model Name	Modeling Strategy	Input Representation	Output Representation	Model Checkpoint
D-AR	Autoregressive	Discrete	Discrete	`D-AR_ckpt_300.pt`
D-NAR	Non-Autoregressive	Discrete	Discrete	`D-NAR_ckpt_300.pt`
D-NAR*	Non-Autoregressive	Continuous	Discrete	`D-NAR_star_ckpt_300.pt`
C-AR	Autoregressive	Continuous	Continuous	`C-AR_ckpt_300.pt`
C-NAR	Non-Autoregressive	Continuous	Continuous	`C-NAR_ckpt_300.pt`

Additional models:

C-FT (C-FT-encoder_ckpt_300.pt) and D-FT (D-FT-encoder_ckpt_300.pt), where we only finetune the NAC's encoder with an MSE loss and a cross-entropy loss, respectively.
STFT-NAR (STFT_NAR_Mask_ckpt_300.pt), where instead of the embeddings of the NAC, we work with STFT representations, and we train the model to output an STFT mask.