hifi-wavegan-48khz / README.md
Frazun09's picture
Add comprehensive README
8b5b648 verified
# HiFi-WaveGAN β€” 48kHz Singing Voice Vocoder
[![Paper](https://img.shields.io/badge/Paper-arXiv%202210.12740-red)](https://arxiv.org/abs/2210.12740)
[![License](https://img.shields.io/badge/License-MIT-blue)](LICENSE)
Full PyTorch implementation of:
> **HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation**
> Chunhui Lu et al., 2022
## Architecture Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HiFi-WaveGAN β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Generator β”‚ Discriminator 1 β”‚ Discriminator 2 β”‚
β”‚ (ExWaveNet) β”‚ (MPD) β”‚ (MRSD) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3Γ—18=54 layers β”‚ 5 sub-discs β”‚ 4 sub-discs β”‚
β”‚ Kernels: β”‚ Periods: β”‚ STFT configs: β”‚
β”‚ {3,3,9,9,17,17} β”‚ [2,3,5,7,11] β”‚ [512,1024, β”‚
β”‚ Residual ch: 80 β”‚ 2D Conv on β”‚ 2048,4096] β”‚
β”‚ ~9.5M params β”‚ reshaped waveform β”‚ 2D Conv on spec β”‚
β”‚ β”‚ ~41M params β”‚ ~0.4M params β”‚
β”‚ + Pulse Extract β”‚ β”‚ β”‚
β”‚ + Noise Upsamp β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Key Components
1. **Extended WaveNet Generator (ExWaveNet)**
- Non-causal WaveNet with 54 layers (3 stacks Γ— 18 layers)
- Larger kernel sizes `{3,3,9,9,17,17}` for wider receptive field (vs standard kernel=3)
- Dilation pattern: `2^(i % 9)` per stack layer
- Transposed conv upsampling: mel (frame-level) β†’ sample-level
- **Pulse Extractor**: F0-synchronized impulse train as additional constraint condition
2. **Multi-Period Discriminator (MPD)** β€” from HiFi-GAN
- 5 sub-discriminators with periods `[2, 3, 5, 7, 11]`
- Reshapes 1D waveform to 2D, applies 2D convolutions
3. **Multi-Resolution Spectrogram Discriminator (MRSD)** β€” from UnivNet
- 4 sub-discriminators with STFT configs:
- `(FFT=512, hop=50, win=240)`
- `(FFT=1024, hop=120, win=600)`
- `(FFT=2048, hop=240, win=1200)`
- `(FFT=4096, hop=480, win=2400)`
4. **Loss Functions**
- **Adversarial**: LSGAN format (Eq. 4-5)
- **Auxiliary**: Multi-resolution STFT (spectral convergence + log magnitude + phase)
- **Feature matching**: L1 on intermediate discriminator features
- **Weights**: `L_G = 1Β·L_adv + 120Β·L_aux + 10Β·L_fm`
## Audio Configuration
| Parameter | Value |
|-----------|-------|
| Sample rate | 48,000 Hz |
| Mel bins | 120 |
| FFT size | 2048 |
| Window | 20ms (960 samples) |
| Hop | 5ms (240 samples) |
| F_min | 0 Hz |
| F_max | 24,000 Hz |
## Training Recipe (from paper)
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 2Γ—10⁻⁴ |
| β₁, Ξ²β‚‚ | 0.8, 0.99 |
| Weight decay | 0.01 |
| LR schedule | Exponential decay Ξ³=0.999 |
| Iterations | 200,000 |
| Batch size | 8 |
| Segment length | 4 seconds (192,000 samples) |
| Training time | ~70h on 4Γ— V100 |
## Dataset
Training uses [GTSinger](https://huggingface.co/datasets/AaronZ345/GTSinger) β€” a high-quality **48kHz** singing voice dataset with:
- ~80 hours of singing across 20 professional singers
- 9 languages, 6 singing techniques
- Native 48kHz recording (no resampling needed)
## Quick Start
### Installation
```bash
pip install torch torchaudio numpy huggingface_hub
```
### Training
```bash
# Self-contained (downloads GTSinger automatically)
python train_hifi_wavegan.py
# Or modular version
python train.py --data_dir /path/to/audio --batch_size 8 --total_steps 200000
```
### Inference
```python
import torch
from hifi_wavegan.models.generator import ExWaveNetGenerator
from hifi_wavegan.config import HiFiWaveGANConfig
cfg = HiFiWaveGANConfig()
gen = ExWaveNetGenerator(
n_mels=120, residual_ch=80, skip_ch=80,
n_stacks=3, n_layers_per_stack=18,
kernel_sizes=(3,3,9,9,17,17),
hop_size=240, sample_rate=48000, use_pulse=True
)
# Load trained weights
gen.load_state_dict(torch.load("generator.pt", map_location="cpu"))
gen.eval()
# Generate from mel-spectrogram
# mel: [B, 120, T_frames], pitch: [B, 1, T_frames]
# f0: [B, T_frames] (Hz), uv: [B, T_frames] (0/1)
wav = gen.inference(mel, pitch, f0, uv) # β†’ [B, 1, T_frames * 240]
```
### Command-line inference
```bash
python inference.py --input singing.wav --output generated.wav --checkpoint generator.pt
```
## File Structure
```
β”œβ”€β”€ hifi_wavegan/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ config.py # All hyperparameters
β”‚ β”œβ”€β”€ dataset.py # Data loading + mel/F0 extraction
β”‚ β”œβ”€β”€ losses.py # LSGAN + multi-res STFT + phase + FM losses
β”‚ └── models/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ generator.py # ExWaveNet + PulseExtractor + UpsampleNet
β”‚ └── discriminator.py # MPD (HiFi-GAN) + MRSD (UnivNet)
β”œβ”€β”€ train.py # Modular training script
β”œβ”€β”€ train_hifi_wavegan.py # Self-contained single-file training
β”œβ”€β”€ inference.py # Inference script
└── README.md
```
## Citation
```bibtex
@inproceedings{lu2023hifiwavegan,
title={HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation},
author={Lu, Chunhui and others},
booktitle={ICASSP 2023},
year={2023}
}
```
## References
- [Parallel WaveGAN](https://arxiv.org/abs/1910.11480) β€” Base WaveNet generator architecture
- [HiFi-GAN](https://arxiv.org/abs/2010.05646) β€” Multi-Period Discriminator
- [UnivNet](https://arxiv.org/abs/2106.07889) β€” Multi-Resolution Spectrogram Discriminator
- [GTSinger](https://huggingface.co/datasets/AaronZ345/GTSinger) β€” 48kHz singing voice dataset