# HiFi-WaveGAN — 48kHz Singing Voice Vocoder [![Paper](https://img.shields.io/badge/Paper-arXiv%202210.12740-red)](https://arxiv.org/abs/2210.12740) [![License](https://img.shields.io/badge/License-MIT-blue)](LICENSE) Full PyTorch implementation of: > **HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation** > Chunhui Lu et al., 2022 ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────┐ │ HiFi-WaveGAN │ ├─────────────────┬───────────────────┬───────────────────┤ │ Generator │ Discriminator 1 │ Discriminator 2 │ │ (ExWaveNet) │ (MPD) │ (MRSD) │ ├─────────────────┼───────────────────┼───────────────────┤ │ 3×18=54 layers │ 5 sub-discs │ 4 sub-discs │ │ Kernels: │ Periods: │ STFT configs: │ │ {3,3,9,9,17,17} │ [2,3,5,7,11] │ [512,1024, │ │ Residual ch: 80 │ 2D Conv on │ 2048,4096] │ │ ~9.5M params │ reshaped waveform │ 2D Conv on spec │ │ │ ~41M params │ ~0.4M params │ │ + Pulse Extract │ │ │ │ + Noise Upsamp │ │ │ └─────────────────┴───────────────────┴───────────────────┘ ``` ### Key Components 1. **Extended WaveNet Generator (ExWaveNet)** - Non-causal WaveNet with 54 layers (3 stacks × 18 layers) - Larger kernel sizes `{3,3,9,9,17,17}` for wider receptive field (vs standard kernel=3) - Dilation pattern: `2^(i % 9)` per stack layer - Transposed conv upsampling: mel (frame-level) → sample-level - **Pulse Extractor**: F0-synchronized impulse train as additional constraint condition 2. **Multi-Period Discriminator (MPD)** — from HiFi-GAN - 5 sub-discriminators with periods `[2, 3, 5, 7, 11]` - Reshapes 1D waveform to 2D, applies 2D convolutions 3. **Multi-Resolution Spectrogram Discriminator (MRSD)** — from UnivNet - 4 sub-discriminators with STFT configs: - `(FFT=512, hop=50, win=240)` - `(FFT=1024, hop=120, win=600)` - `(FFT=2048, hop=240, win=1200)` - `(FFT=4096, hop=480, win=2400)` 4. **Loss Functions** - **Adversarial**: LSGAN format (Eq. 4-5) - **Auxiliary**: Multi-resolution STFT (spectral convergence + log magnitude + phase) - **Feature matching**: L1 on intermediate discriminator features - **Weights**: `L_G = 1·L_adv + 120·L_aux + 10·L_fm` ## Audio Configuration | Parameter | Value | |-----------|-------| | Sample rate | 48,000 Hz | | Mel bins | 120 | | FFT size | 2048 | | Window | 20ms (960 samples) | | Hop | 5ms (240 samples) | | F_min | 0 Hz | | F_max | 24,000 Hz | ## Training Recipe (from paper) | Parameter | Value | |-----------|-------| | Optimizer | AdamW | | Learning rate | 2×10⁻⁴ | | β₁, β₂ | 0.8, 0.99 | | Weight decay | 0.01 | | LR schedule | Exponential decay γ=0.999 | | Iterations | 200,000 | | Batch size | 8 | | Segment length | 4 seconds (192,000 samples) | | Training time | ~70h on 4× V100 | ## Dataset Training uses [GTSinger](https://huggingface.co/datasets/AaronZ345/GTSinger) — a high-quality **48kHz** singing voice dataset with: - ~80 hours of singing across 20 professional singers - 9 languages, 6 singing techniques - Native 48kHz recording (no resampling needed) ## Quick Start ### Installation ```bash pip install torch torchaudio numpy huggingface_hub ``` ### Training ```bash # Self-contained (downloads GTSinger automatically) python train_hifi_wavegan.py # Or modular version python train.py --data_dir /path/to/audio --batch_size 8 --total_steps 200000 ``` ### Inference ```python import torch from hifi_wavegan.models.generator import ExWaveNetGenerator from hifi_wavegan.config import HiFiWaveGANConfig cfg = HiFiWaveGANConfig() gen = ExWaveNetGenerator( n_mels=120, residual_ch=80, skip_ch=80, n_stacks=3, n_layers_per_stack=18, kernel_sizes=(3,3,9,9,17,17), hop_size=240, sample_rate=48000, use_pulse=True ) # Load trained weights gen.load_state_dict(torch.load("generator.pt", map_location="cpu")) gen.eval() # Generate from mel-spectrogram # mel: [B, 120, T_frames], pitch: [B, 1, T_frames] # f0: [B, T_frames] (Hz), uv: [B, T_frames] (0/1) wav = gen.inference(mel, pitch, f0, uv) # → [B, 1, T_frames * 240] ``` ### Command-line inference ```bash python inference.py --input singing.wav --output generated.wav --checkpoint generator.pt ``` ## File Structure ``` ├── hifi_wavegan/ │ ├── __init__.py │ ├── config.py # All hyperparameters │ ├── dataset.py # Data loading + mel/F0 extraction │ ├── losses.py # LSGAN + multi-res STFT + phase + FM losses │ └── models/ │ ├── __init__.py │ ├── generator.py # ExWaveNet + PulseExtractor + UpsampleNet │ └── discriminator.py # MPD (HiFi-GAN) + MRSD (UnivNet) ├── train.py # Modular training script ├── train_hifi_wavegan.py # Self-contained single-file training ├── inference.py # Inference script └── README.md ``` ## Citation ```bibtex @inproceedings{lu2023hifiwavegan, title={HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation}, author={Lu, Chunhui and others}, booktitle={ICASSP 2023}, year={2023} } ``` ## References - [Parallel WaveGAN](https://arxiv.org/abs/1910.11480) — Base WaveNet generator architecture - [HiFi-GAN](https://arxiv.org/abs/2010.05646) — Multi-Period Discriminator - [UnivNet](https://arxiv.org/abs/2106.07889) — Multi-Resolution Spectrogram Discriminator - [GTSinger](https://huggingface.co/datasets/AaronZ345/GTSinger) — 48kHz singing voice dataset