HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
Paper β’ 2210.12740 β’ Published
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Full PyTorch implementation of:
HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
Chunhui Lu et al., 2022
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HiFi-WaveGAN β
βββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββ€
β Generator β Discriminator 1 β Discriminator 2 β
β (ExWaveNet) β (MPD) β (MRSD) β
βββββββββββββββββββΌββββββββββββββββββββΌββββββββββββββββββββ€
β 3Γ18=54 layers β 5 sub-discs β 4 sub-discs β
β Kernels: β Periods: β STFT configs: β
β {3,3,9,9,17,17} β [2,3,5,7,11] β [512,1024, β
β Residual ch: 80 β 2D Conv on β 2048,4096] β
β ~9.5M params β reshaped waveform β 2D Conv on spec β
β β ~41M params β ~0.4M params β
β + Pulse Extract β β β
β + Noise Upsamp β β β
βββββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββ
Extended WaveNet Generator (ExWaveNet)
{3,3,9,9,17,17} for wider receptive field (vs standard kernel=3)2^(i % 9) per stack layerMulti-Period Discriminator (MPD) β from HiFi-GAN
[2, 3, 5, 7, 11]Multi-Resolution Spectrogram Discriminator (MRSD) β from UnivNet
(FFT=512, hop=50, win=240)(FFT=1024, hop=120, win=600)(FFT=2048, hop=240, win=1200)(FFT=4096, hop=480, win=2400)Loss Functions
L_G = 1Β·L_adv + 120Β·L_aux + 10Β·L_fm| Parameter | Value |
|---|---|
| Sample rate | 48,000 Hz |
| Mel bins | 120 |
| FFT size | 2048 |
| Window | 20ms (960 samples) |
| Hop | 5ms (240 samples) |
| F_min | 0 Hz |
| F_max | 24,000 Hz |
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 2Γ10β»β΄ |
| Ξ²β, Ξ²β | 0.8, 0.99 |
| Weight decay | 0.01 |
| LR schedule | Exponential decay Ξ³=0.999 |
| Iterations | 200,000 |
| Batch size | 8 |
| Segment length | 4 seconds (192,000 samples) |
| Training time | ~70h on 4Γ V100 |
Training uses GTSinger β a high-quality 48kHz singing voice dataset with:
pip install torch torchaudio numpy huggingface_hub
# Self-contained (downloads GTSinger automatically)
python train_hifi_wavegan.py
# Or modular version
python train.py --data_dir /path/to/audio --batch_size 8 --total_steps 200000
import torch
from hifi_wavegan.models.generator import ExWaveNetGenerator
from hifi_wavegan.config import HiFiWaveGANConfig
cfg = HiFiWaveGANConfig()
gen = ExWaveNetGenerator(
n_mels=120, residual_ch=80, skip_ch=80,
n_stacks=3, n_layers_per_stack=18,
kernel_sizes=(3,3,9,9,17,17),
hop_size=240, sample_rate=48000, use_pulse=True
)
# Load trained weights
gen.load_state_dict(torch.load("generator.pt", map_location="cpu"))
gen.eval()
# Generate from mel-spectrogram
# mel: [B, 120, T_frames], pitch: [B, 1, T_frames]
# f0: [B, T_frames] (Hz), uv: [B, T_frames] (0/1)
wav = gen.inference(mel, pitch, f0, uv) # β [B, 1, T_frames * 240]
python inference.py --input singing.wav --output generated.wav --checkpoint generator.pt
βββ hifi_wavegan/
β βββ __init__.py
β βββ config.py # All hyperparameters
β βββ dataset.py # Data loading + mel/F0 extraction
β βββ losses.py # LSGAN + multi-res STFT + phase + FM losses
β βββ models/
β βββ __init__.py
β βββ generator.py # ExWaveNet + PulseExtractor + UpsampleNet
β βββ discriminator.py # MPD (HiFi-GAN) + MRSD (UnivNet)
βββ train.py # Modular training script
βββ train_hifi_wavegan.py # Self-contained single-file training
βββ inference.py # Inference script
βββ README.md
@inproceedings{lu2023hifiwavegan,
title={HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation},
author={Lu, Chunhui and others},
booktitle={ICASSP 2023},
year={2023}
}