| # HiFi-WaveGAN β 48kHz Singing Voice Vocoder |
|
|
| [](https://arxiv.org/abs/2210.12740) |
| [](LICENSE) |
|
|
| Full PyTorch implementation of: |
|
|
| > **HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation** |
| > Chunhui Lu et al., 2022 |
|
|
| ## Architecture Overview |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β HiFi-WaveGAN β |
| βββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββββ€ |
| β Generator β Discriminator 1 β Discriminator 2 β |
| β (ExWaveNet) β (MPD) β (MRSD) β |
| βββββββββββββββββββΌββββββββββββββββββββΌββββββββββββββββββββ€ |
| β 3Γ18=54 layers β 5 sub-discs β 4 sub-discs β |
| β Kernels: β Periods: β STFT configs: β |
| β {3,3,9,9,17,17} β [2,3,5,7,11] β [512,1024, β |
| β Residual ch: 80 β 2D Conv on β 2048,4096] β |
| β ~9.5M params β reshaped waveform β 2D Conv on spec β |
| β β ~41M params β ~0.4M params β |
| β + Pulse Extract β β β |
| β + Noise Upsamp β β β |
| βββββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββββββββ |
| ``` |
|
|
| ### Key Components |
|
|
| 1. **Extended WaveNet Generator (ExWaveNet)** |
| - Non-causal WaveNet with 54 layers (3 stacks Γ 18 layers) |
| - Larger kernel sizes `{3,3,9,9,17,17}` for wider receptive field (vs standard kernel=3) |
| - Dilation pattern: `2^(i % 9)` per stack layer |
| - Transposed conv upsampling: mel (frame-level) β sample-level |
| - **Pulse Extractor**: F0-synchronized impulse train as additional constraint condition |
|
|
| 2. **Multi-Period Discriminator (MPD)** β from HiFi-GAN |
| - 5 sub-discriminators with periods `[2, 3, 5, 7, 11]` |
| - Reshapes 1D waveform to 2D, applies 2D convolutions |
|
|
| 3. **Multi-Resolution Spectrogram Discriminator (MRSD)** β from UnivNet |
| - 4 sub-discriminators with STFT configs: |
| - `(FFT=512, hop=50, win=240)` |
| - `(FFT=1024, hop=120, win=600)` |
| - `(FFT=2048, hop=240, win=1200)` |
| - `(FFT=4096, hop=480, win=2400)` |
|
|
| 4. **Loss Functions** |
| - **Adversarial**: LSGAN format (Eq. 4-5) |
| - **Auxiliary**: Multi-resolution STFT (spectral convergence + log magnitude + phase) |
| - **Feature matching**: L1 on intermediate discriminator features |
| - **Weights**: `L_G = 1Β·L_adv + 120Β·L_aux + 10Β·L_fm` |
|
|
| ## Audio Configuration |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Sample rate | 48,000 Hz | |
| | Mel bins | 120 | |
| | FFT size | 2048 | |
| | Window | 20ms (960 samples) | |
| | Hop | 5ms (240 samples) | |
| | F_min | 0 Hz | |
| | F_max | 24,000 Hz | |
|
|
| ## Training Recipe (from paper) |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Optimizer | AdamW | |
| | Learning rate | 2Γ10β»β΄ | |
| | Ξ²β, Ξ²β | 0.8, 0.99 | |
| | Weight decay | 0.01 | |
| | LR schedule | Exponential decay Ξ³=0.999 | |
| | Iterations | 200,000 | |
| | Batch size | 8 | |
| | Segment length | 4 seconds (192,000 samples) | |
| | Training time | ~70h on 4Γ V100 | |
|
|
| ## Dataset |
|
|
| Training uses [GTSinger](https://huggingface.co/datasets/AaronZ345/GTSinger) β a high-quality **48kHz** singing voice dataset with: |
| - ~80 hours of singing across 20 professional singers |
| - 9 languages, 6 singing techniques |
| - Native 48kHz recording (no resampling needed) |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install torch torchaudio numpy huggingface_hub |
| ``` |
|
|
| ### Training |
|
|
| ```bash |
| # Self-contained (downloads GTSinger automatically) |
| python train_hifi_wavegan.py |
| |
| # Or modular version |
| python train.py --data_dir /path/to/audio --batch_size 8 --total_steps 200000 |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| import torch |
| from hifi_wavegan.models.generator import ExWaveNetGenerator |
| from hifi_wavegan.config import HiFiWaveGANConfig |
| |
| cfg = HiFiWaveGANConfig() |
| gen = ExWaveNetGenerator( |
| n_mels=120, residual_ch=80, skip_ch=80, |
| n_stacks=3, n_layers_per_stack=18, |
| kernel_sizes=(3,3,9,9,17,17), |
| hop_size=240, sample_rate=48000, use_pulse=True |
| ) |
| |
| # Load trained weights |
| gen.load_state_dict(torch.load("generator.pt", map_location="cpu")) |
| gen.eval() |
| |
| # Generate from mel-spectrogram |
| # mel: [B, 120, T_frames], pitch: [B, 1, T_frames] |
| # f0: [B, T_frames] (Hz), uv: [B, T_frames] (0/1) |
| wav = gen.inference(mel, pitch, f0, uv) # β [B, 1, T_frames * 240] |
| ``` |
|
|
| ### Command-line inference |
|
|
| ```bash |
| python inference.py --input singing.wav --output generated.wav --checkpoint generator.pt |
| ``` |
|
|
| ## File Structure |
|
|
| ``` |
| βββ hifi_wavegan/ |
| β βββ __init__.py |
| β βββ config.py # All hyperparameters |
| β βββ dataset.py # Data loading + mel/F0 extraction |
| β βββ losses.py # LSGAN + multi-res STFT + phase + FM losses |
| β βββ models/ |
| β βββ __init__.py |
| β βββ generator.py # ExWaveNet + PulseExtractor + UpsampleNet |
| β βββ discriminator.py # MPD (HiFi-GAN) + MRSD (UnivNet) |
| βββ train.py # Modular training script |
| βββ train_hifi_wavegan.py # Self-contained single-file training |
| βββ inference.py # Inference script |
| βββ README.md |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{lu2023hifiwavegan, |
| title={HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation}, |
| author={Lu, Chunhui and others}, |
| booktitle={ICASSP 2023}, |
| year={2023} |
| } |
| ``` |
|
|
| ## References |
|
|
| - [Parallel WaveGAN](https://arxiv.org/abs/1910.11480) β Base WaveNet generator architecture |
| - [HiFi-GAN](https://arxiv.org/abs/2010.05646) β Multi-Period Discriminator |
| - [UnivNet](https://arxiv.org/abs/2106.07889) β Multi-Resolution Spectrogram Discriminator |
| - [GTSinger](https://huggingface.co/datasets/AaronZ345/GTSinger) β 48kHz singing voice dataset |
|
|