YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

HiFi-WaveGAN β€” 48kHz Singing Voice Vocoder

Paper License

Full PyTorch implementation of:

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
Chunhui Lu et al., 2022

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    HiFi-WaveGAN                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Generator     β”‚   Discriminator 1 β”‚   Discriminator 2 β”‚
β”‚   (ExWaveNet)   β”‚   (MPD)           β”‚   (MRSD)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3Γ—18=54 layers  β”‚ 5 sub-discs       β”‚ 4 sub-discs       β”‚
β”‚ Kernels:        β”‚ Periods:          β”‚ STFT configs:     β”‚
β”‚ {3,3,9,9,17,17} β”‚ [2,3,5,7,11]      β”‚ [512,1024,        β”‚
β”‚ Residual ch: 80 β”‚ 2D Conv on        β”‚  2048,4096]       β”‚
β”‚ ~9.5M params    β”‚ reshaped waveform β”‚ 2D Conv on spec   β”‚
β”‚                 β”‚ ~41M params       β”‚ ~0.4M params      β”‚
β”‚ + Pulse Extract β”‚                   β”‚                   β”‚
β”‚ + Noise Upsamp  β”‚                   β”‚                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

  1. Extended WaveNet Generator (ExWaveNet)

    • Non-causal WaveNet with 54 layers (3 stacks Γ— 18 layers)
    • Larger kernel sizes {3,3,9,9,17,17} for wider receptive field (vs standard kernel=3)
    • Dilation pattern: 2^(i % 9) per stack layer
    • Transposed conv upsampling: mel (frame-level) β†’ sample-level
    • Pulse Extractor: F0-synchronized impulse train as additional constraint condition
  2. Multi-Period Discriminator (MPD) β€” from HiFi-GAN

    • 5 sub-discriminators with periods [2, 3, 5, 7, 11]
    • Reshapes 1D waveform to 2D, applies 2D convolutions
  3. Multi-Resolution Spectrogram Discriminator (MRSD) β€” from UnivNet

    • 4 sub-discriminators with STFT configs:
      • (FFT=512, hop=50, win=240)
      • (FFT=1024, hop=120, win=600)
      • (FFT=2048, hop=240, win=1200)
      • (FFT=4096, hop=480, win=2400)
  4. Loss Functions

    • Adversarial: LSGAN format (Eq. 4-5)
    • Auxiliary: Multi-resolution STFT (spectral convergence + log magnitude + phase)
    • Feature matching: L1 on intermediate discriminator features
    • Weights: L_G = 1Β·L_adv + 120Β·L_aux + 10Β·L_fm

Audio Configuration

Parameter Value
Sample rate 48,000 Hz
Mel bins 120
FFT size 2048
Window 20ms (960 samples)
Hop 5ms (240 samples)
F_min 0 Hz
F_max 24,000 Hz

Training Recipe (from paper)

Parameter Value
Optimizer AdamW
Learning rate 2Γ—10⁻⁴
β₁, Ξ²β‚‚ 0.8, 0.99
Weight decay 0.01
LR schedule Exponential decay Ξ³=0.999
Iterations 200,000
Batch size 8
Segment length 4 seconds (192,000 samples)
Training time ~70h on 4Γ— V100

Dataset

Training uses GTSinger β€” a high-quality 48kHz singing voice dataset with:

  • ~80 hours of singing across 20 professional singers
  • 9 languages, 6 singing techniques
  • Native 48kHz recording (no resampling needed)

Quick Start

Installation

pip install torch torchaudio numpy huggingface_hub

Training

# Self-contained (downloads GTSinger automatically)
python train_hifi_wavegan.py

# Or modular version
python train.py --data_dir /path/to/audio --batch_size 8 --total_steps 200000

Inference

import torch
from hifi_wavegan.models.generator import ExWaveNetGenerator
from hifi_wavegan.config import HiFiWaveGANConfig

cfg = HiFiWaveGANConfig()
gen = ExWaveNetGenerator(
    n_mels=120, residual_ch=80, skip_ch=80,
    n_stacks=3, n_layers_per_stack=18,
    kernel_sizes=(3,3,9,9,17,17),
    hop_size=240, sample_rate=48000, use_pulse=True
)

# Load trained weights
gen.load_state_dict(torch.load("generator.pt", map_location="cpu"))
gen.eval()

# Generate from mel-spectrogram
# mel: [B, 120, T_frames], pitch: [B, 1, T_frames]
# f0: [B, T_frames] (Hz), uv: [B, T_frames] (0/1)
wav = gen.inference(mel, pitch, f0, uv)  # β†’ [B, 1, T_frames * 240]

Command-line inference

python inference.py --input singing.wav --output generated.wav --checkpoint generator.pt

File Structure

β”œβ”€β”€ hifi_wavegan/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py                  # All hyperparameters
β”‚   β”œβ”€β”€ dataset.py                 # Data loading + mel/F0 extraction
β”‚   β”œβ”€β”€ losses.py                  # LSGAN + multi-res STFT + phase + FM losses
β”‚   └── models/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ generator.py           # ExWaveNet + PulseExtractor + UpsampleNet
β”‚       └── discriminator.py       # MPD (HiFi-GAN) + MRSD (UnivNet)
β”œβ”€β”€ train.py                       # Modular training script
β”œβ”€β”€ train_hifi_wavegan.py          # Self-contained single-file training
β”œβ”€β”€ inference.py                   # Inference script
└── README.md

Citation

@inproceedings{lu2023hifiwavegan,
  title={HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation},
  author={Lu, Chunhui and others},
  booktitle={ICASSP 2023},
  year={2023}
}

References

  • Parallel WaveGAN β€” Base WaveNet generator architecture
  • HiFi-GAN β€” Multi-Period Discriminator
  • UnivNet β€” Multi-Resolution Spectrogram Discriminator
  • GTSinger β€” 48kHz singing voice dataset
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Frazun09/hifi-wavegan-48khz