HiFi-GAN Vocoder β€” Russian (RUSLAN corpus)

HiFi-GAN V1 vocoder trained on the RUSLAN Russian speech corpus for high-quality mel-to-audio conversion.

Training Details

  • Architecture: HiFi-GAN V1 (14M parameters, 512 initial channels)
  • Training code: jik876/hifi-gan
  • Dataset: RUSLAN corpus β€” single male speaker, studio quality, 13,865 training files (~16 hours)
  • Steps: 160,000 (~185 epochs)
  • Hardware: NVIDIA RTX 4090 (24 GB)
  • Batch size: 16
  • Final mel-spec error: ~0.29
  • Stopping reason: No further perceptible improvement in audio quality beyond 160k steps

Mel Spectrogram Parameters

These must match exactly when computing mel spectrograms for input:

Parameter Value
sample_rate 22050
n_fft 1024
hop_size 256
win_size 1024
num_mels 80
fmin 0
fmax 8000

Important: Mel normalization must use HiFi-GAN's standard dynamic_range_compression:

# CORRECT β€” matches training format
mel = torch.log(torch.clamp(mel_linear, min=1e-5))
# range: approximately [-11.5, 0.9]

# WRONG β€” will produce artifacts
mel = torch.log(mel_linear + 1e-9)
# range: approximately [-20, 8] β€” vocoder was NOT trained on this

Usage

import torch
import torchaudio
import json

# Load model
with open('config.json') as f:
    h = json.load(f)

# Use HiFi-GAN generator from jik876/hifi-gan repo
from models import Generator
generator = Generator(h).to('cuda')
ckpt = torch.load('generator.pth', map_location='cuda')
generator.load_state_dict(ckpt['generator'])
generator.eval()

# Compute mel (must use clamp, not epsilon addition)
waveform, sr = torchaudio.load('audio.wav')
mel_transform = torchaudio.transforms.MelSpectrogram(
    sample_rate=22050, n_fft=1024, n_mels=80,
    hop_length=256, win_length=1024, fmin=0, fmax=8000,
    power=2.0, normalized=False,
)
mel_linear = mel_transform(waveform)
mel = torch.log(torch.clamp(mel_linear, min=1e-5))  # Standard HiFi-GAN normalization

# Generate audio
with torch.no_grad():
    audio = generator(mel.to('cuda')).squeeze().cpu()
torchaudio.save('output.wav', audio.unsqueeze(0), 22050)

License

MIT (same as original HiFi-GAN)

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support