Audio Quality Spectrogram Patch Autoencoder

Overview

This repo publishes a raw PyTorch checkpoint for a convolutional autoencoder that reconstructs grayscale spectrogram patches.

  • Input: 1 x 512 x 256 spectrogram patch
  • Output: reconstructed patch of the same shape
  • Training data: synthetic spectrograms from TashaSkyUp/audio-quality-dataset-nfe4-30-step2
  • Training target: patch reconstruction, not classification

This artifact is best understood as a reconstruction/compression experiment on synthetic speech spectrograms.

What This Repo Is Not

This repo does not publish:

  • a waveform model
  • a text-to-speech model
  • a speech-quality classifier
  • a human-rated perceptual quality model
  • a Transformers-format model package

The checkpoint reconstructs spectrogram images only.

Files In This Repo

  • best_model.pt: preserved best-by-validation-loss checkpoint
  • load_smoke_test.json: local load-and-forward verification report
  • load_smoke_reconstruction.png: smoke-test original vs reconstruction
  • val_reconstruction_preview.png: validation preview from training
  • source_run.log: copied training log
  • preservation_metadata.json: preserved provenance summary

How To Load It

The checkpoint loads into ConvAutoencoder from:

  • tools/train_patch_autoencoder_ssim/train_patch_autoencoder_ssim.py

Minimal load pattern:

import torch
from tools.train_patch_autoencoder_ssim.train_patch_autoencoder_ssim import ConvAutoencoder

checkpoint = torch.load("best_model.pt", map_location="cpu")
model = ConvAutoencoder()
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

The repo includes a verified local smoke test:

  • load_smoke_test.json

That smoke test confirms CPU load, state-dict compatibility, and a successful forward pass on a real 1 x 1 x 512 x 256 patch.

Training Data And Provenance

This model was trained only on synthetic spectrograms from:

  • TashaSkyUp/audio-quality-dataset-nfe4-30-step2

That dataset contains:

  • 200 short English prompt sentences
  • 2800 synthetic runs
  • 14 NFE settings: 4, 6, 8, ..., 30
  • fixed seed 1024

Synthetic generation path:

  1. prompt text from the repo-local TSV tmp/audio_quality_dataset/short_sentences_200.tsv
  2. raw synthetic speech from meituan-longcat/LongCat-AudioDiT-3.5B
  3. one ClearVoice MossFormer2_SR_48K post-pass
  4. grayscale spectrogram rendering at 1024 x 512

Important provenance notes:

  • all training data is synthetic
  • no human speech recordings are included in this published bundle
  • no manual perceptual labels were used to train this autoencoder
  • the dataset contains procedural weak labels, but this model did not use them
  • repo revision for this export: 064a6bd4df88b3222459350d74341933dcfda075

Model And Training Summary

Model family:

  • convolutional autoencoder
  • encoder channels: 32, 64, 96, 128, 192, 256
  • 6 stride-2 convolutional downsampling stages
  • 6 transposed-convolution decoder stages
  • sigmoid output head

Published training configuration:

  • device during the preserved run: remote RTX 3090
  • batch size: 32
  • learning rate: 3e-4
  • scheduler: cosine annealing resume
  • validation split: sentence-level 0.2
  • patch width: quarter-width patches, 256
  • patch stride: 64
  • loss: (1 - 0.85) * MSE + 0.85 * (1 - SSIM)

Checkpoint Selection

The published best_model.pt is the best-by-val_loss checkpoint from the copied run, not the best-by-PSNR checkpoint.

  • preserved checkpoint epoch: 2761
  • preserved checkpoint validation loss: 0.10661878556340605
  • best PSNR seen in the copied run log: 32.57 dB at epoch 2722
  • latest epoch recorded in the copied run log: 2839

Intended Use

Reasonable uses:

  • reconstruction experiments on synthetic speech spectrogram patches
  • compression / latent-space experiments on the published synthetic dataset family
  • reproducing the repo-local autoencoder workflow

Limitations

This checkpoint should not be treated as:

  • a validated real-world speech-quality model
  • evidence about human perceptual quality on natural recordings
  • a general benchmark result for audio reconstruction

Because the training data is synthetic, this model primarily characterizes the published LongCat + ClearVoice artifact pipeline, not general speech audio in the wild.

Licensing And Attribution

This repo is marked license: other because it is a repo-local experiment export and does not assert a new standalone permissive license over the generated artifacts.

Synthetic generation in this workflow depended on:

  • LongCat-AudioDiT from Meituan, specifically meituan-longcat/LongCat-AudioDiT-3.5B
  • ClearVoice, specifically MossFormer2_SR_48K

This Hugging Face repo is not an official upstream release of either dependency. Check upstream terms before redistributing or reusing generated artifacts at scale.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TashaSkyUp/audio-quality-ae-spectrogram-patches-gpu3090-best-20260409