Audio Quality Spectrogram Patch Autoencoder

Overview

This repo publishes a raw PyTorch checkpoint for a convolutional autoencoder that reconstructs grayscale spectrogram patches.

Input: 1 x 512 x 256 spectrogram patch
Output: reconstructed patch of the same shape
Training data: synthetic spectrograms from TashaSkyUp/audio-quality-dataset-nfe4-30-step2
Training target: patch reconstruction, not classification

This artifact is best understood as a reconstruction/compression experiment on synthetic speech spectrograms.

What This Repo Is Not

This repo does not publish:

a waveform model
a text-to-speech model
a speech-quality classifier
a human-rated perceptual quality model
a Transformers-format model package

The checkpoint reconstructs spectrogram images only.

Files In This Repo

best_model.pt: preserved best-by-validation-loss checkpoint
load_smoke_test.json: local load-and-forward verification report
load_smoke_reconstruction.png: smoke-test original vs reconstruction
val_reconstruction_preview.png: validation preview from training
source_run.log: copied training log
preservation_metadata.json: preserved provenance summary

How To Load It

The checkpoint loads into ConvAutoencoder from:

tools/train_patch_autoencoder_ssim/train_patch_autoencoder_ssim.py

Minimal load pattern:

import torch
from tools.train_patch_autoencoder_ssim.train_patch_autoencoder_ssim import ConvAutoencoder

checkpoint = torch.load("best_model.pt", map_location="cpu")
model = ConvAutoencoder()
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

The repo includes a verified local smoke test:

load_smoke_test.json

That smoke test confirms CPU load, state-dict compatibility, and a successful forward pass on a real 1 x 1 x 512 x 256 patch.

Training Data And Provenance

This model was trained only on synthetic spectrograms from:

TashaSkyUp/audio-quality-dataset-nfe4-30-step2

That dataset contains:

200 short English prompt sentences
2800 synthetic runs
14 NFE settings: 4, 6, 8, ..., 30
fixed seed 1024

Synthetic generation path:

prompt text from the repo-local TSV tmp/audio_quality_dataset/short_sentences_200.tsv
raw synthetic speech from meituan-longcat/LongCat-AudioDiT-3.5B
one ClearVoice MossFormer2_SR_48K post-pass
grayscale spectrogram rendering at 1024 x 512

Important provenance notes:

all training data is synthetic
no human speech recordings are included in this published bundle
no manual perceptual labels were used to train this autoencoder
the dataset contains procedural weak labels, but this model did not use them
repo revision for this export: 064a6bd4df88b3222459350d74341933dcfda075

Model And Training Summary

Model family:

convolutional autoencoder
encoder channels: 32, 64, 96, 128, 192, 256
6 stride-2 convolutional downsampling stages
6 transposed-convolution decoder stages
sigmoid output head

Published training configuration:

device during the preserved run: remote RTX 3090
batch size: 32
learning rate: 3e-4
scheduler: cosine annealing resume
validation split: sentence-level 0.2
patch width: quarter-width patches, 256
patch stride: 64
loss: (1 - 0.85) * MSE + 0.85 * (1 - SSIM)

Checkpoint Selection

The published best_model.pt is the best-by-val_loss checkpoint from the copied run, not the best-by-PSNR checkpoint.

preserved checkpoint epoch: 2761
preserved checkpoint validation loss: 0.10661878556340605
best PSNR seen in the copied run log: 32.57 dB at epoch 2722
latest epoch recorded in the copied run log: 2839

Intended Use

Reasonable uses:

reconstruction experiments on synthetic speech spectrogram patches
compression / latent-space experiments on the published synthetic dataset family
reproducing the repo-local autoencoder workflow

Limitations

This checkpoint should not be treated as:

a validated real-world speech-quality model
evidence about human perceptual quality on natural recordings
a general benchmark result for audio reconstruction

Because the training data is synthetic, this model primarily characterizes the published LongCat + ClearVoice artifact pipeline, not general speech audio in the wild.

Licensing And Attribution

This repo is marked license: other because it is a repo-local experiment export and does not assert a new standalone permissive license over the generated artifacts.

Synthetic generation in this workflow depended on:

LongCat-AudioDiT from Meituan, specifically meituan-longcat/LongCat-AudioDiT-3.5B
ClearVoice, specifically MossFormer2_SR_48K

This Hugging Face repo is not an official upstream release of either dependency. Check upstream terms before redistributing or reusing generated artifacts at scale.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

TashaSkyUp
/

audio-quality-ae-spectrogram-patches-gpu3090-best-20260409