Audio Quality Spectrogram Patch Autoencoder
Overview
This repo publishes a raw PyTorch checkpoint for a convolutional autoencoder that reconstructs grayscale spectrogram patches.
- Input:
1 x 512 x 256spectrogram patch - Output: reconstructed patch of the same shape
- Training data: synthetic spectrograms from
TashaSkyUp/audio-quality-dataset-nfe4-30-step2 - Training target: patch reconstruction, not classification
This artifact is best understood as a reconstruction/compression experiment on synthetic speech spectrograms.
What This Repo Is Not
This repo does not publish:
- a waveform model
- a text-to-speech model
- a speech-quality classifier
- a human-rated perceptual quality model
- a Transformers-format model package
The checkpoint reconstructs spectrogram images only.
Files In This Repo
best_model.pt: preserved best-by-validation-loss checkpointload_smoke_test.json: local load-and-forward verification reportload_smoke_reconstruction.png: smoke-test original vs reconstructionval_reconstruction_preview.png: validation preview from trainingsource_run.log: copied training logpreservation_metadata.json: preserved provenance summary
How To Load It
The checkpoint loads into ConvAutoencoder from:
tools/train_patch_autoencoder_ssim/train_patch_autoencoder_ssim.py
Minimal load pattern:
import torch
from tools.train_patch_autoencoder_ssim.train_patch_autoencoder_ssim import ConvAutoencoder
checkpoint = torch.load("best_model.pt", map_location="cpu")
model = ConvAutoencoder()
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
The repo includes a verified local smoke test:
load_smoke_test.json
That smoke test confirms CPU load, state-dict compatibility, and a successful forward pass on a real 1 x 1 x 512 x 256 patch.
Training Data And Provenance
This model was trained only on synthetic spectrograms from:
TashaSkyUp/audio-quality-dataset-nfe4-30-step2
That dataset contains:
200short English prompt sentences2800synthetic runs14NFE settings:4, 6, 8, ..., 30- fixed seed
1024
Synthetic generation path:
- prompt text from the repo-local TSV
tmp/audio_quality_dataset/short_sentences_200.tsv - raw synthetic speech from
meituan-longcat/LongCat-AudioDiT-3.5B - one
ClearVoiceMossFormer2_SR_48Kpost-pass - grayscale spectrogram rendering at
1024 x 512
Important provenance notes:
- all training data is synthetic
- no human speech recordings are included in this published bundle
- no manual perceptual labels were used to train this autoencoder
- the dataset contains procedural weak labels, but this model did not use them
- repo revision for this export:
064a6bd4df88b3222459350d74341933dcfda075
Model And Training Summary
Model family:
- convolutional autoencoder
- encoder channels:
32, 64, 96, 128, 192, 256 - 6 stride-2 convolutional downsampling stages
- 6 transposed-convolution decoder stages
- sigmoid output head
Published training configuration:
- device during the preserved run: remote
RTX 3090 - batch size:
32 - learning rate:
3e-4 - scheduler: cosine annealing resume
- validation split: sentence-level
0.2 - patch width: quarter-width patches,
256 - patch stride:
64 - loss:
(1 - 0.85) * MSE + 0.85 * (1 - SSIM)
Checkpoint Selection
The published best_model.pt is the best-by-val_loss checkpoint from the copied run, not the best-by-PSNR checkpoint.
- preserved checkpoint epoch:
2761 - preserved checkpoint validation loss:
0.10661878556340605 - best PSNR seen in the copied run log:
32.57 dBat epoch2722 - latest epoch recorded in the copied run log:
2839
Intended Use
Reasonable uses:
- reconstruction experiments on synthetic speech spectrogram patches
- compression / latent-space experiments on the published synthetic dataset family
- reproducing the repo-local autoencoder workflow
Limitations
This checkpoint should not be treated as:
- a validated real-world speech-quality model
- evidence about human perceptual quality on natural recordings
- a general benchmark result for audio reconstruction
Because the training data is synthetic, this model primarily characterizes the published LongCat + ClearVoice artifact pipeline, not general speech audio in the wild.
Licensing And Attribution
This repo is marked license: other because it is a repo-local experiment export and does not assert a new standalone permissive license over the generated artifacts.
Synthetic generation in this workflow depended on:
LongCat-AudioDiTfrom Meituan, specificallymeituan-longcat/LongCat-AudioDiT-3.5BClearVoice, specificallyMossFormer2_SR_48K
This Hugging Face repo is not an official upstream release of either dependency. Check upstream terms before redistributing or reusing generated artifacts at scale.