How is the real image training data constructed?

#10

by scf - opened Mar 6

Discussion

scf

Mar 6

Thanks for you great work, how is the 256x256 and 512x512 training image pair constructed?

spacepxl

Owner Mar 6

The process looks like this:

start with high resolution, high quality image (4k+ res)
downsample 0.5x w/ area filter to reduce noise, sensor artifacts, etc
random crop 512px - this is the HQ target
downsample 0.5x w/ area filter again - this is the LQ input
VAE encode LQ
degrade latent using either diffusion model or small convolutional proxy model trained to mimic the degradation
VAE decode, typical L1 + perceptual + GAN losses against HQ target

scf

Mar 9

Thanks to your reply. In the second step, the area filter is to reduce noise which means to generate a higher quality sample? But in the fourth step the area filter generates a LQ input. Is this means the two area filter is different?

spacepxl

Owner Mar 9

•

edited Mar 9

No, they're the same. Area just averages together all pixels in the sampled region equally, it's the only good option in pytorch for downsampling with minimal aliasing. Bilinear is also okay-ish, but it tends to stairstep more on hard edges when used to downscale.

LQ isn't in the same sense as RealESRGAN where the image has blur/noise/jpeg/etc degradation applied. It's just the low resolution half of the pair. The only degradation happens in latent space, because that's what you actually get out of diffusion models and what you want the decoder to be less sensitive to.

scf

Mar 10

•

edited Mar 10

What is the prompt you used to generate the degrade latent by diffusion model? And what is the noise level you added and the denoise step? It seems the degrade quality is very sensitive to these factors

spacepxl

Owner Mar 10

For images I sampled random uniform sigmas in [0, 0.12] during decoder training. The proxy degradation model was trained on a wider range of sigmas since I wasn't sure in advance what would be needed. Prompt was the generic negative for Wan, but empty would be fine too. Test on some real inputs and compare PCA of the clean and degraded latents, and the decoded outputs. You want it to be strong enough to lose some details and cause small scale artifacts but not distort structure too much.

scf

Mar 19

Will you open source the datasets you used to train the model? Thanks!

spacepxl

Owner 25 days ago

For image training I used https://huggingface.co/datasets/zhang0jhon/Aesthetic-4K

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment