How is the real image training data constructed?

#10
by scf - opened

Thanks for you great work, how is the 256x256 and 512x512 training image pair constructed?

Owner

The process looks like this:

  • start with high resolution, high quality image (4k+ res)
  • downsample 0.5x w/ area filter to reduce noise, sensor artifacts, etc
  • random crop 512px - this is the HQ target
  • downsample 0.5x w/ area filter again - this is the LQ input
  • VAE encode LQ
  • degrade latent using either diffusion model or small convolutional proxy model trained to mimic the degradation
  • VAE decode, typical L1 + perceptual + GAN losses against HQ target

Thanks to your reply. In the second step, the area filter is to reduce noise which means to generate a higher quality sample? But in the fourth step the area filter generates a LQ input. Is this means the two area filter is different?

No, they're the same. Area just averages together all pixels in the sampled region equally, it's the only good option in pytorch for downsampling with minimal aliasing. Bilinear is also okay-ish, but it tends to stairstep more on hard edges when used to downscale.

LQ isn't in the same sense as RealESRGAN where the image has blur/noise/jpeg/etc degradation applied. It's just the low resolution half of the pair. The only degradation happens in latent space, because that's what you actually get out of diffusion models and what you want the decoder to be less sensitive to.

What is the prompt you used to generate the degrade latent by diffusion model? And what is the noise level you added and the denoise step? It seems the degrade quality is very sensitive to these factors

For images I sampled random uniform sigmas in [0, 0.12] during decoder training. The proxy degradation model was trained on a wider range of sigmas since I wasn't sure in advance what would be needed. Prompt was the generic negative for Wan, but empty would be fine too. Test on some real inputs and compare PCA of the clean and degraded latents, and the decoded outputs. You want it to be strong enough to lose some details and cause small scale artifacts but not distort structure too much.

Will you open source the datasets you used to train the model? Thanks!

Sign up or log in to comment