Fashionable Spatial Encoder
Please note: This is only compatible with Cozyberry
Second note: The drift between the original and the trained model is not significant enough yet.
There are two straightforward methods to set up conditions for the diffusion model:
- by improving the text encoder, in this case, a BERT model
- by modifying the cross-attention weights
This is a research repo for a diffusion model.
During inference, we know exactly which part of the prompt applies to which part of the image.
The fundamental issue is that millions of samples are processed through the diffusion model during training, but only the mean loss of text-driven image generation is calculated. Meanwhile, the text and the image show a variety of the colours, shapes and other objects. The positions of these features in the encoded data are discarded for no reason.
This is what has changed here. Failure to follow the exact spatial description in the prompt will result in further penalties.
Source data
- synthetic booru fashion
- horizontal scenes
