Image Generator modifying Images using AutoEncoder

by MartialTerran - opened Oct 10, 2025

Oct 10, 2025

Hi. I am https://huggingface.co/MartialTerran

I am impressed by your latent vector manipulations.
https://colab.research.google.com/drive/1CF5Lr1bxoAFC_IPX5I0azu4X8UDz_zp-#scrollTo=buKgoKwXFSyv

You also inspired this guy: https://www.youtube.com/watch?v=fWiieyG2zes

I just saw at https://x.com/ronenklo/status/1976248980687421604 this related work in the 2D image generator field.
https://ronen94.github.io/SAEdit/

AutoEncoders normally have narrow bottlenecks, Sparse AutoEncoders (SAE) have large ones!
SAEdit leverages that to learn a high-dimensional, disentangled sprase space, where linear moves on text embeddings translate into fine-grained, visual edits https://ronen94.github.io/SAEdit/

"We use pairs of text prompts that differ only by the intended edit.
By analyzing how these prompts shift in the sparse space, we identify the dimensions that capture the desired change. 2/5 "
[Same/Similar to your subtraction method?]

"When generating the image, we add the learned direction in sparse space to the target token’s embedding, scaled by an intensity factor controlling the edit’s strength."
[Same/Similar to your factor/interpolation method?]

Maybe you can help that Ronen how to build an image morphing engine:
20+ years ago, a famous video morphed men into women and and morphed different people? Can you produce a morphing video using a sequence of scaled shift-dimensions. E.g., Gradually Convert a kid into the adult of same person? Age progression? Could sell that to MissingKids org.
https://x.com/ronenklo/status/1976248980687421604

MartialTerran

Oct 10, 2025

More details in the paper:
SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder
https://arxiv.org/abs/2510.05081

Large-scale text-to-image diffusion models have become the backbone of modern image editing, yet text prompts alone do not offer adequate control over the editing process. Two properties are especially desirable: disentanglement, where changing one attribute does not unintentionally alter others, and continuous control, where the strength of an edit can be smoothly adjusted. We introduce a method for disentangled and continuous editing through token-level manipulation of text embeddings. The edits are applied by manipulating the embeddings along carefully chosen directions, which control the strength of the target attribute. To identify such directions, we employ a Sparse Autoencoder (SAE), whose sparse latent space exposes semantically isolated dimensions. Our method operates directly on text embeddings without modifying the diffusion process, making it model agnostic and broadly applicable to various image synthesis backbones. Experiments show that it enables intuitive and efficient manipulations with continuous control across diverse attributes and domains.

See also How might LLMs store facts | Deep Learning Chapter 7
"Unpacking the multilayer perceptrons in a transformer, and how they may store facts"
https://www.youtube.com/watch?v=9-Jl0dxWQs8

This guy https://www.youtube.com/watch?v=fWiieyG2zes Built a 9-array deep super-MLP. I reconstructed it and did a bakeoff script in colab. The super-MLP had a 10x larger small loss in training compared to a slightly larger-parameter count regular MLP. I have not yet figured out if that larger small loss metric is good or bad. Sometimes a higher small loss implies greater generalization?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment