would you consider, doing it for a vlm? Yes, no clip use, but could discuss images with it pre and post.

#8
by Andyx1976 - opened

For one, it would give the option showing it an image similar to make more clear, what you want. or reverse engineer a prompt for it by following it's knowledge on Z-Image (and Flux Klein) prompting. But also i actually found it quite interesting having a (standard) vlm like Qwen3 Vl 8b make the prompt using a prompting guide, and then judge it afterwards, by discussing the image.
Although in my tests, good old Gemma3 (27b) had the most constructive suggestions for improving the prompt based on the returned image, more so than even Qwen3 vl 32b. (not suggesting training those though, because too big).
Ideally it would be based on a abliterated model so it doesn't just say no-no to any ... interesting image.

Since you can just install and load it in lmstudio which is super simple, the use would not be complicated. Of course it is not compatible as a clip for comfy (at least for Z-image), but it may still be worth just as aprompt engineer (which is what it's called).

Gemma makes great judge models on writing/prose, true. Did the images turn out better? I have found Gemma models lacking in producing prompts that result in superior images in ZIT.

My dataset could be applied to any Qwen3 model, but it lacks iterative refinement chat training - previous versions of Z-Image Engineer have relied on the strong instruction following of the base model to be able to chat with the model and suggest improvements etc. (plus a solid system prompt). So, to do a proper VL finetune I would need a totally different type of dataset(s) and an entirely new training pipeline... Not looking to do that right now, sorry!

BennyDaBall changed discussion status to closed

Sign up or log in to comment