Qwen3.5-4B-Base-ZitGen-V1

An image captioning fine-tune of Qwen 3.5 4B optimized for Z-Image Turbo image generation (image-to-prompt).

What makes this unique: The dataset (images + prompts) was generated entirely by LLMs tasked with regenerating a target image.

How The Data Is Generated

The process is as follows:

  1. The target image and the last generated image (blank if it's the first step) are provided to an LLM with a comparison prompt.
  2. The LLM outputs a detailed description of each image and the key differences between them.
  3. The comparison results and the last generated prompt (empty if it's the first step) are provided to an LLM with an SD generation prompt.
  4. The output prompt is sent to the ComfyUI API using Z-Image Turbo, and the output image is captured.
  5. Repeat N times.

The system employed between 4 and 6 rounds of comparison and correction to generate each prompt-image pair. In theory, this process adapts the prompt to minimize the difference between the target image and the generated image, thereby tailoring the prompt specifically to the image model.

The prompts were then ranked and filtered to remove occasional LLM errors, such as residuals from the original prompt or undesirable artifacts (e.g., watermarks). Finally, the prompts and images were formatted into the ShareGPT dataset format and used to train Qwen 3.5 4B.

Usage

Load the model from HuggingFace, launch inference server and provide an image to generate an optimized Zit prompt.

e.g. ./llama-server --host 0.0.0.0 -m ./Qwen3.5-4B-Base-ZitGen-V1-BF16.gguf --mmproj ./Qwen3.5-4B-mmproj-BF16.gguf --chat-template-kwargs '{"enable_thinking":false}'

Example

image

A massive, multi-legged steampunk creature resembling a mechanical spider or crab dominates the foreground, constructed entirely from polished brass, copper, and steel components with highly reflective metallic surfaces. Its body features exposed intricate gears, cylindrical segments, and rivets, with two large lens-like structures serving as eyes that emit a soft amber glow. Six pairs of jointed metallic legs extend outward from the central body, ending in sleek, smooth, bullet-shaped golden tips that contrast with the overall industrial aesthetic. The creature possesses multiple small glowing orange lights embedded along its chassis, creating a warm bioluminescent effect against the cool surroundings. A human figure wearing a dark grey overcoat, white shirt, black bow tie, and a tall black top hat sits atop the machine in the upper right quadrant, holding a steaming cup and reading a newspaper, appearing small and proportionate to the massive scale of the creature. The ground is covered in dense, thick volumetric steam that completely obscures the cobblestone street surface, hiding all footprints and debris, with a slight elevation increase from left to right. Lighting is dramatic and high-contrast, dominated by a large bright full moon positioned in the center-top of the frame casting cool blue shadows, supplemented by warm artificial light sources from the mechanical creature and a single visible street lamp on the far left emitting a gentle golden glow. The color palette features deep cool blues and greys in the shadows and sky, contrasted sharply against warm amber highlights from the machine and street lamp, with muted orange tones in the steam. Textures include smooth wet metallic surfaces on the machine, dense swirling vapor, and rough dry cobblestones partially visible through the haze. Background buildings are tall stone structures with arched windows and distinct turrets, rendered with heavy atmospheric haze to create depth and perspective, fading into darkness in the distance. The composition uses a low-angle perspective emphasizing the enormous size of the mechanical creature, centered in the frame with symmetrical leg arrangement, and a shallow depth of field focusing sharply on the foreground. Shot with a 35mm lens at f/2.8 for shallow depth of field, capturing high dynamic range with rich color saturation and fine grain structure reminiscent of Kodak Vision3 500T film stock. The scene exudes a moody, cinematic atmosphere with tangible metallic textures and steamy warmth, inspired by steampunk aesthetics and H.G. Wells' mechanical inventions, featuring no smokestacks, no flags, no litter or debris on the ground, and no wet reflective stone surfaces.

Downloads last month
3,576
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lolzinventor/Qwen3.5-4B-Base-ZitGen-V1

Quantized
(12)
this model
Quantizations
1 model