Spaces:

techfreakworm
/

z-image-studio

Running on Zero

App Files Files Community

techfreakworm commited on 7 days ago

Commit

2035fc8

unverified ·

1 Parent(s): 2e18e13

fix(controlnet): crop control image to multiple-of-16 dims

Browse files

Same modulus-of-16 bug pattern as the upscale fix in 9514256, hit on the
ControlNet path instead. With an input image whose dims aren't already
multiples of 16, DiffSynth's VAE encode rounds the latent shape *down*
while the noise allocator for inpaint_mask rounds *up*, so the
torch.concat([control_latents, inpaint_mask, inpaint_latent], dim=1) in
the pipeline raises a shape mismatch (e.g. 45 vs 46 on dim 3).

Crop the control image (or fallback raw input) to (W//16*16, H//16*16)
before handing it to the pipe, mirroring the upscale fix.

Verified live with cn-input.webp (618x367) + Toon5_E10 LoRA on Z-Image
Turbo locally.

Files changed (1) hide show

modes.py +8 -0

modes.py CHANGED Viewed

@@ -111,6 +111,14 @@ def call_controlnet(pipe: Any, params: dict[str, Any]) -> tuple[Image.Image, dic
         )
         control_image = input_image
     _swap_transformer(pipe, "Turbo")
     cn_input = ControlNetInput(image=control_image, scale=float(params.get("controlnet_scale", 1.0)))

         )
         control_image = input_image
+    # Same modulus-of-16 dance as call_upscale: DiffSynth's VAE encode rounds *down*
+    # for control_latents while the noise allocator rounds *up* for inpaint_mask, so
+    # an unaligned image makes torch.concat on control_context raise.
+    w, h = control_image.size
+    aligned_w, aligned_h = (w // 16) * 16, (h // 16) * 16
+    if (aligned_w, aligned_h) != (w, h):
+        control_image = control_image.crop((0, 0, aligned_w, aligned_h))
     _swap_transformer(pipe, "Turbo")
     cn_input = ControlNetInput(image=control_image, scale=float(params.get("controlnet_scale", 1.0)))