techfreakworm commited on
Commit
2035fc8
·
unverified ·
1 Parent(s): 2e18e13

fix(controlnet): crop control image to multiple-of-16 dims

Browse files

Same modulus-of-16 bug pattern as the upscale fix in 9514256, hit on the
ControlNet path instead. With an input image whose dims aren't already
multiples of 16, DiffSynth's VAE encode rounds the latent shape *down*
while the noise allocator for inpaint_mask rounds *up*, so the
torch.concat([control_latents, inpaint_mask, inpaint_latent], dim=1) in
the pipeline raises a shape mismatch (e.g. 45 vs 46 on dim 3).

Crop the control image (or fallback raw input) to (W//16*16, H//16*16)
before handing it to the pipe, mirroring the upscale fix.

Verified live with cn-input.webp (618x367) + Toon5_E10 LoRA on Z-Image
Turbo locally.

Files changed (1) hide show
  1. modes.py +8 -0
modes.py CHANGED
@@ -111,6 +111,14 @@ def call_controlnet(pipe: Any, params: dict[str, Any]) -> tuple[Image.Image, dic
111
  )
112
  control_image = input_image
113
 
 
 
 
 
 
 
 
 
114
  _swap_transformer(pipe, "Turbo")
115
 
116
  cn_input = ControlNetInput(image=control_image, scale=float(params.get("controlnet_scale", 1.0)))
 
111
  )
112
  control_image = input_image
113
 
114
+ # Same modulus-of-16 dance as call_upscale: DiffSynth's VAE encode rounds *down*
115
+ # for control_latents while the noise allocator rounds *up* for inpaint_mask, so
116
+ # an unaligned image makes torch.concat on control_context raise.
117
+ w, h = control_image.size
118
+ aligned_w, aligned_h = (w // 16) * 16, (h // 16) * 16
119
+ if (aligned_w, aligned_h) != (w, h):
120
+ control_image = control_image.crop((0, 0, aligned_w, aligned_h))
121
+
122
  _swap_transformer(pipe, "Turbo")
123
 
124
  cn_input = ControlNetInput(image=control_image, scale=float(params.get("controlnet_scale", 1.0)))