I tried to make a workflow with motion control and input audio and I am having problems with lipsyncing.

#66
by gpundt - opened

Hi, I have been working on a ComfyUI workflow that combines IC-LoRA motion control with audio input for lipsync using LTX 2.3, and I cannot get both working at the same time.

What I am trying to do:
I want to input a reference image, a motion control video, and an audio file, and have LTX 2.3 generate a video where the character follows the motion from the control video while also lipsyncing to the input audio.

What works individually:

  • My plain image + audio to video workflow produces perfect lipsync with no issues
  • My motion control workflow with IC-LoRA produces good body movement following the reference video
  • But when I combine both, lipsync completely disappears

What I have tried:

  • Started with DensePosePreprocessor β€” no lipsync, tried resolution 512 and 256
  • Switched to DWPose Estimator with detect_face disabled so no facial keypoints compete with the audio conditioning β€” still no lipsync
  • Tried DepthAnythingV2 with a blur node to soften the depth map β€” still no lipsync
  • Adjusted IC-LoRA strength_model between 0.3 and 1.0 and guide strength between 0.2 and 1.0 in various combinations
  • Made sure the audio gender matches the reference image gender
  • Removed all speech references from the text prompt so only the audio latent drives mouth movement
  • Confirmed the audio pipeline is correctly wired: LoadAudio β†’ LTXVAudioVAEEncode β†’ SetLatentNoiseMask β†’ LTXVConcatAVLatent

My current theory:
The IC-LoRA conditioning and audio conditioning may be competing for the same pathways, with IC-LoRA winning every time regardless of strength settings. I am wondering if the audio latent needs to pass through the LTXAddVideoICLoRAGuide node rather than being concatenated separately after the fact.

My setup:

  • RTX 3080 10GB
  • LTX 2.3 22b dev Q5_K_M GGUF
  • ltx-2.3-22b-ic-lora-union-control-ref0.5.safetensors
  • ltx-2.3-22b-distilled-lora-384.safetensors
  • gemma_3_12B_it_fp4_mixed as text encoder
  • Clean Qwen3 TTS audio input, no background noise

Has anyone successfully combined IC-LoRA motion control with audio lipsync in a single workflow? Is there a specific node wiring that makes both work together? Any help would be hugely appreciated.

I havent tried both motion control and lip-sync, should work.. at least in theory ;-) will try see if i can make it work

I can send you my workflow somehow if you want to look at it just let me know.

I can send you my workflow somehow if you want to look at it just let me know.

Sure, if you want to upload it somewhere.. for example pastebin.com .. or anywhere really ;-)

I can send you my workflow somehow if you want to look at it just let me know.

Sure, if you want to upload it somewhere.. for example pastebin.com .. or anywhere really ;-)

I just uploaded the workflow I was using to my hugging face if you want to look at it. The reference image is not what I was using because the original was nsfw lol. Hopefully you can see the structure I was going for hope it helps.

Downloaded it. Will take a look asap ;-)

I've been trying the same thing. It seems like some nodes strip the lip-synch match (conditioning/guide) to the latent. Turn them off and it synchs fine on my workflow, but turn them on and its strictly voice over.

I've been trying the same thing. It seems like some nodes strip the lip-synch match (conditioning/guide) to the latent. Turn them off and it synchs fine on my workflow, but turn them on and its strictly voice over.

Oh i totally forgot about this challenge.
Is that the motion control that strips the lip-sync?
I didnt get around to try yet, actually forgot about it.. but will give it a try

Yeah the lip syncing is very fragile with LTX2.3. Recently I've been trying to make an extend any video workflow with input audio encoding like I was trying to do with the motion control, and it is very hard to get the extended video to lip sync to the input audio. With this extend any video workflow if you encode a number of input frames from the reference video where the subjects mouth isn't moving it will not lip sync at all, but with video extension you kind of need to feed it a second or two of frames from the end of your reference video to make the extension smooth. I basically want the subject to go from not talking in the reference video to talking lip syncing the input audio in the extension, but the only way I could do this is with an EmptyLTXVLatentVideo node feeding into the LTXVAddLatentGuide node's latent. Then, doing that there's like a hard-cut in the middle of the video where the extension starts.

I've been trying the same thing. It seems like some nodes strip the lip-synch match (conditioning/guide) to the latent. Turn them off and it synchs fine on my workflow, but turn them on and its strictly voice over.

@Gpundt - That is an interesting idea! I will have to play around with splicing it in after starting with an EMPTY LTXVLatentVideo. Thanks.

Oh i totally forgot about this challenge.
Is that the motion control that strips the lip-sync?
I didnt get around to try yet, actually forgot about it.. but will give it a try

Yeah, I'm trying to mess with a FFLF + Controlnet to guide the in-between + Injected Audio WF, but ultimately the guide from LTXVAddGuide/AddGuideMulti seems to fight the audio lip sync.

Sign up or log in to comment