Kijai/LTX2.3_comfy · Workflow : I2V & T2V with Custom Audio

RuneXX

Mar 7

•

edited Mar 7

I2V & T2V - With Custom Audio
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main

Use your own audio files with lip sync, and synced motion

(was quite surprised how well the drummer syncs to the drum input solo .. ;-))

MattHVisual

Mar 7

THAT DRUMMER SYNCED TO CUSTOM AUDIO?! No way.. I thought that was impossible! Now I know what I'm doing today.

RuneXX

Mar 7

•

edited Mar 7

THAT DRUMMER SYNCED TO CUSTOM AUDIO?! No way.. I thought that was impossible! Now I know what I'm doing today.

Sure is ;-) from a drum solo audio file

Here is the audio source : https://www.youtube.com/watch?v=kFjrZzbMhp8
Just grabbed that audio to use as example ;-)

MattHVisual

Mar 7

•

edited Mar 7

Mel Band has a instrument isolating node that you can plugin after splitting vocals. Going to test that now.

MattHVisual

Mar 7

I could never get this image to audio working with LTX 2.0 or WAN 2.1. But it finally worked.

RuneXX

Mar 7

•

edited Mar 7

That looks awesome ;-) lip-sync, but also camera and body language. Even has light flair from the lights behind.. LTX-2 surprises me all the time ;-)
Wan got plenty of good sides as well, lots of fun experiments with that.

But LTX-2 is for sure a bit of a jack-of-all-trades magic in a box ;-) hehe

MattHVisual

Mar 7

Not bad!

RuneXX

Mar 7

Yeah its quite surprising it understands music ;-)

MattHVisual

Mar 7

Already made it better, this is nuts! I can't believe its possible.

ThirdEye

Mar 7

i wonder if you could prompt it to actually use the hi hat (sounds like open hi-hat/closed hi-hat).

anr2me

Mar 7

Hmm.. that is strange 🤔 i can't see the video posted by @MattHVisual using Chrome mobile on Android 😨 only audio playback available. While videos posted by others can be played normally.

Did you use a different encoding for the video?

GlamoramaAttack

Mar 10

•

edited Mar 10

i can't see the video posted by @MattHVisual using Chrome mobile on Android

I'm on desktop and can't see it with Firefox. The format and MIME type isn't supported. However, if you right-click on it you can save it as video and you see and hear the downloaded video (looks much better than my test). The video was created with the H.265 codec, not the default H.264. I thought Chrome, Firefox etc. would display H.265 videos too? But H.265 is an option in the 'Video Combine' node by Video Helper Suite.

RuneXX

Mar 12

•

edited Mar 12

Long length custom audio workflow incoming. In theory infinite length, but the model will forget what it rendered in past windows, so realistically for minute or two shots (unless using the static camera lora, then one can do longer, but ram might be a limit. And there is a slight degrading of quality over time as well as changes, and that might come apparent in the parsons identity (unless you create a character lora), probably best to keep around a minute or so). But one can do multiple long length and combine (with same or different camera angle etc), and edit into music videos or similar

RuneXX

Mar 12

•

edited Mar 12

a different take

cuongmai

Mar 17

a different take

Wow this is amazing. Can you please share the prompt, input image and audio? I'd like to try if I can reproduce this. Thanks.

RuneXX

Mar 17

•

edited Mar 17

Wow this is amazing. Can you please share the prompt, input image and audio? I'd like to try if I can reproduce this. Thanks.

A very "simple" one since the generation is so long, that trying to describe some action will just repeat itself.

So basically something like this :
A woman singing an emotional song, while playing guitar. The woman is seated in a chair. The scene is indoor small cosy room. Static camera with subtle slow back and forth sideways panning, always keeping the camera facing the woman. Cinematic video

But if you are going to try, try the single-pass long video workflow first ;-)
It has some needed improvements and better consistency (the other long video workflows i am updating to have some of the same improvements asap)

From a test run with the single-pass workflow:

cuongmai

Mar 17

For the 2 hallelujah videos above, did you enable this camera control lora? Or did you use any other loras to have such nice camera movements?

What are the differences between these 3 workflows?

RuneXX

Mar 17

•

edited Mar 17

For the 2 hallelujah videos above, did you enable this camera control lora? Or did you use any other loras to have such nice camera movements?

Yes, i think i had it at strength 1. It doesnt "work" though, its made for LTX-2.0. But I imagined it helped a little (but not entirely sure).
The lora does work if you set it higher strength (like 1.5 to 2) but the few test runs i had with that the image quality suffered. And completely still camera, leads to "burned" image over time it seems.
So its good to have a subtle camera movement. That why i put that in the prompt.

And not using any camera placement is fine too but then LTX will start drift around, might even film her from the back etc. Can be nice too.
Depends on what you aim for. The prompting is probably the best way to keep camera on her though... always keep the camera facing her from the front, static camera with subtle panning back and forth for example

cuongmai

Mar 17

I think the 40s video looks much more real than the 1:15 one. Is it because of the input image of the 40s video has more "natural" lighting?

RuneXX

Mar 17

•

edited Mar 17

I think the 40s video looks much more real than the 1:15 one. Is it because of the input image of the 40s video has more "natural" lighting?

Probably. The 40s video has a much more "romantic" tone to it. The 1:15 is a bit harsh light from the input image ;-) Plus input image she had eyes closed, so the model seems to have first few seconds a bit "drifting eyes". I am sure a better input image or same as 40s, would look great ;-)

The same workflow can also do T2V and sometimes that looks even better (but of course then you are a little less in control of the exact looks)

cuongmai

Mar 18

OK. So I've tried the single-pass long video workflow. Without camera control lora, it's pretty random, depending on the input image. I tried the "static" camera control lora and it worked (camera not moving at all). I also tried a "dolly left" and it also worked (but it keeps dolly left forever, as long as the total video length).

RuneXX

Mar 18

•

edited Mar 18

OK. So I've tried the single-pass long video workflow. Without camera control lora, it's pretty random, depending on the input image. I tried the "static" camera control lora and it worked (camera not moving at all). I also tried a "dolly left" and it also worked (but it keeps dolly left forever, as long as the total video length).

Yes its a loop that extends a video to be longer. But one limitation of a loop, is that its "same things" in each iteration (lora, prompt etc).
But you can do multiple long length shots, and edit then together to say a full music video (using a video editor to edit the different parts together).
And create different scenes with same character using Flux Klein or Qwen Image Edit, and use that as the input image ;-)

I been meaning to make a variation of this workflow, where you can change things underway, i was ordinally only thinking about prompting and ref.image (for scene change)
So that you can direct what will happen at each extended part. But could also add a lora setting ;-)

cuongmai

Mar 18

Does each loop use the initial input image? Even with a static camera, I feel the further the loop iteration, the more deviation from the initial image.

RuneXX

Mar 18

•

edited Mar 18

Does each loop use the initial input image? Even with a static camera, I feel the further the loop iteration, the more deviation from the initial image.

yes same input reference image all the way. But it also used a lot of reference frames from last loop to be able to continue motion etc.
So it might drift a little over time. I'll try adding more ref and guider, see if it can be made even better ;-)

cuongmai

29 days ago

This is a WAN InfiniteTalk workflow that extends every 3s. Very consistent (static camera though). Hope you can bring something good from there to yours.

https://www.runninghub.ai/post/2024844430603194369?inviteCode=kol02-rh003

RuneXX

29 days ago

•

edited 29 days ago

Wan InfiniteTalk has a very special kind of magic build into the node if I remember right. Its a "frame pack" alike logic where it does some clever things to extend.
Maybe something like that will come for LTX too, perhaps Kijai will do some magic ;-)
But LTX works quite differently than LTX so not sure its even possible.

https://lllyasviel.github.io/frame_pack_gitpage/

For LTX the mask node is the best bet currently, and some newer LTX latent guider nodes for consistency.
Will play around with it a bit and try see what works best. The single pass workflow has a latent guider in it already..

anr2me

28 days ago

This ID-LoRA looks interesting https://id-lora.github.io/

RuneXX

28 days ago

•

edited 28 days ago

This ID-LoRA looks interesting https://id-lora.github.io/

Very. It will make consistent voice across multiple video generations easy with just a short little audio reference
Currently its not supported within native nodes in ComfyUI (it does have a comfyui wrapper already, but this works as stand-alone, and can not connect or be used in other workflows).
But hopefully support comes soon ;-)

Kijai

Owner 28 days ago

This ID-LoRA looks interesting https://id-lora.github.io/

Very. It will make consistent voice across multiple video generations easy with just a short little audio reference
Currently its not supported within native nodes in ComfyUI (it does have a comfyui wrapper already, but this works as stand-alone, and can not connect or be used in other workflows).
But hopefully support comes soon ;-)

https://github.com/Comfy-Org/ComfyUI/pull/13111

RuneXX

28 days ago

•

edited 28 days ago

https://github.com/Comfy-Org/ComfyUI/pull/13111

ah very nice ;-) thanks a ton, will be useful with an easy audio ref.

drakexp

28 days ago

This ID-LoRA looks interesting https://id-lora.github.io/

Will make things simpler. Current workaround is to first use Qwen TTS Voice clone with reference audio, then use LTX I2V with custom audio. But with this LoRA background sounds can be added too.