Adding Voice to existing Video

#89
by APCOTech - opened

Hello RuneXX,
First, Thank you a lot for your great and organized workflows. it is really helpful. I read a comment here that you said that it is possible to add voice to an existing video. I was wondering if you succeeded in doing it? I have a video that I made with Wan. it is very complicated (not just a talking head). I tried the first/last frame but I can't get the action done perfectly in LTX, and infiniteTalk wasn't good. so is there a way to achieve this?

If its very complicated Wan video it might be tricky. But in theory it should work.

You can try one of these workflows https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Video-2-Video

Those might be worth a try ;-)

And if you are stuck, feel free to upload a Wan video, and I can try too...

And if all fails, you can do an LTX recreation of the video, by grabbing the first frame of the Wan video, and simply prompt what to happen (to try mimic what you had in Wan). Even a First-Last or First-Last-Middle frame workflow if its a bit more complicated scene.

Thank you so much for your reply, I tried the First-Last or First-Last-Middle frame workflows but they failed to achieve the same action, and I tried the Foley workflow before, it is amazing but I want to add speech to my current video.. Problem is: I do not want to extend it, I just want to add the speech to the existing video, I will try the Just Talk one with Extend set to zero, may be it will work

The Just Talk with no extent could be worth a try.
If its a very complicated scene, you can alter the Sam3 prompt to be a specific person. It will try to keep track of that person even if its out of scene, and returns later etc.
Sam3 uses natural language, but it can be a few tries before getting the mask right ;-)

The Sam3 node is almost middle of the workflow, and has a prompt box to specify what to mask

The biggest challenge is often than LTX isnt really an inpaint model (without an inpaint lora). So sometimes the changes is a bit too subtle. But give it a try ;-)

(and really only tested mostly on "talking head" sort of videos, it might not work great if its very complicated. LTX seems to love a good "portrait close up", but at distance and high action, its less ideal)

Seems like I lost you here :(
where can I alter the sam3 prompt?
I am using this workflow now: LTX-2.3_-_V2V_Just_Talk_prompt_lipsynced-voice_to_any_video.json
PS: I just got this error (ValueError: Masks batch size must have a multiple of 8 masks + 1.) in the LTX preprocess Masks

image

where can I alter the sam3 prompt?

Sorry my bad. This workflow uses the Face Segmentation node I see now.
Where you just toggle what you want to mask (and therefore change). Default is set to mouth area, but you can expand to more parts of face.

Uploaded a "Just Talk" workflow with Sam3 as a variant you can try ;-) where you just prompt what to mask
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Video-2-Video

(in this particular workflow the mouth area is the aim, but it can really be anything ..and whole face can often be beneficial. LTX will try inpaint the area according to your prompt, so in theory you can do other edits as well ;-))

image

Man, you are a real legend.. Thank you so much.

PS: I just got this error (ValueError: Masks batch size must have a multiple of 8 masks + 1.) in the LTX preprocess Masks

LTX is a cry-baby when it comes to number of frames.
I'll try add some automagic calculations, but for now at the Load Video node, try set frame_load_cap to a number that that is divisible by 8+1 ... 241 frames, or something like that

Unfortunately comfyUI auto-fixes wrong frames for input frames, but not for the mask frames. I'll try add something that auto adjusts

do you have any preview examples of outputs should be good to people working on v2v lipsync area.

do you have any preview examples of outputs should be good to people working on v2v lipsync area.

made some quickly for you, with some random Wan video's from CivitAI. Probably works best for videos where its more obvious that the person "should" talk.
LTX often ends up with voice-over narrator if not. But if you prompt insistently enough, saying stuff like "the woman in red dress opens and moves her mouth, she is talking and she says: ""
And increasing the mask area gives LTX a bit more to work on (a person really talks with all the face, so masking full face is a bit more natural, but the end result will change more.. sometimes for the best though)

And might have to do a few runs, and try a few different seeds before you get a good result ;-) Its after all a bit of a "hack", not really what the model was meant to be

I asked and you delivered, not just one but two workflows! Thank you so much.
I hade a question and a little suggestion:

  1. What's the purpose of "v3_sd15_adapter.ckpt"? disabling it doen't alter anything.
  2. I found this new ltx2.3 inpainting lora: https://huggingface.co/Alissonerdx/LTX-LoRAs/tree/main, it may help.
  1. What's the purpose of "v3_sd15_adapter.ckpt"? disabling it doen't alter anything.

Oh thats just a placeholder. I will remove that lora node since it can not be set to "none". The intention of that particular lora "box" is to be able to use user-made loras that are not trained on audio (and hence might distort LTX sound output). But i see now that it has a "default" lora loaded, that might be confusing. You can simply delete that node, or ctr+b to bypass it

  1. I found this new ltx2.3 inpainting lora: https://huggingface.co/Alissonerdx/LTX-LoRAs/tree/main, it may help.

Yes I saw this lora as well. Been trying it out. It seems to like particular kinds of mask. But will add it to a workflow ;-) there is also an outpainting lora out now, that I will do a workflow for

Seems to me that LTX doesn't respect the mask area, in the 7th second in this video, it changed the background smoke and removed some light from the face and it changed the dress slightly.. it happened with me in countless attempts. In my case, it tends to change small details also but it ruins my video. while I think the reason for this is the upscaler in the second pass, I am not sure but I am sure that something bypass the masked area. but I am really grateful for all of your efforts and help. Thank you so much <3
Just a stupid question I know: is there anyway to lower the strength of the 2nd pass?

yes its not a 100% mask out or mask area. The model takes its freedoms.
Bypassing the 2nd pass and rendering the video in full size 1 pass gives less changes

yes its not a 100% mask out or mask area. The model takes its freedoms.
Bypassing the 2nd pass and rendering the video in full size 1 pass gives less changes

True, I just tried it and I also used the new reasoning Lora (VBVR).. the results are amazing and it does not drift from the first and last frames (although it is much slower now)

image

True, I just tried it and I also used the new reasoning Lora (VBVR).. the results are amazing and it does not drift from the first and last frames (although it is much slower now)

Been meaning to try out that lora ;-) I tried it with Wan and it sure can help logic and prompt understanding

do you have any preview examples of outputs should be good to people working on v2v lipsync area.

made some quickly for you, with some random Wan video's from CivitAI. Probably works best for videos where its more obvious that the person "should" talk.
LTX often ends up with voice-over narrator if not. But if you prompt insistently enough, saying stuff like "the woman in red dress opens and moves her mouth, she is talking and she says: ""
And increasing the mask area gives LTX a bit more to work on (a person really talks with all the face, so masking full face is a bit more natural, but the end result will change more.. sometimes for the best though)

And might have to do a few runs, and try a few different seeds before you get a good result ;-) Its after all a bit of a "hack", not really what the model was meant to be

idk bro but lipsync isn't accurate, is this the best quality of lipsync of ltx2.3 v2v for now? cuz im in need of something more stronger than this and don't wanna submit myself to sync 3 which is close source.

idk bro but lipsync isn't accurate, is this the best quality of lipsync of ltx2.3 v2v for now? cuz im in need of something more stronger than this and don't wanna submit myself to sync 3 which is close source.

depends on the seed and prompt. And there are a couple of loras that can help as well. Talking-Head lora for example

Hey, would you like to check this out?
https://huggingface.co/HiDream-ai/ReCo
The examples looks promising but I do not know how to run it or is it even possible to run in comfyUI ..

Shou

Hey, would you like to check this out?
https://huggingface.co/HiDream-ai/ReCo
The examples looks promising but I do not know how to run it or is it even possible to run in comfyUI ..

Should be possible. But its Wan based, and more specifically Wan VACE.
VACE is really robust multi feature model that can do all sorts of "magic" things, like replace items etc

Hopefully there will be a VACE for LTX or something similar (Iora). LTX is a bit different architecture, so might take a few before more jump onboard.
But Wan (when it was open source) gradually had more and more side-projects and derivative model that added all sorts of great features....

Hopefully a similar "ecosystem" will evolve around LTX, since its the best and most current open source at the moment
https://huggingface.co/Kijai/WanVideo_comfy/tree/main just see how many models that are based on wan ;-) And i even think ReCo is supported in WanVideo Wrapper if i remember right

oh, I did not read. I just looked at the examples :)
Yes, VACE is great.. but LTX is better because of the voice features. I wish it will have many upgrades and finetunes as Wan soon.

Shou

Hey, would you like to check this out?
https://huggingface.co/HiDream-ai/ReCo
The examples looks promising but I do not know how to run it or is it even possible to run in comfyUI ..

Should be possible. But its Wan based, and more specifically Wan VACE.
VACE is really robust multi feature model that can do all sorts of "magic" things, like replace items etc

Hopefully there will be a VACE for LTX or something similar (Iora). LTX is a bit different architecture, so might take a few before more jump onboard.
But Wan (when it was open source) gradually had more and more side-projects and derivative model that added all sorts of great features....

Hopefully a similar "ecosystem" will evolve around LTX, since its the best and most current open source at the moment
https://huggingface.co/Kijai/WanVideo_comfy/tree/main just see how many models that are based on wan ;-) And i even think ReCo is supported in WanVideo Wrapper if i remember right

heard about the ungrade lora recently don't know if it purpose is to focus on an area and unblur it or the opposite but results seem workable, for the realism of ltx2.3 lipsync as there was the update recently could you update the same render but for 1.1 i believe it's the best way for ppl including me to refer to something so we know if it got improved or not.

yes, new v1.1 distilled model seems to be a nice improvement. For sound, aesthetics and more. Might also be for lip-sync, haven't tested that specifically yet ;-) but might be worth trying that true

(The ungrade lora (color ungrade) is to make the output video a bit less saturated if I remember right, to more muted natural colors (although personally i think the LTX output colors are all ok))

Sign up or log in to comment