When video and audio concat don't align
When I used the LTX-2.3_-_I2V_T2V_Music-Video-Creator_multi-scene_custom_audio.json to generate a video, where SamplerCustomAdvanced threw this error:
RuntimeError: Expected all tensors to be on the same device, but got tensors is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_cat)
How can I fixed that promblem?
Thats a strange one.
Try update comfyui and KJ Nodes
It still not work.
I suppose that is MelBandRoFormer out the np.array type.
Did you run into this problem?
Not sure what this error is.. it says the model on the cpu different than the model at the vram.
But where that comes from not sure.
Since you mention MelbandRoFormer, does it work if you disable that?
Yes, I disabled all the MelbandRoFormer nodes and related ones. It works now, but the generated video looks foggy.
Note: This error usually occurs when the video tensor is on the GPU and the voice tensor is on the CPU, and they are being concatenated.
hmmm MelbandRoformer is an audio separator (extracts vocals). It should have zero impact on the video result (other than the lip-sync)
Try double check your vae loaded etc that all the models are correct.
But just to be sure i'll double check here also if there is anything
Note: This error usually occurs when the video tensor is on the GPU and the voice tensor is on the CPU, and they are being concatenated.
Might be something to ask Kijai about. He made that node, but never had any issues myself
Question: Any reason using MelbandRoformer for this task, is it better than audio-separation-node and Deepxtractv2, matter of vram usage? I am curious because I make my drumless track with both mentionned but neevr tried to vocal extract since I drums :)
Many audio separator tools out there by now. You can swap out for other if you prefer.
The MelbanRoformer is really good though, at vocal extraction.
(the only reason for using it is to improve lip-sync if the audio input is "muffled". But for many cases you want to music too, for example playing guitar or dance)
And its only what LTX "hears", the final output has full audio, even if you feed vocals only to LTX
Yeah, I got the part why earlier and thought that was pretty smart :)this is why I love your workflows, smart.
https://huggingface.co/WanApp/LtxMTV/resolve/main/LtxMTV.json
Work in progress ( I just upped this fast because I was happy with my take on your MTV concept. The workflow is a mess, I did not take time to rename or even take decent angles, just wanted to test before bed :)