get_I2V_pipeline doesn't exist

#1
by 4ervonec19 - opened

Is it clear that for model ai-forever/Kandinsky-5.0-I2V-Lite-5s you are presenting an example of image-to-video inference class, but the function itself is not in the GitHub repository (https://github.com/ai-forever/KandinskyVideo/tree/main/kandinsky_video)?

I mean you provide this code in current HuggingFace repository:

import torch
from kandinsky import get_I2V_pipeline

device_map = {
    "dit": torch.device('cuda:0'), 
    "vae": torch.device('cuda:0'), 
    "text_embedder": torch.device('cuda:0')
}

pipe = get_I2V_pipeline(device_map, conf_path="configs/config_5s_i2v.yaml")

images = pipe(
    seed=42,
    time_length=5,
    save_path='./test.mp4',
    text="The Dragon breaths fire.",
    image = "assets/test_image.jpg",
)

But there is no get_I2V_pipeline inside GitHub repository. And also there is no yaml-file configs/config_5s_i2v.yaml here.

Or no check mark βœ… near Diffusers Integration for I2V means that this class is not implemented yet?

Thanks for publishing the inference class for the I2V pipeline!

When running multiple generations, I encountered a consistent VRAM error like this:

OutOfMemoryError: CUDA out of memory. Tried to allocate 562.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 473.69 MiB is free. Process 3818579 has 78.68 GiB memory in use. Of the allocated memory 78.07 GiB is allocated by PyTorch, and 113.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When using the AutoencoderKLHunyuanVideo decoder for video generation, there was a severe CUDA memory leak where each call to vae.decode() would accumulate of GPU memory that couldn't be freed with torch.cuda.empty_cache() (which didn't work for me). This made it impossible to generate video after diffusion process.

The memory leak occurred probably because PyTorch was retaining intermediate tensors and computation graphs even during inference. Debugging confirmed this issue. Running the code below:

# [DEBUG]
import torch
from kandinsky.models.vae import AutoencoderKLHunyuanVideo, HunyuanVideoDecoder3D
torch.cuda.empty_cache()
torch.cuda.synchronize()

vae = AutoencoderKLHunyuanVideo().to('cuda').half().eval()

z = torch.randn(1, 16, 3, 90, 66, device="cuda:0", dtype=torch.float16)

def find_gpu_tensors():
    """Looking for all GPU tensors with their corresponding sizes"""
    import gc
    
    print("=== GPU TENSORS ===")
    gpu_tensors = []
    
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) and obj.device.type == 'cuda':
                size_mb = obj.element_size() * obj.nelement() / 1024**2
                gpu_tensors.append((size_mb, obj.shape, obj.dtype, type(obj)))
        except:
            pass
    
    gpu_tensors.sort(reverse=True, key=lambda x: x[0])
    
    total_size = 0
    for size_mb, shape, dtype, obj_type in gpu_tensors[:20]:
        print(f"{size_mb:8.2f}MB - {shape} - {dtype} - {obj_type}")
        total_size += size_mb
    
    print(f"\nTotal tensors size: {total_size/1024:.2f}GB")
    print(f"Number of tensors: {len(gpu_tensors)}")

res = vae.decode(z)
find_gpu_tensors()

Produced this output (after just one inference call):

=== GPU TENSORS ===
 2055.30MB - torch.Size([1, 256, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 2055.30MB - torch.Size([1, 256, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 1670.62MB - torch.Size([1, 256, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
 1670.62MB - torch.Size([1, 256, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
 1670.62MB - torch.Size([1, 256, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
 1670.62MB - torch.Size([1, 256, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
 1670.62MB - torch.Size([1, 256, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
 1027.65MB - torch.Size([1, 128, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 1027.65MB - torch.Size([1, 128, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 1027.65MB - torch.Size([1, 128, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 1027.65MB - torch.Size([1, 128, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 1027.65MB - torch.Size([1, 128, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
 1027.65MB - torch.Size([1, 128, 11, 722, 530]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
  835.31MB - torch.Size([1, 128, 9, 720, 528]) - torch.float16 - <class 'torch.Tensor'>
Total tensors size: 23.90GB
Number of tensors: 386

As a solution, adding the @torch .no_grad wrapper to both the _decode and decode class methods resolved the issue for me:

@apply_forward_hook
@torch .no_grad()  #  <---- This fix worked for me (also in _decode)
def decode(
        self, z: torch.Tensor, return_dict: bool = True
    ) -> Union[DecoderOutput, torch.Tensor]:
        r"""
        Decode a batch of images.

        Args:
            z (`torch.Tensor`): Input batch of latent vectors.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.

        Returns:
            [`~models.vae.DecoderOutput`] or `tuple`:
                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned,
                otherwise a plain `tuple` is returned.
        """
        tile_size, tile_stride = self.get_dec_optimal_tiling(z.shape)
        if tile_size != self.tile_size:
            self.tile_size = tile_size
            self.apply_tiling(tile_size, tile_stride)

        decoded = self._decode(z).sample

        if not return_dict:
            return (decoded,)

        return DecoderOutput(sample=decoded)

Hope this helps fix the memory issues! There were memory problems running this code out-of-the-box, or perhaps I made some mistakes during my setup attempts πŸ€—

Well, this morning everything works fine using the code like out-of-the-box. Perhaps it was my GPU environmental problems. I don't know why memory cumulatively increased during decoder inference. Eventually, current code works fine with no additional fixes. Thanks)

Sign up or log in to comment