How to extract visual feature map?

#38
by InspiringWaves - opened

Hi,
I am planning to use this multi-modal model to create visual feature map.
I can create image embeddings with this:

   inputs = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    pixel_values = inputs["pixel_values"].to(device) 
    grid_thw = inputs["image_grid_thw"].to(device)
    model.visual.to(pixel_values.device)

However, this seems to be only image embeddings - Is their a way to extract the visual feature map?

Thank you in advance!

InspiringWaves changed discussion title from How to extract embeddings? to How to extract visual feature map?
InspiringWaves changed discussion status to closed

Sign up or log in to comment