How to extract visual feature map?

#38

by InspiringWaves - opened Aug 5, 2025

Discussion

InspiringWaves

Aug 5, 2025

•

edited Aug 5, 2025

Hi,
I am planning to use this multi-modal model to create visual feature map.
I can create image embeddings with this:

   inputs = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    pixel_values = inputs["pixel_values"].to(device) 
    grid_thw = inputs["image_grid_thw"].to(device)
    model.visual.to(pixel_values.device)

However, this seems to be only image embeddings - Is their a way to extract the visual feature map?

Thank you in advance!

InspiringWaves changed discussion title from How to extract embeddings? to How to extract visual feature map? Aug 5, 2025

InspiringWaves changed discussion status to closed Aug 6, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment