VOID-Quadmask-Reasoner-api

Running on Zero

multimodalart HF Staff Claude Opus 4.6 (1M context) commited on 18 days ago

Commit

535cf33

1 Parent(s): 0edb2a4

Add VOID VLM-Mask-Reasoner quadmask generation demo

4-stage pipeline matching Netflix VOID repo:
- Stage 1: SAM2 video segmentation (transformers Sam2VideoModel)
- Stage 2: Gemini VLM scene analysis (repo code directly)
- Stage 3: SAM3 text-prompted segmentation (transformers Sam3Model)
- Stage 4: Lossless quadmask combination (repo code directly)

Gradio UI with point-click selection, progress tracking,
overlay visualization, and lossless quadmask download.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (19) hide show

README.md +5 -5
VLM-MASK-REASONER/README.md +116 -0
VLM-MASK-REASONER/__pycache__/stage1_sam2_segmentation.cpython-312.pyc +0 -0
VLM-MASK-REASONER/__pycache__/stage2_vlm_analysis.cpython-312.pyc +0 -0
VLM-MASK-REASONER/__pycache__/stage3a_generate_grey_masks_v2.cpython-312.pyc +0 -0
VLM-MASK-REASONER/__pycache__/stage4_combine_masks.cpython-312.pyc +0 -0
VLM-MASK-REASONER/edit_quadmask.py +831 -0
VLM-MASK-REASONER/point_selector_gui.py +601 -0
VLM-MASK-REASONER/run_pipeline.sh +74 -0
VLM-MASK-REASONER/stage1_sam2_segmentation.py +419 -0
VLM-MASK-REASONER/stage2_vlm_analysis.py +1022 -0
VLM-MASK-REASONER/stage2_vlm_analysis_cf.py +1024 -0
VLM-MASK-REASONER/stage3a_generate_grey_masks.py +436 -0
VLM-MASK-REASONER/stage3a_generate_grey_masks_v2.py +576 -0
VLM-MASK-REASONER/stage3b_trajectory_gui.py +432 -0
VLM-MASK-REASONER/stage4_combine_masks.py +241 -0
VLM-MASK-REASONER/test_gemini_video.py +98 -0
app.py +539 -11
requirements.txt +13 -0

README.md CHANGED Viewed

@@ -1,13 +1,13 @@
 ---
 title: VOID Quadmask Reasoner
-emoji: 🚀
-colorFrom: green
-colorTo: yellow
 sdk: gradio
 sdk_version: 6.11.0
 python_version: '3.12'
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: VOID Quadmask Reasoner
+emoji: 🎭
+colorFrom: gray
+colorTo: purple
 sdk: gradio
 sdk_version: 6.11.0
 python_version: '3.12'
 app_file: app.py
 pinned: false
+license: mit
+short_description: 'VLM-Mask-Reasoner: Generate quadmasks for VOID video inpainting'
 ---

VLM-MASK-REASONER/README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# VLM Mask Reasoner — Mask Generation Pipeline
+Generates quadmasks for video inpainting by combining user-guided SAM2 segmentation with VLM (Gemini) scene reasoning. The output `quadmask_0.mp4` encodes four semantic layers that the inpainting model uses to understand what to remove and what to preserve.
+---
+## Quadmask Values
+| Value | Meaning |
+|-------|---------|
+| `0` | Primary object (to be removed) |
+| `63` | Overlap of primary and affected regions |
+| `127` | Affected objects (shadows, reflections, held items) |
+| `255` | Background — keep as-is |
+---
+## Step 1 — Select Points via GUI
+Launch the point selector GUI:
+```bash
+python point_selector_gui.py
+```
+Use this GUI to place sparse click points on the object(s) you want removed.
+**A few things to know:**
+- Points can be placed on **any frame**, not just the first. If the object you want to remove only appears later in the video, navigate to that frame and click there.
+- You can place points across **multiple frames** — useful when there are multiple distinct objects to remove, or when an object's position shifts significantly over time.
+- The GUI saves your selections to a `config_points.json` file. Keep track of where this is saved — you'll pass it to the pipeline next.
+---
+## Step 2 — Run the Pipeline
+Once you have your `config_points.json`, run all stages with a single command:
+```bash
+bash run_pipeline.sh <config_points.json>
+```
+Optional flags:
+```bash
+bash run_pipeline.sh <config_points.json> \
+    --sam2-checkpoint ../sam2_hiera_large.pt \
+    --device cuda
+```
+This runs four stages automatically:
+1. **Stage 1 — SAM2 Segmentation:** Propagates your point clicks into a per-frame black mask for the primary object.
+2. **Stage 2 — VLM Analysis (Gemini):** Analyzes the scene to identify affected objects — things like shadows, reflections, or items the primary object is interacting with.
+3. **Stage 3 — Grey Mask Generation:** Produces a grey mask track for the affected objects identified in Stage 2.
+4. **Stage 4 — Combine into Quadmask:** Merges the black and grey masks into the final `quadmask_0.mp4`.
+The output `quadmask_0.mp4` is written into each video's `output_dir` as specified in the config.
+> **Note on grey values in frame 1:** The inpainting model was trained with grey-valued regions (`127`) starting from frame 1 onward — not on the very first frame. We find this convention improves inference quality, so the pipeline automatically clears any grey pixels from frame 0 of the final quadmask before saving.
+---
+## Step 3 (Optional) — Manual Mask Correction
+If the generated quadmask needs refinement, you can correct it interactively:
+```bash
+python edit_quadmask.py
+```
+Point the GUI to the folder containing `quadmask_0.mp4`. You can paint over regions frame-by-frame to fix any mask errors before running inference. The corrected mask is saved back to `quadmask_0.mp4` in the same folder.
+---
+## Installation & Dependencies
+### 1. Python dependencies
+Install the main requirements from the repo root:
+```bash
+pip install -r requirements.txt
+```
+### 2. SAM2
+SAM2 must be installed separately (it is not on PyPI):
+```bash
+pip install git+https://github.com/facebookresearch/segment-anything-2.git
+```
+Then download the SAM2 checkpoint. The pipeline defaults to `sam2_hiera_large.pt` one level above this directory:
+```bash
+# from the repo root (or wherever you want to store it)
+wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt
+```
+If you place the checkpoint elsewhere, pass it explicitly:
+```bash
+bash run_pipeline.sh config_points.json --sam2-checkpoint /path/to/sam2_hiera_large.pt
+```
+> SAM2 requires **Python ≥ 3.10** and **PyTorch ≥ 2.3.1** with CUDA. See the [SAM2 repo](https://github.com/facebookresearch/segment-anything-2) for full system requirements.
+### 3. Gemini API key
+Stage 2 uses the Gemini VLM. Export your API key before running the pipeline:
+```bash
+export GEMINI_API_KEY="your_key_here"
+```

VLM-MASK-REASONER/__pycache__/stage1_sam2_segmentation.cpython-312.pyc ADDED Viewed

Binary file (19.3 kB). View file

VLM-MASK-REASONER/__pycache__/stage2_vlm_analysis.cpython-312.pyc ADDED Viewed

Binary file (42.6 kB). View file

VLM-MASK-REASONER/__pycache__/stage3a_generate_grey_masks_v2.cpython-312.pyc ADDED Viewed

Binary file (26 kB). View file

VLM-MASK-REASONER/__pycache__/stage4_combine_masks.cpython-312.pyc ADDED Viewed

Binary file (9.67 kB). View file

VLM-MASK-REASONER/edit_quadmask.py ADDED Viewed

	@@ -0,0 +1,831 @@

+#!/usr/bin/env python3
+"""
+Mask Editor GUI - Edit gridified video masks with grid toggling and brush tools
+"""
+import cv2
+import numpy as np
+import tkinter as tk
+from tkinter import ttk, filedialog, messagebox
+from PIL import Image, ImageTk
+import subprocess
+from pathlib import Path
+import copy
+import time
+class MaskEditorGUI:
+    def __init__(self, root):
+        self.root = root
+        self.root.title("Mask Editor")
+        # Video data
+        self.rgb_frames = []
+        self.mask_frames = []
+        self.current_frame = 0
+        self.grid_rows = 0
+        self.grid_cols = 0
+        self.min_grid = 8
+        # Edit state
+        self.undo_stack = []
+        self.redo_stack = []
+        self.current_tool = "grid"  # "grid" or "brush"
+        self.brush_size = 20
+        self.brush_mode = "add"  # "add" or "erase"
+        # Display state
+        self.display_scale = 1.0
+        self.rgb_photo = None
+        self.mask_photo = None
+        self.dragging = False
+        self.last_brush_pos = None
+        self.last_update_time = 0
+        self.update_interval = 0.2  # Update every 200ms during dragging (5 FPS - less choppy)
+        self.cached_rgb_frame = None  # Cache current RGB frame
+        self.cached_frame_idx = -1  # Track which frame is cached
+        self.pending_update = False  # Track if update is needed after drag
+        self.brush_repeat_id = None  # Timer for continuous brush application
+        # Paths
+        self.folder_path = None
+        self.mask_path = None
+        self.rgb_path = None
+        self.setup_ui()
+    def setup_ui(self):
+        """Setup the GUI layout"""
+        # Menu bar
+        menubar = tk.Menu(self.root)
+        self.root.config(menu=menubar)
+        file_menu = tk.Menu(menubar, tearoff=0)
+        menubar.add_cascade(label="File", menu=file_menu)
+        file_menu.add_command(label="Open Folder", command=self.load_folder)
+        file_menu.add_command(label="Save Mask", command=self.save_mask)
+        file_menu.add_separator()
+        file_menu.add_command(label="Exit", command=self.root.quit)
+        edit_menu = tk.Menu(menubar, tearoff=0)
+        menubar.add_cascade(label="Edit", menu=edit_menu)
+        edit_menu.add_command(label="Undo", command=self.undo, accelerator="Ctrl+Z")
+        edit_menu.add_command(label="Redo", command=self.redo, accelerator="Ctrl+Y")
+        # Keyboard shortcuts
+        self.root.bind("<Control-z>", lambda e: self.undo())
+        self.root.bind("<Control-y>", lambda e: self.redo())
+        self.root.bind("<Left>", lambda e: self.prev_frame())
+        self.root.bind("<Right>", lambda e: self.next_frame())
+        # Top toolbar
+        toolbar = ttk.Frame(self.root)
+        toolbar.pack(side=tk.TOP, fill=tk.X, padx=5, pady=5)
+        ttk.Label(toolbar, text="Folder:").pack(side=tk.LEFT)
+        self.folder_label = ttk.Label(toolbar, text="None", foreground="gray")
+        self.folder_label.pack(side=tk.LEFT, padx=5)
+        ttk.Button(toolbar, text="Open Folder", command=self.load_folder).pack(side=tk.LEFT, padx=5)
+        ttk.Button(toolbar, text="Save Mask", command=self.save_mask).pack(side=tk.LEFT, padx=5)
+        # Main content area
+        content = ttk.Frame(self.root)
+        content.pack(side=tk.TOP, fill=tk.BOTH, expand=True, padx=5, pady=5)
+        # Left panel - Original video
+        left_panel = ttk.LabelFrame(content, text="Original Video")
+        left_panel.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=5)
+        self.rgb_canvas = tk.Canvas(left_panel, width=640, height=480, bg='black')
+        self.rgb_canvas.pack(fill=tk.BOTH, expand=True)
+        # Right panel - Mask
+        right_panel = ttk.LabelFrame(content, text="Mask (Editable)")
+        right_panel.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=5)
+        self.mask_canvas = tk.Canvas(right_panel, width=640, height=480, bg='black')
+        self.mask_canvas.pack(fill=tk.BOTH, expand=True)
+        self.mask_canvas.bind("<Button-1>", self.on_mask_click)
+        self.mask_canvas.bind("<B1-Motion>", self.on_mask_drag)
+        self.mask_canvas.bind("<ButtonRelease-1>", self.on_mask_release)
+        # Bottom controls
+        controls = ttk.Frame(self.root)
+        controls.pack(side=tk.BOTTOM, fill=tk.X, padx=5, pady=5)
+        # Frame navigation
+        nav_frame = ttk.LabelFrame(controls, text="Frame Navigation")
+        nav_frame.pack(side=tk.TOP, fill=tk.X, pady=5)
+        ttk.Button(nav_frame, text="<<", command=self.first_frame, width=5).pack(side=tk.LEFT, padx=2)
+        ttk.Button(nav_frame, text="<", command=self.prev_frame, width=5).pack(side=tk.LEFT, padx=2)
+        self.frame_label = ttk.Label(nav_frame, text="Frame: 0 / 0")
+        self.frame_label.pack(side=tk.LEFT, padx=10)
+        ttk.Button(nav_frame, text=">", command=self.next_frame, width=5).pack(side=tk.LEFT, padx=2)
+        ttk.Button(nav_frame, text=">>", command=self.last_frame, width=5).pack(side=tk.LEFT, padx=2)
+        self.frame_slider = ttk.Scale(nav_frame, from_=0, to=100, orient=tk.HORIZONTAL,
+                                      command=self.on_slider_change)
+        self.frame_slider.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=10)
+        # Tool selection
+        tool_frame = ttk.LabelFrame(controls, text="Tools")
+        tool_frame.pack(side=tk.TOP, fill=tk.X, pady=5)
+        self.tool_var = tk.StringVar(value="grid")
+        ttk.Radiobutton(tool_frame, text="Grid Toggle", variable=self.tool_var,
+                       value="grid", command=self.on_tool_change).pack(side=tk.LEFT, padx=5)
+        ttk.Radiobutton(tool_frame, text="Grid Black Toggle", variable=self.tool_var,
+                       value="grid_black", command=self.on_tool_change).pack(side=tk.LEFT, padx=5)
+        ttk.Radiobutton(tool_frame, text="Brush (Add Black)", variable=self.tool_var,
+                       value="brush_add", command=self.on_tool_change).pack(side=tk.LEFT, padx=5)
+        ttk.Radiobutton(tool_frame, text="Brush (Erase Black)", variable=self.tool_var,
+                       value="brush_erase", command=self.on_tool_change).pack(side=tk.LEFT, padx=5)
+        ttk.Label(tool_frame, text="Brush Size:").pack(side=tk.LEFT, padx=10)
+        self.brush_slider = ttk.Scale(tool_frame, from_=5, to=100, orient=tk.HORIZONTAL,
+                                     command=self.on_brush_size_change)
+        self.brush_slider.set(20)
+        self.brush_slider.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=5)
+        self.brush_size_label = ttk.Label(tool_frame, text="20")
+        self.brush_size_label.pack(side=tk.LEFT, padx=5)
+        # Copy from previous frame
+        copy_frame = ttk.LabelFrame(controls, text="Copy from Previous Frame")
+        copy_frame.pack(side=tk.TOP, fill=tk.X, pady=5)
+        ttk.Button(copy_frame, text="Copy Black Mask",
+                  command=self.copy_black_from_previous).pack(side=tk.LEFT, padx=5)
+        ttk.Button(copy_frame, text="Copy Grey Mask",
+                  command=self.copy_grey_from_previous).pack(side=tk.LEFT, padx=5)
+        # Info panel
+        info_frame = ttk.Frame(controls)
+        info_frame.pack(side=tk.TOP, fill=tk.X, pady=5)
+        self.info_label = ttk.Label(info_frame, text="Load a folder to begin", foreground="blue")
+        self.info_label.pack(side=tk.LEFT, padx=5)
+        self.grid_info_label = ttk.Label(info_frame, text="Grid: N/A")
+        self.grid_info_label.pack(side=tk.RIGHT, padx=5)
+    def calculate_square_grid(self, width, height, min_grid=8):
+        """Calculate grid dimensions to make square cells"""
+        aspect_ratio = width / height
+        if width >= height:
+            grid_rows = min_grid
+            grid_cols = max(min_grid, round(min_grid * aspect_ratio))
+        else:
+            grid_cols = min_grid
+            grid_rows = max(min_grid, round(min_grid / aspect_ratio))
+        return grid_rows, grid_cols
+    def load_folder(self):
+        """Load a folder containing rgb_full.mp4/input_video.mp4 and quadmask_0.mp4"""
+        folder = filedialog.askdirectory(title="Select Folder")
+        if not folder:
+            return
+        folder_path = Path(folder)
+        # Find RGB video
+        rgb_path = None
+        for name in ["rgb_full.mp4", "input_video.mp4"]:
+            candidate = folder_path / name
+            if candidate.exists():
+                rgb_path = candidate
+                break
+        mask_path = folder_path / "quadmask_0.mp4"
+        if not rgb_path or not mask_path.exists():
+            messagebox.showerror("Error", "Folder must contain quadmask_0.mp4 and rgb_full.mp4 or input_video.mp4")
+            return
+        self.folder_path = folder_path
+        self.rgb_path = rgb_path
+        self.mask_path = mask_path
+        # Load videos
+        self.load_videos()
+    def load_videos(self):
+        """Load RGB and mask videos into memory"""
+        self.info_label.config(text="Loading videos...")
+        self.root.update()
+        # Load RGB frames
+        self.rgb_frames = self.read_video_frames(self.rgb_path)
+        # Load mask frames
+        self.mask_frames = self.read_video_frames(self.mask_path)
+        if len(self.rgb_frames) != len(self.mask_frames):
+            messagebox.showwarning("Warning",
+                f"Frame count mismatch: RGB={len(self.rgb_frames)}, Mask={len(self.mask_frames)}")
+        if len(self.mask_frames) == 0:
+            messagebox.showerror("Error", "No frames loaded")
+            return
+        # Calculate grid dimensions
+        height, width = self.mask_frames[0].shape[:2]
+        self.grid_rows, self.grid_cols = self.calculate_square_grid(width, height, self.min_grid)
+        # Calculate display scale
+        max_width = 600
+        max_height = 450
+        scale_w = max_width / width
+        scale_h = max_height / height
+        self.display_scale = min(scale_w, scale_h, 1.0)
+        # Update UI
+        self.folder_label.config(text=self.folder_path.name, foreground="black")
+        self.grid_info_label.config(text=f"Grid: {self.grid_rows}x{self.grid_cols}")
+        self.frame_slider.config(to=len(self.mask_frames)-1)
+        self.current_frame = 0
+        self.undo_stack = []
+        self.redo_stack = []
+        self.update_display()
+        self.info_label.config(text=f"Loaded {len(self.mask_frames)} frames", foreground="green")
+    def read_video_frames(self, video_path):
+        """Read all frames from a video"""
+        cap = cv2.VideoCapture(str(video_path))
+        frames = []
+        while True:
+            ret, frame = cap.read()
+            if not ret:
+                break
+            # Convert to grayscale if needed
+            if len(frame.shape) == 3:
+                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
+            frames.append(frame)
+        cap.release()
+        return frames
+    def write_video_frames(self, frames, output_path, fps=12):
+        """Write frames to a video file using lossless H.264"""
+        if not frames:
+            return
+        height, width = frames[0].shape[:2]
+        # Write temp AVI first
+        temp_avi = output_path.with_suffix('.avi')
+        fourcc = cv2.VideoWriter_fourcc(*'FFV1')
+        out = cv2.VideoWriter(str(temp_avi), fourcc, fps, (width, height), isColor=False)
+        for frame in frames:
+            if len(frame.shape) == 3:
+                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
+            out.write(frame)
+        out.release()
+        # Convert to LOSSLESS H.264 (qp=0)
+        cmd = [
+            'ffmpeg', '-y', '-i', str(temp_avi),
+            '-c:v', 'libx264', '-qp', '0', '-preset', 'ultrafast',
+            '-pix_fmt', 'yuv444p', '-r', '12',
+            str(output_path)
+        ]
+        subprocess.run(cmd, capture_output=True)
+        temp_avi.unlink()
+    def update_display(self, fast_mode=False):
+        """Update both canvas displays (or just mask in fast mode)"""
+        if not self.mask_frames:
+            return
+        # Cache RGB frame if needed (only in full mode)
+        if not fast_mode and self.cached_frame_idx != self.current_frame:
+            if self.current_frame < len(self.rgb_frames):
+                rgb_frame = self.rgb_frames[self.current_frame]
+                self.cached_rgb_frame = cv2.cvtColor(rgb_frame, cv2.COLOR_GRAY2RGB) if len(rgb_frame.shape) == 2 else rgb_frame.copy()
+                self.cached_frame_idx = self.current_frame
+            else:
+                self.cached_rgb_frame = None
+        if not fast_mode:
+            # Update frame label
+            self.frame_label.config(text=f"Frame: {self.current_frame + 1} / {len(self.mask_frames)}")
+            self.frame_slider.set(self.current_frame)
+            # Display RGB frame
+            if self.cached_rgb_frame is not None:
+                rgb_display = cv2.resize(self.cached_rgb_frame, None, fx=self.display_scale, fy=self.display_scale)
+                rgb_image = Image.fromarray(rgb_display)
+                self.rgb_photo = ImageTk.PhotoImage(rgb_image)
+                self.rgb_canvas.delete("all")
+                self.rgb_canvas.create_image(0, 0, anchor=tk.NW, image=self.rgb_photo)
+        # Display mask frame with grid overlay
+        mask_frame = self.mask_frames[self.current_frame]
+        # Use simple mode (no RGB blending) during fast updates for speed
+        mask_display = self.create_mask_visualization(mask_frame, self.cached_rgb_frame, simple_mode=fast_mode)
+        mask_display = cv2.resize(mask_display, None, fx=self.display_scale, fy=self.display_scale)
+        mask_image = Image.fromarray(mask_display)
+        self.mask_photo = ImageTk.PhotoImage(mask_image)
+        self.mask_canvas.delete("all")
+        self.mask_canvas.create_image(0, 0, anchor=tk.NW, image=self.mask_photo)
+        # Draw grid overlay (skip in fast mode for performance)
+        if not fast_mode:
+            self.draw_grid_overlay()
+    def create_mask_visualization(self, mask_frame, rgb_frame=None, simple_mode=False):
+        """Create RGB visualization of mask with color coding and RGB background"""
+        height, width = mask_frame.shape
+        vis = np.zeros((height, width, 3), dtype=np.uint8)
+        if simple_mode:
+            # Fast simple mode - no blending, just solid colors
+            vis[mask_frame == 255] = [150, 150, 150]  # Background - gray
+            vis[mask_frame == 127] = [0, 200, 0]      # Gridified - green
+            vis[mask_frame == 63] = [200, 200, 0]     # Overlap - yellow
+            vis[mask_frame == 0] = [200, 0, 0]        # Black - red
+            return vis
+        # If RGB frame is provided, use it as background
+        if rgb_frame is not None:
+            if len(rgb_frame.shape) == 2:
+                # Convert grayscale to RGB
+                rgb_background = cv2.cvtColor(rgb_frame, cv2.COLOR_GRAY2RGB)
+            else:
+                rgb_background = rgb_frame.copy()
+            # Use it at 50% opacity as base
+            vis = (rgb_background * 0.5).astype(np.uint8)
+        # Color coding with transparency to show background:
+        # 0 (black) -> Red tint (to indicate removal area)
+        # 63 (overlap) -> Yellow tint
+        # 127 (gridified) -> Green tint
+        # 255 (background) -> Keep RGB background visible
+        # Background areas - show RGB at 60% brightness
+        bg_mask = mask_frame == 255
+        if rgb_frame is not None:
+            vis[bg_mask] = (rgb_background[bg_mask] * 0.6).astype(np.uint8)
+        else:
+            vis[bg_mask] = [150, 150, 150]
+        # Green overlay for gridified areas - blend 40% background + 60% green tint
+        green_mask = mask_frame == 127
+        if rgb_frame is not None:
+            vis[green_mask] = np.clip(rgb_background[green_mask] * 0.4 + np.array([0, 180, 0]) * 0.6, 0, 255).astype(np.uint8)
+        else:
+            vis[green_mask] = [0, 200, 0]
+        # Yellow overlay for overlap areas - blend 40% background + 60% yellow tint
+        yellow_mask = mask_frame == 63
+        if rgb_frame is not None:
+            vis[yellow_mask] = np.clip(rgb_background[yellow_mask] * 0.4 + np.array([180, 180, 0]) * 0.6, 0, 255).astype(np.uint8)
+        else:
+            vis[yellow_mask] = [200, 200, 0]
+        # Red tint for black areas (removal) - blend 30% background + 70% red tint
+        black_mask = mask_frame == 0
+        if rgb_frame is not None:
+            vis[black_mask] = np.clip(rgb_background[black_mask] * 0.3 + np.array([200, 0, 0]) * 0.7, 0, 255).astype(np.uint8)
+        else:
+            vis[black_mask] = [200, 0, 0]
+        return vis
+    def draw_grid_overlay(self):
+        """Draw grid lines on mask canvas"""
+        if not self.mask_frames:
+            return
+        height, width = self.mask_frames[0].shape
+        scaled_width = int(width * self.display_scale)
+        scaled_height = int(height * self.display_scale)
+        cell_width = scaled_width / self.grid_cols
+        cell_height = scaled_height / self.grid_rows
+        # Draw vertical lines
+        for col in range(self.grid_cols + 1):
+            x = int(col * cell_width)
+            self.mask_canvas.create_line(x, 0, x, scaled_height, fill='red', width=1, tags='grid')
+        # Draw horizontal lines
+        for row in range(self.grid_rows + 1):
+            y = int(row * cell_height)
+            self.mask_canvas.create_line(0, y, scaled_width, y, fill='red', width=1, tags='grid')
+    def get_grid_from_pos(self, x, y):
+        """Get grid row, col from canvas position"""
+        if not self.mask_frames:
+            return None, None
+        height, width = self.mask_frames[0].shape
+        # Convert to frame coordinates
+        frame_x = int(x / self.display_scale)
+        frame_y = int(y / self.display_scale)
+        if frame_x < 0 or frame_x >= width or frame_y < 0 or frame_y >= height:
+            return None, None
+        cell_width = width / self.grid_cols
+        cell_height = height / self.grid_rows
+        col = int(frame_x / cell_width)
+        row = int(frame_y / cell_height)
+        return row, col
+    def toggle_grid(self, row, col):
+        """Toggle a grid cell between 127 and 255, handling 63 overlaps"""
+        if row is None or col is None:
+            return
+        if row < 0 or row >= self.grid_rows or col < 0 or col >= self.grid_cols:
+            return
+        # Save state for undo
+        self.save_state()
+        mask = self.mask_frames[self.current_frame]
+        height, width = mask.shape
+        cell_width = width / self.grid_cols
+        cell_height = height / self.grid_rows
+        y1 = int(row * cell_height)
+        y2 = int((row + 1) * cell_height)
+        x1 = int(col * cell_width)
+        x2 = int((col + 1) * cell_width)
+        grid_region = mask[y1:y2, x1:x2]
+        # Check if grid has any 127 or 63 values
+        has_active = np.any((grid_region == 127) | (grid_region == 63))
+        if has_active:
+            # Turn OFF: 127->255, 63->0, keep 0 and 255 as is
+            mask[y1:y2, x1:x2] = np.where(grid_region == 127, 255,
+                                         np.where(grid_region == 63, 0, grid_region))
+        else:
+            # Turn ON: 255->127, 0->63, keep others as is
+            mask[y1:y2, x1:x2] = np.where(grid_region == 255, 127,
+                                         np.where(grid_region == 0, 63, grid_region))
+        self.update_display()
+    def toggle_grid_black(self, row, col):
+        """Toggle black mask in a grid cell"""
+        if row is None or col is None:
+            return
+        if row < 0 or row >= self.grid_rows or col < 0 or col >= self.grid_cols:
+            return
+        # Save state for undo
+        self.save_state()
+        mask = self.mask_frames[self.current_frame]
+        height, width = mask.shape
+        cell_width = width / self.grid_cols
+        cell_height = height / self.grid_rows
+        y1 = int(row * cell_height)
+        y2 = int((row + 1) * cell_height)
+        x1 = int(col * cell_width)
+        x2 = int((col + 1) * cell_width)
+        grid_region = mask[y1:y2, x1:x2]
+        # Check if grid has any black (0 or 63 values)
+        has_black = np.any((grid_region == 0) | (grid_region == 63))
+        if has_black:
+            # Turn OFF black: 0->255, 63->127, keep 127 and 255 as is
+            mask[y1:y2, x1:x2] = np.where(grid_region == 0, 255,
+                                         np.where(grid_region == 63, 127, grid_region))
+        else:
+            # Turn ON black: 255->0, 127->63, keep others as is
+            mask[y1:y2, x1:x2] = np.where(grid_region == 255, 0,
+                                         np.where(grid_region == 127, 63, grid_region))
+        self.update_display()
+    def apply_brush(self, x, y, mode="add"):
+        """Apply brush to add/erase black mask (vectorized for speed)"""
+        if not self.mask_frames:
+            return
+        mask = self.mask_frames[self.current_frame]
+        height, width = mask.shape
+        # Convert to frame coordinates
+        frame_x = int(x / self.display_scale)
+        frame_y = int(y / self.display_scale)
+        if frame_x < 0 or frame_x >= width or frame_y < 0 or frame_y >= height:
+            return
+        # Create circular brush using vectorized operations
+        radius = int(self.brush_size / 2)
+        y1 = max(0, frame_y - radius)
+        y2 = min(height, frame_y + radius + 1)
+        x1 = max(0, frame_x - radius)
+        x2 = min(width, frame_x + radius + 1)
+        # Get the region
+        region = mask[y1:y2, x1:x2]
+        # Create coordinate grids for distance calculation
+        yy, xx = np.ogrid[y1:y2, x1:x2]
+        dist = np.sqrt((xx - frame_x)**2 + (yy - frame_y)**2)
+        brush_mask = dist <= radius
+        if mode == "add":
+            # Add black: 255->0, 127->63
+            region[brush_mask & (region == 255)] = 0
+            region[brush_mask & (region == 127)] = 63
+        else:  # erase
+            # Erase black: 0->255, 63->127
+            region[brush_mask & (region == 0)] = 255
+            region[brush_mask & (region == 63)] = 127
+    def on_mask_click(self, event):
+        """Handle click on mask canvas"""
+        if not self.mask_frames:
+            return
+        tool = self.tool_var.get()
+        if tool == "grid":
+            row, col = self.get_grid_from_pos(event.x, event.y)
+            self.toggle_grid(row, col)
+        elif tool == "grid_black":
+            row, col = self.get_grid_from_pos(event.x, event.y)
+            self.toggle_grid_black(row, col)
+        elif tool in ["brush_add", "brush_erase"]:
+            self.save_state()
+            mode = "add" if tool == "brush_add" else "erase"
+            self.apply_brush(event.x, event.y, mode)
+            self.dragging = True
+            self.last_brush_pos = (event.x, event.y)
+            self.last_update_time = time.time()
+            self.update_display(fast_mode=True)
+            # Start continuous brush application
+            self.schedule_brush_repeat()
+    def on_mask_drag(self, event):
+        """Handle drag on mask canvas with throttled updates"""
+        if not self.dragging:
+            return
+        tool = self.tool_var.get()
+        if tool in ["brush_add", "brush_erase"]:
+            # Update brush position when moving
+            self.last_brush_pos = (event.x, event.y)
+            mode = "add" if tool == "brush_add" else "erase"
+            self.apply_brush(event.x, event.y, mode)
+            # Only update display if enough time has passed (fast mode - no grid)
+            current_time = time.time()
+            if current_time - self.last_update_time >= self.update_interval:
+                self.update_display(fast_mode=True)
+                self.last_update_time = current_time
+    def on_mask_release(self, event):
+        """Handle release on mask canvas"""
+        self.dragging = False
+        self.last_brush_pos = None
+        # Cancel continuous brush application
+        if self.brush_repeat_id:
+            self.root.after_cancel(self.brush_repeat_id)
+            self.brush_repeat_id = None
+        # Final full update when releasing to show the complete result with blending
+        self.update_display(fast_mode=False)
+    def schedule_brush_repeat(self):
+        """Schedule continuous brush application while mouse is held down"""
+        if self.dragging and self.last_brush_pos:
+            tool = self.tool_var.get()
+            if tool in ["brush_add", "brush_erase"]:
+                mode = "add" if tool == "brush_add" else "erase"
+                x, y = self.last_brush_pos
+                self.apply_brush(x, y, mode)
+                # Update display if enough time has passed
+                current_time = time.time()
+                if current_time - self.last_update_time >= self.update_interval:
+                    self.update_display(fast_mode=True)
+                    self.last_update_time = current_time
+                # Schedule next application (every 30ms for smooth continuous painting)
+                self.brush_repeat_id = self.root.after(30, self.schedule_brush_repeat)
+    def copy_black_from_previous(self):
+        """Copy ONLY black component from previous frame, preserving grey in current frame"""
+        if not self.mask_frames:
+            messagebox.showwarning("Warning", "No mask loaded")
+            return
+        if self.current_frame == 0:
+            messagebox.showwarning("Warning", "Cannot copy from previous frame - already at first frame")
+            return
+        # Save state for undo
+        self.save_state()
+        prev_mask = self.mask_frames[self.current_frame - 1]
+        curr_mask = self.mask_frames[self.current_frame]
+        # Copy ONLY the black component from previous frame
+        # Where prev has black (0 or 63): add black to curr
+        # Where prev doesn't have black (127 or 255): remove black from curr
+        has_black_in_prev = (prev_mask == 0) | (prev_mask == 63)
+        no_black_in_prev = (prev_mask == 127) | (prev_mask == 255)
+        # Remove black where prev doesn't have it (preserve grey)
+        curr_mask[no_black_in_prev & (curr_mask == 0)] = 255   # 0 → 255
+        curr_mask[no_black_in_prev & (curr_mask == 63)] = 127  # 63 → 127 (keep grey)
+        # Add black where prev has it (preserve grey)
+        curr_mask[has_black_in_prev & (curr_mask == 255)] = 0   # 255 → 0
+        curr_mask[has_black_in_prev & (curr_mask == 127)] = 63  # 127 → 63 (keep grey, add black)
+        self.update_display()
+        self.info_label.config(text="Copied black mask from previous frame", foreground="green")
+    def copy_grey_from_previous(self):
+        """Copy ONLY grey component from previous frame, preserving black in current frame"""
+        if not self.mask_frames:
+            messagebox.showwarning("Warning", "No mask loaded")
+            return
+        if self.current_frame == 0:
+            messagebox.showwarning("Warning", "Cannot copy from previous frame - already at first frame")
+            return
+        # Save state for undo
+        self.save_state()
+        prev_mask = self.mask_frames[self.current_frame - 1]
+        curr_mask = self.mask_frames[self.current_frame]
+        # Copy ONLY the grey component from previous frame
+        # Where prev has grey (127 or 63): add grey to curr
+        # Where prev doesn't have grey (0 or 255): remove grey from curr
+        has_grey_in_prev = (prev_mask == 127) | (prev_mask == 63)
+        no_grey_in_prev = (prev_mask == 0) | (prev_mask == 255)
+        # Remove grey where prev doesn't have it (preserve black)
+        curr_mask[no_grey_in_prev & (curr_mask == 127)] = 255  # 127 → 255
+        curr_mask[no_grey_in_prev & (curr_mask == 63)] = 0     # 63 → 0 (keep black)
+        # Add grey where prev has it (preserve black)
+        curr_mask[has_grey_in_prev & (curr_mask == 255)] = 127  # 255 → 127
+        curr_mask[has_grey_in_prev & (curr_mask == 0)] = 63     # 0 → 63 (keep black, add grey)
+        self.update_display()
+        self.info_label.config(text="Copied grey mask from previous frame", foreground="green")
+    def save_state(self):
+        """Save current state for undo"""
+        if not self.mask_frames:
+            return
+        # Save deep copy of current frame
+        state = {
+            'frame': self.current_frame,
+            'mask': self.mask_frames[self.current_frame].copy()
+        }
+        self.undo_stack.append(state)
+        self.redo_stack.clear()
+        # Limit undo stack size
+        if len(self.undo_stack) > 50:
+            self.undo_stack.pop(0)
+    def undo(self):
+        """Undo last edit"""
+        if not self.undo_stack:
+            return
+        # Save current state to redo
+        redo_state = {
+            'frame': self.current_frame,
+            'mask': self.mask_frames[self.current_frame].copy()
+        }
+        self.redo_stack.append(redo_state)
+        # Restore previous state
+        state = self.undo_stack.pop()
+        self.current_frame = state['frame']
+        self.mask_frames[self.current_frame] = state['mask']
+        self.update_display()
+    def redo(self):
+        """Redo last undone edit"""
+        if not self.redo_stack:
+            return
+        # Save current state to undo
+        undo_state = {
+            'frame': self.current_frame,
+            'mask': self.mask_frames[self.current_frame].copy()
+        }
+        self.undo_stack.append(undo_state)
+        # Restore redo state
+        state = self.redo_stack.pop()
+        self.current_frame = state['frame']
+        self.mask_frames[self.current_frame] = state['mask']
+        self.update_display()
+    def save_mask(self):
+        """Save edited mask back to quadmask_0.mp4"""
+        if not self.mask_frames or not self.mask_path:
+            messagebox.showwarning("Warning", "No mask loaded")
+            return
+        # Confirm save
+        result = messagebox.askyesno("Confirm Save",
+            f"Save mask to {self.mask_path.name}?\nThis will overwrite the existing file.")
+        if not result:
+            return
+        self.info_label.config(text="Saving mask...", foreground="blue")
+        self.root.update()
+        # Write video
+        self.write_video_frames(self.mask_frames, self.mask_path)
+        self.info_label.config(text="Mask saved successfully!", foreground="green")
+        messagebox.showinfo("Success", f"Mask saved to {self.mask_path.name}!")
+    def first_frame(self):
+        """Go to first frame"""
+        self.current_frame = 0
+        self.update_display()
+    def last_frame(self):
+        """Go to last frame"""
+        if self.mask_frames:
+            self.current_frame = len(self.mask_frames) - 1
+            self.update_display()
+    def prev_frame(self):
+        """Go to previous frame"""
+        if self.current_frame > 0:
+            self.current_frame -= 1
+            self.update_display()
+    def next_frame(self):
+        """Go to next frame"""
+        if self.mask_frames and self.current_frame < len(self.mask_frames) - 1:
+            self.current_frame += 1
+            self.update_display()
+    def on_slider_change(self, value):
+        """Handle slider change"""
+        if not self.mask_frames:
+            return
+        new_frame = int(float(value))
+        if new_frame != self.current_frame:
+            self.current_frame = new_frame
+            self.update_display()
+    def on_tool_change(self):
+        """Handle tool selection change"""
+        tool = self.tool_var.get()
+        if tool == "grid":
+            self.info_label.config(text="Grid Toggle: Click grids to toggle 127↔255", foreground="blue")
+        elif tool == "grid_black":
+            self.info_label.config(text="Grid Black Toggle: Click grids to toggle black mask (0/63)", foreground="blue")
+        elif tool == "brush_add":
+            self.info_label.config(text="Brush (Add): Paint black mask areas", foreground="blue")
+        else:  # brush_erase
+            self.info_label.config(text="Brush (Erase): Erase black mask areas", foreground="blue")
+    def on_brush_size_change(self, value):
+        """Handle brush size change"""
+        self.brush_size = int(float(value))
+        self.brush_size_label.config(text=str(self.brush_size))
+if __name__ == "__main__":
+    root = tk.Tk()
+    root.geometry("1400x800")
+    app = MaskEditorGUI(root)
+    root.mainloop()

VLM-MASK-REASONER/point_selector_gui.py ADDED Viewed

	@@ -0,0 +1,601 @@

+#!/usr/bin/env python3
+"""
+Point Selector GUI - Multi-Frame Support
+NEW: Support adding points across multiple frames for complex cases
+Example: Car appears at frame 0, hand carrying it appears at frame 30
+         → Add points on car at frame 0, points on hand at frame 30
+         → Both get segmented together as "primary object to remove"
+Usage:
+    python point_selector_gui_multiframe.py --config pexel_test_config.json
+"""
+import cv2
+import numpy as np
+import tkinter as tk
+from tkinter import ttk, filedialog, messagebox
+from PIL import Image, ImageTk
+import json
+import argparse
+from pathlib import Path
+from typing import List, Dict, Tuple
+class PointSelectorGUI:
+    def __init__(self, root, config_path=None):
+        self.root = root
+        self.root.title("Point Selector - Multi-Frame Support")
+        # Data
+        self.config_path = config_path
+        self.config_data = None
+        self.current_video_idx = 0
+        self.current_frame_idx = 0
+        self.video_captures = []
+        self.total_frames_list = []
+        # NEW: Points organized by frame
+        self.points_by_frame = {}  # {frame_idx: [(x, y), ...]}
+        self.all_points_by_frame = []  # List of dicts for all videos
+        # Display
+        self.display_scale = 1.0
+        self.photo = None
+        self.point_radius = 8
+        self.setup_ui()
+        if config_path:
+            self.load_config_direct(config_path)
+    def setup_ui(self):
+        """Setup the GUI layout"""
+        # Menu bar
+        menubar = tk.Menu(self.root)
+        self.root.config(menu=menubar)
+        file_menu = tk.Menu(menubar, tearoff=0)
+        menubar.add_cascade(label="File", menu=file_menu)
+        file_menu.add_command(label="Load Config", command=self.load_config)
+        file_menu.add_command(label="Save Points", command=self.save_points)
+        file_menu.add_separator()
+        file_menu.add_command(label="Exit", command=self.root.quit)
+        # Top toolbar
+        toolbar = ttk.Frame(self.root)
+        toolbar.pack(side=tk.TOP, fill=tk.X, padx=5, pady=5)
+        ttk.Label(toolbar, text="Config:").pack(side=tk.LEFT)
+        self.config_label = ttk.Label(toolbar, text="None", foreground="gray")
+        self.config_label.pack(side=tk.LEFT, padx=5)
+        ttk.Button(toolbar, text="Load Config", command=self.load_config).pack(side=tk.LEFT, padx=5)
+        ttk.Button(toolbar, text="Save All Points", command=self.save_points).pack(side=tk.LEFT, padx=5)
+        # Video info
+        info_frame = ttk.Frame(self.root)
+        info_frame.pack(side=tk.TOP, fill=tk.X, padx=5, pady=5)
+        self.video_label = ttk.Label(info_frame, text="Video: None", font=("Arial", 10, "bold"))
+        self.video_label.pack(side=tk.LEFT, padx=5)
+        self.instruction_label = ttk.Label(info_frame, text="", foreground="blue")
+        self.instruction_label.pack(side=tk.LEFT, padx=10)
+        # Frame navigation controls - COMPACT (Ctrl+←/→ shortcuts)
+        frame_nav = ttk.LabelFrame(self.root, text="Frame Navigation")
+        frame_nav.pack(side=tk.TOP, fill=tk.X, padx=5, pady=2)
+        btn_frame = ttk.Frame(frame_nav)
+        btn_frame.pack(side=tk.TOP, fill=tk.X, padx=5, pady=2)
+        # Compact buttons
+        ttk.Button(btn_frame, text="<<", command=self.first_frame, width=3).pack(side=tk.LEFT, padx=1)
+        ttk.Button(btn_frame, text="<10", command=lambda: self.prev_frame(10), width=3).pack(side=tk.LEFT, padx=1)
+        ttk.Button(btn_frame, text="<", command=lambda: self.prev_frame(1), width=3).pack(side=tk.LEFT, padx=1)
+        self.frame_label = ttk.Label(btn_frame, text="F: 0/0", font=("Arial", 9))
+        self.frame_label.pack(side=tk.LEFT, padx=8)
+        ttk.Button(btn_frame, text=">", command=lambda: self.next_frame(1), width=3).pack(side=tk.LEFT, padx=1)
+        ttk.Button(btn_frame, text="10>", command=lambda: self.next_frame(10), width=3).pack(side=tk.LEFT, padx=1)
+        ttk.Button(btn_frame, text=">>", command=self.last_frame, width=3).pack(side=tk.LEFT, padx=1)
+        # Slider inline
+        self.frame_slider = ttk.Scale(btn_frame, from_=0, to=100, orient=tk.HORIZONTAL, command=self.on_slider_change, length=250)
+        self.frame_slider.pack(side=tk.LEFT, padx=5)
+        # Frames with points inline
+        ttk.Label(btn_frame, text="Points:", font=("Arial", 8)).pack(side=tk.LEFT, padx=3)
+        self.frames_with_points_label = ttk.Label(btn_frame, text="None", foreground="blue", font=("Arial", 8))
+        self.frames_with_points_label.pack(side=tk.LEFT)
+        # Main canvas - SMALLER to fit everything
+        canvas_frame = ttk.LabelFrame(self.root, text="Click to add points")
+        canvas_frame.pack(side=tk.TOP, fill=tk.BOTH, expand=True, padx=5, pady=2)
+        self.canvas = tk.Canvas(canvas_frame, width=800, height=450, bg='black', cursor="crosshair")
+        self.canvas.pack(fill=tk.BOTH, expand=True)
+        self.canvas.bind("<Button-1>", self.on_canvas_click)
+        # Bottom controls - COMPACT
+        controls = ttk.Frame(self.root)
+        controls.pack(side=tk.BOTTOM, fill=tk.X, padx=5, pady=2)
+        # Point info - compact
+        point_info = ttk.Frame(controls)
+        point_info.pack(side=tk.TOP, fill=tk.X, pady=2)
+        self.point_count_label = ttk.Label(point_info, text="Pts: 0", font=("Arial", 9))
+        self.point_count_label.pack(side=tk.LEFT, padx=5)
+        ttk.Button(point_info, text="Clear Frame", command=self.clear_current_frame, width=10).pack(side=tk.LEFT, padx=2)
+        ttk.Button(point_info, text="Clear ALL", command=self.clear_all_frames, width=9).pack(side=tk.LEFT, padx=2)
+        ttk.Button(point_info, text="Undo", command=self.undo_last_point, width=6).pack(side=tk.LEFT, padx=2)
+        # Video navigation - compact
+        nav_frame = ttk.Frame(controls)
+        nav_frame.pack(side=tk.TOP, fill=tk.X, pady=2)
+        ttk.Button(nav_frame, text="<< First", command=self.first_video, width=8).pack(side=tk.LEFT, padx=2)
+        ttk.Button(nav_frame, text="< Prev", command=self.prev_video, width=8).pack(side=tk.LEFT, padx=2)
+        self.nav_label = ttk.Label(nav_frame, text="Video: 0/0", font=("Arial", 10, "bold"))
+        self.nav_label.pack(side=tk.LEFT, padx=15)
+        ttk.Button(nav_frame, text="Save & Next >", command=self.save_and_next, width=12).pack(side=tk.LEFT, padx=2)
+        ttk.Button(nav_frame, text="Last >>", command=self.last_video, width=8).pack(side=tk.LEFT, padx=2)
+        # Status - compact
+        self.status_label = ttk.Label(controls, text="Load config", foreground="blue", font=("Arial", 8))
+        self.status_label.pack(side=tk.TOP, pady=2)
+        # Keyboard shortcuts
+        self.root.bind("<space>", lambda e: self.save_and_next())
+        self.root.bind("<Left>", lambda e: self.prev_video())
+        self.root.bind("<Right>", lambda e: self.save_and_next())
+        self.root.bind("<Control-z>", lambda e: self.undo_last_point())
+        self.root.bind("<Control-Left>", lambda e: self.prev_frame(1))
+        self.root.bind("<Control-Right>", lambda e: self.next_frame(1))
+        self.root.bind("<Control-Shift-Left>", lambda e: self.prev_frame(10))
+        self.root.bind("<Control-Shift-Right>", lambda e: self.next_frame(10))
+    def load_config_direct(self, config_path):
+        """Load config from path (for command line usage)"""
+        self.config_path = Path(config_path)
+        try:
+            with open(self.config_path, 'r') as f:
+                self.config_data = json.load(f)
+        except Exception as e:
+            messagebox.showerror("Error", f"Failed to load config: {e}")
+            return
+        self.process_config()
+    def load_config(self):
+        """Load JSON config file via dialog"""
+        filepath = filedialog.askopenfilename(
+            title="Select Config JSON",
+            filetypes=[("JSON files", "*.json"), ("All files", "*.*")]
+        )
+        if not filepath:
+            return
+        self.config_path = Path(filepath)
+        try:
+            with open(self.config_path, 'r') as f:
+                self.config_data = json.load(f)
+        except Exception as e:
+            messagebox.showerror("Error", f"Failed to load config: {e}")
+            return
+        self.process_config()
+    def process_config(self):
+        """Process loaded config"""
+        # Validate config
+        if isinstance(self.config_data, list):
+            videos = self.config_data
+        elif isinstance(self.config_data, dict) and "videos" in self.config_data:
+            videos = self.config_data["videos"]
+        else:
+            messagebox.showerror("Error", "Config must be a list or have 'videos' key")
+            return
+        if not isinstance(videos, list) or len(videos) == 0:
+            messagebox.showerror("Error", "No videos in config")
+            return
+        self.videos = videos
+        # Open video captures
+        self.status_label.config(text="Opening video files...", foreground="blue")
+        self.root.update()
+        self.open_videos()
+        # Initialize storage - now dict per video
+        self.all_points_by_frame = [{} for _ in range(len(self.videos))]
+        # Load existing points if available
+        self.load_existing_points()
+        # Update UI
+        self.config_label.config(text=self.config_path.name, foreground="black")
+        self.current_video_idx = 0
+        self.current_frame_idx = 0
+        self.display_current_video()
+        self.status_label.config(
+            text=f"Loaded {len(self.videos)} videos. Navigate frames and click points. Can add points on multiple frames!",
+            foreground="green"
+        )
+    def open_videos(self):
+        """Open all videos for frame navigation"""
+        self.video_captures = []
+        self.total_frames_list = []
+        for i, video_info in enumerate(self.videos):
+            video_path = video_info.get("video_path", "")
+            if not video_path:
+                self.video_captures.append(None)
+                self.total_frames_list.append(0)
+                continue
+            video_path = Path(video_path)
+            if not video_path.is_absolute():
+                video_path = self.config_path.parent / video_path
+            if not video_path.exists():
+                messagebox.showwarning("Warning", f"Video not found: {video_path}")
+                self.video_captures.append(None)
+                self.total_frames_list.append(0)
+                continue
+            cap = cv2.VideoCapture(str(video_path))
+            if not cap.isOpened():
+                messagebox.showwarning("Warning", f"Failed to open video: {video_path}")
+                self.video_captures.append(None)
+                self.total_frames_list.append(0)
+                continue
+            total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+            self.video_captures.append(cap)
+            self.total_frames_list.append(total_frames)
+            self.status_label.config(text=f"Opened video {i+1}/{len(self.videos)}", foreground="blue")
+            self.root.update()
+    def load_existing_points(self):
+        """Load existing points from output file if it exists"""
+        output_path = self.config_path.parent / f"{self.config_path.stem}_points.json"
+        if not output_path.exists():
+            return
+        try:
+            with open(output_path, 'r') as f:
+                existing_data = json.load(f)
+            if isinstance(existing_data, list):
+                existing_videos = existing_data
+            elif isinstance(existing_data, dict) and "videos" in existing_data:
+                existing_videos = existing_data["videos"]
+            else:
+                return
+            for i, video_data in enumerate(existing_videos):
+                if i < len(self.all_points_by_frame):
+                    # Load multi-frame format
+                    points_by_frame = video_data.get("primary_points_by_frame", {})
+                    # Convert string keys to int
+                    self.all_points_by_frame[i] = {int(k): v for k, v in points_by_frame.items()}
+            self.status_label.config(text="Loaded existing points", foreground="green")
+        except Exception as e:
+            print(f"Warning: Could not load existing points: {e}")
+    def get_current_frame(self):
+        """Get frame at current_frame_idx from current video"""
+        if self.current_video_idx >= len(self.video_captures):
+            return None
+        cap = self.video_captures[self.current_video_idx]
+        if cap is None:
+            return None
+        cap.set(cv2.CAP_PROP_POS_FRAMES, self.current_frame_idx)
+        ret, frame = cap.read()
+        if not ret:
+            return None
+        return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+    def display_current_video(self):
+        """Display current video frame"""
+        if not self.video_captures:
+            return
+        video_info = self.videos[self.current_video_idx]
+        video_path = video_info.get("video_path", "")
+        # Update labels
+        self.video_label.config(text=f"Video: {Path(video_path).name}")
+        instruction = video_info.get("instruction", "")
+        if instruction:
+            self.instruction_label.config(text=f"Instruction: {instruction}")
+        self.nav_label.config(text=f"Video: {self.current_video_idx + 1}/{len(self.videos)}")
+        # Load points for this video
+        self.points_by_frame = self.all_points_by_frame[self.current_video_idx].copy()
+        # Update frame controls
+        total_frames = self.total_frames_list[self.current_video_idx]
+        self.frame_slider.config(to=max(1, total_frames - 1))
+        self.frame_slider.set(self.current_frame_idx)
+        self.frame_label.config(text=f"F: {self.current_frame_idx}/{total_frames - 1}")
+        # Update frames with points display
+        self.update_frames_display()
+        self.display_frame()
+    def update_frames_display(self):
+        """Update display showing which frames have points"""
+        if not self.points_by_frame:
+            self.frames_with_points_label.config(text="None", foreground="gray")
+        else:
+            frames = sorted(self.points_by_frame.keys())
+            frames_str = ", ".join(f"F{f}" for f in frames)
+            total_points = sum(len(pts) for pts in self.points_by_frame.values())
+            self.frames_with_points_label.config(
+                text=f"{frames_str} ({total_points} total points)",
+                foreground="green"
+            )
+    def display_frame(self):
+        """Display current frame with points"""
+        frame = self.get_current_frame()
+        if frame is None:
+            return
+        # Draw points for CURRENT frame
+        vis = frame.copy()
+        current_points = self.points_by_frame.get(self.current_frame_idx, [])
+        for i, (x, y) in enumerate(current_points):
+            cv2.circle(vis, (x, y), self.point_radius, (255, 0, 0), -1)
+            cv2.circle(vis, (x, y), self.point_radius + 2, (255, 255, 255), 2)
+            cv2.putText(vis, str(i + 1), (x + 12, y + 12), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2)
+        # Show indicator if other frames have points
+        if len(self.points_by_frame) > 0:
+            other_frames = [f for f in self.points_by_frame.keys() if f != self.current_frame_idx]
+            if other_frames:
+                text = f"Other frames with points: {', '.join(map(str, sorted(other_frames)))}"
+                cv2.putText(vis, text, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 255), 2)
+        # Scale for display
+        h, w = vis.shape[:2]
+        max_width, max_height = 800, 450
+        scale_w = max_width / w
+        scale_h = max_height / h
+        self.display_scale = min(scale_w, scale_h, 1.0)
+        new_w = int(w * self.display_scale)
+        new_h = int(h * self.display_scale)
+        vis_resized = cv2.resize(vis, (new_w, new_h))
+        # Convert to PIL and display
+        pil_img = Image.fromarray(vis_resized)
+        self.photo = ImageTk.PhotoImage(pil_img)
+        self.canvas.delete("all")
+        self.canvas.create_image(0, 0, anchor=tk.NW, image=self.photo)
+        self.point_count_label.config(text=f"Pts on F{self.current_frame_idx}: {len(current_points)}")
+    def on_canvas_click(self, event):
+        """Handle click on canvas - add point to CURRENT frame"""
+        # Convert to frame coordinates
+        x = int(event.x / self.display_scale)
+        y = int(event.y / self.display_scale)
+        # Add to current frame
+        if self.current_frame_idx not in self.points_by_frame:
+            self.points_by_frame[self.current_frame_idx] = []
+        self.points_by_frame[self.current_frame_idx].append((x, y))
+        self.update_frames_display()
+        self.display_frame()
+    def clear_current_frame(self):
+        """Clear points for current frame only"""
+        if self.current_frame_idx in self.points_by_frame:
+            del self.points_by_frame[self.current_frame_idx]
+            self.update_frames_display()
+            self.display_frame()
+    def clear_all_frames(self):
+        """Clear all points for current video"""
+        result = messagebox.askyesno("Clear All", "Clear points from ALL frames?")
+        if result:
+            self.points_by_frame = {}
+            self.update_frames_display()
+            self.display_frame()
+    def undo_last_point(self):
+        """Remove last point from current frame"""
+        if self.current_frame_idx in self.points_by_frame and self.points_by_frame[self.current_frame_idx]:
+            self.points_by_frame[self.current_frame_idx].pop()
+            if not self.points_by_frame[self.current_frame_idx]:
+                del self.points_by_frame[self.current_frame_idx]
+            self.update_frames_display()
+            self.display_frame()
+    # Frame navigation methods
+    def first_frame(self):
+        """Jump to first frame"""
+        self.current_frame_idx = 0
+        self.frame_slider.set(self.current_frame_idx)
+        self.update_frame_display()
+    def last_frame(self):
+        """Jump to last frame"""
+        total_frames = self.total_frames_list[self.current_video_idx]
+        self.current_frame_idx = max(0, total_frames - 1)
+        self.frame_slider.set(self.current_frame_idx)
+        self.update_frame_display()
+    def prev_frame(self, step=1):
+        """Go to previous frame"""
+        self.current_frame_idx = max(0, self.current_frame_idx - step)
+        self.frame_slider.set(self.current_frame_idx)
+        self.update_frame_display()
+    def next_frame(self, step=1):
+        """Go to next frame"""
+        total_frames = self.total_frames_list[self.current_video_idx]
+        self.current_frame_idx = min(total_frames - 1, self.current_frame_idx + step)
+        self.frame_slider.set(self.current_frame_idx)
+        self.update_frame_display()
+    def on_slider_change(self, value):
+        """Handle slider change"""
+        self.current_frame_idx = int(float(value))
+        self.update_frame_display()
+    def update_frame_display(self):
+        """Update frame label and display"""
+        total_frames = self.total_frames_list[self.current_video_idx]
+        self.frame_label.config(text=f"F: {self.current_frame_idx}/{total_frames - 1}")
+        self.display_frame()
+    # Video navigation
+    def first_video(self):
+        """Jump to first video"""
+        self.save_current_points()
+        self.current_video_idx = 0
+        self.current_frame_idx = 0
+        self.display_current_video()
+    def last_video(self):
+        """Jump to last video"""
+        self.save_current_points()
+        self.current_video_idx = len(self.videos) - 1
+        self.current_frame_idx = 0
+        self.display_current_video()
+    def prev_video(self):
+        """Go to previous video"""
+        if self.current_video_idx > 0:
+            self.save_current_points()
+            self.current_video_idx -= 1
+            self.current_frame_idx = 0
+            self.display_current_video()
+    def save_and_next(self):
+        """Save current points and move to next video"""
+        if len(self.points_by_frame) == 0:
+            result = messagebox.askyesno("No Points", "No points selected for any frame. Continue to next video?")
+            if not result:
+                return
+        self.save_current_points()
+        if self.current_video_idx < len(self.videos) - 1:
+            self.current_video_idx += 1
+            self.current_frame_idx = 0
+            self.display_current_video()
+        else:
+            messagebox.showinfo("Complete", "All videos processed!")
+    def save_current_points(self):
+        """Save current video's points to storage"""
+        self.all_points_by_frame[self.current_video_idx] = self.points_by_frame.copy()
+    def save_points(self):
+        """Save all points to JSON file"""
+        if not self.config_path:
+            messagebox.showerror("Error", "No config loaded")
+            return
+        # Save current video first
+        self.save_current_points()
+        # Build output
+        output_videos = []
+        for i, video_info in enumerate(self.videos):
+            video_data = video_info.copy()
+            points_by_frame = self.all_points_by_frame[i]
+            # Convert to serializable format (int keys → string keys for JSON)
+            video_data["primary_points_by_frame"] = {
+                str(frame_idx): points for frame_idx, points in points_by_frame.items()
+            }
+            # Also save list of frames for easy access
+            video_data["primary_frames"] = sorted(points_by_frame.keys())
+            # Backwards compatibility: if only one frame, save as before
+            if len(points_by_frame) == 1:
+                frame_idx = list(points_by_frame.keys())[0]
+                video_data["first_appears_frame"] = frame_idx
+                video_data["primary_points"] = points_by_frame[frame_idx]
+            elif len(points_by_frame) > 1:
+                # Multiple frames - use first frame as "first_appears_frame"
+                video_data["first_appears_frame"] = min(points_by_frame.keys())
+                # Flatten all points for backwards compat (not ideal but helps)
+                all_points = []
+                for frame_idx in sorted(points_by_frame.keys()):
+                    all_points.extend(points_by_frame[frame_idx])
+                video_data["primary_points"] = all_points
+            output_videos.append(video_data)
+        # Match input format
+        if isinstance(self.config_data, list):
+            output_data = output_videos
+        else:
+            output_data = {"videos": output_videos}
+        # Save
+        output_path = self.config_path.parent / f"{self.config_path.stem}_points.json"
+        try:
+            with open(output_path, 'w') as f:
+                json.dump(output_data, f, indent=2)
+            self.status_label.config(text=f"Saved to {output_path.name}", foreground="green")
+            messagebox.showinfo("Success", f"Points saved to:\n{output_path}")
+        except Exception as e:
+            messagebox.showerror("Error", f"Failed to save: {e}")
+    def __del__(self):
+        """Clean up video captures"""
+        for cap in self.video_captures:
+            if cap is not None:
+                cap.release()
+def main():
+    parser = argparse.ArgumentParser(description="Point Selector GUI - Multi-Frame Support")
+    parser.add_argument("--config", help="Config JSON file to load")
+    args = parser.parse_args()
+    root = tk.Tk()
+    root.geometry("900x750")  # Compact height to fit on screen
+    gui = PointSelectorGUI(root, config_path=args.config)
+    root.mainloop()
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/run_pipeline.sh ADDED Viewed

	@@ -0,0 +1,74 @@

+#!/bin/bash
+# run_pipeline.sh
+# Runs stages 1-4 given a points config JSON (output of point_selector_gui.py)
+#
+# Usage:
+#   bash run_pipeline.sh <config_points.json> [--sam2-checkpoint PATH] [--device cuda]
+#
+# Example:
+#   bash run_pipeline.sh my_config_points.json
+#   bash run_pipeline.sh my_config_points.json --sam2-checkpoint ../sam2_hiera_large.pt
+set -e
+# ── Arguments ──────────────────────────────────────────────────────────────────
+CONFIG="$1"
+if [ -z "$CONFIG" ]; then
+    echo "Usage: bash run_pipeline.sh <config_points.json> [--sam2-checkpoint PATH] [--device cuda]"
+    exit 1
+fi
+SAM2_CHECKPOINT="../sam2_hiera_large.pt"
+DEVICE="cuda"
+# Parse optional flags
+shift
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --sam2-checkpoint) SAM2_CHECKPOINT="$2"; shift 2 ;;
+        --device)          DEVICE="$2";          shift 2 ;;
+        *) echo "Unknown argument: $1"; exit 1 ;;
+    esac
+done
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+echo "=========================================="
+echo "Void Mask Generation Pipeline"
+echo "=========================================="
+echo "Config:          $CONFIG"
+echo "SAM2 checkpoint: $SAM2_CHECKPOINT"
+echo "Device:          $DEVICE"
+echo "=========================================="
+# ── Stage 1: SAM2 Segmentation ─────────────────────────────────────────────────
+echo ""
+echo "[1/4] SAM2 segmentation..."
+python "$SCRIPT_DIR/stage1_sam2_segmentation.py" \
+    --config "$CONFIG" \
+    --sam2-checkpoint "$SAM2_CHECKPOINT" \
+    --device "$DEVICE"
+# ── Stage 2: VLM Analysis ──────────────────────────────────────────────────────
+echo ""
+echo "[2/4] VLM analysis (Gemini)..."
+python "$SCRIPT_DIR/stage2_vlm_analysis.py" \
+    --config "$CONFIG"
+# ── Stage 3a: Generate Grey Masks ─────────────────────────────────────────────
+echo ""
+echo "[3/4] Generating grey masks..."
+python "$SCRIPT_DIR/stage3a_generate_grey_masks_v2.py" \
+    --config "$CONFIG"
+# ── Stage 4: Combine into Quadmask ────────────────────────────────────────────
+echo ""
+echo "[4/4] Combining masks into quadmask_0.mp4..."
+python "$SCRIPT_DIR/stage4_combine_masks.py" \
+    --config "$CONFIG"
+echo ""
+echo "=========================================="
+echo "Pipeline complete!"
+echo "Output: quadmask_0.mp4 in each video's output_dir"
+echo "=========================================="

VLM-MASK-REASONER/stage1_sam2_segmentation.py ADDED Viewed

	@@ -0,0 +1,419 @@

+#!/usr/bin/env python3
+"""
+Stage 1: SAM2 Point-Prompted Segmentation
+Takes user-selected points and generates pixel-perfect masks of primary objects
+using SAM2 video tracking.
+Input:  <config>_points.json (with primary_points)
+Output: For each video:
+        - black_mask.mp4: Primary object mask (0=object, 255=background)
+        - first_frame.jpg: First frame for VLM analysis
+        - segmentation_info.json: Metadata
+Usage:
+    python stage1_sam2_segmentation.py --config more_dyn_2_config_points.json
+"""
+import os
+import sys
+import json
+import argparse
+import cv2
+import numpy as np
+import torch
+import tempfile
+import shutil
+from pathlib import Path
+from typing import Dict, List, Tuple
+import subprocess
+# Check SAM2 availability
+try:
+    from sam2.build_sam import build_sam2_video_predictor
+    SAM2_AVAILABLE = True
+except ImportError:
+    SAM2_AVAILABLE = False
+    print("⚠️  SAM2 not installed. Install with:")
+    print("   pip install git+https://github.com/facebookresearch/segment-anything-2.git")
+    sys.exit(1)
+class SAM2PointSegmenter:
+    """SAM2 video segmentation with point prompts"""
+    def __init__(self, checkpoint_path: str, model_cfg: str = "sam2_hiera_l.yaml", device: str = "cuda"):
+        print(f"   Loading SAM2 video predictor...")
+        self.device = device
+        self.predictor = build_sam2_video_predictor(model_cfg, checkpoint_path, device=device)
+        print(f"   ✓ SAM2 loaded on {device}")
+    def segment_video(self, video_path: str, points: List[List[int]] = None,
+                      output_mask_path: str = None, temp_dir: str = None,
+                      first_appears_frame: int = 0,
+                      points_by_frame: Dict[int, List[List[int]]] = None) -> Dict:
+        """
+        Segment video using point prompts (single or multi-frame).
+        Args:
+            video_path: Path to input video
+            points: List of [x, y] points on object (single frame, legacy)
+            output_mask_path: Path to save mask video
+            temp_dir: Directory for temporary frames
+            first_appears_frame: Frame index where points were selected (for single frame)
+            points_by_frame: Dict mapping frame_idx → [[x, y], ...] (multi-frame support)
+        Returns:
+            Dict with segmentation metadata
+        """
+        # Handle both old and new formats
+        if points_by_frame is not None:
+            # Multi-frame format
+            if not points_by_frame or len(points_by_frame) == 0:
+                raise ValueError("No points provided")
+        elif points is not None:
+            # Single frame format (backwards compat)
+            if not points or len(points) == 0:
+                raise ValueError("No points provided")
+            points_by_frame = {first_appears_frame: points}
+        else:
+            raise ValueError("Must provide either points or points_by_frame")
+        # Create temp directory for frames
+        if temp_dir is None:
+            temp_dir = tempfile.mkdtemp(prefix="sam2_frames_")
+            cleanup = True
+        else:
+            Path(temp_dir).mkdir(parents=True, exist_ok=True)
+            cleanup = False
+        print(f"   Extracting frames to: {temp_dir}")
+        frame_files = self._extract_frames(video_path, temp_dir)
+        if len(frame_files) == 0:
+            raise RuntimeError(f"No frames extracted from {video_path}")
+        # Get video properties
+        cap = cv2.VideoCapture(video_path)
+        fps = cap.get(cv2.CAP_PROP_FPS)
+        frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+        frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+        total_frames = len(frame_files)
+        cap.release()
+        # Count total points across all frames
+        total_points = sum(len(pts) for pts in points_by_frame.values())
+        print(f"   Video: {frame_width}x{frame_height}, {total_frames} frames @ {fps} fps")
+        print(f"   Using {total_points} points across {len(points_by_frame)} frame(s) for segmentation")
+        # Initialize SAM2
+        print(f"   Initializing SAM2...")
+        inference_state = self.predictor.init_state(video_path=temp_dir)
+        # Add points for each frame (all with obj_id=1 to merge into single mask)
+        for frame_idx in sorted(points_by_frame.keys()):
+            frame_points = points_by_frame[frame_idx]
+            # Convert points to numpy array
+            points_np = np.array(frame_points, dtype=np.float32)
+            labels_np = np.ones(len(frame_points), dtype=np.int32)  # All positive
+            # Calculate bounding box from points (with 10% margin for hair/clothes)
+            x_coords = points_np[:, 0]
+            y_coords = points_np[:, 1]
+            x_min, x_max = x_coords.min(), x_coords.max()
+            y_min, y_max = y_coords.min(), y_coords.max()
+            # Add 10% margin
+            x_margin = (x_max - x_min) * 0.1
+            y_margin = (y_max - y_min) * 0.1
+            box = np.array([
+                max(0, x_min - x_margin),
+                max(0, y_min - y_margin),
+                min(frame_width, x_max + x_margin),
+                min(frame_height, y_max + y_margin)
+            ], dtype=np.float32)
+            print(f"   Adding {len(frame_points)} points + box to frame {frame_idx}")
+            print(f"   Points: {frame_points[:3]}..." if len(frame_points) > 3 else f"   Points: {frame_points}")
+            print(f"   Box: [{int(box[0])}, {int(box[1])}, {int(box[2])}, {int(box[3])}]")
+            # Add points + box to this frame (all use obj_id=1 to merge)
+            _, out_obj_ids, out_mask_logits = self.predictor.add_new_points_or_box(
+                inference_state=inference_state,
+                frame_idx=frame_idx,
+                obj_id=1,
+                points=points_np,
+                labels=labels_np,
+                box=box,
+            )
+        print(f"   Propagating through video...")
+        # Propagate through video
+        video_segments = {}
+        for out_frame_idx, out_obj_ids, out_mask_logits in self.predictor.propagate_in_video(inference_state):
+            # Get mask for object ID 1
+            mask_logits = out_mask_logits[out_obj_ids.index(1)]
+            mask = (mask_logits > 0.0).cpu().numpy().squeeze()
+            video_segments[out_frame_idx] = mask
+        print(f"   ✓ Segmented {len(video_segments)} frames")
+        # Write mask video
+        print(f"   Writing mask video...")
+        self._write_mask_video(video_segments, output_mask_path, fps, frame_width, frame_height)
+        # Cleanup
+        if cleanup:
+            shutil.rmtree(temp_dir)
+        # Build metadata
+        metadata = {
+            "total_frames": total_frames,
+            "frame_width": frame_width,
+            "frame_height": frame_height,
+            "fps": fps,
+        }
+        # Add points info based on format
+        if points_by_frame:
+            total_points = sum(len(pts) for pts in points_by_frame.values())
+            metadata["num_points"] = total_points
+            metadata["points_by_frame"] = {str(k): v for k, v in points_by_frame.items()}
+        else:
+            metadata["num_points"] = len(points) if points else 0
+            metadata["points"] = points
+        return metadata
+    def _extract_frames(self, video_path: str, output_dir: str) -> List[str]:
+        """Extract video frames as JPG files"""
+        Path(output_dir).mkdir(parents=True, exist_ok=True)
+        cap = cv2.VideoCapture(video_path)
+        frame_idx = 0
+        frame_files = []
+        while True:
+            ret, frame = cap.read()
+            if not ret:
+                break
+            # SAM2 expects frames named as frame_000000.jpg, frame_000001.jpg, etc.
+            frame_filename = f"{frame_idx:06d}.jpg"
+            frame_path = os.path.join(output_dir, frame_filename)
+            cv2.imwrite(frame_path, frame)
+            frame_files.append(frame_path)
+            frame_idx += 1
+            if frame_idx % 20 == 0:
+                print(f"      Extracted {frame_idx} frames...", end='\r')
+        cap.release()
+        print(f"      Extracted {frame_idx} frames")
+        return frame_files
+    def _write_mask_video(self, masks: Dict[int, np.ndarray], output_path: str,
+                          fps: float, width: int, height: int):
+        """Write masks to video file"""
+        # Write temp AVI first
+        temp_avi = Path(output_path).with_suffix('.avi')
+        fourcc = cv2.VideoWriter_fourcc(*'FFV1')
+        out = cv2.VideoWriter(str(temp_avi), fourcc, fps, (width, height), isColor=False)
+        for frame_idx in sorted(masks.keys()):
+            mask = masks[frame_idx]
+            # Convert boolean mask to 0/255
+            mask_uint8 = np.where(mask, 0, 255).astype(np.uint8)
+            out.write(mask_uint8)
+        out.release()
+        # Convert to lossless MP4
+        cmd = [
+            'ffmpeg', '-y', '-i', str(temp_avi),
+            '-c:v', 'libx264', '-qp', '0', '-preset', 'ultrafast',
+            '-pix_fmt', 'yuv444p',
+            str(output_path)
+        ]
+        subprocess.run(cmd, capture_output=True)
+        temp_avi.unlink()
+        print(f"   ✓ Saved mask video: {output_path}")
+def process_config(config_path: str, sam2_checkpoint: str, device: str = "cuda"):
+    """Process all videos in config"""
+    config_path = Path(config_path)
+    # Load config
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    # Handle both formats
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Config must be a list or have 'videos' key")
+    print(f"\n{'='*70}")
+    print(f"Stage 1: SAM2 Point-Prompted Segmentation")
+    print(f"{'='*70}")
+    print(f"Config: {config_path.name}")
+    print(f"Videos: {len(videos)}")
+    print(f"Device: {device}")
+    print(f"{'='*70}\n")
+    # Initialize SAM2
+    segmenter = SAM2PointSegmenter(sam2_checkpoint, device=device)
+    # Process each video
+    for i, video_info in enumerate(videos):
+        video_path = video_info.get("video_path", "")
+        instruction = video_info.get("instruction", "")
+        output_dir = video_info.get("output_dir", "")
+        # Read points - support both single-frame and multi-frame formats
+        points_by_frame_raw = video_info.get("primary_points_by_frame", None)
+        points = video_info.get("primary_points", [])
+        first_appears_frame = video_info.get("first_appears_frame", 0)
+        # Convert points_by_frame from string keys to int keys
+        points_by_frame = None
+        if points_by_frame_raw:
+            points_by_frame = {int(k): v for k, v in points_by_frame_raw.items()}
+        if not video_path:
+            print(f"\n⚠️  Video {i+1}: No video_path, skipping")
+            continue
+        if not points and not points_by_frame:
+            print(f"\n⚠️  Video {i+1}: No primary_points selected, skipping")
+            continue
+        video_path = Path(video_path)
+        if not video_path.exists():
+            print(f"\n⚠️  Video {i+1}: File not found: {video_path}, skipping")
+            continue
+        print(f"\n{'─'*70}")
+        print(f"Video {i+1}/{len(videos)}: {video_path.name}")
+        print(f"{'─'*70}")
+        print(f"Instruction: {instruction}")
+        if points_by_frame:
+            total_points = sum(len(pts) for pts in points_by_frame.values())
+            print(f"Points: {total_points} across {len(points_by_frame)} frame(s)")
+            print(f"Frames: {sorted(points_by_frame.keys())}")
+        else:
+            print(f"Points: {len(points)}")
+            print(f"First appears frame: {first_appears_frame}")
+        # Setup output directory
+        if output_dir:
+            output_dir = Path(output_dir)
+        else:
+            # Create unique output directory per video
+            video_name = video_path.stem  # Get video name without extension
+            output_dir = video_path.parent / f"{video_name}_masks_output"
+        output_dir.mkdir(parents=True, exist_ok=True)
+        print(f"Output: {output_dir}")
+        try:
+            # Segment video - use multi-frame or single-frame format
+            black_mask_path = output_dir / "black_mask.mp4"
+            if points_by_frame:
+                metadata = segmenter.segment_video(
+                    str(video_path),
+                    output_mask_path=str(black_mask_path),
+                    points_by_frame=points_by_frame
+                )
+                # Use first frame for VLM analysis
+                first_frame_for_vlm = min(points_by_frame.keys())
+            else:
+                metadata = segmenter.segment_video(
+                    str(video_path),
+                    points=points,
+                    output_mask_path=str(black_mask_path),
+                    first_appears_frame=first_appears_frame
+                )
+                first_frame_for_vlm = first_appears_frame
+            # Save frame where object appears for VLM analysis
+            cap = cv2.VideoCapture(str(video_path))
+            cap.set(cv2.CAP_PROP_POS_FRAMES, first_frame_for_vlm)
+            ret, first_frame = cap.read()
+            cap.release()
+            if ret:
+                first_frame_path = output_dir / "first_frame.jpg"
+                cv2.imwrite(str(first_frame_path), first_frame)
+                print(f"   ✓ Saved first frame (frame {first_frame_for_vlm}): {first_frame_path.name}")
+            # Copy input video
+            input_copy_path = output_dir / "input_video.mp4"
+            if not input_copy_path.exists():
+                shutil.copy2(video_path, input_copy_path)
+                print(f"   ✓ Copied input video")
+            # Save metadata
+            metadata["video_path"] = str(video_path)
+            metadata["instruction"] = instruction
+            if points_by_frame:
+                metadata["primary_points_by_frame"] = {str(k): v for k, v in points_by_frame.items()}
+                metadata["primary_frames"] = sorted(points_by_frame.keys())
+                metadata["first_appears_frame"] = min(points_by_frame.keys())
+            else:
+                metadata["primary_points"] = points
+                metadata["first_appears_frame"] = first_appears_frame
+            metadata_path = output_dir / "segmentation_info.json"
+            with open(metadata_path, 'w') as f:
+                json.dump(metadata, f, indent=2)
+            print(f"   ✓ Saved metadata: {metadata_path.name}")
+            print(f"\n✅ Video {i+1} complete!")
+        except Exception as e:
+            print(f"\n❌ Error processing video {i+1}: {e}")
+            import traceback
+            traceback.print_exc()
+            continue
+    print(f"\n{'='*70}")
+    print(f"Stage 1 Complete!")
+    print(f"{'='*70}\n")
+def main():
+    parser = argparse.ArgumentParser(description="Stage 1: SAM2 Point-Prompted Segmentation")
+    parser.add_argument("--config", required=True, help="Config JSON with primary_points")
+    parser.add_argument("--sam2-checkpoint", default="../sam2_hiera_large.pt",
+                       help="Path to SAM2 checkpoint")
+    parser.add_argument("--device", default="cuda", help="Device (cuda/cpu)")
+    args = parser.parse_args()
+    if not SAM2_AVAILABLE:
+        print("❌ SAM2 not available")
+        sys.exit(1)
+    # Check checkpoint exists
+    checkpoint_path = Path(args.sam2_checkpoint)
+    if not checkpoint_path.exists():
+        print(f"❌ Checkpoint not found: {checkpoint_path}")
+        print(f"   Download with:")
+        print(f"   wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt")
+        sys.exit(1)
+    process_config(args.config, str(checkpoint_path), args.device)
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/stage2_vlm_analysis.py ADDED Viewed

	@@ -0,0 +1,1022 @@

+#!/usr/bin/env python3
+"""
+Stage 2: VLM Analysis - Identify Affected Objects & Physics
+Analyzes videos with primary masks to identify:
+- Integral belongings (to add to black mask)
+- Affected objects (shadows, reflections, held items)
+- Physics behavior (will_move, needs_trajectory)
+Input:  Config from Stage 1 (with output_dir containing black_mask.mp4, first_frame.jpg)
+Output: For each video:
+        - vlm_analysis.json: Identified objects and physics reasoning
+Usage:
+    python stage2_vlm_analysis.py --config more_dyn_2_config_points_absolute.json
+"""
+import os
+import sys
+import json
+import argparse
+import cv2
+import numpy as np
+import base64
+from pathlib import Path
+from typing import Dict, List
+from PIL import Image, ImageDraw
+import openai
+DEFAULT_MODEL = "gemini-3-pro-preview"
+def image_to_data_url(image_path: str) -> str:
+    """Convert image file to base64 data URL"""
+    with open(image_path, 'rb') as f:
+        img_data = base64.b64encode(f.read()).decode('utf-8')
+    # Detect format
+    ext = Path(image_path).suffix.lower()
+    if ext == '.png':
+        mime = 'image/png'
+    elif ext in ['.jpg', '.jpeg']:
+        mime = 'image/jpeg'
+    else:
+        mime = 'image/jpeg'
+    return f"data:{mime};base64,{img_data}"
+def video_to_data_url(video_path: str) -> str:
+    """Convert video file to base64 data URL"""
+    with open(video_path, 'rb') as f:
+        video_data = base64.b64encode(f.read()).decode('utf-8')
+    return f"data:video/mp4;base64,{video_data}"
+def calculate_square_grid(width: int, height: int, min_grid: int = 8) -> tuple:
+    """Calculate grid dimensions matching stage3a logic"""
+    aspect_ratio = width / height
+    if width >= height:
+        grid_rows = min_grid
+        grid_cols = max(min_grid, round(min_grid * aspect_ratio))
+    else:
+        grid_cols = min_grid
+        grid_rows = max(min_grid, round(min_grid / aspect_ratio))
+    return grid_rows, grid_cols
+def create_first_frame_with_mask_overlay(first_frame_path: str, black_mask_path: str,
+                                          output_path: str, frame_idx: int = 0) -> str:
+    """Create visualization of first frame with red overlay on primary object
+    Args:
+        first_frame_path: Path to first_frame.jpg
+        black_mask_path: Path to black_mask.mp4
+        output_path: Where to save overlay
+        frame_idx: Which frame to extract from black_mask.mp4 (default: 0)
+    """
+    # Load first frame
+    frame = cv2.imread(first_frame_path)
+    if frame is None:
+        raise ValueError(f"Failed to load first frame: {first_frame_path}")
+    # Load black mask video and get the specified frame
+    cap = cv2.VideoCapture(black_mask_path)
+    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+    ret, mask_frame = cap.read()
+    cap.release()
+    if not ret:
+        raise ValueError(f"Failed to load black mask frame {frame_idx}: {black_mask_path}")
+    # Convert mask to binary (0 = object, 255 = background)
+    if len(mask_frame.shape) == 3:
+        mask_frame = cv2.cvtColor(mask_frame, cv2.COLOR_BGR2GRAY)
+    object_mask = (mask_frame == 0)
+    # Create red overlay on object
+    overlay = frame.copy()
+    overlay[object_mask] = [0, 0, 255]  # Red in BGR
+    # Blend: 60% original + 40% red overlay
+    result = cv2.addWeighted(frame, 0.6, overlay, 0.4, 0)
+    # Save
+    cv2.imwrite(output_path, result)
+    return output_path
+def create_gridded_frame_overlay(first_frame_path: str, black_mask_path: str,
+                                  output_path: str, min_grid: int = 8) -> tuple:
+    """Create first frame with BOTH red mask overlay AND grid lines
+    Returns: (output_path, grid_rows, grid_cols)
+    """
+    # Load first frame
+    frame = cv2.imread(first_frame_path)
+    if frame is None:
+        raise ValueError(f"Failed to load first frame: {first_frame_path}")
+    h, w = frame.shape[:2]
+    # Load black mask
+    cap = cv2.VideoCapture(black_mask_path)
+    ret, mask_frame = cap.read()
+    cap.release()
+    if not ret:
+        raise ValueError(f"Failed to load black mask: {black_mask_path}")
+    if len(mask_frame.shape) == 3:
+        mask_frame = cv2.cvtColor(mask_frame, cv2.COLOR_BGR2GRAY)
+    object_mask = (mask_frame == 0)
+    # Create red overlay
+    overlay = frame.copy()
+    overlay[object_mask] = [0, 0, 255]
+    result = cv2.addWeighted(frame, 0.6, overlay, 0.4, 0)
+    # Calculate grid
+    grid_rows, grid_cols = calculate_square_grid(w, h, min_grid)
+    # Draw grid lines
+    cell_width = w / grid_cols
+    cell_height = h / grid_rows
+    # Vertical lines
+    for col in range(1, grid_cols):
+        x = int(col * cell_width)
+        cv2.line(result, (x, 0), (x, h), (255, 255, 0), 1)  # Yellow lines
+    # Horizontal lines
+    for row in range(1, grid_rows):
+        y = int(row * cell_height)
+        cv2.line(result, (0, y), (w, y), (255, 255, 0), 1)
+    # Add grid labels
+    font = cv2.FONT_HERSHEY_SIMPLEX
+    font_scale = 0.3
+    thickness = 1
+    # Label columns at top
+    for col in range(grid_cols):
+        x = int((col + 0.5) * cell_width)
+        cv2.putText(result, str(col), (x-5, 15), font, font_scale, (255, 255, 0), thickness)
+    # Label rows on left
+    for row in range(grid_rows):
+        y = int((row + 0.5) * cell_height)
+        cv2.putText(result, str(row), (5, y+5), font, font_scale, (255, 255, 0), thickness)
+    cv2.imwrite(output_path, result)
+    return output_path, grid_rows, grid_cols
+def create_multi_frame_grid_samples(video_path: str, output_dir: Path,
+                                      min_grid: int = 8,
+                                      sample_points: list = [0.0, 0.11, 0.22, 0.33, 0.44, 0.56, 0.67, 0.78, 0.89, 1.0]) -> tuple:
+    """
+    Create gridded frame samples at multiple time points in video.
+    Helps VLM see objects that appear mid-video with grid reference.
+    Args:
+        video_path: Path to video
+        output_dir: Where to save samples
+        min_grid: Minimum grid size
+        sample_points: List of normalized positions [0.0-1.0] to sample
+    Returns: (sample_paths, grid_rows, grid_cols)
+    """
+    cap = cv2.VideoCapture(str(video_path))
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    # Calculate grid (same for all frames)
+    grid_rows, grid_cols = calculate_square_grid(w, h, min_grid)
+    cell_width = w / grid_cols
+    cell_height = h / grid_rows
+    sample_paths = []
+    for i, t in enumerate(sample_points):
+        frame_idx = int(t * (total_frames - 1))
+        frame_idx = max(0, min(frame_idx, total_frames - 1))
+        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+        ret, frame = cap.read()
+        if not ret:
+            continue
+        # Draw grid
+        result = frame.copy()
+        # Vertical lines
+        for col in range(1, grid_cols):
+            x = int(col * cell_width)
+            cv2.line(result, (x, 0), (x, h), (255, 255, 0), 2)
+        # Horizontal lines
+        for row in range(1, grid_rows):
+            y = int(row * cell_height)
+            cv2.line(result, (0, y), (w, y), (255, 255, 0), 2)
+        # Add grid labels
+        font = cv2.FONT_HERSHEY_SIMPLEX
+        font_scale = 0.4
+        thickness = 1
+        # Label columns
+        for col in range(grid_cols):
+            x = int((col + 0.5) * cell_width)
+            cv2.putText(result, str(col), (x-8, 20), font, font_scale, (255, 255, 0), thickness)
+        # Label rows
+        for row in range(grid_rows):
+            y = int((row + 0.5) * cell_height)
+            cv2.putText(result, str(row), (10, y+8), font, font_scale, (255, 255, 0), thickness)
+        # Add frame number and percentage
+        label = f"Frame {frame_idx} ({int(t*100)}%)"
+        cv2.putText(result, label, (10, h-10), font, 0.5, (255, 255, 0), 2)
+        # Save
+        output_path = output_dir / f"grid_sample_frame_{frame_idx:04d}.jpg"
+        cv2.imwrite(str(output_path), result)
+        sample_paths.append(output_path)
+    cap.release()
+    return sample_paths, grid_rows, grid_cols
+def make_vlm_analysis_prompt(instruction: str, grid_rows: int, grid_cols: int,
+                              has_multi_frame_grids: bool = False) -> str:
+    """Create VLM prompt for analyzing video with primary mask"""
+    grid_context = ""
+    if has_multi_frame_grids:
+        grid_context = f"""
+1. **Multiple Grid Reference Frames**: Sampled frames at 0%, 11%, 22%, 33%, 44%, 56%, 67%, 78%, 89%, 100% of video
+   - Each frame shows YELLOW GRID with {grid_rows} rows × {grid_cols} columns
+   - Grid cells labeled (row, col) starting from (0, 0) at top-left
+   - Frame number shown at bottom
+   - Use these to locate objects that appear MID-VIDEO and track object positions across time
+2. **First Frame with RED mask**: Shows what will be REMOVED (primary object)
+3. **Full Video**: Complete action and interactions"""
+    else:
+        grid_context = f"""
+1. **First Frame with Grid**: PRIMARY OBJECT highlighted in RED + GRID OVERLAY
+   - The red overlay shows what will be REMOVED (already masked)
+   - Yellow grid with {grid_rows} rows × {grid_cols} columns
+   - Grid cells are labeled (row, col) starting from (0, 0) at top-left
+2. **Full Video**: Complete scene and action"""
+    return f"""
+You are an expert video analyst specializing in physics and object interactions.
+═══════════════════════════════════════════════════════════════════
+CONTEXT
+═══════════════════════════════════════════════════════════════════
+You will see MULTIPLE inputs:
+{grid_context}
+Edit instruction: "{instruction}"
+IMPORTANT: Some objects may NOT appear in first frame. They may enter later.
+Watch the ENTIRE video and note when each object first appears.
+═══════════════════════════════════════════════════════════════════
+YOUR TASK
+═══════════════════════════════════════════════════════════════════
+Analyze what would happen if the PRIMARY OBJECT (shown in red) is removed.
+Watch the ENTIRE video to see all interactions and movements.
+STEP 1: IDENTIFY INTEGRAL BELONGINGS (0-3 items)
+─────────────────────────────────────────────────
+Items that should be ADDED to the primary removal mask (removed WITH primary object):
+✓ INCLUDE:
+  • Distinct wearable items: hat, backpack, jacket (if separate/visible)
+  • Vehicles/equipment being ridden: bike, skateboard, surfboard, scooter
+  • Large carried items that are part of the subject
+✗ DO NOT INCLUDE:
+  • Generic clothing (shirt, pants, shoes) - already captured with person
+  • Held items that could be set down: guitar, cup, phone, tools
+  • Objects they're interacting with but not wearing/riding
+Examples:
+  • Person on bike → integral: "bike"
+  • Person with guitar → integral: none (guitar is affected, not integral)
+  • Surfer → integral: "surfboard"
+  • Boxer → integral: "boxing gloves" (wearable equipment)
+STEP 2: IDENTIFY AFFECTED OBJECTS (0-5 objects)
+────────────────────────────────────────────────
+Objects/effects that are SEPARATE from primary but affected by its removal.
+CRITICAL: Do NOT include integral belongings from Step 1.
+Two categories:
+A) VISUAL ARTIFACTS (disappear when primary removed):
+   • shadow, reflection, wake, ripples, splash, footprints
+   • These vanish completely - no physics needed
+   **CRITICAL FOR VISUAL ARTIFACTS:**
+   You MUST provide GRID LOCALIZATIONS across the reference frames.
+   Keyword segmentation fails to isolate specific shadows/reflections.
+   For each visual artifact:
+   - Look at each grid reference frame you were shown
+   - Identify which grid cells the artifact occupies in EACH frame
+   - List all grid cells (row, col) that contain any part of it
+   - Be thorough - include ALL touched cells (over-mask is better than under-mask)
+   Format:
+   {{
+     "noun": "shadow",
+     "category": "visual_artifact",
+     "grid_localizations": [
+       {{"frame": 0, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 6, "col": 4}}, ...]}},
+       {{"frame": 5, "grid_regions": [{{"row": 6, "col": 4}}, ...]}},
+       // ... for each reference frame shown
+     ]
+   }}
+B) PHYSICAL OBJECTS (may move, fall, or stay):
+   CRITICAL - Understand the difference:
+   **SUPPORTING vs ACTING ON:**
+   • SUPPORTING = holding UP against gravity → object WILL FALL when removed
+     Examples: holding guitar, carrying cup, person sitting on chair
+     → will_move: TRUE
+   • ACTING ON = touching/manipulating but object rests on stable surface → object STAYS
+     Examples: hand crushing can (can on table), hand opening can (can on counter),
+              hand pushing object (object on floor)
+     → will_move: FALSE
+   **Key Questions:**
+   1. Is the primary object HOLDING THIS UP against gravity?
+      - YES → will_move: true, needs_trajectory: true
+      - NO → Check next question
+   2. Is this object RESTING ON a stable surface (table, floor, counter)?
+      - YES → will_move: false (stays on surface when primary removed)
+      - NO → will_move: true
+   3. Is the primary object DOING an action TO this object?
+      - Opening can, crushing can, pushing button, turning knob
+      - When primary removed → action STOPS, object stays in current state
+      - will_move: false
+   **SPECIAL CASE - Object Currently Moving But Should Have Stayed:**
+   If primary object CAUSES another object to move (hitting, kicking, throwing):
+   - The object is currently moving in the video
+   - But WITHOUT primary, it would have stayed at its original position
+   - You MUST provide:
+     • "currently_moving": true
+     • "should_have_stayed": true
+     • "original_position_grid": {{"row": R, "col": C}} - Where it started
+   Examples:
+   - Golf club hits ball → Ball at tee, then flies (mark original tee position)
+   - Person kicks soccer ball → Ball on ground, then rolls (mark original ground position)
+   - Hand throws object → Object held, then flies (mark original held position)
+   Format:
+   {{
+     "noun": "golf ball",
+     "category": "physical",
+     "currently_moving": true,
+     "should_have_stayed": true,
+     "original_position_grid": {{"row": 6, "col": 7}},
+     "why": "ball was stationary until club hit it"
+   }}
+   For each physical object, determine:
+   - **will_move**: true ONLY if object will fall/move when support removed
+   - **first_appears_frame**: frame number object first appears (0 if from start)
+   - **why**: Brief explanation of relationship to primary object
+   IF will_move=TRUE, also provide GRID-BASED TRAJECTORY:
+   - **object_size_grids**: {{"rows": R, "cols": C}} - How many grid cells object occupies
+     IMPORTANT: Add 1 extra cell padding for safety (better to over-mask than under-mask)
+     Example: Object looks 2×1 → report as 3×2
+   - **trajectory_path**: List of keyframe positions as grid coordinates
+     Format: [{{"frame": N, "grid_row": R, "grid_col": C}}, ...]
+     - IMPORTANT: First keyframe should be at first_appears_frame (not frame 0 if object appears later!)
+     - Provide 3-5 keyframes spanning from first appearance to end
+     - (grid_row, grid_col) is the CENTER position of object at that frame
+     - Use the yellow grid reference frames to determine positions
+     - For objects appearing mid-video: use the grid samples to locate them
+     - Example: Object appears at frame 15, falls to bottom
+       [{{"frame": 15, "grid_row": 3, "grid_col": 5}},  ← First appearance
+        {{"frame": 25, "grid_row": 6, "grid_col": 5}},  ← Mid-fall
+        {{"frame": 35, "grid_row": 9, "grid_col": 5}}]  ← On ground
+✓ Objects held/carried at ANY point in video
+✓ Objects the primary supports or interacts with
+✓ Visual effects visible at any time
+✗ Background objects never touched
+✗ Other people/animals with no contact
+✗ Integral belongings (already in Step 1)
+STEP 3: SCENE DESCRIPTION
+──────────────────────────
+Describe scene WITHOUT the primary object (1-2 sentences).
+Focus on what remains and any dynamic changes (falling objects, etc).
+═══════════════════════════════════════════════════════════════════
+OUTPUT FORMAT (STRICT JSON ONLY)
+═══════════════════════════════════════════════════════════════════
+EXAMPLES TO LEARN FROM:
+Example 1: Person holding guitar
+{{
+  "affected_objects": [
+    {{
+      "noun": "guitar",
+      "will_move": true,
+      "why": "person is SUPPORTING guitar against gravity by holding it",
+      "object_size_grids": {{"rows": 3, "cols": 2}},
+      "trajectory_path": [
+        {{"frame": 0, "grid_row": 4, "grid_col": 5}},
+        {{"frame": 15, "grid_row": 6, "grid_col": 5}},
+        {{"frame": 30, "grid_row": 8, "grid_col": 6}}
+      ]
+    }}
+  ]
+}}
+Example 2: Hand crushing can on table
+{{
+  "affected_objects": [
+    {{
+      "noun": "can",
+      "will_move": false,
+      "why": "can RESTS ON TABLE - hand is just acting on it. When hand removed, can stays on table (uncrushed)"
+    }}
+  ]
+}}
+Example 3: Hands opening can on counter
+{{
+  "affected_objects": [
+    {{
+      "noun": "can",
+      "will_move": false,
+      "why": "can RESTS ON COUNTER - hands are doing opening action. When hands removed, can stays closed on counter"
+    }}
+  ]
+}}
+Example 4: Person sitting on chair
+{{
+  "affected_objects": [
+    {{
+      "noun": "chair",
+      "will_move": false,
+      "why": "chair RESTS ON FLOOR - person sitting on it doesn't make it fall. Chair stays on floor when person removed"
+    }}
+  ]
+}}
+Example 5: Person throws ball (ball appears at frame 12)
+{{
+  "affected_objects": [
+    {{
+      "noun": "ball",
+      "category": "physical",
+      "will_move": true,
+      "first_appears_frame": 12,
+      "why": "ball is SUPPORTED by person's hand, then thrown",
+      "object_size_grids": {{"rows": 2, "cols": 2}},
+      "trajectory_path": [
+        {{"frame": 12, "grid_row": 4, "grid_col": 3}},
+        {{"frame": 20, "grid_row": 2, "grid_col": 6}},
+        {{"frame": 28, "grid_row": 5, "grid_col": 8}}
+      ]
+    }}
+  ]
+}}
+Example 6: Person with shadow (shadow needs grid localization)
+{{
+  "affected_objects": [
+    {{
+      "noun": "shadow",
+      "category": "visual_artifact",
+      "why": "cast by person on the floor",
+      "will_move": false,
+      "first_appears_frame": 0,
+      "movement_description": "Disappears entirely as visual artifact",
+      "grid_localizations": [
+        {{"frame": 0, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 6, "col": 4}}, {{"row": 7, "col": 3}}, {{"row": 7, "col": 4}}]}},
+        {{"frame": 12, "grid_regions": [{{"row": 6, "col": 4}}, {{"row": 6, "col": 5}}, {{"row": 7, "col": 4}}]}},
+        {{"frame": 23, "grid_regions": [{{"row": 5, "col": 4}}, {{"row": 6, "col": 4}}, {{"row": 6, "col": 5}}]}},
+        {{"frame": 35, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 6, "col": 4}}, {{"row": 7, "col": 3}}]}},
+        {{"frame": 47, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 7, "col": 3}}, {{"row": 7, "col": 4}}]}}
+      ]
+    }}
+  ]
+}}
+Example 7: Golf club hits ball (Case 4 - currently moving but should stay)
+{{
+  "affected_objects": [
+    {{
+      "noun": "golf ball",
+      "category": "physical",
+      "currently_moving": true,
+      "should_have_stayed": true,
+      "original_position_grid": {{"row": 6, "col": 7}},
+      "first_appears_frame": 0,
+      "why": "ball was stationary on tee until club hit it. Without club, ball would remain at original position."
+    }}
+  ]
+}}
+YOUR OUTPUT FORMAT:
+{{
+  "edit_instruction": "{instruction}",
+  "integral_belongings": [
+    {{
+      "noun": "bike",
+      "why": "person is riding the bike throughout the video"
+    }}
+  ],
+  "affected_objects": [
+    {{
+      "noun": "guitar",
+      "category": "physical",
+      "why": "person is SUPPORTING guitar against gravity by holding it",
+      "will_move": true,
+      "first_appears_frame": 0,
+      "movement_description": "Will fall from held position to the ground",
+      "object_size_grids": {{"rows": 3, "cols": 2}},
+      "trajectory_path": [
+        {{"frame": 0, "grid_row": 3, "grid_col": 6}},
+        {{"frame": 20, "grid_row": 6, "grid_col": 6}},
+        {{"frame": 40, "grid_row": 9, "grid_col": 7}}
+      ]
+    }},
+    {{
+      "noun": "shadow",
+      "category": "visual_artifact",
+      "why": "cast by person on floor",
+      "will_move": false,
+      "first_appears_frame": 0,
+      "movement_description": "Disappears entirely as visual artifact"
+    }}
+  ],
+  "scene_description": "An acoustic guitar falling to the ground in an empty room. Natural window lighting.",
+  "confidence": 0.85
+}}
+CRITICAL REMINDERS:
+• Watch ENTIRE video before answering
+• SUPPORTING vs ACTING ON:
+  - Primary HOLDS UP object against gravity → will_move=TRUE (provide grid trajectory)
+  - Primary ACTS ON object (crushing, opening) but object on stable surface → will_move=FALSE
+  - Object RESTS ON stable surface (table, floor) → will_move=FALSE
+• For visual artifacts (shadow, reflection): will_move=false (no trajectory needed)
+• For held objects (guitar, cup): will_move=true (MUST provide object_size_grids + trajectory_path)
+• For objects on surfaces being acted on (can being crushed, can being opened): will_move=false
+• Grid trajectory: Add +1 cell padding to object size (over-mask is better than under-mask)
+• Grid trajectory: Use the yellow grid overlay to determine (row, col) positions
+• Be conservative - when in doubt, DON'T include
+• Output MUST be valid JSON only
+GRID INFO: {grid_rows} rows × {grid_cols} columns
+EDIT INSTRUCTION: {instruction}
+""".strip()
+def call_vlm_with_images_and_video(client, model: str, image_data_urls: list,
+                                    video_data_url: str, prompt: str) -> str:
+    """Call VLM with multiple images and video"""
+    content = []
+    # Add all images first
+    for img_url in image_data_urls:
+        content.append({"type": "image_url", "image_url": {"url": img_url}})
+    # Add video
+    content.append({"type": "image_url", "image_url": {"url": video_data_url}})
+    # Add prompt
+    content.append({"type": "text", "text": prompt})
+    resp = client.chat.completions.create(
+        model=model,
+        messages=[
+            {
+                "role": "system",
+                "content": "You are an expert video analyst with deep understanding of physics and object interactions. Always output valid JSON only."
+            },
+            {
+                "role": "user",
+                "content": content
+            },
+        ],
+    )
+    return resp.choices[0].message.content
+def parse_vlm_response(raw: str) -> Dict:
+    """Parse VLM JSON response"""
+    # Strip markdown code blocks
+    cleaned = raw.strip()
+    if cleaned.startswith("```"):
+        lines = cleaned.split('\n')
+        if lines[0].startswith("```"):
+            lines = lines[1:]
+        if lines and lines[-1].strip() == "```":
+            lines = lines[:-1]
+        cleaned = '\n'.join(lines)
+    try:
+        parsed = json.loads(cleaned)
+    except json.JSONDecodeError:
+        # Try to find JSON in response
+        start = cleaned.find("{")
+        end = cleaned.rfind("}")
+        if start != -1 and end != -1 and end > start:
+            parsed = json.loads(cleaned[start:end+1])
+        else:
+            raise ValueError("Failed to parse VLM response as JSON")
+    # Validate structure
+    result = {
+        "edit_instruction": parsed.get("edit_instruction", ""),
+        "integral_belongings": [],
+        "affected_objects": [],
+        "scene_description": parsed.get("scene_description", ""),
+        "confidence": float(parsed.get("confidence", 0.0))
+    }
+    # Parse integral belongings
+    for item in parsed.get("integral_belongings", [])[:3]:
+        obj = {
+            "noun": str(item.get("noun", "")).strip().lower(),
+            "why": str(item.get("why", "")).strip()[:200]
+        }
+        if obj["noun"]:
+            result["integral_belongings"].append(obj)
+    # Parse affected objects
+    for item in parsed.get("affected_objects", [])[:5]:
+        obj = {
+            "noun": str(item.get("noun", "")).strip().lower(),
+            "category": str(item.get("category", "physical")).strip().lower(),
+            "why": str(item.get("why", "")).strip()[:200],
+            "will_move": bool(item.get("will_move", False)),
+            "first_appears_frame": int(item.get("first_appears_frame", 0)),
+            "movement_description": str(item.get("movement_description", "")).strip()[:300]
+        }
+        # Parse Case 4: currently moving but should have stayed
+        if "currently_moving" in item:
+            obj["currently_moving"] = bool(item.get("currently_moving", False))
+        if "should_have_stayed" in item:
+            obj["should_have_stayed"] = bool(item.get("should_have_stayed", False))
+        if "original_position_grid" in item:
+            orig_grid = item.get("original_position_grid", {})
+            obj["original_position_grid"] = {
+                "row": int(orig_grid.get("row", 0)),
+                "col": int(orig_grid.get("col", 0))
+            }
+        # Parse grid localizations for visual artifacts
+        if "grid_localizations" in item:
+            grid_locs = []
+            for loc in item.get("grid_localizations", []):
+                frame_loc = {
+                    "frame": int(loc.get("frame", 0)),
+                    "grid_regions": []
+                }
+                for region in loc.get("grid_regions", []):
+                    frame_loc["grid_regions"].append({
+                        "row": int(region.get("row", 0)),
+                        "col": int(region.get("col", 0))
+                    })
+                if frame_loc["grid_regions"]:  # Only add if has regions
+                    grid_locs.append(frame_loc)
+            if grid_locs:
+                obj["grid_localizations"] = grid_locs
+        # Parse grid trajectory if will_move=true
+        if obj["will_move"] and "object_size_grids" in item and "trajectory_path" in item:
+            size_grids = item.get("object_size_grids", {})
+            obj["object_size_grids"] = {
+                "rows": int(size_grids.get("rows", 2)),
+                "cols": int(size_grids.get("cols", 2))
+            }
+            trajectory = []
+            for point in item.get("trajectory_path", []):
+                trajectory.append({
+                    "frame": int(point.get("frame", 0)),
+                    "grid_row": int(point.get("grid_row", 0)),
+                    "grid_col": int(point.get("grid_col", 0))
+                })
+            if trajectory:  # Only add if we have valid trajectory points
+                obj["trajectory_path"] = trajectory
+        if obj["noun"]:
+            result["affected_objects"].append(obj)
+    return result
+def process_video(video_info: Dict, client, model: str):
+    """Process a single video with VLM analysis"""
+    video_path = video_info.get("video_path", "")
+    instruction = video_info.get("instruction", "")
+    output_dir = video_info.get("output_dir", "")
+    if not output_dir:
+        print(f"   ⚠️  No output_dir specified, skipping")
+        return None
+    output_dir = Path(output_dir)
+    if not output_dir.exists():
+        print(f"   ⚠️  Output directory not found: {output_dir}")
+        print(f"   Run Stage 1 first to create black masks")
+        return None
+    # Check required files from Stage 1
+    black_mask_path = output_dir / "black_mask.mp4"
+    first_frame_path = output_dir / "first_frame.jpg"
+    input_video_path = output_dir / "input_video.mp4"
+    segmentation_info_path = output_dir / "segmentation_info.json"
+    if not black_mask_path.exists():
+        print(f"   ⚠️  black_mask.mp4 not found in {output_dir}")
+        print(f"   Run Stage 1 first")
+        return None
+    if not first_frame_path.exists():
+        print(f"   ⚠️  first_frame.jpg not found in {output_dir}")
+        return None
+    if not input_video_path.exists():
+        # Try original video path
+        if Path(video_path).exists():
+            input_video_path = Path(video_path)
+        else:
+            print(f"   ⚠️  Video not found: {video_path}")
+            return None
+    # Read segmentation metadata to get correct frame index
+    frame_idx = 0  # Default
+    if segmentation_info_path.exists():
+        try:
+            with open(segmentation_info_path, 'r') as f:
+                seg_info = json.load(f)
+                frame_idx = seg_info.get("first_appears_frame", 0)
+                print(f"   Using frame {frame_idx} from segmentation metadata")
+        except Exception as e:
+            print(f"   Warning: Could not read segmentation_info.json: {e}")
+            print(f"   Using frame 0 as fallback")
+    # Get min_grid for grid calculation
+    min_grid = video_info.get('min_grid', 8)
+    use_multi_frame_grids = video_info.get('multi_frame_grids', True)  # Default: use multi-frame
+    max_video_size_mb = video_info.get('max_video_size_for_multiframe', 25)  # Default: 25MB limit
+    # Check video size and auto-disable multi-frame for large videos
+    if use_multi_frame_grids:
+        video_size_mb = input_video_path.stat().st_size / (1024 * 1024)
+        if video_size_mb > max_video_size_mb:
+            print(f"   ⚠️  Video size ({video_size_mb:.1f} MB) exceeds {max_video_size_mb} MB")
+            print(f"   Auto-disabling multi-frame grids to avoid API errors")
+            use_multi_frame_grids = False
+    print(f"   Creating frame overlays and grids...")
+    overlay_path = output_dir / "first_frame_with_mask.jpg"
+    gridded_path = output_dir / "first_frame_with_grid.jpg"
+    # Create regular overlay (for backwards compatibility)
+    create_first_frame_with_mask_overlay(
+        str(first_frame_path),
+        str(black_mask_path),
+        str(overlay_path),
+        frame_idx=frame_idx
+    )
+    image_data_urls = []
+    if use_multi_frame_grids:
+        # Create multi-frame grid samples for objects appearing mid-video
+        print(f"   Creating multi-frame grid samples (0%, 25%, 50%, 75%, 100%)...")
+        sample_paths, grid_rows, grid_cols = create_multi_frame_grid_samples(
+            str(input_video_path),
+            output_dir,
+            min_grid=min_grid
+        )
+        # Encode all grid samples
+        for sample_path in sample_paths:
+            image_data_urls.append(image_to_data_url(str(sample_path)))
+        # Also add the first frame with mask overlay
+        _, _, _ = create_gridded_frame_overlay(
+            str(first_frame_path),
+            str(black_mask_path),
+            str(gridded_path),
+            min_grid=min_grid
+        )
+        image_data_urls.append(image_to_data_url(str(gridded_path)))
+        print(f"   Grid: {grid_rows}x{grid_cols}, {len(sample_paths)} sample frames + masked frame")
+    else:
+        # Single gridded first frame (old approach)
+        _, grid_rows, grid_cols = create_gridded_frame_overlay(
+            str(first_frame_path),
+            str(black_mask_path),
+            str(gridded_path),
+            min_grid=min_grid
+        )
+        image_data_urls.append(image_to_data_url(str(gridded_path)))
+        print(f"   Grid: {grid_rows}x{grid_cols} (single frame)")
+    print(f"   Encoding video for VLM...")
+    # Check video size
+    video_size_mb = input_video_path.stat().st_size / (1024 * 1024)
+    print(f"   Video size: {video_size_mb:.1f} MB")
+    if video_size_mb > 20:
+        print(f"   ⚠️  Warning: Large video may cause API errors")
+        if use_multi_frame_grids:
+            print(f"   Consider setting multi_frame_grids=false for large videos")
+    video_data_url = video_to_data_url(str(input_video_path))
+    print(f"   Calling {model}...")
+    prompt = make_vlm_analysis_prompt(instruction, grid_rows, grid_cols,
+                                       has_multi_frame_grids=use_multi_frame_grids)
+    try:
+        try:
+            raw_response = call_vlm_with_images_and_video(
+                client, model, image_data_urls, video_data_url, prompt
+            )
+        except Exception as e:
+            # If multi-frame fails (likely payload size issue), fall back to single frame
+            if use_multi_frame_grids and "400" in str(e):
+                print(f"   ⚠️  Multi-frame request failed (payload too large?)")
+                print(f"   Falling back to single-frame grid mode...")
+                # Retry with just the gridded first frame
+                image_data_urls = [image_to_data_url(str(gridded_path))]
+                prompt = make_vlm_analysis_prompt(instruction, grid_rows, grid_cols,
+                                                   has_multi_frame_grids=False)
+                try:
+                    raw_response = call_vlm_with_images_and_video(
+                        client, model, image_data_urls, video_data_url, prompt
+                    )
+                    print(f"   ✓ Single-frame fallback succeeded")
+                except Exception as e2:
+                    raise e2  # Re-raise if fallback also fails
+            else:
+                raise  # Re-raise if not a 400 or not multi-frame mode
+        # Parse and save results (runs whether first call succeeded or fallback succeeded)
+        print(f"   Parsing VLM response...")
+        analysis = parse_vlm_response(raw_response)
+        # Save results
+        output_path = output_dir / "vlm_analysis.json"
+        with open(output_path, 'w') as f:
+            json.dump(analysis, f, indent=2)
+        print(f"   ✓ Saved VLM analysis: {output_path.name}")
+        # Print summary
+        print(f"\n   Summary:")
+        print(f"   - Integral belongings: {len(analysis['integral_belongings'])}")
+        for obj in analysis['integral_belongings']:
+            print(f"     • {obj['noun']}: {obj['why']}")
+        print(f"   - Affected objects: {len(analysis['affected_objects'])}")
+        for obj in analysis['affected_objects']:
+            move_str = "WILL MOVE" if obj['will_move'] else "STAYS/DISAPPEARS"
+            traj_str = ""
+            if obj.get('will_move') and 'trajectory_path' in obj:
+                num_points = len(obj['trajectory_path'])
+                size = obj.get('object_size_grids', {})
+                traj_str = f" (trajectory: {num_points} keyframes, size: {size.get('rows')}×{size.get('cols')} grids)"
+            print(f"     • {obj['noun']}: {move_str}{traj_str}")
+        return analysis
+    except Exception as e:
+        print(f"   ❌ VLM analysis failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return None
+def process_config(config_path: str, model: str = DEFAULT_MODEL):
+    """Process all videos in config"""
+    config_path = Path(config_path)
+    # Load config
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    # Handle both formats
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Config must be a list or have 'videos' key")
+    print(f"\n{'='*70}")
+    print(f"Stage 2: VLM Analysis - Identify Affected Objects")
+    print(f"{'='*70}")
+    print(f"Config: {config_path.name}")
+    print(f"Videos: {len(videos)}")
+    print(f"Model: {model}")
+    print(f"{'='*70}\n")
+    # Initialize VLM client
+    api_key = os.environ.get("GEMINI_API_KEY")
+    if not api_key:
+        raise RuntimeError("GEMINI_API_KEY environment variable not set")
+    client = openai.OpenAI(
+        api_key=api_key,
+        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
+    )
+    # Process each video
+    results = []
+    for i, video_info in enumerate(videos):
+        video_path = video_info.get("video_path", "")
+        instruction = video_info.get("instruction", "")
+        print(f"\n{'─'*70}")
+        print(f"Video {i+1}/{len(videos)}: {Path(video_path).name}")
+        print(f"{'─'*70}")
+        print(f"Instruction: {instruction}")
+        try:
+            analysis = process_video(video_info, client, model)
+            results.append({
+                "video": video_path,
+                "success": analysis is not None,
+                "analysis": analysis
+            })
+            if analysis:
+                print(f"\n✅ Video {i+1} complete!")
+            else:
+                print(f"\n⚠️  Video {i+1} skipped")
+        except Exception as e:
+            print(f"\n❌ Error processing video {i+1}: {e}")
+            results.append({
+                "video": video_path,
+                "success": False,
+                "error": str(e)
+            })
+            continue
+    # Summary
+    print(f"\n{'='*70}")
+    print(f"Stage 2 Complete!")
+    print(f"{'='*70}")
+    successful = sum(1 for r in results if r["success"])
+    print(f"Successful: {successful}/{len(videos)}")
+    print(f"{'='*70}\n")
+def main():
+    parser = argparse.ArgumentParser(description="Stage 2: VLM Analysis")
+    parser.add_argument("--config", required=True, help="Config JSON from Stage 1")
+    parser.add_argument("--model", default=DEFAULT_MODEL, help="VLM model name")
+    args = parser.parse_args()
+    process_config(args.config, args.model)
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/stage2_vlm_analysis_cf.py ADDED Viewed

	@@ -0,0 +1,1024 @@

+#!/usr/bin/env python3
+"""
+Stage 2: VLM Analysis - Identify Affected Objects & Physics
+(Cloudflare AI Gateway variant)
+Identical to stage2_vlm_analysis.py but routes through the internal CF AI Gateway
+instead of calling the Gemini API directly.  Video is sent as sampled frames rather
+than a raw video data URL (not supported by the OpenAI-compat endpoint).
+Required environment variables:
+    CF_PROJECT_ID   - Cloudflare AI Gateway project ID
+    CF_USER_ID      - Cloudflare AI Gateway user ID
+    MODEL_ID        - Model identifier to use (e.g. "gemini-3-pro-preview")
+Usage:
+    python stage2_vlm_analysis_cf.py --config my_config_points.json
+"""
+import os
+import sys
+import json
+import argparse
+import cv2
+import numpy as np
+import base64
+from pathlib import Path
+from typing import Dict, List
+from PIL import Image, ImageDraw
+import openai
+DEFAULT_MODEL = "gemini-3-pro-preview"
+def image_to_data_url(image_path: str) -> str:
+    """Convert image file to base64 data URL"""
+    with open(image_path, 'rb') as f:
+        img_data = base64.b64encode(f.read()).decode('utf-8')
+    # Detect format
+    ext = Path(image_path).suffix.lower()
+    if ext == '.png':
+        mime = 'image/png'
+    elif ext in ['.jpg', '.jpeg']:
+        mime = 'image/jpeg'
+    else:
+        mime = 'image/jpeg'
+    return f"data:{mime};base64,{img_data}"
+def video_to_data_url(video_path: str) -> str:
+    """Convert video file to base64 data URL"""
+    with open(video_path, 'rb') as f:
+        video_data = base64.b64encode(f.read()).decode('utf-8')
+    return f"data:video/mp4;base64,{video_data}"
+def calculate_square_grid(width: int, height: int, min_grid: int = 8) -> tuple:
+    """Calculate grid dimensions matching stage3a logic"""
+    aspect_ratio = width / height
+    if width >= height:
+        grid_rows = min_grid
+        grid_cols = max(min_grid, round(min_grid * aspect_ratio))
+    else:
+        grid_cols = min_grid
+        grid_rows = max(min_grid, round(min_grid / aspect_ratio))
+    return grid_rows, grid_cols
+def create_first_frame_with_mask_overlay(first_frame_path: str, black_mask_path: str,
+                                          output_path: str, frame_idx: int = 0) -> str:
+    """Create visualization of first frame with red overlay on primary object
+    Args:
+        first_frame_path: Path to first_frame.jpg
+        black_mask_path: Path to black_mask.mp4
+        output_path: Where to save overlay
+        frame_idx: Which frame to extract from black_mask.mp4 (default: 0)
+    """
+    # Load first frame
+    frame = cv2.imread(first_frame_path)
+    if frame is None:
+        raise ValueError(f"Failed to load first frame: {first_frame_path}")
+    # Load black mask video and get the specified frame
+    cap = cv2.VideoCapture(black_mask_path)
+    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+    ret, mask_frame = cap.read()
+    cap.release()
+    if not ret:
+        raise ValueError(f"Failed to load black mask frame {frame_idx}: {black_mask_path}")
+    # Convert mask to binary (0 = object, 255 = background)
+    if len(mask_frame.shape) == 3:
+        mask_frame = cv2.cvtColor(mask_frame, cv2.COLOR_BGR2GRAY)
+    object_mask = (mask_frame == 0)
+    # Create red overlay on object
+    overlay = frame.copy()
+    overlay[object_mask] = [0, 0, 255]  # Red in BGR
+    # Blend: 60% original + 40% red overlay
+    result = cv2.addWeighted(frame, 0.6, overlay, 0.4, 0)
+    # Save
+    cv2.imwrite(output_path, result)
+    return output_path
+def create_gridded_frame_overlay(first_frame_path: str, black_mask_path: str,
+                                  output_path: str, min_grid: int = 8) -> tuple:
+    """Create first frame with BOTH red mask overlay AND grid lines
+    Returns: (output_path, grid_rows, grid_cols)
+    """
+    # Load first frame
+    frame = cv2.imread(first_frame_path)
+    if frame is None:
+        raise ValueError(f"Failed to load first frame: {first_frame_path}")
+    h, w = frame.shape[:2]
+    # Load black mask
+    cap = cv2.VideoCapture(black_mask_path)
+    ret, mask_frame = cap.read()
+    cap.release()
+    if not ret:
+        raise ValueError(f"Failed to load black mask: {black_mask_path}")
+    if len(mask_frame.shape) == 3:
+        mask_frame = cv2.cvtColor(mask_frame, cv2.COLOR_BGR2GRAY)
+    object_mask = (mask_frame == 0)
+    # Create red overlay
+    overlay = frame.copy()
+    overlay[object_mask] = [0, 0, 255]
+    result = cv2.addWeighted(frame, 0.6, overlay, 0.4, 0)
+    # Calculate grid
+    grid_rows, grid_cols = calculate_square_grid(w, h, min_grid)
+    # Draw grid lines
+    cell_width = w / grid_cols
+    cell_height = h / grid_rows
+    # Vertical lines
+    for col in range(1, grid_cols):
+        x = int(col * cell_width)
+        cv2.line(result, (x, 0), (x, h), (255, 255, 0), 1)  # Yellow lines
+    # Horizontal lines
+    for row in range(1, grid_rows):
+        y = int(row * cell_height)
+        cv2.line(result, (0, y), (w, y), (255, 255, 0), 1)
+    # Add grid labels
+    font = cv2.FONT_HERSHEY_SIMPLEX
+    font_scale = 0.3
+    thickness = 1
+    # Label columns at top
+    for col in range(grid_cols):
+        x = int((col + 0.5) * cell_width)
+        cv2.putText(result, str(col), (x-5, 15), font, font_scale, (255, 255, 0), thickness)
+    # Label rows on left
+    for row in range(grid_rows):
+        y = int((row + 0.5) * cell_height)
+        cv2.putText(result, str(row), (5, y+5), font, font_scale, (255, 255, 0), thickness)
+    cv2.imwrite(output_path, result)
+    return output_path, grid_rows, grid_cols
+def create_multi_frame_grid_samples(video_path: str, output_dir: Path,
+                                      min_grid: int = 8,
+                                      sample_points: list = [0.0, 0.11, 0.22, 0.33, 0.44, 0.56, 0.67, 0.78, 0.89, 1.0]) -> tuple:
+    """
+    Create gridded frame samples at multiple time points in video.
+    Helps VLM see objects that appear mid-video with grid reference.
+    Args:
+        video_path: Path to video
+        output_dir: Where to save samples
+        min_grid: Minimum grid size
+        sample_points: List of normalized positions [0.0-1.0] to sample
+    Returns: (sample_paths, grid_rows, grid_cols)
+    """
+    cap = cv2.VideoCapture(str(video_path))
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    # Calculate grid (same for all frames)
+    grid_rows, grid_cols = calculate_square_grid(w, h, min_grid)
+    cell_width = w / grid_cols
+    cell_height = h / grid_rows
+    sample_paths = []
+    for i, t in enumerate(sample_points):
+        frame_idx = int(t * (total_frames - 1))
+        frame_idx = max(0, min(frame_idx, total_frames - 1))
+        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+        ret, frame = cap.read()
+        if not ret:
+            continue
+        # Draw grid
+        result = frame.copy()
+        # Vertical lines
+        for col in range(1, grid_cols):
+            x = int(col * cell_width)
+            cv2.line(result, (x, 0), (x, h), (255, 255, 0), 2)
+        # Horizontal lines
+        for row in range(1, grid_rows):
+            y = int(row * cell_height)
+            cv2.line(result, (0, y), (w, y), (255, 255, 0), 2)
+        # Add grid labels
+        font = cv2.FONT_HERSHEY_SIMPLEX
+        font_scale = 0.4
+        thickness = 1
+        # Label columns
+        for col in range(grid_cols):
+            x = int((col + 0.5) * cell_width)
+            cv2.putText(result, str(col), (x-8, 20), font, font_scale, (255, 255, 0), thickness)
+        # Label rows
+        for row in range(grid_rows):
+            y = int((row + 0.5) * cell_height)
+            cv2.putText(result, str(row), (10, y+8), font, font_scale, (255, 255, 0), thickness)
+        # Add frame number and percentage
+        label = f"Frame {frame_idx} ({int(t*100)}%)"
+        cv2.putText(result, label, (10, h-10), font, 0.5, (255, 255, 0), 2)
+        # Save
+        output_path = output_dir / f"grid_sample_frame_{frame_idx:04d}.jpg"
+        cv2.imwrite(str(output_path), result)
+        sample_paths.append(output_path)
+    cap.release()
+    return sample_paths, grid_rows, grid_cols
+def make_vlm_analysis_prompt(instruction: str, grid_rows: int, grid_cols: int,
+                              has_multi_frame_grids: bool = False) -> str:
+    """Create VLM prompt for analyzing video with primary mask"""
+    grid_context = ""
+    if has_multi_frame_grids:
+        grid_context = f"""
+1. **Multiple Grid Reference Frames**: Sampled frames at 0%, 11%, 22%, 33%, 44%, 56%, 67%, 78%, 89%, 100% of video
+   - Each frame shows YELLOW GRID with {grid_rows} rows × {grid_cols} columns
+   - Grid cells labeled (row, col) starting from (0, 0) at top-left
+   - Frame number shown at bottom
+   - Use these to locate objects that appear MID-VIDEO and track object positions across time
+2. **First Frame with RED mask**: Shows what will be REMOVED (primary object)
+3. **Full Video**: Complete action and interactions"""
+    else:
+        grid_context = f"""
+1. **First Frame with Grid**: PRIMARY OBJECT highlighted in RED + GRID OVERLAY
+   - The red overlay shows what will be REMOVED (already masked)
+   - Yellow grid with {grid_rows} rows × {grid_cols} columns
+   - Grid cells are labeled (row, col) starting from (0, 0) at top-left
+2. **Full Video**: Complete scene and action"""
+    return f"""
+You are an expert video analyst specializing in physics and object interactions.
+═══════════════════════════════════════════════════════════════════
+CONTEXT
+═══════════════════════════════════════════════════════════════════
+You will see MULTIPLE inputs:
+{grid_context}
+Edit instruction: "{instruction}"
+IMPORTANT: Some objects may NOT appear in first frame. They may enter later.
+Watch the ENTIRE video and note when each object first appears.
+═══════════════════════════════════════════════════════════════════
+YOUR TASK
+═══════════════════════════════════════════════════════════════════
+Analyze what would happen if the PRIMARY OBJECT (shown in red) is removed.
+Watch the ENTIRE video to see all interactions and movements.
+STEP 1: IDENTIFY INTEGRAL BELONGINGS (0-3 items)
+─────────────────────────────────────────────────
+Items that should be ADDED to the primary removal mask (removed WITH primary object):
+✓ INCLUDE:
+  • Distinct wearable items: hat, backpack, jacket (if separate/visible)
+  • Vehicles/equipment being ridden: bike, skateboard, surfboard, scooter
+  • Large carried items that are part of the subject
+✗ DO NOT INCLUDE:
+  • Generic clothing (shirt, pants, shoes) - already captured with person
+  • Held items that could be set down: guitar, cup, phone, tools
+  • Objects they're interacting with but not wearing/riding
+Examples:
+  • Person on bike → integral: "bike"
+  • Person with guitar → integral: none (guitar is affected, not integral)
+  • Surfer → integral: "surfboard"
+  • Boxer → integral: "boxing gloves" (wearable equipment)
+STEP 2: IDENTIFY AFFECTED OBJECTS (0-5 objects)
+────────────────────────────────────────────────
+Objects/effects that are SEPARATE from primary but affected by its removal.
+CRITICAL: Do NOT include integral belongings from Step 1.
+Two categories:
+A) VISUAL ARTIFACTS (disappear when primary removed):
+   • shadow, reflection, wake, ripples, splash, footprints
+   • These vanish completely - no physics needed
+   **CRITICAL FOR VISUAL ARTIFACTS:**
+   You MUST provide GRID LOCALIZATIONS across the reference frames.
+   Keyword segmentation fails to isolate specific shadows/reflections.
+   For each visual artifact:
+   - Look at each grid reference frame you were shown
+   - Identify which grid cells the artifact occupies in EACH frame
+   - List all grid cells (row, col) that contain any part of it
+   - Be thorough - include ALL touched cells (over-mask is better than under-mask)
+   Format:
+   {{
+     "noun": "shadow",
+     "category": "visual_artifact",
+     "grid_localizations": [
+       {{"frame": 0, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 6, "col": 4}}, ...]}},
+       {{"frame": 5, "grid_regions": [{{"row": 6, "col": 4}}, ...]}},
+       // ... for each reference frame shown
+     ]
+   }}
+B) PHYSICAL OBJECTS (may move, fall, or stay):
+   CRITICAL - Understand the difference:
+   **SUPPORTING vs ACTING ON:**
+   • SUPPORTING = holding UP against gravity → object WILL FALL when removed
+     Examples: holding guitar, carrying cup, person sitting on chair
+     → will_move: TRUE
+   • ACTING ON = touching/manipulating but object rests on stable surface → object STAYS
+     Examples: hand crushing can (can on table), hand opening can (can on counter),
+              hand pushing object (object on floor)
+     → will_move: FALSE
+   **Key Questions:**
+   1. Is the primary object HOLDING THIS UP against gravity?
+      - YES → will_move: true, needs_trajectory: true
+      - NO → Check next question
+   2. Is this object RESTING ON a stable surface (table, floor, counter)?
+      - YES → will_move: false (stays on surface when primary removed)
+      - NO → will_move: true
+   3. Is the primary object DOING an action TO this object?
+      - Opening can, crushing can, pushing button, turning knob
+      - When primary removed → action STOPS, object stays in current state
+      - will_move: false
+   **SPECIAL CASE - Object Currently Moving But Should Have Stayed:**
+   If primary object CAUSES another object to move (hitting, kicking, throwing):
+   - The object is currently moving in the video
+   - But WITHOUT primary, it would have stayed at its original position
+   - You MUST provide:
+     • "currently_moving": true
+     • "should_have_stayed": true
+     • "original_position_grid": {{"row": R, "col": C}} - Where it started
+   Examples:
+   - Golf club hits ball → Ball at tee, then flies (mark original tee position)
+   - Person kicks soccer ball → Ball on ground, then rolls (mark original ground position)
+   - Hand throws object → Object held, then flies (mark original held position)
+   Format:
+   {{
+     "noun": "golf ball",
+     "category": "physical",
+     "currently_moving": true,
+     "should_have_stayed": true,
+     "original_position_grid": {{"row": 6, "col": 7}},
+     "why": "ball was stationary until club hit it"
+   }}
+   For each physical object, determine:
+   - **will_move**: true ONLY if object will fall/move when support removed
+   - **first_appears_frame**: frame number object first appears (0 if from start)
+   - **why**: Brief explanation of relationship to primary object
+   IF will_move=TRUE, also provide GRID-BASED TRAJECTORY:
+   - **object_size_grids**: {{"rows": R, "cols": C}} - How many grid cells object occupies
+     IMPORTANT: Add 1 extra cell padding for safety (better to over-mask than under-mask)
+     Example: Object looks 2×1 → report as 3×2
+   - **trajectory_path**: List of keyframe positions as grid coordinates
+     Format: [{{"frame": N, "grid_row": R, "grid_col": C}}, ...]
+     - IMPORTANT: First keyframe should be at first_appears_frame (not frame 0 if object appears later!)
+     - Provide 3-5 keyframes spanning from first appearance to end
+     - (grid_row, grid_col) is the CENTER position of object at that frame
+     - Use the yellow grid reference frames to determine positions
+     - For objects appearing mid-video: use the grid samples to locate them
+     - Example: Object appears at frame 15, falls to bottom
+       [{{"frame": 15, "grid_row": 3, "grid_col": 5}},  ← First appearance
+        {{"frame": 25, "grid_row": 6, "grid_col": 5}},  ← Mid-fall
+        {{"frame": 35, "grid_row": 9, "grid_col": 5}}]  ← On ground
+✓ Objects held/carried at ANY point in video
+✓ Objects the primary supports or interacts with
+✓ Visual effects visible at any time
+✗ Background objects never touched
+✗ Other people/animals with no contact
+✗ Integral belongings (already in Step 1)
+STEP 3: SCENE DESCRIPTION
+──────────────────────────
+Describe scene WITHOUT the primary object (1-2 sentences).
+Focus on what remains and any dynamic changes (falling objects, etc).
+═══════════════════════════════════════════════════════════════════
+OUTPUT FORMAT (STRICT JSON ONLY)
+═══════════════════════════════════════════════════════════════════
+EXAMPLES TO LEARN FROM:
+Example 1: Person holding guitar
+{{
+  "affected_objects": [
+    {{
+      "noun": "guitar",
+      "will_move": true,
+      "why": "person is SUPPORTING guitar against gravity by holding it",
+      "object_size_grids": {{"rows": 3, "cols": 2}},
+      "trajectory_path": [
+        {{"frame": 0, "grid_row": 4, "grid_col": 5}},
+        {{"frame": 15, "grid_row": 6, "grid_col": 5}},
+        {{"frame": 30, "grid_row": 8, "grid_col": 6}}
+      ]
+    }}
+  ]
+}}
+Example 2: Hand crushing can on table
+{{
+  "affected_objects": [
+    {{
+      "noun": "can",
+      "will_move": false,
+      "why": "can RESTS ON TABLE - hand is just acting on it. When hand removed, can stays on table (uncrushed)"
+    }}
+  ]
+}}
+Example 3: Hands opening can on counter
+{{
+  "affected_objects": [
+    {{
+      "noun": "can",
+      "will_move": false,
+      "why": "can RESTS ON COUNTER - hands are doing opening action. When hands removed, can stays closed on counter"
+    }}
+  ]
+}}
+Example 4: Person sitting on chair
+{{
+  "affected_objects": [
+    {{
+      "noun": "chair",
+      "will_move": false,
+      "why": "chair RESTS ON FLOOR - person sitting on it doesn't make it fall. Chair stays on floor when person removed"
+    }}
+  ]
+}}
+Example 5: Person throws ball (ball appears at frame 12)
+{{
+  "affected_objects": [
+    {{
+      "noun": "ball",
+      "category": "physical",
+      "will_move": true,
+      "first_appears_frame": 12,
+      "why": "ball is SUPPORTED by person's hand, then thrown",
+      "object_size_grids": {{"rows": 2, "cols": 2}},
+      "trajectory_path": [
+        {{"frame": 12, "grid_row": 4, "grid_col": 3}},
+        {{"frame": 20, "grid_row": 2, "grid_col": 6}},
+        {{"frame": 28, "grid_row": 5, "grid_col": 8}}
+      ]
+    }}
+  ]
+}}
+Example 6: Person with shadow (shadow needs grid localization)
+{{
+  "affected_objects": [
+    {{
+      "noun": "shadow",
+      "category": "visual_artifact",
+      "why": "cast by person on the floor",
+      "will_move": false,
+      "first_appears_frame": 0,
+      "movement_description": "Disappears entirely as visual artifact",
+      "grid_localizations": [
+        {{"frame": 0, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 6, "col": 4}}, {{"row": 7, "col": 3}}, {{"row": 7, "col": 4}}]}},
+        {{"frame": 12, "grid_regions": [{{"row": 6, "col": 4}}, {{"row": 6, "col": 5}}, {{"row": 7, "col": 4}}]}},
+        {{"frame": 23, "grid_regions": [{{"row": 5, "col": 4}}, {{"row": 6, "col": 4}}, {{"row": 6, "col": 5}}]}},
+        {{"frame": 35, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 6, "col": 4}}, {{"row": 7, "col": 3}}]}},
+        {{"frame": 47, "grid_regions": [{{"row": 6, "col": 3}}, {{"row": 7, "col": 3}}, {{"row": 7, "col": 4}}]}}
+      ]
+    }}
+  ]
+}}
+Example 7: Golf club hits ball (Case 4 - currently moving but should stay)
+{{
+  "affected_objects": [
+    {{
+      "noun": "golf ball",
+      "category": "physical",
+      "currently_moving": true,
+      "should_have_stayed": true,
+      "original_position_grid": {{"row": 6, "col": 7}},
+      "first_appears_frame": 0,
+      "why": "ball was stationary on tee until club hit it. Without club, ball would remain at original position."
+    }}
+  ]
+}}
+YOUR OUTPUT FORMAT:
+{{
+  "edit_instruction": "{instruction}",
+  "integral_belongings": [
+    {{
+      "noun": "bike",
+      "why": "person is riding the bike throughout the video"
+    }}
+  ],
+  "affected_objects": [
+    {{
+      "noun": "guitar",
+      "category": "physical",
+      "why": "person is SUPPORTING guitar against gravity by holding it",
+      "will_move": true,
+      "first_appears_frame": 0,
+      "movement_description": "Will fall from held position to the ground",
+      "object_size_grids": {{"rows": 3, "cols": 2}},
+      "trajectory_path": [
+        {{"frame": 0, "grid_row": 3, "grid_col": 6}},
+        {{"frame": 20, "grid_row": 6, "grid_col": 6}},
+        {{"frame": 40, "grid_row": 9, "grid_col": 7}}
+      ]
+    }},
+    {{
+      "noun": "shadow",
+      "category": "visual_artifact",
+      "why": "cast by person on floor",
+      "will_move": false,
+      "first_appears_frame": 0,
+      "movement_description": "Disappears entirely as visual artifact"
+    }}
+  ],
+  "scene_description": "An acoustic guitar falling to the ground in an empty room. Natural window lighting.",
+  "confidence": 0.85
+}}
+CRITICAL REMINDERS:
+• Watch ENTIRE video before answering
+• SUPPORTING vs ACTING ON:
+  - Primary HOLDS UP object against gravity → will_move=TRUE (provide grid trajectory)
+  - Primary ACTS ON object (crushing, opening) but object on stable surface → will_move=FALSE
+  - Object RESTS ON stable surface (table, floor) → will_move=FALSE
+• For visual artifacts (shadow, reflection): will_move=false (no trajectory needed)
+• For held objects (guitar, cup): will_move=true (MUST provide object_size_grids + trajectory_path)
+• For objects on surfaces being acted on (can being crushed, can being opened): will_move=false
+• Grid trajectory: Add +1 cell padding to object size (over-mask is better than under-mask)
+• Grid trajectory: Use the yellow grid overlay to determine (row, col) positions
+• Be conservative - when in doubt, DON'T include
+• Output MUST be valid JSON only
+GRID INFO: {grid_rows} rows × {grid_cols} columns
+EDIT INSTRUCTION: {instruction}
+""".strip()
+def call_vlm_with_images_and_video(client, model: str, image_data_urls: list,
+                                    video_data_url: str, prompt: str) -> str:
+    """Call VLM with sampled frame images.
+    The CF AI Gateway OpenAI-compat endpoint does not support video/mp4 base64
+    data URLs, so we rely solely on the sampled grid frames already in
+    image_data_urls.  video_data_url is accepted for signature compatibility but
+    intentionally not sent.
+    """
+    content = []
+    # Add all sampled frame images
+    for img_url in image_data_urls:
+        content.append({"type": "image_url", "image_url": {"url": img_url}})
+    # Add prompt
+    content.append({"type": "text", "text": prompt})
+    resp = client.chat.completions.create(
+        model=model,
+        messages=[
+            {
+                "role": "system",
+                "content": "You are an expert video analyst with deep understanding of physics and object interactions. Always output valid JSON only."
+            },
+            {
+                "role": "user",
+                "content": content
+            },
+        ],
+    )
+    return resp.choices[0].message.content
+def parse_vlm_response(raw: str) -> Dict:
+    """Parse VLM JSON response"""
+    # Strip markdown code blocks
+    cleaned = raw.strip()
+    if cleaned.startswith("```"):
+        lines = cleaned.split('\n')
+        if lines[0].startswith("```"):
+            lines = lines[1:]
+        if lines and lines[-1].strip() == "```":
+            lines = lines[:-1]
+        cleaned = '\n'.join(lines)
+    try:
+        parsed = json.loads(cleaned)
+    except json.JSONDecodeError:
+        # Try to find JSON in response
+        start = cleaned.find("{")
+        end = cleaned.rfind("}")
+        if start != -1 and end != -1 and end > start:
+            parsed = json.loads(cleaned[start:end+1])
+        else:
+            raise ValueError("Failed to parse VLM response as JSON")
+    # Validate structure
+    result = {
+        "edit_instruction": parsed.get("edit_instruction", ""),
+        "integral_belongings": [],
+        "affected_objects": [],
+        "scene_description": parsed.get("scene_description", ""),
+        "confidence": float(parsed.get("confidence", 0.0))
+    }
+    # Parse integral belongings
+    for item in parsed.get("integral_belongings", [])[:3]:
+        obj = {
+            "noun": str(item.get("noun", "")).strip().lower(),
+            "why": str(item.get("why", "")).strip()[:200]
+        }
+        if obj["noun"]:
+            result["integral_belongings"].append(obj)
+    # Parse affected objects
+    for item in parsed.get("affected_objects", [])[:5]:
+        obj = {
+            "noun": str(item.get("noun", "")).strip().lower(),
+            "category": str(item.get("category", "physical")).strip().lower(),
+            "why": str(item.get("why", "")).strip()[:200],
+            "will_move": bool(item.get("will_move", False)),
+            "first_appears_frame": int(item.get("first_appears_frame", 0)),
+            "movement_description": str(item.get("movement_description", "")).strip()[:300]
+        }
+        # Parse Case 4: currently moving but should have stayed
+        if "currently_moving" in item:
+            obj["currently_moving"] = bool(item.get("currently_moving", False))
+        if "should_have_stayed" in item:
+            obj["should_have_stayed"] = bool(item.get("should_have_stayed", False))
+        if "original_position_grid" in item:
+            orig_grid = item.get("original_position_grid", {})
+            obj["original_position_grid"] = {
+                "row": int(orig_grid.get("row", 0)),
+                "col": int(orig_grid.get("col", 0))
+            }
+        # Parse grid localizations for visual artifacts
+        if "grid_localizations" in item:
+            grid_locs = []
+            for loc in item.get("grid_localizations", []):
+                frame_loc = {
+                    "frame": int(loc.get("frame", 0)),
+                    "grid_regions": []
+                }
+                for region in loc.get("grid_regions", []):
+                    frame_loc["grid_regions"].append({
+                        "row": int(region.get("row", 0)),
+                        "col": int(region.get("col", 0))
+                    })
+                if frame_loc["grid_regions"]:  # Only add if has regions
+                    grid_locs.append(frame_loc)
+            if grid_locs:
+                obj["grid_localizations"] = grid_locs
+        # Parse grid trajectory if will_move=true
+        if obj["will_move"] and "object_size_grids" in item and "trajectory_path" in item:
+            size_grids = item.get("object_size_grids", {})
+            obj["object_size_grids"] = {
+                "rows": int(size_grids.get("rows", 2)),
+                "cols": int(size_grids.get("cols", 2))
+            }
+            trajectory = []
+            for point in item.get("trajectory_path", []):
+                trajectory.append({
+                    "frame": int(point.get("frame", 0)),
+                    "grid_row": int(point.get("grid_row", 0)),
+                    "grid_col": int(point.get("grid_col", 0))
+                })
+            if trajectory:  # Only add if we have valid trajectory points
+                obj["trajectory_path"] = trajectory
+        if obj["noun"]:
+            result["affected_objects"].append(obj)
+    return result
+def process_video(video_info: Dict, client, model: str):
+    """Process a single video with VLM analysis"""
+    video_path = video_info.get("video_path", "")
+    instruction = video_info.get("instruction", "")
+    output_dir = video_info.get("output_dir", "")
+    if not output_dir:
+        print(f"   ⚠️  No output_dir specified, skipping")
+        return None
+    output_dir = Path(output_dir)
+    if not output_dir.exists():
+        print(f"   ⚠️  Output directory not found: {output_dir}")
+        print(f"   Run Stage 1 first to create black masks")
+        return None
+    # Check required files from Stage 1
+    black_mask_path = output_dir / "black_mask.mp4"
+    first_frame_path = output_dir / "first_frame.jpg"
+    input_video_path = output_dir / "input_video.mp4"
+    segmentation_info_path = output_dir / "segmentation_info.json"
+    if not black_mask_path.exists():
+        print(f"   ⚠️  black_mask.mp4 not found in {output_dir}")
+        print(f"   Run Stage 1 first")
+        return None
+    if not first_frame_path.exists():
+        print(f"   ⚠️  first_frame.jpg not found in {output_dir}")
+        return None
+    if not input_video_path.exists():
+        # Try original video path
+        if Path(video_path).exists():
+            input_video_path = Path(video_path)
+        else:
+            print(f"   ⚠️  Video not found: {video_path}")
+            return None
+    # Read segmentation metadata to get correct frame index
+    frame_idx = 0  # Default
+    if segmentation_info_path.exists():
+        try:
+            with open(segmentation_info_path, 'r') as f:
+                seg_info = json.load(f)
+                frame_idx = seg_info.get("first_appears_frame", 0)
+                print(f"   Using frame {frame_idx} from segmentation metadata")
+        except Exception as e:
+            print(f"   Warning: Could not read segmentation_info.json: {e}")
+            print(f"   Using frame 0 as fallback")
+    # Get min_grid for grid calculation
+    min_grid = video_info.get('min_grid', 8)
+    use_multi_frame_grids = video_info.get('multi_frame_grids', True)  # Default: use multi-frame
+    max_video_size_mb = video_info.get('max_video_size_for_multiframe', 25)  # Default: 25MB limit
+    # Check video size and auto-disable multi-frame for large videos
+    if use_multi_frame_grids:
+        video_size_mb = input_video_path.stat().st_size / (1024 * 1024)
+        if video_size_mb > max_video_size_mb:
+            print(f"   ⚠️  Video size ({video_size_mb:.1f} MB) exceeds {max_video_size_mb} MB")
+            print(f"   Auto-disabling multi-frame grids to avoid API errors")
+            use_multi_frame_grids = False
+    print(f"   Creating frame overlays and grids...")
+    overlay_path = output_dir / "first_frame_with_mask.jpg"
+    gridded_path = output_dir / "first_frame_with_grid.jpg"
+    # Create regular overlay (for backwards compatibility)
+    create_first_frame_with_mask_overlay(
+        str(first_frame_path),
+        str(black_mask_path),
+        str(overlay_path),
+        frame_idx=frame_idx
+    )
+    image_data_urls = []
+    if use_multi_frame_grids:
+        # Create multi-frame grid samples for objects appearing mid-video
+        print(f"   Creating multi-frame grid samples (0%, 25%, 50%, 75%, 100%)...")
+        sample_paths, grid_rows, grid_cols = create_multi_frame_grid_samples(
+            str(input_video_path),
+            output_dir,
+            min_grid=min_grid
+        )
+        # Encode all grid samples
+        for sample_path in sample_paths:
+            image_data_urls.append(image_to_data_url(str(sample_path)))
+        # Also add the first frame with mask overlay
+        _, _, _ = create_gridded_frame_overlay(
+            str(first_frame_path),
+            str(black_mask_path),
+            str(gridded_path),
+            min_grid=min_grid
+        )
+        image_data_urls.append(image_to_data_url(str(gridded_path)))
+        print(f"   Grid: {grid_rows}x{grid_cols}, {len(sample_paths)} sample frames + masked frame")
+    else:
+        # Single gridded first frame (old approach)
+        _, grid_rows, grid_cols = create_gridded_frame_overlay(
+            str(first_frame_path),
+            str(black_mask_path),
+            str(gridded_path),
+            min_grid=min_grid
+        )
+        image_data_urls.append(image_to_data_url(str(gridded_path)))
+        print(f"   Grid: {grid_rows}x{grid_cols} (single frame)")
+    # CF gateway does not support video/mp4 base64 — pass None; frames already
+    # captured in image_data_urls above.
+    video_data_url = None
+    print(f"   Calling {model}...")
+    prompt = make_vlm_analysis_prompt(instruction, grid_rows, grid_cols,
+                                       has_multi_frame_grids=use_multi_frame_grids)
+    try:
+        try:
+            raw_response = call_vlm_with_images_and_video(
+                client, model, image_data_urls, video_data_url, prompt
+            )
+        except Exception as e:
+            # If multi-frame fails (likely payload size issue), fall back to single frame
+            if use_multi_frame_grids and "400" in str(e):
+                print(f"   ⚠️  Multi-frame request failed (payload too large?)")
+                print(f"   Falling back to single-frame grid mode...")
+                # Retry with just the gridded first frame
+                image_data_urls = [image_to_data_url(str(gridded_path))]
+                prompt = make_vlm_analysis_prompt(instruction, grid_rows, grid_cols,
+                                                   has_multi_frame_grids=False)
+                try:
+                    raw_response = call_vlm_with_images_and_video(
+                        client, model, image_data_urls, video_data_url, prompt
+                    )
+                    print(f"   ✓ Single-frame fallback succeeded")
+                except Exception as e2:
+                    raise e2  # Re-raise if fallback also fails
+            else:
+                raise  # Re-raise if not a 400 or not multi-frame mode
+        # Parse and save results (runs whether first call succeeded or fallback succeeded)
+        print(f"   Parsing VLM response...")
+        analysis = parse_vlm_response(raw_response)
+        # Save results
+        output_path = output_dir / "vlm_analysis.json"
+        with open(output_path, 'w') as f:
+            json.dump(analysis, f, indent=2)
+        print(f"   ✓ Saved VLM analysis: {output_path.name}")
+        # Print summary
+        print(f"\n   Summary:")
+        print(f"   - Integral belongings: {len(analysis['integral_belongings'])}")
+        for obj in analysis['integral_belongings']:
+            print(f"     • {obj['noun']}: {obj['why']}")
+        print(f"   - Affected objects: {len(analysis['affected_objects'])}")
+        for obj in analysis['affected_objects']:
+            move_str = "WILL MOVE" if obj['will_move'] else "STAYS/DISAPPEARS"
+            traj_str = ""
+            if obj.get('will_move') and 'trajectory_path' in obj:
+                num_points = len(obj['trajectory_path'])
+                size = obj.get('object_size_grids', {})
+                traj_str = f" (trajectory: {num_points} keyframes, size: {size.get('rows')}×{size.get('cols')} grids)"
+            print(f"     • {obj['noun']}: {move_str}{traj_str}")
+        return analysis
+    except Exception as e:
+        print(f"   ❌ VLM analysis failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return None
+def process_config(config_path: str, model: str = DEFAULT_MODEL):
+    """Process all videos in config"""
+    config_path = Path(config_path)
+    # Load config
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    # Handle both formats
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Config must be a list or have 'videos' key")
+    print(f"\n{'='*70}")
+    print(f"Stage 2: VLM Analysis - Identify Affected Objects")
+    print(f"{'='*70}")
+    print(f"Config: {config_path.name}")
+    print(f"Videos: {len(videos)}")
+    print(f"Model: {model}")
+    print(f"{'='*70}\n")
+    # Initialize VLM client (CF AI Gateway)
+    cf_project_id = os.environ.get("CF_PROJECT_ID")
+    cf_user_id = os.environ.get("CF_USER_ID")
+    if not cf_project_id or not cf_user_id:
+        raise RuntimeError("CF_PROJECT_ID and CF_USER_ID environment variables must be set")
+    metadata = json.dumps({"project_id": cf_project_id, "user_id": cf_user_id})
+    client = openai.OpenAI(
+        api_key=os.environ.get("GEMINI_API_KEY", "placeholder"),
+        base_url="https://ai-gateway.plain-flower-4887.workers.dev/compat",
+        default_headers={"cf-aig-metadata": metadata},
+    )
+    # Model comes from MODEL_ID env var; fall back to --model arg
+    model = os.environ.get("MODEL_ID", model)
+    # Process each video
+    results = []
+    for i, video_info in enumerate(videos):
+        video_path = video_info.get("video_path", "")
+        instruction = video_info.get("instruction", "")
+        print(f"\n{'─'*70}")
+        print(f"Video {i+1}/{len(videos)}: {Path(video_path).name}")
+        print(f"{'─'*70}")
+        print(f"Instruction: {instruction}")
+        try:
+            analysis = process_video(video_info, client, model)
+            results.append({
+                "video": video_path,
+                "success": analysis is not None,
+                "analysis": analysis
+            })
+            if analysis:
+                print(f"\n✅ Video {i+1} complete!")
+            else:
+                print(f"\n⚠️  Video {i+1} skipped")
+        except Exception as e:
+            print(f"\n❌ Error processing video {i+1}: {e}")
+            results.append({
+                "video": video_path,
+                "success": False,
+                "error": str(e)
+            })
+            continue
+    # Summary
+    print(f"\n{'='*70}")
+    print(f"Stage 2 Complete!")
+    print(f"{'='*70}")
+    successful = sum(1 for r in results if r["success"])
+    print(f"Successful: {successful}/{len(videos)}")
+    print(f"{'='*70}\n")
+def main():
+    parser = argparse.ArgumentParser(description="Stage 2: VLM Analysis")
+    parser.add_argument("--config", required=True, help="Config JSON from Stage 1")
+    parser.add_argument("--model", default=DEFAULT_MODEL, help="VLM model name")
+    args = parser.parse_args()
+    process_config(args.config, args.model)
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/stage3a_generate_grey_masks.py ADDED Viewed

	@@ -0,0 +1,436 @@

+#!/usr/bin/env python3
+"""
+Stage 3a: Generate Grey Masks - Combine VLM Logic + User Trajectories
+Generates grey masks (127=affected regions) by combining:
+1. VLM-identified affected objects (segmented + gridified)
+2. User-drawn trajectories (from Stage 3b)
+3. Proximity filtering (only mask near primary object)
+Input:  - vlm_analysis.json (Stage 2)
+        - black_mask.mp4 (Stage 1)
+        - trajectories.json (Stage 3b, optional)
+Output: - grey_mask.mp4 (127=affected, 255=background)
+Usage:
+    python stage3a_generate_grey_masks.py --config more_dyn_2_config_points_absolute.json
+"""
+import os
+import sys
+import json
+import argparse
+import cv2
+import numpy as np
+from pathlib import Path
+from typing import Dict, List, Tuple
+from PIL import Image
+import subprocess
+# Segmentation model
+try:
+    from sam3.model_builder import build_sam3_image_model
+    from sam3.model.sam3_image_processor import Sam3Processor
+    SAM3_AVAILABLE = True
+except ImportError:
+    SAM3_AVAILABLE = False
+try:
+    from lang_sam import LangSAM
+    LANGSAM_AVAILABLE = True
+except ImportError:
+    LANGSAM_AVAILABLE = False
+class SegmentationModel:
+    """Wrapper for segmentation"""
+    def __init__(self, model_type: str = "sam3"):
+        self.model_type = model_type.lower()
+        if self.model_type == "sam3":
+            if not SAM3_AVAILABLE:
+                raise ImportError("SAM3 not available")
+            print(f"   Loading SAM3...")
+            model = build_sam3_image_model()
+            self.processor = Sam3Processor(model)
+            self.model = model
+        elif self.model_type == "langsam":
+            if not LANGSAM_AVAILABLE:
+                raise ImportError("LangSAM not available")
+            print(f"   Loading LangSAM...")
+            self.model = LangSAM()
+        else:
+            raise ValueError(f"Unknown model: {model_type}")
+    def segment(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        """Segment object using text prompt"""
+        if self.model_type == "sam3":
+            return self._segment_sam3(image_pil, prompt)
+        else:
+            return self._segment_langsam(image_pil, prompt)
+    def _segment_sam3(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        import torch
+        h, w = image_pil.height, image_pil.width
+        union = np.zeros((h, w), dtype=bool)
+        try:
+            inference_state = self.processor.set_image(image_pil)
+            output = self.processor.set_text_prompt(state=inference_state, prompt=prompt)
+            masks = output.get("masks")
+            if masks is None or len(masks) == 0:
+                return union
+            if torch.is_tensor(masks):
+                masks = masks.cpu().numpy()
+            if masks.ndim == 2:
+                union = masks.astype(bool)
+            elif masks.ndim == 3:
+                union = masks.any(axis=0).astype(bool)
+            elif masks.ndim == 4:
+                union = masks.any(axis=(0, 1)).astype(bool)
+        except Exception as e:
+            print(f"    Warning: SAM3 segmentation failed for '{prompt}': {e}")
+        return union
+    def _segment_langsam(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        h, w = image_pil.height, image_pil.width
+        union = np.zeros((h, w), dtype=bool)
+        try:
+            results = self.model.predict([image_pil], [prompt])
+            if not results:
+                return union
+            r0 = results[0]
+            if isinstance(r0, dict) and "masks" in r0:
+                masks = r0["masks"]
+                if masks.ndim == 4 and masks.shape[0] == 1:
+                    masks = masks[0]
+                if masks.ndim == 3:
+                    union = masks.any(axis=0).astype(bool)
+                elif masks.ndim == 2:
+                    union = masks.astype(bool)
+        except Exception as e:
+            print(f"    Warning: LangSAM segmentation failed for '{prompt}': {e}")
+        return union
+def calculate_square_grid(width: int, height: int, min_grid: int = 8) -> Tuple[int, int]:
+    """Calculate grid dimensions for square cells"""
+    aspect_ratio = width / height
+    if width >= height:
+        grid_rows = min_grid
+        grid_cols = max(min_grid, round(min_grid * aspect_ratio))
+    else:
+        grid_cols = min_grid
+        grid_rows = max(min_grid, round(min_grid / aspect_ratio))
+    return grid_rows, grid_cols
+def gridify_mask(mask: np.ndarray, grid_rows: int, grid_cols: int) -> np.ndarray:
+    """Convert pixel mask to gridified mask"""
+    h, w = mask.shape
+    gridified = np.zeros((h, w), dtype=bool)
+    cell_width = w / grid_cols
+    cell_height = h / grid_rows
+    for row in range(grid_rows):
+        for col in range(grid_cols):
+            y1 = int(row * cell_height)
+            y2 = int((row + 1) * cell_height)
+            x1 = int(col * cell_width)
+            x2 = int((col + 1) * cell_width)
+            cell_region = mask[y1:y2, x1:x2]
+            if cell_region.any():
+                gridified[y1:y2, x1:x2] = True
+    return gridified
+def grid_cells_to_mask(grid_cells: List[List[int]], grid_rows: int, grid_cols: int,
+                       frame_width: int, frame_height: int) -> np.ndarray:
+    """Convert grid cells to mask"""
+    mask = np.zeros((frame_height, frame_width), dtype=bool)
+    cell_width = frame_width / grid_cols
+    cell_height = frame_height / grid_rows
+    for row, col in grid_cells:
+        y1 = int(row * cell_height)
+        y2 = int((row + 1) * cell_height)
+        x1 = int(col * cell_width)
+        x2 = int((col + 1) * cell_width)
+        mask[y1:y2, x1:x2] = True
+    return mask
+def dilate_mask(mask: np.ndarray, kernel_size: int = 15) -> np.ndarray:
+    """Dilate mask to create proximity region"""
+    kernel = np.ones((kernel_size, kernel_size), np.uint8)
+    return cv2.dilate(mask.astype(np.uint8), kernel, iterations=1).astype(bool)
+def filter_by_proximity(mask: np.ndarray, primary_mask: np.ndarray, dilation: int = 15) -> np.ndarray:
+    """Filter mask to only include regions near primary mask"""
+    # Dilate primary mask to create proximity region
+    proximity_region = dilate_mask(primary_mask, dilation)
+    # Only keep mask where it overlaps with proximity region
+    filtered = mask & proximity_region
+    return filtered
+def process_video_grey_masks(video_info: Dict, segmenter: SegmentationModel,
+                              trajectory_data: Dict = None):
+    """Generate grey masks for a single video"""
+    video_path = video_info.get("video_path", "")
+    output_dir = Path(video_info.get("output_dir", ""))
+    if not output_dir.exists():
+        print(f"   ⚠️  Output directory not found: {output_dir}")
+        return
+    # Load required files
+    vlm_analysis_path = output_dir / "vlm_analysis.json"
+    black_mask_path = output_dir / "black_mask.mp4"
+    input_video_path = output_dir / "input_video.mp4"
+    if not vlm_analysis_path.exists():
+        print(f"   ⚠️  vlm_analysis.json not found")
+        return
+    if not black_mask_path.exists():
+        print(f"   ⚠️  black_mask.mp4 not found")
+        return
+    if not input_video_path.exists():
+        input_video_path = Path(video_path)
+        if not input_video_path.exists():
+            print(f"   ⚠️  Video not found")
+            return
+    # Load VLM analysis
+    with open(vlm_analysis_path, 'r') as f:
+        analysis = json.load(f)
+    # Get video properties
+    cap = cv2.VideoCapture(str(input_video_path))
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    cap.release()
+    # Calculate grid
+    min_grid = video_info.get('min_grid', 8)
+    grid_rows, grid_cols = calculate_square_grid(frame_width, frame_height, min_grid)
+    print(f"   Video: {frame_width}x{frame_height}, {total_frames} frames, grid: {grid_rows}x{grid_cols}")
+    # Load first frame
+    cap = cv2.VideoCapture(str(input_video_path))
+    ret, first_frame = cap.read()
+    cap.release()
+    if not ret:
+        print(f"   ⚠️  Failed to read first frame")
+        return
+    first_frame_rgb = cv2.cvtColor(first_frame, cv2.COLOR_BGR2RGB)
+    first_frame_pil = Image.fromarray(first_frame_rgb)
+    # Load black mask (first frame for proximity filtering)
+    black_cap = cv2.VideoCapture(str(black_mask_path))
+    ret, black_mask_frame = black_cap.read()
+    black_cap.release()
+    if not ret:
+        print(f"   ⚠️  Failed to read black mask")
+        return
+    if len(black_mask_frame.shape) == 3:
+        black_mask_frame = cv2.cvtColor(black_mask_frame, cv2.COLOR_BGR2GRAY)
+    primary_mask = (black_mask_frame == 0)  # 0 = primary object
+    # Initialize grey mask
+    grey_mask_combined = np.zeros((frame_height, frame_width), dtype=bool)
+    # Process affected objects from VLM
+    affected_objects = analysis.get('affected_objects', [])
+    print(f"   Processing {len(affected_objects)} affected object(s)...")
+    for obj in affected_objects:
+        noun = obj.get('noun', '')
+        category = obj.get('category', 'physical')
+        will_move = obj.get('will_move', False)
+        needs_trajectory = obj.get('needs_trajectory', False)
+        if not noun:
+            continue
+        print(f"      • {noun} ({category})")
+        # Check if we have trajectory data for this object
+        has_trajectory = False
+        if needs_trajectory and trajectory_data:
+            for traj in trajectory_data:
+                if traj.get('object_noun', '') == noun:
+                    # Use trajectory grid cells
+                    print(f"         Using user-drawn trajectory ({len(traj['trajectory_grid_cells'])} cells)")
+                    traj_mask = grid_cells_to_mask(
+                        traj['trajectory_grid_cells'],
+                        grid_rows, grid_cols,
+                        frame_width, frame_height
+                    )
+                    grey_mask_combined |= traj_mask
+                    has_trajectory = True
+                    break
+        # If no trajectory or doesn't need one, segment normally
+        if not has_trajectory:
+            # Segment object
+            obj_mask = segmenter.segment(first_frame_pil, noun)
+            if obj_mask.any():
+                print(f"         Segmented {obj_mask.sum()} pixels")
+                # Filter by proximity to primary mask
+                obj_mask_filtered = filter_by_proximity(obj_mask, primary_mask, dilation=50)
+                if obj_mask_filtered.any():
+                    print(f"         After proximity filter: {obj_mask_filtered.sum()} pixels")
+                    # Gridify
+                    obj_mask_gridified = gridify_mask(obj_mask_filtered, grid_rows, grid_cols)
+                    # Add to combined grey mask
+                    grey_mask_combined |= obj_mask_gridified
+                    print(f"         ✓ Added to grey mask")
+                else:
+                    print(f"         ⚠️  No pixels near primary object, skipping")
+            else:
+                print(f"         ⚠️  Segmentation failed")
+    # Generate grey mask video
+    print(f"   Generating grey mask video...")
+    # For simplicity, use same mask for all frames
+    # (In future, could track objects through video)
+    grey_mask_uint8 = np.where(grey_mask_combined, 127, 255).astype(np.uint8)
+    # Write temp AVI
+    temp_avi = output_dir / "grey_mask_temp.avi"
+    fourcc = cv2.VideoWriter_fourcc(*'FFV1')
+    out = cv2.VideoWriter(str(temp_avi), fourcc, fps, (frame_width, frame_height), isColor=False)
+    for _ in range(total_frames):
+        out.write(grey_mask_uint8)
+    out.release()
+    # Convert to MP4
+    grey_mask_mp4 = output_dir / "grey_mask.mp4"
+    cmd = [
+        'ffmpeg', '-y', '-i', str(temp_avi),
+        '-c:v', 'libx264', '-qp', '0', '-preset', 'ultrafast',
+        '-pix_fmt', 'yuv444p',
+        str(grey_mask_mp4)
+    ]
+    subprocess.run(cmd, capture_output=True)
+    temp_avi.unlink()
+    print(f"   ✓ Saved grey_mask.mp4")
+    # Save debug visualization
+    debug_vis = np.zeros((frame_height, frame_width, 3), dtype=np.uint8)
+    debug_vis[grey_mask_combined] = [0, 255, 0]  # Green for affected regions
+    debug_vis[primary_mask] = [255, 0, 0]  # Red for primary
+    debug_path = output_dir / "debug_grey_mask.jpg"
+    cv2.imwrite(str(debug_path), debug_vis)
+    print(f"   ✓ Saved debug visualization")
+def main():
+    parser = argparse.ArgumentParser(description="Stage 3a: Generate Grey Masks")
+    parser.add_argument("--config", required=True, help="Config JSON")
+    parser.add_argument("--segmentation-model", default="sam3", choices=["langsam", "sam3"],
+                       help="Segmentation model")
+    args = parser.parse_args()
+    config_path = Path(args.config)
+    # Load config
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Invalid config format")
+    # Load trajectory data if exists
+    trajectory_path = config_path.parent / f"{config_path.stem}_trajectories.json"
+    trajectory_data = None
+    if trajectory_path.exists():
+        print(f"Loading trajectory data from: {trajectory_path.name}")
+        with open(trajectory_path, 'r') as f:
+            trajectory_data = json.load(f)
+        print(f"   Loaded {len(trajectory_data)} trajectory(s)")
+    else:
+        print(f"No trajectory data found (Stage 3b not run or no objects needed trajectories)")
+    print(f"\n{'='*70}")
+    print(f"Stage 3a: Generate Grey Masks")
+    print(f"{'='*70}")
+    print(f"Videos: {len(videos)}")
+    print(f"Segmentation: {args.segmentation_model.upper()}")
+    print(f"{'='*70}\n")
+    # Load segmentation model
+    segmenter = SegmentationModel(args.segmentation_model)
+    # Process each video
+    for i, video_info in enumerate(videos):
+        video_path = video_info.get('video_path', '')
+        print(f"\n{'─'*70}")
+        print(f"Video {i+1}/{len(videos)}: {Path(video_path).parent.name}")
+        print(f"{'─'*70}")
+        try:
+            process_video_grey_masks(video_info, segmenter, trajectory_data)
+            print(f"\n✅ Video {i+1} complete!")
+        except Exception as e:
+            print(f"\n❌ Error processing video {i+1}: {e}")
+            import traceback
+            traceback.print_exc()
+            continue
+    print(f"\n{'='*70}")
+    print(f"✅ Stage 3a Complete!")
+    print(f"{'='*70}")
+    print(f"Generated grey_mask.mp4 for all videos")
+    print(f"Next: Run Stage 4 to combine black + grey masks")
+    print(f"{'='*70}\n")
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/stage3a_generate_grey_masks_v2.py ADDED Viewed

	@@ -0,0 +1,576 @@

+#!/usr/bin/env python3
+"""
+Stage 3a: Generate Grey Masks (CORRECTED)
+Correct pipeline:
+1. For EACH affected object (from VLM analysis):
+   a) IF user drew trajectory (Stage 3b):
+      - Segment object in first_appears_frame → get SIZE
+      - Apply object SIZE along trajectory path across all frames
+   b) ELSE (no user trajectory):
+      - Segment object through ALL frames (captures any movement/changes)
+      - This handles:
+        * Static objects (can, chair)
+        * Objects that move during video (golf ball)
+        * Dynamic effects (paint strokes, shadows)
+      - Filter by proximity to primary object
+2. Accumulate all masks (one combined mask per frame)
+3. Gridify ALL accumulated masks:
+   - If ANY pixel in grid cell → ENTIRE cell = 127
+4. Write grey_mask.mp4
+Key insight: will_move / needs_trajectory are ONLY for Stage 3b (user input).
+In Stage 3a, we segment ALL affected objects through ALL frames.
+Input:  - vlm_analysis.json (Stage 2)
+        - black_mask.mp4 (Stage 1)
+        - trajectories.json (Stage 3b, optional)
+Output: - grey_mask.mp4 (127=affected, 255=background)
+Usage:
+    python stage3a_generate_grey_masks_v2.py --config more_dyn_2_config_points_absolute.json
+"""
+import os
+import sys
+import json
+import argparse
+import cv2
+import numpy as np
+from pathlib import Path
+from typing import Dict, List, Tuple, Optional
+from PIL import Image
+import subprocess
+# SAM2 for video tracking
+try:
+    from sam2.build_sam import build_sam2_video_predictor
+    SAM2_AVAILABLE = True
+except ImportError:
+    SAM2_AVAILABLE = False
+# SAM3 for single-frame segmentation
+try:
+    from sam3.model_builder import build_sam3_image_model
+    from sam3.model.sam3_image_processor import Sam3Processor
+    SAM3_AVAILABLE = True
+except ImportError:
+    SAM3_AVAILABLE = False
+# LangSAM
+try:
+    from lang_sam import LangSAM
+    LANGSAM_AVAILABLE = True
+except ImportError:
+    LANGSAM_AVAILABLE = False
+class SegmentationModel:
+    """Wrapper for segmentation"""
+    def __init__(self, model_type: str = "sam3"):
+        self.model_type = model_type.lower()
+        if self.model_type == "sam3":
+            if not SAM3_AVAILABLE:
+                raise ImportError("SAM3 not available")
+            print(f"   Loading SAM3...")
+            model = build_sam3_image_model()
+            self.processor = Sam3Processor(model)
+            self.model = model
+        elif self.model_type == "langsam":
+            if not LANGSAM_AVAILABLE:
+                raise ImportError("LangSAM not available")
+            print(f"   Loading LangSAM...")
+            self.model = LangSAM()
+        else:
+            raise ValueError(f"Unknown model: {model_type}")
+    def segment(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        """Segment object using text prompt - returns boolean mask"""
+        if self.model_type == "sam3":
+            return self._segment_sam3(image_pil, prompt)
+        else:
+            return self._segment_langsam(image_pil, prompt)
+    def _segment_sam3(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        import torch
+        h, w = image_pil.height, image_pil.width
+        union = np.zeros((h, w), dtype=bool)
+        try:
+            inference_state = self.processor.set_image(image_pil)
+            output = self.processor.set_text_prompt(state=inference_state, prompt=prompt)
+            masks = output.get("masks")
+            if masks is None or len(masks) == 0:
+                return union
+            if torch.is_tensor(masks):
+                masks = masks.cpu().numpy()
+            if masks.ndim == 2:
+                union = masks.astype(bool)
+            elif masks.ndim == 3:
+                union = masks.any(axis=0).astype(bool)
+            elif masks.ndim == 4:
+                union = masks.any(axis=(0, 1)).astype(bool)
+        except Exception as e:
+            print(f"         Warning: SAM3 failed: {e}")
+        return union
+    def _segment_langsam(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        h, w = image_pil.height, image_pil.width
+        union = np.zeros((h, w), dtype=bool)
+        try:
+            results = self.model.predict([image_pil], [prompt])
+            if not results:
+                return union
+            r0 = results[0]
+            if isinstance(r0, dict) and "masks" in r0:
+                masks = r0["masks"]
+                if masks.ndim == 4 and masks.shape[0] == 1:
+                    masks = masks[0]
+                if masks.ndim == 3:
+                    union = masks.any(axis=0).astype(bool)
+                elif masks.ndim == 2:
+                    union = masks.astype(bool)
+        except Exception as e:
+            print(f"         Warning: LangSAM failed: {e}")
+        return union
+def calculate_square_grid(width: int, height: int, min_grid: int = 8) -> Tuple[int, int]:
+    """Calculate grid dimensions for square cells"""
+    aspect_ratio = width / height
+    if width >= height:
+        grid_rows = min_grid
+        grid_cols = max(min_grid, round(min_grid * aspect_ratio))
+    else:
+        grid_cols = min_grid
+        grid_rows = max(min_grid, round(min_grid / aspect_ratio))
+    return grid_rows, grid_cols
+def gridify_masks(masks: List[np.ndarray], grid_rows: int, grid_cols: int) -> List[np.ndarray]:
+    """
+    Gridify masks: if ANY pixel in grid cell → ENTIRE cell = True
+    Args:
+        masks: List of boolean masks (one per frame)
+        grid_rows, grid_cols: Grid dimensions
+    Returns:
+        List of gridified boolean masks
+    """
+    gridified_masks = []
+    for mask in masks:
+        h, w = mask.shape
+        gridified = np.zeros((h, w), dtype=bool)
+        cell_width = w / grid_cols
+        cell_height = h / grid_rows
+        for row in range(grid_rows):
+            for col in range(grid_cols):
+                y1 = int(row * cell_height)
+                y2 = int((row + 1) * cell_height)
+                x1 = int(col * cell_width)
+                x2 = int((col + 1) * cell_width)
+                cell_region = mask[y1:y2, x1:x2]
+                # If ANY pixel in cell → ENTIRE cell
+                if cell_region.any():
+                    gridified[y1:y2, x1:x2] = True
+        gridified_masks.append(gridified)
+    return gridified_masks
+def get_object_size(mask: np.ndarray) -> Tuple[int, int]:
+    """Get bounding box size of object"""
+    rows = np.any(mask, axis=1)
+    cols = np.any(mask, axis=0)
+    if not rows.any() or not cols.any():
+        return 0, 0
+    y1, y2 = np.where(rows)[0][[0, -1]]
+    x1, x2 = np.where(cols)[0][[0, -1]]
+    width = x2 - x1 + 1
+    height = y2 - y1 + 1
+    return width, height
+def apply_object_along_trajectory(obj_mask: np.ndarray, trajectory_points: List[Tuple[int, int]],
+                                   total_frames: int, frame_shape: Tuple[int, int]) -> List[np.ndarray]:
+    """
+    Apply object along trajectory path across frames.
+    Args:
+        obj_mask: Object mask from first_appears_frame
+        trajectory_points: List of (x, y) points defining path
+        total_frames: Total number of frames in video
+        frame_shape: (height, width)
+    Returns:
+        List of masks (one per frame) with object placed along trajectory
+    """
+    h, w = frame_shape
+    masks = [np.zeros((h, w), dtype=bool) for _ in range(total_frames)]
+    if len(trajectory_points) < 2:
+        return masks
+    # Get object size
+    obj_width, obj_height = get_object_size(obj_mask)
+    if obj_width == 0 or obj_height == 0:
+        return masks
+    # Interpolate trajectory across frames
+    num_traj_points = len(trajectory_points)
+    for frame_idx in range(total_frames):
+        # Map frame index to trajectory point
+        t = frame_idx / max(total_frames - 1, 1)  # 0.0 to 1.0
+        traj_idx = int(t * (num_traj_points - 1))
+        traj_idx = min(traj_idx, num_traj_points - 1)
+        # Get position on trajectory
+        x_center, y_center = trajectory_points[traj_idx]
+        # Place object at this position
+        x1 = max(0, int(x_center - obj_width // 2))
+        y1 = max(0, int(y_center - obj_height // 2))
+        x2 = min(w, x1 + obj_width)
+        y2 = min(h, y1 + obj_height)
+        # Place object mask
+        masks[frame_idx][y1:y2, x1:x2] = True
+    return masks
+def segment_object_all_frames(video_path: str, obj_noun: str, segmenter: SegmentationModel,
+                               frame_stride: int = 1) -> List[np.ndarray]:
+    """
+    Segment object through all frames.
+    Args:
+        video_path: Path to video
+        obj_noun: Object to segment
+        segmenter: Segmentation model
+        frame_stride: Process every Nth frame (for speed)
+    Returns:
+        List of boolean masks (one per frame)
+    """
+    cap = cv2.VideoCapture(video_path)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    masks = []
+    frame_idx = 0
+    while True:
+        ret, frame = cap.read()
+        if not ret:
+            break
+        if frame_idx % frame_stride == 0:
+            # Segment this frame
+            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            frame_pil = Image.fromarray(frame_rgb)
+            mask = segmenter.segment(frame_pil, obj_noun)
+            masks.append(mask)
+            if (frame_idx + 1) % 10 == 0:
+                print(f"         Frame {frame_idx + 1}/{total_frames}...", end='\r')
+        else:
+            # Reuse previous mask
+            if masks:
+                masks.append(masks[-1])
+            else:
+                masks.append(np.zeros((frame_height, frame_width), dtype=bool))
+        frame_idx += 1
+    cap.release()
+    print(f"         Segmented {total_frames} frames")
+    return masks
+def dilate_mask(mask: np.ndarray, kernel_size: int = 15) -> np.ndarray:
+    """Dilate mask for proximity checking"""
+    kernel = np.ones((kernel_size, kernel_size), np.uint8)
+    return cv2.dilate(mask.astype(np.uint8), kernel, iterations=1).astype(bool)
+def filter_masks_by_proximity(masks: List[np.ndarray], primary_mask: np.ndarray,
+                               dilation: int = 50) -> List[np.ndarray]:
+    """Filter masks to only include regions near primary mask"""
+    proximity_region = dilate_mask(primary_mask, dilation)
+    filtered = []
+    for mask in masks:
+        filtered_mask = mask & proximity_region
+        filtered.append(filtered_mask)
+    return filtered
+def process_video_grey_masks(video_info: Dict, segmenter: SegmentationModel,
+                              trajectory_data: List[Dict] = None):
+    """Generate grey masks for a single video"""
+    video_path = video_info.get("video_path", "")
+    output_dir = Path(video_info.get("output_dir", ""))
+    if not output_dir.exists():
+        print(f"   ⚠️  Output directory not found")
+        return
+    # Load required files
+    vlm_analysis_path = output_dir / "vlm_analysis.json"
+    black_mask_path = output_dir / "black_mask.mp4"
+    input_video_path = output_dir / "input_video.mp4"
+    if not vlm_analysis_path.exists():
+        print(f"   ⚠️  vlm_analysis.json not found")
+        return
+    if not black_mask_path.exists():
+        print(f"   ⚠️  black_mask.mp4 not found")
+        return
+    if not input_video_path.exists():
+        input_video_path = Path(video_path)
+    # Load VLM analysis
+    with open(vlm_analysis_path, 'r') as f:
+        analysis = json.load(f)
+    # Get video properties
+    cap = cv2.VideoCapture(str(input_video_path))
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    cap.release()
+    # Calculate grid
+    min_grid = video_info.get('min_grid', 8)
+    grid_rows, grid_cols = calculate_square_grid(frame_width, frame_height, min_grid)
+    print(f"   Video: {frame_width}x{frame_height}, {total_frames} frames, grid: {grid_rows}x{grid_cols}")
+    # Load black mask (first frame for proximity filtering)
+    black_cap = cv2.VideoCapture(str(black_mask_path))
+    ret, black_mask_frame = black_cap.read()
+    black_cap.release()
+    if len(black_mask_frame.shape) == 3:
+        black_mask_frame = cv2.cvtColor(black_mask_frame, cv2.COLOR_BGR2GRAY)
+    primary_mask = (black_mask_frame == 0)  # 0 = primary object
+    # Initialize accumulated masks (one per frame)
+    accumulated_masks = [np.zeros((frame_height, frame_width), dtype=bool) for _ in range(total_frames)]
+    # Process affected objects
+    affected_objects = analysis.get('affected_objects', [])
+    print(f"   Processing {len(affected_objects)} affected object(s)...")
+    for obj in affected_objects:
+        noun = obj.get('noun', '')
+        if not noun:
+            continue
+        print(f"      • {noun}")
+        # Check if we have USER TRAJECTORY for this object
+        has_trajectory = False
+        if trajectory_data:
+            for traj in trajectory_data:
+                if traj.get('object_noun', '') == noun and not traj.get('skipped', False):
+                    has_trajectory = True
+                    traj_points = traj.get('trajectory_points', [])
+                    print(f"         Using user-drawn trajectory ({len(traj_points)} points)")
+                    # Segment object in first_appears_frame to get SIZE
+                    first_frame_idx = obj.get('first_appears_frame', 0)
+                    cap = cv2.VideoCapture(str(input_video_path))
+                    cap.set(cv2.CAP_PROP_POS_FRAMES, first_frame_idx)
+                    ret, frame = cap.read()
+                    cap.release()
+                    if ret:
+                        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+                        frame_pil = Image.fromarray(frame_rgb)
+                        obj_mask = segmenter.segment(frame_pil, noun)
+                        if obj_mask.any():
+                            obj_width, obj_height = get_object_size(obj_mask)
+                            print(f"         Segmented object (size: {obj_width}x{obj_height} px)")
+                            # Apply object SIZE along trajectory
+                            traj_masks = apply_object_along_trajectory(
+                                obj_mask, traj_points, total_frames, (frame_height, frame_width)
+                            )
+                            # Accumulate
+                            for i in range(total_frames):
+                                accumulated_masks[i] |= traj_masks[i]
+                            print(f"         ✓ Applied object along trajectory across {total_frames} frames")
+                        else:
+                            print(f"         ⚠️  Segmentation failed, using trajectory grid cells only")
+                            # Fallback: just use trajectory grid cells
+                            grid_cells = traj.get('trajectory_grid_cells', [])
+                            for row, col in grid_cells:
+                                y1 = int(row * frame_height / grid_rows)
+                                y2 = int((row + 1) * frame_height / grid_rows)
+                                x1 = int(col * frame_width / grid_cols)
+                                x2 = int((col + 1) * frame_width / grid_cols)
+                                for i in range(total_frames):
+                                    accumulated_masks[i][y1:y2, x1:x2] = True
+                    break
+        # If NO user trajectory, segment through ALL frames
+        # This captures: static objects, objects that move during video, dynamic effects
+        if not has_trajectory:
+            print(f"         Segmenting through ALL frames (captures any movement/changes)...")
+            obj_masks = segment_object_all_frames(str(input_video_path), noun, segmenter, frame_stride=5)
+            # Filter by proximity to primary mask
+            obj_masks_filtered = filter_masks_by_proximity(obj_masks, primary_mask, dilation=50)
+            # Accumulate
+            for i in range(len(obj_masks_filtered)):
+                if i < len(accumulated_masks):
+                    accumulated_masks[i] |= obj_masks_filtered[i]
+            pixel_count = sum(mask.sum() for mask in obj_masks_filtered)
+            print(f"         ✓ Segmented across {len(obj_masks_filtered)} frames ({pixel_count} total pixels)")
+    # GRIDIFY all accumulated masks
+    print(f"   Gridifying masks...")
+    gridified_masks = gridify_masks(accumulated_masks, grid_rows, grid_cols)
+    # Convert to uint8 (127 = grey, 255 = background)
+    grey_masks_uint8 = [np.where(mask, 127, 255).astype(np.uint8) for mask in gridified_masks]
+    # Write video
+    print(f"   Writing grey_mask.mp4...")
+    temp_avi = output_dir / "grey_mask_temp.avi"
+    fourcc = cv2.VideoWriter_fourcc(*'FFV1')
+    out = cv2.VideoWriter(str(temp_avi), fourcc, fps, (frame_width, frame_height), isColor=False)
+    for mask in grey_masks_uint8:
+        out.write(mask)
+    out.release()
+    # Convert to MP4
+    grey_mask_mp4 = output_dir / "grey_mask.mp4"
+    cmd = [
+        'ffmpeg', '-y', '-i', str(temp_avi),
+        '-c:v', 'libx264', '-qp', '0', '-preset', 'ultrafast',
+        '-pix_fmt', 'yuv444p',
+        str(grey_mask_mp4)
+    ]
+    subprocess.run(cmd, capture_output=True)
+    temp_avi.unlink()
+    print(f"   ✓ Saved grey_mask.mp4")
+    # Save debug visualization (first frame)
+    debug_vis = np.zeros((frame_height, frame_width, 3), dtype=np.uint8)
+    debug_vis[gridified_masks[0]] = [0, 255, 0]  # Green
+    debug_vis[primary_mask] = [255, 0, 0]  # Red
+    debug_path = output_dir / "debug_grey_mask.jpg"
+    cv2.imwrite(str(debug_path), debug_vis)
+def main():
+    parser = argparse.ArgumentParser(description="Stage 3a: Generate Grey Masks (Corrected)")
+    parser.add_argument("--config", required=True, help="Config JSON")
+    parser.add_argument("--segmentation-model", default="sam3", choices=["langsam", "sam3"],
+                       help="Segmentation model")
+    args = parser.parse_args()
+    config_path = Path(args.config)
+    # Load config
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Invalid config format")
+    # Load trajectory data
+    trajectory_path = config_path.parent / f"{config_path.stem}_trajectories.json"
+    trajectory_data = None
+    if trajectory_path.exists():
+        print(f"Loading trajectory data: {trajectory_path.name}")
+        with open(trajectory_path, 'r') as f:
+            trajectory_data = json.load(f)
+        print(f"   Loaded {len(trajectory_data)} trajectory(s)")
+    print(f"\n{'='*70}")
+    print(f"Stage 3a: Generate Grey Masks (CORRECTED)")
+    print(f"{'='*70}")
+    print(f"Videos: {len(videos)}")
+    print(f"Segmentation: {args.segmentation_model.upper()}")
+    print(f"{'='*70}\n")
+    # Load segmentation model
+    segmenter = SegmentationModel(args.segmentation_model)
+    # Process each video
+    for i, video_info in enumerate(videos):
+        video_path = video_info.get('video_path', '')
+        print(f"\n{'─'*70}")
+        print(f"Video {i+1}/{len(videos)}: {Path(video_path).parent.name}")
+        print(f"{'─'*70}")
+        try:
+            process_video_grey_masks(video_info, segmenter, trajectory_data)
+            print(f"\n✅ Video {i+1} complete!")
+        except Exception as e:
+            print(f"\n❌ Error: {e}")
+            import traceback
+            traceback.print_exc()
+            continue
+    print(f"\n{'='*70}")
+    print(f"✅ Stage 3a Complete!")
+    print(f"{'='*70}\n")
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/stage3b_trajectory_gui.py ADDED Viewed

	@@ -0,0 +1,432 @@

+#!/usr/bin/env python3
+"""
+Stage 3b: Trajectory Drawing GUI (Simplified - No Segmentation)
+For objects with needs_trajectory=true, user draws movement paths.
+Input:  Config with vlm_analysis.json in output_dir
+Output: trajectory_data.json with user-drawn paths as grid cells
+Usage:
+    python stage3b_trajectory_gui.py --config more_dyn_2_config_points_absolute.json
+"""
+import os
+import sys
+import json
+import argparse
+import cv2
+import numpy as np
+import tkinter as tk
+from tkinter import ttk, messagebox
+from PIL import Image, ImageTk, ImageDraw, ImageFont
+from pathlib import Path
+from typing import Dict, List, Tuple
+def calculate_square_grid(width: int, height: int, min_grid: int = 8) -> Tuple[int, int]:
+    """Calculate grid dimensions for square cells"""
+    aspect_ratio = width / height
+    if width >= height:
+        grid_rows = min_grid
+        grid_cols = max(min_grid, round(min_grid * aspect_ratio))
+    else:
+        grid_cols = min_grid
+        grid_rows = max(min_grid, round(min_grid / aspect_ratio))
+    return grid_rows, grid_cols
+def points_to_grid_cells(points: List[Tuple[int, int]], grid_rows: int, grid_cols: int,
+                         frame_width: int, frame_height: int) -> List[List[int]]:
+    """Convert trajectory points to grid cells"""
+    cell_width = frame_width / grid_cols
+    cell_height = frame_height / grid_rows
+    grid_cells = set()
+    for x, y in points:
+        col = int(x / cell_width)
+        row = int(y / cell_height)
+        if 0 <= row < grid_rows and 0 <= col < grid_cols:
+            grid_cells.add((row, col))
+    # Sort by row, then col
+    return sorted([[r, c] for r, c in grid_cells])
+class TrajectoryGUI:
+    def __init__(self, root, objects_data: List[Dict]):
+        self.root = root
+        self.root.title("Stage 3b: Trajectory Drawing")
+        self.objects_data = objects_data  # List of {video_info, objects_needing_trajectory}
+        self.current_video_idx = 0
+        self.current_object_idx = 0
+        # Current state
+        self.frame = None
+        self.trajectory_points = []
+        self.drawing = False
+        # Display
+        self.display_scale = 1.0
+        self.photo = None
+        # Results storage
+        self.all_trajectories = []  # List of trajectories for all videos
+        self.setup_ui()
+        self.load_current_object()
+    def setup_ui(self):
+        """Setup GUI layout"""
+        # Top info
+        info_frame = ttk.Frame(self.root)
+        info_frame.pack(side=tk.TOP, fill=tk.X, padx=5, pady=5)
+        self.video_label = ttk.Label(info_frame, text="Video: ", font=("Arial", 10, "bold"))
+        self.video_label.pack(side=tk.LEFT, padx=5)
+        self.object_label = ttk.Label(info_frame, text="Object: ", foreground="blue")
+        self.object_label.pack(side=tk.LEFT, padx=10)
+        # Instructions
+        inst_frame = ttk.LabelFrame(self.root, text="Instructions")
+        inst_frame.pack(side=tk.TOP, fill=tk.X, padx=5, pady=5)
+        ttk.Label(inst_frame, text="1. See the frame where object is visible", foreground="blue").pack(anchor=tk.W, padx=5)
+        ttk.Label(inst_frame, text="2. Click and drag to draw trajectory path (RED line)", foreground="red").pack(anchor=tk.W, padx=5)
+        ttk.Label(inst_frame, text="3. Draw from object's current position to where it should end up", foreground="orange").pack(anchor=tk.W, padx=5)
+        ttk.Label(inst_frame, text="4. Click 'Clear' to restart, 'Save & Next' when done", foreground="green").pack(anchor=tk.W, padx=5)
+        # Canvas
+        canvas_frame = ttk.LabelFrame(self.root, text="Draw Trajectory Path")
+        canvas_frame.pack(side=tk.TOP, fill=tk.BOTH, expand=True, padx=5, pady=5)
+        self.canvas = tk.Canvas(canvas_frame, width=800, height=600, bg='black', cursor="crosshair")
+        self.canvas.pack(fill=tk.BOTH, expand=True)
+        self.canvas.bind("<Button-1>", self.on_canvas_click)
+        self.canvas.bind("<B1-Motion>", self.on_canvas_drag)
+        self.canvas.bind("<ButtonRelease-1>", self.on_canvas_release)
+        # Controls
+        controls = ttk.Frame(self.root)
+        controls.pack(side=tk.BOTTOM, fill=tk.X, padx=5, pady=5)
+        self.status_label = ttk.Label(controls, text="Draw trajectory path for object", foreground="blue")
+        self.status_label.pack(side=tk.TOP, pady=5)
+        button_frame = ttk.Frame(controls)
+        button_frame.pack(side=tk.TOP)
+        ttk.Button(button_frame, text="Clear Trajectory", command=self.clear_trajectory).pack(side=tk.LEFT, padx=5)
+        ttk.Button(button_frame, text="Skip Object", command=self.skip_object).pack(side=tk.LEFT, padx=5)
+        ttk.Button(button_frame, text="Save & Next", command=self.save_and_next).pack(side=tk.LEFT, padx=5)
+        self.progress_label = ttk.Label(controls, text="", font=("Arial", 9))
+        self.progress_label.pack(side=tk.TOP, pady=5)
+    def load_current_object(self):
+        """Load current object for trajectory drawing"""
+        if self.current_video_idx >= len(self.objects_data):
+            # All done
+            self.finish()
+            return
+        data = self.objects_data[self.current_video_idx]
+        video_info = data['video_info']
+        objects_needing_traj = data['objects']
+        if self.current_object_idx >= len(objects_needing_traj):
+            # Done with this video, move to next
+            self.current_video_idx += 1
+            self.current_object_idx = 0
+            self.load_current_object()
+            return
+        obj = objects_needing_traj[self.current_object_idx]
+        video_path = video_info.get('video_path', '')
+        output_dir = Path(video_info.get('output_dir', ''))
+        # Update labels
+        self.video_label.config(text=f"Video: {Path(video_path).parent.name}/{Path(video_path).name}")
+        self.object_label.config(text=f"Object: {obj['noun']} (will fall/move)")
+        total_objects = sum(len(d['objects']) for d in self.objects_data)
+        current_obj_num = sum(len(self.objects_data[i]['objects']) for i in range(self.current_video_idx)) + self.current_object_idx + 1
+        self.progress_label.config(text=f"Object {current_obj_num}/{total_objects} across {len(self.objects_data)} video(s)")
+        # Extract frame
+        frame_idx = obj.get('first_appears_frame', 0)
+        input_video = output_dir / "input_video.mp4"
+        if not input_video.exists():
+            input_video = Path(video_path)
+        cap = cv2.VideoCapture(str(input_video))
+        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+        ret, frame = cap.read()
+        cap.release()
+        if not ret:
+            messagebox.showerror("Error", f"Failed to read frame {frame_idx} from video")
+            self.skip_object()
+            return
+        self.frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+        print(f"\n   Loaded frame {frame_idx} for '{obj['noun']}'")
+        # Clear trajectory
+        self.trajectory_points = []
+        # Calculate grid for this video
+        h, w = self.frame.shape[:2]
+        min_grid = video_info.get('min_grid', 8)
+        self.grid_rows, self.grid_cols = calculate_square_grid(w, h, min_grid)
+        # Display
+        self.status_label.config(text="Draw trajectory path (click and drag)", foreground="blue")
+        self.display_frame()
+    def display_frame(self):
+        """Display frame with trajectory"""
+        if self.frame is None:
+            return
+        # Create visualization
+        vis = self.frame.copy()
+        h, w = vis.shape[:2]
+        # Draw trajectory
+        if len(self.trajectory_points) > 1:
+            for i in range(len(self.trajectory_points) - 1):
+                pt1 = self.trajectory_points[i]
+                pt2 = self.trajectory_points[i + 1]
+                cv2.line(vis, pt1, pt2, (255, 0, 0), 5)  # Thicker line for visibility
+            # Draw start point (green) and end point (red)
+            if len(self.trajectory_points) > 0:
+                start_pt = self.trajectory_points[0]
+                end_pt = self.trajectory_points[-1]
+                cv2.circle(vis, start_pt, 8, (0, 255, 0), -1)  # Green start
+                cv2.circle(vis, end_pt, 8, (255, 0, 0), -1)    # Red end
+        # Scale for display
+        max_width, max_height = 800, 600
+        scale_w = max_width / w
+        scale_h = max_height / h
+        self.display_scale = min(scale_w, scale_h, 1.0)
+        new_w = int(w * self.display_scale)
+        new_h = int(h * self.display_scale)
+        vis_resized = cv2.resize(vis, (new_w, new_h))
+        # Convert to PIL and display
+        pil_img = Image.fromarray(vis_resized)
+        self.photo = ImageTk.PhotoImage(pil_img)
+        self.canvas.delete("all")
+        self.canvas.create_image(0, 0, anchor=tk.NW, image=self.photo)
+    def on_canvas_click(self, event):
+        """Start drawing trajectory"""
+        # Convert to frame coordinates
+        x = int(event.x / self.display_scale)
+        y = int(event.y / self.display_scale)
+        self.trajectory_points = [(x, y)]
+        self.drawing = True
+    def on_canvas_drag(self, event):
+        """Continue drawing trajectory"""
+        if not self.drawing:
+            return
+        x = int(event.x / self.display_scale)
+        y = int(event.y / self.display_scale)
+        # Add point if far enough from last point
+        if len(self.trajectory_points) > 0:
+            last_x, last_y = self.trajectory_points[-1]
+            dist = np.sqrt((x - last_x)**2 + (y - last_y)**2)
+            if dist > 5:  # Minimum distance between points
+                self.trajectory_points.append((x, y))
+                self.display_frame()
+    def on_canvas_release(self, event):
+        """Finish drawing trajectory"""
+        self.drawing = False
+        if len(self.trajectory_points) > 0:
+            x = int(event.x / self.display_scale)
+            y = int(event.y / self.display_scale)
+            self.trajectory_points.append((x, y))
+            self.display_frame()
+    def clear_trajectory(self):
+        """Clear drawn trajectory"""
+        self.trajectory_points = []
+        self.display_frame()
+        self.status_label.config(text="Trajectory cleared. Draw again.", foreground="blue")
+    def skip_object(self):
+        """Skip current object without saving trajectory"""
+        result = messagebox.askyesno("Skip Object", "Skip this object without drawing trajectory?")
+        if not result:
+            return
+        # Save empty trajectory
+        data = self.objects_data[self.current_video_idx]
+        obj = data['objects'][self.current_object_idx]
+        self.all_trajectories.append({
+            'video_path': data['video_info']['video_path'],
+            'object_noun': obj['noun'],
+            'trajectory_points': [],
+            'trajectory_grid_cells': [],
+            'skipped': True
+        })
+        self.current_object_idx += 1
+        self.load_current_object()
+    def save_and_next(self):
+        """Save trajectory and move to next object"""
+        if len(self.trajectory_points) < 2:
+            messagebox.showwarning("Warning", "Draw a trajectory path first (at least 2 points)")
+            return
+        # Convert to grid cells
+        data = self.objects_data[self.current_video_idx]
+        obj = data['objects'][self.current_object_idx]
+        grid_cells = points_to_grid_cells(
+            self.trajectory_points,
+            self.grid_rows,
+            self.grid_cols,
+            self.frame.shape[1],
+            self.frame.shape[0]
+        )
+        # Save
+        self.all_trajectories.append({
+            'video_path': data['video_info']['video_path'],
+            'object_noun': obj['noun'],
+            'first_appears_frame': obj.get('first_appears_frame', 0),
+            'trajectory_points': self.trajectory_points,
+            'trajectory_grid_cells': grid_cells,
+            'grid_rows': self.grid_rows,
+            'grid_cols': self.grid_cols,
+            'skipped': False
+        })
+        print(f"   ✓ Saved trajectory for '{obj['noun']}': {len(grid_cells)} grid cells")
+        self.current_object_idx += 1
+        self.load_current_object()
+    def finish(self):
+        """All objects done"""
+        self.status_label.config(text="All trajectories complete!", foreground="green")
+        messagebox.showinfo("Complete", "All trajectory drawings complete!\n\nSaving results...")
+        self.root.quit()
+def find_objects_needing_trajectory(config_path: str) -> List[Dict]:
+    """Find all objects that need trajectory input"""
+    config_path = Path(config_path)
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Invalid config format")
+    objects_data = []
+    for video_info in videos:
+        output_dir = Path(video_info.get('output_dir', ''))
+        vlm_analysis_path = output_dir / "vlm_analysis.json"
+        if not vlm_analysis_path.exists():
+            print(f"   Skipping {output_dir.parent.name}: no vlm_analysis.json")
+            continue
+        with open(vlm_analysis_path, 'r') as f:
+            analysis = json.load(f)
+        # Find objects with needs_trajectory=true
+        objects_needing_traj = [
+            obj for obj in analysis.get('affected_objects', [])
+            if obj.get('needs_trajectory', False)
+        ]
+        if objects_needing_traj:
+            objects_data.append({
+                'video_info': video_info,
+                'objects': objects_needing_traj,
+                'output_dir': output_dir
+            })
+    return objects_data
+def main():
+    parser = argparse.ArgumentParser(description="Stage 3b: Trajectory Drawing GUI")
+    parser.add_argument("--config", required=True, help="Config JSON")
+    args = parser.parse_args()
+    print(f"\n{'='*70}")
+    print(f"Stage 3b: Trajectory Drawing GUI")
+    print(f"{'='*70}\n")
+    # Find objects needing trajectories
+    print("Finding objects that need trajectory input...")
+    objects_data = find_objects_needing_trajectory(args.config)
+    if not objects_data:
+        print("\n✅ No objects need trajectory input!")
+        print("All objects are either stationary or visual artifacts.")
+        print("Proceeding to Stage 3a for mask generation...")
+        return
+    total_objects = sum(len(d['objects']) for d in objects_data)
+    print(f"\nFound {total_objects} object(s) needing trajectories across {len(objects_data)} video(s):")
+    for d in objects_data:
+        video_name = Path(d['video_info']['video_path']).parent.name
+        print(f"  • {video_name}: {', '.join(obj['noun'] for obj in d['objects'])}")
+    # Launch GUI
+    print("\nLaunching trajectory drawing GUI...")
+    print("Instructions:")
+    print("  1. See the frame where the object is visible")
+    print("  2. Click and drag to draw trajectory path (RED line)")
+    print("  3. Draw from object's current position to where it should end up")
+    print("  4. Click 'Save & Next' when done with each object")
+    print("")
+    root = tk.Tk()
+    root.geometry("900x800")
+    gui = TrajectoryGUI(root, objects_data)
+    root.mainloop()
+    # Save trajectories
+    config_path = Path(args.config)
+    output_path = config_path.parent / f"{config_path.stem}_trajectories.json"
+    with open(output_path, 'w') as f:
+        json.dump(gui.all_trajectories, f, indent=2)
+    print(f"\n{'='*70}")
+    print(f"✅ Stage 3b Complete!")
+    print(f"{'='*70}")
+    print(f"Saved trajectories to: {output_path}")
+    print(f"Total trajectories: {len(gui.all_trajectories)}")
+    print(f"\nNext: Run Stage 3a to generate grey masks (includes trajectories)")
+    print(f"{'='*70}\n")
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/stage4_combine_masks.py ADDED Viewed

	@@ -0,0 +1,241 @@

+#!/usr/bin/env python3
+"""
+Stage 4: Combine Black and Grey Masks into Tri/Quad Mask
+Combines the black mask (primary object) and grey masks (affected objects)
+into a single tri-mask or quad-mask video.
+Mask values:
+- 0: Primary object (from black mask)
+- 63: Overlap of primary and affected objects
+- 127: Affected objects only (from grey masks)
+- 255: Background (keep)
+"""
+import json
+import argparse
+from pathlib import Path
+import cv2
+import numpy as np
+from tqdm import tqdm
+def combine_masks(black_frame, grey_frame):
+    """
+    Combine black and grey mask frames.
+    Rules:
+    - black=0, grey=255 → 0 (primary object only)
+    - black=255, grey=127 → 127 (affected object only)
+    - black=0, grey=127 → 63 (overlap)
+    - black=255, grey=255 → 255 (background)
+    Args:
+        black_frame: Frame from black_mask.mp4 (0=object, 255=background)
+        grey_frame: Frame from grey_mask.mp4 (127=object, 255=background)
+    Returns:
+        Combined mask frame
+    """
+    # Initialize with background (255)
+    combined = np.full_like(black_frame, 255, dtype=np.uint8)
+    # Primary object only (black=0, grey=255)
+    primary_only = (black_frame == 0) & (grey_frame == 255)
+    combined[primary_only] = 0
+    # Affected object only (black=255, grey=127)
+    affected_only = (black_frame == 255) & (grey_frame == 127)
+    combined[affected_only] = 127
+    # Overlap (black=0, grey=127)
+    overlap = (black_frame == 0) & (grey_frame == 127)
+    combined[overlap] = 63
+    return combined
+def process_video(black_mask_path: Path, grey_mask_path: Path, output_path: Path):
+    """Combine black and grey mask videos into trimask/quadmask"""
+    import subprocess
+    print(f"   Loading black mask: {black_mask_path.name}")
+    black_cap = cv2.VideoCapture(str(black_mask_path))
+    print(f"   Loading grey mask: {grey_mask_path.name}")
+    grey_cap = cv2.VideoCapture(str(grey_mask_path))
+    # Get video properties
+    fps = black_cap.get(cv2.CAP_PROP_FPS)
+    width = int(black_cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    height = int(black_cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    total_frames = int(black_cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    # Check grey mask has same properties
+    grey_total_frames = int(grey_cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    if total_frames != grey_total_frames:
+        print(f"   ⚠️  Warning: Frame count mismatch (black: {total_frames}, grey: {grey_total_frames})")
+        total_frames = min(total_frames, grey_total_frames)
+    print(f"   Video: {width}x{height} @ {fps:.2f}fps, {total_frames} frames")
+    print(f"   Combining masks...")
+    # Collect all frames first
+    combined_frames = []
+    # Process frames
+    for frame_idx in tqdm(range(total_frames), desc="   Combining"):
+        ret_black, black_frame = black_cap.read()
+        ret_grey, grey_frame = grey_cap.read()
+        if not ret_black or not ret_grey:
+            print(f"   ⚠️  Warning: Could not read frame {frame_idx}")
+            break
+        # Convert to grayscale if needed
+        if len(black_frame.shape) == 3:
+            black_frame = cv2.cvtColor(black_frame, cv2.COLOR_BGR2GRAY)
+        if len(grey_frame.shape) == 3:
+            grey_frame = cv2.cvtColor(grey_frame, cv2.COLOR_BGR2GRAY)
+        # Combine
+        combined_frame = combine_masks(black_frame, grey_frame)
+        combined_frames.append(combined_frame)
+    # Cleanup
+    black_cap.release()
+    grey_cap.release()
+    # On the first frame, clamp near-grey values (100–135) to 255 (background).
+    # Video codecs can introduce slight luma drift around 127; this ensures no
+    # grey pixels survive into the final quadmask on frame 0.
+    if combined_frames:
+        f0 = combined_frames[0]
+        grey_pixels = (f0 > 100) & (f0 < 135)
+        f0[grey_pixels] = 255
+        combined_frames[0] = f0
+    # Write using LOSSLESS encoding to preserve exact mask values
+    print(f"   Writing lossless video...")
+    # Write temp AVI with FFV1 codec (lossless)
+    temp_avi = output_path.with_suffix('.avi')
+    fourcc = cv2.VideoWriter_fourcc(*'FFV1')
+    out = cv2.VideoWriter(str(temp_avi), fourcc, fps, (width, height), isColor=False)
+    for frame in combined_frames:
+        out.write(frame)
+    out.release()
+    # Convert to LOSSLESS H.264 (qp=0, yuv444p to preserve all luma values)
+    cmd = [
+        'ffmpeg', '-y', '-i', str(temp_avi),
+        '-c:v', 'libx264', '-qp', '0', '-preset', 'ultrafast',
+        '-pix_fmt', 'yuv444p',
+        '-r', str(fps),
+        str(output_path)
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True)
+    if result.returncode != 0:
+        print(f"   ⚠️  Warning: ffmpeg conversion had issues")
+        print(result.stderr)
+    # Clean up temp file
+    temp_avi.unlink()
+    print(f"   ✓ Saved: {output_path.name}")
+def process_config(config_path: str):
+    """Process all videos in config"""
+    config_path = Path(config_path)
+    # Load config
+    with open(config_path, 'r') as f:
+        config_data = json.load(f)
+    # Handle both formats
+    if isinstance(config_data, list):
+        videos = config_data
+    elif isinstance(config_data, dict) and "videos" in config_data:
+        videos = config_data["videos"]
+    else:
+        raise ValueError("Config must be a list or have 'videos' key")
+    print(f"\n{'='*70}")
+    print(f"Stage 4: Combine Masks into Tri/Quad Mask")
+    print(f"{'='*70}")
+    print(f"Config: {config_path.name}")
+    print(f"Videos: {len(videos)}")
+    print(f"{'='*70}\n")
+    # Process each video
+    success_count = 0
+    for i, video_info in enumerate(videos):
+        video_path = video_info.get("video_path", "")
+        output_dir = video_info.get("output_dir", "")
+        print(f"\n{'─'*70}")
+        print(f"Video {i+1}/{len(videos)}: {Path(video_path).name}")
+        print(f"{'─'*70}")
+        if not output_dir:
+            print(f"   ⚠️  No output_dir specified, skipping")
+            continue
+        output_dir = Path(output_dir)
+        if not output_dir.exists():
+            print(f"   ⚠️  Output directory not found: {output_dir}")
+            continue
+        # Check for required masks
+        black_mask_path = output_dir / "black_mask.mp4"
+        grey_mask_path = output_dir / "grey_mask.mp4"
+        if not black_mask_path.exists():
+            print(f"   ⚠️  black_mask.mp4 not found, skipping")
+            continue
+        if not grey_mask_path.exists():
+            print(f"   ⚠️  grey_mask.mp4 not found, skipping")
+            continue
+        # Output path
+        output_path = output_dir / "quadmask_0.mp4"
+        try:
+            process_video(black_mask_path, grey_mask_path, output_path)
+            success_count += 1
+            print(f"\n✅ Video {i+1} complete!")
+        except Exception as e:
+            print(f"\n❌ Error processing video {i+1}: {e}")
+            import traceback
+            traceback.print_exc()
+            continue
+    # Summary
+    print(f"\n{'='*70}")
+    print(f"Stage 4 Complete!")
+    print(f"{'='*70}")
+    print(f"Successful: {success_count}/{len(videos)}")
+    print(f"Failed: {len(videos) - success_count}/{len(videos)}")
+    print(f"{'='*70}\n")
+def main():
+    parser = argparse.ArgumentParser(
+        description="Stage 4: Combine black and grey masks into tri/quad mask"
+    )
+    parser.add_argument(
+        "--config",
+        type=str,
+        required=True,
+        help="Path to config JSON (with output_dir for each video)"
+    )
+    args = parser.parse_args()
+    process_config(args.config)
+if __name__ == "__main__":
+    main()

VLM-MASK-REASONER/test_gemini_video.py ADDED Viewed

	@@ -0,0 +1,98 @@

+#!/usr/bin/env python3
+"""
+Quick Gemini API smoke test — samples a few frames from a video and asks a
+simple question. Use this to verify your API key works before running the
+full pipeline.
+Usage:
+    export GEMINI_API_KEY="your_aistudio_key"
+    python test_gemini_video.py --video path/to/video.mp4
+"""
+import os
+import sys
+import base64
+import argparse
+import cv2
+import numpy as np
+from pathlib import Path
+import openai
+FREE_TIER_MODEL = "gemini-2.0-flash"
+NUM_FRAMES = 4  # keep low for free tier rate limits
+def sample_frames(video_path: str, n: int = NUM_FRAMES):
+    """Sample n evenly-spaced frames from the video, return as base64 data URLs."""
+    cap = cv2.VideoCapture(video_path)
+    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    indices = [int(i * (total - 1) / (n - 1)) for i in range(n)]
+    data_urls = []
+    for idx in indices:
+        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
+        ret, frame = cap.read()
+        if not ret:
+            continue
+        _, buf = cv2.imencode(".jpg", frame, [cv2.IMWRITE_JPEG_QUALITY, 80])
+        b64 = base64.b64encode(buf).decode("utf-8")
+        data_urls.append(f"data:image/jpeg;base64,{b64}")
+    cap.release()
+    return data_urls
+def main():
+    parser = argparse.ArgumentParser(description="Gemini API smoke test with video frames")
+    parser.add_argument("--video", required=True, help="Path to a video file")
+    parser.add_argument("--model", default=FREE_TIER_MODEL, help="Gemini model to use")
+    parser.add_argument("--frames", type=int, default=NUM_FRAMES, help="Number of frames to sample")
+    args = parser.parse_args()
+    api_key = os.environ.get("GEMINI_API_KEY")
+    if not api_key:
+        print("ERROR: GEMINI_API_KEY environment variable not set")
+        sys.exit(1)
+    video_path = Path(args.video)
+    if not video_path.exists():
+        print(f"ERROR: Video not found: {video_path}")
+        sys.exit(1)
+    print(f"Video:  {video_path.name}")
+    print(f"Model:  {args.model}")
+    print(f"Frames: {args.frames}")
+    print()
+    print(f"Sampling {args.frames} frames...")
+    data_urls = sample_frames(str(video_path), args.frames)
+    print(f"Got {len(data_urls)} frames. Sending to Gemini...")
+    client = openai.OpenAI(
+        api_key=api_key,
+        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
+    )
+    content = [
+        {"type": "image_url", "image_url": {"url": url}} for url in data_urls
+    ]
+    content.append({
+        "type": "text",
+        "text": "These are evenly-spaced frames from a short video. In one sentence, describe what is happening in the video."
+    })
+    response = client.chat.completions.create(
+        model=args.model,
+        messages=[{"role": "user", "content": content}],
+    )
+    print("\n--- Gemini response ---")
+    print(response.choices[0].message.content)
+    print("-----------------------")
+    print("\n✅ API key works!")
+if __name__ == "__main__":
+    main()

app.py CHANGED Viewed

@@ -1,18 +1,546 @@
 import gradio as gr
 import spaces
 @spaces.GPU
-def dudu():
-    pass
-def greet(name, intensity):
-    return "Hello, " + name + "!" * int(intensity)
-demo = gr.Interface(
-    fn=greet,
-    inputs=["text", "slider"],
-    outputs=["text"],
-    api_name="predict"
-)
-demo.launch()

+"""
+VOID VLM-Mask-Reasoner — Quadmask Generation Demo
+Generates 4-level semantic masks for interaction-aware video inpainting.
+Pipeline from https://github.com/Netflix/void-model:
+  Stage 1: SAM2 segmentation → black mask  (transformers Sam2Model)
+  Stage 2: Gemini VLM scene analysis → affected objects JSON  (repo code)
+  Stage 3: SAM3 text-prompted segmentation → grey mask  (transformers Sam3Model)
+  Stage 4: Combine black + grey → quadmask (0/63/127/255)  (repo code)
+"""
+import os
+import sys
+import json
+import tempfile
+import shutil
+import subprocess
+from pathlib import Path
+import cv2
+import numpy as np
+import torch
 import gradio as gr
 import spaces
+import imageio
+from PIL import Image, ImageDraw
+from huggingface_hub import hf_hub_download
+import openai
+# ── Add repo modules to path ─────────────────────────────────────────────────
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "VLM-MASK-REASONER"))
+# ── Repo imports: Stage 2 (VLM) and Stage 4 (combine) ────────────────────────
+from stage2_vlm_analysis import (
+    process_video as vlm_process_video,
+    calculate_square_grid,
+)
+from stage4_combine_masks import process_video as combine_process_video
+# Stage 3 helpers (grid logic, mask combination — not the SegmentationModel)
+from stage3a_generate_grey_masks_v2 import (
+    calculate_square_grid as calc_grid_3a,
+    gridify_masks,
+    filter_masks_by_proximity,
+    segment_object_all_frames as _repo_segment_all_frames,
+    process_video_grey_masks,
+)
+# ── Constants ─────────────────────────────────────────────────────────────────
+SAM2_MODEL_ID = "facebook/sam2.1-hiera-large"
+SAM3_MODEL_ID = "jetjodh/sam3"
+DEFAULT_VLM_MODEL = "gemini-3-flash-preview"
+MAX_FRAMES = 197
+FPS_DEFAULT = 12
+FRAME_STRIDE = 4  # Process every Nth frame for SAM2 tracking
+# ── Load transformers SAM2 (video model with propagation support) ─────────────
+print("Loading SAM2 video model (transformers)...")
+from transformers import Sam2VideoModel, Sam2VideoProcessor
+from transformers.models.sam2_video.modeling_sam2_video import Sam2VideoInferenceSession
+sam2_model = Sam2VideoModel.from_pretrained(SAM2_MODEL_ID).to("cuda")
+sam2_processor = Sam2VideoProcessor.from_pretrained(SAM2_MODEL_ID)
+print("SAM2 video model ready.")
+# ── Load transformers SAM3 ───────────────────────────────────────────────────
+print("Loading SAM3 model (transformers)...")
+from transformers import Sam3Model, Sam3Processor
+sam3_model = Sam3Model.from_pretrained(SAM3_MODEL_ID).to("cuda")
+sam3_processor = Sam3Processor.from_pretrained(SAM3_MODEL_ID)
+print("SAM3 ready.")
+# ══════════════════════════════════════════════════════════════════════════════
+# STAGE 1: SAM2 VIDEO SEGMENTATION (transformers Sam2VideoModel)
+# Uses proper video propagation with memory — matches repo's propagate_in_video
+# ══════════════════════════════════════════════════════════════════════════════
+def stage1_segment_video(frames: list, points: list, **kwargs) -> list:
+    """Segment primary object across all video frames using SAM2 video propagation.
+    Matches repo: point prompts + bounding box on frame 0, propagate through video.
+    Returns list of uint8 masks (0=object, 255=background)."""
+    total = len(frames)
+    h, w = frames[0].shape[:2]
+    # Preprocess all frames
+    pil_frames = [Image.fromarray(f) for f in frames]
+    inputs = sam2_processor(images=pil_frames, return_tensors="pt").to(sam2_model.device)
+    # Create inference session with all frames
+    session = Sam2VideoInferenceSession(
+        video=inputs["pixel_values"],
+        video_height=h,
+        video_width=w,
+        inference_device=sam2_model.device,
+        inference_state_device=sam2_model.device,
+        dtype=torch.float32,
+    )
+    # Add point prompts + bounding box on frame 0 via processor
+    # (handles normalization, object registration, and obj_with_new_inputs)
+    pts = np.array(points, dtype=np.float32)
+    x_min, x_max = pts[:, 0].min(), pts[:, 0].max()
+    y_min, y_max = pts[:, 1].min(), pts[:, 1].max()
+    x_margin = max((x_max - x_min) * 0.1, 10)
+    y_margin = max((y_max - y_min) * 0.1, 10)
+    box = [
+        max(0, x_min - x_margin),
+        max(0, y_min - y_margin),
+        min(w, x_max + x_margin),
+        min(h, y_max + y_margin),
+    ]
+    sam2_processor.process_new_points_or_boxes_for_video_frame(
+        inference_session=session,
+        frame_idx=0,
+        obj_ids=[1],
+        input_points=[[[[float(p[0]), float(p[1])] for p in points]]],
+        input_labels=[[[1] * len(points)]],
+        input_boxes=[[[float(box[0]), float(box[1]), float(box[2]), float(box[3])]]],
+    )
+    # Run forward on the prompted frame first (populates cond_frame_outputs)
+    with torch.no_grad():
+        sam2_model(session, frame_idx=0)
+    # Propagate through all frames (matches repo's propagate_in_video)
+    video_segments = {}
+    original_sizes = [[h, w]]
+    with torch.no_grad():
+        for output in sam2_model.propagate_in_video_iterator(session):
+            frame_idx = output.frame_idx
+            # pred_masks shape varies — get the raw logits and resize to original
+            mask_logits = output.pred_masks[0].cpu().float()  # first object
+            # Ensure 4D for interpolation: (1, 1, H_model, W_model)
+            while mask_logits.dim() < 4:
+                mask_logits = mask_logits.unsqueeze(0)
+            mask_resized = torch.nn.functional.interpolate(
+                mask_logits, size=(h, w), mode="bilinear", align_corners=False
+            )
+            mask = (mask_resized.squeeze() > 0.0).numpy()
+            video_segments[frame_idx] = mask
+    # Convert to uint8 masks (0=object, 255=background)
+    all_masks = []
+    for idx in range(total):
+        if idx in video_segments:
+            mask_bool = video_segments[idx]
+        else:
+            nearest = min(video_segments.keys(), key=lambda k: abs(k - idx))
+            mask_bool = video_segments[nearest]
+        mask_uint8 = np.where(mask_bool, 0, 255).astype(np.uint8)
+        all_masks.append(mask_uint8)
+    return all_masks
+def write_mask_video(masks: list, fps: float, output_path: str):
+    """Write list of uint8 grayscale masks to lossless MP4."""
+    h, w = masks[0].shape[:2]
+    temp_avi = str(Path(output_path).with_suffix('.avi'))
+    fourcc = cv2.VideoWriter_fourcc(*'FFV1')
+    out = cv2.VideoWriter(temp_avi, fourcc, fps, (w, h), isColor=False)
+    for mask in masks:
+        out.write(mask)
+    out.release()
+    cmd = [
+        'ffmpeg', '-y', '-i', temp_avi,
+        '-c:v', 'libx264', '-qp', '0', '-preset', 'ultrafast',
+        '-pix_fmt', 'yuv444p', str(output_path),
+    ]
+    subprocess.run(cmd, capture_output=True)
+    if os.path.exists(temp_avi):
+        os.unlink(temp_avi)
+# ══════════════════════════════════════════════════════════════════════════════
+# STAGE 3: SAM3 TEXT-PROMPTED SEGMENTATION (transformers)
+# — Drop-in replacement for repo's SegmentationModel.segment()
+# ══════════════════════════════════════════════════════════════════════════════
+class TransformersSam3Segmenter:
+    """Matches the interface of the repo's SegmentationModel for stage3a."""
+    model_type = "sam3"
+    def segment(self, image_pil: Image.Image, prompt: str) -> np.ndarray:
+        """Segment object by text prompt. Returns boolean mask."""
+        h, w = image_pil.height, image_pil.width
+        union = np.zeros((h, w), dtype=bool)
+        try:
+            inputs = sam3_processor(
+                images=image_pil, text=prompt, return_tensors="pt"
+            ).to(sam3_model.device)
+            with torch.no_grad():
+                outputs = sam3_model(**inputs)
+            results = sam3_processor.post_process_instance_segmentation(
+                outputs,
+                threshold=0.3,
+                mask_threshold=0.5,
+                target_sizes=inputs.get("original_sizes").tolist(),
+            )[0]
+            masks = results.get("masks")
+            if masks is not None and len(masks) > 0:
+                if torch.is_tensor(masks):
+                    masks = masks.cpu().numpy()
+                if masks.ndim == 2:
+                    union = masks.astype(bool)
+                elif masks.ndim == 3:
+                    union = masks.any(axis=0).astype(bool)
+                elif masks.ndim == 4:
+                    union = masks.any(axis=(0, 1)).astype(bool)
+        except Exception as e:
+            print(f"         Warning: SAM3 segmentation failed for '{prompt}': {e}")
+        return union
+seg_model = TransformersSam3Segmenter()
+# ══════════════════════════════════════════════════════════════════════════════
+# HELPERS
+# ══════════════════════════════════════════════════════════════════════════════
+def extract_frames(video_path: str, max_frames: int = MAX_FRAMES):
+    """Extract frames from video. Returns (frames_rgb_list, fps)."""
+    cap = cv2.VideoCapture(video_path)
+    fps = cap.get(cv2.CAP_PROP_FPS) or FPS_DEFAULT
+    frames = []
+    while len(frames) < max_frames:
+        ret, frame = cap.read()
+        if not ret:
+            break
+        frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
+    cap.release()
+    return frames, fps
+def draw_points_on_image(image: np.ndarray, points: list, radius: int = 6) -> np.ndarray:
+    pil_img = Image.fromarray(image.copy())
+    draw = ImageDraw.Draw(pil_img)
+    for i, (x, y) in enumerate(points):
+        r = radius
+        draw.ellipse([x - r, y - r, x + r, y + r], fill="red", outline="white", width=2)
+        draw.text((x + r + 2, y - r), str(i + 1), fill="white")
+    return np.array(pil_img)
+def frames_to_video(frames: list, fps: float) -> str:
+    tmp = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
+    tmp_path = tmp.name
+    tmp.close()
+    writer = imageio.get_writer(tmp_path, fps=fps, codec='libx264',
+                                 output_params=['-crf', '18', '-pix_fmt', 'yuv420p'])
+    for frame in frames:
+        writer.append_data(frame)
+    writer.close()
+    return tmp_path
+def create_quadmask_visualization(video_path: str, quadmask_path: str) -> str:
+    cap_vid = cv2.VideoCapture(video_path)
+    cap_qm = cv2.VideoCapture(quadmask_path)
+    fps = cap_vid.get(cv2.CAP_PROP_FPS) or FPS_DEFAULT
+    vis_frames = []
+    while True:
+        ret_v, frame = cap_vid.read()
+        ret_q, qm_frame = cap_qm.read()
+        if not ret_v or not ret_q:
+            break
+        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+        qm = cv2.cvtColor(qm_frame, cv2.COLOR_BGR2GRAY) if len(qm_frame.shape) == 3 else qm_frame
+        qm = np.where(qm <= 31, 0, qm)
+        qm = np.where((qm > 31) & (qm <= 95), 63, qm)
+        qm = np.where((qm > 95) & (qm <= 191), 127, qm)
+        qm = np.where(qm > 191, 255, qm)
+        overlay = frame_rgb.copy()
+        overlay[qm == 0] = [255, 50, 50]
+        overlay[qm == 63] = [255, 200, 0]
+        overlay[qm == 127] = [50, 255, 50]
+        result = cv2.addWeighted(frame_rgb, 0.5, overlay, 0.5, 0)
+        result[qm == 255] = frame_rgb[qm == 255]
+        vis_frames.append(result)
+    cap_vid.release()
+    cap_qm.release()
+    return frames_to_video(vis_frames, fps) if vis_frames else None
+def create_quadmask_color_video(quadmask_path: str) -> str:
+    cap = cv2.VideoCapture(quadmask_path)
+    fps = cap.get(cv2.CAP_PROP_FPS) or FPS_DEFAULT
+    color_frames = []
+    while True:
+        ret, frame = cap.read()
+        if not ret:
+            break
+        qm = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) if len(frame.shape) == 3 else frame
+        qm = np.where(qm <= 31, 0, qm)
+        qm = np.where((qm > 31) & (qm <= 95), 63, qm)
+        qm = np.where((qm > 95) & (qm <= 191), 127, qm)
+        qm = np.where(qm > 191, 255, qm)
+        h, w = qm.shape
+        color = np.full((h, w, 3), 255, dtype=np.uint8)
+        color[qm == 0] = [0, 0, 0]
+        color[qm == 63] = [80, 80, 80]
+        color[qm == 127] = [160, 160, 160]
+        color_frames.append(color)
+    cap.release()
+    return frames_to_video(color_frames, fps) if color_frames else None
+# ══════════════════════════════════════════════════════════════════════════════
+# MAIN PIPELINE
+# ══════════════════════════════════════════════════════════════════════════════
 @spaces.GPU
+def run_pipeline(video_path: str, points_json: str, instruction: str,
+                 progress=gr.Progress(track_tqdm=False)):
+    """Run the full VLM-Mask-Reasoner pipeline."""
+    if not video_path:
+        raise gr.Error("Please upload a video.")
+    if not points_json or points_json == "[]":
+        raise gr.Error("Please click on the image to select at least one point on the primary object.")
+    if not instruction.strip():
+        raise gr.Error("Please enter an edit instruction.")
+    points = json.loads(points_json)
+    if len(points) == 0:
+        raise gr.Error("Please select at least one point on the primary object.")
+    api_key = os.environ.get("GEMINI_API_KEY", "")
+    # Create temp output directory
+    output_dir = Path(tempfile.mkdtemp(prefix="void_quadmask_"))
+    input_video_path = output_dir / "input_video.mp4"
+    shutil.copy2(video_path, input_video_path)
+    # ── Stage 1: SAM2 Segmentation ──────────────────────────────────────────
+    progress(0.05, desc="Stage 1: SAM2 segmentation...")
+    frames, fps = extract_frames(str(input_video_path))
+    if len(frames) < 2:
+        raise gr.Error("Video must have at least 2 frames.")
+    black_masks = stage1_segment_video(frames, points, stride=FRAME_STRIDE)
+    black_mask_path = output_dir / "black_mask.mp4"
+    write_mask_video(black_masks, fps, str(black_mask_path))
+    # Save first frame for VLM analysis
+    first_frame_path = output_dir / "first_frame.jpg"
+    cv2.imwrite(str(first_frame_path), cv2.cvtColor(frames[0], cv2.COLOR_RGB2BGR))
+    # Save segmentation metadata (Stage 2 expects this)
+    seg_info = {
+        "total_frames": len(frames),
+        "frame_width": frames[0].shape[1],
+        "frame_height": frames[0].shape[0],
+        "fps": fps,
+        "video_path": str(input_video_path),
+        "instruction": instruction,
+        "primary_points_by_frame": {"0": points},
+        "first_appears_frame": 0,
+    }
+    with open(output_dir / "segmentation_info.json", 'w') as f:
+        json.dump(seg_info, f, indent=2)
+    progress(0.3, desc="Stage 1 complete.")
+    # ── Stage 2: VLM Analysis (repo code) ───────────────────────────────────
+    analysis = None
+    if api_key:
+        progress(0.35, desc="Stage 2: VLM analysis (calling Gemini)...")
+        try:
+            video_info = {
+                "video_path": str(input_video_path),
+                "instruction": instruction,
+                "output_dir": str(output_dir),
+                "multi_frame_grids": True,
+            }
+            client = openai.OpenAI(
+                api_key=api_key,
+                base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
+            )
+            analysis = vlm_process_video(video_info, client, DEFAULT_VLM_MODEL)
+            progress(0.55, desc="Stage 2 complete.")
+        except Exception as e:
+            gr.Warning(f"VLM analysis failed: {e}. Generating binary mask only.")
+            analysis = None
+    else:
+        gr.Warning("No GEMINI_API_KEY set. Generating binary mask only (no VLM analysis).")
+    # ── Stage 3: Grey Mask Generation (repo logic + transformers SAM3) ──────
+    grey_mask_path = output_dir / "grey_mask.mp4"
+    vlm_analysis_path = output_dir / "vlm_analysis.json"
+    if analysis and vlm_analysis_path.exists():
+        progress(0.6, desc="Stage 3: Generating grey masks (SAM3 segmentation)...")
+        try:
+            video_info_3 = {
+                "video_path": str(input_video_path),
+                "output_dir": str(output_dir),
+                "min_grid": 8,
+            }
+            # Uses the repo's process_video_grey_masks with our TransformersSam3Segmenter
+            process_video_grey_masks(video_info_3, seg_model)
+            progress(0.8, desc="Stage 3 complete.")
+        except Exception as e:
+            gr.Warning(f"Stage 3 failed: {e}. Generating binary mask only.")
+    # ── Stage 4: Combine into Quadmask (repo code) ─────────────────────────
+    quadmask_path = output_dir / "quadmask_0.mp4"
+    if grey_mask_path.exists():
+        progress(0.85, desc="Stage 4: Combining into quadmask...")
+        combine_process_video(black_mask_path, grey_mask_path, quadmask_path)
+    else:
+        shutil.copy2(black_mask_path, quadmask_path)
+    progress(0.9, desc="Creating visualizations...")
+    # ── Visualization outputs ───────────────────────────────────────────────
+    overlay_path = create_quadmask_visualization(str(input_video_path), str(quadmask_path))
+    color_path = create_quadmask_color_video(str(quadmask_path))
+    analysis_text = ""
+    if vlm_analysis_path.exists():
+        with open(vlm_analysis_path) as f:
+            analysis_text = f.read()
+    else:
+        analysis_text = "No VLM analysis available."
+    progress(1.0, desc="Done!")
+    return str(quadmask_path), overlay_path, color_path, analysis_text
+# ══════════════════════════════════════════════════════════════════════════════
+# GRADIO UI
+# ══════════════════════════════════════════════════════════════════════════════
+def on_video_upload(video_path):
+    if not video_path:
+        return None, None, "[]", gr.update(interactive=False)
+    frames, _ = extract_frames(video_path, max_frames=1)
+    if not frames:
+        return None, None, "[]", gr.update(interactive=False)
+    return frames[0], frames[0], "[]", gr.update(interactive=True)
+def on_frame_select(clean_frame, points_json, evt: gr.SelectData):
+    if clean_frame is None:
+        return None, points_json
+    points = json.loads(points_json) if points_json else []
+    x, y = evt.index
+    points.append([int(x), int(y)])
+    annotated = draw_points_on_image(clean_frame, points)
+    return annotated, json.dumps(points)
+def on_clear_points(clean_frame):
+    if clean_frame is not None:
+        return clean_frame, "[]"
+    return None, "[]"
+DESCRIPTION = """
+# VOID VLM-Mask-Reasoner — Quadmask Generation
+Generate **4-level semantic masks** (quadmasks) for interaction-aware video inpainting with [VOID](https://github.com/Netflix/void-model).
+**Pipeline:** Click points on object → SAM2 segments it → Gemini VLM reasons about interactions → SAM3 segments affected objects → Quadmask generated
+Use the generated quadmask with the [VOID inpainting demo](https://huggingface.co/spaces/sam-motamed/VOID).
+"""
+QUADMASK_EXPLAINER = """
+### Quadmask format
+| Pixel Value | Color | Meaning |
+|-------------|-------|---------|
+| **0** (black) | Red overlay | Primary object to remove |
+| **63** (dark grey) | Yellow overlay | Overlap of primary + affected zone |
+| **127** (mid grey) | Green overlay | Affected region (shadows, reflections, physics) |
+| **255** (white) | Original | Background — keep as-is |
+"""
+with gr.Blocks(title="VOID VLM-Mask-Reasoner", theme=gr.themes.Default()) as demo:
+    gr.Markdown(DESCRIPTION)
+    points_state = gr.State("[]")
+    clean_frame_state = gr.State(None)
+    with gr.Row():
+        with gr.Column(scale=1):
+            video_input = gr.Video(label="Upload Video", sources=["upload"])
+            frame_display = gr.Image(
+                label="Click to select primary object points (click multiple spots on the object)",
+                interactive=True, type="numpy",
+            )
+            with gr.Row():
+                clear_btn = gr.Button("Clear Points", size="sm")
+                points_display = gr.Textbox(label="Selected Points", value="[]",
+                                             interactive=False, max_lines=2)
+            instruction_input = gr.Textbox(
+                label="Edit instruction — describe what to remove",
+                placeholder="e.g., remove the person", lines=1,
+            )
+            generate_btn = gr.Button("Generate Quadmask", variant="primary", size="lg",
+                                      interactive=False)
+        with gr.Column(scale=1):
+            output_quadmask_file = gr.File(label="Download lossless quadmask_0.mp4 (use this with VOID)")
+            with gr.Tabs():
+                with gr.TabItem("Quadmask Overlay"):
+                    output_overlay = gr.Video(label="Quadmask overlay on original video")
+                with gr.TabItem("Raw Quadmask"):
+                    output_color = gr.Video(label="Color-coded quadmask")
+                with gr.TabItem("VLM Analysis"):
+                    output_analysis = gr.Code(label="VLM Analysis JSON", language="json")
+    video_input.change(
+        fn=on_video_upload, inputs=[video_input],
+        outputs=[frame_display, clean_frame_state, points_state, generate_btn],
+    )
+    points_state.change(lambda p: p, inputs=points_state, outputs=points_display)
+    frame_display.select(
+        fn=on_frame_select, inputs=[clean_frame_state, points_state],
+        outputs=[frame_display, points_state],
+    )
+    clear_btn.click(
+        fn=on_clear_points, inputs=[clean_frame_state],
+        outputs=[frame_display, points_state],
+    )
+    generate_btn.click(
+        fn=run_pipeline, inputs=[video_input, points_state, instruction_input],
+        outputs=[output_quadmask_file, output_overlay, output_color, output_analysis],
+    )
+    gr.Markdown(QUADMASK_EXPLAINER)
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+torchvision==0.24
+transformers>=4.50.0
+accelerate
+gradio
+numpy<2.0
+opencv-python-headless
+Pillow
+imageio
+imageio-ffmpeg
+openai
+huggingface_hub
+spaces
+tqdm