Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

This repository contains the model weights for CRPO, a dual-branch reinforcement learning framework designed to improve the spatiotemporal sensitivity of Video LLMs.

Paper: Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
Project Page: https://ddz16.github.io/crpo.github.io/
Code: https://github.com/ddz16/CRPO

Introduction

Video Large Language Models (Video LLMs) often rely on shortcuts, such as single-frame cues and language priors, rather than tracking spatiotemporal dynamics. Counterfactual Relational Policy Optimization (CRPO) addresses this by using a dual-branch RL framework.

CRPO constructs counterfactual videos (e.g., through horizontal flips and temporal reversals) and introduces a Counterfactual Relation Reward (CRR). This reward encourages the model's answers to change for dynamic questions when the visual world changes, and to remain unchanged for static questions, making it difficult for shortcut-based policies to be consistently rewarded.

Evaluation

The model was evaluated using DyBench, a paired counterfactual video benchmark with over 3,000 videos covering:

Reversible dynamics
Moving directions
Event sequences

Experiments show that CRPO significantly outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench pair-accuracy, indicating improved sensitivity to video dynamics rather than reliance on static shortcuts.

Downloads last month: 143

Safetensors

Model size

570k params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ddz16/Qwen3-VL-4B-CRPO

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

Paper • 2605.21988 • Published 7 days ago