Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

This repository contains the model weights for CRPO, a dual-branch reinforcement learning framework designed to improve the spatiotemporal sensitivity of Video LLMs.

Introduction

Video Large Language Models (Video LLMs) often rely on shortcuts, such as single-frame cues and language priors, rather than tracking spatiotemporal dynamics. Counterfactual Relational Policy Optimization (CRPO) addresses this by using a dual-branch RL framework.

CRPO constructs counterfactual videos (e.g., through horizontal flips and temporal reversals) and introduces a Counterfactual Relation Reward (CRR). This reward encourages the model's answers to change for dynamic questions when the visual world changes, and to remain unchanged for static questions, making it difficult for shortcut-based policies to be consistently rewarded.

Evaluation

The model was evaluated using DyBench, a paired counterfactual video benchmark with over 3,000 videos covering:

  • Reversible dynamics
  • Moving directions
  • Event sequences

Experiments show that CRPO significantly outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench pair-accuracy, indicating improved sensitivity to video dynamics rather than reliance on static shortcuts.

Downloads last month
143
Safetensors
Model size
570k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ddz16/Qwen3-VL-4B-CRPO