File size: 4,145 Bytes
8ca702c
 
 
 
07c952a
 
 
8ca702c
 
 
 
 
 
 
 
 
 
 
987d21c
 
 
07c952a
987d21c
 
 
 
14d11e9
 
 
987d21c
 
 
 
07c952a
 
 
14d11e9
 
 
 
987d21c
14d11e9
07c952a
 
987d21c
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
 
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07c952a
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
base_model: Qwen/Qwen3-8B
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: video-text-to-text
tags:
- video-understanding
- long-video-understanding
- agentic-llm
- video-question-answering
- vision-language-model
- grpo
- reinforcement-learning
- icml-2026
---

<h2 align="center">🎬 VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority</h2>

<p align="center">
  <a href="https://huggingface.co/papers/2605.12571"><img alt="Paper" src="https://img.shields.io/badge/Paper-HF--Paper-red"></a>
  <a href="https://github.com/Echochef/VideoSEAL"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-black?logo=github"></a>
  <a href="https://huggingface.co/CewEhao/VideoSEAL_8B"><img alt="HF Model" src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-VideoSEAL__8B-yellow"></a>
  <img alt="ICML 2026" src="https://img.shields.io/badge/ICML-2026-blue">
</p>

<p align="center">
  πŸ€— HuggingFace model:
  <a href="https://huggingface.co/CewEhao/VideoSEAL_8B">CewEhao/VideoSEAL_8B</a>
  &nbsp;Β·&nbsp;
  πŸ’» Code:
  <a href="https://github.com/Echochef/VideoSEAL">Echochef/VideoSEAL</a>
  &nbsp;Β·&nbsp;
  πŸ“„ Paper:
  <a href="https://huggingface.co/papers/2605.12571">2605.12571</a>
</p>

## πŸ‘‰ Introduction

This is the official model card for **VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority** (ICML 2026).

VideoSEAL is an agentic framework for long-video question answering. It separates the *planner* role (deciding which evidence to gather) from the *answerer* role (judging the evidence), mitigating the "evidence misalignment" where models produce correct answers not supported by retrieved evidence.

VideoSEAL provides offline build utilities for long video indexing:

- OCR subtitles (SRT) β†’ OCR captions + (optional) embeddings
- Clip captions (VLM) β†’ clip captions + (optional) embeddings
- Merge into a unified semantic index under `indexes/semantic/<video_id>/`
- (Optional) generate a global `full_story.txt` summary

## πŸ“¦ Layout

- 🧰 Shell entrypoints: `scripts/`
- 🐍 Python package: `videoseal/`
- βœ… Tests: `test/`
- 🧩 OCR toolchain (vendored): `third_party/video-subtitle-extractor/`

## βš™οΈ Configuration

- Defaults live in the scripts under `scripts/`.
- Put real API keys/endpoints in your shell environment / job launcher.

## πŸ—οΈ Run offline build

```bash
cd /path/to/VideoSEAL

export MLLM_API_KEY="sk_your_api_key"
export EMBEDDING_API_KEY="sk_your_api_key"
export AGENT_LLM_API_KEY="sk_your_api_key"
export VISUAL_INSPECT_API_KEY="sk_your_api_key"
VIDEO=/path/to/video.mp4 BENCHMARK=LVBench ./scripts/run_offline_build.sh
```

## βœ… Run tests

```bash
/root/miniconda3/envs/rllm/bin/python -m unittest discover -s test -v
```

## πŸ‹οΈ GRPO training (video tool workflow)

This repo vendors a minimal copy of the `rllm/` + `verl/` Python packages (under the repo root)
to make the video tool-agent GRPO workflow runnable without an extra repo checkout.

### πŸ§ͺ Training environment (conda)

```bash
conda create -n videoseal python=3.12 -y
conda activate videoseal

pip install vllm==0.11.0

cd rllm
pip install -e .

cd ../verl
pip install -e .
```

### πŸš€ Launcher

- `scripts/train/run_video_workflow_grpo.sh`

### 🧩 Example

```bash
cd /path/to/VideoSEAL

# Export real API keys/endpoints in your environment before launching.

TRAIN_PARQUET='["/path/to/train.parquet"]' \
VAL_PARQUET='/path/to/val.parquet' \
MODEL_PATH='Qwen/Qwen3-8B' \
./scripts/train/run_video_workflow_grpo.sh train
```

### πŸ”Ž Quick checks

```bash
./scripts/train/run_video_workflow_grpo.sh test-reward
pytest -q tests/rewards/test_video_reward_tool_env_integration.py
```

## πŸ“œ Citation

```bibtex
@inproceedings{videoseal2026,
  title={VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority},
  author={Dongyang Liu and others},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026},
  url={https://huggingface.co/papers/2605.12571}
}
```