File size: 3,263 Bytes
8ca702c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
 
 
 
 
 
 
14d11e9
 
 
987d21c
 
 
 
14d11e9
 
 
 
987d21c
14d11e9
987d21c
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
 
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
987d21c
14d11e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: apache-2.0
library_name: transformers
pipeline_tag: video-text-to-text
base_model: Qwen/Qwen3-8B
language:
- en
tags:
- video-understanding
- long-video-understanding
- agentic-llm
- video-question-answering
- vision-language-model
- grpo
- reinforcement-learning
- icml-2026
---

<h2 align="center">🎬 VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority</h2>

<p align="center">
  <a href="https://github.com/Echochef/VideoSEAL"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-black?logo=github"></a>
  <a href="https://huggingface.co/CewEhao/VideoSEAL_8B"><img alt="HF Model" src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-VideoSEAL__8B-yellow"></a>
  <img alt="ICML 2026" src="https://img.shields.io/badge/ICML-2026-blue">
</p>

<p align="center">
  πŸ€— HuggingFace model:
  <a href="https://huggingface.co/CewEhao/VideoSEAL_8B">CewEhao/VideoSEAL_8B</a>
  &nbsp;Β·&nbsp;
  πŸ’» Code:
  <a href="https://github.com/Echochef/VideoSEAL">Echochef/VideoSEAL</a>
</p>

## πŸ‘‰ Introduction

This is the official model card for **VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority** (ICML 2026).

VideoSEAL provides offline build utilities for long video indexing:

- OCR subtitles (SRT) β†’ OCR captions + (optional) embeddings
- Clip captions (VLM) β†’ clip captions + (optional) embeddings
- Merge into a unified semantic index under `indexes/semantic/<video_id>/`
- (Optional) generate a global `full_story.txt` summary

## πŸ“¦ Layout

- 🧰 Shell entrypoints: `scripts/`
- 🐍 Python package: `videoseal/`
- βœ… Tests: `test/`
- 🧩 OCR toolchain (vendored): `third_party/video-subtitle-extractor/`

## βš™οΈ Configuration

- Defaults live in the scripts under `scripts/`.
- Put real API keys/endpoints in your shell environment / job launcher.

## πŸ—οΈ Run offline build

```bash
cd /path/to/VideoSEAL

export MLLM_API_KEY="sk_your_api_key"
export EMBEDDING_API_KEY="sk_your_api_key"
export AGENT_LLM_API_KEY="sk_your_api_key"
export VISUAL_INSPECT_API_KEY="sk_your_api_key"
VIDEO=/path/to/video.mp4 BENCHMARK=LVBench ./scripts/run_offline_build.sh
```

## βœ… Run tests

```bash
/root/miniconda3/envs/rllm/bin/python -m unittest discover -s test -v
```

## πŸ‹οΈ GRPO training (video tool workflow)

This repo vendors a minimal copy of the `rllm/` + `verl/` Python packages (under the repo root)
to make the video tool-agent GRPO workflow runnable without an extra repo checkout.

### πŸ§ͺ Training environment (conda)

```bash
conda create -n videoseal python=3.12 -y
conda activate videoseal

pip install vllm==0.11.0

cd rllm
pip install -e .

cd ../verl
pip install -e .
```

### πŸš€ Launcher

- `scripts/train/run_video_workflow_grpo.sh`

### 🧩 Example

```bash
cd /path/to/VideoSEAL

# Export real API keys/endpoints in your environment before launching.

TRAIN_PARQUET='["/path/to/train.parquet"]' \
VAL_PARQUET='/path/to/val.parquet' \
MODEL_PATH='Qwen/Qwen3-8B' \
./scripts/train/run_video_workflow_grpo.sh train
```

### πŸ”Ž Quick checks

```bash
./scripts/train/run_video_workflow_grpo.sh test-reward
pytest -q tests/rewards/test_video_reward_tool_env_integration.py
```