nielsr HF Staff

Add paper link and citation

07c952a verified 2 days ago

4.15 kB

	---
	base_model: Qwen/Qwen3-8B
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: video-text-to-text
	tags:
	- video-understanding
	- long-video-understanding
	- agentic-llm
	- video-question-answering
	- vision-language-model
	- grpo
	- reinforcement-learning
	- icml-2026
	---

	<h2 align="center">🎬 VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority</h2>

	<p align="center">
	<a href="https://huggingface.co/papers/2605.12571"><img alt="Paper" src="https://img.shields.io/badge/Paper-HF--Paper-red"></a>
	<a href="https://github.com/Echochef/VideoSEAL"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-black?logo=github"></a>
	<a href="https://huggingface.co/CewEhao/VideoSEAL_8B"><img alt="HF Model" src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-VideoSEAL__8B-yellow"></a>
	<img alt="ICML 2026" src="https://img.shields.io/badge/ICML-2026-blue">
	</p>

	<p align="center">
	🤗 HuggingFace model:
	<a href="https://huggingface.co/CewEhao/VideoSEAL_8B">CewEhao/VideoSEAL_8B</a>
	·
	💻 Code:
	<a href="https://github.com/Echochef/VideoSEAL">Echochef/VideoSEAL</a>
	·
	📄 Paper:
	<a href="https://huggingface.co/papers/2605.12571">2605.12571</a>
	</p>

	## 👉 Introduction

	This is the official model card for VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority (ICML 2026).

	VideoSEAL is an agentic framework for long-video question answering. It separates the planner role (deciding which evidence to gather) from the answerer role (judging the evidence), mitigating the "evidence misalignment" where models produce correct answers not supported by retrieved evidence.

	VideoSEAL provides offline build utilities for long video indexing:

	- OCR subtitles (SRT) → OCR captions + (optional) embeddings
	- Clip captions (VLM) → clip captions + (optional) embeddings
	- Merge into a unified semantic index under `indexes/semantic/<video_id>/`
	- (Optional) generate a global `full_story.txt` summary

	## 📦 Layout

	- 🧰 Shell entrypoints: `scripts/`
	- 🐍 Python package: `videoseal/`
	- ✅ Tests: `test/`
	- 🧩 OCR toolchain (vendored): `third_party/video-subtitle-extractor/`

	## ⚙️ Configuration

	- Defaults live in the scripts under `scripts/`.
	- Put real API keys/endpoints in your shell environment / job launcher.

	## 🏗️ Run offline build

	```bash
	cd /path/to/VideoSEAL

	export MLLM_API_KEY="sk_your_api_key"
	export EMBEDDING_API_KEY="sk_your_api_key"
	export AGENT_LLM_API_KEY="sk_your_api_key"
	export VISUAL_INSPECT_API_KEY="sk_your_api_key"
	VIDEO=/path/to/video.mp4 BENCHMARK=LVBench ./scripts/run_offline_build.sh
	```

	## ✅ Run tests

	```bash
	/root/miniconda3/envs/rllm/bin/python -m unittest discover -s test -v
	```

	## 🏋️ GRPO training (video tool workflow)

	This repo vendors a minimal copy of the `rllm/` + `verl/` Python packages (under the repo root)
	to make the video tool-agent GRPO workflow runnable without an extra repo checkout.

	### 🧪 Training environment (conda)

	```bash
	conda create -n videoseal python=3.12 -y
	conda activate videoseal

	pip install vllm==0.11.0

	cd rllm
	pip install -e .

	cd ../verl
	pip install -e .
	```

	### 🚀 Launcher

	- `scripts/train/run_video_workflow_grpo.sh`

	### 🧩 Example

	```bash
	cd /path/to/VideoSEAL

	# Export real API keys/endpoints in your environment before launching.

	TRAIN_PARQUET='["/path/to/train.parquet"]' \
	VAL_PARQUET='/path/to/val.parquet' \
	MODEL_PATH='Qwen/Qwen3-8B' \
	./scripts/train/run_video_workflow_grpo.sh train
	```

	### 🔎 Quick checks

	```bash
	./scripts/train/run_video_workflow_grpo.sh test-reward
	pytest -q tests/rewards/test_video_reward_tool_env_integration.py
	```

	## 📜 Citation

	```bibtex
	@inproceedings{videoseal2026,
	title={VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority},
	author={Dongyang Liu and others},
	booktitle={International Conference on Machine Learning (ICML)},
	year={2026},
	url={https://huggingface.co/papers/2605.12571}
	}
	```