Video-Text-to-Text
Transformers
Safetensors
English
qwen3
text-generation
video-understanding
long-video-understanding
agentic-llm
video-question-answering
vision-language-model
grpo
reinforcement-learning
icml-2026
text-generation-inference
Instructions to use CewEhao/VideoSEAL_8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CewEhao/VideoSEAL_8B with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CewEhao/VideoSEAL_8B") model = AutoModelForCausalLM.from_pretrained("CewEhao/VideoSEAL_8B") - Notebooks
- Google Colab
- Kaggle
File size: 4,145 Bytes
8ca702c 07c952a 8ca702c 987d21c 07c952a 987d21c 14d11e9 987d21c 07c952a 14d11e9 987d21c 14d11e9 07c952a 987d21c 14d11e9 987d21c 14d11e9 987d21c 14d11e9 987d21c 14d11e9 07c952a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
base_model: Qwen/Qwen3-8B
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: video-text-to-text
tags:
- video-understanding
- long-video-understanding
- agentic-llm
- video-question-answering
- vision-language-model
- grpo
- reinforcement-learning
- icml-2026
---
<h2 align="center">π¬ VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority</h2>
<p align="center">
<a href="https://huggingface.co/papers/2605.12571"><img alt="Paper" src="https://img.shields.io/badge/Paper-HF--Paper-red"></a>
<a href="https://github.com/Echochef/VideoSEAL"><img alt="Code" src="https://img.shields.io/badge/Code-GitHub-black?logo=github"></a>
<a href="https://huggingface.co/CewEhao/VideoSEAL_8B"><img alt="HF Model" src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-VideoSEAL__8B-yellow"></a>
<img alt="ICML 2026" src="https://img.shields.io/badge/ICML-2026-blue">
</p>
<p align="center">
π€ HuggingFace model:
<a href="https://huggingface.co/CewEhao/VideoSEAL_8B">CewEhao/VideoSEAL_8B</a>
Β·
π» Code:
<a href="https://github.com/Echochef/VideoSEAL">Echochef/VideoSEAL</a>
Β·
π Paper:
<a href="https://huggingface.co/papers/2605.12571">2605.12571</a>
</p>
## π Introduction
This is the official model card for **VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority** (ICML 2026).
VideoSEAL is an agentic framework for long-video question answering. It separates the *planner* role (deciding which evidence to gather) from the *answerer* role (judging the evidence), mitigating the "evidence misalignment" where models produce correct answers not supported by retrieved evidence.
VideoSEAL provides offline build utilities for long video indexing:
- OCR subtitles (SRT) β OCR captions + (optional) embeddings
- Clip captions (VLM) β clip captions + (optional) embeddings
- Merge into a unified semantic index under `indexes/semantic/<video_id>/`
- (Optional) generate a global `full_story.txt` summary
## π¦ Layout
- π§° Shell entrypoints: `scripts/`
- π Python package: `videoseal/`
- β
Tests: `test/`
- π§© OCR toolchain (vendored): `third_party/video-subtitle-extractor/`
## βοΈ Configuration
- Defaults live in the scripts under `scripts/`.
- Put real API keys/endpoints in your shell environment / job launcher.
## ποΈ Run offline build
```bash
cd /path/to/VideoSEAL
export MLLM_API_KEY="sk_your_api_key"
export EMBEDDING_API_KEY="sk_your_api_key"
export AGENT_LLM_API_KEY="sk_your_api_key"
export VISUAL_INSPECT_API_KEY="sk_your_api_key"
VIDEO=/path/to/video.mp4 BENCHMARK=LVBench ./scripts/run_offline_build.sh
```
## β
Run tests
```bash
/root/miniconda3/envs/rllm/bin/python -m unittest discover -s test -v
```
## ποΈ GRPO training (video tool workflow)
This repo vendors a minimal copy of the `rllm/` + `verl/` Python packages (under the repo root)
to make the video tool-agent GRPO workflow runnable without an extra repo checkout.
### π§ͺ Training environment (conda)
```bash
conda create -n videoseal python=3.12 -y
conda activate videoseal
pip install vllm==0.11.0
cd rllm
pip install -e .
cd ../verl
pip install -e .
```
### π Launcher
- `scripts/train/run_video_workflow_grpo.sh`
### π§© Example
```bash
cd /path/to/VideoSEAL
# Export real API keys/endpoints in your environment before launching.
TRAIN_PARQUET='["/path/to/train.parquet"]' \
VAL_PARQUET='/path/to/val.parquet' \
MODEL_PATH='Qwen/Qwen3-8B' \
./scripts/train/run_video_workflow_grpo.sh train
```
### π Quick checks
```bash
./scripts/train/run_video_workflow_grpo.sh test-reward
pytest -q tests/rewards/test_video_reward_tool_env_integration.py
```
## π Citation
```bibtex
@inproceedings{videoseal2026,
title={VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority},
author={Dongyang Liu and others},
booktitle={International Conference on Machine Learning (ICML)},
year={2026},
url={https://huggingface.co/papers/2605.12571}
}
``` |