Add paper link and pipeline metadata
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -2,6 +2,8 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
- pytorch
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
<a id="top"></a>
|
|
@@ -27,8 +29,8 @@ tags:
|
|
| 27 |
|
| 28 |
These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).
|
| 29 |
|
| 30 |
-
π **Paper:** [
|
| 31 |
-
π **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE
|
| 32 |
π **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)
|
| 33 |
|
| 34 |
---
|
|
@@ -42,7 +44,7 @@ These are the official implementation, pre-trained model weights, and configurat
|
|
| 42 |
## π Model Information
|
| 43 |
|
| 44 |
### 1. Model Name
|
| 45 |
-
**ViSAGE(Video Saliency with Adaptive Gated Experts)**
|
| 46 |
|
| 47 |
### 2. Task Type & Applicable Tasks
|
| 48 |
- **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
|
|
@@ -51,7 +53,7 @@ These are the official implementation, pre-trained model weights, and configurat
|
|
| 51 |
### 3. Project Introduction
|
| 52 |
Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.
|
| 53 |
|
| 54 |
-
> π‘ **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging
|
| 55 |
|
| 56 |
### 4. Training Data Source
|
| 57 |
- Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).
|
|
@@ -65,25 +67,19 @@ Clone the GitHub repository and set up the Conda environment:
|
|
| 65 |
```bash
|
| 66 |
git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
|
| 67 |
cd ViSAGE
|
| 68 |
-
```
|
| 69 |
-
```bash
|
| 70 |
conda create -n visage python=3.10 -y
|
| 71 |
conda activate visage
|
| 72 |
pip install -r requirements.txt
|
| 73 |
```
|
| 74 |
|
| 75 |
### Step 2: Data & Pre-trained Weights Preparation
|
| 76 |
-
1. **Challenge Data:** Use the provided scripts to extract frames from the source videos.
|
| 77 |
-
*(β οΈ **Important:** Do not modify the output directory name `derived_fullfps` unless you manually update the path configs in all inference scripts.)*
|
| 78 |
```bash
|
| 79 |
python video_to_frames.py
|
| 80 |
```
|
| 81 |
-
2. **
|
| 82 |
-
3. **
|
| 83 |
-
|
| 84 |
-
git clone https://github.com/OpenGVLab/InternVideo.git
|
| 85 |
-
*(Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory).*
|
| 86 |
-
```
|
| 87 |
### Step 3: Run Inference & Ensemble
|
| 88 |
|
| 89 |
**1. Inference:** Generate predictions for both experts.
|
|
@@ -95,14 +91,14 @@ python Expert2/inference.py
|
|
| 95 |
```bash
|
| 96 |
python ensemble.py
|
| 97 |
```
|
| 98 |
-
**3. Format Check & Video Generation:**
|
| 99 |
```bash
|
| 100 |
python check.py
|
| 101 |
python makevideos.py
|
| 102 |
```
|
| 103 |
|
| 104 |
### Step 4: Training (Optional)
|
| 105 |
-
|
| 106 |
```bash
|
| 107 |
python trainnew.py # Stage 1
|
| 108 |
python trainnew2.py # Stage 2
|
|
@@ -112,15 +108,8 @@ python trainnew2.py # Stage 2
|
|
| 112 |
|
| 113 |
## β οΈ Limitations & Notes
|
| 114 |
|
| 115 |
-
**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
|
| 116 |
- The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
|
| 117 |
-
-
|
| 118 |
-
|
| 119 |
-
---
|
| 120 |
-
|
| 121 |
-
## π€ Acknowledgements & Contact
|
| 122 |
-
|
| 123 |
-
- **Contact:** If you have any questions or encounter issues, feel free to open an issue or contact the author Kun Wang at `khylon.kun.wang@gmail.com`.
|
| 124 |
|
| 125 |
---
|
| 126 |
|
|
@@ -128,10 +117,11 @@ python trainnew2.py # Stage 2
|
|
| 128 |
|
| 129 |
If you find this project useful for your research, please consider citing:
|
| 130 |
|
| 131 |
-
|
| 132 |
@inproceedings{ntire26visage,
|
| 133 |
title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}},
|
| 134 |
author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},
|
| 135 |
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
|
| 136 |
year={2026}
|
| 137 |
-
}
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
- pytorch
|
| 5 |
+
- video-saliency-prediction
|
| 6 |
+
pipeline_tag: other
|
| 7 |
---
|
| 8 |
|
| 9 |
<a id="top"></a>
|
|
|
|
| 29 |
|
| 30 |
These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).
|
| 31 |
|
| 32 |
+
π **Paper:** [ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction](https://huggingface.co/papers/2604.08613)
|
| 33 |
+
π **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE)
|
| 34 |
π **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)
|
| 35 |
|
| 36 |
---
|
|
|
|
| 44 |
## π Model Information
|
| 45 |
|
| 46 |
### 1. Model Name
|
| 47 |
+
**ViSAGE (Video Saliency with Adaptive Gated Experts)**
|
| 48 |
|
| 49 |
### 2. Task Type & Applicable Tasks
|
| 50 |
- **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
|
|
|
|
| 53 |
### 3. Project Introduction
|
| 54 |
Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.
|
| 55 |
|
| 56 |
+
> π‘ **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging.
|
| 57 |
|
| 58 |
### 4. Training Data Source
|
| 59 |
- Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).
|
|
|
|
| 67 |
```bash
|
| 68 |
git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
|
| 69 |
cd ViSAGE
|
|
|
|
|
|
|
| 70 |
conda create -n visage python=3.10 -y
|
| 71 |
conda activate visage
|
| 72 |
pip install -r requirements.txt
|
| 73 |
```
|
| 74 |
|
| 75 |
### Step 2: Data & Pre-trained Weights Preparation
|
| 76 |
+
1. **Challenge Data:** Use the provided scripts to extract frames from the source videos.
|
|
|
|
| 77 |
```bash
|
| 78 |
python video_to_frames.py
|
| 79 |
```
|
| 80 |
+
2. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo.
|
| 81 |
+
3. **Paths:** Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory.
|
| 82 |
+
|
|
|
|
|
|
|
|
|
|
| 83 |
### Step 3: Run Inference & Ensemble
|
| 84 |
|
| 85 |
**1. Inference:** Generate predictions for both experts.
|
|
|
|
| 91 |
```bash
|
| 92 |
python ensemble.py
|
| 93 |
```
|
| 94 |
+
**3. Format Check & Video Generation:**
|
| 95 |
```bash
|
| 96 |
python check.py
|
| 97 |
python makevideos.py
|
| 98 |
```
|
| 99 |
|
| 100 |
### Step 4: Training (Optional)
|
| 101 |
+
Run the two-stage LoRA fine-tuning pipeline:
|
| 102 |
```bash
|
| 103 |
python trainnew.py # Stage 1
|
| 104 |
python trainnew2.py # Stage 2
|
|
|
|
| 108 |
|
| 109 |
## β οΈ Limitations & Notes
|
| 110 |
|
|
|
|
| 111 |
- The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
|
| 112 |
+
- This framework and its pre-trained weights are intended for **academic research purposes only**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
---
|
| 115 |
|
|
|
|
| 117 |
|
| 118 |
If you find this project useful for your research, please consider citing:
|
| 119 |
|
| 120 |
+
```bibtex
|
| 121 |
@inproceedings{ntire26visage,
|
| 122 |
title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}},
|
| 123 |
author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},
|
| 124 |
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
|
| 125 |
year={2026}
|
| 126 |
+
}
|
| 127 |
+
```
|