Add paper link and pipeline metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +16 -26
README.md CHANGED
@@ -2,6 +2,8 @@
2
  license: apache-2.0
3
  tags:
4
  - pytorch
 
 
5
  ---
6
 
7
  <a id="top"></a>
@@ -27,8 +29,8 @@ tags:
27
 
28
  These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).
29
 
30
- πŸ”— **Paper:** [Accepted by CVPRW 2026](https://arxiv.org)
31
- πŸ”— **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git)
32
  πŸ”— **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)
33
 
34
  ---
@@ -42,7 +44,7 @@ These are the official implementation, pre-trained model weights, and configurat
42
  ## πŸ“Œ Model Information
43
 
44
  ### 1. Model Name
45
- **ViSAGE(Video Saliency with Adaptive Gated Experts)**
46
 
47
  ### 2. Task Type & Applicable Tasks
48
  - **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
@@ -51,7 +53,7 @@ These are the official implementation, pre-trained model weights, and configurat
51
  ### 3. Project Introduction
52
  Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.
53
 
54
- > πŸ’‘ **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging, which provides significantly more accurate estimation than simple saliency map averaging.
55
 
56
  ### 4. Training Data Source
57
  - Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).
@@ -65,25 +67,19 @@ Clone the GitHub repository and set up the Conda environment:
65
  ```bash
66
  git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
67
  cd ViSAGE
68
- ```
69
- ```bash
70
  conda create -n visage python=3.10 -y
71
  conda activate visage
72
  pip install -r requirements.txt
73
  ```
74
 
75
  ### Step 2: Data & Pre-trained Weights Preparation
76
- 1. **Challenge Data:** Use the provided scripts to extract frames from the source videos. The extracted frames will be automatically saved to `derived_fullfps`.
77
- *(⚠️ **Important:** Do not modify the output directory name `derived_fullfps` unless you manually update the path configs in all inference scripts.)*
78
  ```bash
79
  python video_to_frames.py
80
  ```
81
- 2. **ViSAGE Checkpoints:** Download our model checkpoints(https://huggingface.co/iLearn-Lab/CVPRW26-ViSAGE).
82
- 3. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo:
83
- ```bash
84
- git clone https://github.com/OpenGVLab/InternVideo.git
85
- *(Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory).*
86
- ```
87
  ### Step 3: Run Inference & Ensemble
88
 
89
  **1. Inference:** Generate predictions for both experts.
@@ -95,14 +91,14 @@ python Expert2/inference.py
95
  ```bash
96
  python ensemble.py
97
  ```
98
- **3. Format Check & Video Generation:** Validate your submission format and render the predicted saliency outputs onto the source video frames.
99
  ```bash
100
  python check.py
101
  python makevideos.py
102
  ```
103
 
104
  ### Step 4: Training (Optional)
105
- If you wish to train the model from scratch, run the two-stage LoRA fine-tuning pipeline:
106
  ```bash
107
  python trainnew.py # Stage 1
108
  python trainnew2.py # Stage 2
@@ -112,15 +108,8 @@ python trainnew2.py # Stage 2
112
 
113
  ## ⚠️ Limitations & Notes
114
 
115
- **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
116
  - The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
117
- - Inference speed and performance may fluctuate depending on the hardware utilized.
118
-
119
- ---
120
-
121
- ## 🀝 Acknowledgements & Contact
122
-
123
- - **Contact:** If you have any questions or encounter issues, feel free to open an issue or contact the author Kun Wang at `khylon.kun.wang@gmail.com`.
124
 
125
  ---
126
 
@@ -128,10 +117,11 @@ python trainnew2.py # Stage 2
128
 
129
  If you find this project useful for your research, please consider citing:
130
 
131
-
132
  @inproceedings{ntire26visage,
133
  title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}},
134
  author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},
135
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
136
  year={2026}
137
- }
 
 
2
  license: apache-2.0
3
  tags:
4
  - pytorch
5
+ - video-saliency-prediction
6
+ pipeline_tag: other
7
  ---
8
 
9
  <a id="top"></a>
 
29
 
30
  These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).
31
 
32
+ πŸ”— **Paper:** [ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction](https://huggingface.co/papers/2604.08613)
33
+ πŸ”— **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE)
34
  πŸ”— **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)
35
 
36
  ---
 
44
  ## πŸ“Œ Model Information
45
 
46
  ### 1. Model Name
47
+ **ViSAGE (Video Saliency with Adaptive Gated Experts)**
48
 
49
  ### 2. Task Type & Applicable Tasks
50
  - **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
 
53
  ### 3. Project Introduction
54
  Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.
55
 
56
+ > πŸ’‘ **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging.
57
 
58
  ### 4. Training Data Source
59
  - Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).
 
67
  ```bash
68
  git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
69
  cd ViSAGE
 
 
70
  conda create -n visage python=3.10 -y
71
  conda activate visage
72
  pip install -r requirements.txt
73
  ```
74
 
75
  ### Step 2: Data & Pre-trained Weights Preparation
76
+ 1. **Challenge Data:** Use the provided scripts to extract frames from the source videos.
 
77
  ```bash
78
  python video_to_frames.py
79
  ```
80
+ 2. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo.
81
+ 3. **Paths:** Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory.
82
+
 
 
 
83
  ### Step 3: Run Inference & Ensemble
84
 
85
  **1. Inference:** Generate predictions for both experts.
 
91
  ```bash
92
  python ensemble.py
93
  ```
94
+ **3. Format Check & Video Generation:**
95
  ```bash
96
  python check.py
97
  python makevideos.py
98
  ```
99
 
100
  ### Step 4: Training (Optional)
101
+ Run the two-stage LoRA fine-tuning pipeline:
102
  ```bash
103
  python trainnew.py # Stage 1
104
  python trainnew2.py # Stage 2
 
108
 
109
  ## ⚠️ Limitations & Notes
110
 
 
111
  - The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
112
+ - This framework and its pre-trained weights are intended for **academic research purposes only**.
 
 
 
 
 
 
113
 
114
  ---
115
 
 
117
 
118
  If you find this project useful for your research, please consider citing:
119
 
120
+ ```bibtex
121
  @inproceedings{ntire26visage,
122
  title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}},
123
  author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},
124
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
125
  year={2026}
126
+ }
127
+ ```