iLearn-Lab
/

CVPRW26-ViSAGE

PyTorch

Model card Files Files and versions

xet

Community

Add paper link and pipeline metadata

by nielsr HF Staff - opened 4 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+16

-26

Files changed (1) hide show

README.md +16 -26

README.md CHANGED Viewed

@@ -2,6 +2,8 @@
 license: apache-2.0
 tags:
 - pytorch
 ---
 <a id="top"></a>
@@ -27,8 +29,8 @@ tags:
 These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).
-🔗 **Paper:** [Accepted by CVPRW 2026](https://arxiv.org)
-🔗 **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git)
 🔗 **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)
 ---
@@ -42,7 +44,7 @@ These are the official implementation, pre-trained model weights, and configurat
 ## 📌 Model Information
 ### 1. Model Name
-**ViSAGE(Video Saliency with Adaptive Gated Experts)**
 ### 2. Task Type & Applicable Tasks
 - **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
@@ -51,7 +53,7 @@ These are the official implementation, pre-trained model weights, and configurat
 ### 3. Project Introduction
 Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.
-> 💡 **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging, which provides significantly more accurate estimation than simple saliency map averaging.
 ### 4. Training Data Source
 - Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).
@@ -65,25 +67,19 @@ Clone the GitHub repository and set up the Conda environment:
 ```bash
 git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
 cd ViSAGE
-```
-```bash
 conda create -n visage python=3.10 -y
 conda activate visage
 pip install -r requirements.txt
 ```
 ### Step 2: Data & Pre-trained Weights Preparation
-1. **Challenge Data:** Use the provided scripts to extract frames from the source videos. The extracted frames will be automatically saved to `derived_fullfps`.
-   *(⚠️ **Important:** Do not modify the output directory name `derived_fullfps` unless you manually update the path configs in all inference scripts.)*
   ```bash
    python video_to_frames.py
   ```
-2. **ViSAGE Checkpoints:** Download our model checkpoints(https://huggingface.co/iLearn-Lab/CVPRW26-ViSAGE).
-3. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo:
-  ```bash
-   git clone https://github.com/OpenGVLab/InternVideo.git
-   *(Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory).*
-  ```
 ### Step 3: Run Inference & Ensemble
 **1. Inference:** Generate predictions for both experts.
@@ -95,14 +91,14 @@ python Expert2/inference.py
 ```bash
 python ensemble.py
 ```
-**3. Format Check & Video Generation:** Validate your submission format and render the predicted saliency outputs onto the source video frames.
 ```bash
 python check.py
 python makevideos.py
 ```
 ### Step 4: Training (Optional)
-If you wish to train the model from scratch, run the two-stage LoRA fine-tuning pipeline:
 ```bash
 python trainnew.py   # Stage 1
 python trainnew2.py  # Stage 2
@@ -112,15 +108,8 @@ python trainnew2.py  # Stage 2
 ## ⚠️ Limitations & Notes
-**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
 - The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
-- Inference speed and performance may fluctuate depending on the hardware utilized.
----
-## 🤝 Acknowledgements & Contact
-- **Contact:** If you have any questions or encounter issues, feel free to open an issue or contact the author Kun Wang at `khylon.kun.wang@gmail.com`.
 ---
@@ -128,10 +117,11 @@ python trainnew2.py  # Stage 2
 If you find this project useful for your research, please consider citing:
 @inproceedings{ntire26visage,
   title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}},
   author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},
   booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
   year={2026}
-}

 license: apache-2.0
 tags:
 - pytorch
+- video-saliency-prediction
+pipeline_tag: other
 ---
 <a id="top"></a>
 These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).
+🔗 **Paper:** [ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction](https://huggingface.co/papers/2604.08613)
+🔗 **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE)
 🔗 **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)
 ---
 ## 📌 Model Information
 ### 1. Model Name
+**ViSAGE (Video Saliency with Adaptive Gated Experts)**
 ### 2. Task Type & Applicable Tasks
 - **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
 ### 3. Project Introduction
 Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.
+> 💡 **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging.
 ### 4. Training Data Source
 - Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).
 ```bash
 git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
 cd ViSAGE
 conda create -n visage python=3.10 -y
 conda activate visage
 pip install -r requirements.txt
 ```
 ### Step 2: Data & Pre-trained Weights Preparation
+1. **Challenge Data:** Use the provided scripts to extract frames from the source videos.
   ```bash
    python video_to_frames.py
   ```
+2. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo.
+3. **Paths:** Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory.
 ### Step 3: Run Inference & Ensemble
 **1. Inference:** Generate predictions for both experts.
 ```bash
 python ensemble.py
 ```
+**3. Format Check & Video Generation:**
 ```bash
 python check.py
 python makevideos.py
 ```
 ### Step 4: Training (Optional)
+Run the two-stage LoRA fine-tuning pipeline:
 ```bash
 python trainnew.py   # Stage 1
 python trainnew2.py  # Stage 2
 ## ⚠️ Limitations & Notes
 - The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
+- This framework and its pre-trained weights are intended for **academic research purposes only**.
 ---
 If you find this project useful for your research, please consider citing:
+```bibtex
 @inproceedings{ntire26visage,
   title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}},
   author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},
   booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
   year={2026}
+}
+```