Swindl commited on
Commit
8c20b86
Β·
verified Β·
1 Parent(s): 85c1572

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -3
README.md CHANGED
@@ -1,3 +1,191 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ ---
5
+
6
+ # GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
7
+ <div style="text-align: left;">
8
+ <p>
9
+ <a href="https://openreview.net/profile?id=~Lang_Lin3">Lang Lin</a>*,
10
+ <a href="https://openreview.net/profile?id=~Xueyang_Yu1">Xueyang Yu</a>*,
11
+ <a href="https://ziqipang.github.io/">Ziqi Pang</a>*,
12
+ <a href="https://yxw.web.illinois.edu/">Yu-Xiong Wang</a>
13
+ </p>
14
+ </div>
15
+
16
+ [[`Project Page`](https://glus-video.github.io/)] [[`arXiv`](https://arxiv.org/abs/2504.07962)] [[`GitHub`](https://github.com/GLUS-video/GLUS)]
17
+
18
+
19
+ [![arXiv](https://img.shields.io/badge/arXiv-2504.07962-A42C25?style=flat&logo=arXiv&logoColor=A42C25)](https://arxiv.org/abs/2504.07962)
20
+ [![Project](https://img.shields.io/badge/Project-Page-green?style=flat&logo=Google%20chrome&logoColor=green)](https://glus-video.github.io/)
21
+
22
+
23
+ ## Overview
24
+
25
+ **RefVOS in complex scenarios** places high demands on models' video understanding and fine-grained localization capabilities. Recently, numerous models leveraging **MLLM-based** comprehension and reasoning abilities have been proposed to address this challenge. Our **GLUS** advances further along this methodological path.
26
+
27
+ πŸš€ **GLUS is principled.** It utilizes global-local reasoning to combine holistic video understanding with detailed frames understanding, unleashing the potential of fine-grained segmentation in complex scenarios.
28
+
29
+ ✨ **GLUS is powerful.** It unifies the methods of memory bank, object contrastive learning and key frame selection to tackle the problems of mask inconsistency and object obfuscation, achieving state-of-the-art performance in complex-scenario RefVOS tasks.
30
+
31
+ πŸ“Œ **GLUS is simple.** It elegantly integrates the approach for complex-scenario RefVOS tasks within a single MLLM framework, eliminating the necessity of utilizing other independent modules.
32
+
33
+
34
+ ## Installation
35
+ ```shell
36
+ git clone git@github.com:GLUS-video/GLUS.git && cd GLUS
37
+ pip install -r requirements.txt
38
+ pip install ./model/segment-anything-2
39
+ pip install flash-attn==2.6.2 --no-build-isolation
40
+ ```
41
+
42
+ ## Model Zoo
43
+
44
+ For more convenient following, we provide the checkpoints of GLUS without object contrastive learning.
45
+
46
+ | Model | Training Datasets | Methods | Download | MeViS J\&F | Ref-Youtube-VOS J\&F |
47
+ |--------------------------------------|---------------------------------|--------------|----------|-----------|-----------------------|
48
+ | **GLUS<sup><i>S</i></sup><sub>partial</sub>** | MeViS, Ref-Youtube-VOS | GLU + MB | [HuggingFace](https://huggingface.co/Swindl/GLUS-S-partial/tree/main), [ModelScope](https://www.modelscope.cn/models/LangLin/GLUS-S-partial/files) | 49.5 | 65.2 |
49
+ | **GLUS<sup><i>S</i></sup>** | MeViS, Ref-Youtube-VOS | GLU + MB + OC + KFS | [HuggingFace](https://huggingface.co/Swindl/GLUS-S/tree/main), [ModelScope](https://www.modelscope.cn/models/LangLin/GLUS-S/files) | 50.3 | 66.6 |
50
+ | **GLUS<sup><i>A</i></sup>** | + RefDAVIS17, ReVOS, LVVIS | GLU + MB | [HuggingFace](https://huggingface.co/Swindl/GLUS-A/tree/main), [ModelScope](https://www.modelscope.cn/models/LangLin/GLUS-A/files) | 51.3 | 67.3 |
51
+
52
+ **Notes**: β€œGLU”: Global-local unification, β€œMB”: End-to-end memory bank, β€œOC”: Object contrastive loss, β€œKFS”: key frame selection.
53
+ GLUS<sup><i>S</i></sup> refers to the model trained on a subset of existing RefVOS datasets (Mevis and Ref-Youtube-VOS), while GLUS<sup><i>A</i></sup> denotes the model trained on the full set of available datasets.
54
+
55
+ We recommend to download and store the pretrained weights at ``GLUS_ROOT/checkpoints``.
56
+
57
+ ## Training and Validation
58
+
59
+ ### 1. Data Preparation
60
+
61
+ Please follow the below architecture to prepare the datasets. We recommend to set ``DATASET_ROOT`` to ``GLUS_ROOT/data``.
62
+
63
+ 1. RefVOS Datasets: [MeViS](https://github.com/henghuiding/MeViS), [Refer-YouTube-VOS](https://codalab.lisn.upsaclay.fr/competitions/3282#participate-get-data), [Ref-DAVIS17](https://github.com/wjn922/ReferFormer/blob/main/docs/data.md).
64
+ 2. Reasoning VOS Datasets: [ReVOS](https://github.com/cilinyan/ReVOS-api), [ReasonVOS](https://github.com/showlab/VideoLISA/blob/main/BENCHMARK.md)
65
+ 3. Open-Vocabulary Video Instance Segmentation Dataset: [LV-VIS](https://github.com/haochenheheda/LVVIS/tree/main).
66
+
67
+ <details open>
68
+ <summary> <strong>Datasets Architecture</strong> </summary>
69
+
70
+ ```
71
+ DATASET_ROOT
72
+ β”œβ”€β”€ mevis
73
+ β”‚ β”œβ”€β”€ train
74
+ β”‚ β”‚ β”œβ”€β”€ JPEGImages
75
+ β”‚ β”‚ β”œβ”€β”€ mask_dict.json
76
+ β”‚ β”‚ └── meta_expressions.json
77
+ β”‚ β”œβ”€β”€ valid
78
+ β”‚ β”‚ β”œβ”€β”€ JPEGImages
79
+ β”‚ β”‚ └── meta_expressions.json
80
+ β”‚ └── valid_u
81
+ β”‚ β”œβ”€β”€ JPEGImages
82
+ β”‚ β”œβ”€β”€ mask_dict.json
83
+ β”‚ └── meta_expressions.json
84
+ β”œβ”€β”€ Refer-YouTube-VOS
85
+ β”‚ β”œβ”€β”€ meta_expressions
86
+ β”‚ β”‚ β”œβ”€οΏ½οΏ½ train/meta_expressions.json
87
+ β”‚ β”‚ └── valid/meta_expressions.json
88
+ β”‚ β”œβ”€β”€ train
89
+ β”‚ β”‚ β”œβ”€β”€ JPEGImages
90
+ β”‚ β”‚ └── Annotations
91
+ β”‚ └── valid
92
+ β”‚ └── JPEGImages
93
+ β”œβ”€β”€ DAVIS17
94
+ β”‚ β”œβ”€β”€ meta_expressions
95
+ β”‚ β”‚ β”œβ”€β”€ train/meta_expressions.json
96
+ β”‚ β”‚ └── valid/meta_expressions.json
97
+ β”‚ β”œβ”€β”€ train
98
+ β”‚ β”‚ β”œβ”€β”€ JPEGImages
99
+ β”‚ β”‚ └── Annotations
100
+ β”‚ └── valid
101
+ β”‚ β”œβ”€β”€ JPEGImages
102
+ β”‚ └── Annotations
103
+ β”œβ”€β”€ LVVIS
104
+ β”‚ β”œβ”€β”€ train
105
+ β”‚ β”‚ └── JPEGImages
106
+ β”‚ β”œβ”€β”€ mask_dict.json
107
+ β”‚ └── meta_expressions.json
108
+ β”œβ”€β”€ ReVOS
109
+ β”‚ β”œβ”€β”€ JPEGImages
110
+ β”‚ β”œβ”€β”€ mask_dict.json
111
+ β”‚ β”œβ”€β”€ mask_dict_foreground.json
112
+ β”‚ β”œβ”€β”€ meta_expressions_train_.json
113
+ β”‚ └── meta_expressions_valid_.json
114
+ β”œβ”€β”€ ReasonVOS
115
+ β”‚ β”œβ”€β”€ JPEGImages
116
+ β”‚ β”œβ”€β”€ Annotations
117
+ β”‚ β”œβ”€β”€ meta_expressions.json
118
+
119
+ ```
120
+
121
+ </details>
122
+
123
+ ### 2. Model Weights Preparation
124
+
125
+ Follow the guidance to prepare for the pretrained weights of LISA and SAM-2 for training GLUS:
126
+
127
+ 1. Download the pretrained weights of LISA from [LISA-7B-v1](https://huggingface.co/xinlai/LISA-7B-v1/tree/main).
128
+ 2. Download the pretrained weights of SAM-2 from [sam2_hiera_large](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt).
129
+
130
+ <details>
131
+ <summary> Then organize them in the following architecture: </summary>
132
+
133
+ ```
134
+ WEIGHTS_ROOT
135
+ β”œβ”€β”€ LISA-7B-v1
136
+ └── sam2_hiera_large.pt
137
+ ```
138
+
139
+ We recommend to set ``WEIGHTS_ROOT`` to ``GLUS_ROOT/checkpoints``.
140
+
141
+ </details>
142
+
143
+ ### 3. Training
144
+
145
+ Set the paths in the scripts and then run ``scripts/train_glus_s.sh`` or ``scripts/train_glus_a.sh``. The scripts will automatically start the training, and transform the saved checkpoint into hugging-face format when the training finished.
146
+
147
+ #### Key Frame Selection
148
+ For the usage of key frame selection, please refer to the [KFS_README](kfs/README.md).
149
+
150
+
151
+ ### 4. Evaluation
152
+
153
+ Set the paths, ``val_set`` and ``set_name`` in ``scripts/inference.sh``, and then run it. It will detect the available GPUs firstly and then individually run parallelizable inference on each gpu.
154
+
155
+ #### Evaluation with Key Frame Selection
156
+ Set the args ``use_kf`` and ``kf_path`` in ``scripts/inference_kf.sh``, and then run it. We provide our json file on Mevis and Refyoutube-VOS for **GLUS<sup><i>S</i></sup>** on the [google drive](https://drive.google.com/drive/folders/1NcjOguZUmal7Xk7rihyhvs5GRK_RzQSO?usp=sharing).
157
+
158
+ After the masks are generated completely, run the corresponding evalaution python file in ``utils``. You may need to set the groundtruth mask path, predicted mask path and expressions json file path. Please refer to the eval files to see the help on arguments.
159
+
160
+ An example:
161
+
162
+ ```
163
+ python utils/eval_mevis.py \\
164
+ --mevis_exp_path=\'$GLUS_ROOT/data/mevis/valid_u/meta_expressions.json\' \\
165
+ --mevis_mask_path=\'$GLUS_ROOT/data/mevis/valid_u/mask_dict.json\'
166
+ --mevis_pred_path=\'$GLUS_ROOT/generated\'
167
+ ```
168
+
169
+ Specially, to evaluate the performance on ``Refer-YouTube-VOS Valid`` or ``MeViS Valid`` benchmarks, you may need to submit the predicted masks results following the guidance at [MeViS-Evaluation-Server](https://codalab.lisn.upsaclay.fr/competitions/15094) or [RefYoutube-Evaluation-Server](https://codalab.lisn.upsaclay.fr/competitions/3282).
170
+
171
+ ## Inference and Demo
172
+
173
+ Please refer to ``demo.ipynb`` to inference on your own videos and referrings.
174
+
175
+ For more examples, please refer to our [Project Page](https://glus-video.github.io/).
176
+ ## Citation
177
+ If you find this work useful in your research, please consider citing:
178
+ ```bibtex
179
+ @inproceedings{lin2025glus,
180
+ title={GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation},
181
+ author={Lin, Lang and Yu, Xueyang and Pang, Ziqi and Wang, Yu-Xiong},
182
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
183
+ year={2025}
184
+ }
185
+ ```
186
+ ## Acknowledgement
187
+ We thank the contributors to the following open-source projects. Our project is impossible without the inspirations from these excellent researchers.
188
+ * [LISA](https://github.com/dvlab-research/LISA)
189
+ * [SAM2](https://github.com/facebookresearch/sam2)
190
+ * [Mevis](https://github.com/henghuiding/MeViS)
191
+ * [VISA](https://github.com/cilinyan/VISA)