wkun03 commited on
Commit
dc6b41d
·
verified ·
1 Parent(s): 4c02260

Upload 5 files

Browse files
README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - video-moment-localization
5
+ - point-supervised
6
+ - vision-language
7
+ - multimodal
8
+ - pytorch
9
+ ---
10
+
11
+ <a id="top"></a>
12
+ <div align="center">
13
+ <h1>🚀 SG-SCI: Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization</h1>
14
+
15
+ <p>
16
+ <b>Kun Wang</b><sup>1</sup>&nbsp;
17
+ <b>Hao Liu</b><sup>1</sup>&nbsp;
18
+ <b>Lirong Jie</b><sup>1</sup>&nbsp;
19
+ <b>Zixu Li</b><sup>1</sup>&nbsp;
20
+ <b>Yupeng Hu</b><sup>1</sup>&nbsp;
21
+ <b>Liqiang Nie</b><sup>2</sup>
22
+ </p>
23
+
24
+ <p>
25
+ <sup>1</sup>School of Software, Shandong University, Jinan, China<br>
26
+ <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
27
+ </p>
28
+ </div>
29
+
30
+ These are the official implementation, pre-trained model weights, and configuration files for **SG-SCI**, a novel framework explicitly designed to address explicit granularity alignment and implicit scale perception in Video Moment Localization (VML).
31
+
32
+ 🔗 **Paper:** [Accepted by ACM MM 2024](https://dl.acm.org/doi/10.1145/3664647.3680774)
33
+ 🔗 **GitHub Repository:** [iLearn-Lab/SG-SCI](https://github.com/iLearn-Lab/SG-SCI)
34
+
35
+ ---
36
+
37
+ ## 📌 Model Information
38
+
39
+ ### 1. Model Name
40
+ **SG-SCI** (Semantic Granularity and Scale Correspondence Integration)
41
+
42
+ ### 2. Task Type & Applicable Tasks
43
+ - **Task Type:** Point-supervised Video Moment Localization (VML) / Vision-Language / Multimodal Learning
44
+ - **Applicable Tasks:** Localizing specific moments in untrimmed videos based on textual queries, utilizing only single-frame (point-level) annotations during training to reduce annotation costs.
45
+
46
+ ### 3. Project Introduction
47
+ Existing point-supervised Video Moment Localization (VML) methods often struggle with explicit granularity alignment and implicit scale perception. **SG-SCI** introduces a Semantic Granularity and Scale Correspondence Integration framework to model the semantic alignment between video and text. This approach helps the model effectively enhance and utilize modal feature representations of varying granularities and scales.
48
+
49
+ > 💡 **Method Highlight:** SG-SCI explicitly models semantic relations of different feature granularities (via the GCA module) and adaptively mines implicit semantic scales (via the SCL strategy). It fully supports end-to-end model training and multi-modal interaction using single-frame annotations.
50
+
51
+ ### 4. Training Data Source
52
+ The model is evaluated on standard VML datasets using point-level supervision:
53
+ - **Charades-STA**
54
+ - **TACoS**
55
+ *(Splits produced by [ViGA](https://github.com/r-cui/ViGA/tree/master))*
56
+
57
+ ---
58
+
59
+ ## 🚀 Usage & Basic Inference
60
+
61
+ ### Step 1: Prepare the Environment
62
+ Clone the GitHub repository and set up the Conda environment (evaluated on Python 3.7 and PyTorch 1.10.0):
63
+ ```bash
64
+ git clone https://github.com/iLearn-Lab/SG-SCI.git
65
+ cd SG-SCI
66
+ ```
67
+ ```bash
68
+ conda create --name sg-sci python=3.7 -y
69
+ source activate sg-sci
70
+ ```
71
+
72
+ ```bash
73
+ conda install pytorch=1.10.0 cudatoolkit=11.3.1 -y
74
+ pip install numpy scipy pyyaml tqdm
75
+ ```
76
+
77
+
78
+
79
+ ### Step 2: Download Model Weights & Data
80
+ 1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`.
81
+ 2. **Datasets:** Ensure the Charades-STA and TACoS datasets are properly structured in the `src/dataset/` directory according to the splits provided by ViGA.
82
+
83
+ ### Step 3: Run Training & Evaluation
84
+
85
+ **Training from Scratch:**
86
+ Depending on the dataset you want to train on, run the following commands:
87
+
88
+ #### For TACoS
89
+ python -m experiment.train --task tacos
90
+
91
+ #### For Charades-STA
92
+ python -m experiment.train --task charadessta
93
+
94
+
95
+ **Evaluation (Eval):**
96
+ Put the downloaded checkpoints in your `LOGGER_PATH`, then run:
97
+
98
+ python -m src.experiment.eval --exp $LOGGER_PATH
99
+
100
+
101
+ ---
102
+
103
+
104
+ ## ⚠️ Limitations & Notes
105
+
106
+ **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
107
+ - The model requires access to the original source datasets (Charades-STA, TACoS) for full evaluation.
108
+ - As a point-supervised method, while it significantly reduces annotation costs compared to fully-supervised methods, localization boundary precision may still be inherently limited by the single-frame nature of the ground-truth signals.
109
+
110
+ ---
111
+
112
+ ## 🤝 Acknowledgements & Contact
113
+
114
+ - **Acknowledgement:** Thanks to the [ViGA](https://github.com/r-cui/ViGA/tree/master) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
115
+ - **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
116
+
117
+ ---
118
+
119
+ ## 📝⭐️ Citation
120
+
121
+ If you find our work or this repository useful in your research, please consider citing our paper:
122
+
123
+
124
+ @inproceedings{wang2024explicit,
125
+ title={Explicit granularity and implicit scale correspondence learning for point-supervised video moment localization},
126
+ author={Wang, Kun and Liu, Hao and Jie, Lirong and Li, Zixu and Hu, Yupeng and Nie, Liqiang},
127
+ booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
128
+ pages={9214--9223},
129
+ year={2024}
130
+ }
charades/config.yaml ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ charadessta:
2
+ batch_size: 512
3
+ clip_frames:
4
+ - 8
5
+ epoch: 30
6
+ feature_dim: 1024
7
+ feature_dir: data/charadessta/i3d
8
+ inter_loss: 0.1
9
+ intra_loss: 0.1
10
+ moment_length_factors:
11
+ - 0.25
12
+ - 0.3
13
+ - 0.35
14
+ momentum: 0.3
15
+ overlapping_factors:
16
+ - 0.0
17
+ - 0.1
18
+ - 0.2
19
+ - 0.3
20
+ - 0.4
21
+ - 0.5
22
+ - 0.6
23
+ - 0.7
24
+ - 0.8
25
+ - 0.9
26
+ pooling_func: max_pooling
27
+ sigma_factor: 0.3
28
+ stride: 4
29
+ video_feature_len: 128
30
+ dataset_name: charadessta
31
+ exp_dir: log
32
+ gpu: '1'
33
+ model:
34
+ dim: 512
35
+ dropout: 0.1
36
+ glove_path: data/glove.840B.300d.txt
37
+ momentum_alpha: 0.7
38
+ n_layers: 2
39
+ temp: 0.1
40
+ topk: 1
41
+ seed: 0
42
+ train:
43
+ clip_norm: 1.0
44
+ dev: false
45
+ init_lr: 0.0001
charades/model_best.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a763be0003fed60cf8311959e1080076acdc789aefa82d52de4ad17d1f30ac1
3
+ size 98963819
tacos/config.yaml ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ tacos:
2
+ batch_size: 64
3
+ clip_frames:
4
+ - 32
5
+ epoch: 80
6
+ feature_dim: 4096
7
+ feature_dir: data/tacos/c3d
8
+ inter_loss: 0.1
9
+ intra_loss: 0.1
10
+ moment_length_factors:
11
+ - 0.05
12
+ - 0.1
13
+ - 0.15
14
+ - 0.2
15
+ - 0.25
16
+ - 0.3
17
+ - 0.35
18
+ - 0.4
19
+ momentum: 0.3
20
+ overlapping_factors:
21
+ - 0.0
22
+ - 0.1
23
+ - 0.2
24
+ - 0.3
25
+ - 0.4
26
+ - 0.5
27
+ - 0.6
28
+ - 0.7
29
+ - 0.8
30
+ - 0.9
31
+ pooling_func: mean_pooling
32
+ sigma_factor: 1.0
33
+ stride: 16
34
+ video_feature_len: 512
35
+ train:
36
+ clip_norm: 1.0
37
+ dev: false
38
+ init_lr: 0.0001
39
+ dataset_name: tacos
40
+ exp_dir: log
41
+ gpu: '0'
42
+ model:
43
+ dim: 512
44
+ dropout: 0.1
45
+ glove_path: data/glove.840B.300d.txt
46
+ momentum_alpha: 0.7
47
+ n_layers: 2
48
+ temp: 0.1
49
+ topk: 1
50
+ seed: 0
tacos/model_best.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91aa6f88d84be4641fa24d4335c347c10738e56ac5fdc6a777ae5764fe821695
3
+ size 106066283