Upload 5 files
Browse files- README.md +130 -3
- charades/config.yaml +45 -0
- charades/model_best.pt +3 -0
- tacos/config.yaml +50 -0
- tacos/model_best.pt +3 -0
README.md
CHANGED
|
@@ -1,3 +1,130 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- video-moment-localization
|
| 5 |
+
- point-supervised
|
| 6 |
+
- vision-language
|
| 7 |
+
- multimodal
|
| 8 |
+
- pytorch
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
<a id="top"></a>
|
| 12 |
+
<div align="center">
|
| 13 |
+
<h1>🚀 SG-SCI: Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization</h1>
|
| 14 |
+
|
| 15 |
+
<p>
|
| 16 |
+
<b>Kun Wang</b><sup>1</sup>
|
| 17 |
+
<b>Hao Liu</b><sup>1</sup>
|
| 18 |
+
<b>Lirong Jie</b><sup>1</sup>
|
| 19 |
+
<b>Zixu Li</b><sup>1</sup>
|
| 20 |
+
<b>Yupeng Hu</b><sup>1</sup>
|
| 21 |
+
<b>Liqiang Nie</b><sup>2</sup>
|
| 22 |
+
</p>
|
| 23 |
+
|
| 24 |
+
<p>
|
| 25 |
+
<sup>1</sup>School of Software, Shandong University, Jinan, China<br>
|
| 26 |
+
<sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
|
| 27 |
+
</p>
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
These are the official implementation, pre-trained model weights, and configuration files for **SG-SCI**, a novel framework explicitly designed to address explicit granularity alignment and implicit scale perception in Video Moment Localization (VML).
|
| 31 |
+
|
| 32 |
+
🔗 **Paper:** [Accepted by ACM MM 2024](https://dl.acm.org/doi/10.1145/3664647.3680774)
|
| 33 |
+
🔗 **GitHub Repository:** [iLearn-Lab/SG-SCI](https://github.com/iLearn-Lab/SG-SCI)
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## 📌 Model Information
|
| 38 |
+
|
| 39 |
+
### 1. Model Name
|
| 40 |
+
**SG-SCI** (Semantic Granularity and Scale Correspondence Integration)
|
| 41 |
+
|
| 42 |
+
### 2. Task Type & Applicable Tasks
|
| 43 |
+
- **Task Type:** Point-supervised Video Moment Localization (VML) / Vision-Language / Multimodal Learning
|
| 44 |
+
- **Applicable Tasks:** Localizing specific moments in untrimmed videos based on textual queries, utilizing only single-frame (point-level) annotations during training to reduce annotation costs.
|
| 45 |
+
|
| 46 |
+
### 3. Project Introduction
|
| 47 |
+
Existing point-supervised Video Moment Localization (VML) methods often struggle with explicit granularity alignment and implicit scale perception. **SG-SCI** introduces a Semantic Granularity and Scale Correspondence Integration framework to model the semantic alignment between video and text. This approach helps the model effectively enhance and utilize modal feature representations of varying granularities and scales.
|
| 48 |
+
|
| 49 |
+
> 💡 **Method Highlight:** SG-SCI explicitly models semantic relations of different feature granularities (via the GCA module) and adaptively mines implicit semantic scales (via the SCL strategy). It fully supports end-to-end model training and multi-modal interaction using single-frame annotations.
|
| 50 |
+
|
| 51 |
+
### 4. Training Data Source
|
| 52 |
+
The model is evaluated on standard VML datasets using point-level supervision:
|
| 53 |
+
- **Charades-STA**
|
| 54 |
+
- **TACoS**
|
| 55 |
+
*(Splits produced by [ViGA](https://github.com/r-cui/ViGA/tree/master))*
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## 🚀 Usage & Basic Inference
|
| 60 |
+
|
| 61 |
+
### Step 1: Prepare the Environment
|
| 62 |
+
Clone the GitHub repository and set up the Conda environment (evaluated on Python 3.7 and PyTorch 1.10.0):
|
| 63 |
+
```bash
|
| 64 |
+
git clone https://github.com/iLearn-Lab/SG-SCI.git
|
| 65 |
+
cd SG-SCI
|
| 66 |
+
```
|
| 67 |
+
```bash
|
| 68 |
+
conda create --name sg-sci python=3.7 -y
|
| 69 |
+
source activate sg-sci
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
conda install pytorch=1.10.0 cudatoolkit=11.3.1 -y
|
| 74 |
+
pip install numpy scipy pyyaml tqdm
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
### Step 2: Download Model Weights & Data
|
| 80 |
+
1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`.
|
| 81 |
+
2. **Datasets:** Ensure the Charades-STA and TACoS datasets are properly structured in the `src/dataset/` directory according to the splits provided by ViGA.
|
| 82 |
+
|
| 83 |
+
### Step 3: Run Training & Evaluation
|
| 84 |
+
|
| 85 |
+
**Training from Scratch:**
|
| 86 |
+
Depending on the dataset you want to train on, run the following commands:
|
| 87 |
+
|
| 88 |
+
#### For TACoS
|
| 89 |
+
python -m experiment.train --task tacos
|
| 90 |
+
|
| 91 |
+
#### For Charades-STA
|
| 92 |
+
python -m experiment.train --task charadessta
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
**Evaluation (Eval):**
|
| 96 |
+
Put the downloaded checkpoints in your `LOGGER_PATH`, then run:
|
| 97 |
+
|
| 98 |
+
python -m src.experiment.eval --exp $LOGGER_PATH
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
## ⚠️ Limitations & Notes
|
| 105 |
+
|
| 106 |
+
**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
|
| 107 |
+
- The model requires access to the original source datasets (Charades-STA, TACoS) for full evaluation.
|
| 108 |
+
- As a point-supervised method, while it significantly reduces annotation costs compared to fully-supervised methods, localization boundary precision may still be inherently limited by the single-frame nature of the ground-truth signals.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## 🤝 Acknowledgements & Contact
|
| 113 |
+
|
| 114 |
+
- **Acknowledgement:** Thanks to the [ViGA](https://github.com/r-cui/ViGA/tree/master) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
|
| 115 |
+
- **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## 📝⭐️ Citation
|
| 120 |
+
|
| 121 |
+
If you find our work or this repository useful in your research, please consider citing our paper:
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
@inproceedings{wang2024explicit,
|
| 125 |
+
title={Explicit granularity and implicit scale correspondence learning for point-supervised video moment localization},
|
| 126 |
+
author={Wang, Kun and Liu, Hao and Jie, Lirong and Li, Zixu and Hu, Yupeng and Nie, Liqiang},
|
| 127 |
+
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
|
| 128 |
+
pages={9214--9223},
|
| 129 |
+
year={2024}
|
| 130 |
+
}
|
charades/config.yaml
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
charadessta:
|
| 2 |
+
batch_size: 512
|
| 3 |
+
clip_frames:
|
| 4 |
+
- 8
|
| 5 |
+
epoch: 30
|
| 6 |
+
feature_dim: 1024
|
| 7 |
+
feature_dir: data/charadessta/i3d
|
| 8 |
+
inter_loss: 0.1
|
| 9 |
+
intra_loss: 0.1
|
| 10 |
+
moment_length_factors:
|
| 11 |
+
- 0.25
|
| 12 |
+
- 0.3
|
| 13 |
+
- 0.35
|
| 14 |
+
momentum: 0.3
|
| 15 |
+
overlapping_factors:
|
| 16 |
+
- 0.0
|
| 17 |
+
- 0.1
|
| 18 |
+
- 0.2
|
| 19 |
+
- 0.3
|
| 20 |
+
- 0.4
|
| 21 |
+
- 0.5
|
| 22 |
+
- 0.6
|
| 23 |
+
- 0.7
|
| 24 |
+
- 0.8
|
| 25 |
+
- 0.9
|
| 26 |
+
pooling_func: max_pooling
|
| 27 |
+
sigma_factor: 0.3
|
| 28 |
+
stride: 4
|
| 29 |
+
video_feature_len: 128
|
| 30 |
+
dataset_name: charadessta
|
| 31 |
+
exp_dir: log
|
| 32 |
+
gpu: '1'
|
| 33 |
+
model:
|
| 34 |
+
dim: 512
|
| 35 |
+
dropout: 0.1
|
| 36 |
+
glove_path: data/glove.840B.300d.txt
|
| 37 |
+
momentum_alpha: 0.7
|
| 38 |
+
n_layers: 2
|
| 39 |
+
temp: 0.1
|
| 40 |
+
topk: 1
|
| 41 |
+
seed: 0
|
| 42 |
+
train:
|
| 43 |
+
clip_norm: 1.0
|
| 44 |
+
dev: false
|
| 45 |
+
init_lr: 0.0001
|
charades/model_best.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1a763be0003fed60cf8311959e1080076acdc789aefa82d52de4ad17d1f30ac1
|
| 3 |
+
size 98963819
|
tacos/config.yaml
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
tacos:
|
| 2 |
+
batch_size: 64
|
| 3 |
+
clip_frames:
|
| 4 |
+
- 32
|
| 5 |
+
epoch: 80
|
| 6 |
+
feature_dim: 4096
|
| 7 |
+
feature_dir: data/tacos/c3d
|
| 8 |
+
inter_loss: 0.1
|
| 9 |
+
intra_loss: 0.1
|
| 10 |
+
moment_length_factors:
|
| 11 |
+
- 0.05
|
| 12 |
+
- 0.1
|
| 13 |
+
- 0.15
|
| 14 |
+
- 0.2
|
| 15 |
+
- 0.25
|
| 16 |
+
- 0.3
|
| 17 |
+
- 0.35
|
| 18 |
+
- 0.4
|
| 19 |
+
momentum: 0.3
|
| 20 |
+
overlapping_factors:
|
| 21 |
+
- 0.0
|
| 22 |
+
- 0.1
|
| 23 |
+
- 0.2
|
| 24 |
+
- 0.3
|
| 25 |
+
- 0.4
|
| 26 |
+
- 0.5
|
| 27 |
+
- 0.6
|
| 28 |
+
- 0.7
|
| 29 |
+
- 0.8
|
| 30 |
+
- 0.9
|
| 31 |
+
pooling_func: mean_pooling
|
| 32 |
+
sigma_factor: 1.0
|
| 33 |
+
stride: 16
|
| 34 |
+
video_feature_len: 512
|
| 35 |
+
train:
|
| 36 |
+
clip_norm: 1.0
|
| 37 |
+
dev: false
|
| 38 |
+
init_lr: 0.0001
|
| 39 |
+
dataset_name: tacos
|
| 40 |
+
exp_dir: log
|
| 41 |
+
gpu: '0'
|
| 42 |
+
model:
|
| 43 |
+
dim: 512
|
| 44 |
+
dropout: 0.1
|
| 45 |
+
glove_path: data/glove.840B.300d.txt
|
| 46 |
+
momentum_alpha: 0.7
|
| 47 |
+
n_layers: 2
|
| 48 |
+
temp: 0.1
|
| 49 |
+
topk: 1
|
| 50 |
+
seed: 0
|
tacos/model_best.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:91aa6f88d84be4641fa24d4335c347c10738e56ac5fdc6a777ae5764fe821695
|
| 3 |
+
size 106066283
|