wkun03 commited on
Commit
35d6ec5
ยท
verified ยท
1 Parent(s): f19e74a

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +126 -3
  2. f30k_H64_bert.tar +3 -0
  3. f30k_H64_bigru.tar +3 -0
README.md CHANGED
@@ -1,3 +1,126 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - image-text-retrieval
5
+ - vision-language
6
+ - multimodal
7
+ - redundancy-mitigation
8
+ - deep-hashing
9
+ - pytorch
10
+ ---
11
+
12
+ <a id="top"></a>
13
+ <div align="center">
14
+ <h1>๐Ÿš€ MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval</h1>
15
+
16
+ <p>
17
+ <b>Kun Wang</b><sup>1</sup>&nbsp;
18
+ <b>Yupeng Hu</b><sup>1</sup>&nbsp;
19
+ <b>Hao Liu</b><sup>1</sup>&nbsp;
20
+ <b>Lirong Jie</b><sup>1</sup>&nbsp;
21
+ <b>Liqiang Nie</b><sup>2</sup>
22
+ </p>
23
+
24
+ <p>
25
+ <sup>1</sup>School of Software, Shandong University, Jinan, China<br>
26
+ <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
27
+ </p>
28
+ </div>
29
+
30
+ These are the official implementation, pre-trained model weights, and configuration files for **MEET**, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR).
31
+
32
+ ๐Ÿ”— **Paper:** [Accepted by TCSVT 2025](https://ieeexplore.ieee.org/document/11299108)
33
+ ๐Ÿ”— **GitHub Repository:** [iLearn-Lab/TCSVT25-MEET](https://github.com/iLearn-Lab/TCSVT25-MEET)
34
+
35
+ ---
36
+
37
+ ## ๐Ÿ“Œ Model Information
38
+
39
+ ### 1. Model Name
40
+ **MEET** (iMage-text retrieval rEdundancy miTigation)
41
+
42
+ ### 2. Task Type & Applicable Tasks
43
+ - **Task Type:** Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning
44
+ - **Applicable Tasks:** Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments.
45
+
46
+ ### 3. Project Introduction
47
+ Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. **MEET** introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval.
48
+
49
+ > ๐Ÿ’ก **Method Highlight:** MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization.
50
+
51
+ ### 4. Training Data Source
52
+ The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets:
53
+ - **MSCOCO** (1K and 5K splits)
54
+ - **Flickr30K**
55
+ *(Splits produced by HREM)*
56
+
57
+ ---
58
+
59
+ ## ๐Ÿš€ Usage & Basic Inference
60
+
61
+ ### Step 1: Prepare the Environment
62
+ Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0):
63
+ ```bash
64
+ git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git
65
+ cd MEET
66
+ ```
67
+
68
+ pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard
69
+
70
+
71
+ ### Step 2: Download Model Weights & Data
72
+ 1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`.
73
+ 2. **Language Models & Features:**
74
+ - Obtain pretrained files for [BERT-base](https://huggingface.co/bert-base-uncased).
75
+ - Obtain pretrained VSE model checkpoints (e.g., [ESA](https://github.com/KevinLight831/ESA) as an example).
76
+ 3. **Datasets:** Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., `coco_precomp`, `f30k_precomp`, `vocab`, `VSE`).
77
+
78
+ ### Step 3: Run Training & Evaluation
79
+
80
+ **Evaluation (Eval):**
81
+ Depending on the text features you are using, open the corresponding script (`at/lib/test.py` for BiGRU, `at_bert/lib/test.py` for BERT) and modify the `RUN_PATH`.
82
+
83
+ #### For MSCOCO 1K 5-fold splits, first generate the folds:
84
+ python scripts/make_coco_1k_folds.py
85
+
86
+ #### Run testing (ensure MODEL_PATH is set to the correct VSE weights)
87
+ PYTHONPATH=. python -m lib.test
88
+
89
+
90
+ **Training from Scratch:**
91
+ Make sure to specify the dataset name (`coco_precomp` or `f30k_precomp`) after the `--data_name` flag:
92
+
93
+ PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name <dataset_name>
94
+
95
+
96
+ ---
97
+
98
+ ## โš ๏ธ Limitations & Notes
99
+
100
+ **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
101
+ - The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation.
102
+ - While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution.
103
+
104
+
105
+
106
+ ---
107
+
108
+ ## โš ๏ธ Acknowledgements & Contact
109
+
110
+ - **Acknowledgement:** Thanks to the [HREM](https://github.com/crossmodalgroup/hrem) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
111
+ - **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
112
+
113
+ ---
114
+
115
+ ## ๐Ÿ“โญ๏ธ Citation
116
+
117
+ If you find our work or this repository useful in your research, please consider citing our paper:
118
+
119
+
120
+ @article{wang2025redundancy,
121
+ title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval},
122
+ author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang},
123
+ journal={IEEE Transactions on Circuits and Systems for Video Technology},
124
+ year={2025},
125
+ publisher={IEEE}
126
+ }
f30k_H64_bert.tar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62f96f14faccfde11b48678eb2415e039845a436eaef010ff3ecee466c1f5988
3
+ size 524705767
f30k_H64_bigru.tar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5a76922af6e436242ee0ef38dfa5a03ad008c066df0a9e86d08d5b201cb3900
3
+ size 69563673