Upload 3 files
Browse files- README.md +126 -3
- f30k_H64_bert.tar +3 -0
- f30k_H64_bigru.tar +3 -0
README.md
CHANGED
|
@@ -1,3 +1,126 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- image-text-retrieval
|
| 5 |
+
- vision-language
|
| 6 |
+
- multimodal
|
| 7 |
+
- redundancy-mitigation
|
| 8 |
+
- deep-hashing
|
| 9 |
+
- pytorch
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
<a id="top"></a>
|
| 13 |
+
<div align="center">
|
| 14 |
+
<h1>๐ MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval</h1>
|
| 15 |
+
|
| 16 |
+
<p>
|
| 17 |
+
<b>Kun Wang</b><sup>1</sup>
|
| 18 |
+
<b>Yupeng Hu</b><sup>1</sup>
|
| 19 |
+
<b>Hao Liu</b><sup>1</sup>
|
| 20 |
+
<b>Lirong Jie</b><sup>1</sup>
|
| 21 |
+
<b>Liqiang Nie</b><sup>2</sup>
|
| 22 |
+
</p>
|
| 23 |
+
|
| 24 |
+
<p>
|
| 25 |
+
<sup>1</sup>School of Software, Shandong University, Jinan, China<br>
|
| 26 |
+
<sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
|
| 27 |
+
</p>
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
These are the official implementation, pre-trained model weights, and configuration files for **MEET**, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR).
|
| 31 |
+
|
| 32 |
+
๐ **Paper:** [Accepted by TCSVT 2025](https://ieeexplore.ieee.org/document/11299108)
|
| 33 |
+
๐ **GitHub Repository:** [iLearn-Lab/TCSVT25-MEET](https://github.com/iLearn-Lab/TCSVT25-MEET)
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## ๐ Model Information
|
| 38 |
+
|
| 39 |
+
### 1. Model Name
|
| 40 |
+
**MEET** (iMage-text retrieval rEdundancy miTigation)
|
| 41 |
+
|
| 42 |
+
### 2. Task Type & Applicable Tasks
|
| 43 |
+
- **Task Type:** Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning
|
| 44 |
+
- **Applicable Tasks:** Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments.
|
| 45 |
+
|
| 46 |
+
### 3. Project Introduction
|
| 47 |
+
Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. **MEET** introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval.
|
| 48 |
+
|
| 49 |
+
> ๐ก **Method Highlight:** MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization.
|
| 50 |
+
|
| 51 |
+
### 4. Training Data Source
|
| 52 |
+
The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets:
|
| 53 |
+
- **MSCOCO** (1K and 5K splits)
|
| 54 |
+
- **Flickr30K**
|
| 55 |
+
*(Splits produced by HREM)*
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## ๐ Usage & Basic Inference
|
| 60 |
+
|
| 61 |
+
### Step 1: Prepare the Environment
|
| 62 |
+
Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0):
|
| 63 |
+
```bash
|
| 64 |
+
git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git
|
| 65 |
+
cd MEET
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
### Step 2: Download Model Weights & Data
|
| 72 |
+
1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`.
|
| 73 |
+
2. **Language Models & Features:**
|
| 74 |
+
- Obtain pretrained files for [BERT-base](https://huggingface.co/bert-base-uncased).
|
| 75 |
+
- Obtain pretrained VSE model checkpoints (e.g., [ESA](https://github.com/KevinLight831/ESA) as an example).
|
| 76 |
+
3. **Datasets:** Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., `coco_precomp`, `f30k_precomp`, `vocab`, `VSE`).
|
| 77 |
+
|
| 78 |
+
### Step 3: Run Training & Evaluation
|
| 79 |
+
|
| 80 |
+
**Evaluation (Eval):**
|
| 81 |
+
Depending on the text features you are using, open the corresponding script (`at/lib/test.py` for BiGRU, `at_bert/lib/test.py` for BERT) and modify the `RUN_PATH`.
|
| 82 |
+
|
| 83 |
+
#### For MSCOCO 1K 5-fold splits, first generate the folds:
|
| 84 |
+
python scripts/make_coco_1k_folds.py
|
| 85 |
+
|
| 86 |
+
#### Run testing (ensure MODEL_PATH is set to the correct VSE weights)
|
| 87 |
+
PYTHONPATH=. python -m lib.test
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
**Training from Scratch:**
|
| 91 |
+
Make sure to specify the dataset name (`coco_precomp` or `f30k_precomp`) after the `--data_name` flag:
|
| 92 |
+
|
| 93 |
+
PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name <dataset_name>
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## โ ๏ธ Limitations & Notes
|
| 99 |
+
|
| 100 |
+
**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
|
| 101 |
+
- The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation.
|
| 102 |
+
- While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution.
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## โ ๏ธ Acknowledgements & Contact
|
| 109 |
+
|
| 110 |
+
- **Acknowledgement:** Thanks to the [HREM](https://github.com/crossmodalgroup/hrem) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project.
|
| 111 |
+
- **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## ๐โญ๏ธ Citation
|
| 116 |
+
|
| 117 |
+
If you find our work or this repository useful in your research, please consider citing our paper:
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
@article{wang2025redundancy,
|
| 121 |
+
title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval},
|
| 122 |
+
author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang},
|
| 123 |
+
journal={IEEE Transactions on Circuits and Systems for Video Technology},
|
| 124 |
+
year={2025},
|
| 125 |
+
publisher={IEEE}
|
| 126 |
+
}
|
f30k_H64_bert.tar
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:62f96f14faccfde11b48678eb2415e039845a436eaef010ff3ecee466c1f5988
|
| 3 |
+
size 524705767
|
f30k_H64_bigru.tar
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b5a76922af6e436242ee0ef38dfa5a03ad008c066df0a9e86d08d5b201cb3900
|
| 3 |
+
size 69563673
|