🚀 MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval
Kun Wang1
Yupeng Hu1
Hao Liu1
Lirong Jie1
Liqiang Nie2
1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
These are the official implementation, pre-trained model weights, and configuration files for **MEET**, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR).
🔗 **Paper:** [Accepted by TCSVT 2025](https://ieeexplore.ieee.org/document/11299108)
🔗 **GitHub Repository:** [iLearn-Lab/TCSVT25-MEET](https://github.com/iLearn-Lab/TCSVT25-MEET)
---
## 📌 Model Information
### 1. Model Name
**MEET** (iMage-text retrieval rEdundancy miTigation)
### 2. Task Type & Applicable Tasks
- **Task Type:** Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning
- **Applicable Tasks:** Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments.
### 3. Project Introduction
Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. **MEET** introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval.
> 💡 **Method Highlight:** MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization.
### 4. Training Data Source
The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets:
- **MSCOCO** (1K and 5K splits)
- **Flickr30K**
*(Splits produced by HREM)*
---
## 🚀 Usage & Basic Inference
### Step 1: Prepare the Environment
Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0):
```bash
git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git
cd MEET
```
pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard
### Step 2: Download Model Weights & Data
1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`.
2. **Language Models & Features:**
- Obtain pretrained files for [BERT-base](https://huggingface.co/bert-base-uncased).
- Obtain pretrained VSE model checkpoints (e.g., [ESA](https://github.com/KevinLight831/ESA) as an example).
3. **Datasets:** Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., `coco_precomp`, `f30k_precomp`, `vocab`, `VSE`).
### Step 3: Run Training & Evaluation
**Evaluation (Eval):**
Depending on the text features you are using, open the corresponding script (`at/lib/test.py` for BiGRU, `at_bert/lib/test.py` for BERT) and modify the `RUN_PATH`.
#### For MSCOCO 1K 5-fold splits, first generate the folds:
python scripts/make_coco_1k_folds.py
#### Run testing (ensure MODEL_PATH is set to the correct VSE weights)
PYTHONPATH=. python -m lib.test
**Training from Scratch:**
Make sure to specify the dataset name (`coco_precomp` or `f30k_precomp`) after the `--data_name` flag:
PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name