--- license: apache-2.0 tags: - pytorch ---

🚀 MEET: Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval

Kun Wang1  Yupeng Hu1  Hao Liu1  Lirong Jie1  Liqiang Nie2

1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China

These are the official implementation, pre-trained model weights, and configuration files for **MEET**, a novel framework explicitly designed to address semantic and relationship redundancy in Image-Text Retrieval (ITR). 🔗 **Paper:** [Accepted by TCSVT 2025](https://ieeexplore.ieee.org/document/11299108) 🔗 **GitHub Repository:** [iLearn-Lab/TCSVT25-MEET](https://github.com/iLearn-Lab/TCSVT25-MEET) --- ## 📌 Model Information ### 1. Model Name **MEET** (iMage-text retrieval rEdundancy miTigation) ### 2. Task Type & Applicable Tasks - **Task Type:** Image-Text Retrieval (ITR) / Vision-Language / Multimodal Learning - **Applicable Tasks:** Accurate and efficient cross-modal retrieval. It specifically addresses redundancy by mitigating semantic redundancy within unimodal representations and relationship redundancy in cross-modal alignments. ### 3. Project Introduction Existing Image-Text Retrieval methods often suffer from a fundamental yet overlooked challenge: redundancy. **MEET** introduces an iMage-text retrieval rEdundancy miTigation framework to explicitly analyze and address the ITR problem from a redundancy perspective. This approach helps the model effectively produce compact yet highly discriminative representations for accurate and efficient retrieval. > 💡 **Method Highlight:** MEET mitigates semantic redundancy by repurposing deep hashing and quantization, and progressively refines the cross-modal alignment space by filtering misleading negative samples and adaptively reweighting informative pairs. It supports end-to-end model training, diverse feature encoders, and unified optimization. ### 4. Training Data Source The model is evaluated using features from Bi-GRU and BERT on standard ITR datasets: - **MSCOCO** (1K and 5K splits) - **Flickr30K** *(Splits produced by HREM)* --- ## 🚀 Usage & Basic Inference ### Step 1: Prepare the Environment Clone the GitHub repository and ensure you have the required dependencies (evaluated on Python >= 3.8 and PyTorch >= 1.7.0): ```bash git clone https://github.com/iLearn-Lab/TCSVT25-MEET.git cd MEET ``` pip install torchvision>=0.8.0 transformers>=2.1.1 opencv-python tensorboard ### Step 2: Download Model Weights & Data 1. **Pre-trained Checkpoints:** Download the model checkpoints and place them in your designated `LOGGER_PATH`. 2. **Language Models & Features:** - Obtain pretrained files for [BERT-base](https://huggingface.co/bert-base-uncased). - Obtain pretrained VSE model checkpoints (e.g., [ESA](https://github.com/KevinLight831/ESA) as an example). 3. **Datasets:** Structure the MSCOCO and Flickr30K datasets as outlined in the data tree structure (e.g., `coco_precomp`, `f30k_precomp`, `vocab`, `VSE`). ### Step 3: Run Training & Evaluation **Evaluation (Eval):** Depending on the text features you are using, open the corresponding script (`at/lib/test.py` for BiGRU, `at_bert/lib/test.py` for BERT) and modify the `RUN_PATH`. #### For MSCOCO 1K 5-fold splits, first generate the folds: python scripts/make_coco_1k_folds.py #### Run testing (ensure MODEL_PATH is set to the correct VSE weights) PYTHONPATH=. python -m lib.test **Training from Scratch:** Make sure to specify the dataset name (`coco_precomp` or `f30k_precomp`) after the `--data_name` flag: PYTHONPATH=. python hq_train.py --num_epochs 12 --batch_size 128 --workers 8 --H 64 --M 8 --K 8 --data_name --- ## ⚠️ Limitations & Notes **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**. - The model requires access to the original source datasets (MSCOCO, Flickr30K) for full evaluation. - While designed for redundancy mitigation, the performance may still fluctuate based on extreme domain shifts not covered by the training distribution. --- ## ⚠️ Acknowledgements & Contact - **Acknowledgement:** Thanks to the [HREM](https://github.com/crossmodalgroup/hrem) open-source community for strong baselines and tooling. Thanks to all collaborators and contributors of this project. - **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`. --- ## 📝⭐️ Citation If you find our work or this repository useful in your research, please consider citing our paper: @article{wang2025redundancy, title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang}, journal={IEEE Transactions on Circuits and Systems for Video Technology}, year={2025}, publisher={IEEE} }