(ICASSP 2026) HINT: Composed Image Retrieval with Dual-Path Compositional Contextualized Network (Model Weights)

Mingyu Zhang1, Zixu Li1, Zhiwei Chen1, Zhiheng Fu1, Xiaowei Zhu1, Jiajia Nie1, Yinwei Wei1 Yupeng Hu1βœ‰,
1School of Software, Shandong University    
βœ‰ Corresponding author  

ICASSP 2026 Paper page GitHub

This repository hosts the official pre-trained checkpoints for **HINT**, a novel framework designed to tackle the neglect of contextual information and the absence of discrepancy-amplification mechanisms in Composed Image Retrieval (CIR).

πŸ“Œ Model Information

1. Model Name

HINT (dual-patH composItional coNtextualized neTwork) Checkpoints.

2. Task Type & Applicable Tasks

  • Task Type: Composed Image Retrieval (CIR) / Vision-Language Retrieval.
  • Applicable Tasks: Retrieving target images based on a reference image and a modification text.

3. Project Introduction

Existing Composed Image Retrieval (CIR) methods often suffer from the neglect of contextual information in discriminating matching samples , struggling to understand complex modifications and implicit dependencies in real-world scenarios. HINT effectively addresses this through:

  • 🧩 Dual Context Extraction (DCE): Extracts both intra-modal context and cross-modal context, enhancing joint semantic representation by integrating multimodal contextual information.

  • πŸ“ Quantification of Contextual Relevance (QCR): Measures the relevance between cross-modal contextual information and the target image semantics, enabling the quantification of the implicit dependencies.

  • βš–οΈ Dual-Path Consistency Constraints (DPCC): Optimizes the training process by constraining representation consistency, ensuring the stable enhancement of similarity for matching instances while lowering it for non-matching ones.

Based on the BLIP-2 architecture , HINT achieves State-of-the-Art (SOTA) retrieval performance across both open-domain and fashion-domain benchmarks.

4. Training Data Source & Hosted Weights

The models were trained on the FashionIQ and CIRR datasets . This Hugging Face repository provides the corresponding .pt checkpoint files organized by dataset:

  • fashioniq.pt (Trained on FashionIQ)

  • cirr.pt (Trained on CIRR)


πŸš€ Usage & Basic Inference

These weights are designed to be evaluated seamlessly using the official HINT GitHub repository.

Step 1: Prepare the Environment

Clone the GitHub repository and install dependencies:

git clone https://github.com/iLearn-Lab/ICASSP26-HINT
cd ICASSP26-HINT
conda create -n hint python=3.8 -y
conda activate hint
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install open-clip-torch==2.24.0 scikit-learn==1.3.2 transformers==4.25.0 salesforce-lavis==1.0.2 timm==0.9.16

Step 2: Download Model Weights

Download the specific .pt files you wish to evaluate from this Hugging Face repository. Place them into a checkpoints/ directory within your cloned GitHub repo. For example, to evaluate the CIRR model:

ICASSP26-HINT/
└── checkpoints/
        └── cirr.pt  <-- (Rename to best_model.pt if required by your specific test script)

Step 3: Run Testing / Evaluation

To generate prediction files on the CIRR dataset for the CIRR Evaluation Server, point the test script to the directory containing your downloaded checkpoint:

python src/cirr_test_submission.py checkpoints/

(The script will automatically output .json files based on the checkpoint for online evaluation.)


⚠️ Limitations & Notes

  • Hardware Requirements: Because HINT is built upon the powerful BLIP-2 architecture, inference and further fine-tuning require GPUs with sufficient memory (e.g., NVIDIA A40 48G / V100 32G is recommended).
  • Intended Use: These weights are provided for academic research and to facilitate reproducibility of the ICASSP 2026 paper.

πŸ“β­οΈ Citation

If you find our work, code, or these model weights useful in your research, please consider leaving a Star ⭐️ on our GitHub repository and citing our paper:

@inproceedings{HINT2026,
  title={HINT: COMPOSED IMAGE RETRIEVAL WITH DUAL-PATH COMPOSITIONAL CONTEXTUALIZED NETWORK},
  author={Zhang, Mingyu and Li, Zixu and Chen, Zhiwei and Fu, Zhiheng and Zhu, Xiaowei and Nie, Jiajia and Wei, Yinwei and Hu, Yupeng},
  booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for iLearn-Lab/HINT