FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
Paper • 2604.10297 • Published
[Paper (arXiv)] | [Code (GitHub)] | [Dataset]
ProCIR (0.8B) is a multi-view composed image retrieval model trained on the FashionMV dataset, based on Qwen3.5-0.8B. It adopts a perception-reasoning decoupled dialogue architecture and leverages image-text alignment to inject product knowledge, enabling effective multi-view product-level CIR.
| Dataset | R@5 | R@10 |
|---|---|---|
| DeepFashion | 89.2 | 94.9 |
| Fashion200K | 77.6 | 86.6 |
| FashionGen-val | 75.0 | 85.3 |
| Average | 80.6 | 88.9 |
See our GitHub repository for evaluation code and data preparation instructions.
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
processor = AutoProcessor.from_pretrained("yuandaxia/ProCIR")
model = Qwen3_5ForConditionalGeneration.from_pretrained("yuandaxia/ProCIR", torch_dtype="bfloat16")
@article{yuan2026fashionmv,
title={FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data},
author={Yuan, Peng and Mei, Bingyin and Zhang, Hui},
year={2026}
}
Model weights are released under the same license as the base model (Qwen3.5).