File size: 4,940 Bytes
e23b994
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
license: apache-2.0
tags:
  - image-cropping
  - aesthetic-cropping
  - computer-vision
  - retrieval-augmented
  - conditional-detr
pipeline_tag: image-to-image
library_name: pytorch
datasets:
  - BWGZK/procrop_dataset
language:
  - en
---

# ProCrop: Learning Aesthetic Image Cropping from Professional Compositions

[![arXiv](https://img.shields.io/badge/arXiv-2505.22490-b31b1b.svg)](https://arxiv.org/abs/2505.22490)
[![GitHub](https://img.shields.io/badge/GitHub-ProCrop-blue)](https://github.com/BWGZK-keke/ProCrop)

This is the **headline supervised checkpoint** for the AAAI 2026 paper "ProCrop: Learning Aesthetic Image Cropping from Professional Compositions" by Zhang et al.

## Model Description

ProCrop is a retrieval-augmented framework for aesthetic image cropping that leverages professional photography compositions as guidance. Given a query image, ProCrop:

1. **Retrieves** compositionally similar professional images from a large database (AVA / CGL) using SAM embeddings and Faiss nearest-neighbor search.
2. **Fuses** retrieved features with the query via cross-attention.
3. **Predicts** diverse crop proposals ranked by aesthetic score using a Conditional DETR decoder.

## Reported Performance (FLMS supervised setting)

| Metric | Value |
|--------|-------|
| **IoU** | **0.843** |
| **BDE (Displacement)** | **0.036** |

This checkpoint matches the FLMS row of Table 3 in the paper.

## Checkpoint Details

| Property | Value |
|----------|-------|
| File | `procrop_flms_supervised.pth` |
| Size | 512 MB |
| Original filename | `checkpoint0008200.8425250053405762.pth` |
| Trainable params | ~44.8M |
| Backbone | ResNet-50 (DC5) + Transformer encoder/decoder |
| Training data | CPCDataset (supervised) + AVA retrieval references |
| Evaluation | FLMS test set, IoU = 0.8425 |
| Training epoch | 83 |
| Crop queries | 24 (Conditional DETR style) |

## How to Use

### 1. Clone the GitHub repository

```bash
git clone https://github.com/BWGZK-keke/ProCrop.git
cd ProCrop
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
```

### 2. Download this checkpoint

```python
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="BWGZK/ProCrop",
    filename="procrop_flms_supervised.pth"
)
```

Or with the CLI:
```bash
huggingface-cli download BWGZK/ProCrop procrop_flms_supervised.pth --local-dir ./checkpoints
```

### 3. Run inference on a single image

```bash
cd cropping
python test_singleimage.py \
    --dataset_root /path/to/your/images \
    --retrieval_cache_dir /path/to/retrieval_tables \
    --retrieval_img_dir /path/to/CGL_images \
    --resume ./checkpoints/procrop_flms_supervised.pth \
    --crop_savepath ./results
```

### 4. Evaluate on FLMS

```bash
cd cropping
python main_cpc.py \
    --dataset_root /path/to/FLMS \
    --retrieval_cache_dir /path/to/retrieval_tables \
    --resume ./checkpoints/procrop_flms_supervised.pth \
    --eval
```

You also need:
- **Precomputed retrieval tables** from [BWGZK/procrop_dataset](https://huggingface.co/datasets/BWGZK/procrop_dataset)
- **SAM ViT-B checkpoint** if training on GAIC/CAD: [download here](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth)

## Architecture

ProCrop extends **Conditional DETR** with a retrieval augmentation module:

- **Backbone**: ResNet-50 with dilated C5 stage
- **Encoder**: 6-layer transformer encoder for the query image
- **Retrieval fusion**: Cross-attention between query features and top-K retrieved SAM embeddings (64×256)
- **Decoder**: 6-layer transformer decoder with N=24 learnable crop queries
- **Heads**:
  - 4-dim bounding-box MLP (3 layers)
  - 1-dim aesthetic-score classification head (binary focal loss)
- **EMA self-distillation**: Mean-teacher framework for weakly-supervised training on CAD

Core implementation: [`cropping/models/conditional_detr_cpc.py`](https://github.com/BWGZK-keke/ProCrop/blob/main/cropping/models/conditional_detr_cpc.py)

## Related Resources

- **Code (GitHub)**: https://github.com/BWGZK-keke/ProCrop
- **Paper (arXiv)**: https://arxiv.org/abs/2505.22490
- **Dataset (HuggingFace)**: https://huggingface.co/datasets/BWGZK/procrop_dataset
  - CAD dataset (242K weakly annotated images)
  - Precomputed retrieval tables
  - Pre-extracted SAM embedding databases

## Citation

```bibtex
@article{ProCrop2025,
  title={ProCrop: Learning Aesthetic Image Cropping from Professional Compositions},
  author={Zhang, Ke and Ding, Tianyu and Jiang, Jiachen and Chen, Tianyi and Zharkov, Ilya and Patel, Vishal M. and Liang, Luming},
  journal={arXiv preprint arXiv:2505.22490},
  year={2025}
}
```

## License

Apache 2.0. The model builds on [ConditionalDETR](https://github.com/Atten4Vis/ConditionalDETR), [RALF](https://github.com/CyberAgentAILab/RALF), and [Segment Anything](https://github.com/facebookresearch/segment-anything) — please consult their respective licenses.