BWGZK
/

ProCrop

+---
+license: apache-2.0
+tags:
+  - image-cropping
+  - aesthetic-cropping
+  - computer-vision
+  - retrieval-augmented
+  - conditional-detr
+pipeline_tag: image-to-image
+library_name: pytorch
+datasets:
+  - BWGZK/procrop_dataset
+language:
+  - en
+---
+# ProCrop: Learning Aesthetic Image Cropping from Professional Compositions
+[![arXiv](https://img.shields.io/badge/arXiv-2505.22490-b31b1b.svg)](https://arxiv.org/abs/2505.22490)
+[![GitHub](https://img.shields.io/badge/GitHub-ProCrop-blue)](https://github.com/BWGZK-keke/ProCrop)
+This is the **headline supervised checkpoint** for the AAAI 2026 paper "ProCrop: Learning Aesthetic Image Cropping from Professional Compositions" by Zhang et al.
+## Model Description
+ProCrop is a retrieval-augmented framework for aesthetic image cropping that leverages professional photography compositions as guidance. Given a query image, ProCrop:
+1. **Retrieves** compositionally similar professional images from a large database (AVA / CGL) using SAM embeddings and Faiss nearest-neighbor search.
+2. **Fuses** retrieved features with the query via cross-attention.
+3. **Predicts** diverse crop proposals ranked by aesthetic score using a Conditional DETR decoder.
+## Reported Performance (FLMS supervised setting)
+| Metric | Value |
+|--------|-------|
+| **IoU** | **0.843** |
+| **BDE (Displacement)** | **0.036** |
+This checkpoint matches the FLMS row of Table 3 in the paper.
+## Checkpoint Details
+| Property | Value |
+|----------|-------|
+| File | `procrop_flms_supervised.pth` |
+| Size | 512 MB |
+| Original filename | `checkpoint0008200.8425250053405762.pth` |
+| Trainable params | ~44.8M |
+| Backbone | ResNet-50 (DC5) + Transformer encoder/decoder |
+| Training data | CPCDataset (supervised) + AVA retrieval references |
+| Evaluation | FLMS test set, IoU = 0.8425 |
+| Training epoch | 83 |
+| Crop queries | 24 (Conditional DETR style) |
+## How to Use
+### 1. Clone the GitHub repository
+```bash
+git clone https://github.com/BWGZK-keke/ProCrop.git
+cd ProCrop
+pip install -r requirements.txt
+pip install git+https://github.com/openai/CLIP.git
+```
+### 2. Download this checkpoint
+```python
+from huggingface_hub import hf_hub_download
+ckpt_path = hf_hub_download(
+    repo_id="BWGZK/ProCrop",
+    filename="procrop_flms_supervised.pth"
+)
+```
+Or with the CLI:
+```bash
+huggingface-cli download BWGZK/ProCrop procrop_flms_supervised.pth --local-dir ./checkpoints
+```
+### 3. Run inference on a single image
+```bash
+cd cropping
+python test_singleimage.py \
+    --dataset_root /path/to/your/images \
+    --retrieval_cache_dir /path/to/retrieval_tables \
+    --retrieval_img_dir /path/to/CGL_images \
+    --resume ./checkpoints/procrop_flms_supervised.pth \
+    --crop_savepath ./results
+```
+### 4. Evaluate on FLMS
+```bash
+cd cropping
+python main_cpc.py \
+    --dataset_root /path/to/FLMS \
+    --retrieval_cache_dir /path/to/retrieval_tables \
+    --resume ./checkpoints/procrop_flms_supervised.pth \
+    --eval
+```
+You also need:
+- **Precomputed retrieval tables** from [BWGZK/procrop_dataset](https://huggingface.co/datasets/BWGZK/procrop_dataset)
+- **SAM ViT-B checkpoint** if training on GAIC/CAD: [download here](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth)
+## Architecture
+ProCrop extends **Conditional DETR** with a retrieval augmentation module:
+- **Backbone**: ResNet-50 with dilated C5 stage
+- **Encoder**: 6-layer transformer encoder for the query image
+- **Retrieval fusion**: Cross-attention between query features and top-K retrieved SAM embeddings (64×256)
+- **Decoder**: 6-layer transformer decoder with N=24 learnable crop queries
+- **Heads**:
+  - 4-dim bounding-box MLP (3 layers)
+  - 1-dim aesthetic-score classification head (binary focal loss)
+- **EMA self-distillation**: Mean-teacher framework for weakly-supervised training on CAD
+Core implementation: [`cropping/models/conditional_detr_cpc.py`](https://github.com/BWGZK-keke/ProCrop/blob/main/cropping/models/conditional_detr_cpc.py)
+## Related Resources
+- **Code (GitHub)**: https://github.com/BWGZK-keke/ProCrop
+- **Paper (arXiv)**: https://arxiv.org/abs/2505.22490
+- **Dataset (HuggingFace)**: https://huggingface.co/datasets/BWGZK/procrop_dataset
+  - CAD dataset (242K weakly annotated images)
+  - Precomputed retrieval tables
+  - Pre-extracted SAM embedding databases
+## Citation
+```bibtex
+@article{ProCrop2025,
+  title={ProCrop: Learning Aesthetic Image Cropping from Professional Compositions},
+  author={Zhang, Ke and Ding, Tianyu and Jiang, Jiachen and Chen, Tianyi and Zharkov, Ilya and Patel, Vishal M. and Liang, Luming},
+  journal={arXiv preprint arXiv:2505.22490},
+  year={2025}
+}
+```
+## License
+Apache 2.0. The model builds on [ConditionalDETR](https://github.com/Atten4Vis/ConditionalDETR), [RALF](https://github.com/CyberAgentAILab/RALF), and [Segment Anything](https://github.com/facebookresearch/segment-anything) — please consult their respective licenses.