| --- |
| license: apache-2.0 |
| tags: |
| - image-cropping |
| - aesthetic-cropping |
| - computer-vision |
| - retrieval-augmented |
| - conditional-detr |
| pipeline_tag: image-to-image |
| library_name: pytorch |
| datasets: |
| - BWGZK/procrop_dataset |
| language: |
| - en |
| --- |
| |
| # ProCrop: Learning Aesthetic Image Cropping from Professional Compositions |
|
|
| [](https://arxiv.org/abs/2505.22490) |
| [](https://github.com/BWGZK-keke/ProCrop) |
|
|
| This is the **headline supervised checkpoint** for the AAAI 2026 paper "ProCrop: Learning Aesthetic Image Cropping from Professional Compositions" by Zhang et al. |
|
|
| ## Model Description |
|
|
| ProCrop is a retrieval-augmented framework for aesthetic image cropping that leverages professional photography compositions as guidance. Given a query image, ProCrop: |
|
|
| 1. **Retrieves** compositionally similar professional images from a large database (AVA / CGL) using SAM embeddings and Faiss nearest-neighbor search. |
| 2. **Fuses** retrieved features with the query via cross-attention. |
| 3. **Predicts** diverse crop proposals ranked by aesthetic score using a Conditional DETR decoder. |
|
|
| ## Reported Performance (FLMS supervised setting) |
|
|
| | Metric | Value | |
| |--------|-------| |
| | **IoU** | **0.843** | |
| | **BDE (Displacement)** | **0.036** | |
|
|
| This checkpoint matches the FLMS row of Table 3 in the paper. |
|
|
| ## Checkpoint Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | File | `procrop_flms_supervised.pth` | |
| | Size | 512 MB | |
| | Original filename | `checkpoint0008200.8425250053405762.pth` | |
| | Trainable params | ~44.8M | |
| | Backbone | ResNet-50 (DC5) + Transformer encoder/decoder | |
| | Training data | CPCDataset (supervised) + AVA retrieval references | |
| | Evaluation | FLMS test set, IoU = 0.8425 | |
| | Training epoch | 83 | |
| | Crop queries | 24 (Conditional DETR style) | |
|
|
| ## How to Use |
|
|
| ### 1. Clone the GitHub repository |
|
|
| ```bash |
| git clone https://github.com/BWGZK-keke/ProCrop.git |
| cd ProCrop |
| pip install -r requirements.txt |
| pip install git+https://github.com/openai/CLIP.git |
| ``` |
|
|
| ### 2. Download this checkpoint |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| ckpt_path = hf_hub_download( |
| repo_id="BWGZK/ProCrop", |
| filename="procrop_flms_supervised.pth" |
| ) |
| ``` |
|
|
| Or with the CLI: |
| ```bash |
| huggingface-cli download BWGZK/ProCrop procrop_flms_supervised.pth --local-dir ./checkpoints |
| ``` |
|
|
| ### 3. Run inference on a single image |
|
|
| ```bash |
| cd cropping |
| python test_singleimage.py \ |
| --dataset_root /path/to/your/images \ |
| --retrieval_cache_dir /path/to/retrieval_tables \ |
| --retrieval_img_dir /path/to/CGL_images \ |
| --resume ./checkpoints/procrop_flms_supervised.pth \ |
| --crop_savepath ./results |
| ``` |
|
|
| ### 4. Evaluate on FLMS |
|
|
| ```bash |
| cd cropping |
| python main_cpc.py \ |
| --dataset_root /path/to/FLMS \ |
| --retrieval_cache_dir /path/to/retrieval_tables \ |
| --resume ./checkpoints/procrop_flms_supervised.pth \ |
| --eval |
| ``` |
|
|
| You also need: |
| - **Precomputed retrieval tables** from [BWGZK/procrop_dataset](https://huggingface.co/datasets/BWGZK/procrop_dataset) |
| - **SAM ViT-B checkpoint** if training on GAIC/CAD: [download here](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth) |
|
|
| ## Architecture |
|
|
| ProCrop extends **Conditional DETR** with a retrieval augmentation module: |
|
|
| - **Backbone**: ResNet-50 with dilated C5 stage |
| - **Encoder**: 6-layer transformer encoder for the query image |
| - **Retrieval fusion**: Cross-attention between query features and top-K retrieved SAM embeddings (64×256) |
| - **Decoder**: 6-layer transformer decoder with N=24 learnable crop queries |
| - **Heads**: |
| - 4-dim bounding-box MLP (3 layers) |
| - 1-dim aesthetic-score classification head (binary focal loss) |
| - **EMA self-distillation**: Mean-teacher framework for weakly-supervised training on CAD |
|
|
| Core implementation: [`cropping/models/conditional_detr_cpc.py`](https://github.com/BWGZK-keke/ProCrop/blob/main/cropping/models/conditional_detr_cpc.py) |
|
|
| ## Related Resources |
|
|
| - **Code (GitHub)**: https://github.com/BWGZK-keke/ProCrop |
| - **Paper (arXiv)**: https://arxiv.org/abs/2505.22490 |
| - **Dataset (HuggingFace)**: https://huggingface.co/datasets/BWGZK/procrop_dataset |
| - CAD dataset (242K weakly annotated images) |
| - Precomputed retrieval tables |
| - Pre-extracted SAM embedding databases |
| |
| ## Citation |
| |
| ```bibtex |
| @article{ProCrop2025, |
| title={ProCrop: Learning Aesthetic Image Cropping from Professional Compositions}, |
| author={Zhang, Ke and Ding, Tianyu and Jiang, Jiachen and Chen, Tianyi and Zharkov, Ilya and Patel, Vishal M. and Liang, Luming}, |
| journal={arXiv preprint arXiv:2505.22490}, |
| year={2025} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0. The model builds on [ConditionalDETR](https://github.com/Atten4Vis/ConditionalDETR), [RALF](https://github.com/CyberAgentAILab/RALF), and [Segment Anything](https://github.com/facebookresearch/segment-anything) — please consult their respective licenses. |
| |