--- license: apache-2.0 tags: - image-cropping - aesthetic-cropping - computer-vision - retrieval-augmented - conditional-detr pipeline_tag: image-to-image library_name: pytorch datasets: - BWGZK/procrop_dataset language: - en --- # ProCrop: Learning Aesthetic Image Cropping from Professional Compositions [![arXiv](https://img.shields.io/badge/arXiv-2505.22490-b31b1b.svg)](https://arxiv.org/abs/2505.22490) [![GitHub](https://img.shields.io/badge/GitHub-ProCrop-blue)](https://github.com/BWGZK-keke/ProCrop) This is the **headline supervised checkpoint** for the AAAI 2026 paper "ProCrop: Learning Aesthetic Image Cropping from Professional Compositions" by Zhang et al. ## Model Description ProCrop is a retrieval-augmented framework for aesthetic image cropping that leverages professional photography compositions as guidance. Given a query image, ProCrop: 1. **Retrieves** compositionally similar professional images from a large database (AVA / CGL) using SAM embeddings and Faiss nearest-neighbor search. 2. **Fuses** retrieved features with the query via cross-attention. 3. **Predicts** diverse crop proposals ranked by aesthetic score using a Conditional DETR decoder. ## Reported Performance (FLMS supervised setting) | Metric | Value | |--------|-------| | **IoU** | **0.843** | | **BDE (Displacement)** | **0.036** | This checkpoint matches the FLMS row of Table 3 in the paper. ## Checkpoint Details | Property | Value | |----------|-------| | File | `procrop_flms_supervised.pth` | | Size | 512 MB | | Original filename | `checkpoint0008200.8425250053405762.pth` | | Trainable params | ~44.8M | | Backbone | ResNet-50 (DC5) + Transformer encoder/decoder | | Training data | CPCDataset (supervised) + AVA retrieval references | | Evaluation | FLMS test set, IoU = 0.8425 | | Training epoch | 83 | | Crop queries | 24 (Conditional DETR style) | ## How to Use ### 1. Clone the GitHub repository ```bash git clone https://github.com/BWGZK-keke/ProCrop.git cd ProCrop pip install -r requirements.txt pip install git+https://github.com/openai/CLIP.git ``` ### 2. Download this checkpoint ```python from huggingface_hub import hf_hub_download ckpt_path = hf_hub_download( repo_id="BWGZK/ProCrop", filename="procrop_flms_supervised.pth" ) ``` Or with the CLI: ```bash huggingface-cli download BWGZK/ProCrop procrop_flms_supervised.pth --local-dir ./checkpoints ``` ### 3. Run inference on a single image ```bash cd cropping python test_singleimage.py \ --dataset_root /path/to/your/images \ --retrieval_cache_dir /path/to/retrieval_tables \ --retrieval_img_dir /path/to/CGL_images \ --resume ./checkpoints/procrop_flms_supervised.pth \ --crop_savepath ./results ``` ### 4. Evaluate on FLMS ```bash cd cropping python main_cpc.py \ --dataset_root /path/to/FLMS \ --retrieval_cache_dir /path/to/retrieval_tables \ --resume ./checkpoints/procrop_flms_supervised.pth \ --eval ``` You also need: - **Precomputed retrieval tables** from [BWGZK/procrop_dataset](https://huggingface.co/datasets/BWGZK/procrop_dataset) - **SAM ViT-B checkpoint** if training on GAIC/CAD: [download here](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth) ## Architecture ProCrop extends **Conditional DETR** with a retrieval augmentation module: - **Backbone**: ResNet-50 with dilated C5 stage - **Encoder**: 6-layer transformer encoder for the query image - **Retrieval fusion**: Cross-attention between query features and top-K retrieved SAM embeddings (64×256) - **Decoder**: 6-layer transformer decoder with N=24 learnable crop queries - **Heads**: - 4-dim bounding-box MLP (3 layers) - 1-dim aesthetic-score classification head (binary focal loss) - **EMA self-distillation**: Mean-teacher framework for weakly-supervised training on CAD Core implementation: [`cropping/models/conditional_detr_cpc.py`](https://github.com/BWGZK-keke/ProCrop/blob/main/cropping/models/conditional_detr_cpc.py) ## Related Resources - **Code (GitHub)**: https://github.com/BWGZK-keke/ProCrop - **Paper (arXiv)**: https://arxiv.org/abs/2505.22490 - **Dataset (HuggingFace)**: https://huggingface.co/datasets/BWGZK/procrop_dataset - CAD dataset (242K weakly annotated images) - Precomputed retrieval tables - Pre-extracted SAM embedding databases ## Citation ```bibtex @article{ProCrop2025, title={ProCrop: Learning Aesthetic Image Cropping from Professional Compositions}, author={Zhang, Ke and Ding, Tianyu and Jiang, Jiachen and Chen, Tianyi and Zharkov, Ilya and Patel, Vishal M. and Liang, Luming}, journal={arXiv preprint arXiv:2505.22490}, year={2025} } ``` ## License Apache 2.0. The model builds on [ConditionalDETR](https://github.com/Atten4Vis/ConditionalDETR), [RALF](https://github.com/CyberAgentAILab/RALF), and [Segment Anything](https://github.com/facebookresearch/segment-anything) — please consult their respective licenses.