--- library_name: transformers tags: - vision-language-model - image-decomposition --- # SynLayers This repository contains the assets behind SynLayers, our two-stage image decomposition system. At the root is the bbox-caption model. Given one image, it predicts: - a whole-image caption - bounding boxes for visible objects or layers The same repo also includes the Stage 2 SynLayers pipeline to do layer decomposition. If you want the easiest way to try the full system, please use our public demo: [SynLayers/synlayers](https://huggingface.co/spaces/SynLayers/synlayers) This repo is not meant to be used as a single generic `DiffusionPipeline(prompt)` model. The full SynLayers pipeline is: 1. bbox + whole-caption prediction 2. layer decomposition into transparent RGBA outputs If you only want the Stage 1 model at the repo root, you can load it with `transformers`. ```python from transformers import AutoProcessor, Qwen3VLForConditionalGeneration model = Qwen3VLForConditionalGeneration.from_pretrained( "SynLayers/Bbox-caption-8b", torch_dtype="auto", device_map="auto", ) processor = AutoProcessor.from_pretrained("SynLayers/Bbox-caption-8b") ``` If you want to see more details of our implementation, please check our paper out here: https://arxiv.org/abs/2605.15167 If you find our work useful, please consider citing: @misc{wu2026doessyntheticlayereddesign, title={Does Synthetic Layered Design Data Benefit Layered Design Decomposition?}, author={Kam Man Wu and Haolin Yang and Qingyu Chen and Yihu Tang and Jingye Chen and Qifeng Chen}, year={2026}, eprint={2605.15167}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.15167}, } Thanks for trying SynLayers.