--- library_name: transformers license: mit tags: - image-segmentation - semantic-segmentation - segformer - facade - building - mixed-rectification - unrectified - vision pipeline_tag: image-segmentation datasets: - Xpitfire/cmp_facade - merve/scene_parse_150 metrics: - mean_iou model-index: - name: segformer-b0-facade-mixed results: - task: type: image-segmentation name: Semantic Segmentation dataset: type: mixed name: CMP Facade + ADE20K filtered split: validation metrics: - type: mean_iou value: 0.0 name: Mean IoU --- # SegFormer-B0 — Facade Segmentation (Mixed Rectified + Unrectified) > **Status:** Training script ready. Run `train.py` to produce this model. See **How to Train** below. A **SegFormer-B0** model trained on **mixed rectified and unrectified facade data** for a 2-pass pipeline: 1. **Pass 1** — Run on raw street photo (unrectified/perspective) to detect the dominant facade wall 2. **Rectify** detected facade via homography 3. **Pass 2** — Run on rectified crop to extract windows, doors, balconies cleanly | | | |---|---| | **Architecture** | SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) | | **Parameters** | ~3.7 M | | **Input** | RGB image, any resolution (resized to 512×512) | | **Output** | 6-class pixel mask | | **Format** | SafeTensors | | **Base model** | [Marco333/segformer-b0-facade-cmp](https://huggingface.co/Marco333/segformer-b0-facade-cmp) | ## 6-Class Taxonomy | ID | Class | Function | Pass | |:--:|-------|----------|:--:| | 0 | `background` | Sky, ground, non-facade regions | Both | | 1 | `facade_wall` | Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) | Both | | 2 | `window` | Windows + blinds + shopfronts | Both | | 3 | `door` | Doors + shopfronts | Both | | 4 | `balcony` | Balconies | Both | | 5 | `vegetation_occluder` | Trees, plants occluding facade | Both | ## Two-Pass Pipeline ```python from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation from PIL import Image import torch.nn.functional as F processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed") model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed") # Pass 1: raw street photo image = Image.open("street_photo.jpg").convert("RGB") inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) mask = F.interpolate(outputs.logits, size=image.size[::-1], mode="bilinear", align_corners=False).argmax(dim=1)[0] # Find biggest facade_wall blob (class 1), compute homography, rectify... # Then Pass 2 on rectified crop: rectified = Image.open("rectified_facade.jpg").convert("RGB") inputs2 = processor(images=rectified, return_tensors="pt") outputs2 = model(**inputs2) mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1], mode="bilinear", align_corners=False).argmax(dim=1)[0] ``` ## How to Train This repo contains the training script. Run it on any GPU (T4 or better): ```bash pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy python train.py ``` **What the script does:** 1. Loads **CMP Facade** dataset (~492 rectified images, 12 classes) → remaps to 6-class 2. Loads **ADE20K scene_parse_150** (~20K images) → filters to building-containing scenes → remaps to same 6-class 3. Applies **perspective augmentation** (RandomPerspective, p=0.4) during training — simulates oblique camera angles 4. Starts from your existing model (`segformer-b0-facade-cmp`, 48.56% mIoU) 5. Trains 80 epochs, pushes best model to this Hub repo **Training time:** ~4-6h on T4 GPU ## Data Sources | Dataset | Type | Images | Geometry | Classes (raw) | Classes (remapped) | |---------|------|--------|----------|--------------|-------------------| | [CMP Facade](https://huggingface.co/datasets/Xpitfire/cmp_facade) | Primary | ~492 | Rectified | 12 | 6 (background, wall, window, door, balcony, ignore) | | [ADE20K scene_parse_150](https://huggingface.co/datasets/merve/scene_parse_150) | Augmentation | ~5K filtered | Unrectified (perspective) | 150 | 6 (same taxonomy) | ### Why mix these? - **CMP alone** = excellent on rectified facades, fails on street-view perspective - **ADE20K** adds natural perspective building scenes (wall, building, house, skyscraper classes) - **Perspective augmentation** (`RandomPerspective`, distortion=0.3, p=0.4) closes the geometric domain gap Literature confirms the gap: [Texture2LoD3](https://huggingface.co/papers/2504.05249) measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix. ## Training Configuration | Parameter | Value | |-----------|-------| | Base checkpoint | `Marco333/segformer-b0-facade-cmp` | | Optimizer | AdamW | | Learning rate | 6 × 10⁻⁵ | | LR schedule | Polynomial decay | | Warmup | 10% of steps | | Weight decay | 0.01 | | Effective batch size | 8 (4 × device · 2 grad accum) | | Resolution | 512 × 512 | | Precision | FP16 | | Epochs | 80 | | Augmentation | ColorJitter + RandomPerspective (p=0.4, distortion=0.3) | | Selection metric | Highest mean IoU on validation | ## Expected Improvements vs. CMP-Only Model | Capability | CMP-only (baseline) | Mixed (this model) | |---|---|---| | Rectified facades | ✅ 48.6% mIoU | ✅ Likely 55-70% (more data + transfer) | | Unrectified street photos | ❌ Untrained | ✅ Trained on ADE20K perspective scenes | | Perspective robustness | ~10pp IoU drop | Gap closed via augmentation | ## Citation CMP Facade: ```bibtex @INPROCEEDINGS{Tylecek13, author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra}, title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure}, booktitle = {Proc. GCPR}, year = {2013}, } ``` ADE20K: ```bibtex @article{zhou2017scene, title={Scene Parsing through ADE20K Dataset}, author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio}, journal={CVPR}, year={2017} } ``` SegFormer: ```bibtex @article{xie2021segformer, title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers}, author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping}, journal={arXiv preprint arXiv:2105.15203}, year={2021} } ```