| --- |
| library_name: transformers |
| license: mit |
| tags: |
| - image-segmentation |
| - semantic-segmentation |
| - segformer |
| - facade |
| - building |
| - mixed-rectification |
| - unrectified |
| - vision |
| pipeline_tag: image-segmentation |
| datasets: |
| - Xpitfire/cmp_facade |
| - merve/scene_parse_150 |
| metrics: |
| - mean_iou |
| model-index: |
| - name: segformer-b0-facade-mixed |
| results: |
| - task: |
| type: image-segmentation |
| name: Semantic Segmentation |
| dataset: |
| type: mixed |
| name: CMP Facade + ADE20K filtered |
| split: validation |
| metrics: |
| - type: mean_iou |
| value: 0.0 |
| name: Mean IoU |
| --- |
| |
| # SegFormer-B0 β Facade Segmentation (Mixed Rectified + Unrectified) |
|
|
| > **Status:** Training script ready. Run `train.py` to produce this model. See **How to Train** below. |
|
|
| A **SegFormer-B0** model trained on **mixed rectified and unrectified facade data** for a 2-pass pipeline: |
|
|
| 1. **Pass 1** β Run on raw street photo (unrectified/perspective) to detect the dominant facade wall |
| 2. **Rectify** detected facade via homography |
| 3. **Pass 2** β Run on rectified crop to extract windows, doors, balconies cleanly |
|
|
| | | | |
| |---|---| |
| | **Architecture** | SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) | |
| | **Parameters** | ~3.7 M | |
| | **Input** | RGB image, any resolution (resized to 512Γ512) | |
| | **Output** | 6-class pixel mask | |
| | **Format** | SafeTensors | |
| | **Base model** | [Marco333/segformer-b0-facade-cmp](https://huggingface.co/Marco333/segformer-b0-facade-cmp) | |
|
|
| ## 6-Class Taxonomy |
|
|
| | ID | Class | Function | Pass | |
| |:--:|-------|----------|:--:| |
| | 0 | `background` | Sky, ground, non-facade regions | Both | |
| | 1 | `facade_wall` | Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) | Both | |
| | 2 | `window` | Windows + blinds + shopfronts | Both | |
| | 3 | `door` | Doors + shopfronts | Both | |
| | 4 | `balcony` | Balconies | Both | |
| | 5 | `vegetation_occluder` | Trees, plants occluding facade | Both | |
|
|
| ## Two-Pass Pipeline |
|
|
| ```python |
| from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation |
| from PIL import Image |
| import torch.nn.functional as F |
| |
| processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed") |
| model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed") |
| |
| # Pass 1: raw street photo |
| image = Image.open("street_photo.jpg").convert("RGB") |
| inputs = processor(images=image, return_tensors="pt") |
| outputs = model(**inputs) |
| mask = F.interpolate(outputs.logits, size=image.size[::-1], |
| mode="bilinear", align_corners=False).argmax(dim=1)[0] |
| |
| # Find biggest facade_wall blob (class 1), compute homography, rectify... |
| # Then Pass 2 on rectified crop: |
| rectified = Image.open("rectified_facade.jpg").convert("RGB") |
| inputs2 = processor(images=rectified, return_tensors="pt") |
| outputs2 = model(**inputs2) |
| mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1], |
| mode="bilinear", align_corners=False).argmax(dim=1)[0] |
| ``` |
|
|
| ## How to Train |
|
|
| This repo contains the training script. Run it on any GPU (T4 or better): |
|
|
| ```bash |
| pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy |
| python train.py |
| ``` |
|
|
| **What the script does:** |
|
|
| 1. Loads **CMP Facade** dataset (~492 rectified images, 12 classes) β remaps to 6-class |
| 2. Loads **ADE20K scene_parse_150** (~20K images) β filters to building-containing scenes β remaps to same 6-class |
| 3. Applies **perspective augmentation** (RandomPerspective, p=0.4) during training β simulates oblique camera angles |
| 4. Starts from your existing model (`segformer-b0-facade-cmp`, 48.56% mIoU) |
| 5. Trains 80 epochs, pushes best model to this Hub repo |
|
|
| **Training time:** ~4-6h on T4 GPU |
|
|
| ## Data Sources |
|
|
| | Dataset | Type | Images | Geometry | Classes (raw) | Classes (remapped) | |
| |---------|------|--------|----------|--------------|-------------------| |
| | [CMP Facade](https://huggingface.co/datasets/Xpitfire/cmp_facade) | Primary | ~492 | Rectified | 12 | 6 (background, wall, window, door, balcony, ignore) | |
| | [ADE20K scene_parse_150](https://huggingface.co/datasets/merve/scene_parse_150) | Augmentation | ~5K filtered | Unrectified (perspective) | 150 | 6 (same taxonomy) | |
|
|
| ### Why mix these? |
|
|
| - **CMP alone** = excellent on rectified facades, fails on street-view perspective |
| - **ADE20K** adds natural perspective building scenes (wall, building, house, skyscraper classes) |
| - **Perspective augmentation** (`RandomPerspective`, distortion=0.3, p=0.4) closes the geometric domain gap |
|
|
| Literature confirms the gap: [Texture2LoD3](https://huggingface.co/papers/2504.05249) measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix. |
|
|
| ## Training Configuration |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Base checkpoint | `Marco333/segformer-b0-facade-cmp` | |
| | Optimizer | AdamW | |
| | Learning rate | 6 Γ 10β»β΅ | |
| | LR schedule | Polynomial decay | |
| | Warmup | 10% of steps | |
| | Weight decay | 0.01 | |
| | Effective batch size | 8 (4 Γ device Β· 2 grad accum) | |
| | Resolution | 512 Γ 512 | |
| | Precision | FP16 | |
| | Epochs | 80 | |
| | Augmentation | ColorJitter + RandomPerspective (p=0.4, distortion=0.3) | |
| | Selection metric | Highest mean IoU on validation | |
|
|
| ## Expected Improvements vs. CMP-Only Model |
|
|
| | Capability | CMP-only (baseline) | Mixed (this model) | |
| |---|---|---| |
| | Rectified facades | β
48.6% mIoU | β
Likely 55-70% (more data + transfer) | |
| | Unrectified street photos | β Untrained | β
Trained on ADE20K perspective scenes | |
| | Perspective robustness | ~10pp IoU drop | Gap closed via augmentation | |
|
|
| ## Citation |
|
|
| CMP Facade: |
| ```bibtex |
| @INPROCEEDINGS{Tylecek13, |
| author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra}, |
| title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure}, |
| booktitle = {Proc. GCPR}, |
| year = {2013}, |
| } |
| ``` |
|
|
| ADE20K: |
| ```bibtex |
| @article{zhou2017scene, |
| title={Scene Parsing through ADE20K Dataset}, |
| author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio}, |
| journal={CVPR}, |
| year={2017} |
| } |
| ``` |
|
|
| SegFormer: |
| ```bibtex |
| @article{xie2021segformer, |
| title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers}, |
| author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping}, |
| journal={arXiv preprint arXiv:2105.15203}, |
| year={2021} |
| } |
| ``` |
|
|