SegFormer-B0 β€” Facade Segmentation (Mixed Rectified + Unrectified)

Status: Training script ready. Run train.py to produce this model. See How to Train below.

A SegFormer-B0 model trained on mixed rectified and unrectified facade data for a 2-pass pipeline:

  1. Pass 1 β€” Run on raw street photo (unrectified/perspective) to detect the dominant facade wall
  2. Rectify detected facade via homography
  3. Pass 2 β€” Run on rectified crop to extract windows, doors, balconies cleanly
Architecture SegFormer-B0 (Mix Transformer encoder + all-MLP decoder)
Parameters ~3.7 M
Input RGB image, any resolution (resized to 512Γ—512)
Output 6-class pixel mask
Format SafeTensors
Base model Marco333/segformer-b0-facade-cmp

6-Class Taxonomy

ID Class Function Pass
0 background Sky, ground, non-facade regions Both
1 facade_wall Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) Both
2 window Windows + blinds + shopfronts Both
3 door Doors + shopfronts Both
4 balcony Balconies Both
5 vegetation_occluder Trees, plants occluding facade Both

Two-Pass Pipeline

from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch.nn.functional as F

processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed")

# Pass 1: raw street photo
image = Image.open("street_photo.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
mask = F.interpolate(outputs.logits, size=image.size[::-1],
                     mode="bilinear", align_corners=False).argmax(dim=1)[0]

# Find biggest facade_wall blob (class 1), compute homography, rectify...
# Then Pass 2 on rectified crop:
rectified = Image.open("rectified_facade.jpg").convert("RGB")
inputs2 = processor(images=rectified, return_tensors="pt")
outputs2 = model(**inputs2)
mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1],
                      mode="bilinear", align_corners=False).argmax(dim=1)[0]

How to Train

This repo contains the training script. Run it on any GPU (T4 or better):

pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy
python train.py

What the script does:

  1. Loads CMP Facade dataset (~492 rectified images, 12 classes) β†’ remaps to 6-class
  2. Loads ADE20K scene_parse_150 (~20K images) β†’ filters to building-containing scenes β†’ remaps to same 6-class
  3. Applies perspective augmentation (RandomPerspective, p=0.4) during training β€” simulates oblique camera angles
  4. Starts from your existing model (segformer-b0-facade-cmp, 48.56% mIoU)
  5. Trains 80 epochs, pushes best model to this Hub repo

Training time: ~4-6h on T4 GPU

Data Sources

Dataset Type Images Geometry Classes (raw) Classes (remapped)
CMP Facade Primary ~492 Rectified 12 6 (background, wall, window, door, balcony, ignore)
ADE20K scene_parse_150 Augmentation ~5K filtered Unrectified (perspective) 150 6 (same taxonomy)

Why mix these?

  • CMP alone = excellent on rectified facades, fails on street-view perspective
  • ADE20K adds natural perspective building scenes (wall, building, house, skyscraper classes)
  • Perspective augmentation (RandomPerspective, distortion=0.3, p=0.4) closes the geometric domain gap

Literature confirms the gap: Texture2LoD3 measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix.

Training Configuration

Parameter Value
Base checkpoint Marco333/segformer-b0-facade-cmp
Optimizer AdamW
Learning rate 6 Γ— 10⁻⁡
LR schedule Polynomial decay
Warmup 10% of steps
Weight decay 0.01
Effective batch size 8 (4 Γ— device Β· 2 grad accum)
Resolution 512 Γ— 512
Precision FP16
Epochs 80
Augmentation ColorJitter + RandomPerspective (p=0.4, distortion=0.3)
Selection metric Highest mean IoU on validation

Expected Improvements vs. CMP-Only Model

Capability CMP-only (baseline) Mixed (this model)
Rectified facades βœ… 48.6% mIoU βœ… Likely 55-70% (more data + transfer)
Unrectified street photos ❌ Untrained βœ… Trained on ADE20K perspective scenes
Perspective robustness ~10pp IoU drop Gap closed via augmentation

Citation

CMP Facade:

@INPROCEEDINGS{Tylecek13,
  author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra},
  title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure},
  booktitle = {Proc. GCPR},
  year = {2013},
}

ADE20K:

@article{zhou2017scene,
  title={Scene Parsing through ADE20K Dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={CVPR},
  year={2017}
}

SegFormer:

@article{xie2021segformer,
  title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
  author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
  journal={arXiv preprint arXiv:2105.15203},
  year={2021}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train Marco333/segformer-b0-facade-mixed

Papers for Marco333/segformer-b0-facade-mixed

Evaluation results