SegFormer-B0 — Facade Segmentation (Mixed Rectified + Unrectified)

Status: Training script ready. Run train.py to produce this model. See How to Train below.

A SegFormer-B0 model trained on mixed rectified and unrectified facade data for a 2-pass pipeline:

Pass 1 — Run on raw street photo (unrectified/perspective) to detect the dominant facade wall
Rectify detected facade via homography
Pass 2 — Run on rectified crop to extract windows, doors, balconies cleanly


Architecture	SegFormer-B0 (Mix Transformer encoder + all-MLP decoder)
Parameters	~3.7 M
Input	RGB image, any resolution (resized to 512×512)
Output	6-class pixel mask
Format	SafeTensors
Base model	Marco333/segformer-b0-facade-cmp

6-Class Taxonomy

ID	Class	Function	Pass
0	`background`	Sky, ground, non-facade regions	Both
1	`facade_wall`	Main wall surface (merged: facade, molding, cornice, pillar, sill, deco)	Both
2	`window`	Windows + blinds + shopfronts	Both
3	`door`	Doors + shopfronts	Both
4	`balcony`	Balconies	Both
5	`vegetation_occluder`	Trees, plants occluding facade	Both

Two-Pass Pipeline

from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch.nn.functional as F

processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed")

# Pass 1: raw street photo
image = Image.open("street_photo.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
mask = F.interpolate(outputs.logits, size=image.size[::-1],
                     mode="bilinear", align_corners=False).argmax(dim=1)[0]

# Find biggest facade_wall blob (class 1), compute homography, rectify...
# Then Pass 2 on rectified crop:
rectified = Image.open("rectified_facade.jpg").convert("RGB")
inputs2 = processor(images=rectified, return_tensors="pt")
outputs2 = model(**inputs2)
mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1],
                      mode="bilinear", align_corners=False).argmax(dim=1)[0]

How to Train

This repo contains the training script. Run it on any GPU (T4 or better):

pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy
python train.py

What the script does:

Loads CMP Facade dataset (~492 rectified images, 12 classes) → remaps to 6-class
Loads ADE20K scene_parse_150 (~20K images) → filters to building-containing scenes → remaps to same 6-class
Applies perspective augmentation (RandomPerspective, p=0.4) during training — simulates oblique camera angles
Starts from your existing model (segformer-b0-facade-cmp, 48.56% mIoU)
Trains 80 epochs, pushes best model to this Hub repo

Training time: ~4-6h on T4 GPU

Data Sources

Dataset	Type	Images	Geometry	Classes (raw)	Classes (remapped)
CMP Facade	Primary	~492	Rectified	12	6 (background, wall, window, door, balcony, ignore)
ADE20K scene_parse_150	Augmentation	~5K filtered	Unrectified (perspective)	150	6 (same taxonomy)

Why mix these?

CMP alone = excellent on rectified facades, fails on street-view perspective
ADE20K adds natural perspective building scenes (wall, building, house, skyscraper classes)
Perspective augmentation (RandomPerspective, distortion=0.3, p=0.4) closes the geometric domain gap

Literature confirms the gap: Texture2LoD3 measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix.

Training Configuration

Parameter	Value
Base checkpoint	`Marco333/segformer-b0-facade-cmp`
Optimizer	AdamW
Learning rate	6 × 10⁻⁵
LR schedule	Polynomial decay
Warmup	10% of steps
Weight decay	0.01
Effective batch size	8 (4 × device · 2 grad accum)
Resolution	512 × 512
Precision	FP16
Epochs	80
Augmentation	ColorJitter + RandomPerspective (p=0.4, distortion=0.3)
Selection metric	Highest mean IoU on validation

Expected Improvements vs. CMP-Only Model

Capability	CMP-only (baseline)	Mixed (this model)
Rectified facades	✅ 48.6% mIoU	✅ Likely 55-70% (more data + transfer)
Unrectified street photos	❌ Untrained	✅ Trained on ADE20K perspective scenes
Perspective robustness	~10pp IoU drop	Gap closed via augmentation

Citation

CMP Facade:

@INPROCEEDINGS{Tylecek13,
  author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra},
  title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure},
  booktitle = {Proc. GCPR},
  year = {2013},
}

ADE20K:

@article{zhou2017scene,
  title={Scene Parsing through ADE20K Dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={CVPR},
  year={2017}
}

SegFormer:

@article{xie2021segformer,
  title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
  author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
  journal={arXiv preprint arXiv:2105.15203},
  year={2021}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train Marco333/segformer-b0-facade-mixed

Papers for Marco333/segformer-b0-facade-mixed

Texture2LoD3: Enabling LoD3 Building Reconstruction With Panoramic Images

Paper • 2504.05249 • Published Apr 7, 2025 • 2

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Paper • 2105.15203 • Published May 31, 2021 • 3

Evaluation results

Mean IoU on CMP Facade + ADE20K filtered
validation set self-reported

0.000