Marco333's picture
Upload README.md
6d226e9 verified
---
library_name: transformers
license: mit
tags:
- image-segmentation
- semantic-segmentation
- segformer
- facade
- building
- mixed-rectification
- unrectified
- vision
pipeline_tag: image-segmentation
datasets:
- Xpitfire/cmp_facade
- merve/scene_parse_150
metrics:
- mean_iou
model-index:
- name: segformer-b0-facade-mixed
results:
- task:
type: image-segmentation
name: Semantic Segmentation
dataset:
type: mixed
name: CMP Facade + ADE20K filtered
split: validation
metrics:
- type: mean_iou
value: 0.0
name: Mean IoU
---
# SegFormer-B0 β€” Facade Segmentation (Mixed Rectified + Unrectified)
> **Status:** Training script ready. Run `train.py` to produce this model. See **How to Train** below.
A **SegFormer-B0** model trained on **mixed rectified and unrectified facade data** for a 2-pass pipeline:
1. **Pass 1** β€” Run on raw street photo (unrectified/perspective) to detect the dominant facade wall
2. **Rectify** detected facade via homography
3. **Pass 2** β€” Run on rectified crop to extract windows, doors, balconies cleanly
| | |
|---|---|
| **Architecture** | SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) |
| **Parameters** | ~3.7 M |
| **Input** | RGB image, any resolution (resized to 512Γ—512) |
| **Output** | 6-class pixel mask |
| **Format** | SafeTensors |
| **Base model** | [Marco333/segformer-b0-facade-cmp](https://huggingface.co/Marco333/segformer-b0-facade-cmp) |
## 6-Class Taxonomy
| ID | Class | Function | Pass |
|:--:|-------|----------|:--:|
| 0 | `background` | Sky, ground, non-facade regions | Both |
| 1 | `facade_wall` | Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) | Both |
| 2 | `window` | Windows + blinds + shopfronts | Both |
| 3 | `door` | Doors + shopfronts | Both |
| 4 | `balcony` | Balconies | Both |
| 5 | `vegetation_occluder` | Trees, plants occluding facade | Both |
## Two-Pass Pipeline
```python
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch.nn.functional as F
processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed")
# Pass 1: raw street photo
image = Image.open("street_photo.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
mask = F.interpolate(outputs.logits, size=image.size[::-1],
mode="bilinear", align_corners=False).argmax(dim=1)[0]
# Find biggest facade_wall blob (class 1), compute homography, rectify...
# Then Pass 2 on rectified crop:
rectified = Image.open("rectified_facade.jpg").convert("RGB")
inputs2 = processor(images=rectified, return_tensors="pt")
outputs2 = model(**inputs2)
mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1],
mode="bilinear", align_corners=False).argmax(dim=1)[0]
```
## How to Train
This repo contains the training script. Run it on any GPU (T4 or better):
```bash
pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy
python train.py
```
**What the script does:**
1. Loads **CMP Facade** dataset (~492 rectified images, 12 classes) β†’ remaps to 6-class
2. Loads **ADE20K scene_parse_150** (~20K images) β†’ filters to building-containing scenes β†’ remaps to same 6-class
3. Applies **perspective augmentation** (RandomPerspective, p=0.4) during training β€” simulates oblique camera angles
4. Starts from your existing model (`segformer-b0-facade-cmp`, 48.56% mIoU)
5. Trains 80 epochs, pushes best model to this Hub repo
**Training time:** ~4-6h on T4 GPU
## Data Sources
| Dataset | Type | Images | Geometry | Classes (raw) | Classes (remapped) |
|---------|------|--------|----------|--------------|-------------------|
| [CMP Facade](https://huggingface.co/datasets/Xpitfire/cmp_facade) | Primary | ~492 | Rectified | 12 | 6 (background, wall, window, door, balcony, ignore) |
| [ADE20K scene_parse_150](https://huggingface.co/datasets/merve/scene_parse_150) | Augmentation | ~5K filtered | Unrectified (perspective) | 150 | 6 (same taxonomy) |
### Why mix these?
- **CMP alone** = excellent on rectified facades, fails on street-view perspective
- **ADE20K** adds natural perspective building scenes (wall, building, house, skyscraper classes)
- **Perspective augmentation** (`RandomPerspective`, distortion=0.3, p=0.4) closes the geometric domain gap
Literature confirms the gap: [Texture2LoD3](https://huggingface.co/papers/2504.05249) measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix.
## Training Configuration
| Parameter | Value |
|-----------|-------|
| Base checkpoint | `Marco333/segformer-b0-facade-cmp` |
| Optimizer | AdamW |
| Learning rate | 6 Γ— 10⁻⁡ |
| LR schedule | Polynomial decay |
| Warmup | 10% of steps |
| Weight decay | 0.01 |
| Effective batch size | 8 (4 Γ— device Β· 2 grad accum) |
| Resolution | 512 Γ— 512 |
| Precision | FP16 |
| Epochs | 80 |
| Augmentation | ColorJitter + RandomPerspective (p=0.4, distortion=0.3) |
| Selection metric | Highest mean IoU on validation |
## Expected Improvements vs. CMP-Only Model
| Capability | CMP-only (baseline) | Mixed (this model) |
|---|---|---|
| Rectified facades | βœ… 48.6% mIoU | βœ… Likely 55-70% (more data + transfer) |
| Unrectified street photos | ❌ Untrained | βœ… Trained on ADE20K perspective scenes |
| Perspective robustness | ~10pp IoU drop | Gap closed via augmentation |
## Citation
CMP Facade:
```bibtex
@INPROCEEDINGS{Tylecek13,
author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra},
title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure},
booktitle = {Proc. GCPR},
year = {2013},
}
```
ADE20K:
```bibtex
@article{zhou2017scene,
title={Scene Parsing through ADE20K Dataset},
author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
journal={CVPR},
year={2017}
}
```
SegFormer:
```bibtex
@article{xie2021segformer,
title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
journal={arXiv preprint arXiv:2105.15203},
year={2021}
}
```