---
library_name: transformers
license: mit
tags:
  - image-segmentation
  - semantic-segmentation
  - segformer
  - facade
  - building
  - mixed-rectification
  - unrectified
  - vision
pipeline_tag: image-segmentation
datasets:
  - Xpitfire/cmp_facade
  - merve/scene_parse_150
metrics:
  - mean_iou
model-index:
  - name: segformer-b0-facade-mixed
    results:
      - task:
          type: image-segmentation
          name: Semantic Segmentation
        dataset:
          type: mixed
          name: CMP Facade + ADE20K filtered
          split: validation
        metrics:
          - type: mean_iou
            value: 0.0
            name: Mean IoU
---

# SegFormer-B0 — Facade Segmentation (Mixed Rectified + Unrectified)

> **Status:** Training script ready. Run `train.py` to produce this model. See **How to Train** below.

A **SegFormer-B0** model trained on **mixed rectified and unrectified facade data** for a 2-pass pipeline:

1. **Pass 1** — Run on raw street photo (unrectified/perspective) to detect the dominant facade wall
2. **Rectify** detected facade via homography
3. **Pass 2** — Run on rectified crop to extract windows, doors, balconies cleanly

| | |
|---|---|
| **Architecture** | SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) |
| **Parameters** | ~3.7 M |
| **Input** | RGB image, any resolution (resized to 512×512) |
| **Output** | 6-class pixel mask |
| **Format** | SafeTensors |
| **Base model** | [Marco333/segformer-b0-facade-cmp](https://huggingface.co/Marco333/segformer-b0-facade-cmp) |

## 6-Class Taxonomy

| ID | Class | Function | Pass |
|:--:|-------|----------|:--:|
| 0 | `background` | Sky, ground, non-facade regions | Both |
| 1 | `facade_wall` | Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) | Both |
| 2 | `window` | Windows + blinds + shopfronts | Both |
| 3 | `door` | Doors + shopfronts | Both |
| 4 | `balcony` | Balconies | Both |
| 5 | `vegetation_occluder` | Trees, plants occluding facade | Both |

## Two-Pass Pipeline

```python
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch.nn.functional as F

processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed")

# Pass 1: raw street photo
image = Image.open("street_photo.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
mask = F.interpolate(outputs.logits, size=image.size[::-1],
                     mode="bilinear", align_corners=False).argmax(dim=1)[0]

# Find biggest facade_wall blob (class 1), compute homography, rectify...
# Then Pass 2 on rectified crop:
rectified = Image.open("rectified_facade.jpg").convert("RGB")
inputs2 = processor(images=rectified, return_tensors="pt")
outputs2 = model(**inputs2)
mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1],
                      mode="bilinear", align_corners=False).argmax(dim=1)[0]
```

## How to Train

This repo contains the training script. Run it on any GPU (T4 or better):

```bash
pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy
python train.py
```

**What the script does:**

1. Loads **CMP Facade** dataset (~492 rectified images, 12 classes) → remaps to 6-class
2. Loads **ADE20K scene_parse_150** (~20K images) → filters to building-containing scenes → remaps to same 6-class
3. Applies **perspective augmentation** (RandomPerspective, p=0.4) during training — simulates oblique camera angles
4. Starts from your existing model (`segformer-b0-facade-cmp`, 48.56% mIoU)
5. Trains 80 epochs, pushes best model to this Hub repo

**Training time:** ~4-6h on T4 GPU

## Data Sources

| Dataset | Type | Images | Geometry | Classes (raw) | Classes (remapped) |
|---------|------|--------|----------|--------------|-------------------|
| [CMP Facade](https://huggingface.co/datasets/Xpitfire/cmp_facade) | Primary | ~492 | Rectified | 12 | 6 (background, wall, window, door, balcony, ignore) |
| [ADE20K scene_parse_150](https://huggingface.co/datasets/merve/scene_parse_150) | Augmentation | ~5K filtered | Unrectified (perspective) | 150 | 6 (same taxonomy) |

### Why mix these?

- **CMP alone** = excellent on rectified facades, fails on street-view perspective
- **ADE20K** adds natural perspective building scenes (wall, building, house, skyscraper classes)
- **Perspective augmentation** (`RandomPerspective`, distortion=0.3, p=0.4) closes the geometric domain gap

Literature confirms the gap: [Texture2LoD3](https://huggingface.co/papers/2504.05249) measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix.

## Training Configuration

| Parameter | Value |
|-----------|-------|
| Base checkpoint | `Marco333/segformer-b0-facade-cmp` |
| Optimizer | AdamW |
| Learning rate | 6 × 10⁻⁵ |
| LR schedule | Polynomial decay |
| Warmup | 10% of steps |
| Weight decay | 0.01 |
| Effective batch size | 8 (4 × device · 2 grad accum) |
| Resolution | 512 × 512 |
| Precision | FP16 |
| Epochs | 80 |
| Augmentation | ColorJitter + RandomPerspective (p=0.4, distortion=0.3) |
| Selection metric | Highest mean IoU on validation |

## Expected Improvements vs. CMP-Only Model

| Capability | CMP-only (baseline) | Mixed (this model) |
|---|---|---|
| Rectified facades | ✅ 48.6% mIoU | ✅ Likely 55-70% (more data + transfer) |
| Unrectified street photos | ❌ Untrained | ✅ Trained on ADE20K perspective scenes |
| Perspective robustness | ~10pp IoU drop | Gap closed via augmentation |

## Citation

CMP Facade:
```bibtex
@INPROCEEDINGS{Tylecek13,
  author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra},
  title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure},
  booktitle = {Proc. GCPR},
  year = {2013},
}
```

ADE20K:
```bibtex
@article{zhou2017scene,
  title={Scene Parsing through ADE20K Dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={CVPR},
  year={2017}
}
```

SegFormer:
```bibtex
@article{xie2021segformer,
  title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
  author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
  journal={arXiv preprint arXiv:2105.15203},
  year={2021}
}
```