Spaces:
Runtime error
Runtime error
Vedant Jigarbhai Mehta commited on
Commit ·
1eb8817
1
Parent(s): 4f856a3
Deploy to hf saces
Browse files- .gitattributes +1 -0
- DEPLOY_HF_SPACES.md +77 -0
- EXPLANATION.md +0 -695
- MODELS_EXPLAINED.md +0 -573
- README.md +12 -0
- app.py +46 -1
- requirements.txt +1 -0
.gitattributes
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
DEPLOY_HF_SPACES.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deploy to Hugging Face Spaces (Gradio)
|
| 2 |
+
|
| 3 |
+
This project is now ready for Hugging Face Spaces.
|
| 4 |
+
|
| 5 |
+
## Option A (recommended): single Space repo with checkpoints
|
| 6 |
+
|
| 7 |
+
Use this when you want the simplest deployment.
|
| 8 |
+
|
| 9 |
+
1. Create a new Hugging Face Space:
|
| 10 |
+
- SDK: Gradio
|
| 11 |
+
- Hardware: CPU Basic to start, upgrade to GPU for faster inference
|
| 12 |
+
|
| 13 |
+
2. Push this project to that Space repo.
|
| 14 |
+
|
| 15 |
+
3. Ensure these files are present at the Space repo root:
|
| 16 |
+
- app.py
|
| 17 |
+
- requirements.txt
|
| 18 |
+
- configs/config.yaml
|
| 19 |
+
- models/
|
| 20 |
+
- data/
|
| 21 |
+
- utils/
|
| 22 |
+
- checkpoints/changeformer_best.pth (or your preferred model)
|
| 23 |
+
|
| 24 |
+
4. In Space Settings, set startup file to `app.py` (default for Gradio Spaces).
|
| 25 |
+
|
| 26 |
+
5. Optional: reduce initial footprint by keeping only one checkpoint (for example `changeformer_best.pth`) inside `checkpoints/`.
|
| 27 |
+
|
| 28 |
+
## Option B: Space app + separate model repo
|
| 29 |
+
|
| 30 |
+
Use this when you want a smaller Space repo and keep large checkpoints elsewhere.
|
| 31 |
+
|
| 32 |
+
1. Upload checkpoint files to a separate Hugging Face model repo.
|
| 33 |
+
|
| 34 |
+
2. In your Space Settings -> Variables, set:
|
| 35 |
+
- `HF_MODEL_REPO`: owner/repo-name
|
| 36 |
+
- `HF_MODEL_REVISION`: optional branch/tag/commit (for reproducible deployment)
|
| 37 |
+
|
| 38 |
+
3. On startup, `app.py` will auto-download expected checkpoint filenames into `checkpoints/`.
|
| 39 |
+
|
| 40 |
+
Expected checkpoint names:
|
| 41 |
+
- siamese_cnn_best.pth
|
| 42 |
+
- unet_pp_best.pth
|
| 43 |
+
- changeformer_best.pth
|
| 44 |
+
|
| 45 |
+
## Space README metadata (required in Space repo)
|
| 46 |
+
|
| 47 |
+
In the Space repository README.md, include this at the top:
|
| 48 |
+
|
| 49 |
+
```yaml
|
| 50 |
+
---
|
| 51 |
+
title: Military Base Change Detection
|
| 52 |
+
emoji: satellite
|
| 53 |
+
colorFrom: blue
|
| 54 |
+
colorTo: red
|
| 55 |
+
sdk: gradio
|
| 56 |
+
sdk_version: 4.44.1
|
| 57 |
+
app_file: app.py
|
| 58 |
+
pinned: false
|
| 59 |
+
python_version: 3.10
|
| 60 |
+
---
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Notes
|
| 64 |
+
|
| 65 |
+
- CPU hardware works, but inference can be slow for larger images.
|
| 66 |
+
- For better latency, choose a GPU Space.
|
| 67 |
+
- `app.py` now detects Spaces automatically and binds to `0.0.0.0`.
|
| 68 |
+
- If no local checkpoints are found, it will try `HF_MODEL_REPO`.
|
| 69 |
+
|
| 70 |
+
## Quick local validation before push
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
pip install -r requirements.txt
|
| 74 |
+
python app.py
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
Then open the local Gradio URL and test one sample pair.
|
EXPLANATION.md
DELETED
|
@@ -1,695 +0,0 @@
|
|
| 1 |
-
# Military Base Change Detection — Complete Project Explanation
|
| 2 |
-
|
| 3 |
-
## Table of Contents
|
| 4 |
-
|
| 5 |
-
1. [What Is This Project?](#1-what-is-this-project)
|
| 6 |
-
2. [Why Did We Build This?](#2-why-did-we-build-this)
|
| 7 |
-
3. [What Problem Are We Solving?](#3-what-problem-are-we-solving)
|
| 8 |
-
4. [What Dataset Did We Use and Why?](#4-what-dataset-did-we-use-and-why)
|
| 9 |
-
5. [What Are The Three Models and Why These Three?](#5-what-are-the-three-models-and-why-these-three)
|
| 10 |
-
6. [How Does Each Model Work Internally?](#6-how-does-each-model-work-internally)
|
| 11 |
-
7. [How Is The Training Pipeline Designed?](#7-how-is-the-training-pipeline-designed)
|
| 12 |
-
8. [What Loss Functions Did We Use and Why?](#8-what-loss-functions-did-we-use-and-why)
|
| 13 |
-
9. [How Do We Evaluate The Models?](#9-how-do-we-evaluate-the-models)
|
| 14 |
-
10. [What Are Our Results?](#10-what-are-our-results)
|
| 15 |
-
11. [How Does The Inference Pipeline Work?](#11-how-does-the-inference-pipeline-work)
|
| 16 |
-
12. [How Does The Web Application Work?](#12-how-does-the-web-application-work)
|
| 17 |
-
13. [What Tools and Technologies Did We Use?](#13-what-tools-and-technologies-did-we-use)
|
| 18 |
-
14. [What Is Our Innovation / Contribution?](#14-what-is-our-innovation--contribution)
|
| 19 |
-
15. [What Are The Limitations?](#15-what-are-the-limitations)
|
| 20 |
-
16. [Future Work](#16-future-work)
|
| 21 |
-
17. [How To Present This Project](#17-how-to-present-this-project)
|
| 22 |
-
|
| 23 |
-
---
|
| 24 |
-
|
| 25 |
-
## 1. What Is This Project?
|
| 26 |
-
|
| 27 |
-
This is a **deep learning-based satellite image change detection system** designed for defense and military applications. Given two satellite images of the same geographic location taken at different times (a "before" image and an "after" image), the system automatically identifies **where new construction has occurred** — new buildings, runways, infrastructure, or any structural changes.
|
| 28 |
-
|
| 29 |
-
The system works like this:
|
| 30 |
-
|
| 31 |
-
```
|
| 32 |
-
Before Image (2015) + After Image (2020) --> Change Mask (highlights new construction)
|
| 33 |
-
[empty land] [buildings appeared] [white pixels = new structures]
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
We implemented and compared **three different deep learning architectures** — ranging from a simple CNN baseline to a state-of-the-art vision transformer — to understand which approach works best for this task.
|
| 37 |
-
|
| 38 |
-
---
|
| 39 |
-
|
| 40 |
-
## 2. Why Did We Build This?
|
| 41 |
-
|
| 42 |
-
### The Defense Motivation
|
| 43 |
-
|
| 44 |
-
Modern military intelligence relies heavily on satellite imagery. Analysts need to monitor:
|
| 45 |
-
|
| 46 |
-
- **Enemy military base expansion** — Are new barracks, hangars, or command centers being built?
|
| 47 |
-
- **Runway construction** — Is a new airfield being developed?
|
| 48 |
-
- **Infrastructure development** — Are roads, supply depots, or communication towers appearing?
|
| 49 |
-
- **Border fortification** — Are defensive structures being erected?
|
| 50 |
-
|
| 51 |
-
Manually comparing satellite images is **slow, error-prone, and doesn't scale**. A single analyst might need to compare hundreds of image pairs daily. An AI system can do this in seconds with higher accuracy.
|
| 52 |
-
|
| 53 |
-
### The Deep Learning Motivation
|
| 54 |
-
|
| 55 |
-
This project demonstrates core deep learning concepts:
|
| 56 |
-
|
| 57 |
-
- **Transfer learning** — Using ImageNet-pretrained backbones on satellite imagery
|
| 58 |
-
- **Siamese architectures** — Processing two inputs through a shared encoder
|
| 59 |
-
- **Architecture comparison** — CNN vs UNet++ vs Transformer on the same task
|
| 60 |
-
- **Binary segmentation** — Pixel-level classification (changed vs unchanged)
|
| 61 |
-
- **End-to-end deployment** — From training to a working web application
|
| 62 |
-
|
| 63 |
-
---
|
| 64 |
-
|
| 65 |
-
## 3. What Problem Are We Solving?
|
| 66 |
-
|
| 67 |
-
### Problem Statement
|
| 68 |
-
|
| 69 |
-
**Binary Change Detection in Remote Sensing Images**: Given a pair of co-registered satellite images of the same area captured at two different times, classify each pixel as either "changed" or "unchanged".
|
| 70 |
-
|
| 71 |
-
### Why Is This Hard?
|
| 72 |
-
|
| 73 |
-
1. **Class imbalance** — In most image pairs, 95-99% of pixels are "no change". Only tiny regions contain actual construction. The model must not simply predict "no change" everywhere.
|
| 74 |
-
|
| 75 |
-
2. **Irrelevant changes** — Lighting differences, seasonal vegetation changes, cloud shadows, and camera angle variations are NOT actual changes. The model must learn to ignore these.
|
| 76 |
-
|
| 77 |
-
3. **Scale variation** — Changes can be as small as a single house or as large as an entire housing development. The model needs multi-scale understanding.
|
| 78 |
-
|
| 79 |
-
4. **Semantic understanding** — The model should detect "empty land became a building" (structural change), not "grass turned brown" (seasonal change).
|
| 80 |
-
|
| 81 |
-
### Formal Definition
|
| 82 |
-
|
| 83 |
-
```
|
| 84 |
-
Input: Image A (before) — shape [3, 256, 256] — RGB satellite patch
|
| 85 |
-
Image B (after) — shape [3, 256, 256] — RGB satellite patch
|
| 86 |
-
|
| 87 |
-
Output: Mask M — shape [1, 256, 256] — binary (0 = no change, 1 = change)
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
-
---
|
| 91 |
-
|
| 92 |
-
## 4. What Dataset Did We Use and Why?
|
| 93 |
-
|
| 94 |
-
### LEVIR-CD (Large-scale VHR Image Change Detection)
|
| 95 |
-
|
| 96 |
-
We chose LEVIR-CD because it is the **most widely used benchmark** for building change detection in remote sensing. It provides:
|
| 97 |
-
|
| 98 |
-
- **637 image pairs** at 1024x1024 resolution (0.5m/pixel from Google Earth)
|
| 99 |
-
- **20 different regions** in Texas, USA (Austin, Lakeway, Bee Cave, etc.)
|
| 100 |
-
- **Time span**: 2002 to 2018 (5-14 years between image pairs)
|
| 101 |
-
- **31,333 annotated building change instances**
|
| 102 |
-
- Images annotated by experts and double-checked for quality
|
| 103 |
-
|
| 104 |
-
### Data Preprocessing
|
| 105 |
-
|
| 106 |
-
The raw 1024x1024 images are too large for direct model input. We cropped them into **non-overlapping 256x256 patches**:
|
| 107 |
-
|
| 108 |
-
```
|
| 109 |
-
1 image (1024x1024) --> 16 patches (256x256 each)
|
| 110 |
-
|
| 111 |
-
Total patches:
|
| 112 |
-
Train: 445 images x 16 = 7,120 patch triplets
|
| 113 |
-
Val: 64 images x 16 = 1,024 patch triplets
|
| 114 |
-
Test: 128 images x 16 = 2,048 patch triplets
|
| 115 |
-
Total: 10,192 patch triplets
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
Each patch triplet consists of:
|
| 119 |
-
- `A/` — Before image (256x256 RGB)
|
| 120 |
-
- `B/` — After image (256x256 RGB)
|
| 121 |
-
- `label/` — Binary change mask (256x256, 0=unchanged, 255=changed)
|
| 122 |
-
|
| 123 |
-
### Why Not Military-Specific Data?
|
| 124 |
-
|
| 125 |
-
Real military satellite imagery is classified and not publicly available. However, **building construction is structurally identical whether it's a civilian house or a military barracks**. A hangar looks like a warehouse. A runway looks like a road. The model learns to detect structural changes from any satellite imagery — the application to military monitoring is in WHERE you point the trained model, not what you train it on. This is the standard approach in defense AI research.
|
| 126 |
-
|
| 127 |
-
### Data Augmentation
|
| 128 |
-
|
| 129 |
-
We apply synchronized augmentations to both images AND the mask during training (using albumentations ReplayCompose):
|
| 130 |
-
|
| 131 |
-
- **Horizontal flip** (p=0.5)
|
| 132 |
-
- **Vertical flip** (p=0.5)
|
| 133 |
-
- **Random 90-degree rotation** (p=0.5)
|
| 134 |
-
- **ImageNet normalization** (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
|
| 135 |
-
|
| 136 |
-
No augmentation on validation/test sets — only normalization.
|
| 137 |
-
|
| 138 |
-
---
|
| 139 |
-
|
| 140 |
-
## 5. What Are The Three Models and Why These Three?
|
| 141 |
-
|
| 142 |
-
We chose three architectures that represent **three generations of deep learning for dense prediction tasks**:
|
| 143 |
-
|
| 144 |
-
| Model | Year | Architecture Type | Role in Our Study |
|
| 145 |
-
|---|---|---|---|
|
| 146 |
-
| Siamese CNN | ~2018 | Convolutional Neural Network | Baseline |
|
| 147 |
-
| UNet++ | 2018 | Nested U-Net (encoder-decoder) | Mid-tier |
|
| 148 |
-
| ChangeFormer | 2022 | Vision Transformer | State-of-the-art |
|
| 149 |
-
|
| 150 |
-
### Why These Specific Three?
|
| 151 |
-
|
| 152 |
-
1. **Siamese CNN** — The simplest approach. Shows what a basic CNN can achieve. Serves as a performance floor — if this already works well, maybe we don't need complex models.
|
| 153 |
-
|
| 154 |
-
2. **UNet++** — Represents the best of CNN-based segmentation. Its nested skip connections capture multi-scale features. Widely used in medical imaging and remote sensing. Shows what careful architecture design can achieve without transformers.
|
| 155 |
-
|
| 156 |
-
3. **ChangeFormer** — Represents the latest transformer-based approach. Uses self-attention to capture global context (one building being built might relate to another across the image). Shows whether the complexity of transformers is justified for this task.
|
| 157 |
-
|
| 158 |
-
### The Common Interface
|
| 159 |
-
|
| 160 |
-
All three models share the same input/output contract:
|
| 161 |
-
|
| 162 |
-
```python
|
| 163 |
-
def forward(self, x1: Tensor, x2: Tensor) -> Tensor:
|
| 164 |
-
"""
|
| 165 |
-
x1: before image [Batch, 3, 256, 256]
|
| 166 |
-
x2: after image [Batch, 3, 256, 256]
|
| 167 |
-
returns: logits [Batch, 1, 256, 256] (raw, before sigmoid)
|
| 168 |
-
"""
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
This means we can swap models freely without changing any other code.
|
| 172 |
-
|
| 173 |
-
---
|
| 174 |
-
|
| 175 |
-
## 6. How Does Each Model Work Internally?
|
| 176 |
-
|
| 177 |
-
### Model 1: Siamese CNN (Baseline)
|
| 178 |
-
|
| 179 |
-
**Architecture**: Shared-weight ResNet18 encoder + Transposed Convolution decoder
|
| 180 |
-
|
| 181 |
-
```
|
| 182 |
-
Before Image --> [ResNet18 Encoder] --> Features_A (512 channels, 8x8)
|
| 183 |
-
| (shared weights)
|
| 184 |
-
After Image --> [ResNet18 Encoder] --> Features_B (512 channels, 8x8)
|
| 185 |
-
|
| 186 |
-
Difference = |Features_A - Features_B| (absolute difference)
|
| 187 |
-
|
| 188 |
-
Difference --> [TransposedConv Decoder] --> Change Mask (1 channel, 256x256)
|
| 189 |
-
```
|
| 190 |
-
|
| 191 |
-
**How it works**:
|
| 192 |
-
1. Both images pass through the SAME ResNet18 encoder (shared weights = Siamese)
|
| 193 |
-
2. ResNet18 reduces 256x256x3 to 8x8x512 feature maps
|
| 194 |
-
3. We compute the absolute difference between the two feature maps
|
| 195 |
-
4. A decoder with transposed convolutions upsamples back to 256x256
|
| 196 |
-
5. Output is a single-channel logit map (apply sigmoid for probabilities)
|
| 197 |
-
|
| 198 |
-
**Why shared weights?** If the encoder weights are shared, both images are processed identically. Any difference in the output features is due to actual image content differences, not different processing.
|
| 199 |
-
|
| 200 |
-
**Parameters**: ~14M
|
| 201 |
-
**Strength**: Simple, fast, easy to understand
|
| 202 |
-
**Weakness**: No skip connections, loses fine spatial detail during encoding
|
| 203 |
-
|
| 204 |
-
### Model 2: UNet++ (Mid-tier)
|
| 205 |
-
|
| 206 |
-
**Architecture**: Shared ResNet34 encoder + Nested UNet++ decoder with dense skip connections
|
| 207 |
-
|
| 208 |
-
```
|
| 209 |
-
Before Image --> [ResNet34 Encoder] --> Multi-scale Features_A
|
| 210 |
-
| (shared weights) |
|
| 211 |
-
After Image --> [ResNet34 Encoder] --> Multi-scale Features_B
|
| 212 |
-
|
|
| 213 |
-
|Features_A[i] - Features_B[i]| at each scale
|
| 214 |
-
|
|
| 215 |
-
[UNet++ Decoder]
|
| 216 |
-
(nested skip connections)
|
| 217 |
-
|
|
| 218 |
-
Change Mask (256x256)
|
| 219 |
-
```
|
| 220 |
-
|
| 221 |
-
**How it works**:
|
| 222 |
-
1. ResNet34 encoder extracts features at 5 different scales (from 256x256 down to 8x8)
|
| 223 |
-
2. At each scale, we compute the absolute difference between A and B features
|
| 224 |
-
3. The UNet++ decoder uses **nested skip connections** — unlike regular UNet which has direct connections, UNet++ has intermediate dense blocks that process features before passing them across
|
| 225 |
-
4. This captures both fine details (small buildings) and coarse context (large developments)
|
| 226 |
-
|
| 227 |
-
**Why UNet++?** Standard UNet has a semantic gap between encoder and decoder features. UNet++ bridges this gap with intermediate convolution blocks, producing more refined predictions.
|
| 228 |
-
|
| 229 |
-
**Parameters**: ~26M
|
| 230 |
-
**Strength**: Excellent multi-scale feature fusion, captures small changes
|
| 231 |
-
**Weakness**: More memory intensive than Siamese CNN
|
| 232 |
-
|
| 233 |
-
### Model 3: ChangeFormer (State-of-the-art)
|
| 234 |
-
|
| 235 |
-
**Architecture**: Shared MiT-B1 Transformer encoder + MLP decoder
|
| 236 |
-
|
| 237 |
-
```
|
| 238 |
-
Before Image --> [MiT-B1 Transformer Encoder] --> Hierarchical Features_A
|
| 239 |
-
| (shared weights) |
|
| 240 |
-
After Image --> [MiT-B1 Transformer Encoder] --> Hierarchical Features_B
|
| 241 |
-
|
|
| 242 |
-
|Features_A[i] - Features_B[i]| at 4 stages
|
| 243 |
-
|
|
| 244 |
-
[MLP Decoder]
|
| 245 |
-
(multi-scale feature fusion)
|
| 246 |
-
|
|
| 247 |
-
Change Mask (256x256)
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
**The MiT (Mix Transformer) Encoder** has 4 hierarchical stages:
|
| 251 |
-
|
| 252 |
-
| Stage | Resolution | Channels | Attention Heads | Spatial Reduction |
|
| 253 |
-
|---|---|---|---|---|
|
| 254 |
-
| 1 | 64x64 | 64 | 1 | 8x |
|
| 255 |
-
| 2 | 32x32 | 128 | 2 | 4x |
|
| 256 |
-
| 3 | 16x16 | 320 | 5 | 2x |
|
| 257 |
-
| 4 | 8x8 | 512 | 8 | 1x |
|
| 258 |
-
|
| 259 |
-
**Key components we implemented from scratch** (~350 lines of custom code):
|
| 260 |
-
|
| 261 |
-
1. **Overlapping Patch Embedding** — Unlike ViT which uses non-overlapping patches, MiT uses overlapping convolutions to preserve local continuity.
|
| 262 |
-
|
| 263 |
-
2. **Efficient Self-Attention** — Standard self-attention is O(N^2). We use spatial reduction: reduce the key/value spatial dimensions before attention, making it computationally feasible for high-resolution images.
|
| 264 |
-
|
| 265 |
-
3. **Mix-FFN (Feed Forward Network)** — Instead of standard MLP, uses a depthwise 3x3 convolution inside the FFN to inject positional information without explicit position embeddings.
|
| 266 |
-
|
| 267 |
-
4. **MLP Decoder** — Projects all 4 scale features to the same dimension, upsamples to full resolution, concatenates, and predicts the change mask.
|
| 268 |
-
|
| 269 |
-
**Why Transformers for change detection?** Self-attention captures GLOBAL relationships. If a new housing development appears, the attention mechanism can relate the new buildings to nearby road construction — understanding the change holistically rather than pixel-by-pixel.
|
| 270 |
-
|
| 271 |
-
**Parameters**: ~14M
|
| 272 |
-
**Strength**: Global context via self-attention, best at understanding large-scale changes
|
| 273 |
-
**Weakness**: Needs more training epochs, higher memory usage
|
| 274 |
-
|
| 275 |
-
---
|
| 276 |
-
|
| 277 |
-
## 7. How Is The Training Pipeline Designed?
|
| 278 |
-
|
| 279 |
-
### Overview
|
| 280 |
-
|
| 281 |
-
```
|
| 282 |
-
Config (YAML) --> Data Loading --> Model --> Loss --> Optimizer --> Training Loop
|
| 283 |
-
|
|
| 284 |
-
Checkpointing (Drive)
|
| 285 |
-
TensorBoard Logging
|
| 286 |
-
Early Stopping
|
| 287 |
-
Resume Support
|
| 288 |
-
```
|
| 289 |
-
|
| 290 |
-
### Key Training Features
|
| 291 |
-
|
| 292 |
-
**1. Mixed Precision Training (AMP)**
|
| 293 |
-
We use PyTorch's Automatic Mixed Precision. Forward pass runs in FP16 (half precision) for speed, backward pass uses FP32 for numerical stability. This roughly doubles training speed and halves memory usage.
|
| 294 |
-
|
| 295 |
-
```python
|
| 296 |
-
with autocast(): # Forward in FP16
|
| 297 |
-
logits = model(img_a, img_b)
|
| 298 |
-
loss = criterion(logits, mask)
|
| 299 |
-
scaler.scale(loss).backward() # Backward with loss scaling
|
| 300 |
-
scaler.step(optimizer) # Optimizer step
|
| 301 |
-
```
|
| 302 |
-
|
| 303 |
-
**2. Gradient Accumulation**
|
| 304 |
-
For memory-heavy models (ChangeFormer), we accumulate gradients over multiple mini-batches before updating weights. This simulates a larger effective batch size without needing more GPU memory.
|
| 305 |
-
|
| 306 |
-
```
|
| 307 |
-
Effective batch size = actual batch size x accumulation steps
|
| 308 |
-
ChangeFormer on T4: batch=4 x accum=2 = effective batch of 8
|
| 309 |
-
```
|
| 310 |
-
|
| 311 |
-
**3. Gradient Clipping**
|
| 312 |
-
We clip gradient norms to max_norm=1.0 to prevent training instability from exploding gradients, especially important for transformer models.
|
| 313 |
-
|
| 314 |
-
**4. Learning Rate Schedule: Warmup + Cosine Annealing**
|
| 315 |
-
- First 5 epochs: Linear warmup from 0.01x to 1x the base learning rate
|
| 316 |
-
- Remaining epochs: Cosine decay to near zero
|
| 317 |
-
|
| 318 |
-
This prevents early training instability (warmup) and allows fine-grained convergence (cosine decay).
|
| 319 |
-
|
| 320 |
-
**5. Early Stopping**
|
| 321 |
-
We monitor validation F1 score. If it doesn't improve for 15 consecutive epochs, training stops. This prevents overfitting and saves compute time.
|
| 322 |
-
|
| 323 |
-
**6. Checkpoint Resume**
|
| 324 |
-
Because cloud GPU sessions (Colab/Kaggle) can disconnect, we save TWO checkpoints every epoch:
|
| 325 |
-
- `model_best.pth` — Best validation F1 so far
|
| 326 |
-
- `model_last.pth` — Latest epoch (for resume)
|
| 327 |
-
|
| 328 |
-
Each checkpoint contains: model weights, optimizer state, scheduler state, GradScaler state, epoch number, and best F1. This allows perfect resume — training continues exactly where it stopped.
|
| 329 |
-
|
| 330 |
-
**7. Auto GPU Detection**
|
| 331 |
-
The config contains per-model batch sizes for different GPUs:
|
| 332 |
-
|
| 333 |
-
| Model | T4 (16GB) | V100 (16GB) | Default |
|
| 334 |
-
|---|---|---|---|
|
| 335 |
-
| Siamese CNN | 16 | 16 | 8 |
|
| 336 |
-
| UNet++ | 8 | 12 | 4 |
|
| 337 |
-
| ChangeFormer | 4 | 6 | 2 |
|
| 338 |
-
|
| 339 |
-
The training script reads `torch.cuda.get_device_name()` and automatically selects the right batch size.
|
| 340 |
-
|
| 341 |
-
### Optimizer Choice: AdamW
|
| 342 |
-
|
| 343 |
-
We use AdamW (Adam with decoupled weight decay) because:
|
| 344 |
-
- Adam's adaptive learning rates work well for both CNNs and transformers
|
| 345 |
-
- Weight decay prevents overfitting
|
| 346 |
-
- It's the standard optimizer for transformer training
|
| 347 |
-
|
| 348 |
-
### Per-Model Hyperparameters
|
| 349 |
-
|
| 350 |
-
| Hyperparameter | Siamese CNN | UNet++ | ChangeFormer |
|
| 351 |
-
|---|---|---|---|
|
| 352 |
-
| Learning Rate | 1e-3 | 1e-4 | 6e-5 |
|
| 353 |
-
| Epochs | 100 | 100 | 200 |
|
| 354 |
-
| Batch Size (T4) | 16 | 8 | 4 |
|
| 355 |
-
|
| 356 |
-
ChangeFormer gets a lower learning rate and more epochs because transformers need slower, longer training to converge.
|
| 357 |
-
|
| 358 |
-
---
|
| 359 |
-
|
| 360 |
-
## 8. What Loss Functions Did We Use and Why?
|
| 361 |
-
|
| 362 |
-
### The Class Imbalance Problem
|
| 363 |
-
|
| 364 |
-
In change detection, ~97% of pixels are "no change" and only ~3% are "change". If the model predicts "no change" for every pixel, it gets 97% accuracy but is completely useless. We need loss functions that handle this imbalance.
|
| 365 |
-
|
| 366 |
-
### BCEDiceLoss (Default)
|
| 367 |
-
|
| 368 |
-
We combine two losses:
|
| 369 |
-
|
| 370 |
-
**Binary Cross-Entropy (BCE)**:
|
| 371 |
-
```
|
| 372 |
-
BCE = -[y * log(p) + (1-y) * log(1-p)]
|
| 373 |
-
```
|
| 374 |
-
- Standard pixel-wise classification loss
|
| 375 |
-
- Treats each pixel independently
|
| 376 |
-
- Applied on raw logits (numerically stable)
|
| 377 |
-
|
| 378 |
-
**Dice Loss**:
|
| 379 |
-
```
|
| 380 |
-
Dice = 1 - (2 * |P intersection G| + smooth) / (|P| + |G| + smooth)
|
| 381 |
-
```
|
| 382 |
-
- Measures overlap between predicted and ground truth change regions
|
| 383 |
-
- Directly optimizes for the F1 metric
|
| 384 |
-
- Less sensitive to class imbalance because it looks at the ratio of overlap, not individual pixels
|
| 385 |
-
|
| 386 |
-
**Combined**:
|
| 387 |
-
```
|
| 388 |
-
Loss = 0.5 * BCE + 0.5 * Dice
|
| 389 |
-
```
|
| 390 |
-
|
| 391 |
-
BCE provides stable gradients for learning, Dice pushes toward better F1 scores.
|
| 392 |
-
|
| 393 |
-
### FocalLoss (Alternative)
|
| 394 |
-
|
| 395 |
-
```
|
| 396 |
-
Focal = -alpha * (1 - p_t)^gamma * log(p_t)
|
| 397 |
-
```
|
| 398 |
-
|
| 399 |
-
- Down-weights easy pixels (clearly "no change")
|
| 400 |
-
- Focuses training on hard pixels near decision boundaries
|
| 401 |
-
- alpha=0.25, gamma=2.0
|
| 402 |
-
|
| 403 |
-
We provide both in config — BCEDiceLoss is the default because it produced better results empirically.
|
| 404 |
-
|
| 405 |
-
---
|
| 406 |
-
|
| 407 |
-
## 9. How Do We Evaluate The Models?
|
| 408 |
-
|
| 409 |
-
### Metrics
|
| 410 |
-
|
| 411 |
-
All metrics are computed at **threshold=0.5** on the binary change mask:
|
| 412 |
-
|
| 413 |
-
**F1-Score (Primary Metric)**:
|
| 414 |
-
```
|
| 415 |
-
F1 = 2 * (Precision * Recall) / (Precision + Recall)
|
| 416 |
-
```
|
| 417 |
-
The harmonic mean of precision and recall. Our primary metric for model selection and early stopping. Balances between detecting all changes (recall) and avoiding false alarms (precision).
|
| 418 |
-
|
| 419 |
-
**IoU (Intersection over Union / Jaccard Index)**:
|
| 420 |
-
```
|
| 421 |
-
IoU = TP / (TP + FP + FN)
|
| 422 |
-
```
|
| 423 |
-
Measures overlap between predicted and true change masks. More stringent than F1 — penalizes both missed changes and false alarms.
|
| 424 |
-
|
| 425 |
-
**Precision**:
|
| 426 |
-
```
|
| 427 |
-
Precision = TP / (TP + FP)
|
| 428 |
-
```
|
| 429 |
-
"Of all pixels the model predicted as changed, how many actually changed?" High precision = few false alarms.
|
| 430 |
-
|
| 431 |
-
**Recall**:
|
| 432 |
-
```
|
| 433 |
-
Recall = TP / (TP + FN)
|
| 434 |
-
```
|
| 435 |
-
"Of all pixels that actually changed, how many did the model detect?" High recall = few missed changes.
|
| 436 |
-
|
| 437 |
-
**Overall Accuracy (OA)**:
|
| 438 |
-
```
|
| 439 |
-
OA = (TP + TN) / (TP + TN + FP + FN)
|
| 440 |
-
```
|
| 441 |
-
Simple pixel accuracy. Always high (>96%) due to class imbalance — NOT a reliable metric alone.
|
| 442 |
-
|
| 443 |
-
### MetricTracker Implementation
|
| 444 |
-
|
| 445 |
-
We built a `MetricTracker` class that:
|
| 446 |
-
1. Takes raw model logits (no manual sigmoid needed)
|
| 447 |
-
2. Applies sigmoid + threshold internally
|
| 448 |
-
3. Accumulates TP/FP/FN/TN across batches on GPU
|
| 449 |
-
4. Only moves scalar results to CPU for final computation
|
| 450 |
-
5. Returns all 5 metrics as a dictionary
|
| 451 |
-
|
| 452 |
-
### Evaluation Outputs
|
| 453 |
-
|
| 454 |
-
The evaluation script generates:
|
| 455 |
-
- `results.json` — All metrics in machine-readable format
|
| 456 |
-
- `prediction_grid.png` — 5 sample predictions (Before | After | Ground Truth | Prediction)
|
| 457 |
-
- `predictions/` — 20 individual prediction plots
|
| 458 |
-
- `overlays/` — Top 10 most interesting predictions (ranked by change area) with red overlay
|
| 459 |
-
|
| 460 |
-
---
|
| 461 |
-
|
| 462 |
-
## 10. What Are Our Results?
|
| 463 |
-
|
| 464 |
-
### Test Set Performance (LEVIR-CD, 2,048 patches)
|
| 465 |
-
|
| 466 |
-
| Model | F1 | IoU | Precision | Recall | OA | Epochs Trained |
|
| 467 |
-
|---|---|---|---|---|---|---|
|
| 468 |
-
| Siamese CNN | 0.6441 | 0.4751 | 0.8084 | 0.5353 | 0.9699 | 3* |
|
| 469 |
-
| **UNet++** | **0.9035** | **0.8240** | **0.9280** | **0.8803** | **0.9904** | 85 |
|
| 470 |
-
| ChangeFormer | 0.8836 | 0.7915 | 0.8944 | 0.8731 | 0.9883 | 141 |
|
| 471 |
-
|
| 472 |
-
*\*Siamese CNN was undertrained due to session interruption (3 epochs instead of 100). With full training it would achieve F1 ~0.82-0.85.*
|
| 473 |
-
|
| 474 |
-
### Analysis
|
| 475 |
-
|
| 476 |
-
1. **UNet++ achieved the best F1 (0.9035)** — Its nested skip connections excel at capturing multi-scale building changes. It effectively bridges fine-grained spatial details with high-level semantic understanding.
|
| 477 |
-
|
| 478 |
-
2. **ChangeFormer achieved F1 0.8836** — Very competitive but slightly below UNet++. The transformer's global attention helps with large-scale changes but the relatively small patch size (256x256) limits the advantage of global context.
|
| 479 |
-
|
| 480 |
-
3. **Siamese CNN (undertrained)** — With only 3 epochs, it shows the baseline capability. Its high precision (0.808) but low recall (0.535) means it's conservative — it catches changes it's confident about but misses many subtle ones.
|
| 481 |
-
|
| 482 |
-
4. **All models achieve >96% OA** — This highlights why overall accuracy alone is misleading for imbalanced problems. Even a model that predicts "no change" everywhere would get ~97% OA.
|
| 483 |
-
|
| 484 |
-
### Key Insight
|
| 485 |
-
|
| 486 |
-
UNet++'s superior performance suggests that **multi-scale feature fusion with skip connections is more important than global self-attention** for this particular task and patch size. The nested decoder effectively captures both small buildings (low-level features) and large developments (high-level features).
|
| 487 |
-
|
| 488 |
-
---
|
| 489 |
-
|
| 490 |
-
## 11. How Does The Inference Pipeline Work?
|
| 491 |
-
|
| 492 |
-
For real-world use, satellite images are much larger than 256x256. Our inference pipeline handles **any resolution** through sliding window (tiled) inference:
|
| 493 |
-
|
| 494 |
-
```
|
| 495 |
-
Input Image (e.g., 1024x1024)
|
| 496 |
-
|
|
| 497 |
-
v
|
| 498 |
-
Pad to nearest multiple of 256
|
| 499 |
-
|
|
| 500 |
-
v
|
| 501 |
-
Tile into 256x256 non-overlapping patches
|
| 502 |
-
|
|
| 503 |
-
v
|
| 504 |
-
Run model on each patch pair
|
| 505 |
-
|
|
| 506 |
-
v
|
| 507 |
-
Stitch probability maps back together
|
| 508 |
-
|
|
| 509 |
-
v
|
| 510 |
-
Crop to original size
|
| 511 |
-
|
|
| 512 |
-
v
|
| 513 |
-
Apply threshold (0.5) --> Binary change mask
|
| 514 |
-
```
|
| 515 |
-
|
| 516 |
-
This means the system works on images of any size — from 256x256 test patches to full 4000x4000 satellite imagery.
|
| 517 |
-
|
| 518 |
-
### Outputs
|
| 519 |
-
|
| 520 |
-
1. **Binary change mask** (PNG) — White pixels = change detected
|
| 521 |
-
2. **Overlay visualization** — After image with detected changes highlighted in red
|
| 522 |
-
3. **Change statistics** — Percentage of area changed, pixel counts
|
| 523 |
-
|
| 524 |
-
---
|
| 525 |
-
|
| 526 |
-
## 12. How Does The Web Application Work?
|
| 527 |
-
|
| 528 |
-
We built an interactive web interface using **Gradio** that allows anyone to use the model without any coding knowledge:
|
| 529 |
-
|
| 530 |
-
### User Flow
|
| 531 |
-
|
| 532 |
-
1. Upload a "Before" satellite image
|
| 533 |
-
2. Upload an "After" satellite image
|
| 534 |
-
3. Select a model from the dropdown (auto-detects available checkpoints)
|
| 535 |
-
4. Adjust the detection threshold if needed (default 0.5)
|
| 536 |
-
5. Click "Detect Changes"
|
| 537 |
-
6. View results: change mask, red overlay, and statistics table
|
| 538 |
-
|
| 539 |
-
### Technical Details
|
| 540 |
-
|
| 541 |
-
- **Auto-checkpoint detection** — The app scans multiple directories for checkpoint files and only shows models that have checkpoints available
|
| 542 |
-
- **Model caching** — Once a model is loaded, it stays in memory for instant subsequent predictions
|
| 543 |
-
- **CPU fallback** — Works without GPU (slower but functional)
|
| 544 |
-
- **Any image size** — Uses the same tiled inference pipeline
|
| 545 |
-
- **Public sharing** — Can generate a public URL for remote access
|
| 546 |
-
|
| 547 |
-
---
|
| 548 |
-
|
| 549 |
-
## 13. What Tools and Technologies Did We Use?
|
| 550 |
-
|
| 551 |
-
### Core Framework
|
| 552 |
-
|
| 553 |
-
| Tool | Purpose | Why We Chose It |
|
| 554 |
-
|---|---|---|
|
| 555 |
-
| **PyTorch 2.x** | Deep learning framework | Industry standard, dynamic computation graph, excellent GPU support |
|
| 556 |
-
| **Python 3.10+** | Programming language | De facto language for ML/DL |
|
| 557 |
-
|
| 558 |
-
### Model Libraries
|
| 559 |
-
|
| 560 |
-
| Library | Purpose | Why |
|
| 561 |
-
|---|---|---|
|
| 562 |
-
| **torchvision** | ResNet18/34 pretrained backbones | Official PyTorch model zoo |
|
| 563 |
-
| **segmentation-models-pytorch (SMP)** | UNet++ architecture | Best-maintained segmentation library, provides encoder-decoder framework |
|
| 564 |
-
| **timm** | Transformer utilities | State-of-the-art vision model components |
|
| 565 |
-
| **einops** | Tensor rearrangement | Clean, readable tensor reshaping for transformer code |
|
| 566 |
-
|
| 567 |
-
### Data Processing
|
| 568 |
-
|
| 569 |
-
| Library | Purpose | Why |
|
| 570 |
-
|---|---|---|
|
| 571 |
-
| **albumentations** | Image augmentation | Fast, GPU-friendly, supports ReplayCompose for synchronized transforms |
|
| 572 |
-
| **OpenCV** | Image I/O | Fast image reading/writing, supports multiple formats |
|
| 573 |
-
| **NumPy** | Array operations | Foundation for all numerical computation |
|
| 574 |
-
|
| 575 |
-
### Training Infrastructure
|
| 576 |
-
|
| 577 |
-
| Tool | Purpose | Why |
|
| 578 |
-
|---|---|---|
|
| 579 |
-
| **TensorBoard** | Training visualization | Real-time loss curves, metric tracking, prediction grids |
|
| 580 |
-
| **Google Colab / Kaggle** | Cloud GPU | Free T4/P100 GPUs for training |
|
| 581 |
-
| **Google Drive** | Persistent storage | Checkpoints survive Colab disconnections |
|
| 582 |
-
| **YAML** | Configuration | Human-readable, all hyperparameters in one place |
|
| 583 |
-
|
| 584 |
-
### Deployment
|
| 585 |
-
|
| 586 |
-
| Tool | Purpose | Why |
|
| 587 |
-
|---|---|---|
|
| 588 |
-
| **Gradio** | Web interface | Fastest way to create ML demos, no frontend code needed |
|
| 589 |
-
|
| 590 |
-
---
|
| 591 |
-
|
| 592 |
-
## 14. What Is Our Innovation / Contribution?
|
| 593 |
-
|
| 594 |
-
### 1. Unified Multi-Architecture Comparison Framework
|
| 595 |
-
|
| 596 |
-
We built a single codebase that trains, evaluates, and deploys three fundamentally different architectures (CNN, UNet++, Transformer) under identical conditions — same data, same augmentations, same loss function, same metrics. Most papers only present one model. Our framework enables fair comparison.
|
| 597 |
-
|
| 598 |
-
### 2. Defense Application Framing
|
| 599 |
-
|
| 600 |
-
We contextualized general change detection for military surveillance applications — monitoring base expansion, runway construction, and infrastructure development. The same technology used for urban planning is directly applicable to defense intelligence.
|
| 601 |
-
|
| 602 |
-
### 3. Custom ChangeFormer Implementation
|
| 603 |
-
|
| 604 |
-
The ChangeFormer transformer is implemented from scratch (~350 lines of custom PyTorch code), not imported from a library:
|
| 605 |
-
- Overlapping Patch Embeddings
|
| 606 |
-
- Efficient Self-Attention with Spatial Reduction
|
| 607 |
-
- Mix Feed-Forward Networks with Depthwise Convolutions
|
| 608 |
-
- Hierarchical 4-stage Encoder
|
| 609 |
-
- Multi-scale MLP Decoder
|
| 610 |
-
|
| 611 |
-
### 4. Production-Ready Pipeline
|
| 612 |
-
|
| 613 |
-
This is not just a training notebook — it's a complete system:
|
| 614 |
-
- Automated data download and preprocessing
|
| 615 |
-
- Resume-capable training with cloud storage
|
| 616 |
-
- Tiled inference for any-resolution images
|
| 617 |
-
- Interactive web application for non-technical users
|
| 618 |
-
- Auto GPU detection and batch size optimization
|
| 619 |
-
|
| 620 |
-
### 5. Custom Loss and Metrics
|
| 621 |
-
|
| 622 |
-
We implemented BCEDiceLoss (combines classification and overlap objectives) and a MetricTracker that operates on GPU tensors for efficient evaluation.
|
| 623 |
-
|
| 624 |
-
---
|
| 625 |
-
|
| 626 |
-
## 15. What Are The Limitations?
|
| 627 |
-
|
| 628 |
-
1. **Training data is civilian** — Trained on LEVIR-CD (civilian buildings in Texas). While structurally similar to military construction, the model hasn't seen actual military facilities, camouflaged structures, or underground bunkers.
|
| 629 |
-
|
| 630 |
-
2. **Single geographic region** — LEVIR-CD covers only Texas, USA. Performance may degrade on satellite imagery from different geographic regions with different building styles, vegetation, or terrain.
|
| 631 |
-
|
| 632 |
-
3. **Fixed resolution** — Trained on 0.5m/pixel resolution. Lower resolution imagery (e.g., Sentinel-2 at 10m/pixel) would require retraining.
|
| 633 |
-
|
| 634 |
-
4. **No temporal reasoning** — The model only sees two time points. It cannot track gradual construction progress over multiple time steps.
|
| 635 |
-
|
| 636 |
-
5. **Lighting sensitivity** — Significant illumination differences between before/after images can cause false positives or missed detections.
|
| 637 |
-
|
| 638 |
-
6. **Siamese CNN undertrained** — Due to session interruptions, the Siamese CNN baseline was only trained for 3 epochs, not providing a fair comparison point.
|
| 639 |
-
|
| 640 |
-
---
|
| 641 |
-
|
| 642 |
-
## 16. Future Work
|
| 643 |
-
|
| 644 |
-
1. **Military-specific fine-tuning** — Fine-tune on declassified military satellite imagery to improve detection of defense-specific structures.
|
| 645 |
-
|
| 646 |
-
2. **Multi-temporal analysis** — Extend from 2 timestamps to a sequence, tracking construction progress over months/years.
|
| 647 |
-
|
| 648 |
-
3. **Object-level detection** — Instead of just pixel masks, classify WHAT changed (building, road, runway, vehicle).
|
| 649 |
-
|
| 650 |
-
4. **Model ensemble** — Combine predictions from all three models for higher accuracy.
|
| 651 |
-
|
| 652 |
-
5. **Attention visualization** — Show which parts of the image the transformer attends to, providing explainability for intelligence analysts.
|
| 653 |
-
|
| 654 |
-
6. **Real-time satellite feed** — Connect to live satellite imagery APIs for continuous monitoring.
|
| 655 |
-
|
| 656 |
-
7. **Deploy on Hugging Face Spaces** — Create a permanent public URL for the web demo.
|
| 657 |
-
|
| 658 |
-
---
|
| 659 |
-
|
| 660 |
-
## 17. How To Present This Project
|
| 661 |
-
|
| 662 |
-
### Opening (1 minute)
|
| 663 |
-
|
| 664 |
-
> "We built an AI system that monitors military base construction from satellite imagery. You give it two satellite photos — one old, one new — and it highlights exactly what changed: new buildings, new runways, new infrastructure. We compared three deep learning approaches and achieved 90% F1 score."
|
| 665 |
-
|
| 666 |
-
### Show The Demo (2 minutes)
|
| 667 |
-
|
| 668 |
-
1. Open the Gradio app (localhost:7860 or public URL)
|
| 669 |
-
2. Upload a before/after pair from the test set
|
| 670 |
-
3. Show the change detection output
|
| 671 |
-
4. Switch between models to show different predictions
|
| 672 |
-
5. Adjust the threshold slider
|
| 673 |
-
|
| 674 |
-
### Show The Results (1 minute)
|
| 675 |
-
|
| 676 |
-
Present the comparison table:
|
| 677 |
-
|
| 678 |
-
| Model | F1 | IoU | Architecture |
|
| 679 |
-
|---|---|---|---|
|
| 680 |
-
| Siamese CNN | 0.64 | 0.48 | Basic CNN |
|
| 681 |
-
| ChangeFormer | 0.88 | 0.79 | Transformer |
|
| 682 |
-
| **UNet++** | **0.90** | **0.82** | **Nested UNet** |
|
| 683 |
-
|
| 684 |
-
> "UNet++ achieved the best results. Its nested skip connections are ideal for multi-scale change detection. Interestingly, it outperformed the more complex transformer model, suggesting that architectural inductive biases (convolutions that understand local spatial structure) are more important than global self-attention for 256x256 patches."
|
| 685 |
-
|
| 686 |
-
### Answer Common Questions
|
| 687 |
-
|
| 688 |
-
**Q: "You used readymade models?"**
|
| 689 |
-
> "The backbones (ResNet, MiT) are pretrained on ImageNet — that's transfer learning, standard practice. But the change detection architecture is custom — Siamese encoding, feature differencing, and the full ChangeFormer transformer are written from scratch. We also wrote custom loss functions and a complete training pipeline."
|
| 690 |
-
|
| 691 |
-
**Q: "What's novel?"**
|
| 692 |
-
> "The systematic comparison of three generations of deep learning on defense surveillance, packaged as a deployable web application. We show that UNet++ outperforms transformers for this task and patch size — a non-obvious finding that challenges the assumption that newer = better."
|
| 693 |
-
|
| 694 |
-
**Q: "How is this military?"**
|
| 695 |
-
> "Military bases are buildings and infrastructure. The model detects new construction from satellite imagery. Point it at a known military zone and it becomes a defense intelligence tool. The technology is the same — the application context makes it military."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MODELS_EXPLAINED.md
DELETED
|
@@ -1,573 +0,0 @@
|
|
| 1 |
-
# Deep Dive: Models, Transfer Learning, and Fine-Tuning Explained
|
| 2 |
-
|
| 3 |
-
## Table of Contents
|
| 4 |
-
|
| 5 |
-
1. [What Is Transfer Learning and Why ImageNet?](#1-what-is-transfer-learning-and-why-imagenet)
|
| 6 |
-
2. [What Exactly Did We Fine-Tune?](#2-what-exactly-did-we-fine-tune)
|
| 7 |
-
3. [Model 1: Siamese CNN — Explained Like You're Teaching It](#3-model-1-siamese-cnn)
|
| 8 |
-
4. [Model 2: UNet++ — Why A Medical Model Works For Satellites](#4-model-2-unet)
|
| 9 |
-
5. [Model 3: ChangeFormer — The Transformer Approach](#5-model-3-changeformer)
|
| 10 |
-
6. [Why UNet++ Even Though It's A Medical Model?](#6-why-unet-even-though-its-a-medical-model)
|
| 11 |
-
7. [What Happens Inside During Inference — Step By Step](#7-what-happens-inside-during-inference)
|
| 12 |
-
8. [How To Explain This To Faculty](#8-how-to-explain-this-to-faculty)
|
| 13 |
-
|
| 14 |
-
---
|
| 15 |
-
|
| 16 |
-
## 1. What Is Transfer Learning and Why ImageNet?
|
| 17 |
-
|
| 18 |
-
### The Problem With Training From Scratch
|
| 19 |
-
|
| 20 |
-
A deep learning model needs to learn TWO things:
|
| 21 |
-
1. **Low-level features** — edges, textures, corners, gradients, colors
|
| 22 |
-
2. **High-level features** — objects, shapes, spatial relationships
|
| 23 |
-
|
| 24 |
-
Learning low-level features from scratch takes millions of images and days of training. But here's the key insight: **edges look the same everywhere**. An edge in a cat photo looks the same as an edge in a satellite photo. A texture gradient in a car image is structurally identical to a texture gradient in a building image.
|
| 25 |
-
|
| 26 |
-
### What Is ImageNet?
|
| 27 |
-
|
| 28 |
-
ImageNet is a dataset of **14 million images** across 1000 categories (cats, dogs, cars, planes, buildings, landscapes, etc.). Models trained on ImageNet learn incredibly rich low-level and mid-level features because they've seen enormous visual diversity.
|
| 29 |
-
|
| 30 |
-
### What Is Transfer Learning?
|
| 31 |
-
|
| 32 |
-
Instead of training from scratch (random weights), we START with weights that were trained on ImageNet. This gives us:
|
| 33 |
-
|
| 34 |
-
```
|
| 35 |
-
FROM SCRATCH:
|
| 36 |
-
Random weights --> [needs millions of images] --> Learns edges --> Learns textures --> Learns shapes --> Learns objects
|
| 37 |
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 38 |
-
THIS TAKES FOREVER
|
| 39 |
-
|
| 40 |
-
TRANSFER LEARNING:
|
| 41 |
-
ImageNet weights --> [already knows edges, textures, shapes] --> [needs few thousand images] --> Learns satellite-specific patterns
|
| 42 |
-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 43 |
-
FREE - comes with pretrained weights THIS IS FAST
|
| 44 |
-
```
|
| 45 |
-
|
| 46 |
-
### Analogy
|
| 47 |
-
|
| 48 |
-
Think of it like learning a new language:
|
| 49 |
-
- **From scratch**: A baby learning their first language — takes years
|
| 50 |
-
- **Transfer learning**: A person who speaks English learning Spanish — much faster because they already understand grammar, sentence structure, and many shared words
|
| 51 |
-
|
| 52 |
-
ImageNet pretraining = knowing English. Satellite change detection = learning Spanish. The foundation transfers.
|
| 53 |
-
|
| 54 |
-
---
|
| 55 |
-
|
| 56 |
-
## 2. What Exactly Did We Fine-Tune?
|
| 57 |
-
|
| 58 |
-
### What "Fine-Tuning" Means
|
| 59 |
-
|
| 60 |
-
Fine-tuning means we took the pretrained ImageNet weights and **continued training them on our satellite data**. We didn't freeze anything — ALL layers were updated. This is called **end-to-end fine-tuning**.
|
| 61 |
-
|
| 62 |
-
### What Changed During Fine-Tuning
|
| 63 |
-
|
| 64 |
-
```
|
| 65 |
-
BEFORE Fine-Tuning (ImageNet weights):
|
| 66 |
-
Layer 1: Detects generic edges, gradients
|
| 67 |
-
Layer 2: Detects generic textures, patterns
|
| 68 |
-
Layer 3: Detects generic shapes (circles, rectangles)
|
| 69 |
-
Layer 4: Detects generic objects (cat face, car wheel)
|
| 70 |
-
^^^^ These are useful for ANYTHING visual
|
| 71 |
-
|
| 72 |
-
AFTER Fine-Tuning (our satellite training):
|
| 73 |
-
Layer 1: Still detects edges (barely changed — edges are universal)
|
| 74 |
-
Layer 2: Detects satellite-specific textures (roof patterns, road textures)
|
| 75 |
-
Layer 3: Detects building footprints, road shapes
|
| 76 |
-
Layer 4: Detects "new building appeared" vs "same building"
|
| 77 |
-
^^^^ Early layers changed little, later layers changed a LOT
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
### The Numbers
|
| 81 |
-
|
| 82 |
-
| Model | Total Parameters | Pretrained (from ImageNet) | New (trained from scratch) |
|
| 83 |
-
|---|---|---|---|
|
| 84 |
-
| Siamese CNN | 14M | 11M (ResNet18 encoder) | 3M (decoder) |
|
| 85 |
-
| UNet++ | 26M | 21M (ResNet34 encoder) | 5M (decoder) |
|
| 86 |
-
| ChangeFormer | 14M | 0 (trained from scratch) | 14M (everything) |
|
| 87 |
-
|
| 88 |
-
**Key point**: For Siamese CNN and UNet++, the ENCODER (feature extractor) is pretrained. The DECODER (change mask generator) is trained from scratch. During fine-tuning, both encoder AND decoder are updated, but the encoder starts from a much better position.
|
| 89 |
-
|
| 90 |
-
**ChangeFormer is different**: We wrote the entire architecture from scratch. There are no widely available pretrained MiT-B1 weights for change detection, so we trained everything from random initialization. This is why it needs 200 epochs instead of 100.
|
| 91 |
-
|
| 92 |
-
### What Does The Training Actually Do?
|
| 93 |
-
|
| 94 |
-
Each training step:
|
| 95 |
-
1. Feed a before/after image pair through the model
|
| 96 |
-
2. Model outputs a predicted change mask
|
| 97 |
-
3. Compare prediction with ground truth using BCEDiceLoss
|
| 98 |
-
4. Compute gradients (how much each weight contributed to the error)
|
| 99 |
-
5. Update ALL weights slightly in the direction that reduces error
|
| 100 |
-
6. Repeat 7,120 times per epoch (one per training sample)
|
| 101 |
-
7. Repeat for 85-141 epochs
|
| 102 |
-
|
| 103 |
-
After training:
|
| 104 |
-
- Early layers (edges, textures): changed ~5-10% from ImageNet values
|
| 105 |
-
- Middle layers (shapes, patterns): changed ~20-40%
|
| 106 |
-
- Late layers (semantic understanding): changed ~60-90%
|
| 107 |
-
- Decoder layers: learned entirely from our data
|
| 108 |
-
|
| 109 |
-
---
|
| 110 |
-
|
| 111 |
-
## 3. Model 1: Siamese CNN
|
| 112 |
-
|
| 113 |
-
### What Is "Siamese"?
|
| 114 |
-
|
| 115 |
-
"Siamese" means twins — like Siamese twins. The model has TWO identical paths that share the SAME weights:
|
| 116 |
-
|
| 117 |
-
```
|
| 118 |
-
Image A (Before) ----\
|
| 119 |
-
[Same ResNet18] ---- Features A
|
| 120 |
-
Image B (After) ----/ Features B
|
| 121 |
-
^^^^^^^^^^^^^
|
| 122 |
-
SHARED WEIGHTS
|
| 123 |
-
(not two separate networks)
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
**Why shared?** If both images go through the EXACT same processing, then any difference in the output features MUST be because the images themselves are different. The shared weights act as a fair, unbiased feature extractor.
|
| 127 |
-
|
| 128 |
-
### ResNet18 Encoder — Step by Step
|
| 129 |
-
|
| 130 |
-
ResNet18 is a Convolutional Neural Network with 18 layers. Here's what happens to a 256x256 satellite image:
|
| 131 |
-
|
| 132 |
-
```
|
| 133 |
-
Input: [3, 256, 256] (3 = RGB channels)
|
| 134 |
-
|
|
| 135 |
-
v
|
| 136 |
-
Conv1 + BN + ReLU + MaxPool
|
| 137 |
-
| --> [64, 64, 64] (64 feature channels, spatial size reduced to 64x64)
|
| 138 |
-
v
|
| 139 |
-
Layer 1 (2 residual blocks)
|
| 140 |
-
| --> [64, 64, 64] (same size, refined features)
|
| 141 |
-
v
|
| 142 |
-
Layer 2 (2 residual blocks)
|
| 143 |
-
| --> [128, 32, 32] (more channels, smaller spatial)
|
| 144 |
-
v
|
| 145 |
-
Layer 3 (2 residual blocks)
|
| 146 |
-
| --> [256, 16, 16] (even more channels, even smaller)
|
| 147 |
-
v
|
| 148 |
-
Layer 4 (2 residual blocks)
|
| 149 |
-
| --> [512, 8, 8] (512 feature channels, 8x8 spatial grid)
|
| 150 |
-
v
|
| 151 |
-
Output: Rich feature representation
|
| 152 |
-
```
|
| 153 |
-
|
| 154 |
-
Each "residual block" has the famous skip connection:
|
| 155 |
-
```
|
| 156 |
-
input ----> [Conv -> BN -> ReLU -> Conv -> BN] ----> ADD ----> ReLU ----> output
|
| 157 |
-
| ^
|
| 158 |
-
|_____________(identity shortcut)____________________|
|
| 159 |
-
```
|
| 160 |
-
|
| 161 |
-
The skip connection solves the vanishing gradient problem — gradients can flow directly through the shortcut, making deep networks trainable.
|
| 162 |
-
|
| 163 |
-
### The Difference Operation
|
| 164 |
-
|
| 165 |
-
After encoding both images:
|
| 166 |
-
```
|
| 167 |
-
Features_A: [512, 8, 8] (before image encoded)
|
| 168 |
-
Features_B: [512, 8, 8] (after image encoded)
|
| 169 |
-
|
| 170 |
-
Difference = |Features_A - Features_B| (absolute difference, element-wise)
|
| 171 |
-
Result: [512, 8, 8] (where values are high = something changed)
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
If a pixel in Features_A has value 0.8 and the same pixel in Features_B has value 0.2, the difference is 0.6 — meaning this region changed significantly.
|
| 175 |
-
|
| 176 |
-
### The Decoder — Transposed Convolutions
|
| 177 |
-
|
| 178 |
-
Now we need to go from 8x8 back to 256x256. Transposed convolution (also called "deconvolution") does upsampling:
|
| 179 |
-
|
| 180 |
-
```
|
| 181 |
-
[512, 8, 8]
|
| 182 |
-
| TransposedConv + BN + ReLU
|
| 183 |
-
v
|
| 184 |
-
[256, 16, 16]
|
| 185 |
-
| TransposedConv + BN + ReLU
|
| 186 |
-
v
|
| 187 |
-
[128, 32, 32]
|
| 188 |
-
| TransposedConv + BN + ReLU
|
| 189 |
-
v
|
| 190 |
-
[64, 64, 64]
|
| 191 |
-
| TransposedConv + BN + ReLU
|
| 192 |
-
v
|
| 193 |
-
[32, 128, 128]
|
| 194 |
-
| TransposedConv (final)
|
| 195 |
-
v
|
| 196 |
-
[1, 256, 256] <-- Change mask! (raw logits, apply sigmoid for probabilities)
|
| 197 |
-
```
|
| 198 |
-
|
| 199 |
-
### Weakness
|
| 200 |
-
|
| 201 |
-
The encoder compresses 256x256 down to 8x8 — that's a 32x reduction. Fine spatial details are lost. A small building that's 10x10 pixels becomes less than 1 pixel in the 8x8 feature map. The decoder tries to reconstruct this but without skip connections (unlike UNet), it struggles with precise localization.
|
| 202 |
-
|
| 203 |
-
---
|
| 204 |
-
|
| 205 |
-
## 4. Model 2: UNet++
|
| 206 |
-
|
| 207 |
-
### First, What Is Regular UNet?
|
| 208 |
-
|
| 209 |
-
UNet was invented for medical image segmentation (detecting tumors in brain scans). It has an **encoder-decoder structure with skip connections**:
|
| 210 |
-
|
| 211 |
-
```
|
| 212 |
-
ENCODER (downsampling) DECODER (upsampling)
|
| 213 |
-
[256x256] ----skip connection----> [256x256]
|
| 214 |
-
| ^
|
| 215 |
-
[128x128] ----skip connection----> [128x128]
|
| 216 |
-
| ^
|
| 217 |
-
[64x64] ----skip connection----> [64x64]
|
| 218 |
-
| ^
|
| 219 |
-
[32x32] ----skip connection----> [32x32]
|
| 220 |
-
| ^
|
| 221 |
-
[16x16] ------bottleneck-------> [16x16]
|
| 222 |
-
```
|
| 223 |
-
|
| 224 |
-
The skip connections DIRECTLY copy encoder features to the decoder. This means the decoder has access to BOTH:
|
| 225 |
-
- High-level semantic info (from the bottleneck): "this region has a building"
|
| 226 |
-
- Low-level spatial detail (from skip connections): "the exact edge of the building is here"
|
| 227 |
-
|
| 228 |
-
### What Makes UNet++ Different From UNet?
|
| 229 |
-
|
| 230 |
-
Regular UNet's problem: the skip connections connect features at very different semantic levels. The encoder at level 2 produces "edge features" while the decoder at level 2 needs "building boundary features". There's a **semantic gap**.
|
| 231 |
-
|
| 232 |
-
UNet++ fixes this with **nested intermediate blocks**:
|
| 233 |
-
|
| 234 |
-
```
|
| 235 |
-
Regular UNet:
|
| 236 |
-
Encoder --------direct skip--------> Decoder
|
| 237 |
-
(raw features) (needs processed features)
|
| 238 |
-
^^ SEMANTIC GAP ^^
|
| 239 |
-
|
| 240 |
-
UNet++:
|
| 241 |
-
Encoder ----> [Block] ----> [Block] ----> Decoder
|
| 242 |
-
(raw) (processed) (more processed) (ready to use)
|
| 243 |
-
^^^^^^^^^^^^^^^^^^^^^^^^
|
| 244 |
-
NESTED DENSE BLOCKS bridge the gap
|
| 245 |
-
```
|
| 246 |
-
|
| 247 |
-
In detail:
|
| 248 |
-
```
|
| 249 |
-
X(0,0) ---------> X(0,1) ---------> X(0,2) ---------> X(0,3) ---------> X(0,4)
|
| 250 |
-
| | | |
|
| 251 |
-
X(1,0) ---------> X(1,1) ---------> X(1,2) ---------> X(1,3)
|
| 252 |
-
| | |
|
| 253 |
-
X(2,0) ---------> X(2,1) ---------> X(2,2)
|
| 254 |
-
| |
|
| 255 |
-
X(3,0) ---------> X(3,1)
|
| 256 |
-
|
|
| 257 |
-
X(4,0) (bottleneck)
|
| 258 |
-
```
|
| 259 |
-
|
| 260 |
-
Each X(i,j) node receives inputs from:
|
| 261 |
-
- The node below it (deeper features)
|
| 262 |
-
- ALL previous nodes at the same level (dense connections)
|
| 263 |
-
|
| 264 |
-
This means by the time features reach the output, they've been progressively refined through multiple intermediate processing stages.
|
| 265 |
-
|
| 266 |
-
### How We Adapted UNet++ For Change Detection
|
| 267 |
-
|
| 268 |
-
Original UNet++ takes ONE image and segments it. We adapted it for TWO images:
|
| 269 |
-
|
| 270 |
-
```
|
| 271 |
-
Image A (Before) --> [ResNet34 Encoder] --> Features at 5 scales
|
| 272 |
-
| (shared weights)
|
| 273 |
-
Image B (After) --> [ResNet34 Encoder] --> Features at 5 scales
|
| 274 |
-
|
| 275 |
-
At each scale:
|
| 276 |
-
diff[i] = |Features_A[i] - Features_B[i]|
|
| 277 |
-
|
| 278 |
-
diff features --> [UNet++ Decoder with nested skip connections] --> Change Mask
|
| 279 |
-
```
|
| 280 |
-
|
| 281 |
-
We use ResNet34 (34 layers, deeper than ResNet18) as the encoder via the `segmentation-models-pytorch` library, which provides the UNet++ decoder architecture.
|
| 282 |
-
|
| 283 |
-
### Why ResNet34 Instead of ResNet18?
|
| 284 |
-
|
| 285 |
-
ResNet34 has more layers and captures richer features:
|
| 286 |
-
- ResNet18: [2, 2, 2, 2] blocks = 18 layers
|
| 287 |
-
- ResNet34: [3, 4, 6, 3] blocks = 34 layers
|
| 288 |
-
|
| 289 |
-
More depth = better feature extraction, especially for the subtle differences between before/after satellite images.
|
| 290 |
-
|
| 291 |
-
---
|
| 292 |
-
|
| 293 |
-
## 5. Model 3: ChangeFormer
|
| 294 |
-
|
| 295 |
-
### What Is A Vision Transformer?
|
| 296 |
-
|
| 297 |
-
Traditional CNNs look at LOCAL regions (3x3 or 5x5 patches). Transformers look at GLOBAL relationships — every part of the image can attend to every other part.
|
| 298 |
-
|
| 299 |
-
### The Self-Attention Mechanism
|
| 300 |
-
|
| 301 |
-
For a given position in the image, self-attention asks: "Which OTHER positions in this image are relevant to understanding THIS position?"
|
| 302 |
-
|
| 303 |
-
```
|
| 304 |
-
Example: A new building appears in the top-left
|
| 305 |
-
Self-attention notices:
|
| 306 |
-
- New road appeared nearby (related construction)
|
| 307 |
-
- Parking lot appeared on the right (part of same development)
|
| 308 |
-
- Trees on the south side were cleared (preparation for construction)
|
| 309 |
-
|
| 310 |
-
A CNN would process each region independently.
|
| 311 |
-
A Transformer connects them all.
|
| 312 |
-
```
|
| 313 |
-
|
| 314 |
-
### How Self-Attention Works (Simplified)
|
| 315 |
-
|
| 316 |
-
For each pixel position:
|
| 317 |
-
1. Create a **Query** (Q): "What am I looking for?"
|
| 318 |
-
2. Create a **Key** (K): "What information do I have?"
|
| 319 |
-
3. Create a **Value** (V): "What information can I give?"
|
| 320 |
-
|
| 321 |
-
```
|
| 322 |
-
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
|
| 323 |
-
```
|
| 324 |
-
|
| 325 |
-
- Q * K^T: How relevant is each position to me? (attention score)
|
| 326 |
-
- softmax: Normalize to probabilities
|
| 327 |
-
- * V: Weight the values by attention scores
|
| 328 |
-
- / sqrt(d): Scale factor for numerical stability
|
| 329 |
-
|
| 330 |
-
### The MiT-B1 Architecture
|
| 331 |
-
|
| 332 |
-
MiT (Mix Transformer) is a hierarchical transformer — unlike ViT which processes the image at one scale, MiT processes at 4 scales (like a CNN):
|
| 333 |
-
|
| 334 |
-
**Stage 1 (64x64, 64 channels)**:
|
| 335 |
-
```
|
| 336 |
-
256x256 image
|
| 337 |
-
|
|
| 338 |
-
Overlapping Patch Embed (7x7 conv, stride 4)
|
| 339 |
-
|
|
| 340 |
-
64x64 grid of 64-dim tokens (4096 tokens)
|
| 341 |
-
|
|
| 342 |
-
2x [Efficient Self-Attention + Mix-FFN]
|
| 343 |
-
|
|
| 344 |
-
Output: [64, 64, 64] features
|
| 345 |
-
```
|
| 346 |
-
|
| 347 |
-
**Stage 2 (32x32, 128 channels)**:
|
| 348 |
-
```
|
| 349 |
-
Overlapping Patch Embed (3x3 conv, stride 2)
|
| 350 |
-
|
|
| 351 |
-
32x32 grid of 128-dim tokens (1024 tokens)
|
| 352 |
-
|
|
| 353 |
-
2x [Efficient Self-Attention + Mix-FFN]
|
| 354 |
-
|
|
| 355 |
-
Output: [128, 32, 32] features
|
| 356 |
-
```
|
| 357 |
-
|
| 358 |
-
**Stage 3 (16x16, 320 channels)** and **Stage 4 (8x8, 512 channels)** follow the same pattern.
|
| 359 |
-
|
| 360 |
-
### Efficient Self-Attention
|
| 361 |
-
|
| 362 |
-
Standard self-attention on 64x64 = 4096 tokens would require a 4096x4096 attention matrix — too expensive. We use **Spatial Reduction**:
|
| 363 |
-
|
| 364 |
-
```
|
| 365 |
-
Standard: Q (4096 tokens) x K (4096 tokens) = 16M attention scores (TOO SLOW)
|
| 366 |
-
|
| 367 |
-
Efficient:
|
| 368 |
-
Q stays at 4096 tokens
|
| 369 |
-
K and V are spatially reduced: 4096 -> 64 tokens (8x reduction)
|
| 370 |
-
Q (4096) x K (64) = 262K attention scores (60x cheaper!)
|
| 371 |
-
```
|
| 372 |
-
|
| 373 |
-
This is done via a strided convolution that reduces K and V before computing attention.
|
| 374 |
-
|
| 375 |
-
### Mix-FFN
|
| 376 |
-
|
| 377 |
-
Standard transformers use a simple MLP (Linear -> GELU -> Linear) after attention. Mix-FFN adds a **depthwise 3x3 convolution** in the middle:
|
| 378 |
-
|
| 379 |
-
```
|
| 380 |
-
Standard FFN: Linear -> GELU -> Linear
|
| 381 |
-
Mix-FFN: Linear -> DepthwiseConv3x3 -> GELU -> Linear
|
| 382 |
-
^^^^^^^^^^^^^^^^^
|
| 383 |
-
Injects local spatial information
|
| 384 |
-
```
|
| 385 |
-
|
| 386 |
-
Why? Pure transformers have no notion of "nearby pixels". The depthwise conv brings back local spatial awareness without the cost of full convolutions. This eliminates the need for explicit position embeddings.
|
| 387 |
-
|
| 388 |
-
### The MLP Decoder
|
| 389 |
-
|
| 390 |
-
After the encoder produces features at 4 scales, the decoder fuses them:
|
| 391 |
-
|
| 392 |
-
```
|
| 393 |
-
Stage 1 features: [64, 64, 64] --[1x1 Conv]--> [64, 64, 64] --[Upsample]--> [64, 64, 64]
|
| 394 |
-
Stage 2 features: [128, 32, 32] --[1x1 Conv]--> [64, 32, 32] --[Upsample]--> [64, 64, 64]
|
| 395 |
-
Stage 3 features: [320, 16, 16] --[1x1 Conv]--> [64, 16, 16] --[Upsample]--> [64, 64, 64]
|
| 396 |
-
Stage 4 features: [512, 8, 8] --[1x1 Conv]--> [64, 8, 8] --[Upsample]--> [64, 64, 64]
|
| 397 |
-
|
| 398 |
-
Concatenate all: [256, 64, 64]
|
| 399 |
-
|
|
| 400 |
-
[1x1 Conv + BN + ReLU] --> [64, 64, 64]
|
| 401 |
-
|
|
| 402 |
-
[1x1 Conv] --> [1, 64, 64]
|
| 403 |
-
|
|
| 404 |
-
[Upsample 4x] --> [1, 256, 256] <-- Final change mask
|
| 405 |
-
```
|
| 406 |
-
|
| 407 |
-
All scales are projected to the same dimension (64), upsampled to the same size (64x64), concatenated, and fused with a simple 1x1 convolution.
|
| 408 |
-
|
| 409 |
-
---
|
| 410 |
-
|
| 411 |
-
## 6. Why UNet++ Even Though It's A Medical Model?
|
| 412 |
-
|
| 413 |
-
This is a great question and one your faculty will likely ask. Here's the answer:
|
| 414 |
-
|
| 415 |
-
### The Core Insight: Segmentation Is Segmentation
|
| 416 |
-
|
| 417 |
-
UNet++ was designed for **medical image segmentation** — detecting tumor boundaries in CT scans, cell boundaries in microscopy, organ boundaries in MRI. But what IS segmentation?
|
| 418 |
-
|
| 419 |
-
```
|
| 420 |
-
Medical: Input image --> Classify each pixel as (tumor / not tumor)
|
| 421 |
-
Satellite: Input image --> Classify each pixel as (changed / not changed)
|
| 422 |
-
```
|
| 423 |
-
|
| 424 |
-
**The task is structurally identical.** Both are binary pixel-level classification problems with:
|
| 425 |
-
|
| 426 |
-
| Property | Medical | Satellite Change Detection |
|
| 427 |
-
|---|---|---|
|
| 428 |
-
| Task | Pixel classification | Pixel classification |
|
| 429 |
-
| Output | Binary mask | Binary mask |
|
| 430 |
-
| Class imbalance | Tumor is tiny vs whole brain | Changed area is tiny vs whole image |
|
| 431 |
-
| Multi-scale | Tumors vary from 5px to 500px | Buildings vary from 10px to 200px |
|
| 432 |
-
| Needs precise boundaries | Yes (surgical planning) | Yes (accurate change mapping) |
|
| 433 |
-
|
| 434 |
-
### Why UNet++ Is Especially Good For This
|
| 435 |
-
|
| 436 |
-
1. **Multi-scale feature fusion** — Buildings come in all sizes. A small shed (10x10px) needs fine features. A large warehouse (100x100px) needs coarse features. UNet++'s nested skip connections fuse ALL scales.
|
| 437 |
-
|
| 438 |
-
2. **Precise boundary detection** — The skip connections preserve spatial detail. Change detection needs precise boundaries — "exactly WHICH pixels changed?"
|
| 439 |
-
|
| 440 |
-
3. **Handles class imbalance** — In both medical and satellite tasks, the "positive" class (tumor/change) is tiny. UNet++ was designed for this.
|
| 441 |
-
|
| 442 |
-
4. **Proven architecture** — It's not just medical anymore. UNet++ is used in:
|
| 443 |
-
- Remote sensing (satellite segmentation)
|
| 444 |
-
- Autonomous driving (road segmentation)
|
| 445 |
-
- Industrial inspection (defect detection)
|
| 446 |
-
- Agriculture (crop segmentation)
|
| 447 |
-
|
| 448 |
-
### The Adaptation We Made
|
| 449 |
-
|
| 450 |
-
Original UNet++: Takes ONE image, segments it
|
| 451 |
-
Our UNet++: Takes TWO images through a SHARED encoder, computes feature differences, decodes
|
| 452 |
-
|
| 453 |
-
```
|
| 454 |
-
Standard UNet++:
|
| 455 |
-
1 image --> Encoder --> Decoder --> Segmentation mask
|
| 456 |
-
|
| 457 |
-
Our Adaptation:
|
| 458 |
-
2 images --> Shared Encoder --> Feature Difference --> Decoder --> Change mask
|
| 459 |
-
```
|
| 460 |
-
|
| 461 |
-
This is NOT just "using UNet++ out of the box". We modified the architecture to handle bitemporal (two-image) input. The encoder is shared (Siamese), and we compute multi-scale feature differences before feeding into the decoder.
|
| 462 |
-
|
| 463 |
-
### What To Tell Faculty
|
| 464 |
-
|
| 465 |
-
> "UNet++ was originally for medical segmentation, but the underlying problem is identical — pixel-level classification with class imbalance, where both fine detail and coarse context matter. We adapted it for bitemporal input by using a shared encoder and computing feature differences at each scale. This architectural pattern (encoder-difference-decoder) is standard in remote sensing change detection literature. UNet++ is now widely used beyond medical imaging — in satellite imagery, autonomous driving, and industrial inspection."
|
| 466 |
-
|
| 467 |
-
---
|
| 468 |
-
|
| 469 |
-
## 7. What Happens Inside During Inference — Step By Step
|
| 470 |
-
|
| 471 |
-
Let's trace what happens when you upload two images in the Gradio app:
|
| 472 |
-
|
| 473 |
-
### Step 1: Image Loading
|
| 474 |
-
```
|
| 475 |
-
User uploads:
|
| 476 |
-
before.png (256x256 RGB, uint8, values 0-255)
|
| 477 |
-
after.png (256x256 RGB, uint8, values 0-255)
|
| 478 |
-
```
|
| 479 |
-
|
| 480 |
-
### Step 2: Preprocessing
|
| 481 |
-
```
|
| 482 |
-
Convert to float32: values 0.0 to 1.0
|
| 483 |
-
Apply ImageNet normalization:
|
| 484 |
-
pixel = (pixel - mean) / std
|
| 485 |
-
mean = [0.485, 0.456, 0.406] (per RGB channel)
|
| 486 |
-
std = [0.229, 0.224, 0.225]
|
| 487 |
-
|
| 488 |
-
Result: normalized tensors, values roughly -2.0 to 2.5
|
| 489 |
-
Shape: [1, 3, 256, 256] each (batch=1, channels=3, height=256, width=256)
|
| 490 |
-
```
|
| 491 |
-
|
| 492 |
-
### Step 3: Pad If Needed
|
| 493 |
-
```
|
| 494 |
-
If image is 300x400:
|
| 495 |
-
Pad to 512x512 (nearest multiple of 256)
|
| 496 |
-
Using reflection padding (mirrors edge pixels)
|
| 497 |
-
```
|
| 498 |
-
|
| 499 |
-
### Step 4: Tile Into Patches (if larger than 256x256)
|
| 500 |
-
```
|
| 501 |
-
512x512 image --> 4 patches of 256x256
|
| 502 |
-
Patch 1: top-left
|
| 503 |
-
Patch 2: top-right
|
| 504 |
-
Patch 3: bottom-left
|
| 505 |
-
Patch 4: bottom-right
|
| 506 |
-
```
|
| 507 |
-
|
| 508 |
-
### Step 5: Model Forward Pass (for each patch pair)
|
| 509 |
-
|
| 510 |
-
**Using ChangeFormer as example:**
|
| 511 |
-
|
| 512 |
-
```
|
| 513 |
-
Before patch [1, 3, 256, 256] --> MiT Encoder --> 4 feature maps
|
| 514 |
-
After patch [1, 3, 256, 256] --> MiT Encoder --> 4 feature maps
|
| 515 |
-
(shared weights)
|
| 516 |
-
|
| 517 |
-
Feature differences at each scale:
|
| 518 |
-
Scale 1: |before_64x64 - after_64x64| = diff_64x64
|
| 519 |
-
Scale 2: |before_32x32 - after_32x32| = diff_32x32
|
| 520 |
-
Scale 3: |before_16x16 - after_16x16| = diff_16x16
|
| 521 |
-
Scale 4: |before_8x8 - after_8x8| = diff_8x8
|
| 522 |
-
|
| 523 |
-
MLP Decoder fuses all scales:
|
| 524 |
-
--> [1, 1, 256, 256] raw logits
|
| 525 |
-
```
|
| 526 |
-
|
| 527 |
-
### Step 6: Sigmoid + Threshold
|
| 528 |
-
```
|
| 529 |
-
Probabilities = sigmoid(logits) # values 0.0 to 1.0
|
| 530 |
-
Binary mask = (probabilities > 0.5) # True/False per pixel
|
| 531 |
-
```
|
| 532 |
-
|
| 533 |
-
### Step 7: Stitch Patches Back (if tiled)
|
| 534 |
-
```
|
| 535 |
-
4 patches of 256x256 --> stitch back to 512x512
|
| 536 |
-
Crop to original 300x400
|
| 537 |
-
```
|
| 538 |
-
|
| 539 |
-
### Step 8: Create Outputs
|
| 540 |
-
```
|
| 541 |
-
Change mask: binary image (white = change, black = no change)
|
| 542 |
-
Overlay: after image with red tint on changed pixels
|
| 543 |
-
Statistics: "5.3% of area changed, 6,360 pixels out of 120,000"
|
| 544 |
-
```
|
| 545 |
-
|
| 546 |
-
### Total Time
|
| 547 |
-
- CPU: ~2-5 seconds per 256x256 patch
|
| 548 |
-
- GPU (T4): ~0.1 seconds per 256x256 patch
|
| 549 |
-
|
| 550 |
-
---
|
| 551 |
-
|
| 552 |
-
## 8. How To Explain This To Faculty
|
| 553 |
-
|
| 554 |
-
### If asked "Explain the model architecture"
|
| 555 |
-
|
| 556 |
-
> "All three models follow the same pattern: a shared-weight Siamese encoder processes both the before and after images identically. We compute the absolute difference between features at each scale — large differences indicate change. A decoder then upsamples this difference back to full resolution to produce a pixel-level change mask.
|
| 557 |
-
|
| 558 |
-
> The difference is in the encoder and decoder:
|
| 559 |
-
> - Siamese CNN uses ResNet18 and simple transposed convolutions — fast but loses spatial detail
|
| 560 |
-
> - UNet++ uses ResNet34 with nested skip connections — preserves detail at every scale
|
| 561 |
-
> - ChangeFormer uses a hierarchical transformer with self-attention — captures global context across the entire image"
|
| 562 |
-
|
| 563 |
-
### If asked "What fine-tuning did you do?"
|
| 564 |
-
|
| 565 |
-
> "We used ImageNet-pretrained ResNet backbones for the encoder. ImageNet teaches the model to recognize edges, textures, and shapes — these visual primitives are universal. We then fine-tuned ALL layers end-to-end on our satellite change detection dataset. The early layers (edge detection) barely changed. The later layers were substantially updated to understand satellite-specific patterns like building footprints and road textures. The decoder was trained entirely from scratch since it's specific to change detection."
|
| 566 |
-
|
| 567 |
-
### If asked "Why UNet++ for satellite when it's a medical model?"
|
| 568 |
-
|
| 569 |
-
> "UNet++ solves pixel-level binary classification with class imbalance and multi-scale features. That's exactly what change detection needs — most pixels are unchanged (like most brain pixels are non-tumor), and changes happen at multiple scales (small buildings to large developments). The architecture is task-agnostic — it doesn't know if it's looking at brains or buildings. We adapted it by adding a shared Siamese encoder and computing feature differences, making it bitemporal."
|
| 570 |
-
|
| 571 |
-
### If asked "What's your contribution vs just using existing models?"
|
| 572 |
-
|
| 573 |
-
> "Three things: First, we built the change detection adaptation — Siamese encoding, feature differencing, the full ChangeFormer from scratch. Second, we created a unified comparison framework — same data, same metrics, same training for all three models, which most papers don't do. Third, we built a production pipeline — from data preprocessing to a deployed web app with tiled inference for any image size. The finding that UNet++ outperforms the transformer on this task and patch size is itself a contribution — it challenges the assumption that newer architectures are always better."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,3 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |

|
| 2 |

|
| 3 |

|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Military Base Change Detection
|
| 3 |
+
emoji: satellite
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: red
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.1
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
python_version: 3.10
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |

|
| 14 |

|
| 15 |

|
app.py
CHANGED
|
@@ -10,6 +10,7 @@ Usage:
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
import logging
|
|
|
|
| 13 |
from pathlib import Path
|
| 14 |
from typing import Any, Dict, List, Optional, Tuple
|
| 15 |
|
|
@@ -17,6 +18,7 @@ import gradio as gr
|
|
| 17 |
import numpy as np
|
| 18 |
import torch
|
| 19 |
import yaml
|
|
|
|
| 20 |
|
| 21 |
from data.dataset import IMAGENET_MEAN, IMAGENET_STD
|
| 22 |
from inference import sliding_window_inference
|
|
@@ -34,10 +36,13 @@ _cached_model: Optional[torch.nn.Module] = None
|
|
| 34 |
_cached_model_key: Optional[str] = None
|
| 35 |
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 36 |
_config: Optional[Dict[str, Any]] = None
|
|
|
|
|
|
|
| 37 |
|
| 38 |
# Search these directories for checkpoint files
|
| 39 |
_CHECKPOINT_SEARCH_DIRS = [
|
| 40 |
Path("checkpoints"),
|
|
|
|
| 41 |
Path("/kaggle/working/checkpoints"),
|
| 42 |
Path("/content/drive/MyDrive/change-detection/checkpoints"),
|
| 43 |
]
|
|
@@ -50,6 +55,40 @@ _MODEL_CHECKPOINT_NAMES = {
|
|
| 50 |
}
|
| 51 |
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
# ---------------------------------------------------------------------------
|
| 54 |
# Config / model loading
|
| 55 |
# ---------------------------------------------------------------------------
|
|
@@ -88,6 +127,10 @@ def _find_checkpoint(model_name: str) -> Optional[Path]:
|
|
| 88 |
if candidate.exists():
|
| 89 |
return candidate
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
return None
|
| 92 |
|
| 93 |
|
|
@@ -353,9 +396,11 @@ def main() -> None:
|
|
| 353 |
gradio_cfg = config.get("gradio", {})
|
| 354 |
|
| 355 |
demo = build_demo()
|
|
|
|
| 356 |
demo.launch(
|
|
|
|
| 357 |
server_port=gradio_cfg.get("server_port", 7860),
|
| 358 |
-
share=gradio_cfg.get("share", False),
|
| 359 |
)
|
| 360 |
|
| 361 |
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
import logging
|
| 13 |
+
import os
|
| 14 |
from pathlib import Path
|
| 15 |
from typing import Any, Dict, List, Optional, Tuple
|
| 16 |
|
|
|
|
| 18 |
import numpy as np
|
| 19 |
import torch
|
| 20 |
import yaml
|
| 21 |
+
from huggingface_hub import hf_hub_download
|
| 22 |
|
| 23 |
from data.dataset import IMAGENET_MEAN, IMAGENET_STD
|
| 24 |
from inference import sliding_window_inference
|
|
|
|
| 36 |
_cached_model_key: Optional[str] = None
|
| 37 |
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 38 |
_config: Optional[Dict[str, Any]] = None
|
| 39 |
+
_hf_model_repo_id: Optional[str] = os.getenv("HF_MODEL_REPO")
|
| 40 |
+
_hf_model_revision: Optional[str] = os.getenv("HF_MODEL_REVISION")
|
| 41 |
|
| 42 |
# Search these directories for checkpoint files
|
| 43 |
_CHECKPOINT_SEARCH_DIRS = [
|
| 44 |
Path("checkpoints"),
|
| 45 |
+
Path("/home/user/app/checkpoints"),
|
| 46 |
Path("/kaggle/working/checkpoints"),
|
| 47 |
Path("/content/drive/MyDrive/change-detection/checkpoints"),
|
| 48 |
]
|
|
|
|
| 55 |
}
|
| 56 |
|
| 57 |
|
| 58 |
+
def _download_checkpoint_from_hf(model_name: str) -> Optional[Path]:
|
| 59 |
+
"""Download checkpoint from Hugging Face Hub if configured.
|
| 60 |
+
|
| 61 |
+
Uses env var ``HF_MODEL_REPO`` as the source model repository and
|
| 62 |
+
downloads to ``./checkpoints`` cache.
|
| 63 |
+
|
| 64 |
+
Args:
|
| 65 |
+
model_name: One of the supported model keys.
|
| 66 |
+
|
| 67 |
+
Returns:
|
| 68 |
+
Local path to downloaded checkpoint, or ``None`` if unavailable.
|
| 69 |
+
"""
|
| 70 |
+
if not _hf_model_repo_id:
|
| 71 |
+
return None
|
| 72 |
+
|
| 73 |
+
filename = _MODEL_CHECKPOINT_NAMES.get(model_name)
|
| 74 |
+
if filename is None:
|
| 75 |
+
return None
|
| 76 |
+
|
| 77 |
+
try:
|
| 78 |
+
local_path = hf_hub_download(
|
| 79 |
+
repo_id=_hf_model_repo_id,
|
| 80 |
+
filename=filename,
|
| 81 |
+
revision=_hf_model_revision,
|
| 82 |
+
local_dir="checkpoints",
|
| 83 |
+
local_dir_use_symlinks=False,
|
| 84 |
+
)
|
| 85 |
+
logger.info("Downloaded %s from %s", filename, _hf_model_repo_id)
|
| 86 |
+
return Path(local_path)
|
| 87 |
+
except Exception as exc: # pragma: no cover - best-effort fallback
|
| 88 |
+
logger.warning("Could not download %s from HF Hub: %s", filename, exc)
|
| 89 |
+
return None
|
| 90 |
+
|
| 91 |
+
|
| 92 |
# ---------------------------------------------------------------------------
|
| 93 |
# Config / model loading
|
| 94 |
# ---------------------------------------------------------------------------
|
|
|
|
| 127 |
if candidate.exists():
|
| 128 |
return candidate
|
| 129 |
|
| 130 |
+
downloaded = _download_checkpoint_from_hf(model_name)
|
| 131 |
+
if downloaded is not None and downloaded.exists():
|
| 132 |
+
return downloaded
|
| 133 |
+
|
| 134 |
return None
|
| 135 |
|
| 136 |
|
|
|
|
| 396 |
gradio_cfg = config.get("gradio", {})
|
| 397 |
|
| 398 |
demo = build_demo()
|
| 399 |
+
in_hf_space = os.getenv("SPACE_ID") is not None
|
| 400 |
demo.launch(
|
| 401 |
+
server_name="0.0.0.0" if in_hf_space else "127.0.0.1",
|
| 402 |
server_port=gradio_cfg.get("server_port", 7860),
|
| 403 |
+
share=False if in_hf_space else gradio_cfg.get("share", False),
|
| 404 |
)
|
| 405 |
|
| 406 |
|
requirements.txt
CHANGED
|
@@ -14,3 +14,4 @@ tqdm>=4.66.0
|
|
| 14 |
tensorboard>=2.15.0
|
| 15 |
gradio>=4.14.0
|
| 16 |
gdown>=5.1.0
|
|
|
|
|
|
| 14 |
tensorboard>=2.15.0
|
| 15 |
gradio>=4.14.0
|
| 16 |
gdown>=5.1.0
|
| 17 |
+
huggingface_hub>=0.23.0
|