Vedant Jigarbhai Mehta commited on
Commit
1eb8817
·
1 Parent(s): 4f856a3

Deploy to hf saces

Browse files
Files changed (7) hide show
  1. .gitattributes +1 -0
  2. DEPLOY_HF_SPACES.md +77 -0
  3. EXPLANATION.md +0 -695
  4. MODELS_EXPLAINED.md +0 -573
  5. README.md +12 -0
  6. app.py +46 -1
  7. requirements.txt +1 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.pth filter=lfs diff=lfs merge=lfs -text
DEPLOY_HF_SPACES.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deploy to Hugging Face Spaces (Gradio)
2
+
3
+ This project is now ready for Hugging Face Spaces.
4
+
5
+ ## Option A (recommended): single Space repo with checkpoints
6
+
7
+ Use this when you want the simplest deployment.
8
+
9
+ 1. Create a new Hugging Face Space:
10
+ - SDK: Gradio
11
+ - Hardware: CPU Basic to start, upgrade to GPU for faster inference
12
+
13
+ 2. Push this project to that Space repo.
14
+
15
+ 3. Ensure these files are present at the Space repo root:
16
+ - app.py
17
+ - requirements.txt
18
+ - configs/config.yaml
19
+ - models/
20
+ - data/
21
+ - utils/
22
+ - checkpoints/changeformer_best.pth (or your preferred model)
23
+
24
+ 4. In Space Settings, set startup file to `app.py` (default for Gradio Spaces).
25
+
26
+ 5. Optional: reduce initial footprint by keeping only one checkpoint (for example `changeformer_best.pth`) inside `checkpoints/`.
27
+
28
+ ## Option B: Space app + separate model repo
29
+
30
+ Use this when you want a smaller Space repo and keep large checkpoints elsewhere.
31
+
32
+ 1. Upload checkpoint files to a separate Hugging Face model repo.
33
+
34
+ 2. In your Space Settings -> Variables, set:
35
+ - `HF_MODEL_REPO`: owner/repo-name
36
+ - `HF_MODEL_REVISION`: optional branch/tag/commit (for reproducible deployment)
37
+
38
+ 3. On startup, `app.py` will auto-download expected checkpoint filenames into `checkpoints/`.
39
+
40
+ Expected checkpoint names:
41
+ - siamese_cnn_best.pth
42
+ - unet_pp_best.pth
43
+ - changeformer_best.pth
44
+
45
+ ## Space README metadata (required in Space repo)
46
+
47
+ In the Space repository README.md, include this at the top:
48
+
49
+ ```yaml
50
+ ---
51
+ title: Military Base Change Detection
52
+ emoji: satellite
53
+ colorFrom: blue
54
+ colorTo: red
55
+ sdk: gradio
56
+ sdk_version: 4.44.1
57
+ app_file: app.py
58
+ pinned: false
59
+ python_version: 3.10
60
+ ---
61
+ ```
62
+
63
+ ## Notes
64
+
65
+ - CPU hardware works, but inference can be slow for larger images.
66
+ - For better latency, choose a GPU Space.
67
+ - `app.py` now detects Spaces automatically and binds to `0.0.0.0`.
68
+ - If no local checkpoints are found, it will try `HF_MODEL_REPO`.
69
+
70
+ ## Quick local validation before push
71
+
72
+ ```bash
73
+ pip install -r requirements.txt
74
+ python app.py
75
+ ```
76
+
77
+ Then open the local Gradio URL and test one sample pair.
EXPLANATION.md DELETED
@@ -1,695 +0,0 @@
1
- # Military Base Change Detection — Complete Project Explanation
2
-
3
- ## Table of Contents
4
-
5
- 1. [What Is This Project?](#1-what-is-this-project)
6
- 2. [Why Did We Build This?](#2-why-did-we-build-this)
7
- 3. [What Problem Are We Solving?](#3-what-problem-are-we-solving)
8
- 4. [What Dataset Did We Use and Why?](#4-what-dataset-did-we-use-and-why)
9
- 5. [What Are The Three Models and Why These Three?](#5-what-are-the-three-models-and-why-these-three)
10
- 6. [How Does Each Model Work Internally?](#6-how-does-each-model-work-internally)
11
- 7. [How Is The Training Pipeline Designed?](#7-how-is-the-training-pipeline-designed)
12
- 8. [What Loss Functions Did We Use and Why?](#8-what-loss-functions-did-we-use-and-why)
13
- 9. [How Do We Evaluate The Models?](#9-how-do-we-evaluate-the-models)
14
- 10. [What Are Our Results?](#10-what-are-our-results)
15
- 11. [How Does The Inference Pipeline Work?](#11-how-does-the-inference-pipeline-work)
16
- 12. [How Does The Web Application Work?](#12-how-does-the-web-application-work)
17
- 13. [What Tools and Technologies Did We Use?](#13-what-tools-and-technologies-did-we-use)
18
- 14. [What Is Our Innovation / Contribution?](#14-what-is-our-innovation--contribution)
19
- 15. [What Are The Limitations?](#15-what-are-the-limitations)
20
- 16. [Future Work](#16-future-work)
21
- 17. [How To Present This Project](#17-how-to-present-this-project)
22
-
23
- ---
24
-
25
- ## 1. What Is This Project?
26
-
27
- This is a **deep learning-based satellite image change detection system** designed for defense and military applications. Given two satellite images of the same geographic location taken at different times (a "before" image and an "after" image), the system automatically identifies **where new construction has occurred** — new buildings, runways, infrastructure, or any structural changes.
28
-
29
- The system works like this:
30
-
31
- ```
32
- Before Image (2015) + After Image (2020) --> Change Mask (highlights new construction)
33
- [empty land] [buildings appeared] [white pixels = new structures]
34
- ```
35
-
36
- We implemented and compared **three different deep learning architectures** — ranging from a simple CNN baseline to a state-of-the-art vision transformer — to understand which approach works best for this task.
37
-
38
- ---
39
-
40
- ## 2. Why Did We Build This?
41
-
42
- ### The Defense Motivation
43
-
44
- Modern military intelligence relies heavily on satellite imagery. Analysts need to monitor:
45
-
46
- - **Enemy military base expansion** — Are new barracks, hangars, or command centers being built?
47
- - **Runway construction** — Is a new airfield being developed?
48
- - **Infrastructure development** — Are roads, supply depots, or communication towers appearing?
49
- - **Border fortification** — Are defensive structures being erected?
50
-
51
- Manually comparing satellite images is **slow, error-prone, and doesn't scale**. A single analyst might need to compare hundreds of image pairs daily. An AI system can do this in seconds with higher accuracy.
52
-
53
- ### The Deep Learning Motivation
54
-
55
- This project demonstrates core deep learning concepts:
56
-
57
- - **Transfer learning** — Using ImageNet-pretrained backbones on satellite imagery
58
- - **Siamese architectures** — Processing two inputs through a shared encoder
59
- - **Architecture comparison** — CNN vs UNet++ vs Transformer on the same task
60
- - **Binary segmentation** — Pixel-level classification (changed vs unchanged)
61
- - **End-to-end deployment** — From training to a working web application
62
-
63
- ---
64
-
65
- ## 3. What Problem Are We Solving?
66
-
67
- ### Problem Statement
68
-
69
- **Binary Change Detection in Remote Sensing Images**: Given a pair of co-registered satellite images of the same area captured at two different times, classify each pixel as either "changed" or "unchanged".
70
-
71
- ### Why Is This Hard?
72
-
73
- 1. **Class imbalance** — In most image pairs, 95-99% of pixels are "no change". Only tiny regions contain actual construction. The model must not simply predict "no change" everywhere.
74
-
75
- 2. **Irrelevant changes** — Lighting differences, seasonal vegetation changes, cloud shadows, and camera angle variations are NOT actual changes. The model must learn to ignore these.
76
-
77
- 3. **Scale variation** — Changes can be as small as a single house or as large as an entire housing development. The model needs multi-scale understanding.
78
-
79
- 4. **Semantic understanding** — The model should detect "empty land became a building" (structural change), not "grass turned brown" (seasonal change).
80
-
81
- ### Formal Definition
82
-
83
- ```
84
- Input: Image A (before) — shape [3, 256, 256] — RGB satellite patch
85
- Image B (after) — shape [3, 256, 256] — RGB satellite patch
86
-
87
- Output: Mask M — shape [1, 256, 256] — binary (0 = no change, 1 = change)
88
- ```
89
-
90
- ---
91
-
92
- ## 4. What Dataset Did We Use and Why?
93
-
94
- ### LEVIR-CD (Large-scale VHR Image Change Detection)
95
-
96
- We chose LEVIR-CD because it is the **most widely used benchmark** for building change detection in remote sensing. It provides:
97
-
98
- - **637 image pairs** at 1024x1024 resolution (0.5m/pixel from Google Earth)
99
- - **20 different regions** in Texas, USA (Austin, Lakeway, Bee Cave, etc.)
100
- - **Time span**: 2002 to 2018 (5-14 years between image pairs)
101
- - **31,333 annotated building change instances**
102
- - Images annotated by experts and double-checked for quality
103
-
104
- ### Data Preprocessing
105
-
106
- The raw 1024x1024 images are too large for direct model input. We cropped them into **non-overlapping 256x256 patches**:
107
-
108
- ```
109
- 1 image (1024x1024) --> 16 patches (256x256 each)
110
-
111
- Total patches:
112
- Train: 445 images x 16 = 7,120 patch triplets
113
- Val: 64 images x 16 = 1,024 patch triplets
114
- Test: 128 images x 16 = 2,048 patch triplets
115
- Total: 10,192 patch triplets
116
- ```
117
-
118
- Each patch triplet consists of:
119
- - `A/` — Before image (256x256 RGB)
120
- - `B/` — After image (256x256 RGB)
121
- - `label/` — Binary change mask (256x256, 0=unchanged, 255=changed)
122
-
123
- ### Why Not Military-Specific Data?
124
-
125
- Real military satellite imagery is classified and not publicly available. However, **building construction is structurally identical whether it's a civilian house or a military barracks**. A hangar looks like a warehouse. A runway looks like a road. The model learns to detect structural changes from any satellite imagery — the application to military monitoring is in WHERE you point the trained model, not what you train it on. This is the standard approach in defense AI research.
126
-
127
- ### Data Augmentation
128
-
129
- We apply synchronized augmentations to both images AND the mask during training (using albumentations ReplayCompose):
130
-
131
- - **Horizontal flip** (p=0.5)
132
- - **Vertical flip** (p=0.5)
133
- - **Random 90-degree rotation** (p=0.5)
134
- - **ImageNet normalization** (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
135
-
136
- No augmentation on validation/test sets — only normalization.
137
-
138
- ---
139
-
140
- ## 5. What Are The Three Models and Why These Three?
141
-
142
- We chose three architectures that represent **three generations of deep learning for dense prediction tasks**:
143
-
144
- | Model | Year | Architecture Type | Role in Our Study |
145
- |---|---|---|---|
146
- | Siamese CNN | ~2018 | Convolutional Neural Network | Baseline |
147
- | UNet++ | 2018 | Nested U-Net (encoder-decoder) | Mid-tier |
148
- | ChangeFormer | 2022 | Vision Transformer | State-of-the-art |
149
-
150
- ### Why These Specific Three?
151
-
152
- 1. **Siamese CNN** — The simplest approach. Shows what a basic CNN can achieve. Serves as a performance floor — if this already works well, maybe we don't need complex models.
153
-
154
- 2. **UNet++** — Represents the best of CNN-based segmentation. Its nested skip connections capture multi-scale features. Widely used in medical imaging and remote sensing. Shows what careful architecture design can achieve without transformers.
155
-
156
- 3. **ChangeFormer** — Represents the latest transformer-based approach. Uses self-attention to capture global context (one building being built might relate to another across the image). Shows whether the complexity of transformers is justified for this task.
157
-
158
- ### The Common Interface
159
-
160
- All three models share the same input/output contract:
161
-
162
- ```python
163
- def forward(self, x1: Tensor, x2: Tensor) -> Tensor:
164
- """
165
- x1: before image [Batch, 3, 256, 256]
166
- x2: after image [Batch, 3, 256, 256]
167
- returns: logits [Batch, 1, 256, 256] (raw, before sigmoid)
168
- """
169
- ```
170
-
171
- This means we can swap models freely without changing any other code.
172
-
173
- ---
174
-
175
- ## 6. How Does Each Model Work Internally?
176
-
177
- ### Model 1: Siamese CNN (Baseline)
178
-
179
- **Architecture**: Shared-weight ResNet18 encoder + Transposed Convolution decoder
180
-
181
- ```
182
- Before Image --> [ResNet18 Encoder] --> Features_A (512 channels, 8x8)
183
- | (shared weights)
184
- After Image --> [ResNet18 Encoder] --> Features_B (512 channels, 8x8)
185
-
186
- Difference = |Features_A - Features_B| (absolute difference)
187
-
188
- Difference --> [TransposedConv Decoder] --> Change Mask (1 channel, 256x256)
189
- ```
190
-
191
- **How it works**:
192
- 1. Both images pass through the SAME ResNet18 encoder (shared weights = Siamese)
193
- 2. ResNet18 reduces 256x256x3 to 8x8x512 feature maps
194
- 3. We compute the absolute difference between the two feature maps
195
- 4. A decoder with transposed convolutions upsamples back to 256x256
196
- 5. Output is a single-channel logit map (apply sigmoid for probabilities)
197
-
198
- **Why shared weights?** If the encoder weights are shared, both images are processed identically. Any difference in the output features is due to actual image content differences, not different processing.
199
-
200
- **Parameters**: ~14M
201
- **Strength**: Simple, fast, easy to understand
202
- **Weakness**: No skip connections, loses fine spatial detail during encoding
203
-
204
- ### Model 2: UNet++ (Mid-tier)
205
-
206
- **Architecture**: Shared ResNet34 encoder + Nested UNet++ decoder with dense skip connections
207
-
208
- ```
209
- Before Image --> [ResNet34 Encoder] --> Multi-scale Features_A
210
- | (shared weights) |
211
- After Image --> [ResNet34 Encoder] --> Multi-scale Features_B
212
- |
213
- |Features_A[i] - Features_B[i]| at each scale
214
- |
215
- [UNet++ Decoder]
216
- (nested skip connections)
217
- |
218
- Change Mask (256x256)
219
- ```
220
-
221
- **How it works**:
222
- 1. ResNet34 encoder extracts features at 5 different scales (from 256x256 down to 8x8)
223
- 2. At each scale, we compute the absolute difference between A and B features
224
- 3. The UNet++ decoder uses **nested skip connections** — unlike regular UNet which has direct connections, UNet++ has intermediate dense blocks that process features before passing them across
225
- 4. This captures both fine details (small buildings) and coarse context (large developments)
226
-
227
- **Why UNet++?** Standard UNet has a semantic gap between encoder and decoder features. UNet++ bridges this gap with intermediate convolution blocks, producing more refined predictions.
228
-
229
- **Parameters**: ~26M
230
- **Strength**: Excellent multi-scale feature fusion, captures small changes
231
- **Weakness**: More memory intensive than Siamese CNN
232
-
233
- ### Model 3: ChangeFormer (State-of-the-art)
234
-
235
- **Architecture**: Shared MiT-B1 Transformer encoder + MLP decoder
236
-
237
- ```
238
- Before Image --> [MiT-B1 Transformer Encoder] --> Hierarchical Features_A
239
- | (shared weights) |
240
- After Image --> [MiT-B1 Transformer Encoder] --> Hierarchical Features_B
241
- |
242
- |Features_A[i] - Features_B[i]| at 4 stages
243
- |
244
- [MLP Decoder]
245
- (multi-scale feature fusion)
246
- |
247
- Change Mask (256x256)
248
- ```
249
-
250
- **The MiT (Mix Transformer) Encoder** has 4 hierarchical stages:
251
-
252
- | Stage | Resolution | Channels | Attention Heads | Spatial Reduction |
253
- |---|---|---|---|---|
254
- | 1 | 64x64 | 64 | 1 | 8x |
255
- | 2 | 32x32 | 128 | 2 | 4x |
256
- | 3 | 16x16 | 320 | 5 | 2x |
257
- | 4 | 8x8 | 512 | 8 | 1x |
258
-
259
- **Key components we implemented from scratch** (~350 lines of custom code):
260
-
261
- 1. **Overlapping Patch Embedding** — Unlike ViT which uses non-overlapping patches, MiT uses overlapping convolutions to preserve local continuity.
262
-
263
- 2. **Efficient Self-Attention** — Standard self-attention is O(N^2). We use spatial reduction: reduce the key/value spatial dimensions before attention, making it computationally feasible for high-resolution images.
264
-
265
- 3. **Mix-FFN (Feed Forward Network)** — Instead of standard MLP, uses a depthwise 3x3 convolution inside the FFN to inject positional information without explicit position embeddings.
266
-
267
- 4. **MLP Decoder** — Projects all 4 scale features to the same dimension, upsamples to full resolution, concatenates, and predicts the change mask.
268
-
269
- **Why Transformers for change detection?** Self-attention captures GLOBAL relationships. If a new housing development appears, the attention mechanism can relate the new buildings to nearby road construction — understanding the change holistically rather than pixel-by-pixel.
270
-
271
- **Parameters**: ~14M
272
- **Strength**: Global context via self-attention, best at understanding large-scale changes
273
- **Weakness**: Needs more training epochs, higher memory usage
274
-
275
- ---
276
-
277
- ## 7. How Is The Training Pipeline Designed?
278
-
279
- ### Overview
280
-
281
- ```
282
- Config (YAML) --> Data Loading --> Model --> Loss --> Optimizer --> Training Loop
283
- |
284
- Checkpointing (Drive)
285
- TensorBoard Logging
286
- Early Stopping
287
- Resume Support
288
- ```
289
-
290
- ### Key Training Features
291
-
292
- **1. Mixed Precision Training (AMP)**
293
- We use PyTorch's Automatic Mixed Precision. Forward pass runs in FP16 (half precision) for speed, backward pass uses FP32 for numerical stability. This roughly doubles training speed and halves memory usage.
294
-
295
- ```python
296
- with autocast(): # Forward in FP16
297
- logits = model(img_a, img_b)
298
- loss = criterion(logits, mask)
299
- scaler.scale(loss).backward() # Backward with loss scaling
300
- scaler.step(optimizer) # Optimizer step
301
- ```
302
-
303
- **2. Gradient Accumulation**
304
- For memory-heavy models (ChangeFormer), we accumulate gradients over multiple mini-batches before updating weights. This simulates a larger effective batch size without needing more GPU memory.
305
-
306
- ```
307
- Effective batch size = actual batch size x accumulation steps
308
- ChangeFormer on T4: batch=4 x accum=2 = effective batch of 8
309
- ```
310
-
311
- **3. Gradient Clipping**
312
- We clip gradient norms to max_norm=1.0 to prevent training instability from exploding gradients, especially important for transformer models.
313
-
314
- **4. Learning Rate Schedule: Warmup + Cosine Annealing**
315
- - First 5 epochs: Linear warmup from 0.01x to 1x the base learning rate
316
- - Remaining epochs: Cosine decay to near zero
317
-
318
- This prevents early training instability (warmup) and allows fine-grained convergence (cosine decay).
319
-
320
- **5. Early Stopping**
321
- We monitor validation F1 score. If it doesn't improve for 15 consecutive epochs, training stops. This prevents overfitting and saves compute time.
322
-
323
- **6. Checkpoint Resume**
324
- Because cloud GPU sessions (Colab/Kaggle) can disconnect, we save TWO checkpoints every epoch:
325
- - `model_best.pth` — Best validation F1 so far
326
- - `model_last.pth` — Latest epoch (for resume)
327
-
328
- Each checkpoint contains: model weights, optimizer state, scheduler state, GradScaler state, epoch number, and best F1. This allows perfect resume — training continues exactly where it stopped.
329
-
330
- **7. Auto GPU Detection**
331
- The config contains per-model batch sizes for different GPUs:
332
-
333
- | Model | T4 (16GB) | V100 (16GB) | Default |
334
- |---|---|---|---|
335
- | Siamese CNN | 16 | 16 | 8 |
336
- | UNet++ | 8 | 12 | 4 |
337
- | ChangeFormer | 4 | 6 | 2 |
338
-
339
- The training script reads `torch.cuda.get_device_name()` and automatically selects the right batch size.
340
-
341
- ### Optimizer Choice: AdamW
342
-
343
- We use AdamW (Adam with decoupled weight decay) because:
344
- - Adam's adaptive learning rates work well for both CNNs and transformers
345
- - Weight decay prevents overfitting
346
- - It's the standard optimizer for transformer training
347
-
348
- ### Per-Model Hyperparameters
349
-
350
- | Hyperparameter | Siamese CNN | UNet++ | ChangeFormer |
351
- |---|---|---|---|
352
- | Learning Rate | 1e-3 | 1e-4 | 6e-5 |
353
- | Epochs | 100 | 100 | 200 |
354
- | Batch Size (T4) | 16 | 8 | 4 |
355
-
356
- ChangeFormer gets a lower learning rate and more epochs because transformers need slower, longer training to converge.
357
-
358
- ---
359
-
360
- ## 8. What Loss Functions Did We Use and Why?
361
-
362
- ### The Class Imbalance Problem
363
-
364
- In change detection, ~97% of pixels are "no change" and only ~3% are "change". If the model predicts "no change" for every pixel, it gets 97% accuracy but is completely useless. We need loss functions that handle this imbalance.
365
-
366
- ### BCEDiceLoss (Default)
367
-
368
- We combine two losses:
369
-
370
- **Binary Cross-Entropy (BCE)**:
371
- ```
372
- BCE = -[y * log(p) + (1-y) * log(1-p)]
373
- ```
374
- - Standard pixel-wise classification loss
375
- - Treats each pixel independently
376
- - Applied on raw logits (numerically stable)
377
-
378
- **Dice Loss**:
379
- ```
380
- Dice = 1 - (2 * |P intersection G| + smooth) / (|P| + |G| + smooth)
381
- ```
382
- - Measures overlap between predicted and ground truth change regions
383
- - Directly optimizes for the F1 metric
384
- - Less sensitive to class imbalance because it looks at the ratio of overlap, not individual pixels
385
-
386
- **Combined**:
387
- ```
388
- Loss = 0.5 * BCE + 0.5 * Dice
389
- ```
390
-
391
- BCE provides stable gradients for learning, Dice pushes toward better F1 scores.
392
-
393
- ### FocalLoss (Alternative)
394
-
395
- ```
396
- Focal = -alpha * (1 - p_t)^gamma * log(p_t)
397
- ```
398
-
399
- - Down-weights easy pixels (clearly "no change")
400
- - Focuses training on hard pixels near decision boundaries
401
- - alpha=0.25, gamma=2.0
402
-
403
- We provide both in config — BCEDiceLoss is the default because it produced better results empirically.
404
-
405
- ---
406
-
407
- ## 9. How Do We Evaluate The Models?
408
-
409
- ### Metrics
410
-
411
- All metrics are computed at **threshold=0.5** on the binary change mask:
412
-
413
- **F1-Score (Primary Metric)**:
414
- ```
415
- F1 = 2 * (Precision * Recall) / (Precision + Recall)
416
- ```
417
- The harmonic mean of precision and recall. Our primary metric for model selection and early stopping. Balances between detecting all changes (recall) and avoiding false alarms (precision).
418
-
419
- **IoU (Intersection over Union / Jaccard Index)**:
420
- ```
421
- IoU = TP / (TP + FP + FN)
422
- ```
423
- Measures overlap between predicted and true change masks. More stringent than F1 — penalizes both missed changes and false alarms.
424
-
425
- **Precision**:
426
- ```
427
- Precision = TP / (TP + FP)
428
- ```
429
- "Of all pixels the model predicted as changed, how many actually changed?" High precision = few false alarms.
430
-
431
- **Recall**:
432
- ```
433
- Recall = TP / (TP + FN)
434
- ```
435
- "Of all pixels that actually changed, how many did the model detect?" High recall = few missed changes.
436
-
437
- **Overall Accuracy (OA)**:
438
- ```
439
- OA = (TP + TN) / (TP + TN + FP + FN)
440
- ```
441
- Simple pixel accuracy. Always high (>96%) due to class imbalance — NOT a reliable metric alone.
442
-
443
- ### MetricTracker Implementation
444
-
445
- We built a `MetricTracker` class that:
446
- 1. Takes raw model logits (no manual sigmoid needed)
447
- 2. Applies sigmoid + threshold internally
448
- 3. Accumulates TP/FP/FN/TN across batches on GPU
449
- 4. Only moves scalar results to CPU for final computation
450
- 5. Returns all 5 metrics as a dictionary
451
-
452
- ### Evaluation Outputs
453
-
454
- The evaluation script generates:
455
- - `results.json` — All metrics in machine-readable format
456
- - `prediction_grid.png` — 5 sample predictions (Before | After | Ground Truth | Prediction)
457
- - `predictions/` — 20 individual prediction plots
458
- - `overlays/` — Top 10 most interesting predictions (ranked by change area) with red overlay
459
-
460
- ---
461
-
462
- ## 10. What Are Our Results?
463
-
464
- ### Test Set Performance (LEVIR-CD, 2,048 patches)
465
-
466
- | Model | F1 | IoU | Precision | Recall | OA | Epochs Trained |
467
- |---|---|---|---|---|---|---|
468
- | Siamese CNN | 0.6441 | 0.4751 | 0.8084 | 0.5353 | 0.9699 | 3* |
469
- | **UNet++** | **0.9035** | **0.8240** | **0.9280** | **0.8803** | **0.9904** | 85 |
470
- | ChangeFormer | 0.8836 | 0.7915 | 0.8944 | 0.8731 | 0.9883 | 141 |
471
-
472
- *\*Siamese CNN was undertrained due to session interruption (3 epochs instead of 100). With full training it would achieve F1 ~0.82-0.85.*
473
-
474
- ### Analysis
475
-
476
- 1. **UNet++ achieved the best F1 (0.9035)** — Its nested skip connections excel at capturing multi-scale building changes. It effectively bridges fine-grained spatial details with high-level semantic understanding.
477
-
478
- 2. **ChangeFormer achieved F1 0.8836** — Very competitive but slightly below UNet++. The transformer's global attention helps with large-scale changes but the relatively small patch size (256x256) limits the advantage of global context.
479
-
480
- 3. **Siamese CNN (undertrained)** — With only 3 epochs, it shows the baseline capability. Its high precision (0.808) but low recall (0.535) means it's conservative — it catches changes it's confident about but misses many subtle ones.
481
-
482
- 4. **All models achieve >96% OA** — This highlights why overall accuracy alone is misleading for imbalanced problems. Even a model that predicts "no change" everywhere would get ~97% OA.
483
-
484
- ### Key Insight
485
-
486
- UNet++'s superior performance suggests that **multi-scale feature fusion with skip connections is more important than global self-attention** for this particular task and patch size. The nested decoder effectively captures both small buildings (low-level features) and large developments (high-level features).
487
-
488
- ---
489
-
490
- ## 11. How Does The Inference Pipeline Work?
491
-
492
- For real-world use, satellite images are much larger than 256x256. Our inference pipeline handles **any resolution** through sliding window (tiled) inference:
493
-
494
- ```
495
- Input Image (e.g., 1024x1024)
496
- |
497
- v
498
- Pad to nearest multiple of 256
499
- |
500
- v
501
- Tile into 256x256 non-overlapping patches
502
- |
503
- v
504
- Run model on each patch pair
505
- |
506
- v
507
- Stitch probability maps back together
508
- |
509
- v
510
- Crop to original size
511
- |
512
- v
513
- Apply threshold (0.5) --> Binary change mask
514
- ```
515
-
516
- This means the system works on images of any size — from 256x256 test patches to full 4000x4000 satellite imagery.
517
-
518
- ### Outputs
519
-
520
- 1. **Binary change mask** (PNG) — White pixels = change detected
521
- 2. **Overlay visualization** — After image with detected changes highlighted in red
522
- 3. **Change statistics** — Percentage of area changed, pixel counts
523
-
524
- ---
525
-
526
- ## 12. How Does The Web Application Work?
527
-
528
- We built an interactive web interface using **Gradio** that allows anyone to use the model without any coding knowledge:
529
-
530
- ### User Flow
531
-
532
- 1. Upload a "Before" satellite image
533
- 2. Upload an "After" satellite image
534
- 3. Select a model from the dropdown (auto-detects available checkpoints)
535
- 4. Adjust the detection threshold if needed (default 0.5)
536
- 5. Click "Detect Changes"
537
- 6. View results: change mask, red overlay, and statistics table
538
-
539
- ### Technical Details
540
-
541
- - **Auto-checkpoint detection** — The app scans multiple directories for checkpoint files and only shows models that have checkpoints available
542
- - **Model caching** — Once a model is loaded, it stays in memory for instant subsequent predictions
543
- - **CPU fallback** — Works without GPU (slower but functional)
544
- - **Any image size** — Uses the same tiled inference pipeline
545
- - **Public sharing** — Can generate a public URL for remote access
546
-
547
- ---
548
-
549
- ## 13. What Tools and Technologies Did We Use?
550
-
551
- ### Core Framework
552
-
553
- | Tool | Purpose | Why We Chose It |
554
- |---|---|---|
555
- | **PyTorch 2.x** | Deep learning framework | Industry standard, dynamic computation graph, excellent GPU support |
556
- | **Python 3.10+** | Programming language | De facto language for ML/DL |
557
-
558
- ### Model Libraries
559
-
560
- | Library | Purpose | Why |
561
- |---|---|---|
562
- | **torchvision** | ResNet18/34 pretrained backbones | Official PyTorch model zoo |
563
- | **segmentation-models-pytorch (SMP)** | UNet++ architecture | Best-maintained segmentation library, provides encoder-decoder framework |
564
- | **timm** | Transformer utilities | State-of-the-art vision model components |
565
- | **einops** | Tensor rearrangement | Clean, readable tensor reshaping for transformer code |
566
-
567
- ### Data Processing
568
-
569
- | Library | Purpose | Why |
570
- |---|---|---|
571
- | **albumentations** | Image augmentation | Fast, GPU-friendly, supports ReplayCompose for synchronized transforms |
572
- | **OpenCV** | Image I/O | Fast image reading/writing, supports multiple formats |
573
- | **NumPy** | Array operations | Foundation for all numerical computation |
574
-
575
- ### Training Infrastructure
576
-
577
- | Tool | Purpose | Why |
578
- |---|---|---|
579
- | **TensorBoard** | Training visualization | Real-time loss curves, metric tracking, prediction grids |
580
- | **Google Colab / Kaggle** | Cloud GPU | Free T4/P100 GPUs for training |
581
- | **Google Drive** | Persistent storage | Checkpoints survive Colab disconnections |
582
- | **YAML** | Configuration | Human-readable, all hyperparameters in one place |
583
-
584
- ### Deployment
585
-
586
- | Tool | Purpose | Why |
587
- |---|---|---|
588
- | **Gradio** | Web interface | Fastest way to create ML demos, no frontend code needed |
589
-
590
- ---
591
-
592
- ## 14. What Is Our Innovation / Contribution?
593
-
594
- ### 1. Unified Multi-Architecture Comparison Framework
595
-
596
- We built a single codebase that trains, evaluates, and deploys three fundamentally different architectures (CNN, UNet++, Transformer) under identical conditions — same data, same augmentations, same loss function, same metrics. Most papers only present one model. Our framework enables fair comparison.
597
-
598
- ### 2. Defense Application Framing
599
-
600
- We contextualized general change detection for military surveillance applications — monitoring base expansion, runway construction, and infrastructure development. The same technology used for urban planning is directly applicable to defense intelligence.
601
-
602
- ### 3. Custom ChangeFormer Implementation
603
-
604
- The ChangeFormer transformer is implemented from scratch (~350 lines of custom PyTorch code), not imported from a library:
605
- - Overlapping Patch Embeddings
606
- - Efficient Self-Attention with Spatial Reduction
607
- - Mix Feed-Forward Networks with Depthwise Convolutions
608
- - Hierarchical 4-stage Encoder
609
- - Multi-scale MLP Decoder
610
-
611
- ### 4. Production-Ready Pipeline
612
-
613
- This is not just a training notebook — it's a complete system:
614
- - Automated data download and preprocessing
615
- - Resume-capable training with cloud storage
616
- - Tiled inference for any-resolution images
617
- - Interactive web application for non-technical users
618
- - Auto GPU detection and batch size optimization
619
-
620
- ### 5. Custom Loss and Metrics
621
-
622
- We implemented BCEDiceLoss (combines classification and overlap objectives) and a MetricTracker that operates on GPU tensors for efficient evaluation.
623
-
624
- ---
625
-
626
- ## 15. What Are The Limitations?
627
-
628
- 1. **Training data is civilian** — Trained on LEVIR-CD (civilian buildings in Texas). While structurally similar to military construction, the model hasn't seen actual military facilities, camouflaged structures, or underground bunkers.
629
-
630
- 2. **Single geographic region** — LEVIR-CD covers only Texas, USA. Performance may degrade on satellite imagery from different geographic regions with different building styles, vegetation, or terrain.
631
-
632
- 3. **Fixed resolution** — Trained on 0.5m/pixel resolution. Lower resolution imagery (e.g., Sentinel-2 at 10m/pixel) would require retraining.
633
-
634
- 4. **No temporal reasoning** — The model only sees two time points. It cannot track gradual construction progress over multiple time steps.
635
-
636
- 5. **Lighting sensitivity** — Significant illumination differences between before/after images can cause false positives or missed detections.
637
-
638
- 6. **Siamese CNN undertrained** — Due to session interruptions, the Siamese CNN baseline was only trained for 3 epochs, not providing a fair comparison point.
639
-
640
- ---
641
-
642
- ## 16. Future Work
643
-
644
- 1. **Military-specific fine-tuning** — Fine-tune on declassified military satellite imagery to improve detection of defense-specific structures.
645
-
646
- 2. **Multi-temporal analysis** — Extend from 2 timestamps to a sequence, tracking construction progress over months/years.
647
-
648
- 3. **Object-level detection** — Instead of just pixel masks, classify WHAT changed (building, road, runway, vehicle).
649
-
650
- 4. **Model ensemble** — Combine predictions from all three models for higher accuracy.
651
-
652
- 5. **Attention visualization** — Show which parts of the image the transformer attends to, providing explainability for intelligence analysts.
653
-
654
- 6. **Real-time satellite feed** — Connect to live satellite imagery APIs for continuous monitoring.
655
-
656
- 7. **Deploy on Hugging Face Spaces** — Create a permanent public URL for the web demo.
657
-
658
- ---
659
-
660
- ## 17. How To Present This Project
661
-
662
- ### Opening (1 minute)
663
-
664
- > "We built an AI system that monitors military base construction from satellite imagery. You give it two satellite photos — one old, one new — and it highlights exactly what changed: new buildings, new runways, new infrastructure. We compared three deep learning approaches and achieved 90% F1 score."
665
-
666
- ### Show The Demo (2 minutes)
667
-
668
- 1. Open the Gradio app (localhost:7860 or public URL)
669
- 2. Upload a before/after pair from the test set
670
- 3. Show the change detection output
671
- 4. Switch between models to show different predictions
672
- 5. Adjust the threshold slider
673
-
674
- ### Show The Results (1 minute)
675
-
676
- Present the comparison table:
677
-
678
- | Model | F1 | IoU | Architecture |
679
- |---|---|---|---|
680
- | Siamese CNN | 0.64 | 0.48 | Basic CNN |
681
- | ChangeFormer | 0.88 | 0.79 | Transformer |
682
- | **UNet++** | **0.90** | **0.82** | **Nested UNet** |
683
-
684
- > "UNet++ achieved the best results. Its nested skip connections are ideal for multi-scale change detection. Interestingly, it outperformed the more complex transformer model, suggesting that architectural inductive biases (convolutions that understand local spatial structure) are more important than global self-attention for 256x256 patches."
685
-
686
- ### Answer Common Questions
687
-
688
- **Q: "You used readymade models?"**
689
- > "The backbones (ResNet, MiT) are pretrained on ImageNet — that's transfer learning, standard practice. But the change detection architecture is custom — Siamese encoding, feature differencing, and the full ChangeFormer transformer are written from scratch. We also wrote custom loss functions and a complete training pipeline."
690
-
691
- **Q: "What's novel?"**
692
- > "The systematic comparison of three generations of deep learning on defense surveillance, packaged as a deployable web application. We show that UNet++ outperforms transformers for this task and patch size — a non-obvious finding that challenges the assumption that newer = better."
693
-
694
- **Q: "How is this military?"**
695
- > "Military bases are buildings and infrastructure. The model detects new construction from satellite imagery. Point it at a known military zone and it becomes a defense intelligence tool. The technology is the same — the application context makes it military."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MODELS_EXPLAINED.md DELETED
@@ -1,573 +0,0 @@
1
- # Deep Dive: Models, Transfer Learning, and Fine-Tuning Explained
2
-
3
- ## Table of Contents
4
-
5
- 1. [What Is Transfer Learning and Why ImageNet?](#1-what-is-transfer-learning-and-why-imagenet)
6
- 2. [What Exactly Did We Fine-Tune?](#2-what-exactly-did-we-fine-tune)
7
- 3. [Model 1: Siamese CNN — Explained Like You're Teaching It](#3-model-1-siamese-cnn)
8
- 4. [Model 2: UNet++ — Why A Medical Model Works For Satellites](#4-model-2-unet)
9
- 5. [Model 3: ChangeFormer — The Transformer Approach](#5-model-3-changeformer)
10
- 6. [Why UNet++ Even Though It's A Medical Model?](#6-why-unet-even-though-its-a-medical-model)
11
- 7. [What Happens Inside During Inference — Step By Step](#7-what-happens-inside-during-inference)
12
- 8. [How To Explain This To Faculty](#8-how-to-explain-this-to-faculty)
13
-
14
- ---
15
-
16
- ## 1. What Is Transfer Learning and Why ImageNet?
17
-
18
- ### The Problem With Training From Scratch
19
-
20
- A deep learning model needs to learn TWO things:
21
- 1. **Low-level features** — edges, textures, corners, gradients, colors
22
- 2. **High-level features** — objects, shapes, spatial relationships
23
-
24
- Learning low-level features from scratch takes millions of images and days of training. But here's the key insight: **edges look the same everywhere**. An edge in a cat photo looks the same as an edge in a satellite photo. A texture gradient in a car image is structurally identical to a texture gradient in a building image.
25
-
26
- ### What Is ImageNet?
27
-
28
- ImageNet is a dataset of **14 million images** across 1000 categories (cats, dogs, cars, planes, buildings, landscapes, etc.). Models trained on ImageNet learn incredibly rich low-level and mid-level features because they've seen enormous visual diversity.
29
-
30
- ### What Is Transfer Learning?
31
-
32
- Instead of training from scratch (random weights), we START with weights that were trained on ImageNet. This gives us:
33
-
34
- ```
35
- FROM SCRATCH:
36
- Random weights --> [needs millions of images] --> Learns edges --> Learns textures --> Learns shapes --> Learns objects
37
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38
- THIS TAKES FOREVER
39
-
40
- TRANSFER LEARNING:
41
- ImageNet weights --> [already knows edges, textures, shapes] --> [needs few thousand images] --> Learns satellite-specific patterns
42
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43
- FREE - comes with pretrained weights THIS IS FAST
44
- ```
45
-
46
- ### Analogy
47
-
48
- Think of it like learning a new language:
49
- - **From scratch**: A baby learning their first language — takes years
50
- - **Transfer learning**: A person who speaks English learning Spanish — much faster because they already understand grammar, sentence structure, and many shared words
51
-
52
- ImageNet pretraining = knowing English. Satellite change detection = learning Spanish. The foundation transfers.
53
-
54
- ---
55
-
56
- ## 2. What Exactly Did We Fine-Tune?
57
-
58
- ### What "Fine-Tuning" Means
59
-
60
- Fine-tuning means we took the pretrained ImageNet weights and **continued training them on our satellite data**. We didn't freeze anything — ALL layers were updated. This is called **end-to-end fine-tuning**.
61
-
62
- ### What Changed During Fine-Tuning
63
-
64
- ```
65
- BEFORE Fine-Tuning (ImageNet weights):
66
- Layer 1: Detects generic edges, gradients
67
- Layer 2: Detects generic textures, patterns
68
- Layer 3: Detects generic shapes (circles, rectangles)
69
- Layer 4: Detects generic objects (cat face, car wheel)
70
- ^^^^ These are useful for ANYTHING visual
71
-
72
- AFTER Fine-Tuning (our satellite training):
73
- Layer 1: Still detects edges (barely changed — edges are universal)
74
- Layer 2: Detects satellite-specific textures (roof patterns, road textures)
75
- Layer 3: Detects building footprints, road shapes
76
- Layer 4: Detects "new building appeared" vs "same building"
77
- ^^^^ Early layers changed little, later layers changed a LOT
78
- ```
79
-
80
- ### The Numbers
81
-
82
- | Model | Total Parameters | Pretrained (from ImageNet) | New (trained from scratch) |
83
- |---|---|---|---|
84
- | Siamese CNN | 14M | 11M (ResNet18 encoder) | 3M (decoder) |
85
- | UNet++ | 26M | 21M (ResNet34 encoder) | 5M (decoder) |
86
- | ChangeFormer | 14M | 0 (trained from scratch) | 14M (everything) |
87
-
88
- **Key point**: For Siamese CNN and UNet++, the ENCODER (feature extractor) is pretrained. The DECODER (change mask generator) is trained from scratch. During fine-tuning, both encoder AND decoder are updated, but the encoder starts from a much better position.
89
-
90
- **ChangeFormer is different**: We wrote the entire architecture from scratch. There are no widely available pretrained MiT-B1 weights for change detection, so we trained everything from random initialization. This is why it needs 200 epochs instead of 100.
91
-
92
- ### What Does The Training Actually Do?
93
-
94
- Each training step:
95
- 1. Feed a before/after image pair through the model
96
- 2. Model outputs a predicted change mask
97
- 3. Compare prediction with ground truth using BCEDiceLoss
98
- 4. Compute gradients (how much each weight contributed to the error)
99
- 5. Update ALL weights slightly in the direction that reduces error
100
- 6. Repeat 7,120 times per epoch (one per training sample)
101
- 7. Repeat for 85-141 epochs
102
-
103
- After training:
104
- - Early layers (edges, textures): changed ~5-10% from ImageNet values
105
- - Middle layers (shapes, patterns): changed ~20-40%
106
- - Late layers (semantic understanding): changed ~60-90%
107
- - Decoder layers: learned entirely from our data
108
-
109
- ---
110
-
111
- ## 3. Model 1: Siamese CNN
112
-
113
- ### What Is "Siamese"?
114
-
115
- "Siamese" means twins — like Siamese twins. The model has TWO identical paths that share the SAME weights:
116
-
117
- ```
118
- Image A (Before) ----\
119
- [Same ResNet18] ---- Features A
120
- Image B (After) ----/ Features B
121
- ^^^^^^^^^^^^^
122
- SHARED WEIGHTS
123
- (not two separate networks)
124
- ```
125
-
126
- **Why shared?** If both images go through the EXACT same processing, then any difference in the output features MUST be because the images themselves are different. The shared weights act as a fair, unbiased feature extractor.
127
-
128
- ### ResNet18 Encoder — Step by Step
129
-
130
- ResNet18 is a Convolutional Neural Network with 18 layers. Here's what happens to a 256x256 satellite image:
131
-
132
- ```
133
- Input: [3, 256, 256] (3 = RGB channels)
134
- |
135
- v
136
- Conv1 + BN + ReLU + MaxPool
137
- | --> [64, 64, 64] (64 feature channels, spatial size reduced to 64x64)
138
- v
139
- Layer 1 (2 residual blocks)
140
- | --> [64, 64, 64] (same size, refined features)
141
- v
142
- Layer 2 (2 residual blocks)
143
- | --> [128, 32, 32] (more channels, smaller spatial)
144
- v
145
- Layer 3 (2 residual blocks)
146
- | --> [256, 16, 16] (even more channels, even smaller)
147
- v
148
- Layer 4 (2 residual blocks)
149
- | --> [512, 8, 8] (512 feature channels, 8x8 spatial grid)
150
- v
151
- Output: Rich feature representation
152
- ```
153
-
154
- Each "residual block" has the famous skip connection:
155
- ```
156
- input ----> [Conv -> BN -> ReLU -> Conv -> BN] ----> ADD ----> ReLU ----> output
157
- | ^
158
- |_____________(identity shortcut)____________________|
159
- ```
160
-
161
- The skip connection solves the vanishing gradient problem — gradients can flow directly through the shortcut, making deep networks trainable.
162
-
163
- ### The Difference Operation
164
-
165
- After encoding both images:
166
- ```
167
- Features_A: [512, 8, 8] (before image encoded)
168
- Features_B: [512, 8, 8] (after image encoded)
169
-
170
- Difference = |Features_A - Features_B| (absolute difference, element-wise)
171
- Result: [512, 8, 8] (where values are high = something changed)
172
- ```
173
-
174
- If a pixel in Features_A has value 0.8 and the same pixel in Features_B has value 0.2, the difference is 0.6 — meaning this region changed significantly.
175
-
176
- ### The Decoder — Transposed Convolutions
177
-
178
- Now we need to go from 8x8 back to 256x256. Transposed convolution (also called "deconvolution") does upsampling:
179
-
180
- ```
181
- [512, 8, 8]
182
- | TransposedConv + BN + ReLU
183
- v
184
- [256, 16, 16]
185
- | TransposedConv + BN + ReLU
186
- v
187
- [128, 32, 32]
188
- | TransposedConv + BN + ReLU
189
- v
190
- [64, 64, 64]
191
- | TransposedConv + BN + ReLU
192
- v
193
- [32, 128, 128]
194
- | TransposedConv (final)
195
- v
196
- [1, 256, 256] <-- Change mask! (raw logits, apply sigmoid for probabilities)
197
- ```
198
-
199
- ### Weakness
200
-
201
- The encoder compresses 256x256 down to 8x8 — that's a 32x reduction. Fine spatial details are lost. A small building that's 10x10 pixels becomes less than 1 pixel in the 8x8 feature map. The decoder tries to reconstruct this but without skip connections (unlike UNet), it struggles with precise localization.
202
-
203
- ---
204
-
205
- ## 4. Model 2: UNet++
206
-
207
- ### First, What Is Regular UNet?
208
-
209
- UNet was invented for medical image segmentation (detecting tumors in brain scans). It has an **encoder-decoder structure with skip connections**:
210
-
211
- ```
212
- ENCODER (downsampling) DECODER (upsampling)
213
- [256x256] ----skip connection----> [256x256]
214
- | ^
215
- [128x128] ----skip connection----> [128x128]
216
- | ^
217
- [64x64] ----skip connection----> [64x64]
218
- | ^
219
- [32x32] ----skip connection----> [32x32]
220
- | ^
221
- [16x16] ------bottleneck-------> [16x16]
222
- ```
223
-
224
- The skip connections DIRECTLY copy encoder features to the decoder. This means the decoder has access to BOTH:
225
- - High-level semantic info (from the bottleneck): "this region has a building"
226
- - Low-level spatial detail (from skip connections): "the exact edge of the building is here"
227
-
228
- ### What Makes UNet++ Different From UNet?
229
-
230
- Regular UNet's problem: the skip connections connect features at very different semantic levels. The encoder at level 2 produces "edge features" while the decoder at level 2 needs "building boundary features". There's a **semantic gap**.
231
-
232
- UNet++ fixes this with **nested intermediate blocks**:
233
-
234
- ```
235
- Regular UNet:
236
- Encoder --------direct skip--------> Decoder
237
- (raw features) (needs processed features)
238
- ^^ SEMANTIC GAP ^^
239
-
240
- UNet++:
241
- Encoder ----> [Block] ----> [Block] ----> Decoder
242
- (raw) (processed) (more processed) (ready to use)
243
- ^^^^^^^^^^^^^^^^^^^^^^^^
244
- NESTED DENSE BLOCKS bridge the gap
245
- ```
246
-
247
- In detail:
248
- ```
249
- X(0,0) ---------> X(0,1) ---------> X(0,2) ---------> X(0,3) ---------> X(0,4)
250
- | | | |
251
- X(1,0) ---------> X(1,1) ---------> X(1,2) ---------> X(1,3)
252
- | | |
253
- X(2,0) ---------> X(2,1) ---------> X(2,2)
254
- | |
255
- X(3,0) ---------> X(3,1)
256
- |
257
- X(4,0) (bottleneck)
258
- ```
259
-
260
- Each X(i,j) node receives inputs from:
261
- - The node below it (deeper features)
262
- - ALL previous nodes at the same level (dense connections)
263
-
264
- This means by the time features reach the output, they've been progressively refined through multiple intermediate processing stages.
265
-
266
- ### How We Adapted UNet++ For Change Detection
267
-
268
- Original UNet++ takes ONE image and segments it. We adapted it for TWO images:
269
-
270
- ```
271
- Image A (Before) --> [ResNet34 Encoder] --> Features at 5 scales
272
- | (shared weights)
273
- Image B (After) --> [ResNet34 Encoder] --> Features at 5 scales
274
-
275
- At each scale:
276
- diff[i] = |Features_A[i] - Features_B[i]|
277
-
278
- diff features --> [UNet++ Decoder with nested skip connections] --> Change Mask
279
- ```
280
-
281
- We use ResNet34 (34 layers, deeper than ResNet18) as the encoder via the `segmentation-models-pytorch` library, which provides the UNet++ decoder architecture.
282
-
283
- ### Why ResNet34 Instead of ResNet18?
284
-
285
- ResNet34 has more layers and captures richer features:
286
- - ResNet18: [2, 2, 2, 2] blocks = 18 layers
287
- - ResNet34: [3, 4, 6, 3] blocks = 34 layers
288
-
289
- More depth = better feature extraction, especially for the subtle differences between before/after satellite images.
290
-
291
- ---
292
-
293
- ## 5. Model 3: ChangeFormer
294
-
295
- ### What Is A Vision Transformer?
296
-
297
- Traditional CNNs look at LOCAL regions (3x3 or 5x5 patches). Transformers look at GLOBAL relationships — every part of the image can attend to every other part.
298
-
299
- ### The Self-Attention Mechanism
300
-
301
- For a given position in the image, self-attention asks: "Which OTHER positions in this image are relevant to understanding THIS position?"
302
-
303
- ```
304
- Example: A new building appears in the top-left
305
- Self-attention notices:
306
- - New road appeared nearby (related construction)
307
- - Parking lot appeared on the right (part of same development)
308
- - Trees on the south side were cleared (preparation for construction)
309
-
310
- A CNN would process each region independently.
311
- A Transformer connects them all.
312
- ```
313
-
314
- ### How Self-Attention Works (Simplified)
315
-
316
- For each pixel position:
317
- 1. Create a **Query** (Q): "What am I looking for?"
318
- 2. Create a **Key** (K): "What information do I have?"
319
- 3. Create a **Value** (V): "What information can I give?"
320
-
321
- ```
322
- Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
323
- ```
324
-
325
- - Q * K^T: How relevant is each position to me? (attention score)
326
- - softmax: Normalize to probabilities
327
- - * V: Weight the values by attention scores
328
- - / sqrt(d): Scale factor for numerical stability
329
-
330
- ### The MiT-B1 Architecture
331
-
332
- MiT (Mix Transformer) is a hierarchical transformer — unlike ViT which processes the image at one scale, MiT processes at 4 scales (like a CNN):
333
-
334
- **Stage 1 (64x64, 64 channels)**:
335
- ```
336
- 256x256 image
337
- |
338
- Overlapping Patch Embed (7x7 conv, stride 4)
339
- |
340
- 64x64 grid of 64-dim tokens (4096 tokens)
341
- |
342
- 2x [Efficient Self-Attention + Mix-FFN]
343
- |
344
- Output: [64, 64, 64] features
345
- ```
346
-
347
- **Stage 2 (32x32, 128 channels)**:
348
- ```
349
- Overlapping Patch Embed (3x3 conv, stride 2)
350
- |
351
- 32x32 grid of 128-dim tokens (1024 tokens)
352
- |
353
- 2x [Efficient Self-Attention + Mix-FFN]
354
- |
355
- Output: [128, 32, 32] features
356
- ```
357
-
358
- **Stage 3 (16x16, 320 channels)** and **Stage 4 (8x8, 512 channels)** follow the same pattern.
359
-
360
- ### Efficient Self-Attention
361
-
362
- Standard self-attention on 64x64 = 4096 tokens would require a 4096x4096 attention matrix — too expensive. We use **Spatial Reduction**:
363
-
364
- ```
365
- Standard: Q (4096 tokens) x K (4096 tokens) = 16M attention scores (TOO SLOW)
366
-
367
- Efficient:
368
- Q stays at 4096 tokens
369
- K and V are spatially reduced: 4096 -> 64 tokens (8x reduction)
370
- Q (4096) x K (64) = 262K attention scores (60x cheaper!)
371
- ```
372
-
373
- This is done via a strided convolution that reduces K and V before computing attention.
374
-
375
- ### Mix-FFN
376
-
377
- Standard transformers use a simple MLP (Linear -> GELU -> Linear) after attention. Mix-FFN adds a **depthwise 3x3 convolution** in the middle:
378
-
379
- ```
380
- Standard FFN: Linear -> GELU -> Linear
381
- Mix-FFN: Linear -> DepthwiseConv3x3 -> GELU -> Linear
382
- ^^^^^^^^^^^^^^^^^
383
- Injects local spatial information
384
- ```
385
-
386
- Why? Pure transformers have no notion of "nearby pixels". The depthwise conv brings back local spatial awareness without the cost of full convolutions. This eliminates the need for explicit position embeddings.
387
-
388
- ### The MLP Decoder
389
-
390
- After the encoder produces features at 4 scales, the decoder fuses them:
391
-
392
- ```
393
- Stage 1 features: [64, 64, 64] --[1x1 Conv]--> [64, 64, 64] --[Upsample]--> [64, 64, 64]
394
- Stage 2 features: [128, 32, 32] --[1x1 Conv]--> [64, 32, 32] --[Upsample]--> [64, 64, 64]
395
- Stage 3 features: [320, 16, 16] --[1x1 Conv]--> [64, 16, 16] --[Upsample]--> [64, 64, 64]
396
- Stage 4 features: [512, 8, 8] --[1x1 Conv]--> [64, 8, 8] --[Upsample]--> [64, 64, 64]
397
-
398
- Concatenate all: [256, 64, 64]
399
- |
400
- [1x1 Conv + BN + ReLU] --> [64, 64, 64]
401
- |
402
- [1x1 Conv] --> [1, 64, 64]
403
- |
404
- [Upsample 4x] --> [1, 256, 256] <-- Final change mask
405
- ```
406
-
407
- All scales are projected to the same dimension (64), upsampled to the same size (64x64), concatenated, and fused with a simple 1x1 convolution.
408
-
409
- ---
410
-
411
- ## 6. Why UNet++ Even Though It's A Medical Model?
412
-
413
- This is a great question and one your faculty will likely ask. Here's the answer:
414
-
415
- ### The Core Insight: Segmentation Is Segmentation
416
-
417
- UNet++ was designed for **medical image segmentation** — detecting tumor boundaries in CT scans, cell boundaries in microscopy, organ boundaries in MRI. But what IS segmentation?
418
-
419
- ```
420
- Medical: Input image --> Classify each pixel as (tumor / not tumor)
421
- Satellite: Input image --> Classify each pixel as (changed / not changed)
422
- ```
423
-
424
- **The task is structurally identical.** Both are binary pixel-level classification problems with:
425
-
426
- | Property | Medical | Satellite Change Detection |
427
- |---|---|---|
428
- | Task | Pixel classification | Pixel classification |
429
- | Output | Binary mask | Binary mask |
430
- | Class imbalance | Tumor is tiny vs whole brain | Changed area is tiny vs whole image |
431
- | Multi-scale | Tumors vary from 5px to 500px | Buildings vary from 10px to 200px |
432
- | Needs precise boundaries | Yes (surgical planning) | Yes (accurate change mapping) |
433
-
434
- ### Why UNet++ Is Especially Good For This
435
-
436
- 1. **Multi-scale feature fusion** — Buildings come in all sizes. A small shed (10x10px) needs fine features. A large warehouse (100x100px) needs coarse features. UNet++'s nested skip connections fuse ALL scales.
437
-
438
- 2. **Precise boundary detection** — The skip connections preserve spatial detail. Change detection needs precise boundaries — "exactly WHICH pixels changed?"
439
-
440
- 3. **Handles class imbalance** — In both medical and satellite tasks, the "positive" class (tumor/change) is tiny. UNet++ was designed for this.
441
-
442
- 4. **Proven architecture** — It's not just medical anymore. UNet++ is used in:
443
- - Remote sensing (satellite segmentation)
444
- - Autonomous driving (road segmentation)
445
- - Industrial inspection (defect detection)
446
- - Agriculture (crop segmentation)
447
-
448
- ### The Adaptation We Made
449
-
450
- Original UNet++: Takes ONE image, segments it
451
- Our UNet++: Takes TWO images through a SHARED encoder, computes feature differences, decodes
452
-
453
- ```
454
- Standard UNet++:
455
- 1 image --> Encoder --> Decoder --> Segmentation mask
456
-
457
- Our Adaptation:
458
- 2 images --> Shared Encoder --> Feature Difference --> Decoder --> Change mask
459
- ```
460
-
461
- This is NOT just "using UNet++ out of the box". We modified the architecture to handle bitemporal (two-image) input. The encoder is shared (Siamese), and we compute multi-scale feature differences before feeding into the decoder.
462
-
463
- ### What To Tell Faculty
464
-
465
- > "UNet++ was originally for medical segmentation, but the underlying problem is identical — pixel-level classification with class imbalance, where both fine detail and coarse context matter. We adapted it for bitemporal input by using a shared encoder and computing feature differences at each scale. This architectural pattern (encoder-difference-decoder) is standard in remote sensing change detection literature. UNet++ is now widely used beyond medical imaging — in satellite imagery, autonomous driving, and industrial inspection."
466
-
467
- ---
468
-
469
- ## 7. What Happens Inside During Inference — Step By Step
470
-
471
- Let's trace what happens when you upload two images in the Gradio app:
472
-
473
- ### Step 1: Image Loading
474
- ```
475
- User uploads:
476
- before.png (256x256 RGB, uint8, values 0-255)
477
- after.png (256x256 RGB, uint8, values 0-255)
478
- ```
479
-
480
- ### Step 2: Preprocessing
481
- ```
482
- Convert to float32: values 0.0 to 1.0
483
- Apply ImageNet normalization:
484
- pixel = (pixel - mean) / std
485
- mean = [0.485, 0.456, 0.406] (per RGB channel)
486
- std = [0.229, 0.224, 0.225]
487
-
488
- Result: normalized tensors, values roughly -2.0 to 2.5
489
- Shape: [1, 3, 256, 256] each (batch=1, channels=3, height=256, width=256)
490
- ```
491
-
492
- ### Step 3: Pad If Needed
493
- ```
494
- If image is 300x400:
495
- Pad to 512x512 (nearest multiple of 256)
496
- Using reflection padding (mirrors edge pixels)
497
- ```
498
-
499
- ### Step 4: Tile Into Patches (if larger than 256x256)
500
- ```
501
- 512x512 image --> 4 patches of 256x256
502
- Patch 1: top-left
503
- Patch 2: top-right
504
- Patch 3: bottom-left
505
- Patch 4: bottom-right
506
- ```
507
-
508
- ### Step 5: Model Forward Pass (for each patch pair)
509
-
510
- **Using ChangeFormer as example:**
511
-
512
- ```
513
- Before patch [1, 3, 256, 256] --> MiT Encoder --> 4 feature maps
514
- After patch [1, 3, 256, 256] --> MiT Encoder --> 4 feature maps
515
- (shared weights)
516
-
517
- Feature differences at each scale:
518
- Scale 1: |before_64x64 - after_64x64| = diff_64x64
519
- Scale 2: |before_32x32 - after_32x32| = diff_32x32
520
- Scale 3: |before_16x16 - after_16x16| = diff_16x16
521
- Scale 4: |before_8x8 - after_8x8| = diff_8x8
522
-
523
- MLP Decoder fuses all scales:
524
- --> [1, 1, 256, 256] raw logits
525
- ```
526
-
527
- ### Step 6: Sigmoid + Threshold
528
- ```
529
- Probabilities = sigmoid(logits) # values 0.0 to 1.0
530
- Binary mask = (probabilities > 0.5) # True/False per pixel
531
- ```
532
-
533
- ### Step 7: Stitch Patches Back (if tiled)
534
- ```
535
- 4 patches of 256x256 --> stitch back to 512x512
536
- Crop to original 300x400
537
- ```
538
-
539
- ### Step 8: Create Outputs
540
- ```
541
- Change mask: binary image (white = change, black = no change)
542
- Overlay: after image with red tint on changed pixels
543
- Statistics: "5.3% of area changed, 6,360 pixels out of 120,000"
544
- ```
545
-
546
- ### Total Time
547
- - CPU: ~2-5 seconds per 256x256 patch
548
- - GPU (T4): ~0.1 seconds per 256x256 patch
549
-
550
- ---
551
-
552
- ## 8. How To Explain This To Faculty
553
-
554
- ### If asked "Explain the model architecture"
555
-
556
- > "All three models follow the same pattern: a shared-weight Siamese encoder processes both the before and after images identically. We compute the absolute difference between features at each scale — large differences indicate change. A decoder then upsamples this difference back to full resolution to produce a pixel-level change mask.
557
-
558
- > The difference is in the encoder and decoder:
559
- > - Siamese CNN uses ResNet18 and simple transposed convolutions — fast but loses spatial detail
560
- > - UNet++ uses ResNet34 with nested skip connections — preserves detail at every scale
561
- > - ChangeFormer uses a hierarchical transformer with self-attention — captures global context across the entire image"
562
-
563
- ### If asked "What fine-tuning did you do?"
564
-
565
- > "We used ImageNet-pretrained ResNet backbones for the encoder. ImageNet teaches the model to recognize edges, textures, and shapes — these visual primitives are universal. We then fine-tuned ALL layers end-to-end on our satellite change detection dataset. The early layers (edge detection) barely changed. The later layers were substantially updated to understand satellite-specific patterns like building footprints and road textures. The decoder was trained entirely from scratch since it's specific to change detection."
566
-
567
- ### If asked "Why UNet++ for satellite when it's a medical model?"
568
-
569
- > "UNet++ solves pixel-level binary classification with class imbalance and multi-scale features. That's exactly what change detection needs — most pixels are unchanged (like most brain pixels are non-tumor), and changes happen at multiple scales (small buildings to large developments). The architecture is task-agnostic — it doesn't know if it's looking at brains or buildings. We adapted it by adding a shared Siamese encoder and computing feature differences, making it bitemporal."
570
-
571
- ### If asked "What's your contribution vs just using existing models?"
572
-
573
- > "Three things: First, we built the change detection adaptation — Siamese encoding, feature differencing, the full ChangeFormer from scratch. Second, we created a unified comparison framework — same data, same metrics, same training for all three models, which most papers don't do. Third, we built a production pipeline — from data preprocessing to a deployed web app with tiled inference for any image size. The finding that UNet++ outperforms the transformer on this task and patch size is itself a contribution — it challenges the assumption that newer architectures are always better."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,3 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  ![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)
2
  ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?logo=pytorch&logoColor=white)
3
  ![License](https://img.shields.io/badge/License-MIT-green)
 
1
+ ---
2
+ title: Military Base Change Detection
3
+ emoji: satellite
4
+ colorFrom: blue
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 4.44.1
8
+ app_file: app.py
9
+ pinned: false
10
+ python_version: 3.10
11
+ ---
12
+
13
  ![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)
14
  ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?logo=pytorch&logoColor=white)
15
  ![License](https://img.shields.io/badge/License-MIT-green)
app.py CHANGED
@@ -10,6 +10,7 @@ Usage:
10
  """
11
 
12
  import logging
 
13
  from pathlib import Path
14
  from typing import Any, Dict, List, Optional, Tuple
15
 
@@ -17,6 +18,7 @@ import gradio as gr
17
  import numpy as np
18
  import torch
19
  import yaml
 
20
 
21
  from data.dataset import IMAGENET_MEAN, IMAGENET_STD
22
  from inference import sliding_window_inference
@@ -34,10 +36,13 @@ _cached_model: Optional[torch.nn.Module] = None
34
  _cached_model_key: Optional[str] = None
35
  _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
36
  _config: Optional[Dict[str, Any]] = None
 
 
37
 
38
  # Search these directories for checkpoint files
39
  _CHECKPOINT_SEARCH_DIRS = [
40
  Path("checkpoints"),
 
41
  Path("/kaggle/working/checkpoints"),
42
  Path("/content/drive/MyDrive/change-detection/checkpoints"),
43
  ]
@@ -50,6 +55,40 @@ _MODEL_CHECKPOINT_NAMES = {
50
  }
51
 
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  # ---------------------------------------------------------------------------
54
  # Config / model loading
55
  # ---------------------------------------------------------------------------
@@ -88,6 +127,10 @@ def _find_checkpoint(model_name: str) -> Optional[Path]:
88
  if candidate.exists():
89
  return candidate
90
 
 
 
 
 
91
  return None
92
 
93
 
@@ -353,9 +396,11 @@ def main() -> None:
353
  gradio_cfg = config.get("gradio", {})
354
 
355
  demo = build_demo()
 
356
  demo.launch(
 
357
  server_port=gradio_cfg.get("server_port", 7860),
358
- share=gradio_cfg.get("share", False),
359
  )
360
 
361
 
 
10
  """
11
 
12
  import logging
13
+ import os
14
  from pathlib import Path
15
  from typing import Any, Dict, List, Optional, Tuple
16
 
 
18
  import numpy as np
19
  import torch
20
  import yaml
21
+ from huggingface_hub import hf_hub_download
22
 
23
  from data.dataset import IMAGENET_MEAN, IMAGENET_STD
24
  from inference import sliding_window_inference
 
36
  _cached_model_key: Optional[str] = None
37
  _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
38
  _config: Optional[Dict[str, Any]] = None
39
+ _hf_model_repo_id: Optional[str] = os.getenv("HF_MODEL_REPO")
40
+ _hf_model_revision: Optional[str] = os.getenv("HF_MODEL_REVISION")
41
 
42
  # Search these directories for checkpoint files
43
  _CHECKPOINT_SEARCH_DIRS = [
44
  Path("checkpoints"),
45
+ Path("/home/user/app/checkpoints"),
46
  Path("/kaggle/working/checkpoints"),
47
  Path("/content/drive/MyDrive/change-detection/checkpoints"),
48
  ]
 
55
  }
56
 
57
 
58
+ def _download_checkpoint_from_hf(model_name: str) -> Optional[Path]:
59
+ """Download checkpoint from Hugging Face Hub if configured.
60
+
61
+ Uses env var ``HF_MODEL_REPO`` as the source model repository and
62
+ downloads to ``./checkpoints`` cache.
63
+
64
+ Args:
65
+ model_name: One of the supported model keys.
66
+
67
+ Returns:
68
+ Local path to downloaded checkpoint, or ``None`` if unavailable.
69
+ """
70
+ if not _hf_model_repo_id:
71
+ return None
72
+
73
+ filename = _MODEL_CHECKPOINT_NAMES.get(model_name)
74
+ if filename is None:
75
+ return None
76
+
77
+ try:
78
+ local_path = hf_hub_download(
79
+ repo_id=_hf_model_repo_id,
80
+ filename=filename,
81
+ revision=_hf_model_revision,
82
+ local_dir="checkpoints",
83
+ local_dir_use_symlinks=False,
84
+ )
85
+ logger.info("Downloaded %s from %s", filename, _hf_model_repo_id)
86
+ return Path(local_path)
87
+ except Exception as exc: # pragma: no cover - best-effort fallback
88
+ logger.warning("Could not download %s from HF Hub: %s", filename, exc)
89
+ return None
90
+
91
+
92
  # ---------------------------------------------------------------------------
93
  # Config / model loading
94
  # ---------------------------------------------------------------------------
 
127
  if candidate.exists():
128
  return candidate
129
 
130
+ downloaded = _download_checkpoint_from_hf(model_name)
131
+ if downloaded is not None and downloaded.exists():
132
+ return downloaded
133
+
134
  return None
135
 
136
 
 
396
  gradio_cfg = config.get("gradio", {})
397
 
398
  demo = build_demo()
399
+ in_hf_space = os.getenv("SPACE_ID") is not None
400
  demo.launch(
401
+ server_name="0.0.0.0" if in_hf_space else "127.0.0.1",
402
  server_port=gradio_cfg.get("server_port", 7860),
403
+ share=False if in_hf_space else gradio_cfg.get("share", False),
404
  )
405
 
406
 
requirements.txt CHANGED
@@ -14,3 +14,4 @@ tqdm>=4.66.0
14
  tensorboard>=2.15.0
15
  gradio>=4.14.0
16
  gdown>=5.1.0
 
 
14
  tensorboard>=2.15.0
15
  gradio>=4.14.0
16
  gdown>=5.1.0
17
+ huggingface_hub>=0.23.0