Spaces:

mv63
/

BaseChange

Runtime error

App Files Files Community

Vedant Jigarbhai Mehta commited on 5 days ago

Commit

1eb8817

1 Parent(s): 4f856a3

Deploy to hf saces

Browse files

Files changed (7) hide show

.gitattributes +1 -0
DEPLOY_HF_SPACES.md +77 -0
EXPLANATION.md +0 -695
MODELS_EXPLAINED.md +0 -573
README.md +12 -0
app.py +46 -1
requirements.txt +1 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.pth filter=lfs diff=lfs merge=lfs -text

DEPLOY_HF_SPACES.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Deploy to Hugging Face Spaces (Gradio)
+This project is now ready for Hugging Face Spaces.
+## Option A (recommended): single Space repo with checkpoints
+Use this when you want the simplest deployment.
+1. Create a new Hugging Face Space:
+- SDK: Gradio
+- Hardware: CPU Basic to start, upgrade to GPU for faster inference
+2. Push this project to that Space repo.
+3. Ensure these files are present at the Space repo root:
+- app.py
+- requirements.txt
+- configs/config.yaml
+- models/
+- data/
+- utils/
+- checkpoints/changeformer_best.pth (or your preferred model)
+4. In Space Settings, set startup file to `app.py` (default for Gradio Spaces).
+5. Optional: reduce initial footprint by keeping only one checkpoint (for example `changeformer_best.pth`) inside `checkpoints/`.
+## Option B: Space app + separate model repo
+Use this when you want a smaller Space repo and keep large checkpoints elsewhere.
+1. Upload checkpoint files to a separate Hugging Face model repo.
+2. In your Space Settings -> Variables, set:
+- `HF_MODEL_REPO`: owner/repo-name
+- `HF_MODEL_REVISION`: optional branch/tag/commit (for reproducible deployment)
+3. On startup, `app.py` will auto-download expected checkpoint filenames into `checkpoints/`.
+Expected checkpoint names:
+- siamese_cnn_best.pth
+- unet_pp_best.pth
+- changeformer_best.pth
+## Space README metadata (required in Space repo)
+In the Space repository README.md, include this at the top:
+```yaml
+---
+title: Military Base Change Detection
+emoji: satellite
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 4.44.1
+app_file: app.py
+pinned: false
+python_version: 3.10
+---
+```
+## Notes
+- CPU hardware works, but inference can be slow for larger images.
+- For better latency, choose a GPU Space.
+- `app.py` now detects Spaces automatically and binds to `0.0.0.0`.
+- If no local checkpoints are found, it will try `HF_MODEL_REPO`.
+## Quick local validation before push
+```bash
+pip install -r requirements.txt
+python app.py
+```
+Then open the local Gradio URL and test one sample pair.

EXPLANATION.md DELETED Viewed

@@ -1,695 +0,0 @@
-# Military Base Change Detection — Complete Project Explanation
-## Table of Contents
-1. [What Is This Project?](#1-what-is-this-project)
-2. [Why Did We Build This?](#2-why-did-we-build-this)
-3. [What Problem Are We Solving?](#3-what-problem-are-we-solving)
-4. [What Dataset Did We Use and Why?](#4-what-dataset-did-we-use-and-why)
-5. [What Are The Three Models and Why These Three?](#5-what-are-the-three-models-and-why-these-three)
-6. [How Does Each Model Work Internally?](#6-how-does-each-model-work-internally)
-7. [How Is The Training Pipeline Designed?](#7-how-is-the-training-pipeline-designed)
-8. [What Loss Functions Did We Use and Why?](#8-what-loss-functions-did-we-use-and-why)
-9. [How Do We Evaluate The Models?](#9-how-do-we-evaluate-the-models)
-10. [What Are Our Results?](#10-what-are-our-results)
-11. [How Does The Inference Pipeline Work?](#11-how-does-the-inference-pipeline-work)
-12. [How Does The Web Application Work?](#12-how-does-the-web-application-work)
-13. [What Tools and Technologies Did We Use?](#13-what-tools-and-technologies-did-we-use)
-14. [What Is Our Innovation / Contribution?](#14-what-is-our-innovation--contribution)
-15. [What Are The Limitations?](#15-what-are-the-limitations)
-16. [Future Work](#16-future-work)
-17. [How To Present This Project](#17-how-to-present-this-project)
----
-## 1. What Is This Project?
-This is a **deep learning-based satellite image change detection system** designed for defense and military applications. Given two satellite images of the same geographic location taken at different times (a "before" image and an "after" image), the system automatically identifies **where new construction has occurred** — new buildings, runways, infrastructure, or any structural changes.
-The system works like this:
-```
-Before Image (2015)  +  After Image (2020)  -->  Change Mask (highlights new construction)
-   [empty land]         [buildings appeared]       [white pixels = new structures]
-```
-We implemented and compared **three different deep learning architectures** — ranging from a simple CNN baseline to a state-of-the-art vision transformer — to understand which approach works best for this task.
----
-## 2. Why Did We Build This?
-### The Defense Motivation
-Modern military intelligence relies heavily on satellite imagery. Analysts need to monitor:
-- **Enemy military base expansion** — Are new barracks, hangars, or command centers being built?
-- **Runway construction** — Is a new airfield being developed?
-- **Infrastructure development** — Are roads, supply depots, or communication towers appearing?
-- **Border fortification** — Are defensive structures being erected?
-Manually comparing satellite images is **slow, error-prone, and doesn't scale**. A single analyst might need to compare hundreds of image pairs daily. An AI system can do this in seconds with higher accuracy.
-### The Deep Learning Motivation
-This project demonstrates core deep learning concepts:
-- **Transfer learning** — Using ImageNet-pretrained backbones on satellite imagery
-- **Siamese architectures** — Processing two inputs through a shared encoder
-- **Architecture comparison** — CNN vs UNet++ vs Transformer on the same task
-- **Binary segmentation** — Pixel-level classification (changed vs unchanged)
-- **End-to-end deployment** — From training to a working web application
----
-## 3. What Problem Are We Solving?
-### Problem Statement
-**Binary Change Detection in Remote Sensing Images**: Given a pair of co-registered satellite images of the same area captured at two different times, classify each pixel as either "changed" or "unchanged".
-### Why Is This Hard?
-1. **Class imbalance** — In most image pairs, 95-99% of pixels are "no change". Only tiny regions contain actual construction. The model must not simply predict "no change" everywhere.
-2. **Irrelevant changes** — Lighting differences, seasonal vegetation changes, cloud shadows, and camera angle variations are NOT actual changes. The model must learn to ignore these.
-3. **Scale variation** — Changes can be as small as a single house or as large as an entire housing development. The model needs multi-scale understanding.
-4. **Semantic understanding** — The model should detect "empty land became a building" (structural change), not "grass turned brown" (seasonal change).
-### Formal Definition
-```
-Input:  Image A (before) — shape [3, 256, 256] — RGB satellite patch
-        Image B (after)  — shape [3, 256, 256] — RGB satellite patch
-Output: Mask M           — shape [1, 256, 256] — binary (0 = no change, 1 = change)
-```
----
-## 4. What Dataset Did We Use and Why?
-### LEVIR-CD (Large-scale VHR Image Change Detection)
-We chose LEVIR-CD because it is the **most widely used benchmark** for building change detection in remote sensing. It provides:
-- **637 image pairs** at 1024x1024 resolution (0.5m/pixel from Google Earth)
-- **20 different regions** in Texas, USA (Austin, Lakeway, Bee Cave, etc.)
-- **Time span**: 2002 to 2018 (5-14 years between image pairs)
-- **31,333 annotated building change instances**
-- Images annotated by experts and double-checked for quality
-### Data Preprocessing
-The raw 1024x1024 images are too large for direct model input. We cropped them into **non-overlapping 256x256 patches**:
-```
-1 image (1024x1024) --> 16 patches (256x256 each)
-Total patches:
-  Train: 445 images x 16 = 7,120 patch triplets
-  Val:    64 images x 16 = 1,024 patch triplets
-  Test:  128 images x 16 = 2,048 patch triplets
-  Total:                   10,192 patch triplets
-```
-Each patch triplet consists of:
-- `A/` — Before image (256x256 RGB)
-- `B/` — After image (256x256 RGB)
-- `label/` — Binary change mask (256x256, 0=unchanged, 255=changed)
-### Why Not Military-Specific Data?
-Real military satellite imagery is classified and not publicly available. However, **building construction is structurally identical whether it's a civilian house or a military barracks**. A hangar looks like a warehouse. A runway looks like a road. The model learns to detect structural changes from any satellite imagery — the application to military monitoring is in WHERE you point the trained model, not what you train it on. This is the standard approach in defense AI research.
-### Data Augmentation
-We apply synchronized augmentations to both images AND the mask during training (using albumentations ReplayCompose):
-- **Horizontal flip** (p=0.5)
-- **Vertical flip** (p=0.5)
-- **Random 90-degree rotation** (p=0.5)
-- **ImageNet normalization** (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
-No augmentation on validation/test sets — only normalization.
----
-## 5. What Are The Three Models and Why These Three?
-We chose three architectures that represent **three generations of deep learning for dense prediction tasks**:
-| Model | Year | Architecture Type | Role in Our Study |
-|---|---|---|---|
-| Siamese CNN | ~2018 | Convolutional Neural Network | Baseline |
-| UNet++ | 2018 | Nested U-Net (encoder-decoder) | Mid-tier |
-| ChangeFormer | 2022 | Vision Transformer | State-of-the-art |
-### Why These Specific Three?
-1. **Siamese CNN** — The simplest approach. Shows what a basic CNN can achieve. Serves as a performance floor — if this already works well, maybe we don't need complex models.
-2. **UNet++** — Represents the best of CNN-based segmentation. Its nested skip connections capture multi-scale features. Widely used in medical imaging and remote sensing. Shows what careful architecture design can achieve without transformers.
-3. **ChangeFormer** — Represents the latest transformer-based approach. Uses self-attention to capture global context (one building being built might relate to another across the image). Shows whether the complexity of transformers is justified for this task.
-### The Common Interface
-All three models share the same input/output contract:
-```python
-def forward(self, x1: Tensor, x2: Tensor) -> Tensor:
-    """
-    x1: before image  [Batch, 3, 256, 256]
-    x2: after image   [Batch, 3, 256, 256]
-    returns: logits    [Batch, 1, 256, 256]  (raw, before sigmoid)
-    """
-```
-This means we can swap models freely without changing any other code.
----
-## 6. How Does Each Model Work Internally?
-### Model 1: Siamese CNN (Baseline)
-**Architecture**: Shared-weight ResNet18 encoder + Transposed Convolution decoder
-```
-Before Image --> [ResNet18 Encoder] --> Features_A (512 channels, 8x8)
-                      |  (shared weights)
-After Image  --> [ResNet18 Encoder] --> Features_B (512 channels, 8x8)
-Difference = |Features_A - Features_B|  (absolute difference)
-Difference --> [TransposedConv Decoder] --> Change Mask (1 channel, 256x256)
-```
-**How it works**:
-1. Both images pass through the SAME ResNet18 encoder (shared weights = Siamese)
-2. ResNet18 reduces 256x256x3 to 8x8x512 feature maps
-3. We compute the absolute difference between the two feature maps
-4. A decoder with transposed convolutions upsamples back to 256x256
-5. Output is a single-channel logit map (apply sigmoid for probabilities)
-**Why shared weights?** If the encoder weights are shared, both images are processed identically. Any difference in the output features is due to actual image content differences, not different processing.
-**Parameters**: ~14M
-**Strength**: Simple, fast, easy to understand
-**Weakness**: No skip connections, loses fine spatial detail during encoding
-### Model 2: UNet++ (Mid-tier)
-**Architecture**: Shared ResNet34 encoder + Nested UNet++ decoder with dense skip connections
-```
-Before Image --> [ResNet34 Encoder] --> Multi-scale Features_A
-                      |  (shared weights)               |
-After Image  --> [ResNet34 Encoder] --> Multi-scale Features_B
-                                                         |
-              |Features_A[i] - Features_B[i]| at each scale
-                                                         |
-                           [UNet++ Decoder]
-                    (nested skip connections)
-                                |
-                         Change Mask (256x256)
-```
-**How it works**:
-1. ResNet34 encoder extracts features at 5 different scales (from 256x256 down to 8x8)
-2. At each scale, we compute the absolute difference between A and B features
-3. The UNet++ decoder uses **nested skip connections** — unlike regular UNet which has direct connections, UNet++ has intermediate dense blocks that process features before passing them across
-4. This captures both fine details (small buildings) and coarse context (large developments)
-**Why UNet++?** Standard UNet has a semantic gap between encoder and decoder features. UNet++ bridges this gap with intermediate convolution blocks, producing more refined predictions.
-**Parameters**: ~26M
-**Strength**: Excellent multi-scale feature fusion, captures small changes
-**Weakness**: More memory intensive than Siamese CNN
-### Model 3: ChangeFormer (State-of-the-art)
-**Architecture**: Shared MiT-B1 Transformer encoder + MLP decoder
-```
-Before Image --> [MiT-B1 Transformer Encoder] --> Hierarchical Features_A
-                         |  (shared weights)              |
-After Image  --> [MiT-B1 Transformer Encoder] --> Hierarchical Features_B
-                                                          |
-              |Features_A[i] - Features_B[i]| at 4 stages
-                                                          |
-                          [MLP Decoder]
-                   (multi-scale feature fusion)
-                                |
-                         Change Mask (256x256)
-```
-**The MiT (Mix Transformer) Encoder** has 4 hierarchical stages:
-| Stage | Resolution | Channels | Attention Heads | Spatial Reduction |
-|---|---|---|---|---|
-| 1 | 64x64 | 64 | 1 | 8x |
-| 2 | 32x32 | 128 | 2 | 4x |
-| 3 | 16x16 | 320 | 5 | 2x |
-| 4 | 8x8 | 512 | 8 | 1x |
-**Key components we implemented from scratch** (~350 lines of custom code):
-1. **Overlapping Patch Embedding** — Unlike ViT which uses non-overlapping patches, MiT uses overlapping convolutions to preserve local continuity.
-2. **Efficient Self-Attention** — Standard self-attention is O(N^2). We use spatial reduction: reduce the key/value spatial dimensions before attention, making it computationally feasible for high-resolution images.
-3. **Mix-FFN (Feed Forward Network)** — Instead of standard MLP, uses a depthwise 3x3 convolution inside the FFN to inject positional information without explicit position embeddings.
-4. **MLP Decoder** — Projects all 4 scale features to the same dimension, upsamples to full resolution, concatenates, and predicts the change mask.
-**Why Transformers for change detection?** Self-attention captures GLOBAL relationships. If a new housing development appears, the attention mechanism can relate the new buildings to nearby road construction — understanding the change holistically rather than pixel-by-pixel.
-**Parameters**: ~14M
-**Strength**: Global context via self-attention, best at understanding large-scale changes
-**Weakness**: Needs more training epochs, higher memory usage
----
-## 7. How Is The Training Pipeline Designed?
-### Overview
-```
-Config (YAML) --> Data Loading --> Model --> Loss --> Optimizer --> Training Loop
-                                                                       |
-                                                              Checkpointing (Drive)
-                                                              TensorBoard Logging
-                                                              Early Stopping
-                                                              Resume Support
-```
-### Key Training Features
-**1. Mixed Precision Training (AMP)**
-We use PyTorch's Automatic Mixed Precision. Forward pass runs in FP16 (half precision) for speed, backward pass uses FP32 for numerical stability. This roughly doubles training speed and halves memory usage.
-```python
-with autocast():                    # Forward in FP16
-    logits = model(img_a, img_b)
-    loss = criterion(logits, mask)
-scaler.scale(loss).backward()       # Backward with loss scaling
-scaler.step(optimizer)              # Optimizer step
-```
-**2. Gradient Accumulation**
-For memory-heavy models (ChangeFormer), we accumulate gradients over multiple mini-batches before updating weights. This simulates a larger effective batch size without needing more GPU memory.
-```
-Effective batch size = actual batch size x accumulation steps
-ChangeFormer on T4: batch=4 x accum=2 = effective batch of 8
-```
-**3. Gradient Clipping**
-We clip gradient norms to max_norm=1.0 to prevent training instability from exploding gradients, especially important for transformer models.
-**4. Learning Rate Schedule: Warmup + Cosine Annealing**
-- First 5 epochs: Linear warmup from 0.01x to 1x the base learning rate
-- Remaining epochs: Cosine decay to near zero
-This prevents early training instability (warmup) and allows fine-grained convergence (cosine decay).
-**5. Early Stopping**
-We monitor validation F1 score. If it doesn't improve for 15 consecutive epochs, training stops. This prevents overfitting and saves compute time.
-**6. Checkpoint Resume**
-Because cloud GPU sessions (Colab/Kaggle) can disconnect, we save TWO checkpoints every epoch:
-- `model_best.pth` — Best validation F1 so far
-- `model_last.pth` — Latest epoch (for resume)
-Each checkpoint contains: model weights, optimizer state, scheduler state, GradScaler state, epoch number, and best F1. This allows perfect resume — training continues exactly where it stopped.
-**7. Auto GPU Detection**
-The config contains per-model batch sizes for different GPUs:
-| Model | T4 (16GB) | V100 (16GB) | Default |
-|---|---|---|---|
-| Siamese CNN | 16 | 16 | 8 |
-| UNet++ | 8 | 12 | 4 |
-| ChangeFormer | 4 | 6 | 2 |
-The training script reads `torch.cuda.get_device_name()` and automatically selects the right batch size.
-### Optimizer Choice: AdamW
-We use AdamW (Adam with decoupled weight decay) because:
-- Adam's adaptive learning rates work well for both CNNs and transformers
-- Weight decay prevents overfitting
-- It's the standard optimizer for transformer training
-### Per-Model Hyperparameters
-| Hyperparameter | Siamese CNN | UNet++ | ChangeFormer |
-|---|---|---|---|
-| Learning Rate | 1e-3 | 1e-4 | 6e-5 |
-| Epochs | 100 | 100 | 200 |
-| Batch Size (T4) | 16 | 8 | 4 |
-ChangeFormer gets a lower learning rate and more epochs because transformers need slower, longer training to converge.
----
-## 8. What Loss Functions Did We Use and Why?
-### The Class Imbalance Problem
-In change detection, ~97% of pixels are "no change" and only ~3% are "change". If the model predicts "no change" for every pixel, it gets 97% accuracy but is completely useless. We need loss functions that handle this imbalance.
-### BCEDiceLoss (Default)
-We combine two losses:
-**Binary Cross-Entropy (BCE)**:
-```
-BCE = -[y * log(p) + (1-y) * log(1-p)]
-```
-- Standard pixel-wise classification loss
-- Treats each pixel independently
-- Applied on raw logits (numerically stable)
-**Dice Loss**:
-```
-Dice = 1 - (2 * |P intersection G| + smooth) / (|P| + |G| + smooth)
-```
-- Measures overlap between predicted and ground truth change regions
-- Directly optimizes for the F1 metric
-- Less sensitive to class imbalance because it looks at the ratio of overlap, not individual pixels
-**Combined**:
-```
-Loss = 0.5 * BCE + 0.5 * Dice
-```
-BCE provides stable gradients for learning, Dice pushes toward better F1 scores.
-### FocalLoss (Alternative)
-```
-Focal = -alpha * (1 - p_t)^gamma * log(p_t)
-```
-- Down-weights easy pixels (clearly "no change")
-- Focuses training on hard pixels near decision boundaries
-- alpha=0.25, gamma=2.0
-We provide both in config — BCEDiceLoss is the default because it produced better results empirically.
----
-## 9. How Do We Evaluate The Models?
-### Metrics
-All metrics are computed at **threshold=0.5** on the binary change mask:
-**F1-Score (Primary Metric)**:
-```
-F1 = 2 * (Precision * Recall) / (Precision + Recall)
-```
-The harmonic mean of precision and recall. Our primary metric for model selection and early stopping. Balances between detecting all changes (recall) and avoiding false alarms (precision).
-**IoU (Intersection over Union / Jaccard Index)**:
-```
-IoU = TP / (TP + FP + FN)
-```
-Measures overlap between predicted and true change masks. More stringent than F1 — penalizes both missed changes and false alarms.
-**Precision**:
-```
-Precision = TP / (TP + FP)
-```
-"Of all pixels the model predicted as changed, how many actually changed?" High precision = few false alarms.
-**Recall**:
-```
-Recall = TP / (TP + FN)
-```
-"Of all pixels that actually changed, how many did the model detect?" High recall = few missed changes.
-**Overall Accuracy (OA)**:
-```
-OA = (TP + TN) / (TP + TN + FP + FN)
-```
-Simple pixel accuracy. Always high (>96%) due to class imbalance — NOT a reliable metric alone.
-### MetricTracker Implementation
-We built a `MetricTracker` class that:
-1. Takes raw model logits (no manual sigmoid needed)
-2. Applies sigmoid + threshold internally
-3. Accumulates TP/FP/FN/TN across batches on GPU
-4. Only moves scalar results to CPU for final computation
-5. Returns all 5 metrics as a dictionary
-### Evaluation Outputs
-The evaluation script generates:
-- `results.json` — All metrics in machine-readable format
-- `prediction_grid.png` — 5 sample predictions (Before | After | Ground Truth | Prediction)
-- `predictions/` — 20 individual prediction plots
-- `overlays/` — Top 10 most interesting predictions (ranked by change area) with red overlay
----
-## 10. What Are Our Results?
-### Test Set Performance (LEVIR-CD, 2,048 patches)
-| Model | F1 | IoU | Precision | Recall | OA | Epochs Trained |
-|---|---|---|---|---|---|---|
-| Siamese CNN | 0.6441 | 0.4751 | 0.8084 | 0.5353 | 0.9699 | 3* |
-| **UNet++** | **0.9035** | **0.8240** | **0.9280** | **0.8803** | **0.9904** | 85 |
-| ChangeFormer | 0.8836 | 0.7915 | 0.8944 | 0.8731 | 0.9883 | 141 |
-*\*Siamese CNN was undertrained due to session interruption (3 epochs instead of 100). With full training it would achieve F1 ~0.82-0.85.*
-### Analysis
-1. **UNet++ achieved the best F1 (0.9035)** — Its nested skip connections excel at capturing multi-scale building changes. It effectively bridges fine-grained spatial details with high-level semantic understanding.
-2. **ChangeFormer achieved F1 0.8836** — Very competitive but slightly below UNet++. The transformer's global attention helps with large-scale changes but the relatively small patch size (256x256) limits the advantage of global context.
-3. **Siamese CNN (undertrained)** — With only 3 epochs, it shows the baseline capability. Its high precision (0.808) but low recall (0.535) means it's conservative — it catches changes it's confident about but misses many subtle ones.
-4. **All models achieve >96% OA** — This highlights why overall accuracy alone is misleading for imbalanced problems. Even a model that predicts "no change" everywhere would get ~97% OA.
-### Key Insight
-UNet++'s superior performance suggests that **multi-scale feature fusion with skip connections is more important than global self-attention** for this particular task and patch size. The nested decoder effectively captures both small buildings (low-level features) and large developments (high-level features).
----
-## 11. How Does The Inference Pipeline Work?
-For real-world use, satellite images are much larger than 256x256. Our inference pipeline handles **any resolution** through sliding window (tiled) inference:
-```
-Input Image (e.g., 1024x1024)
-    |
-    v
-Pad to nearest multiple of 256
-    |
-    v
-Tile into 256x256 non-overlapping patches
-    |
-    v
-Run model on each patch pair
-    |
-    v
-Stitch probability maps back together
-    |
-    v
-Crop to original size
-    |
-    v
-Apply threshold (0.5) --> Binary change mask
-```
-This means the system works on images of any size — from 256x256 test patches to full 4000x4000 satellite imagery.
-### Outputs
-1. **Binary change mask** (PNG) — White pixels = change detected
-2. **Overlay visualization** — After image with detected changes highlighted in red
-3. **Change statistics** — Percentage of area changed, pixel counts
----
-## 12. How Does The Web Application Work?
-We built an interactive web interface using **Gradio** that allows anyone to use the model without any coding knowledge:
-### User Flow
-1. Upload a "Before" satellite image
-2. Upload an "After" satellite image
-3. Select a model from the dropdown (auto-detects available checkpoints)
-4. Adjust the detection threshold if needed (default 0.5)
-5. Click "Detect Changes"
-6. View results: change mask, red overlay, and statistics table
-### Technical Details
-- **Auto-checkpoint detection** — The app scans multiple directories for checkpoint files and only shows models that have checkpoints available
-- **Model caching** — Once a model is loaded, it stays in memory for instant subsequent predictions
-- **CPU fallback** — Works without GPU (slower but functional)
-- **Any image size** — Uses the same tiled inference pipeline
-- **Public sharing** — Can generate a public URL for remote access
----
-## 13. What Tools and Technologies Did We Use?
-### Core Framework
-| Tool | Purpose | Why We Chose It |
-|---|---|---|
-| **PyTorch 2.x** | Deep learning framework | Industry standard, dynamic computation graph, excellent GPU support |
-| **Python 3.10+** | Programming language | De facto language for ML/DL |
-### Model Libraries
-| Library | Purpose | Why |
-|---|---|---|
-| **torchvision** | ResNet18/34 pretrained backbones | Official PyTorch model zoo |
-| **segmentation-models-pytorch (SMP)** | UNet++ architecture | Best-maintained segmentation library, provides encoder-decoder framework |
-| **timm** | Transformer utilities | State-of-the-art vision model components |
-| **einops** | Tensor rearrangement | Clean, readable tensor reshaping for transformer code |
-### Data Processing
-| Library | Purpose | Why |
-|---|---|---|
-| **albumentations** | Image augmentation | Fast, GPU-friendly, supports ReplayCompose for synchronized transforms |
-| **OpenCV** | Image I/O | Fast image reading/writing, supports multiple formats |
-| **NumPy** | Array operations | Foundation for all numerical computation |
-### Training Infrastructure
-| Tool | Purpose | Why |
-|---|---|---|
-| **TensorBoard** | Training visualization | Real-time loss curves, metric tracking, prediction grids |
-| **Google Colab / Kaggle** | Cloud GPU | Free T4/P100 GPUs for training |
-| **Google Drive** | Persistent storage | Checkpoints survive Colab disconnections |
-| **YAML** | Configuration | Human-readable, all hyperparameters in one place |
-### Deployment
-| Tool | Purpose | Why |
-|---|---|---|
-| **Gradio** | Web interface | Fastest way to create ML demos, no frontend code needed |
----
-## 14. What Is Our Innovation / Contribution?
-### 1. Unified Multi-Architecture Comparison Framework
-We built a single codebase that trains, evaluates, and deploys three fundamentally different architectures (CNN, UNet++, Transformer) under identical conditions — same data, same augmentations, same loss function, same metrics. Most papers only present one model. Our framework enables fair comparison.
-### 2. Defense Application Framing
-We contextualized general change detection for military surveillance applications — monitoring base expansion, runway construction, and infrastructure development. The same technology used for urban planning is directly applicable to defense intelligence.
-### 3. Custom ChangeFormer Implementation
-The ChangeFormer transformer is implemented from scratch (~350 lines of custom PyTorch code), not imported from a library:
-- Overlapping Patch Embeddings
-- Efficient Self-Attention with Spatial Reduction
-- Mix Feed-Forward Networks with Depthwise Convolutions
-- Hierarchical 4-stage Encoder
-- Multi-scale MLP Decoder
-### 4. Production-Ready Pipeline
-This is not just a training notebook — it's a complete system:
-- Automated data download and preprocessing
-- Resume-capable training with cloud storage
-- Tiled inference for any-resolution images
-- Interactive web application for non-technical users
-- Auto GPU detection and batch size optimization
-### 5. Custom Loss and Metrics
-We implemented BCEDiceLoss (combines classification and overlap objectives) and a MetricTracker that operates on GPU tensors for efficient evaluation.
----
-## 15. What Are The Limitations?
-1. **Training data is civilian** — Trained on LEVIR-CD (civilian buildings in Texas). While structurally similar to military construction, the model hasn't seen actual military facilities, camouflaged structures, or underground bunkers.
-2. **Single geographic region** — LEVIR-CD covers only Texas, USA. Performance may degrade on satellite imagery from different geographic regions with different building styles, vegetation, or terrain.
-3. **Fixed resolution** — Trained on 0.5m/pixel resolution. Lower resolution imagery (e.g., Sentinel-2 at 10m/pixel) would require retraining.
-4. **No temporal reasoning** — The model only sees two time points. It cannot track gradual construction progress over multiple time steps.
-5. **Lighting sensitivity** — Significant illumination differences between before/after images can cause false positives or missed detections.
-6. **Siamese CNN undertrained** — Due to session interruptions, the Siamese CNN baseline was only trained for 3 epochs, not providing a fair comparison point.
----
-## 16. Future Work
-1. **Military-specific fine-tuning** — Fine-tune on declassified military satellite imagery to improve detection of defense-specific structures.
-2. **Multi-temporal analysis** — Extend from 2 timestamps to a sequence, tracking construction progress over months/years.
-3. **Object-level detection** — Instead of just pixel masks, classify WHAT changed (building, road, runway, vehicle).
-4. **Model ensemble** — Combine predictions from all three models for higher accuracy.
-5. **Attention visualization** — Show which parts of the image the transformer attends to, providing explainability for intelligence analysts.
-6. **Real-time satellite feed** — Connect to live satellite imagery APIs for continuous monitoring.
-7. **Deploy on Hugging Face Spaces** — Create a permanent public URL for the web demo.
----
-## 17. How To Present This Project
-### Opening (1 minute)
-> "We built an AI system that monitors military base construction from satellite imagery. You give it two satellite photos — one old, one new — and it highlights exactly what changed: new buildings, new runways, new infrastructure. We compared three deep learning approaches and achieved 90% F1 score."
-### Show The Demo (2 minutes)
-1. Open the Gradio app (localhost:7860 or public URL)
-2. Upload a before/after pair from the test set
-3. Show the change detection output
-4. Switch between models to show different predictions
-5. Adjust the threshold slider
-### Show The Results (1 minute)
-Present the comparison table:
-| Model | F1 | IoU | Architecture |
-|---|---|---|---|
-| Siamese CNN | 0.64 | 0.48 | Basic CNN |
-| ChangeFormer | 0.88 | 0.79 | Transformer |
-| **UNet++** | **0.90** | **0.82** | **Nested UNet** |
-> "UNet++ achieved the best results. Its nested skip connections are ideal for multi-scale change detection. Interestingly, it outperformed the more complex transformer model, suggesting that architectural inductive biases (convolutions that understand local spatial structure) are more important than global self-attention for 256x256 patches."
-### Answer Common Questions
-**Q: "You used readymade models?"**
-> "The backbones (ResNet, MiT) are pretrained on ImageNet — that's transfer learning, standard practice. But the change detection architecture is custom — Siamese encoding, feature differencing, and the full ChangeFormer transformer are written from scratch. We also wrote custom loss functions and a complete training pipeline."
-**Q: "What's novel?"**
-> "The systematic comparison of three generations of deep learning on defense surveillance, packaged as a deployable web application. We show that UNet++ outperforms transformers for this task and patch size — a non-obvious finding that challenges the assumption that newer = better."
-**Q: "How is this military?"**
-> "Military bases are buildings and infrastructure. The model detects new construction from satellite imagery. Point it at a known military zone and it becomes a defense intelligence tool. The technology is the same — the application context makes it military."

MODELS_EXPLAINED.md DELETED Viewed

@@ -1,573 +0,0 @@
-# Deep Dive: Models, Transfer Learning, and Fine-Tuning Explained
-## Table of Contents
-1. [What Is Transfer Learning and Why ImageNet?](#1-what-is-transfer-learning-and-why-imagenet)
-2. [What Exactly Did We Fine-Tune?](#2-what-exactly-did-we-fine-tune)
-3. [Model 1: Siamese CNN — Explained Like You're Teaching It](#3-model-1-siamese-cnn)
-4. [Model 2: UNet++ — Why A Medical Model Works For Satellites](#4-model-2-unet)
-5. [Model 3: ChangeFormer — The Transformer Approach](#5-model-3-changeformer)
-6. [Why UNet++ Even Though It's A Medical Model?](#6-why-unet-even-though-its-a-medical-model)
-7. [What Happens Inside During Inference — Step By Step](#7-what-happens-inside-during-inference)
-8. [How To Explain This To Faculty](#8-how-to-explain-this-to-faculty)
----
-## 1. What Is Transfer Learning and Why ImageNet?
-### The Problem With Training From Scratch
-A deep learning model needs to learn TWO things:
-1. **Low-level features** — edges, textures, corners, gradients, colors
-2. **High-level features** — objects, shapes, spatial relationships
-Learning low-level features from scratch takes millions of images and days of training. But here's the key insight: **edges look the same everywhere**. An edge in a cat photo looks the same as an edge in a satellite photo. A texture gradient in a car image is structurally identical to a texture gradient in a building image.
-### What Is ImageNet?
-ImageNet is a dataset of **14 million images** across 1000 categories (cats, dogs, cars, planes, buildings, landscapes, etc.). Models trained on ImageNet learn incredibly rich low-level and mid-level features because they've seen enormous visual diversity.
-### What Is Transfer Learning?
-Instead of training from scratch (random weights), we START with weights that were trained on ImageNet. This gives us:
-```
-FROM SCRATCH:
-Random weights --> [needs millions of images] --> Learns edges --> Learns textures --> Learns shapes --> Learns objects
-                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-                   THIS TAKES FOREVER
-TRANSFER LEARNING:
-ImageNet weights --> [already knows edges, textures, shapes] --> [needs few thousand images] --> Learns satellite-specific patterns
-                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-                     FREE - comes with pretrained weights         THIS IS FAST
-```
-### Analogy
-Think of it like learning a new language:
-- **From scratch**: A baby learning their first language — takes years
-- **Transfer learning**: A person who speaks English learning Spanish — much faster because they already understand grammar, sentence structure, and many shared words
-ImageNet pretraining = knowing English. Satellite change detection = learning Spanish. The foundation transfers.
----
-## 2. What Exactly Did We Fine-Tune?
-### What "Fine-Tuning" Means
-Fine-tuning means we took the pretrained ImageNet weights and **continued training them on our satellite data**. We didn't freeze anything — ALL layers were updated. This is called **end-to-end fine-tuning**.
-### What Changed During Fine-Tuning
-```
-BEFORE Fine-Tuning (ImageNet weights):
-  Layer 1: Detects generic edges, gradients
-  Layer 2: Detects generic textures, patterns
-  Layer 3: Detects generic shapes (circles, rectangles)
-  Layer 4: Detects generic objects (cat face, car wheel)
-  ^^^^ These are useful for ANYTHING visual
-AFTER Fine-Tuning (our satellite training):
-  Layer 1: Still detects edges (barely changed — edges are universal)
-  Layer 2: Detects satellite-specific textures (roof patterns, road textures)
-  Layer 3: Detects building footprints, road shapes
-  Layer 4: Detects "new building appeared" vs "same building"
-  ^^^^ Early layers changed little, later layers changed a LOT
-```
-### The Numbers
-| Model | Total Parameters | Pretrained (from ImageNet) | New (trained from scratch) |
-|---|---|---|---|
-| Siamese CNN | 14M | 11M (ResNet18 encoder) | 3M (decoder) |
-| UNet++ | 26M | 21M (ResNet34 encoder) | 5M (decoder) |
-| ChangeFormer | 14M | 0 (trained from scratch) | 14M (everything) |
-**Key point**: For Siamese CNN and UNet++, the ENCODER (feature extractor) is pretrained. The DECODER (change mask generator) is trained from scratch. During fine-tuning, both encoder AND decoder are updated, but the encoder starts from a much better position.
-**ChangeFormer is different**: We wrote the entire architecture from scratch. There are no widely available pretrained MiT-B1 weights for change detection, so we trained everything from random initialization. This is why it needs 200 epochs instead of 100.
-### What Does The Training Actually Do?
-Each training step:
-1. Feed a before/after image pair through the model
-2. Model outputs a predicted change mask
-3. Compare prediction with ground truth using BCEDiceLoss
-4. Compute gradients (how much each weight contributed to the error)
-5. Update ALL weights slightly in the direction that reduces error
-6. Repeat 7,120 times per epoch (one per training sample)
-7. Repeat for 85-141 epochs
-After training:
-- Early layers (edges, textures): changed ~5-10% from ImageNet values
-- Middle layers (shapes, patterns): changed ~20-40%
-- Late layers (semantic understanding): changed ~60-90%
-- Decoder layers: learned entirely from our data
----
-## 3. Model 1: Siamese CNN
-### What Is "Siamese"?
-"Siamese" means twins — like Siamese twins. The model has TWO identical paths that share the SAME weights:
-```
-Image A (Before) ----\
-                      [Same ResNet18] ---- Features A
-Image B (After)  ----/                     Features B
-                      ^^^^^^^^^^^^^
-                      SHARED WEIGHTS
-                      (not two separate networks)
-```
-**Why shared?** If both images go through the EXACT same processing, then any difference in the output features MUST be because the images themselves are different. The shared weights act as a fair, unbiased feature extractor.
-### ResNet18 Encoder — Step by Step
-ResNet18 is a Convolutional Neural Network with 18 layers. Here's what happens to a 256x256 satellite image:
-```
-Input: [3, 256, 256]    (3 = RGB channels)
-  |
-  v
-Conv1 + BN + ReLU + MaxPool
-  |  --> [64, 64, 64]    (64 feature channels, spatial size reduced to 64x64)
-  v
-Layer 1 (2 residual blocks)
-  |  --> [64, 64, 64]    (same size, refined features)
-  v
-Layer 2 (2 residual blocks)
-  |  --> [128, 32, 32]   (more channels, smaller spatial)
-  v
-Layer 3 (2 residual blocks)
-  |  --> [256, 16, 16]   (even more channels, even smaller)
-  v
-Layer 4 (2 residual blocks)
-  |  --> [512, 8, 8]     (512 feature channels, 8x8 spatial grid)
-  v
-Output: Rich feature representation
-```
-Each "residual block" has the famous skip connection:
-```
-input ----> [Conv -> BN -> ReLU -> Conv -> BN] ----> ADD ----> ReLU ----> output
-  |                                                    ^
-  |_____________(identity shortcut)____________________|
-```
-The skip connection solves the vanishing gradient problem — gradients can flow directly through the shortcut, making deep networks trainable.
-### The Difference Operation
-After encoding both images:
-```
-Features_A: [512, 8, 8]  (before image encoded)
-Features_B: [512, 8, 8]  (after image encoded)
-Difference = |Features_A - Features_B|  (absolute difference, element-wise)
-Result:      [512, 8, 8]  (where values are high = something changed)
-```
-If a pixel in Features_A has value 0.8 and the same pixel in Features_B has value 0.2, the difference is 0.6 — meaning this region changed significantly.
-### The Decoder — Transposed Convolutions
-Now we need to go from 8x8 back to 256x256. Transposed convolution (also called "deconvolution") does upsampling:
-```
-[512, 8, 8]
-  |  TransposedConv + BN + ReLU
-  v
-[256, 16, 16]
-  |  TransposedConv + BN + ReLU
-  v
-[128, 32, 32]
-  |  TransposedConv + BN + ReLU
-  v
-[64, 64, 64]
-  |  TransposedConv + BN + ReLU
-  v
-[32, 128, 128]
-  |  TransposedConv (final)
-  v
-[1, 256, 256]  <-- Change mask! (raw logits, apply sigmoid for probabilities)
-```
-### Weakness
-The encoder compresses 256x256 down to 8x8 — that's a 32x reduction. Fine spatial details are lost. A small building that's 10x10 pixels becomes less than 1 pixel in the 8x8 feature map. The decoder tries to reconstruct this but without skip connections (unlike UNet), it struggles with precise localization.
----
-## 4. Model 2: UNet++
-### First, What Is Regular UNet?
-UNet was invented for medical image segmentation (detecting tumors in brain scans). It has an **encoder-decoder structure with skip connections**:
-```
-ENCODER (downsampling)              DECODER (upsampling)
-[256x256] ----skip connection----> [256x256]
-    |                                  ^
-[128x128] ----skip connection----> [128x128]
-    |                                  ^
-[64x64]   ----skip connection----> [64x64]
-    |                                  ^
-[32x32]   ----skip connection----> [32x32]
-    |                                  ^
-[16x16]   ------bottleneck-------> [16x16]
-```
-The skip connections DIRECTLY copy encoder features to the decoder. This means the decoder has access to BOTH:
-- High-level semantic info (from the bottleneck): "this region has a building"
-- Low-level spatial detail (from skip connections): "the exact edge of the building is here"
-### What Makes UNet++ Different From UNet?
-Regular UNet's problem: the skip connections connect features at very different semantic levels. The encoder at level 2 produces "edge features" while the decoder at level 2 needs "building boundary features". There's a **semantic gap**.
-UNet++ fixes this with **nested intermediate blocks**:
-```
-Regular UNet:
-Encoder --------direct skip--------> Decoder
-(raw features)                       (needs processed features)
-                ^^ SEMANTIC GAP ^^
-UNet++:
-Encoder ----> [Block] ----> [Block] ----> Decoder
-(raw)         (processed)   (more processed)  (ready to use)
-              ^^^^^^^^^^^^^^^^^^^^^^^^
-              NESTED DENSE BLOCKS bridge the gap
-```
-In detail:
-```
-X(0,0) ---------> X(0,1) ---------> X(0,2) ---------> X(0,3) ---------> X(0,4)
-  |                  |                  |                  |
-X(1,0) ---------> X(1,1) ---------> X(1,2) ---------> X(1,3)
-  |                  |                  |
-X(2,0) ---------> X(2,1) ---------> X(2,2)
-  |                  |
-X(3,0) ---------> X(3,1)
-  |
-X(4,0) (bottleneck)
-```
-Each X(i,j) node receives inputs from:
-- The node below it (deeper features)
-- ALL previous nodes at the same level (dense connections)
-This means by the time features reach the output, they've been progressively refined through multiple intermediate processing stages.
-### How We Adapted UNet++ For Change Detection
-Original UNet++ takes ONE image and segments it. We adapted it for TWO images:
-```
-Image A (Before) --> [ResNet34 Encoder] --> Features at 5 scales
-                          |  (shared weights)
-Image B (After)  --> [ResNet34 Encoder] --> Features at 5 scales
-At each scale:
-  diff[i] = |Features_A[i] - Features_B[i]|
-diff features --> [UNet++ Decoder with nested skip connections] --> Change Mask
-```
-We use ResNet34 (34 layers, deeper than ResNet18) as the encoder via the `segmentation-models-pytorch` library, which provides the UNet++ decoder architecture.
-### Why ResNet34 Instead of ResNet18?
-ResNet34 has more layers and captures richer features:
-- ResNet18: [2, 2, 2, 2] blocks = 18 layers
-- ResNet34: [3, 4, 6, 3] blocks = 34 layers
-More depth = better feature extraction, especially for the subtle differences between before/after satellite images.
----
-## 5. Model 3: ChangeFormer
-### What Is A Vision Transformer?
-Traditional CNNs look at LOCAL regions (3x3 or 5x5 patches). Transformers look at GLOBAL relationships — every part of the image can attend to every other part.
-### The Self-Attention Mechanism
-For a given position in the image, self-attention asks: "Which OTHER positions in this image are relevant to understanding THIS position?"
-```
-Example: A new building appears in the top-left
-Self-attention notices:
-  - New road appeared nearby (related construction)
-  - Parking lot appeared on the right (part of same development)
-  - Trees on the south side were cleared (preparation for construction)
-A CNN would process each region independently.
-A Transformer connects them all.
-```
-### How Self-Attention Works (Simplified)
-For each pixel position:
-1. Create a **Query** (Q): "What am I looking for?"
-2. Create a **Key** (K): "What information do I have?"
-3. Create a **Value** (V): "What information can I give?"
-```
-Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
-```
-- Q * K^T: How relevant is each position to me? (attention score)
-- softmax: Normalize to probabilities
-- * V: Weight the values by attention scores
-- / sqrt(d): Scale factor for numerical stability
-### The MiT-B1 Architecture
-MiT (Mix Transformer) is a hierarchical transformer — unlike ViT which processes the image at one scale, MiT processes at 4 scales (like a CNN):
-**Stage 1 (64x64, 64 channels)**:
-```
-256x256 image
-    |
-Overlapping Patch Embed (7x7 conv, stride 4)
-    |
-64x64 grid of 64-dim tokens (4096 tokens)
-    |
-2x [Efficient Self-Attention + Mix-FFN]
-    |
-Output: [64, 64, 64] features
-```
-**Stage 2 (32x32, 128 channels)**:
-```
-Overlapping Patch Embed (3x3 conv, stride 2)
-    |
-32x32 grid of 128-dim tokens (1024 tokens)
-    |
-2x [Efficient Self-Attention + Mix-FFN]
-    |
-Output: [128, 32, 32] features
-```
-**Stage 3 (16x16, 320 channels)** and **Stage 4 (8x8, 512 channels)** follow the same pattern.
-### Efficient Self-Attention
-Standard self-attention on 64x64 = 4096 tokens would require a 4096x4096 attention matrix — too expensive. We use **Spatial Reduction**:
-```
-Standard: Q (4096 tokens) x K (4096 tokens) = 16M attention scores  (TOO SLOW)
-Efficient:
-  Q stays at 4096 tokens
-  K and V are spatially reduced: 4096 -> 64 tokens (8x reduction)
-  Q (4096) x K (64) = 262K attention scores  (60x cheaper!)
-```
-This is done via a strided convolution that reduces K and V before computing attention.
-### Mix-FFN
-Standard transformers use a simple MLP (Linear -> GELU -> Linear) after attention. Mix-FFN adds a **depthwise 3x3 convolution** in the middle:
-```
-Standard FFN:  Linear -> GELU -> Linear
-Mix-FFN:       Linear -> DepthwiseConv3x3 -> GELU -> Linear
-                         ^^^^^^^^^^^^^^^^^
-                         Injects local spatial information
-```
-Why? Pure transformers have no notion of "nearby pixels". The depthwise conv brings back local spatial awareness without the cost of full convolutions. This eliminates the need for explicit position embeddings.
-### The MLP Decoder
-After the encoder produces features at 4 scales, the decoder fuses them:
-```
-Stage 1 features: [64, 64, 64]   --[1x1 Conv]--> [64, 64, 64]   --[Upsample]--> [64, 64, 64]
-Stage 2 features: [128, 32, 32]  --[1x1 Conv]--> [64, 32, 32]   --[Upsample]--> [64, 64, 64]
-Stage 3 features: [320, 16, 16]  --[1x1 Conv]--> [64, 16, 16]   --[Upsample]--> [64, 64, 64]
-Stage 4 features: [512, 8, 8]    --[1x1 Conv]--> [64, 8, 8]     --[Upsample]--> [64, 64, 64]
-Concatenate all: [256, 64, 64]
-    |
-[1x1 Conv + BN + ReLU] --> [64, 64, 64]
-    |
-[1x1 Conv] --> [1, 64, 64]
-    |
-[Upsample 4x] --> [1, 256, 256]  <-- Final change mask
-```
-All scales are projected to the same dimension (64), upsampled to the same size (64x64), concatenated, and fused with a simple 1x1 convolution.
----
-## 6. Why UNet++ Even Though It's A Medical Model?
-This is a great question and one your faculty will likely ask. Here's the answer:
-### The Core Insight: Segmentation Is Segmentation
-UNet++ was designed for **medical image segmentation** — detecting tumor boundaries in CT scans, cell boundaries in microscopy, organ boundaries in MRI. But what IS segmentation?
-```
-Medical:    Input image --> Classify each pixel as (tumor / not tumor)
-Satellite:  Input image --> Classify each pixel as (changed / not changed)
-```
-**The task is structurally identical.** Both are binary pixel-level classification problems with:
-| Property | Medical | Satellite Change Detection |
-|---|---|---|
-| Task | Pixel classification | Pixel classification |
-| Output | Binary mask | Binary mask |
-| Class imbalance | Tumor is tiny vs whole brain | Changed area is tiny vs whole image |
-| Multi-scale | Tumors vary from 5px to 500px | Buildings vary from 10px to 200px |
-| Needs precise boundaries | Yes (surgical planning) | Yes (accurate change mapping) |
-### Why UNet++ Is Especially Good For This
-1. **Multi-scale feature fusion** — Buildings come in all sizes. A small shed (10x10px) needs fine features. A large warehouse (100x100px) needs coarse features. UNet++'s nested skip connections fuse ALL scales.
-2. **Precise boundary detection** — The skip connections preserve spatial detail. Change detection needs precise boundaries — "exactly WHICH pixels changed?"
-3. **Handles class imbalance** — In both medical and satellite tasks, the "positive" class (tumor/change) is tiny. UNet++ was designed for this.
-4. **Proven architecture** — It's not just medical anymore. UNet++ is used in:
-   - Remote sensing (satellite segmentation)
-   - Autonomous driving (road segmentation)
-   - Industrial inspection (defect detection)
-   - Agriculture (crop segmentation)
-### The Adaptation We Made
-Original UNet++: Takes ONE image, segments it
-Our UNet++: Takes TWO images through a SHARED encoder, computes feature differences, decodes
-```
-Standard UNet++:
-  1 image --> Encoder --> Decoder --> Segmentation mask
-Our Adaptation:
-  2 images --> Shared Encoder --> Feature Difference --> Decoder --> Change mask
-```
-This is NOT just "using UNet++ out of the box". We modified the architecture to handle bitemporal (two-image) input. The encoder is shared (Siamese), and we compute multi-scale feature differences before feeding into the decoder.
-### What To Tell Faculty
-> "UNet++ was originally for medical segmentation, but the underlying problem is identical — pixel-level classification with class imbalance, where both fine detail and coarse context matter. We adapted it for bitemporal input by using a shared encoder and computing feature differences at each scale. This architectural pattern (encoder-difference-decoder) is standard in remote sensing change detection literature. UNet++ is now widely used beyond medical imaging — in satellite imagery, autonomous driving, and industrial inspection."
----
-## 7. What Happens Inside During Inference — Step By Step
-Let's trace what happens when you upload two images in the Gradio app:
-### Step 1: Image Loading
-```
-User uploads:
-  before.png (256x256 RGB, uint8, values 0-255)
-  after.png  (256x256 RGB, uint8, values 0-255)
-```
-### Step 2: Preprocessing
-```
-Convert to float32: values 0.0 to 1.0
-Apply ImageNet normalization:
-  pixel = (pixel - mean) / std
-  mean = [0.485, 0.456, 0.406]  (per RGB channel)
-  std  = [0.229, 0.224, 0.225]
-Result: normalized tensors, values roughly -2.0 to 2.5
-Shape: [1, 3, 256, 256] each (batch=1, channels=3, height=256, width=256)
-```
-### Step 3: Pad If Needed
-```
-If image is 300x400:
-  Pad to 512x512 (nearest multiple of 256)
-  Using reflection padding (mirrors edge pixels)
-```
-### Step 4: Tile Into Patches (if larger than 256x256)
-```
-512x512 image --> 4 patches of 256x256
-  Patch 1: top-left
-  Patch 2: top-right
-  Patch 3: bottom-left
-  Patch 4: bottom-right
-```
-### Step 5: Model Forward Pass (for each patch pair)
-**Using ChangeFormer as example:**
-```
-Before patch [1, 3, 256, 256]  --> MiT Encoder --> 4 feature maps
-After patch  [1, 3, 256, 256]  --> MiT Encoder --> 4 feature maps
-                                    (shared weights)
-Feature differences at each scale:
-  Scale 1: |before_64x64 - after_64x64|   = diff_64x64
-  Scale 2: |before_32x32 - after_32x32|   = diff_32x32
-  Scale 3: |before_16x16 - after_16x16|   = diff_16x16
-  Scale 4: |before_8x8 - after_8x8|       = diff_8x8
-MLP Decoder fuses all scales:
-  --> [1, 1, 256, 256] raw logits
-```
-### Step 6: Sigmoid + Threshold
-```
-Probabilities = sigmoid(logits)    # values 0.0 to 1.0
-Binary mask = (probabilities > 0.5)  # True/False per pixel
-```
-### Step 7: Stitch Patches Back (if tiled)
-```
-4 patches of 256x256 --> stitch back to 512x512
-Crop to original 300x400
-```
-### Step 8: Create Outputs
-```
-Change mask: binary image (white = change, black = no change)
-Overlay: after image with red tint on changed pixels
-Statistics: "5.3% of area changed, 6,360 pixels out of 120,000"
-```
-### Total Time
-- CPU: ~2-5 seconds per 256x256 patch
-- GPU (T4): ~0.1 seconds per 256x256 patch
----
-## 8. How To Explain This To Faculty
-### If asked "Explain the model architecture"
-> "All three models follow the same pattern: a shared-weight Siamese encoder processes both the before and after images identically. We compute the absolute difference between features at each scale — large differences indicate change. A decoder then upsamples this difference back to full resolution to produce a pixel-level change mask.
-> The difference is in the encoder and decoder:
-> - Siamese CNN uses ResNet18 and simple transposed convolutions — fast but loses spatial detail
-> - UNet++ uses ResNet34 with nested skip connections — preserves detail at every scale
-> - ChangeFormer uses a hierarchical transformer with self-attention — captures global context across the entire image"
-### If asked "What fine-tuning did you do?"
-> "We used ImageNet-pretrained ResNet backbones for the encoder. ImageNet teaches the model to recognize edges, textures, and shapes — these visual primitives are universal. We then fine-tuned ALL layers end-to-end on our satellite change detection dataset. The early layers (edge detection) barely changed. The later layers were substantially updated to understand satellite-specific patterns like building footprints and road textures. The decoder was trained entirely from scratch since it's specific to change detection."
-### If asked "Why UNet++ for satellite when it's a medical model?"
-> "UNet++ solves pixel-level binary classification with class imbalance and multi-scale features. That's exactly what change detection needs — most pixels are unchanged (like most brain pixels are non-tumor), and changes happen at multiple scales (small buildings to large developments). The architecture is task-agnostic — it doesn't know if it's looking at brains or buildings. We adapted it by adding a shared Siamese encoder and computing feature differences, making it bitemporal."
-### If asked "What's your contribution vs just using existing models?"
-> "Three things: First, we built the change detection adaptation — Siamese encoding, feature differencing, the full ChangeFormer from scratch. Second, we created a unified comparison framework — same data, same metrics, same training for all three models, which most papers don't do. Third, we built a production pipeline — from data preprocessing to a deployed web app with tiled inference for any image size. The finding that UNet++ outperforms the transformer on this task and patch size is itself a contribution — it challenges the assumption that newer architectures are always better."

README.md CHANGED Viewed

@@ -1,3 +1,15 @@
 ![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)
 ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?logo=pytorch&logoColor=white)
 ![License](https://img.shields.io/badge/License-MIT-green)

+---
+title: Military Base Change Detection
+emoji: satellite
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 4.44.1
+app_file: app.py
+pinned: false
+python_version: 3.10
+---
 ![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)
 ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-EE4C2C?logo=pytorch&logoColor=white)
 ![License](https://img.shields.io/badge/License-MIT-green)

app.py CHANGED Viewed

@@ -10,6 +10,7 @@ Usage:
 """
 import logging
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
@@ -17,6 +18,7 @@ import gradio as gr
 import numpy as np
 import torch
 import yaml
 from data.dataset import IMAGENET_MEAN, IMAGENET_STD
 from inference import sliding_window_inference
@@ -34,10 +36,13 @@ _cached_model: Optional[torch.nn.Module] = None
 _cached_model_key: Optional[str] = None
 _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 _config: Optional[Dict[str, Any]] = None
 # Search these directories for checkpoint files
 _CHECKPOINT_SEARCH_DIRS = [
     Path("checkpoints"),
     Path("/kaggle/working/checkpoints"),
     Path("/content/drive/MyDrive/change-detection/checkpoints"),
 ]
@@ -50,6 +55,40 @@ _MODEL_CHECKPOINT_NAMES = {
 }
 # ---------------------------------------------------------------------------
 # Config / model loading
 # ---------------------------------------------------------------------------
@@ -88,6 +127,10 @@ def _find_checkpoint(model_name: str) -> Optional[Path]:
         if candidate.exists():
             return candidate
     return None
@@ -353,9 +396,11 @@ def main() -> None:
     gradio_cfg = config.get("gradio", {})
     demo = build_demo()
     demo.launch(
         server_port=gradio_cfg.get("server_port", 7860),
-        share=gradio_cfg.get("share", False),
     )

 """
 import logging
+import os
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Tuple
 import numpy as np
 import torch
 import yaml
+from huggingface_hub import hf_hub_download
 from data.dataset import IMAGENET_MEAN, IMAGENET_STD
 from inference import sliding_window_inference
 _cached_model_key: Optional[str] = None
 _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 _config: Optional[Dict[str, Any]] = None
+_hf_model_repo_id: Optional[str] = os.getenv("HF_MODEL_REPO")
+_hf_model_revision: Optional[str] = os.getenv("HF_MODEL_REVISION")
 # Search these directories for checkpoint files
 _CHECKPOINT_SEARCH_DIRS = [
     Path("checkpoints"),
+    Path("/home/user/app/checkpoints"),
     Path("/kaggle/working/checkpoints"),
     Path("/content/drive/MyDrive/change-detection/checkpoints"),
 ]
 }
+def _download_checkpoint_from_hf(model_name: str) -> Optional[Path]:
+    """Download checkpoint from Hugging Face Hub if configured.
+    Uses env var ``HF_MODEL_REPO`` as the source model repository and
+    downloads to ``./checkpoints`` cache.
+    Args:
+        model_name: One of the supported model keys.
+    Returns:
+        Local path to downloaded checkpoint, or ``None`` if unavailable.
+    """
+    if not _hf_model_repo_id:
+        return None
+    filename = _MODEL_CHECKPOINT_NAMES.get(model_name)
+    if filename is None:
+        return None
+    try:
+        local_path = hf_hub_download(
+            repo_id=_hf_model_repo_id,
+            filename=filename,
+            revision=_hf_model_revision,
+            local_dir="checkpoints",
+            local_dir_use_symlinks=False,
+        )
+        logger.info("Downloaded %s from %s", filename, _hf_model_repo_id)
+        return Path(local_path)
+    except Exception as exc:  # pragma: no cover - best-effort fallback
+        logger.warning("Could not download %s from HF Hub: %s", filename, exc)
+        return None
 # ---------------------------------------------------------------------------
 # Config / model loading
 # ---------------------------------------------------------------------------
         if candidate.exists():
             return candidate
+    downloaded = _download_checkpoint_from_hf(model_name)
+    if downloaded is not None and downloaded.exists():
+        return downloaded
     return None
     gradio_cfg = config.get("gradio", {})
     demo = build_demo()
+    in_hf_space = os.getenv("SPACE_ID") is not None
     demo.launch(
+        server_name="0.0.0.0" if in_hf_space else "127.0.0.1",
         server_port=gradio_cfg.get("server_port", 7860),
+        share=False if in_hf_space else gradio_cfg.get("share", False),
     )

requirements.txt CHANGED Viewed

@@ -14,3 +14,4 @@ tqdm>=4.66.0
 tensorboard>=2.15.0
 gradio>=4.14.0
 gdown>=5.1.0

 tensorboard>=2.15.0
 gradio>=4.14.0
 gdown>=5.1.0
+huggingface_hub>=0.23.0