seriffic's picture
Backend evolution: Phases 1-10 specialists + agentic FSM + Mellea + LiteLLM router
6a82282
# Phase 15 β€” TerraMind NYC Multi-head: ONE Model, Multiple Tasks
## Goal
The defensible single artifact. ONE TerraMind checkpoint trained
simultaneously on multiple NYC tasks via a shared backbone with
multiple decoder heads. A multi-task model is harder to overclaim,
harder to forget, and more honest about model capacity than a chain
of separate fine-tunes.
This is the alternative to Phase 12 (TiM) and Phase 13 (buildings) β€”
INSTEAD OF training them as separate ckpts, we train one model that
does both at the same time.
## Why this is the right shape
- **One artifact** to publish, one card, one repro recipe. Simpler.
- **Shared encoder** learns features that help BOTH tasks; can be more
parameter-efficient than separate models.
- **No catastrophic forgetting** β€” both tasks are in the loss, both
have equal gradient share.
- **Honest claim**: "the same backbone produces these outputs" is
defensible; "we trained five separate models" sounds less rigorous.
- **Real downstream use**: Riprap's `terramind_nyc` specialist gets
multiple class-fraction signals from one forward pass.
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
S2L2A (12 bands) ──► β”‚ β”‚
S1RTC (2 bands) ───► β”‚ TerraMind v1 base encoder β”‚ shared
DEM (1 band) ───► β”‚ (167M trainable params) β”‚
β”‚ β”‚
β””β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ UNet decoder β”‚ β”‚ UNet decoder β”‚
β”‚ LULC head (5) β”‚ β”‚ Buildings head β”‚
β”‚ β”‚ β”‚ (binary) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
(LULC prediction) (Building footprint)
loss = Ξ± * dice(LULC) + Ξ² * dice(Buildings)
```
Could extend to a third head (flood mask) once Phase 14's Prithvi
NYC dataset exists β€” same chip β†’ flood mask via Prithvi labels β€”
but flood is a different signal and may want a separate model.
Stick to LULC + Buildings for the multi-head experiment.
## Training data
Same 22 parent chips Γ— 16 sub-chips = 336 training tiles (Phase 2 dataset).
Each sub-chip now has TWO labels:
- `MASK_LULC/<chip_id>.tif` β€” 5-class WorldCover labels (Phase 2)
- `MASK_BUILDINGS/<chip_id>.tif` β€” binary NYC building footprint (Phase 13)
Both rasterized onto the same chip grid in the same prep pipeline.
## Plan
1. Scaffold (this file).
2. Extend `slice_and_label_nyc.py` to write BOTH MASK_LULC and
MASK_BUILDINGS per sub-chip (currently only LULC).
3. Write `multihead_datamodule.py` β€” yields `(image_dict, {"lulc": tensor,
"buildings": tensor})` per batch.
4. Write `terramind_multihead_model.py` β€” TerraMind backbone + two
decoder heads, joint forward, joint loss.
5. Write `phase5_multihead.yaml` β€” training config.
6. Smoke-test on 1 sub-chip with both losses summing.
7. Run full fine-tune (~6 GPU-hr).
8. Eval BOTH heads independently against held-out test set.
Compare: Phase 5 multi-head LULC IoU vs Phase 2 single-task LULC IoU.
Compare: Phase 5 multi-head Buildings IoU vs Phase 13 single-task IoU.
9. Publish as `msradam/TerraMind-base-NYC-multitask`.
## Eval gate
Strong: BOTH heads within 1pp of their respective single-task baselines
AND the model is published as a single deployment artifact.
Acceptable: One head trades up to 3pp loss for the other to gain β‰₯ that
much, AND the multi-head story is told honestly.
Negative: Both heads drop β‰₯ 3pp from single-task β†’ multi-task interference
is real, publish negative result.
## Risk
Higher than separate models (more bugs in dataloader + multi-loss + dual
heads), but the artifact is much more compelling. If I were a single
judge, I'd recognize this as "real ML engineering" vs "ran the recipe N
times."
## What it adds to Riprap
`app/context/terramind_nyc.py` returns a single `fetch(lat, lon)` with
BOTH building density AND LULC class fractions in one call. Halves
inference cost and surfaces correlated features (a high-density
building tile usually has high "developed" class fraction; the
multi-head model sees this jointly).
## Reproduction (planned)
```bash
python3 experiments/15_terramind_multihead/build_multihead_dataset.py
docker exec terramind terratorch fit --config /root/config_multihead.yaml
docker exec terramind terratorch test --config /root/config_multihead.yaml
```