File size: 33,143 Bytes

e43dca4

# AirTrackLM: LLM4STP Adapted for ADS-B Air Track Prediction

## Complete Architecture & Implementation Plan

---

## 1. Executive Summary

We adapt the LLM4STP multi-feature fusion architecture (originally for maritime AIS ship trajectory prediction) to work with **ADS-B air track data**. The model uses a **decoder-only transformer** with four specialized embedding types — Prompt, Uncertainty, Geohash, and Temporal — fused together for **next-state prediction** pretraining. Once pretrained, the model is adaptable to downstream tasks like activity classification.

This design is grounded in published results from:
- **FTP-LLM** (arXiv:2501.17459) — LLaMA-3.1-8B for flight trajectory prediction
- **H3-CLM** (arXiv:2405.09596) — H3 geohash + causal LM for maritime trajectories
- **GeoFormer** (arXiv:2311.05092) — GPT-style geospatial tokenization
- **TrAISFormer** (arXiv:2109.03958) — Discrete tokenization of AIS features

---

## 2. System Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                        RAW ADS-B INPUT                              │
│              (timestamp, latitude, longitude, altitude)             │
└─────────────────────────┬───────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   FEATURE DERIVATION PIPELINE                       │
│                                                                     │
│   Raw:     lat, lon, alt                                           │
│   Derived: COG, SOG, ROT, altitude_rate                            │
│   Meta:    timestamp → (hour, day_of_week, month)                  │
│                                                                     │
│   Output per timestep:                                              │
│   state_t = [lat, lon, alt, COG, SOG, ROT, alt_rate]              │
└─────────────────────────┬───────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    TOKENIZATION / ENCODING                          │
│                                                                     │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │
│   │   Geohash     │  │  Continuous   │  │   Temporal   │            │
│   │  Tokenizer    │  │  Discretizer  │  │   Encoder    │            │
│   │              │  │              │  │              │            │
│   │ lat,lon,alt  │  │ COG,SOG,ROT  │  │ hour,dow,    │            │
│   │ → H3 cell + │  │ alt_rate     │  │ month        │            │
│   │   alt_band   │  │ → bin IDs    │  │ → time IDs   │            │
│   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘            │
│          │                 │                 │                      │
│          ▼                 ▼                 ▼                      │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐            │
│   │  Geohash     │  │  Feature     │  │  Temporal    │            │
│   │  Embedding   │  │  Embeddings  │  │  Embedding   │            │
│   │  Table       │  │  Tables      │  │  Table       │            │
│   │  (d_model)   │  │  (d_model)   │  │  (d_model)   │            │
│   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘            │
│          │                 │                 │                      │
└──────────┼─────────────────┼─────────────────┼──────────────────────┘
           │                 │                 │
           ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    EMBEDDING FUSION LAYER                            │
│                                                                     │
│   ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────┐   │
│   │  Geohash   │ │  Feature   │ │  Temporal  │ │ Uncertainty  │   │
│   │  Embed     │ │  Embed     │ │  Embed     │ │   Embed      │   │
│   │  (d_model) │ │  (d_model) │ │  (d_model) │ │  (d_model)   │   │
│   └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └──────┬───────┘   │
│         │              │              │               │            │
│         └──────────┬───┴──────┬───────┘               │            │
│                    │          │                        │            │
│                    ▼          ▼                        ▼            │
│              E_state = E_geo + E_feat + E_temp + E_uncert           │
│                              │                                      │
│                              ▼                                      │
│   ┌───────────────────────────────────────────┐                    │
│   │  Prompt Embedding (prepended prefix)      │                    │
│   │  [PROMPT_1, PROMPT_2, ..., PROMPT_k]      │                    │
│   └───────────────────┬───────────────────────┘                    │
│                       │                                             │
│                       ▼                                             │
│   Input: [PROMPT_TOKENS | STATE_1 | STATE_2 | ... | STATE_T]      │
│                       │                                             │
│                       ▼                                             │
│              Linear Projection → d_model                           │
│                       │                                             │
│                       ▼                                             │
│              + Positional Encoding (sinusoidal)                    │
│                                                                     │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│              DECODER-ONLY TRANSFORMER BACKBONE                      │
│                                                                     │
│   ┌─────────────────────────────────────────────────────┐          │
│   │  Transformer Block ×N_layers                        │          │
│   │                                                     │          │
│   │  ┌─────────────────────────────────────────┐       │          │
│   │  │  Causal Multi-Head Self-Attention        │       │          │
│   │  │  (masked: each position attends only     │       │          │
│   │  │   to itself and earlier positions)        │       │          │
│   │  └──────────────────┬──────────────────────┘       │          │
│   │                     │                               │          │
│   │                     ▼                               │          │
│   │  ┌─────────────────────────────────────────┐       │          │
│   │  │  LayerNorm + Residual Connection         │       │          │
│   │  └──────────────────┬──────────────────────┘       │          │
│   │                     │                               │          │
│   │                     ▼                               │          │
│   │  ┌─────────────────────────────────────────┐       │          │
│   │  │  Feed-Forward Network                    │       │          │
│   │  │  (Linear → GELU → Linear)               │       │          │
│   │  │  d_model → 4*d_model → d_model          │       │          │
│   │  └──────────────────┬──────────────────────┘       │          │
│   │                     │                               │          │
│   │                     ▼                               │          │
│   │  ┌─────────────────────────────────────────┐       │          │
│   │  │  LayerNorm + Residual Connection         │       │          │
│   │  └─────────────────────────────────────────┘       │          │
│   │                                                     │          │
│   └─────────────────────────────────────────────┘      │          │
│                                                                     │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      OUTPUT HEADS                                   │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────┐      │
│   │  PRETRAINING: Next-State Prediction Head                │      │
│   │                                                         │      │
│   │  For each position t, predict state at t+1:             │      │
│   │                                                         │      │
│   │  h_t → Linear → softmax → P(geohash_token_{t+1})       │      │
│   │  h_t → Linear → softmax → P(COG_bin_{t+1})             │      │
│   │  h_t → Linear → softmax → P(SOG_bin_{t+1})             │      │
│   │  h_t → Linear → softmax → P(ROT_bin_{t+1})             │      │
│   │  h_t → Linear → softmax → P(alt_rate_bin_{t+1})        │      │
│   │  h_t → Linear → softmax → P(alt_band_{t+1})            │      │
│   │                                                         │      │
│   │  Loss = Σ CrossEntropy(predicted_feature, true_feature) │      │
│   └─────────────────────────────────────────────────────────┘      │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────┐      │
│   │  DOWNSTREAM: Activity Classification Head               │      │
│   │  (attached after pretraining, frozen or fine-tuned)     │      │
│   │                                                         │      │
│   │  h_[BOS] or mean(h_1:T) → MLP → softmax → class label  │      │
│   └─────────────────────────────────────────────────────────┘      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 3. The Four Embedding Types (Detailed)

### 3.1 Geohash Embeddings — Spatial Position Encoding

**Purpose**: Encode the aircraft's 3D geographic position as a discrete token.

**Method**: We use **H3 hexagonal hierarchical spatial index** (Uber's H3) at resolution 5 (hex area ≈ 252 km², edge ≈ 9.85 km) for en-route flight, with an option to use resolution 7 (≈ 5.16 km², edge ≈ 1.22 km) for terminal areas. This follows the H3-CLM paper's approach but adapted for aviation's larger spatial scale.

**3D Extension**: Since aircraft operate in 3D, we combine the H3 cell with an **altitude band**:
```
Geohash Token = H3_cell_index × N_alt_bands + alt_band_index

Altitude bands (1000 ft increments):
  Band 0:     0 - 1,000 ft    (ground / taxi)
  Band 1:  1,000 - 2,000 ft   (initial climb / approach)
  ...
  Band 45: 44,000 - 45,000 ft (high cruise)

  N_alt_bands = 46
```

**Vocabulary size**: At H3 resolution 5, the number of unique cells covering typical airspace is ~100K-200K. With altitude bands: `~200K × 46 ≈ 9.2M` — too large for direct embedding.

**Solution — Factored Embedding**:
```
E_geohash = E_h3[h3_cell_id] + E_alt[alt_band_id]

E_h3:  learned embedding table, vocab = N_h3_cells (~200K or hashing trick to 50K)
E_alt: learned embedding table, vocab = 46

Both project to d_model dimensions.
```

The **hashing trick**: Map H3 cell indices through a hash function to a fixed vocabulary of ~50,000 buckets. This bounds memory while maintaining spatial discrimination.

**Why H3 over traditional geohash**: H3 hexagons have uniform area (no polar distortion), hierarchical nesting, and consistent neighbor relationships — critical for trajectory continuity.

### 3.2 Temporal Embeddings — When Is the Aircraft Flying?

**Purpose**: Encode temporal context — time of day affects traffic density, routes, and behavior.

**Method**: Additive composition of multiple temporal scales:
```
E_temporal = E_hour[hour_of_day] + E_dow[day_of_week] + E_month[month]

E_hour:  24 entries  (captures rush hour vs. night patterns)
E_dow:    7 entries  (weekday vs. weekend traffic)
E_month: 12 entries  (seasonal routes, weather patterns)

All project to d_model dimensions.
```

**Optional — Sinusoidal Sub-minute Encoding**: For sub-minute resolution:
```
E_minute = sin(2π × minute / 60), cos(2π × minute / 60)  → linear → d_model
```

### 3.3 Uncertainty Embeddings — How Confident Are We?

**Purpose**: Encode the model's uncertainty about the current trajectory state. Aircraft in straight-and-level cruise have low uncertainty; aircraft maneuvering near airports have high uncertainty.

**Method**: Compute a **trajectory smoothness score** from recent states, then discretize:

```
Uncertainty sources (sliding window of k=5 recent states):

1. Position variance:  σ²_pos = var(Δlat) + var(Δlon)
2. Heading variance:   σ²_COG = circular_var(COG_{t-k:t})
3. Speed variance:     σ²_SOG = var(SOG_{t-k:t})
4. Altitude variance:  σ²_alt = var(alt_rate_{t-k:t})

Combined uncertainty score:
  U_t = w1·σ²_pos + w2·σ²_COG + w3·σ²_SOG + w4·σ²_alt

Discretize into N_uncert = 16 bins (quantile binning on training data)

E_uncertainty = E_uncert_table[bin(U_t)]   →  d_model
```

**Weights w1-w4**: Hyperparameters tuned on validation data, or learned as part of the model.

**During inference**: For multi-step prediction, uncertainty can be updated using MC-Dropout or ensemble disagreement.

### 3.4 Prompt Embeddings — Task and Context Metadata

**Purpose**: Provide metadata context about the flight, analogous to system prompts in LLMs. Enables task conditioning and multi-task learning.

**Method**: Learnable prompt tokens prepended to the trajectory:

```
Prompt token vocabulary:
  - Aircraft category:  [HEAVY, LARGE, SMALL, ROTORCRAFT, GLIDER, UAV, UNKNOWN]  (7)
  - Flight phase:       [CLIMB, CRUISE, DESCENT, APPROACH, GROUND, UNKNOWN]       (6)
  - Region:             [CONUS, EUROPE, ASIA, OTHER]                               (4)
  - Task:               [PREDICT, CLASSIFY, DETECT_ANOMALY]                        (3)
  - Special:            [BOS, EOS, PAD, MASK]                                      (4)

Total prompt vocab: ~24 tokens

Prompt sequence (prepended):
  [BOS, TASK_TOKEN, AIRCRAFT_TOKEN, PHASE_TOKEN, REGION_TOKEN]

Each has a learned embedding of dimension d_model.
```

**For downstream classification**: Change TASK_TOKEN to CLASSIFY; output at BOS position is used for classification.

---

## 4. Feature Derivation Pipeline

### 4.1 Raw Input
```
timestamp (Unix epoch seconds)
latitude  (degrees, WGS84)
longitude (degrees, WGS84)
altitude  (feet, barometric or geometric)
```

### 4.2 Derived Features

```python
import numpy as np

def derive_features(timestamps, lats, lons, alts):
    """
    Derive COG, SOG, ROT, and altitude rate from raw position data.
    All inputs: numpy arrays of shape (N,) for a single trajectory.
    Returns arrays of shape (N,) — first element is NaN.
    """
    dt = np.diff(timestamps)  # seconds
    dt = np.maximum(dt, 1e-6)  # avoid division by zero
    
    # --- Course Over Ground (COG) ---
    lat1, lat2 = np.radians(lats[:-1]), np.radians(lats[1:])
    dlon = np.radians(np.diff(lons))
    
    x = np.sin(dlon) * np.cos(lat2)
    y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(dlon)
    COG = np.degrees(np.arctan2(x, y)) % 360  # [0, 360)
    
    # --- Speed Over Ground (SOG) ---
    dlat = np.radians(np.diff(lats))
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    distance_nm = 3440.065 * c  # Earth radius in nautical miles
    SOG = distance_nm / (dt / 3600)  # knots
    
    # --- Rate of Turn (ROT) ---
    dCOG = np.diff(COG)
    dCOG = (dCOG + 180) % 360 - 180  # normalize to [-180, 180]
    ROT = np.full(len(lats), np.nan)
    ROT[2:] = dCOG / dt[1:]  # degrees per second
    
    # --- Rate of Altitude Change ---
    dalt = np.diff(alts)  # feet
    alt_rate = dalt / (dt / 60)  # feet per minute
    
    # Pad first elements
    COG_full = np.concatenate([[np.nan], COG])
    SOG_full = np.concatenate([[np.nan], SOG])
    alt_rate_full = np.concatenate([[np.nan], alt_rate])
    
    return COG_full, SOG_full, ROT, alt_rate_full
```

### 4.3 Feature Discretization

| Feature       | Range             | Bin Width    | N_bins | Notes              |
|---------------|-------------------|--------------|--------|--------------------|
| COG           | [0, 360)          | 5°           | 72     | Circular           |
| SOG           | [0, 600] kts      | 5 knots      | 121    | Capped at ~Mach 1  |
| ROT           | [-6, 6] °/s       | 0.25 °/s     | 49     | Capped ±6°/s       |
| Altitude Rate | [-6000, 6000] fpm | 200 ft/min   | 61     | Capped ±6000 fpm   |

Outliers beyond caps clipped to boundary bin.

### 4.4 Trajectory Preprocessing Pipeline

```
1. Segment raw ADS-B by ICAO24 + temporal gaps > 15 min → individual flights
2. Resample to fixed Δt = 60 seconds (linear interp for position, circular for heading)
3. Derive features (COG, SOG, ROT, alt_rate)
4. Drop first 2 points per trajectory (NaN from derivation)
5. Filter: remove trajectories with < 20 points (< 20 minutes)
6. Compute H3 cell (res 5) + altitude band for each point
7. Discretize all continuous features into bins
8. Compute uncertainty scores (sliding window k=5)
9. Extract temporal features (hour, dow, month)
10. Construct prompt tokens from metadata (if available)
```

---

## 5. Model Hyperparameters

### 5.1 Model Dimensions

| Parameter        | Value  | Rationale                                          |
|------------------|--------|----------------------------------------------------|
| d_model          | 256    | H3-CLM found 256-1024 effective                    |
| n_heads          | 8      | head_dim = 32                                       |
| n_layers         | 8      | Moderate depth for ~10M param model                 |
| d_ff             | 1024   | 4× d_model (standard)                              |
| max_seq_len      | 128    | 128 states × 60s ≈ 2 hours of flight               |
| n_prompt_tokens  | 5      | [BOS, TASK, AIRCRAFT, PHASE, REGION]                |
| dropout          | 0.1    |                                                     |

**Total parameters**: ~8-12M (trainable on single GPU in hours)

### 5.2 Vocabulary Sizes

| Embedding        | Vocab  | Dim |
|------------------|--------|-----|
| H3 cells         | 50,000 | 256 |
| Altitude bands   | 46     | 256 |
| COG bins         | 72     | 256 |
| SOG bins         | 121    | 256 |
| ROT bins         | 49     | 256 |
| Alt rate bins    | 61     | 256 |
| Hour of day      | 24     | 256 |
| Day of week      | 7      | 256 |
| Month            | 12     | 256 |
| Uncertainty bins | 16     | 256 |
| Prompt tokens    | 24     | 256 |

### 5.3 State Token Composition

Each timestep → single state token via additive fusion:

```
E_state_t = E_h3[h3_id_t] + E_alt_band[alt_band_t]            # Geohash (3D position)
          + E_COG[cog_bin_t] + E_SOG[sog_bin_t]                # Kinematics
          + E_ROT[rot_bin_t] + E_alt_rate[alt_rate_bin_t]       # Dynamics
          + E_hour[hour_t] + E_dow[dow_t] + E_month[month_t]   # Temporal
          + E_uncert[uncert_bin_t]                               # Uncertainty

E_state_t ∈ R^{d_model}
```

This additive fusion follows BERT (token + segment + position) and TrAISFormer.

---

## 6. Training Recipe

### 6.1 Pretraining: Next-State Prediction (Causal LM)

**Objective**: Given states 1..T, predict state at T+1 (applied autoregressively at every position).

**Loss**:
```
L = Σ_{t=1}^{T-1} [ λ_geo · CE(ŷ_geo_t, y_geo_{t+1})
                    + λ_COG · CE(ŷ_COG_t, y_COG_{t+1})
                    + λ_SOG · CE(ŷ_SOG_t, y_SOG_{t+1})
                    + λ_ROT · CE(ŷ_ROT_t, y_ROT_{t+1})
                    + λ_alt · CE(ŷ_alt_rate_t, y_alt_rate_{t+1})
                    + λ_altb · CE(ŷ_alt_band_t, y_alt_band_{t+1}) ]

λ values default to 1.0 (equal weighting).
```

**Training hyperparameters** (based on FTP-LLM + H3-CLM):

| Parameter            | Value               |
|----------------------|---------------------|
| Optimizer            | AdamW               |
| Learning rate        | 5e-4                |
| LR Schedule          | Cosine + 5% warmup  |
| Batch size (per GPU) | 64                  |
| Gradient accumulation| 4 (effective = 256) |
| Max epochs           | 30 (early stop p=5) |
| Weight decay         | 0.01                |
| Gradient clipping    | 1.0                 |
| Mixed precision      | bf16                |

**Data windowing**: Sliding window size=128, stride=64 (50% overlap).

### 6.2 Downstream: Activity Classification

After pretraining, attach classification head:
```
h_BOS → Linear(256, 128) → GELU → Dropout(0.1) → Linear(128, N_classes)
```

**Fine-tuning options**:
- **A**: Freeze backbone, train head only (fast, small data)
- **B**: Full fine-tune, backbone lr=1e-5, head lr=1e-3

---

## 7. Dataset Strategy

### 7.1 Prototyping — `traffic` Python Library

```python
from traffic.data.samples import landing_zurich_2019
# ~2,000 flights near Zurich
# Columns: timestamp, icao24, callsign, latitude, longitude, altitude,
#          groundspeed, track, vertical_rate, ...
```

Instant access, clean, well-documented. Single airport, limited diversity.

### 7.2 Training — OpenSky Network

```python
from pyopensky.trino import Trino
trino = Trino()
df = trino.rawquery("""
    SELECT time, icao24, lat, lon, baroaltitude, velocity, heading, vertrate
    FROM state_vectors_data4
    WHERE hour >= '2024-01-15 00:00:00'
      AND hour <  '2024-01-15 12:00:00'
      AND lat BETWEEN 40 AND 55
      AND lon BETWEEN -10 AND 20
    ORDER BY icao24, time
""")
```

**Target**:
- **Region A** (train): Europe, 1 month → ~500K-1M flights
- **Region B** (OOD test): US CONUS, 1 week → ~200K flights
- **Region C** (far test): East Asia, 1 week → ~100K flights

### 7.3 Alternative: SCAT Dataset

~170K en-route flights over Sweden, Zenodo. Pre-segmented, clean.

### 7.4 Data Split

```
Training:    70% of Region A flights
Validation:  15% of Region A flights  
Test (IID):  15% of Region A flights
Test (OOD):  100% of Region B flights
Test (Far):  100% of Region C flights
```

Split by **flight** (not time window) to avoid data leakage.

---

## 8. Ablation Study: Geohash Geographic Dependency

### 8.1 Hypothesis

> Geohash embeddings encode **absolute geographic position**, causing the model to memorize region-specific patterns (airways, approach paths, airspace structure). This improves in-distribution performance but degrades transfer to unseen regions.

### 8.2 Experimental Variants

| Variant | Geohash Type | Description |
|---------|-------------|-------------|
| **V1: Full Model** | H3 absolute | Complete architecture as described |
| **V2: No Geohash** | None | Remove geohash entirely; model sees only kinematics + temporal + uncertainty |
| **V3: Relative Geohash** | H3 relative | H3 cell of (Δlat, Δlon) from trajectory start — position-invariant |
| **V4: Multi-Resolution** | H3 res 3+5+7 | 3 resolutions summed (coarse→fine) |
| **V5: Continuous Position** | Linear projection | `Linear([lat, lon, alt] → d_model)` — no discretization |

### 8.3 Evaluation Metrics

For each variant × each test set (IID, OOD, Far):

| Metric | Description |
|--------|-------------|
| Geo Accuracy | % correct H3 cell prediction |
| Position MAE | Mean absolute error in km |
| COG MAE | Heading error in degrees |
| SOG MAE | Speed error in knots |
| Multi-step ADE | Average displacement error over 5 predicted steps |
| Multi-step FDE | Final displacement error at step 5 |

### 8.4 Key Comparisons

| Comparison | Tests |
|-----------|-------|
| V1 vs V2 (IID) | How much geohash helps when test = train region |
| V1 vs V2 (OOD) | If V2 > V1 on OOD → geohash causes geographic overfitting |
| V1 vs V3 (OOD) | If V3 good on both IID and OOD → relative geohash is the sweet spot |
| V4 (all) | Multi-resolution: coarse cells transfer, fine cells specialize? |
| V5 (all) | Does continuous encoding avoid discretization issues? |

### 8.5 Expected Outcomes

- **V1**: Best IID, worst OOD (hypothesis)
- **V3**: Best compromise — predicted winner
- **V5**: May struggle (loses discrete token structure transformers excel at)
- **V2**: Strong OOD baseline, sacrifices IID

### 8.6 Additional Analysis

- **Attention visualization**: V1 vs V3 attention patterns
- **Embedding clustering**: t-SNE of geohash embeddings colored by region
- **Learning curves**: IID vs OOD performance vs training data size

---

## 9. Implementation Phases

### Phase 1: Data Pipeline (Week 1)
- Set up `traffic` library, extract sample trajectories
- Implement feature derivation (COG, SOG, ROT, alt_rate)
- Implement H3 geohash encoding + altitude banding
- Implement feature discretization (binning)
- Implement uncertainty score computation
- Build PyTorch Dataset class with sliding window
- Unit tests for all derivation functions

### Phase 2: Model Architecture (Week 1-2)
- Implement all embedding tables
- Implement additive fusion layer
- Implement prompt token prepending
- Implement decoder-only transformer backbone
- Implement multi-head output (6 prediction heads)
- Implement classification head (for downstream)
- Forward pass test with dummy data

### Phase 3: Pretraining (Week 2-3)
- Implement training loop with multi-task loss
- Prototyping run on `traffic` data (small, fast iteration)
- Scale to OpenSky data
- Monitor loss curves, validate convergence
- Save best checkpoint

### Phase 4: Downstream Adaptation (Week 3-4)
- Implement classification fine-tuning pipeline
- Test on activity classification task
- Compare frozen vs. fine-tuned backbone

### Phase 5: Ablation Study (Week 4-5)
- Implement all 5 geohash variants
- Train each variant with identical hyperparameters
- Evaluate on IID, OOD, and Far test sets
- Generate comparison tables and visualizations
- Write analysis of geographic dependency findings

---

## 10. Key Design Decisions & Rationale

| Decision | Choice | Why |
|----------|--------|-----|
| Custom model vs. pretrained LLM | Custom ~10M param transformer | FTP-LLM showed text-tokenized LLMs work, but custom allows proper multi-feature fusion. 10M params trains in hours. |
| H3 vs. traditional geohash | H3 | Uniform hexagonal cells, no polar distortion, hierarchical. Proven by H3-CLM. |
| Additive vs. concatenative fusion | Additive | BERT/TrAISFormer paradigm. Keeps d_model constant. Concatenation → d_model × N_features = massive. |
| 60s time resolution | 60 seconds | FTP-LLM validated 1-min aggregation. 128 steps ≈ 2+ hours. |
| Factored geohash (H3 + alt) | Separate tables, summed | Avoids combinatorial explosion (9.2M → 50K + 46). |
| Multi-head output | Separate softmax per feature | More interpretable, allows per-feature analysis. |
| Uncertainty from smoothness | Variance-based | Computable at data time, no inference overhead. |

---

## 11. Risk Analysis

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Geohash overfits to region | High | High | Ablation study; V3 (relative) is fallback |
| OpenSky access issues | Medium | High | Fallback: `traffic` samples + SCAT |
| 60s too coarse for terminal | Medium | Low | Separate terminal model at 10s |
| Model too small | Low | Medium | Scale: d_model→512, n_layers→16 (~40M) |
| Alt discretization too coarse | Low | Low | Refine to 500ft bands (92) |

---

## 12. Monitoring & Evaluation

**During training** (Trackio):
- Total loss + per-feature loss curves
- Validation loss each epoch
- LR schedule, GPU utilization

**After training**:
- Next-state accuracy (top-1, top-5 per feature)
- Position error in km
- Multi-step prediction (1, 5, 10, 20 steps ahead)
- Downstream classification F1/precision/recall

---

*Grounded in: FTP-LLM, H3-CLM, GeoFormer, TrAISFormer, and LLM4STP (reconstructed). Ready for implementation upon approval.*