AirTrackLM / ARCHITECTURE.md

Add ARCHITECTURE.md

e43dca4 verified 13 days ago

33.1 kB

	# AirTrackLM: LLM4STP Adapted for ADS-B Air Track Prediction

	## Complete Architecture & Implementation Plan

	---

	## 1. Executive Summary

	We adapt the LLM4STP multi-feature fusion architecture (originally for maritime AIS ship trajectory prediction) to work with ADS-B air track data. The model uses a decoder-only transformer with four specialized embedding types — Prompt, Uncertainty, Geohash, and Temporal — fused together for next-state prediction pretraining. Once pretrained, the model is adaptable to downstream tasks like activity classification.

	This design is grounded in published results from:
	- FTP-LLM (arXiv:2501.17459) — LLaMA-3.1-8B for flight trajectory prediction
	- H3-CLM (arXiv:2405.09596) — H3 geohash + causal LM for maritime trajectories
	- GeoFormer (arXiv:2311.05092) — GPT-style geospatial tokenization
	- TrAISFormer (arXiv:2109.03958) — Discrete tokenization of AIS features

	---

	## 2. System Architecture Overview

	```
	┌─────────────────────────────────────────────────────────────────────┐
	│ RAW ADS-B INPUT │
	│ (timestamp, latitude, longitude, altitude) │
	└─────────────────────────┬───────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ FEATURE DERIVATION PIPELINE │
	│ │
	│ Raw: lat, lon, alt │
	│ Derived: COG, SOG, ROT, altitude_rate │
	│ Meta: timestamp → (hour, day_of_week, month) │
	│ │
	│ Output per timestep: │
	│ state_t = [lat, lon, alt, COG, SOG, ROT, alt_rate] │
	└─────────────────────────┬───────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ TOKENIZATION / ENCODING │
	│ │
	│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
	│ │ Geohash │ │ Continuous │ │ Temporal │ │
	│ │ Tokenizer │ │ Discretizer │ │ Encoder │ │
	│ │ │ │ │ │ │ │
	│ │ lat,lon,alt │ │ COG,SOG,ROT │ │ hour,dow, │ │
	│ │ → H3 cell + │ │ alt_rate │ │ month │ │
	│ │ alt_band │ │ → bin IDs │ │ → time IDs │ │
	│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
	│ │ │ │ │
	│ ▼ ▼ ▼ │
	│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
	│ │ Geohash │ │ Feature │ │ Temporal │ │
	│ │ Embedding │ │ Embeddings │ │ Embedding │ │
	│ │ Table │ │ Tables │ │ Table │ │
	│ │ (d_model) │ │ (d_model) │ │ (d_model) │ │
	│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
	│ │ │ │ │
	└──────────┼─────────────────┼─────────────────┼──────────────────────┘
	│ │ │
	▼ ▼ ▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ EMBEDDING FUSION LAYER │
	│ │
	│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
	│ │ Geohash │ │ Feature │ │ Temporal │ │ Uncertainty │ │
	│ │ Embed │ │ Embed │ │ Embed │ │ Embed │ │
	│ │ (d_model) │ │ (d_model) │ │ (d_model) │ │ (d_model) │ │
	│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └──────┬───────┘ │
	│ │ │ │ │ │
	│ └──────────┬───┴──────┬───────┘ │ │
	│ │ │ │ │
	│ ▼ ▼ ▼ │
	│ E_state = E_geo + E_feat + E_temp + E_uncert │
	│ │ │
	│ ▼ │
	│ ┌───────────────────────────────────────────┐ │
	│ │ Prompt Embedding (prepended prefix) │ │
	│ │ [PROMPT_1, PROMPT_2, ..., PROMPT_k] │ │
	│ └───────────────────┬───────────────────────┘ │
	│ │ │
	│ ▼ │
	│ Input: [PROMPT_TOKENS \| STATE_1 \| STATE_2 \| ... \| STATE_T] │
	│ │ │
	│ ▼ │
	│ Linear Projection → d_model │
	│ │ │
	│ ▼ │
	│ + Positional Encoding (sinusoidal) │
	│ │
	└───────────────────────┬─────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ DECODER-ONLY TRANSFORMER BACKBONE │
	│ │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ Transformer Block ×N_layers │ │
	│ │ │ │
	│ │ ┌─────────────────────────────────────────┐ │ │
	│ │ │ Causal Multi-Head Self-Attention │ │ │
	│ │ │ (masked: each position attends only │ │ │
	│ │ │ to itself and earlier positions) │ │ │
	│ │ └──────────────────┬──────────────────────┘ │ │
	│ │ │ │ │
	│ │ ▼ │ │
	│ │ ┌─────────────────────────────────────────┐ │ │
	│ │ │ LayerNorm + Residual Connection │ │ │
	│ │ └──────────────────┬──────────────────────┘ │ │
	│ │ │ │ │
	│ │ ▼ │ │
	│ │ ┌─────────────────────────────────────────┐ │ │
	│ │ │ Feed-Forward Network │ │ │
	│ │ │ (Linear → GELU → Linear) │ │ │
	│ │ │ d_model → 4*d_model → d_model │ │ │
	│ │ └──────────────────┬──────────────────────┘ │ │
	│ │ │ │ │
	│ │ ▼ │ │
	│ │ ┌─────────────────────────────────────────┐ │ │
	│ │ │ LayerNorm + Residual Connection │ │ │
	│ │ └─────────────────────────────────────────┘ │ │
	│ │ │ │
	│ └─────────────────────────────────────────────┘ │ │
	│ │
	└───────────────────────┬─────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ OUTPUT HEADS │
	│ │
	│ ┌─────────────────────────────────────────────────────────┐ │
	│ │ PRETRAINING: Next-State Prediction Head │ │
	│ │ │ │
	│ │ For each position t, predict state at t+1: │ │
	│ │ │ │
	│ │ h_t → Linear → softmax → P(geohash_token_{t+1}) │ │
	│ │ h_t → Linear → softmax → P(COG_bin_{t+1}) │ │
	│ │ h_t → Linear → softmax → P(SOG_bin_{t+1}) │ │
	│ │ h_t → Linear → softmax → P(ROT_bin_{t+1}) │ │
	│ │ h_t → Linear → softmax → P(alt_rate_bin_{t+1}) │ │
	│ │ h_t → Linear → softmax → P(alt_band_{t+1}) │ │
	│ │ │ │
	│ │ Loss = Σ CrossEntropy(predicted_feature, true_feature) │ │
	│ └─────────────────────────────────────────────────────────┘ │
	│ │
	│ ┌─────────────────────────────────────────────────────────┐ │
	│ │ DOWNSTREAM: Activity Classification Head │ │
	│ │ (attached after pretraining, frozen or fine-tuned) │ │
	│ │ │ │
	│ │ h_[BOS] or mean(h_1:T) → MLP → softmax → class label │ │
	│ └─────────────────────────────────────────────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────────┘
	```

	---

	## 3. The Four Embedding Types (Detailed)

	### 3.1 Geohash Embeddings — Spatial Position Encoding

	Purpose: Encode the aircraft's 3D geographic position as a discrete token.

	Method: We use H3 hexagonal hierarchical spatial index (Uber's H3) at resolution 5 (hex area ≈ 252 km², edge ≈ 9.85 km) for en-route flight, with an option to use resolution 7 (≈ 5.16 km², edge ≈ 1.22 km) for terminal areas. This follows the H3-CLM paper's approach but adapted for aviation's larger spatial scale.

	3D Extension: Since aircraft operate in 3D, we combine the H3 cell with an altitude band:
	```
	Geohash Token = H3_cell_index × N_alt_bands + alt_band_index

	Altitude bands (1000 ft increments):
	Band 0: 0 - 1,000 ft (ground / taxi)
	Band 1: 1,000 - 2,000 ft (initial climb / approach)
	...
	Band 45: 44,000 - 45,000 ft (high cruise)

	N_alt_bands = 46
	```

	Vocabulary size: At H3 resolution 5, the number of unique cells covering typical airspace is ~100K-200K. With altitude bands: `~200K × 46 ≈ 9.2M` — too large for direct embedding.

	Solution — Factored Embedding:
	```
	E_geohash = E_h3[h3_cell_id] + E_alt[alt_band_id]

	E_h3: learned embedding table, vocab = N_h3_cells (~200K or hashing trick to 50K)
	E_alt: learned embedding table, vocab = 46

	Both project to d_model dimensions.
	```

	The hashing trick: Map H3 cell indices through a hash function to a fixed vocabulary of ~50,000 buckets. This bounds memory while maintaining spatial discrimination.

	Why H3 over traditional geohash: H3 hexagons have uniform area (no polar distortion), hierarchical nesting, and consistent neighbor relationships — critical for trajectory continuity.

	### 3.2 Temporal Embeddings — When Is the Aircraft Flying?

	Purpose: Encode temporal context — time of day affects traffic density, routes, and behavior.

	Method: Additive composition of multiple temporal scales:
	```
	E_temporal = E_hour[hour_of_day] + E_dow[day_of_week] + E_month[month]

	E_hour: 24 entries (captures rush hour vs. night patterns)
	E_dow: 7 entries (weekday vs. weekend traffic)
	E_month: 12 entries (seasonal routes, weather patterns)

	All project to d_model dimensions.
	```

	Optional — Sinusoidal Sub-minute Encoding: For sub-minute resolution:
	```
	E_minute = sin(2π × minute / 60), cos(2π × minute / 60) → linear → d_model
	```

	### 3.3 Uncertainty Embeddings — How Confident Are We?

	Purpose: Encode the model's uncertainty about the current trajectory state. Aircraft in straight-and-level cruise have low uncertainty; aircraft maneuvering near airports have high uncertainty.

	Method: Compute a trajectory smoothness score from recent states, then discretize:

	```
	Uncertainty sources (sliding window of k=5 recent states):

	1. Position variance: σ²_pos = var(Δlat) + var(Δlon)
	2. Heading variance: σ²_COG = circular_var(COG_{t-k:t})
	3. Speed variance: σ²_SOG = var(SOG_{t-k:t})
	4. Altitude variance: σ²_alt = var(alt_rate_{t-k:t})

	Combined uncertainty score:
	U_t = w1·σ²_pos + w2·σ²_COG + w3·σ²_SOG + w4·σ²_alt

	Discretize into N_uncert = 16 bins (quantile binning on training data)

	E_uncertainty = E_uncert_table[bin(U_t)] → d_model
	```

	Weights w1-w4: Hyperparameters tuned on validation data, or learned as part of the model.

	During inference: For multi-step prediction, uncertainty can be updated using MC-Dropout or ensemble disagreement.

	### 3.4 Prompt Embeddings — Task and Context Metadata

	Purpose: Provide metadata context about the flight, analogous to system prompts in LLMs. Enables task conditioning and multi-task learning.

	Method: Learnable prompt tokens prepended to the trajectory:

	```
	Prompt token vocabulary:
	- Aircraft category: [HEAVY, LARGE, SMALL, ROTORCRAFT, GLIDER, UAV, UNKNOWN] (7)
	- Flight phase: [CLIMB, CRUISE, DESCENT, APPROACH, GROUND, UNKNOWN] (6)
	- Region: [CONUS, EUROPE, ASIA, OTHER] (4)
	- Task: [PREDICT, CLASSIFY, DETECT_ANOMALY] (3)
	- Special: [BOS, EOS, PAD, MASK] (4)

	Total prompt vocab: ~24 tokens

	Prompt sequence (prepended):
	[BOS, TASK_TOKEN, AIRCRAFT_TOKEN, PHASE_TOKEN, REGION_TOKEN]

	Each has a learned embedding of dimension d_model.
	```

	For downstream classification: Change TASK_TOKEN to CLASSIFY; output at BOS position is used for classification.

	---

	## 4. Feature Derivation Pipeline

	### 4.1 Raw Input
	```
	timestamp (Unix epoch seconds)
	latitude (degrees, WGS84)
	longitude (degrees, WGS84)
	altitude (feet, barometric or geometric)
	```

	### 4.2 Derived Features

	```python
	import numpy as np

	def derive_features(timestamps, lats, lons, alts):
	"""
	Derive COG, SOG, ROT, and altitude rate from raw position data.
	All inputs: numpy arrays of shape (N,) for a single trajectory.
	Returns arrays of shape (N,) — first element is NaN.
	"""
	dt = np.diff(timestamps) # seconds
	dt = np.maximum(dt, 1e-6) # avoid division by zero

	# --- Course Over Ground (COG) ---
	lat1, lat2 = np.radians(lats[:-1]), np.radians(lats[1:])
	dlon = np.radians(np.diff(lons))

	x = np.sin(dlon) * np.cos(lat2)
	y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(dlon)
	COG = np.degrees(np.arctan2(x, y)) % 360 # [0, 360)

	# --- Speed Over Ground (SOG) ---
	dlat = np.radians(np.diff(lats))
	a = np.sin(dlat/2)*2 + np.cos(lat1) np.cos(lat2) * np.sin(dlon/2)**2
	c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
	distance_nm = 3440.065 * c # Earth radius in nautical miles
	SOG = distance_nm / (dt / 3600) # knots

	# --- Rate of Turn (ROT) ---
	dCOG = np.diff(COG)
	dCOG = (dCOG + 180) % 360 - 180 # normalize to [-180, 180]
	ROT = np.full(len(lats), np.nan)
	ROT[2:] = dCOG / dt[1:] # degrees per second

	# --- Rate of Altitude Change ---
	dalt = np.diff(alts) # feet
	alt_rate = dalt / (dt / 60) # feet per minute

	# Pad first elements
	COG_full = np.concatenate([[np.nan], COG])
	SOG_full = np.concatenate([[np.nan], SOG])
	alt_rate_full = np.concatenate([[np.nan], alt_rate])

	return COG_full, SOG_full, ROT, alt_rate_full
	```

	### 4.3 Feature Discretization

	\| Feature \| Range \| Bin Width \| N_bins \| Notes \|
	\|---------------\|-------------------\|--------------\|--------\|--------------------\|
	\| COG \| [0, 360) \| 5° \| 72 \| Circular \|
	\| SOG \| [0, 600] kts \| 5 knots \| 121 \| Capped at ~Mach 1 \|
	\| ROT \| [-6, 6] °/s \| 0.25 °/s \| 49 \| Capped ±6°/s \|
	\| Altitude Rate \| [-6000, 6000] fpm \| 200 ft/min \| 61 \| Capped ±6000 fpm \|

	Outliers beyond caps clipped to boundary bin.

	### 4.4 Trajectory Preprocessing Pipeline

	```
	1. Segment raw ADS-B by ICAO24 + temporal gaps > 15 min → individual flights
	2. Resample to fixed Δt = 60 seconds (linear interp for position, circular for heading)
	3. Derive features (COG, SOG, ROT, alt_rate)
	4. Drop first 2 points per trajectory (NaN from derivation)
	5. Filter: remove trajectories with < 20 points (< 20 minutes)
	6. Compute H3 cell (res 5) + altitude band for each point
	7. Discretize all continuous features into bins
	8. Compute uncertainty scores (sliding window k=5)
	9. Extract temporal features (hour, dow, month)
	10. Construct prompt tokens from metadata (if available)
	```

	---

	## 5. Model Hyperparameters

	### 5.1 Model Dimensions

	\| Parameter \| Value \| Rationale \|
	\|------------------\|--------\|----------------------------------------------------\|
	\| d_model \| 256 \| H3-CLM found 256-1024 effective \|
	\| n_heads \| 8 \| head_dim = 32 \|
	\| n_layers \| 8 \| Moderate depth for ~10M param model \|
	\| d_ff \| 1024 \| 4× d_model (standard) \|
	\| max_seq_len \| 128 \| 128 states × 60s ≈ 2 hours of flight \|
	\| n_prompt_tokens \| 5 \| [BOS, TASK, AIRCRAFT, PHASE, REGION] \|
	\| dropout \| 0.1 \| \|

	Total parameters: ~8-12M (trainable on single GPU in hours)

	### 5.2 Vocabulary Sizes

	\| Embedding \| Vocab \| Dim \|
	\|------------------\|--------\|-----\|
	\| H3 cells \| 50,000 \| 256 \|
	\| Altitude bands \| 46 \| 256 \|
	\| COG bins \| 72 \| 256 \|
	\| SOG bins \| 121 \| 256 \|
	\| ROT bins \| 49 \| 256 \|
	\| Alt rate bins \| 61 \| 256 \|
	\| Hour of day \| 24 \| 256 \|
	\| Day of week \| 7 \| 256 \|
	\| Month \| 12 \| 256 \|
	\| Uncertainty bins \| 16 \| 256 \|
	\| Prompt tokens \| 24 \| 256 \|

	### 5.3 State Token Composition

	Each timestep → single state token via additive fusion:

	```
	E_state_t = E_h3[h3_id_t] + E_alt_band[alt_band_t] # Geohash (3D position)
	+ E_COG[cog_bin_t] + E_SOG[sog_bin_t] # Kinematics
	+ E_ROT[rot_bin_t] + E_alt_rate[alt_rate_bin_t] # Dynamics
	+ E_hour[hour_t] + E_dow[dow_t] + E_month[month_t] # Temporal
	+ E_uncert[uncert_bin_t] # Uncertainty

	E_state_t ∈ R^{d_model}
	```

	This additive fusion follows BERT (token + segment + position) and TrAISFormer.

	---

	## 6. Training Recipe

	### 6.1 Pretraining: Next-State Prediction (Causal LM)

	Objective: Given states 1..T, predict state at T+1 (applied autoregressively at every position).

	Loss:
	```
	L = Σ_{t=1}^{T-1} [ λ_geo · CE(ŷ_geo_t, y_geo_{t+1})
	+ λ_COG · CE(ŷ_COG_t, y_COG_{t+1})
	+ λ_SOG · CE(ŷ_SOG_t, y_SOG_{t+1})
	+ λ_ROT · CE(ŷ_ROT_t, y_ROT_{t+1})
	+ λ_alt · CE(ŷ_alt_rate_t, y_alt_rate_{t+1})
	+ λ_altb · CE(ŷ_alt_band_t, y_alt_band_{t+1}) ]

	λ values default to 1.0 (equal weighting).
	```

	Training hyperparameters (based on FTP-LLM + H3-CLM):

	\| Parameter \| Value \|
	\|----------------------\|---------------------\|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 5e-4 \|
	\| LR Schedule \| Cosine + 5% warmup \|
	\| Batch size (per GPU) \| 64 \|
	\| Gradient accumulation\| 4 (effective = 256) \|
	\| Max epochs \| 30 (early stop p=5) \|
	\| Weight decay \| 0.01 \|
	\| Gradient clipping \| 1.0 \|
	\| Mixed precision \| bf16 \|

	Data windowing: Sliding window size=128, stride=64 (50% overlap).

	### 6.2 Downstream: Activity Classification

	After pretraining, attach classification head:
	```
	h_BOS → Linear(256, 128) → GELU → Dropout(0.1) → Linear(128, N_classes)
	```

	Fine-tuning options:
	- A: Freeze backbone, train head only (fast, small data)
	- B: Full fine-tune, backbone lr=1e-5, head lr=1e-3

	---

	## 7. Dataset Strategy

	### 7.1 Prototyping — `traffic` Python Library

	```python
	from traffic.data.samples import landing_zurich_2019
	# ~2,000 flights near Zurich
	# Columns: timestamp, icao24, callsign, latitude, longitude, altitude,
	# groundspeed, track, vertical_rate, ...
	```

	Instant access, clean, well-documented. Single airport, limited diversity.

	### 7.2 Training — OpenSky Network

	```python
	from pyopensky.trino import Trino
	trino = Trino()
	df = trino.rawquery("""
	SELECT time, icao24, lat, lon, baroaltitude, velocity, heading, vertrate
	FROM state_vectors_data4
	WHERE hour >= '2024-01-15 00:00:00'
	AND hour < '2024-01-15 12:00:00'
	AND lat BETWEEN 40 AND 55
	AND lon BETWEEN -10 AND 20
	ORDER BY icao24, time
	""")
	```

	Target:
	- Region A (train): Europe, 1 month → ~500K-1M flights
	- Region B (OOD test): US CONUS, 1 week → ~200K flights
	- Region C (far test): East Asia, 1 week → ~100K flights

	### 7.3 Alternative: SCAT Dataset

	~170K en-route flights over Sweden, Zenodo. Pre-segmented, clean.

	### 7.4 Data Split

	```
	Training: 70% of Region A flights
	Validation: 15% of Region A flights
	Test (IID): 15% of Region A flights
	Test (OOD): 100% of Region B flights
	Test (Far): 100% of Region C flights
	```

	Split by flight (not time window) to avoid data leakage.

	---

	## 8. Ablation Study: Geohash Geographic Dependency

	### 8.1 Hypothesis

	> Geohash embeddings encode absolute geographic position, causing the model to memorize region-specific patterns (airways, approach paths, airspace structure). This improves in-distribution performance but degrades transfer to unseen regions.

	### 8.2 Experimental Variants

	\| Variant \| Geohash Type \| Description \|
	\|---------\|-------------\|-------------\|
	\| V1: Full Model \| H3 absolute \| Complete architecture as described \|
	\| V2: No Geohash \| None \| Remove geohash entirely; model sees only kinematics + temporal + uncertainty \|
	\| V3: Relative Geohash \| H3 relative \| H3 cell of (Δlat, Δlon) from trajectory start — position-invariant \|
	\| V4: Multi-Resolution \| H3 res 3+5+7 \| 3 resolutions summed (coarse→fine) \|
	\| V5: Continuous Position \| Linear projection \| `Linear([lat, lon, alt] → d_model)` — no discretization \|

	### 8.3 Evaluation Metrics

	For each variant × each test set (IID, OOD, Far):

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| Geo Accuracy \| % correct H3 cell prediction \|
	\| Position MAE \| Mean absolute error in km \|
	\| COG MAE \| Heading error in degrees \|
	\| SOG MAE \| Speed error in knots \|
	\| Multi-step ADE \| Average displacement error over 5 predicted steps \|
	\| Multi-step FDE \| Final displacement error at step 5 \|

	### 8.4 Key Comparisons

	\| Comparison \| Tests \|
	\|-----------\|-------\|
	\| V1 vs V2 (IID) \| How much geohash helps when test = train region \|
	\| V1 vs V2 (OOD) \| If V2 > V1 on OOD → geohash causes geographic overfitting \|
	\| V1 vs V3 (OOD) \| If V3 good on both IID and OOD → relative geohash is the sweet spot \|
	\| V4 (all) \| Multi-resolution: coarse cells transfer, fine cells specialize? \|
	\| V5 (all) \| Does continuous encoding avoid discretization issues? \|

	### 8.5 Expected Outcomes

	- V1: Best IID, worst OOD (hypothesis)
	- V3: Best compromise — predicted winner
	- V5: May struggle (loses discrete token structure transformers excel at)
	- V2: Strong OOD baseline, sacrifices IID

	### 8.6 Additional Analysis

	- Attention visualization: V1 vs V3 attention patterns
	- Embedding clustering: t-SNE of geohash embeddings colored by region
	- Learning curves: IID vs OOD performance vs training data size

	---

	## 9. Implementation Phases

	### Phase 1: Data Pipeline (Week 1)
	- Set up `traffic` library, extract sample trajectories
	- Implement feature derivation (COG, SOG, ROT, alt_rate)
	- Implement H3 geohash encoding + altitude banding
	- Implement feature discretization (binning)
	- Implement uncertainty score computation
	- Build PyTorch Dataset class with sliding window
	- Unit tests for all derivation functions

	### Phase 2: Model Architecture (Week 1-2)
	- Implement all embedding tables
	- Implement additive fusion layer
	- Implement prompt token prepending
	- Implement decoder-only transformer backbone
	- Implement multi-head output (6 prediction heads)
	- Implement classification head (for downstream)
	- Forward pass test with dummy data

	### Phase 3: Pretraining (Week 2-3)
	- Implement training loop with multi-task loss
	- Prototyping run on `traffic` data (small, fast iteration)
	- Scale to OpenSky data
	- Monitor loss curves, validate convergence
	- Save best checkpoint

	### Phase 4: Downstream Adaptation (Week 3-4)
	- Implement classification fine-tuning pipeline
	- Test on activity classification task
	- Compare frozen vs. fine-tuned backbone

	### Phase 5: Ablation Study (Week 4-5)
	- Implement all 5 geohash variants
	- Train each variant with identical hyperparameters
	- Evaluate on IID, OOD, and Far test sets
	- Generate comparison tables and visualizations
	- Write analysis of geographic dependency findings

	---

	## 10. Key Design Decisions & Rationale

	\| Decision \| Choice \| Why \|
	\|----------\|--------\|-----\|
	\| Custom model vs. pretrained LLM \| Custom ~10M param transformer \| FTP-LLM showed text-tokenized LLMs work, but custom allows proper multi-feature fusion. 10M params trains in hours. \|
	\| H3 vs. traditional geohash \| H3 \| Uniform hexagonal cells, no polar distortion, hierarchical. Proven by H3-CLM. \|
	\| Additive vs. concatenative fusion \| Additive \| BERT/TrAISFormer paradigm. Keeps d_model constant. Concatenation → d_model × N_features = massive. \|
	\| 60s time resolution \| 60 seconds \| FTP-LLM validated 1-min aggregation. 128 steps ≈ 2+ hours. \|
	\| Factored geohash (H3 + alt) \| Separate tables, summed \| Avoids combinatorial explosion (9.2M → 50K + 46). \|
	\| Multi-head output \| Separate softmax per feature \| More interpretable, allows per-feature analysis. \|
	\| Uncertainty from smoothness \| Variance-based \| Computable at data time, no inference overhead. \|

	---

	## 11. Risk Analysis

	\| Risk \| Likelihood \| Impact \| Mitigation \|
	\|------\|-----------\|--------\|------------\|
	\| Geohash overfits to region \| High \| High \| Ablation study; V3 (relative) is fallback \|
	\| OpenSky access issues \| Medium \| High \| Fallback: `traffic` samples + SCAT \|
	\| 60s too coarse for terminal \| Medium \| Low \| Separate terminal model at 10s \|
	\| Model too small \| Low \| Medium \| Scale: d_model→512, n_layers→16 (~40M) \|
	\| Alt discretization too coarse \| Low \| Low \| Refine to 500ft bands (92) \|

	---

	## 12. Monitoring & Evaluation

	During training (Trackio):
	- Total loss + per-feature loss curves
	- Validation loss each epoch
	- LR schedule, GPU utilization

	After training:
	- Next-state accuracy (top-1, top-5 per feature)
	- Position error in km
	- Multi-step prediction (1, 5, 10, 20 steps ahead)
	- Downstream classification F1/precision/recall

	---

	Grounded in: FTP-LLM, H3-CLM, GeoFormer, TrAISFormer, and LLM4STP (reconstructed). Ready for implementation upon approval.