Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

App Files Files Community

riprap-nyc / docs /RIPRAP-OWNER-BRIEF.md

seriffic

sync: bring HF Space to parity with origin/main (squash, XET-backed)

b2f95f6 1 day ago

preview code

raw

history blame contribute delete

52.6 kB

Riprap — Owner's Brief

1. The system in one paragraph

Riprap takes any NYC address, neighborhood, or development-permit query and produces a four-section flood-exposure briefing where every numeric claim is anchored to a [doc_id] citation that traces back to the source dataset, agency report, or model output. A natural-language planner (Granite 4.1 3b) routes each query to one of four intent paths; the chosen path fans out across up to ~25 atomic data specialists; a synthesizer (Granite 4.1 8b) reads only the specialist outputs that fired and writes the briefing; a Mellea rejection sampler checks four grounding invariants and rerolls if any fail. The system is NYC-specific and public-record-only: all data comes from NYC OpenData, USGS, NOAA, NWS, or FloodNet, and all four models run inside the container — no vendor LLM is contacted at runtime. The output is a tier 1–4 exposure score (deterministic, published rubric, not generated by the LLM) plus a cited paragraph in prose. What Riprap does not do: damage probability, insurance rating, flood prediction, or any claim about basement apartments or infrastructure that isn't in a public register.

2. Architecture map

HTTP request lifecycle

User browser → GET /api/agent/stream?q=<query>
  web/main.py: api_agent_stream()  (async SSE generator)
    runs runner() in a threadpool executor
      app/planner.plan(q, on_token=...)   → streams plan_token events while Granite generates
        returns Plan(intent, targets, specialists, rationale)
      out_q.put({kind:"plan", ...})       → SSE plan event
      intent dispatch:
        "single_address"     → app/intents/single_address.run(plan, q, progress_q, strict=True)
        "neighborhood"       → app/intents/neighborhood.run(plan, q, progress_q, strict=True)
        "development_check"  → app/intents/development_check.run(plan, q, progress_q, strict=True)
        "live_now"           → app/intents/live_now.run(plan, q, progress_q)
        "not_implemented"    → inline JSON response, no FSM
      each intent calls fsm.iter_steps() or its own specialist loop
        → out_q.put({kind:"step", ...})   per specialist
        → out_q.put({kind:"token", ...})  per Granite reconcile chunk
        → out_q.put({kind:"mellea_attempt", ...})  per Mellea pass/fail
      out_q.put({kind:"final", ...})
  event_stream() async generator reads out_q, wraps steps in
    stone_start / stone_done envelope keyed by _STEP_TO_STONE dict,
    yields SSE frames
  SSE response header: Cache-Control: no-cache, X-Accel-Buffering: no

Planner: `app/planner.py`

Entry: plan(query, model, on_token) → Plan
Model: RIPRAP_PLANNER_MODEL env, default granite4.1:3b
Uses llm.chat(format="json") with temperature=0 for deterministic JSON output via Ollama's constrained-decode mode
Pre-filter: _not_implemented_message(query) checks two regex patterns (retrospective, ranking) and returns early with a Plan(intent="not_implemented") so no LLM call is made
Post-validator: _validate(d, raw_query) sanitizes intent, targets, specialists against the declared INTENTS/SPECIALISTS dicts; adds floor specialists via _required_specialists(intent) if planner omitted them
Floor specialists (always added regardless of planner output): geocode+sandy+dep_stormwater+microtopo for single_address; nta_resolve+sandy+dep_stormwater+nyc311 for neighborhood; nws_alerts+noaa_tides for live_now
Returns: Plan(intent, targets: list[dict], specialists: list[str], rationale: str)

FSM: `app/fsm.py`

Entry: build_app(query) → Burr Application; run(query) → dict; iter_steps(query) → generator
Burr 0.x ApplicationBuilder with with_state(query=query, trace=[]), with_entrypoint("geocode")
Actions registered in dict order, transitions are consecutive pairs (linear, not DAG)
Each @action writes one state key + appends to trace list
Out-of-NYC guard: _NYC_S/W/N/E = 40.49, -74.27, 40.92, -73.69 — NYC-specific specialists skip with "out of NYC scope" reason; live/national specialists (NWS/NOAA/TTM) run unconditionally
Thread-locals for streaming (since Burr runs sync in a background thread):
- set_strict_mode(bool) / _current_strict_mode()
- set_token_callback(fn) / _current_token_callback()
- set_mellea_attempt_callback(fn) / _current_mellea_attempt_callback()
- set_planned_specialists(set) / _current_planned_specialists()
- set_user_query(str) / _current_user_query()
- set_planner_intent(str) / _current_planner_intent()
iter_steps spawns a daemon thread running app.iterate(halt_after=["reconcile"]); snaps threadlocals from caller thread and re-installs on iterate thread; deduplicates trace records by (step_name, started_at)
Heavy-specialist gate: _HEAVY_SPECIALISTS_ENABLED = True when RIPRAP_LLM_PRIMARY != ollama OR RIPRAP_ML_BASE_URL is set; otherwise False. Controls whether prithvi_live, terramind, eo_chip, terramind_lulc, terramind_buildings fire
NYCHA register gate: _NYCHA_REGISTERS_ENABLED = controlled by RIPRAP_NYCHA_REGISTERS=1 (default off); registers load a 91 MB GeoJSON file on first call

Full action sequence (default, single_address)

#	Action name	State key written	Data source
1	geocode	geocode, lat, lon	NYC DCP Geosearch → OSM Nominatim fallback
2	sandy	sandy	data/sandy_inundation.geojson (lru_cache)
3	dep	dep	data/dep/*.gdb (3 scenarios, lru_cache)
4	floodnet	floodnet	api.floodnet.nyc Hasura GraphQL
5	nyc311	nyc311	Socrata erm2-nwe9
6	noaa_tides	noaa_tides	api.tidesandcurrents.noaa.gov
7	nws_alerts	nws_alerts	api.weather.gov/alerts/active
8	nws_obs	nws_obs	api.weather.gov/stations//observations
9	ttm_forecast	ttm_forecast	ibm-granite/granite-timeseries-ttm-r2 (in-process or remote)
10	ttm_311_forecast	ttm_311_forecast	TTM r2 on local 311 weekly series
11	floodnet_forecast	floodnet_forecast	TTM r2 on nearest FloodNet sensor history
12	ttm_battery_surge	ttm_battery_surge	msradam/Granite-TTM-r2-Battery-Surge (remote or local)
13	microtopo	microtopo	data/nyc_dem_30m.tif, twi.tif, hand.tif
14	ida_hwm	ida_hwm	data/ida_2021_hwms_ny.geojson
15	mta_entrances	mta_entrances	data/mta_entrances.geojson
16	prithvi	prithvi_water	data/prithvi_ida_2021.geojson (166 polygons)
17–22	(heavy, if enabled)	prithvi_live, terramind, eo_chip, terramind_lulc, terramind_buildings	STAC/Sentinel-2, msradam/TerraMind-NYC-Adapters
23	rag	rag	Granite Embedding 278M over corpus/*.pdf (5 PDFs)
24	gliner	gliner	GLiNER typed-entity extraction over RAG hits
25	reconcile	paragraph, audit, mellea	Granite 4.1:8b via Mellea strict sampler

Capstone reconciliation: `app/reconcile.py` + `app/mellea_validator.py`

build_documents(state) → list[dict] — emits one {"role": "document <doc_id>", "content": "..."} per specialist that fired, in Stones order; gatted by both specialist fire status and the out-of-NYC guard
trim_docs_to_plan(doc_msgs, planned_specialists) — drops doc messages not matching planner's specialist set; saves ~30–50% prompt tokens; RIPRAP_TRIM_DOCS=0 disables
EXTRA_SYSTEM_PROMPT — the 4-section skeleton with the citation-discipline rules
augment_system_prompt(EXTRA_SYSTEM_PROMPT, query, intent) — calls app/framing.detect() to classify question type (11 types, deterministic regex), then appends a QUESTION-AWARE OPENING: directive to the system prompt for non-generic questions
Strict path (production): reconcile_strict_streaming(doc_msgs, system_prompt, ...) in app/mellea_validator.py
- Streams each attempt's tokens via on_token(delta, attempt_idx) callback
- After each attempt runs four checks, fires on_attempt_end(attempt_idx, passed, failed) callback
- On failure, appends a feedback user-turn naming failing sentences and rerolls
- Budget: DEFAULT_LOOP_BUDGET = 2 (Ollama primary) or 3 (vLLM primary), overridable via RIPRAP_MELLEA_MAX_ATTEMPTS
Legacy path (non-strict): reconcile.reconcile(state) → streams tokens, then calls verify_paragraph() which drops sentences with ungrounded numbers (post-hoc, not rejection-sampling)
The step_reconcile action detects strict mode via _current_strict_mode() and routes to one or the other

Four Mellea grounding checks (`app/mellea_validator.py`)

numerics_grounded — _check_no_invented_numbers(): every non-trivial number in output appears verbatim in haystack (joined document content). Trivial set: {0–10, 100, 311, 911, 211}. Number regex: \b-?\d[\d,]*(?:\.\d+)?\b (word-boundary — skips QN1206, B12)
no_placeholder_tokens — _check_no_placeholder_tokens(): output contains none of [source], <document, </document, [doc_id]
citations_dense — _check_every_claim_cited(): each non-trivial number has a [doc_id] citation somewhere in the same sentence (sentence boundary: \.[\s)] or end of string)
citations_resolve — _check_referenced_doc_ids_exist(): every [id] cited in output is a member of the input doc_id set

SSE event vocabulary (`/api/agent/stream`)

event	payload	when
`hello`	`{query}`	connection open
`plan_token`	`{delta}`	planner JSON tokens
`plan`	`{intent, targets, specialists, rationale}`	planner done
`stone_start`	`{name, tagline, description}`	first step in a Stone fires
`step`	`{step, ok, elapsed_s, result?, err?}`	each FSM action completes
`token`	`{delta, attempt?}`	Granite reconcile chunk (attempt idx resets on reroll)
`mellea_attempt`	`{attempt, passed, failed}`	end of each Mellea attempt
`stone_done`	`{name, tagline, description, n_steps}`	last step in a Stone done
`final`	full state dict (geocode, sandy, dep, paragraph, mellea, energy, ...)	reconcile done
`error`	`{err}`	exception
`done`	`{}`	stream closing

3. The Five Stones, one section each

Cornerstone — Hazard Reader

Job: Establish the historical and modeled flood record at the address. These are static datasets that do not change between queries.

Specialists (file:function, what it does):

Specialist	File:function	What it returns
`step_sandy`	`fsm.py:step_sandy`	Boolean: inside 2012 Sandy Inundation Zone; `gpd.sjoin` point-in-polygon against `data/sandy_inundation.geojson` (91 MB)
`step_dep`	`fsm.py:step_dep`	Three DEP stormwater scenarios: `dep_extreme_2080` (3.66 in/hr rainfall, 2080 SLR), `dep_moderate_2050` (2.13 in/hr, 2050 SLR), `dep_moderate_current`; depth class 1–3 per point
`step_microtopo`	`fsm.py:step_microtopo`	Point elevation (m), HAND (height above nearest drainage, m), TWI (topographic wetness index), rel_elev_pct_200m, rel_elev_pct_750m, basin_relief_m from rasters in `data/`
`step_ida_hwm`	`fsm.py:step_ida_hwm`	USGS Hurricane Ida 2021 high-water marks; n_within_800m, max_height_above_gnd_ft, nearest_dist_m
`step_prithvi`	`fsm.py:step_prithvi`	Point-in-polygon against 166 pre-computed polygons in `data/prithvi_ida_2021.geojson`; inside_water_polygon bool, nearest_distance_m

Data sources:

Sandy: NYC OpenData 5xsi-dfpx — downloaded to data/sandy_inundation.geojson
DEP: NYC DEP Stormwater Flood Maps (2021), Esri FileGDBs at data/dep/*.gdb
Microtopo: USGS 3DEP 30 m DEM via py3dep + whitebox-workflows for TWI/HAND computation, baked to data/nyc_dem_30m.tif, data/twi.tif, data/hand.tif by scripts/compute_hydrology_indices.py
Ida HWMs: USGS STN Event 312 (NY State), baked to data/ida_2021_hwms_ny.geojson by scripts/fetch_ida_hwms.py
Prithvi polygons: offline Prithvi-EO 2.0 segmentation on Sentinel-2 HLS tile (pre-event 2021-08-25, post-event 2021-09-02), baked to data/prithvi_ida_2021.geojson by scripts/run_prithvi_ida.py

Models invoked: Prithvi-EO 2.0 ran offline (TerraTorch) to produce the 166-polygon GeoJSON; no live model at query time for this Stone

Failure modes: Sandy/DEP/Prithvi fail silently if GeoJSON/GDB load errors; microtopo/ida_hwm fail if raster files absent from data/; all check _in_nyc() and skip with "out of NYC scope" for non-NYC addresses

UI evidence cards: Sandy inundation zone (boolean), DEP scenario depth classes (three cards), microtopo terrain indices, Ida HWM count/height, Prithvi polygon proximity

Keystone — Asset Register

Job: Quantify what public assets (transit, housing, schools, hospitals, buildings) are exposed to the hazards the Cornerstone established.

Specialists (file:function):

Specialist	File:function	What it returns
`step_mta_entrances`	`fsm.py:step_mta_entrances`	MTA subway entrances within 500 m: n_entrances, n_inside_sandy_2012, n_in_dep_extreme_2080; per-entrance elevation + HAND
`step_nycha`	`fsm.py:step_nycha`	NYCHA developments within 1.5 km: n_developments, n_majority_inside_sandy_2012, n_with_dep_2080_overlap; per-development footprint overlap percentages
`step_doe_schools`	`fsm.py:step_doe_schools`	DOE schools within 1 km: n_schools, n_inside_sandy_2012, n_in_dep_extreme_2080
`step_doh_hospitals`	`fsm.py:step_doh_hospitals`	NYS DOH hospitals within 2 km: n_hospitals, n_inside_sandy_2012, n_in_dep_extreme_2080
`step_terramind_buildings`	`fsm.py:step_terramind_buildings`	msradam/TerraMind-NYC-Adapters Buildings LoRA: pct_buildings, n_building_components in per-query Sentinel-2 chip; heavy, needs `_HEAVY_SPECIALISTS_ENABLED`

Data sources:

MTA: data/mta_entrances.geojson (pre-computed register with elevation + flood layer joins)
NYCHA: data/registers/nycha.json (built by scripts/build_nycha_register.py)
DOE schools: data/registers/schools.json
DOH hospitals: fetched from NYS DOH Health Facility Certification (vn5v-hh5r) at register-build time
TerraMind-Buildings: msradam/TerraMind-NYC-Adapters adapter nyc-buildings-v1, via app/context/terramind_nyc.py:buildings()

Models invoked: msradam/TerraMind-NYC-Adapters (TerraMind 1.0 base + Buildings LoRA), ~1.6 GB base + ~325 MB LoRA, loaded lazily and cached; runs on RIPRAP_ML remote if configured

Failure modes: NYCHA/DOE/DOH registers require RIPRAP_NYCHA_REGISTERS=1 and the 91 MB sandy GeoJSON to be loaded — they fire disabled by default on local dev. TerraMind-Buildings skips silently if eo_chip didn't fire, deps unavailable, or heavy disabled

UI evidence cards: MTA entrance exposure summary, NYCHA development exposure, school exposure, hospital exposure, TerraMind building footprint fraction

Touchstone — Live Observer

Job: Report current conditions — what sensors, 311 data, and EO imagery show right now.

Specialists (file:function):

Specialist	File:function	What it returns
`step_floodnet`	`fsm.py:step_floodnet`	FloodNet sensors within 600 m: n_sensors, n_sensors_with_events, n_flood_events_3y, peak_event (max_depth_mm)
`step_311`	`fsm.py:step_311`	NYC 311 flood complaints within 200 m, last 5 years: count, by_descriptor breakdown, by_year
`step_nws_obs`	`fsm.py:step_nws_obs`	Nearest ASOS hourly METAR: station_id, precip_last_hour_mm, precip_last_3h_mm, precip_last_6h_mm
`step_noaa_tides`	`fsm.py:step_noaa_tides`	Nearest of 3 NOAA gauges (Battery 8518750, Kings Point 8516945, Sandy Hook 8531680): observed_ft_mllw, predicted_ft_mllw, residual_ft
`step_prithvi_live`	`fsm.py:step_prithvi_live`	Live Sentinel-2 L2A water segmentation via msradam/Prithvi-EO-2.0-NYC-Pluvial v2; pct_water_within_500m, pct_water_full, scene_date, cloud_cover; heavy
`step_terramind_lulc`	`fsm.py:step_terramind_lulc`	msradam/TerraMind-NYC-Adapters LULC LoRA: dominant_class, dominant_pct, per-class fractions; heavy

Data sources:

FloodNet: https://api.floodnet.nyc/v1/graphql — Hasura GraphQL, no auth; ~350 sensors
311: Socrata erm2-nwe9 (live API call, 200 m buffer, last 5 years)
NWS obs: https://api.weather.gov/stations/<id>/observations/latest; nearest of KNYC, KLGA, KJFK, KEWR, KFRG
NOAA tides: https://api.tidesandcurrents.noaa.gov/api/prod/datagetter; 6-min cadence
Prithvi live: Microsoft Planetary Computer STAC API for Sentinel-2 L2A; msradam/Prithvi-EO-2.0-NYC-Pluvial v2 weights
TerraMind LULC: shared chip from step_eo_chip (also STAC/Planetary Computer)

Models invoked: Prithvi-EO-2.0-NYC-Pluvial v2 (300 M params, TerraTorch, flood IoU 0.5979 vs 0.10 base); TerraMind-NYC-Adapters LULC LoRA (mIoU 0.5866, +6.13 pp over full-FT)

Failure modes: FloodNet GraphQL call sets verify=False (self-signed cert); 311 Socrata times out gracefully; NOAA/NWS calls have 15-20 s timeouts; Prithvi/TerraMind LULC require _HEAVY_SPECIALISTS_ENABLED and app/context/eo_chip_cache.py:fetch() succeeding

Lodestone — Projector

Job: Report forward-looking signals — NWS alerts, surge forecasts, and complaint-rate trends.

Specialists (file:function):

Specialist	File:function	What it returns
`step_nws_alerts`	`fsm.py:step_nws_alerts`	Active NWS flood-relevant alerts at point (Flash Flood, Coastal Flood, etc.): n_active, list of alerts with event/severity/urgency/expires
`step_ttm_forecast`	`fsm.py:step_ttm_forecast`	TTM r2 zero-shot Battery surge residual: context 512 steps (~~51 h at 6-min), horizon 96 steps (~~9.6 h); forecast_peak_ft, forecast_peak_minutes_ahead; only emits doc when interesting (peak > 0.3 ft)
`step_ttm_311_forecast`	`fsm.py:step_ttm_311_forecast`	TTM r2 zero-shot on 52 weeks of 311 complaint history → 4-week forecast; forecast_mean_per_week, forecast_peak_per_week, accelerating flag
`step_floodnet_forecast`	`fsm.py:step_floodnet_forecast`	TTM r2 on nearest FloodNet sensor flood-event recurrence; forecast_28d_expected_events, accelerating; silent if sensor history too sparse
`step_ttm_battery_surge`	`fsm.py:step_ttm_battery_surge`	msradam/Granite-TTM-r2-Battery-Surge fine-tune: hourly cadence, 96 h horizon; forecast_peak_m, forecast_peak_hours_ahead; only emits doc when interesting

Data sources:

NWS alerts: https://api.weather.gov/alerts/active filtered to flood event types at the point's county
TTM context data: live pull from NOAA CO-OPS 6-min water level (for Battery/Kings Point/Sandy Hook); Socrata 311 history; FloodNet GraphQL event history
Battery surge fine-tune: NOAA hourly verified water level from Battery gauge (NOAA 8518750), loaded by app/live/ttm_battery_surge.py

Models invoked: ibm-granite/granite-timeseries-ttm-r2 (1.5 M params, ~30 MB, CPU-viable, zero-shot); msradam/Granite-TTM-r2-Battery-Surge fine-tune (same backbone, test MAE 0.1091 m, −41% vs persistence, −25% vs zero-shot)

Failure modes: NWS alerts call gracefully returns n_active=0 on timeout; TTM models loaded lazily via app/live/ttm_forecast.py:_load_model() with _DEPS_OK = False fallback pattern; all Lodestone specialists fire unconditionally (no NYC bbox gate except floodnet/311 which are NYC-specific)

Capstone — Synthesizer

Job: Read all documents produced by the four data-Stones and write a citation-grounded four-section prose briefing.

Entry: app/mellea_validator.py:reconcile_strict_streaming(doc_msgs, system_prompt, user_prompt, loop_budget, on_token, on_attempt_end)

Document ordering in prompt: geocode preamble → Cornerstone (sandy, dep_*, ida_hwm, prithvi_water, microtopo) → Keystone (mta_entrance_*, nycha_dev_*, doe_school_*, nyc_hospital_*, tm_buildings) → Touchstone (floodnet, nyc311, nws_obs, noaa_tides, prithvi_live, tm_lulc) → Lodestone (nws_alerts, ttm_forecast, ttm_311_forecast, floodnet_forecast_*, ttm_battery) → Policy (rag_*, gliner_*)

Four-section skeleton (from EXTRA_SYSTEM_PROMPT):

Status. — dominant exposure signal, strongest doc_id citation
Empirical evidence. — Sandy, 311, FloodNet, Ida HWMs, Prithvi polygons
Modeled scenarios. — DEP dep_* scenarios, microtopo terrain (HAND, TWI, percentile)
Policy context. — one sentence per RAG hit, citing agency name + rag_* doc_id

Four grounding checks (described in §2 above): numerics_grounded, no_placeholder_tokens, citations_dense, citations_resolve

Reroll feedback mechanism: _failing_sentences_for_citations(text) identifies sentences with uncited numbers; on reroll the feedback user-turn names those specific sentences and instructs surgical citation additions

Model: RIPRAP_RECONCILER_MODEL env, default granite4.1:8b; num_ctx=4096, num_predict=400

Return shape from step_reconcile: {paragraph, audit: {raw, dropped}, mellea: {rerolls, n_attempts, requirements_passed, requirements_failed, requirements_total, model, loop_budget}}

4. The three NYC fine-tunes

msradam/Prithvi-EO-2.0-NYC-Pluvial

HF Hub path: msradam/Prithvi-EO-2.0-NYC-Pluvial
Base model: IBM/NASA Prithvi-EO 2.0 (300 M params, ViT-L foundation model pre-trained on HLS Sentinel-2 multispectral imagery), Apache-2.0
Training data: NYC HLS Sentinel-2 tiles with pluvial flood labels derived from USGS Ida HWM survey and NYC DEP records; Lovász-Softmax loss with copy-paste augmentation; trained on AMD Instinct MI300X
Metrics: Test flood IoU 0.5979 vs 0.10 on Sen1Floods11 base (6× improvement)
Invocation: Two paths:
- Offline (Cornerstone): produced data/prithvi_ida_2021.geojson via scripts/run_prithvi_ida.py; runtime does point-in-polygon, no model call
- Live (Touchstone): app/flood_layers/prithvi_live.py:fetch(lat, lon) — fetches latest Sentinel-2 L2A chip from Planetary Computer STAC, runs model forward pass, returns pct_water_within_500m, pct_water_full; slow (~30 s), gated by _HEAVY_SPECIALISTS_ENABLED; input 6-band S2L2A chip, output binary segmentation mask
Degradation: If Planetary Computer STAC unavailable or cloud cover too high, fetch() returns {ok: False, skipped: "...reason..."} and no doc is emitted

msradam/TerraMind-NYC-Adapters

HF Hub path: msradam/TerraMind-NYC-Adapters
Base model: TerraMind 1.0 (IBM/ESA any-to-any generative EO foundation model), Apache-2.0
Training data: NYC Sentinel-2 + SAR chips matched to ESRI Land Cover 2020–2022 labels (LULC adapter) and NYC building footprints (Buildings adapter); trained on AMD Instinct MI300X in ~18 minutes
Metrics: LULC test mIoU 0.5866 (+6.13 pp over full-FT baseline); Buildings test mIoU 0.5511; TiM 0.6023
Two adapters:
- lulc — 5-class land cover (water, built, vegetation, bare, agriculture); invoked by step_terramind_lulc via app/context/terramind_nyc.py:lulc(s2_tensor, s1rtc, dem)
- buildings — binary building footprint mask; invoked by step_terramind_buildings via app/context/terramind_nyc.py:buildings(s2_tensor, s1rtc, dem)
Shared chip: Both consume tensors from step_eo_chip → app/context/eo_chip_cache.py:fetch(lat, lon), which fetches S2L2A + S1RTC + DEM chip once per query
Degradation: If eo_chip didn't fire successfully, both LoRA specialists silently no-op. Lazy load + cached in-process; first call ~30 s, subsequent calls ~3–7 s

msradam/Granite-TTM-r2-Battery-Surge

HF Hub path: msradam/Granite-TTM-r2-Battery-Surge
Base model: ibm-granite/granite-timeseries-ttm-r2 (1.5 M params, Tiny Time Mixer, Ekambaram et al. NeurIPS 2024), Apache-2.0
Training data: NOAA CO-OPS Battery gauge (station 8518750) hourly verified water level, surge residual computed as verified minus harmonic tide; trained on AMD Instinct MI300X
Metrics: Test MAE 0.1091 m, −41% vs persistence, −25% vs zero-shot TTM r2
Invocation: app/live/ttm_battery_surge.py:fetch() — loads model via tsfm_public.get_model(), fetches NOAA hourly context, returns {available, context_hours, horizon_hours: 96, forecast_peak_m, forecast_peak_hours_ahead, interesting}; in-process on CPU
Input shape: (context_length, 1) float tensor of hourly surge residuals; context = 336 h (~14 days)
Output shape: (96,) hourly forecast, scanned for peak
Degradation: _DEPS_OK module-level flag set at import time; on failure returns {available: False, reason: "..."}, no doc emitted

5. The deployment topology

Local development

Python 3.12 venv (.venv), uv for package management
Ollama serving granite4.1:3b + granite4.1:8b locally
uvicorn web.main:app --host 127.0.0.1 --port 7860
_HEAVY_SPECIALISTS_ENABLED = False by default (no RIPRAP_ML_BASE_URL set, no vLLM)
RIPRAP_NYCHA_REGISTERS = 0 by default (heavy 91 MB GeoJSON loads)
Granite Embedding 278M and TTM r2 download to HF cache on first query (~280 MB + ~30 MB)
SvelteKit UI built at web/sveltekit/build/; rebuild only needed when sources change

HF Space (production demo URL)

URL: https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space
Docker SDK, base nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04, hardware cpu-basic (actual hardware is cpu-basic, not T4 — the ARCHITECTURE.md mentions T4 but the Dockerfile's GPU notes are aspirational)
Python 3.10 inside container (pinning mellea<0.4, transformers<5, huggingface_hub<1)
entrypoint.sh flow:
1. Attempts EO toolchain install at runtime to $HOME/.eo-pkgs (bypasses HF build disk limit); if fails, terramind/prithvi-live silently skip
2. Starts ollama serve in background, polls until ready (up to 60 s)
3. Pulls granite4.1:8b at runtime if not cached (~5 GB, ~2 min first cold start); 3b is optional
4. Pre-warms 8b via curl POST /api/generate with keep_alive=24h
5. Launches uvicorn web.main:app --host 0.0.0.0 --port 7860
RIPRAP_OLLAMA_3B_TAG=granite4.1:8b set in Dockerfile so planner routes to 8b (avoids disk cost of two separate model pulls)
web/main.py:_warm_caches() on startup: loads sandy + DEP layers, optionally NYCHA registers, warms RAG (Granite Embedding 278M + 5 PDFs), pre-imports heavy ML stacks to avoid import races, warms Ollama models via HTTP

AMD MI300X droplet (demo GPU path — currently destroyed)

Two Docker containers on same host, both with --device=/dev/kfd --device=/dev/dri
Container 1: vllm/vllm-openai-rocm:v0.17.1 — serves granite-4.1-8b on port 8001
- --max-model-len 8192, --served-model-name granite-4.1-8b
- GLOO_SOCKET_IFNAME=eth0 required or gloo fails to bind
Container 2: riprap-models:latest (built from services/riprap-models/Dockerfile) — FastAPI on port 8002 (or 7860 per scripts)
- Endpoints: GET /healthz, POST /v1/prithvi-pluvial, POST /v1/terramind, POST /v1/ttm-forecast, POST /v1/granite-embed, POST /v1/gliner-extract
- Model loading: lazy + per-model threading.Lock to prevent double-load on concurrent requests
- ROCm device: cuda (ROCm's CUDA shim maps cuda to first /dev/kfd device)

Env vars to connect HF Space to droplet:

RIPRAP_LLM_PRIMARY=vllm
RIPRAP_LLM_BASE_URL=http://<ip>:8001/v1
RIPRAP_LLM_API_KEY=<token>
RIPRAP_ML_BASE_URL=http://<ip>:8002
RIPRAP_ML_API_KEY=<token>

What breaks if droplet IP changes: Set the four env vars above via huggingface-cli space variables and restart the Space. The LiteLLM Router builds at import time from env, so a Space restart is required.

Deterministic redeploy: scripts/deploy_droplet.sh <new-ip> $TOKEN — idempotent, ~10–20 min first run (pulls images, builds riprap-models); re-runs on same droplet ~1 min. Known fragile: safetensors==0.8.0rc0 pin in services/riprap-models/requirements-full.txt is an RC and may fail on future pip resolves.

6. One query traced end-to-end: "80 Pioneer Street, Brooklyn"

Query enters: GET /api/agent/stream?q=80+Pioneer+Street%2C+Brooklyn

1. Planner (app/planner.py:plan)

No not-implemented regex matches
Calls llm.chat(model="granite4.1:3b", messages=[system, user], format="json", stream=True, temperature=0)
Streams plan_token SSE events as JSON generates
Returns Plan(intent="single_address", targets=[{type:"address", text:"80 Pioneer Street, Brooklyn"}], specialists=[...], rationale="...")
Validator adds floor specialists: geocode, sandy, dep_stormwater, microtopo
SSE: plan event emitted

2. single_address.run (app/intents/single_address.py:run)

Sets threadlocals: strict=True, planned_specialists={...}, user_query="80 Pioneer Street, Brooklyn", planner_intent="single_address"
Registers on_token and on_mellea_attempt callbacks on progress_q
Calls fsm.iter_steps("80 Pioneer Street, Brooklyn")

3. FSM: step_geocode

app/geocode.py:geocode_one("80 Pioneer Street, Brooklyn")
Detects borough hint "Brooklyn", calls DCP Geosearch with size=8, filters for Brooklyn results
Returns GeocodeHit(address="80 Pioneer Street, Brooklyn, NY 11231", borough="Brooklyn", lat=40.6772, lon=-74.0070, bbl="3-00589-0003", ...)
State: {geocode: {...}, lat: 40.6772, lon: -74.0070}
SSE: step event {step: "geocode", ok: true, elapsed_s: 0.4, result: {address:..., lat:..., lon:...}}

4. FSM: step_sandy

Confirmed inside NYC bbox
sandy_inundation.join(point) — spatial join against data/sandy_inundation.geojson
Red Hook is inside the 2012 Sandy inundation zone → sandy=True
State: {sandy: True}
SSE: step event → opens stone_start: Cornerstone

5. FSM: step_dep

dep_stormwater.join(pt, scen) for each of 3 scenarios against data/dep/*.gdb
Likely returns dep_moderate_2050: depth_class=2 (Deep & Contiguous 1-4 ft), dep_extreme_2080: depth_class=3 (Deep Contiguous >4 ft), dep_moderate_current: depth_class=1

6–8. FSM: step_floodnet, step_311, step_noaa_tides

FloodNet: GraphQL POST to api.floodnet.nyc — checks sensors within 600 m of (40.6772, -74.0070)
311: Socrata API call for flood complaints within 200 m, last 5 years
NOAA: fetches Battery gauge (closest of 3 stations to Red Hook), returns observed/predicted/residual

9–12. FSM: TTM forecast steps

ttm_forecast.summary_for_point(40.6772, -74.0070): loads ibm-granite/granite-timeseries-ttm-r2, fetches 512 steps of Battery residual history via NOAA, forecasts 96 steps ahead; emits doc only if peak > 0.3 ft
ttm_311_forecast.weekly_311_forecast_for_point(...): fetches 52-week complaint history for 200 m buffer from 311, runs TTM zero-shot
floodnet_forecast.summary_for_point(...): nearest sensor historical events → TTM recurrence forecast
ttm_battery_surge.fetch(): msradam/Granite-TTM-r2-Battery-Surge, hourly context → 96 h forecast

13–14. FSM: step_microtopo, step_ida_hwm

microtopo.microtopo_at(40.6772, -74.0070): samples data/nyc_dem_30m.tif, hand.tif, twi.tif at point; returns elevation ~3 m, HAND ~0.8 m (near drainage), TWI ~11
ida_hwm.summary_for_point(...): checks data/ida_2021_hwms_ny.geojson within 800 m — Ida hit Queens hardest, Red Hook had no USGS HWMs

15. FSM: step_mta_entrances

app/registers/mta_entrances.py:summary_for_point(...): loads data/mta_entrances.geojson, finds entrances within 500 m (likely Smith-9th and Carroll St A/C/G stations)

16. FSM: step_prithvi

prithvi_water.summary_for_point(40.6772, -74.0070): point-in-polygon against data/prithvi_ida_2021.geojson 166 polygons; Red Hook is coastal — likely inside_water_polygon=True or close proximity

17. FSM: step_rag

Builds query: "address 80 Pioneer Street, Brooklyn; inside Hurricane Sandy 2012 inundation zone; in Deep Contiguous pluvial scenario; flood resilience plan..."
rag.retrieve(q, k=3, min_score=0.45): Granite Embedding 278M cosine similarity over embedded corpus; likely returns rag_npcc4 (NPCC4 coastal) + rag_mta (MTA Resilience Roadmap coastal references) + rag_comptroller
External reads: none after startup (RAG index built at startup via rag.warm())

18. FSM: step_gliner

gliner_extract.extract_for_rag_hits(hits): GLiNER NER extraction over RAG paragraphs; extracts agency names, dollar amounts, infrastructure projects, NYC locations, date ranges
Emits gliner_{source} doc messages

19. FSM: step_reconcile

_current_strict_mode() = True
build_documents(snap) → ~15 doc messages
trim_docs_to_plan(doc_msgs, planned_specialists) → drops specialists planner didn't ask for
augment_system_prompt(EXTRA_SYSTEM_PROMPT, query="80 Pioneer Street, Brooklyn", intent="single_address") → framing.detect() → generic_exposure → no directive added (Red Hook query has no question-shape keywords)
reconcile_strict_streaming(doc_msgs, framed_prompt, loop_budget=2, on_token=..., on_attempt_end=...)
- Attempt 0: streams tokens to frontend; runs 4 checks; likely passes
- If fails: feedback user-turn names failing sentences, attempt 1
Emits: paragraph, mellea metadata (rerolls=0, requirements_passed=[4/4])
SSE: multiple token events → mellea_attempt event → stone_done: Capstone → final event

Scoring (computed in web/main.py from final state, or explicitly via app/score.py:composite()):

sandy=True → empirical.sandy=1.0 → floor triggered (tier capped at 2)
dep_moderate_2050 depth_class=2 → regulatory.dep_moderate_2050=0.75
microtopo HAND=0.8 → hydrological.hand_band=1.0 (HAND < 1 m)
composite likely ≥ 1.5 → raw tier 1; floor_applied=True → final tier capped at min(1,2) = 1 (floor is a floor, not a ceiling — the actual rule caps tier at no worse than 2; since tier 1 is better than tier 2, the floor is satisfied: tier stays 1)
Final tier: 1 (High exposure)

7. What's robust vs fragile

Robust (load-bearing, tested)

Silence-over-confabulation in specialists: Every FSM action returns the declared state key as None on failure; build_documents() gates on state.get(key) is not None; Granite never invents content from absent documents. Pattern is consistent across 25 specialists.
NYC-scope guard: _in_nyc() check in every FSM action + build_documents() scope_note mechanism for out-of-NYC addresses. National specialists (NOAA, NWS) still fire and a live-conditions-only briefing is produced.
LiteLLM Router failover: app/llm.py auto-fails from vLLM to Ollama on timeout/5xx. num_retries=0 so the Router doesn't burn seconds re-hitting dead endpoints. The Ollama fallback fires from the same call site.
Planner validator floor: _required_specialists() adds geocode/sandy/dep/microtopo even if planner forgot them; prevents silent missing-Stone briefings.
Four Mellea grounding checks with reroll feedback: The _failing_sentences_for_citations() targeted feedback mechanism is the reason neighborhood queries went from chronic 3/4 → 4/4. The identifier-aware \b regex in _NUM_RE is specifically why it stopped false-firing on NTA codes.
End-to-end probe suite: scripts/probe_addresses.py drives /api/agent/stream against 5 addresses (442 E Houston, 80 Pioneer, 100 Gold, Hollis, Coney Island), asserts Stone fire patterns + Mellea 4/4 + four-section structure. Last green run: 5/5, 5.8–13.1 s per address at RIPRAP_MELLEA_MAX_ATTEMPTS=3.
Startup warmup in web/main.py:_warm_caches(): Sandy, DEP, RAG, Ollama models, and heavy ML module pre-imports all happen before the first request. The startup function catches exceptions individually so one failure doesn't kill the app.
Threadlocal cleanup in finally: blocks: app/intents/single_address.py always resets all five threadlocals in a finally: clause, preventing state bleeding between requests.

Fragile (single points of failure, missing error handling)

Burr FSM concurrent queries: iter_steps() mutates module-level Burr state. Two concurrent single_address queries to the same uvicorn worker will interleave threadlocals. No per-request isolation. Production HF Space is single-worker; local dev with --workers 2 would break.
build_documents() complexity radon F=101: ~750-line function with one if/elif branch per specialist. Order matters for the Granite prompt. Small edits risk subtle doc-ordering regressions that are silent but affect citation density.
entrypoint.sh EO install: Runtime pip install --target for terratorch/einops/diffusers/timm/torchvision into $HOME/.eo-pkgs is brittle — if pip fails mid-install the marker isn't created and the next container start retries, but if the Space's filesystem cache persists a partial install, it might never clear. The build log won't show this failure clearly.
Droplet redeploy: Dockerfile unverified end-to-end: The last full E2E Dockerfile build was never confirmed — the bootstrap droplet was destroyed before final verification. safetensors==0.8.0rc0 in services/riprap-models/requirements-full.txt is an RC that may fail on a fresh pip resolve.
NOAA/NWS live calls without rate-limit handling: app/context/noaa_tides.py and nws_obs.py call live APIs on every request with no caching, no retry-after handling. Under concurrent load or NOAA outage, specialists fail silently (returns error key in result dict) but every request re-hits the failed endpoint.
FloodNet GraphQL verify=False: Certificate validation disabled in app/context/floodnet.py:_gql(). This is a permanent workaround for FloodNet's self-signed cert, not a temporary workaround.
Static asset cache: web/sveltekit/build/ assets have no cache-busting. When iterating on Svelte sources, browser hard-reload is required.
Planner 3b → 8b alias on HF Space: RIPRAP_OLLAMA_3B_TAG=granite4.1:8b in the Dockerfile means both planner and reconciler use the 8b on the Space. If 3b is never pulled, the granite4.1:3b model is absent and an explicit call to that tag would fail. Current routing via the alias system prevents this, but a direct tag reference in new code would break.
vLLM [doc_id=X] normalization in app/llm.py:_normalize_citations(): Applied per-chunk in streaming and once on non-streaming responses. If vLLM ever batches citation tokens across two stream chunks, the regex would miss them. This hasn't happened in practice but is a known theoretical gap.
RAG startup failure doesn't prevent startup: rag.warm() is wrapped in a try/except that prints and continues. If sentence-transformers fails to load, all queries return without policy context — the briefing still works but silently loses the RAG section.
Mellea API shape versioning: reconcile_strict() uses mellea.start_session(backend_name="ollama") from Mellea 0.3/0.4 (HF has 0.3, local has 0.4). The _extract_text() and _extract_attempts() helpers duck-type multiple attribute names. reconcile_strict_streaming() avoids Mellea's session entirely (hand-rolled) and is version-independent — this is the production path. The reconcile_strict() function is only exercised in offline contexts.
NYC 311 Socrata calls uncached: Each query fetches fresh from Socrata. Under rate-limit or extended 311 maintenance, the specialist returns n=0 and no 311 doc is emitted; the briefing silently lacks that signal.

Known gaps / out-of-scope

compare intent defined in planner.py INTENTS dict but no routing to a compare.py intent module exists in web/main.py:api_agent_stream. Planner would route to it but the runner would fall through to single_address.
Retrospective mode (what would Riprap have said on date X): blocked at planner with not-implemented message. No historical data replay exists.
Cross-register ranking (rank top 5 neighborhoods by flood exposure): blocked at planner. Would require a cross-register join that doesn't exist.
FEMA NFHL integration: FEMA 1% and 0.2% floodplain indicators are in the scoring rubric (app/score.py:REGULATORY) but the corresponding FSM step and data layer are absent — they're stubbed at 0 in practice. The score still works but the FEMA regulatory sub-index doesn't contribute.
Sub-surface flooding (Ida basement mode): Optical satellites can't see basement flooding. Prithvi correctly emits no polygons for inland Queens. This is documented as an honest scope limit, not a bug.
/api/compare endpoint exists at web/main.py:compare_stream and works as a two-parallel-FSM-runs endpoint, but the SvelteKit UI doesn't expose a compare page (legacy compare.html was retired in v0.4.5).

8. The non-obvious decisions

Why not a risk score from 0–100

The tier is a deterministic, published rubric (Cutter et al. 2003 construction, Tate 2012 equal-weights argument, Balica 2012 empirical floor). A continuous score would imply calibration against labeled damage outcomes — which don't exist here. Riprap has no closed claim records; producing "flood risk 0.73" without claims-driven calibration would be a fabricated precision. The tier is explicitly a prior (METHODOLOGY.md §1). FEMA Risk Rating 2.0 is the product to use if you want claims-driven numbers.

Why silence over confabulation

Specialists that don't fire emit nothing. build_documents() gates on state.get(key) is not None. Granite's post-training includes grounded-generation discipline ("don't generate from absent documents"). This plus the Mellea citation checks means a calm-weather query produces no NWS-alerts section in the briefing rather than "no alerts were found" — that would be correct but uncitable. The section is absent. This is explicit in the system prompt: "Omit any section whose supporting facts are absent from the documents."

Why public-record-only at runtime

Data governance: a newsroom with FOIL'd documents, or an agency with internal capital plans, can't paste that data into a vendor LLM (ARCHITECTURE.md §11). All specialist data comes from NYC OpenData, USGS, NOAA, NWS, FloodNet NYC (public sensor network). No commercial data; no private address databases. The system is reproducible and auditable.

Why the four epistemic tiers (empirical / modeled / proxy / synthetic)

The distinction matters for how much weight to give each signal, documented in ARCHITECTURE.md §1.2. Empirical (Sandy HWMs, Ida HWMs, FloodNet events) = something flooded a place and was measured. Modeled scenarios (DEP, FEMA NFHL) = hydraulic simulation under assumptions. Proxy (311 complaints, HAND, TWI) = indirect indicators. Synthetic prior (TerraMind synthesis) = generative model output, never "imaged" or "reconstructed." The build_documents() function embeds these interpretive framing sentences directly into the doc bodies so Granite is instructed in the document itself how to characterize each source.

Why the Five Stones names

Functional grouping for a trace UI with 25+ specialists. Stonework vocabulary maps to function: Cornerstone remembers the foundation (static hazard record); Keystone is the load-bearing arch piece (what's exposed); Touchstone is the evaluative reference (current state); Lodestone draws you toward something (forecast pull); Capstone is the crown that holds the vault (synthesis). The names let a non-technical demo audience follow the 25-step trace without reading each step label.

Why citation-grounded prose vs structured output

JSON structured output (tier + per-field arrays) is easy to produce but hard to cite in a grant application or news article. The four-section prose format with [doc_id] tags produces text a planner can quote in a FEMA BRIC sub-application or a journalist can use verbatim with inline sourcing. The citation tags map to clickable source chips in the frontend. Structured JSON of the underlying specialist outputs is also available in the final SSE event for machine consumption.

Why Mellea rejection sampling (vs post-hoc sentence dropping)

The original verify_paragraph() in app/reconcile.py drops sentences after generation. This produces a shorter briefing and a silent quality improvement — but the user sees a briefing that may have had sentences removed. The Mellea rejection sampler rerolls the entire generation when it fails, and streams each attempt's tokens to the user live (visible progress), then shows a green/amber inline banner. The user understands the system is enforcing quality, not silently deleting content. Psychologically this is more defensible in a professional context.

Why planner-then-Capstone two-LLM split

The planner is a structured-output routing task (small JSON, deterministic, temperature=0). It should be fast and cheap. The reconciler is a long-form synthesis task requiring dense citation discipline — it benefits from the larger context window and stronger instruction-following of the 8b model. Using 3b for routing keeps TTFB low (planner JSON appears in ~2 s vs ~8 s for 8b). On the HF Space both aliases map to 8b via RIPRAP_OLLAMA_3B_TAG=granite4.1:8b to avoid disk cost, accepting the TTFB penalty.

Why LiteLLM Router

The alternative was a hand-rolled if primary == "vllm": ... else: ollama.chat(...) dispatch. LiteLLM's Router gives model aliasing, failover, and a common call signature for free. The ~250-line shim in app/llm.py covers: Ollama-vs-vLLM backend selection, document-role message extraction for vLLM's HF chat template, [doc_id=X] → [X] citation normalization, JSON-mode translation, and backend info for the UI badge. Any future backend (mlx-lm, llama.cpp, etc.) is a 10-line entry in _build_router().

Why vLLM emits `[doc_id=X]` while Ollama emits `[X]`

Ollama's Granite 4.1 Modelfile template lifts role="document <id>" messages into a <documents> block and the model emits bare [X] citations. The HF tokenizer template used by vLLM emits [doc_id=X]. The rest of Riprap (Mellea regex, frontend citation chip parser, sources footer) was written against [X]. The _CITE_NORMALIZE_RE in app/llm.py normalizes per-chunk in streaming, preventing any vLLM-specific citation format from leaking downstream.

Why Prithvi runs offline (baked GeoJSON) while TTM runs live

Prithvi-EO 2.0 with TerraTorch needs GPU and minutes per HLS tile. Running it per-query on a CPU-basic Space is not viable. The 166-polygon GeoJSON was computed once on AMD MI300X, filtered (>30,000 sqft to drop noise, <1 km² to drop tidal artifacts), and committed. The runtime FSM does point-in-polygon (milliseconds). This is honest about what EO models earn their keep on: a one-time defensible event-level signal, not per-request inference. TTM r2 at 1.5 M params runs in milliseconds on CPU — no such tradeoff exists.

Why `citations_dense` uses sentence scope, not character window

The original implementation used ~40 chars proximity between a number and its citation tag. This was fragile for normal English sentence structure ("The address has 11 flood-related complaints [nyc311] within 200 m"). The citation might be 60 chars from the number. Switching to sentence scope (.[\s)] split) eliminated the chronic 3/4 neighborhood-query failure mode. "Sentence scope" is also how human readers actually assign attribution — the citation at the end of the sentence covers the claim anywhere in that sentence.

9. What's next

From OPEN-ISSUES.md, CLAUDE.md polish targets, and code-level TODO comments, in priority order for the May 13 ASCE presentation:

Demo-script dry run against live Space. Space sometimes sleeps after idle; cold start is 30–90 s. Pre-ping the Space before presenting. Verify the backend pill shows correct hardware.
compare intent wiring. planner.py declares the compare intent (noted as NOT_IMPLEMENTED comment — actually the planner doesn't short-circuit compare, it just routes to single_address by default). If you want the compare flow to work end-to-end, web/main.py:api_agent_stream needs routing to i_addr.run twice in parallel, or a new compare.py intent module.
FEMA NFHL layer. The scoring rubric has fema_1pct and fema_02pct weights but no FSM step or data layer. Adding the FEMA NFHL download and a step_fema_nfhl action would materially improve Regulatory sub-index accuracy for addresses in AE/VE zones that aren't in Sandy extent.
NYCHA/DOE/DOH registers on Space. RIPRAP_NYCHA_REGISTERS=0 by default. Enabling on HF Space would add 3 more Keystone specialists to every single_address query but requires the 91 MB sandy GeoJSON pre-load to complete within Space startup time.
Droplet redeploy verification. The services/riprap-models/Dockerfile was never tested end-to-end. The safetensors==0.8.0rc0 RC pin is the most likely failure point. Next droplet bring-up should test this first.
Experiments OPEN-ISSUES.md items. All four issues are in experiments/ only (F821 numpy annotation in exp17, f-string Py 3.12+ syntax in exp18, B023 closure variable in exp05, F841 unused api in exp18). Won't affect production but clean up the codebase.
Reranker integration. app/rag.py has a full _ensure_reranker() and RIPRAP_RERANKER_ENABLE flag for ibm-granite/granite-embedding-reranker-english-r2 cross-encoder. Off by default (no HF Space disk for the CrossEncoder model). Enabling on the AMD droplet path would improve Policy context quality at no latency cost.
Historical replay / retrospective mode. Blocked at planner with not-implemented message. Substantial feature: would require snapshotting specialist output at query time or storing NOAA/311/FloodNet historical pull results.

10. Quick reference: files that matter

Task	Open first
Add a new specialist	`app/fsm.py` (add `@action` + wire into `build_app()`), `app/reconcile.py:build_documents()` (add doc emission), `app/intents/single_address.py` (no change usually needed), `web/sveltekit/src/` (add step label + source card)
Change the briefing structure / system prompt	`app/reconcile.py:EXTRA_SYSTEM_PROMPT`, then `app/intents/neighborhood.py:EXTRA_SYSTEM_PROMPT` for neighborhood path; rebuild `web/sveltekit` if adding new section rendering
Tune the Mellea grounding checks	`app/mellea_validator.py` — `_NUM_RE`, `_TRIVIAL_NUMS`, `_check_every_claim_cited()`, `_failing_sentences_for_citations()`
Change which backend (vLLM vs Ollama)	`app/llm.py` env vars; no code change needed
Add a new intent	`app/planner.py:INTENTS` + `SPECIALISTS` entries, `_required_specialists()`, then new `app/intents/<intent>.py`; wire in `web/main.py:api_agent_stream` and `api_agent`
Change the exposure tier scoring	`app/score.py:REGULATORY/HYDROLOGICAL/EMPIRICAL` dicts + `TIER_BREAKPOINTS`; update `METHODOLOGY.md`
Debug why a specialist fired wrong	`scripts/probe_mellea.py --query "<address>" --runs 1`; check step events in SSE stream; look at `final.mellea.requirements_failed`
Rebuild the frontend	`cd web/sveltekit && npm run build` (new design-system UI); `cd web/svelte && npm run build` (legacy Svelte 5 custom elements to `web/static/dist/riprap.js`)
Run the full end-to-end test	`.venv/bin/python scripts/probe_addresses.py`
Rebuild the pre-computed registers	`scripts/build_mta_entrances_register.py`, `scripts/build_nycha_register.py`, `scripts/build_schools_register.py`
Rebuild Prithvi Ida polygons	`scripts/run_prithvi_ida.py` — needs GPU + TerraTorch
Rebuild the pitch deck	`cd slides && make pdf html pptx` (needs marp-cli)
Add a question-type framing	`app/framing.py:_PATTERNS` + `_DIRECTIVES`
Understand why a doc was missing from the briefing	Check `build_documents()` in `app/reconcile.py` — each block has an explicit gate condition; also check `trim_docs_to_plan()`
Understand the SSE stream structure	`web/main.py:api_agent_stream`, the `_STEP_TO_STONE` dict, and the stone_start/stone_done wrapping logic
Deploy to HF Space	`git push && git push huggingface main`; monitor rebuild via `curl -sf "https://huggingface.co/api/spaces/lablab-ai-amd-developer-hackathon/riprap-nyc/runtime"
Deploy to AMD droplet	`scripts/deploy_droplet.sh <ip> <token>`, then set Space env vars via `huggingface-cli space variables`, restart Space

Riprap — Owner's Brief

1. The system in one paragraph

2. Architecture map

HTTP request lifecycle

Planner: app/planner.py

FSM: app/fsm.py

Full action sequence (default, single_address)

Capstone reconciliation: app/reconcile.py + app/mellea_validator.py

Four Mellea grounding checks (app/mellea_validator.py)

SSE event vocabulary (/api/agent/stream)

3. The Five Stones, one section each

Cornerstone — Hazard Reader

Keystone — Asset Register

Touchstone — Live Observer

Lodestone — Projector

Capstone — Synthesizer

4. The three NYC fine-tunes

msradam/Prithvi-EO-2.0-NYC-Pluvial

msradam/TerraMind-NYC-Adapters

msradam/Granite-TTM-r2-Battery-Surge

5. The deployment topology

Local development

HF Space (production demo URL)

AMD MI300X droplet (demo GPU path — currently destroyed)

6. One query traced end-to-end: "80 Pioneer Street, Brooklyn"

7. What's robust vs fragile

Robust (load-bearing, tested)

Fragile (single points of failure, missing error handling)

Known gaps / out-of-scope

8. The non-obvious decisions

Why not a risk score from 0–100

Why silence over confabulation

Why public-record-only at runtime

Why the four epistemic tiers (empirical / modeled / proxy / synthetic)

Why the Five Stones names

Why citation-grounded prose vs structured output

Why Mellea rejection sampling (vs post-hoc sentence dropping)

Why planner-then-Capstone two-LLM split

Why LiteLLM Router

Why vLLM emits [doc_id=X] while Ollama emits [X]

Why Prithvi runs offline (baked GeoJSON) while TTM runs live

Why citations_dense uses sentence scope, not character window

9. What's next

10. Quick reference: files that matter

Planner: `app/planner.py`

FSM: `app/fsm.py`

Capstone reconciliation: `app/reconcile.py` + `app/mellea_validator.py`

Four Mellea grounding checks (`app/mellea_validator.py`)

SSE event vocabulary (`/api/agent/stream`)

Why vLLM emits `[doc_id=X]` while Ollama emits `[X]`

Why `citations_dense` uses sentence scope, not character window