# Riprap — Owner's Brief --- ## 1. The system in one paragraph Riprap takes any NYC address, neighborhood, or development-permit query and produces a four-section flood-exposure briefing where every numeric claim is anchored to a `[doc_id]` citation that traces back to the source dataset, agency report, or model output. A natural-language planner (Granite 4.1 3b) routes each query to one of four intent paths; the chosen path fans out across up to ~25 atomic data specialists; a synthesizer (Granite 4.1 8b) reads only the specialist outputs that fired and writes the briefing; a Mellea rejection sampler checks four grounding invariants and rerolls if any fail. The system is NYC-specific and public-record-only: all data comes from NYC OpenData, USGS, NOAA, NWS, or FloodNet, and all four models run inside the container — no vendor LLM is contacted at runtime. The output is a tier 1–4 exposure score (deterministic, published rubric, not generated by the LLM) plus a cited paragraph in prose. What Riprap does not do: damage probability, insurance rating, flood prediction, or any claim about basement apartments or infrastructure that isn't in a public register. --- ## 2. Architecture map ### HTTP request lifecycle ``` User browser → GET /api/agent/stream?q= web/main.py: api_agent_stream() (async SSE generator) runs runner() in a threadpool executor app/planner.plan(q, on_token=...) → streams plan_token events while Granite generates returns Plan(intent, targets, specialists, rationale) out_q.put({kind:"plan", ...}) → SSE plan event intent dispatch: "single_address" → app/intents/single_address.run(plan, q, progress_q, strict=True) "neighborhood" → app/intents/neighborhood.run(plan, q, progress_q, strict=True) "development_check" → app/intents/development_check.run(plan, q, progress_q, strict=True) "live_now" → app/intents/live_now.run(plan, q, progress_q) "not_implemented" → inline JSON response, no FSM each intent calls fsm.iter_steps() or its own specialist loop → out_q.put({kind:"step", ...}) per specialist → out_q.put({kind:"token", ...}) per Granite reconcile chunk → out_q.put({kind:"mellea_attempt", ...}) per Mellea pass/fail out_q.put({kind:"final", ...}) event_stream() async generator reads out_q, wraps steps in stone_start / stone_done envelope keyed by _STEP_TO_STONE dict, yields SSE frames SSE response header: Cache-Control: no-cache, X-Accel-Buffering: no ``` ### Planner: `app/planner.py` - Entry: `plan(query, model, on_token) → Plan` - Model: `RIPRAP_PLANNER_MODEL` env, default `granite4.1:3b` - Uses `llm.chat(format="json")` with `temperature=0` for deterministic JSON output via Ollama's constrained-decode mode - Pre-filter: `_not_implemented_message(query)` checks two regex patterns (retrospective, ranking) and returns early with a `Plan(intent="not_implemented")` so no LLM call is made - Post-validator: `_validate(d, raw_query)` sanitizes intent, targets, specialists against the declared INTENTS/SPECIALISTS dicts; adds floor specialists via `_required_specialists(intent)` if planner omitted them - Floor specialists (always added regardless of planner output): geocode+sandy+dep_stormwater+microtopo for single_address; nta_resolve+sandy+dep_stormwater+nyc311 for neighborhood; nws_alerts+noaa_tides for live_now - Returns: `Plan(intent, targets: list[dict], specialists: list[str], rationale: str)` ### FSM: `app/fsm.py` - Entry: `build_app(query) → Burr Application`; `run(query) → dict`; `iter_steps(query) → generator` - Burr 0.x `ApplicationBuilder` with `with_state(query=query, trace=[])`, `with_entrypoint("geocode")` - Actions registered in dict order, transitions are consecutive pairs (linear, not DAG) - Each `@action` writes one state key + appends to `trace` list - Out-of-NYC guard: `_NYC_S/W/N/E = 40.49, -74.27, 40.92, -73.69` — NYC-specific specialists skip with `"out of NYC scope"` reason; live/national specialists (NWS/NOAA/TTM) run unconditionally - Thread-locals for streaming (since Burr runs sync in a background thread): - `set_strict_mode(bool)` / `_current_strict_mode()` - `set_token_callback(fn)` / `_current_token_callback()` - `set_mellea_attempt_callback(fn)` / `_current_mellea_attempt_callback()` - `set_planned_specialists(set)` / `_current_planned_specialists()` - `set_user_query(str)` / `_current_user_query()` - `set_planner_intent(str)` / `_current_planner_intent()` - `iter_steps` spawns a daemon thread running `app.iterate(halt_after=["reconcile"])`; snaps threadlocals from caller thread and re-installs on iterate thread; deduplicates trace records by (step_name, started_at) - Heavy-specialist gate: `_HEAVY_SPECIALISTS_ENABLED` = True when `RIPRAP_LLM_PRIMARY != ollama` OR `RIPRAP_ML_BASE_URL` is set; otherwise False. Controls whether prithvi_live, terramind, eo_chip, terramind_lulc, terramind_buildings fire - NYCHA register gate: `_NYCHA_REGISTERS_ENABLED` = controlled by `RIPRAP_NYCHA_REGISTERS=1` (default off); registers load a 91 MB GeoJSON file on first call ### Full action sequence (default, single_address) | # | Action name | State key written | Data source | |---|---|---|---| | 1 | geocode | geocode, lat, lon | NYC DCP Geosearch → OSM Nominatim fallback | | 2 | sandy | sandy | data/sandy_inundation.geojson (lru_cache) | | 3 | dep | dep | data/dep/*.gdb (3 scenarios, lru_cache) | | 4 | floodnet | floodnet | api.floodnet.nyc Hasura GraphQL | | 5 | nyc311 | nyc311 | Socrata erm2-nwe9 | | 6 | noaa_tides | noaa_tides | api.tidesandcurrents.noaa.gov | | 7 | nws_alerts | nws_alerts | api.weather.gov/alerts/active | | 8 | nws_obs | nws_obs | api.weather.gov/stations//observations | | 9 | ttm_forecast | ttm_forecast | ibm-granite/granite-timeseries-ttm-r2 (in-process or remote) | | 10 | ttm_311_forecast | ttm_311_forecast | TTM r2 on local 311 weekly series | | 11 | floodnet_forecast | floodnet_forecast | TTM r2 on nearest FloodNet sensor history | | 12 | ttm_battery_surge | ttm_battery_surge | msradam/Granite-TTM-r2-Battery-Surge (remote or local) | | 13 | microtopo | microtopo | data/nyc_dem_30m.tif, twi.tif, hand.tif | | 14 | ida_hwm | ida_hwm | data/ida_2021_hwms_ny.geojson | | 15 | mta_entrances | mta_entrances | data/mta_entrances.geojson | | 16 | prithvi | prithvi_water | data/prithvi_ida_2021.geojson (166 polygons) | | 17–22 | (heavy, if enabled) | prithvi_live, terramind, eo_chip, terramind_lulc, terramind_buildings | STAC/Sentinel-2, msradam/TerraMind-NYC-Adapters | | 23 | rag | rag | Granite Embedding 278M over corpus/*.pdf (5 PDFs) | | 24 | gliner | gliner | GLiNER typed-entity extraction over RAG hits | | 25 | reconcile | paragraph, audit, mellea | Granite 4.1:8b via Mellea strict sampler | ### Capstone reconciliation: `app/reconcile.py` + `app/mellea_validator.py` - `build_documents(state) → list[dict]` — emits one `{"role": "document ", "content": "..."}` per specialist that fired, in Stones order; gatted by both specialist fire status and the out-of-NYC guard - `trim_docs_to_plan(doc_msgs, planned_specialists)` — drops doc messages not matching planner's specialist set; saves ~30–50% prompt tokens; `RIPRAP_TRIM_DOCS=0` disables - `EXTRA_SYSTEM_PROMPT` — the 4-section skeleton with the citation-discipline rules - `augment_system_prompt(EXTRA_SYSTEM_PROMPT, query, intent)` — calls `app/framing.detect()` to classify question type (11 types, deterministic regex), then appends a `QUESTION-AWARE OPENING:` directive to the system prompt for non-generic questions - Strict path (production): `reconcile_strict_streaming(doc_msgs, system_prompt, ...)` in `app/mellea_validator.py` - Streams each attempt's tokens via `on_token(delta, attempt_idx)` callback - After each attempt runs four checks, fires `on_attempt_end(attempt_idx, passed, failed)` callback - On failure, appends a feedback user-turn naming failing sentences and rerolls - Budget: `DEFAULT_LOOP_BUDGET` = 2 (Ollama primary) or 3 (vLLM primary), overridable via `RIPRAP_MELLEA_MAX_ATTEMPTS` - Legacy path (non-strict): `reconcile.reconcile(state)` → streams tokens, then calls `verify_paragraph()` which drops sentences with ungrounded numbers (post-hoc, not rejection-sampling) - The `step_reconcile` action detects strict mode via `_current_strict_mode()` and routes to one or the other ### Four Mellea grounding checks (`app/mellea_validator.py`) 1. **`numerics_grounded`** — `_check_no_invented_numbers()`: every non-trivial number in output appears verbatim in haystack (joined document content). Trivial set: `{0–10, 100, 311, 911, 211}`. Number regex: `\b-?\d[\d,]*(?:\.\d+)?\b` (word-boundary — skips `QN1206`, `B12`) 2. **`no_placeholder_tokens`** — `_check_no_placeholder_tokens()`: output contains none of `[source]`, `/observations/latest`; nearest of KNYC, KLGA, KJFK, KEWR, KFRG - NOAA tides: `https://api.tidesandcurrents.noaa.gov/api/prod/datagetter`; 6-min cadence - Prithvi live: Microsoft Planetary Computer STAC API for Sentinel-2 L2A; msradam/Prithvi-EO-2.0-NYC-Pluvial v2 weights - TerraMind LULC: shared chip from `step_eo_chip` (also STAC/Planetary Computer) **Models invoked:** Prithvi-EO-2.0-NYC-Pluvial v2 (300 M params, TerraTorch, flood IoU 0.5979 vs 0.10 base); TerraMind-NYC-Adapters LULC LoRA (mIoU 0.5866, +6.13 pp over full-FT) **Failure modes:** FloodNet GraphQL call sets `verify=False` (self-signed cert); 311 Socrata times out gracefully; NOAA/NWS calls have 15-20 s timeouts; Prithvi/TerraMind LULC require `_HEAVY_SPECIALISTS_ENABLED` and `app/context/eo_chip_cache.py:fetch()` succeeding --- ### Lodestone — Projector **Job:** Report forward-looking signals — NWS alerts, surge forecasts, and complaint-rate trends. **Specialists (file:function):** | Specialist | File:function | What it returns | |---|---|---| | `step_nws_alerts` | `fsm.py:step_nws_alerts` | Active NWS flood-relevant alerts at point (Flash Flood, Coastal Flood, etc.): n_active, list of alerts with event/severity/urgency/expires | | `step_ttm_forecast` | `fsm.py:step_ttm_forecast` | TTM r2 zero-shot Battery surge residual: context 512 steps (~51 h at 6-min), horizon 96 steps (~9.6 h); forecast_peak_ft, forecast_peak_minutes_ahead; only emits doc when interesting (peak > 0.3 ft) | | `step_ttm_311_forecast` | `fsm.py:step_ttm_311_forecast` | TTM r2 zero-shot on 52 weeks of 311 complaint history → 4-week forecast; forecast_mean_per_week, forecast_peak_per_week, accelerating flag | | `step_floodnet_forecast` | `fsm.py:step_floodnet_forecast` | TTM r2 on nearest FloodNet sensor flood-event recurrence; forecast_28d_expected_events, accelerating; silent if sensor history too sparse | | `step_ttm_battery_surge` | `fsm.py:step_ttm_battery_surge` | msradam/Granite-TTM-r2-Battery-Surge fine-tune: hourly cadence, 96 h horizon; forecast_peak_m, forecast_peak_hours_ahead; only emits doc when interesting | **Data sources:** - NWS alerts: `https://api.weather.gov/alerts/active` filtered to flood event types at the point's county - TTM context data: live pull from NOAA CO-OPS 6-min water level (for Battery/Kings Point/Sandy Hook); Socrata 311 history; FloodNet GraphQL event history - Battery surge fine-tune: NOAA hourly verified water level from Battery gauge (NOAA 8518750), loaded by `app/live/ttm_battery_surge.py` **Models invoked:** ibm-granite/granite-timeseries-ttm-r2 (1.5 M params, ~30 MB, CPU-viable, zero-shot); msradam/Granite-TTM-r2-Battery-Surge fine-tune (same backbone, test MAE 0.1091 m, −41% vs persistence, −25% vs zero-shot) **Failure modes:** NWS alerts call gracefully returns `n_active=0` on timeout; TTM models loaded lazily via `app/live/ttm_forecast.py:_load_model()` with `_DEPS_OK = False` fallback pattern; all Lodestone specialists fire unconditionally (no NYC bbox gate except floodnet/311 which are NYC-specific) --- ### Capstone — Synthesizer **Job:** Read all documents produced by the four data-Stones and write a citation-grounded four-section prose briefing. **Entry:** `app/mellea_validator.py:reconcile_strict_streaming(doc_msgs, system_prompt, user_prompt, loop_budget, on_token, on_attempt_end)` **Document ordering in prompt:** geocode preamble → Cornerstone (sandy, dep_*, ida_hwm, prithvi_water, microtopo) → Keystone (mta_entrance_*, nycha_dev_*, doe_school_*, nyc_hospital_*, tm_buildings) → Touchstone (floodnet, nyc311, nws_obs, noaa_tides, prithvi_live, tm_lulc) → Lodestone (nws_alerts, ttm_forecast, ttm_311_forecast, floodnet_forecast_*, ttm_battery) → Policy (rag_*, gliner_*) **Four-section skeleton (from `EXTRA_SYSTEM_PROMPT`):** - **Status.** — dominant exposure signal, strongest doc_id citation - **Empirical evidence.** — Sandy, 311, FloodNet, Ida HWMs, Prithvi polygons - **Modeled scenarios.** — DEP dep_* scenarios, microtopo terrain (HAND, TWI, percentile) - **Policy context.** — one sentence per RAG hit, citing agency name + rag_* doc_id **Four grounding checks (described in §2 above):** `numerics_grounded`, `no_placeholder_tokens`, `citations_dense`, `citations_resolve` **Reroll feedback mechanism:** `_failing_sentences_for_citations(text)` identifies sentences with uncited numbers; on reroll the feedback user-turn names those specific sentences and instructs surgical citation additions **Model:** `RIPRAP_RECONCILER_MODEL` env, default `granite4.1:8b`; `num_ctx=4096`, `num_predict=400` **Return shape from `step_reconcile`:** `{paragraph, audit: {raw, dropped}, mellea: {rerolls, n_attempts, requirements_passed, requirements_failed, requirements_total, model, loop_budget}}` --- ## 4. The three NYC fine-tunes ### msradam/Prithvi-EO-2.0-NYC-Pluvial - **HF Hub path:** `msradam/Prithvi-EO-2.0-NYC-Pluvial` - **Base model:** IBM/NASA Prithvi-EO 2.0 (300 M params, ViT-L foundation model pre-trained on HLS Sentinel-2 multispectral imagery), Apache-2.0 - **Training data:** NYC HLS Sentinel-2 tiles with pluvial flood labels derived from USGS Ida HWM survey and NYC DEP records; Lovász-Softmax loss with copy-paste augmentation; trained on AMD Instinct MI300X - **Metrics:** Test flood IoU 0.5979 vs 0.10 on Sen1Floods11 base (6× improvement) - **Invocation:** Two paths: - Offline (Cornerstone): produced `data/prithvi_ida_2021.geojson` via `scripts/run_prithvi_ida.py`; runtime does point-in-polygon, no model call - Live (Touchstone): `app/flood_layers/prithvi_live.py:fetch(lat, lon)` — fetches latest Sentinel-2 L2A chip from Planetary Computer STAC, runs model forward pass, returns `pct_water_within_500m`, `pct_water_full`; slow (~30 s), gated by `_HEAVY_SPECIALISTS_ENABLED`; input 6-band S2L2A chip, output binary segmentation mask - **Degradation:** If Planetary Computer STAC unavailable or cloud cover too high, `fetch()` returns `{ok: False, skipped: "...reason..."}` and no doc is emitted ### msradam/TerraMind-NYC-Adapters - **HF Hub path:** `msradam/TerraMind-NYC-Adapters` - **Base model:** TerraMind 1.0 (IBM/ESA any-to-any generative EO foundation model), Apache-2.0 - **Training data:** NYC Sentinel-2 + SAR chips matched to ESRI Land Cover 2020–2022 labels (LULC adapter) and NYC building footprints (Buildings adapter); trained on AMD Instinct MI300X in ~18 minutes - **Metrics:** LULC test mIoU 0.5866 (+6.13 pp over full-FT baseline); Buildings test mIoU 0.5511; TiM 0.6023 - **Two adapters:** - `lulc` — 5-class land cover (water, built, vegetation, bare, agriculture); invoked by `step_terramind_lulc` via `app/context/terramind_nyc.py:lulc(s2_tensor, s1rtc, dem)` - `buildings` — binary building footprint mask; invoked by `step_terramind_buildings` via `app/context/terramind_nyc.py:buildings(s2_tensor, s1rtc, dem)` - **Shared chip:** Both consume tensors from `step_eo_chip` → `app/context/eo_chip_cache.py:fetch(lat, lon)`, which fetches S2L2A + S1RTC + DEM chip once per query - **Degradation:** If `eo_chip` didn't fire successfully, both LoRA specialists silently no-op. Lazy load + cached in-process; first call ~30 s, subsequent calls ~3–7 s ### msradam/Granite-TTM-r2-Battery-Surge - **HF Hub path:** `msradam/Granite-TTM-r2-Battery-Surge` - **Base model:** ibm-granite/granite-timeseries-ttm-r2 (1.5 M params, Tiny Time Mixer, Ekambaram et al. NeurIPS 2024), Apache-2.0 - **Training data:** NOAA CO-OPS Battery gauge (station 8518750) hourly verified water level, surge residual computed as verified minus harmonic tide; trained on AMD Instinct MI300X - **Metrics:** Test MAE 0.1091 m, −41% vs persistence, −25% vs zero-shot TTM r2 - **Invocation:** `app/live/ttm_battery_surge.py:fetch()` — loads model via `tsfm_public.get_model()`, fetches NOAA hourly context, returns `{available, context_hours, horizon_hours: 96, forecast_peak_m, forecast_peak_hours_ahead, interesting}`; in-process on CPU - **Input shape:** `(context_length, 1)` float tensor of hourly surge residuals; context = 336 h (~14 days) - **Output shape:** `(96,)` hourly forecast, scanned for peak - **Degradation:** `_DEPS_OK` module-level flag set at import time; on failure returns `{available: False, reason: "..."}`, no doc emitted --- ## 5. The deployment topology ### Local development - Python 3.12 venv (`.venv`), `uv` for package management - Ollama serving `granite4.1:3b` + `granite4.1:8b` locally - `uvicorn web.main:app --host 127.0.0.1 --port 7860` - `_HEAVY_SPECIALISTS_ENABLED = False` by default (no `RIPRAP_ML_BASE_URL` set, no vLLM) - `RIPRAP_NYCHA_REGISTERS = 0` by default (heavy 91 MB GeoJSON loads) - Granite Embedding 278M and TTM r2 download to HF cache on first query (~280 MB + ~30 MB) - SvelteKit UI built at `web/sveltekit/build/`; rebuild only needed when sources change ### HF Space (production demo URL) - URL: `https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space` - Docker SDK, base `nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04`, hardware `cpu-basic` (actual hardware is cpu-basic, not T4 — the ARCHITECTURE.md mentions T4 but the Dockerfile's GPU notes are aspirational) - Python 3.10 inside container (pinning `mellea<0.4`, `transformers<5`, `huggingface_hub<1`) - `entrypoint.sh` flow: 1. Attempts EO toolchain install at runtime to `$HOME/.eo-pkgs` (bypasses HF build disk limit); if fails, terramind/prithvi-live silently skip 2. Starts `ollama serve` in background, polls until ready (up to 60 s) 3. Pulls `granite4.1:8b` at runtime if not cached (~5 GB, ~2 min first cold start); 3b is optional 4. Pre-warms 8b via `curl POST /api/generate` with `keep_alive=24h` 5. Launches `uvicorn web.main:app --host 0.0.0.0 --port 7860` - `RIPRAP_OLLAMA_3B_TAG=granite4.1:8b` set in Dockerfile so planner routes to 8b (avoids disk cost of two separate model pulls) - `web/main.py:_warm_caches()` on startup: loads sandy + DEP layers, optionally NYCHA registers, warms RAG (Granite Embedding 278M + 5 PDFs), pre-imports heavy ML stacks to avoid import races, warms Ollama models via HTTP ### AMD MI300X droplet (demo GPU path — currently destroyed) - Two Docker containers on same host, both with `--device=/dev/kfd --device=/dev/dri` - Container 1: `vllm/vllm-openai-rocm:v0.17.1` — serves `granite-4.1-8b` on port 8001 - `--max-model-len 8192`, `--served-model-name granite-4.1-8b` - `GLOO_SOCKET_IFNAME=eth0` required or gloo fails to bind - Container 2: `riprap-models:latest` (built from `services/riprap-models/Dockerfile`) — FastAPI on port 8002 (or 7860 per scripts) - Endpoints: `GET /healthz`, `POST /v1/prithvi-pluvial`, `POST /v1/terramind`, `POST /v1/ttm-forecast`, `POST /v1/granite-embed`, `POST /v1/gliner-extract` - Model loading: lazy + per-model threading.Lock to prevent double-load on concurrent requests - ROCm device: `cuda` (ROCm's CUDA shim maps `cuda` to first `/dev/kfd` device) **Env vars to connect HF Space to droplet:** ``` RIPRAP_LLM_PRIMARY=vllm RIPRAP_LLM_BASE_URL=http://:8001/v1 RIPRAP_LLM_API_KEY= RIPRAP_ML_BASE_URL=http://:8002 RIPRAP_ML_API_KEY= ``` **What breaks if droplet IP changes:** Set the four env vars above via `huggingface-cli space variables` and restart the Space. The LiteLLM Router builds at import time from env, so a Space restart is required. **Deterministic redeploy:** `scripts/deploy_droplet.sh $TOKEN` — idempotent, ~10–20 min first run (pulls images, builds riprap-models); re-runs on same droplet ~1 min. Known fragile: `safetensors==0.8.0rc0` pin in `services/riprap-models/requirements-full.txt` is an RC and may fail on future pip resolves. --- ## 6. One query traced end-to-end: "80 Pioneer Street, Brooklyn" **Query enters:** `GET /api/agent/stream?q=80+Pioneer+Street%2C+Brooklyn` **1. Planner** (`app/planner.py:plan`) - No not-implemented regex matches - Calls `llm.chat(model="granite4.1:3b", messages=[system, user], format="json", stream=True, temperature=0)` - Streams `plan_token` SSE events as JSON generates - Returns `Plan(intent="single_address", targets=[{type:"address", text:"80 Pioneer Street, Brooklyn"}], specialists=[...], rationale="...")` - Validator adds floor specialists: geocode, sandy, dep_stormwater, microtopo - SSE: `plan` event emitted **2. single_address.run** (`app/intents/single_address.py:run`) - Sets threadlocals: `strict=True`, `planned_specialists={...}`, `user_query="80 Pioneer Street, Brooklyn"`, `planner_intent="single_address"` - Registers `on_token` and `on_mellea_attempt` callbacks on `progress_q` - Calls `fsm.iter_steps("80 Pioneer Street, Brooklyn")` **3. FSM: step_geocode** - `app/geocode.py:geocode_one("80 Pioneer Street, Brooklyn")` - Detects borough hint "Brooklyn", calls DCP Geosearch with `size=8`, filters for Brooklyn results - Returns `GeocodeHit(address="80 Pioneer Street, Brooklyn, NY 11231", borough="Brooklyn", lat=40.6772, lon=-74.0070, bbl="3-00589-0003", ...)` - State: `{geocode: {...}, lat: 40.6772, lon: -74.0070}` - SSE: `step` event `{step: "geocode", ok: true, elapsed_s: 0.4, result: {address:..., lat:..., lon:...}}` **4. FSM: step_sandy** - Confirmed inside NYC bbox - `sandy_inundation.join(point)` — spatial join against `data/sandy_inundation.geojson` - Red Hook is inside the 2012 Sandy inundation zone → `sandy=True` - State: `{sandy: True}` - SSE: `step` event → opens `stone_start: Cornerstone` **5. FSM: step_dep** - `dep_stormwater.join(pt, scen)` for each of 3 scenarios against `data/dep/*.gdb` - Likely returns `dep_moderate_2050: depth_class=2 (Deep & Contiguous 1-4 ft)`, `dep_extreme_2080: depth_class=3 (Deep Contiguous >4 ft)`, `dep_moderate_current: depth_class=1` **6–8. FSM: step_floodnet, step_311, step_noaa_tides** - FloodNet: GraphQL POST to `api.floodnet.nyc` — checks sensors within 600 m of (40.6772, -74.0070) - 311: Socrata API call for flood complaints within 200 m, last 5 years - NOAA: fetches Battery gauge (closest of 3 stations to Red Hook), returns observed/predicted/residual **9–12. FSM: TTM forecast steps** - `ttm_forecast.summary_for_point(40.6772, -74.0070)`: loads ibm-granite/granite-timeseries-ttm-r2, fetches 512 steps of Battery residual history via NOAA, forecasts 96 steps ahead; emits doc only if peak > 0.3 ft - `ttm_311_forecast.weekly_311_forecast_for_point(...)`: fetches 52-week complaint history for 200 m buffer from 311, runs TTM zero-shot - `floodnet_forecast.summary_for_point(...)`: nearest sensor historical events → TTM recurrence forecast - `ttm_battery_surge.fetch()`: msradam/Granite-TTM-r2-Battery-Surge, hourly context → 96 h forecast **13–14. FSM: step_microtopo, step_ida_hwm** - `microtopo.microtopo_at(40.6772, -74.0070)`: samples `data/nyc_dem_30m.tif`, `hand.tif`, `twi.tif` at point; returns elevation ~3 m, HAND ~0.8 m (near drainage), TWI ~11 - `ida_hwm.summary_for_point(...)`: checks `data/ida_2021_hwms_ny.geojson` within 800 m — Ida hit Queens hardest, Red Hook had no USGS HWMs **15. FSM: step_mta_entrances** - `app/registers/mta_entrances.py:summary_for_point(...)`: loads `data/mta_entrances.geojson`, finds entrances within 500 m (likely Smith-9th and Carroll St A/C/G stations) **16. FSM: step_prithvi** - `prithvi_water.summary_for_point(40.6772, -74.0070)`: point-in-polygon against `data/prithvi_ida_2021.geojson` 166 polygons; Red Hook is coastal — likely `inside_water_polygon=True` or close proximity **17. FSM: step_rag** - Builds query: "address 80 Pioneer Street, Brooklyn; inside Hurricane Sandy 2012 inundation zone; in Deep Contiguous pluvial scenario; flood resilience plan..." - `rag.retrieve(q, k=3, min_score=0.45)`: Granite Embedding 278M cosine similarity over embedded corpus; likely returns `rag_npcc4` (NPCC4 coastal) + `rag_mta` (MTA Resilience Roadmap coastal references) + `rag_comptroller` - External reads: none after startup (RAG index built at startup via `rag.warm()`) **18. FSM: step_gliner** - `gliner_extract.extract_for_rag_hits(hits)`: GLiNER NER extraction over RAG paragraphs; extracts agency names, dollar amounts, infrastructure projects, NYC locations, date ranges - Emits `gliner_{source}` doc messages **19. FSM: step_reconcile** - `_current_strict_mode() = True` - `build_documents(snap)` → ~15 doc messages - `trim_docs_to_plan(doc_msgs, planned_specialists)` → drops specialists planner didn't ask for - `augment_system_prompt(EXTRA_SYSTEM_PROMPT, query="80 Pioneer Street, Brooklyn", intent="single_address")` → `framing.detect()` → `generic_exposure` → no directive added (Red Hook query has no question-shape keywords) - `reconcile_strict_streaming(doc_msgs, framed_prompt, loop_budget=2, on_token=..., on_attempt_end=...)` - Attempt 0: streams tokens to frontend; runs 4 checks; likely passes - If fails: feedback user-turn names failing sentences, attempt 1 - Emits: paragraph, mellea metadata (`rerolls=0`, `requirements_passed=[4/4]`) - SSE: multiple `token` events → `mellea_attempt` event → `stone_done: Capstone` → `final` event **Scoring** (computed in `web/main.py` from final state, or explicitly via `app/score.py:composite()`): - `sandy=True` → empirical.sandy=1.0 → floor triggered (tier capped at 2) - `dep_moderate_2050 depth_class=2` → regulatory.dep_moderate_2050=0.75 - `microtopo HAND=0.8` → hydrological.hand_band=1.0 (HAND < 1 m) - composite likely ≥ 1.5 → raw tier 1; floor_applied=True → final tier capped at min(1,2) = 1 (floor is a floor, not a ceiling — the actual rule caps tier at no worse than 2; since tier 1 is better than tier 2, the floor is satisfied: tier stays 1) - Final tier: 1 (High exposure) --- ## 7. What's robust vs fragile ### Robust (load-bearing, tested) - **Silence-over-confabulation in specialists:** Every FSM action returns the declared state key as `None` on failure; `build_documents()` gates on `state.get(key) is not None`; Granite never invents content from absent documents. Pattern is consistent across 25 specialists. - **NYC-scope guard:** `_in_nyc()` check in every FSM action + `build_documents()` scope_note mechanism for out-of-NYC addresses. National specialists (NOAA, NWS) still fire and a live-conditions-only briefing is produced. - **LiteLLM Router failover:** `app/llm.py` auto-fails from vLLM to Ollama on timeout/5xx. `num_retries=0` so the Router doesn't burn seconds re-hitting dead endpoints. The Ollama fallback fires from the same call site. - **Planner validator floor:** `_required_specialists()` adds geocode/sandy/dep/microtopo even if planner forgot them; prevents silent missing-Stone briefings. - **Four Mellea grounding checks with reroll feedback:** The `_failing_sentences_for_citations()` targeted feedback mechanism is the reason neighborhood queries went from chronic 3/4 → 4/4. The identifier-aware `\b` regex in `_NUM_RE` is specifically why it stopped false-firing on NTA codes. - **End-to-end probe suite:** `scripts/probe_addresses.py` drives `/api/agent/stream` against 5 addresses (442 E Houston, 80 Pioneer, 100 Gold, Hollis, Coney Island), asserts Stone fire patterns + Mellea 4/4 + four-section structure. Last green run: 5/5, 5.8–13.1 s per address at `RIPRAP_MELLEA_MAX_ATTEMPTS=3`. - **Startup warmup in `web/main.py:_warm_caches()`:** Sandy, DEP, RAG, Ollama models, and heavy ML module pre-imports all happen before the first request. The startup function catches exceptions individually so one failure doesn't kill the app. - **Threadlocal cleanup in `finally:` blocks:** `app/intents/single_address.py` always resets all five threadlocals in a `finally:` clause, preventing state bleeding between requests. ### Fragile (single points of failure, missing error handling) - **Burr FSM concurrent queries:** `iter_steps()` mutates module-level Burr state. Two concurrent `single_address` queries to the same uvicorn worker will interleave threadlocals. No per-request isolation. Production HF Space is single-worker; local dev with `--workers 2` would break. - **`build_documents()` complexity radon F=101:** ~750-line function with one `if`/`elif` branch per specialist. Order matters for the Granite prompt. Small edits risk subtle doc-ordering regressions that are silent but affect citation density. - **entrypoint.sh EO install:** Runtime `pip install --target` for terratorch/einops/diffusers/timm/torchvision into `$HOME/.eo-pkgs` is brittle — if pip fails mid-install the marker isn't created and the next container start retries, but if the Space's filesystem cache persists a partial install, it might never clear. The build log won't show this failure clearly. - **Droplet redeploy: Dockerfile unverified end-to-end:** The last full E2E Dockerfile build was never confirmed — the bootstrap droplet was destroyed before final verification. `safetensors==0.8.0rc0` in `services/riprap-models/requirements-full.txt` is an RC that may fail on a fresh pip resolve. - **NOAA/NWS live calls without rate-limit handling:** `app/context/noaa_tides.py` and `nws_obs.py` call live APIs on every request with no caching, no retry-after handling. Under concurrent load or NOAA outage, specialists fail silently (returns `error` key in result dict) but every request re-hits the failed endpoint. - **FloodNet GraphQL `verify=False`:** Certificate validation disabled in `app/context/floodnet.py:_gql()`. This is a permanent workaround for FloodNet's self-signed cert, not a temporary workaround. - **Static asset cache:** `web/sveltekit/build/` assets have no cache-busting. When iterating on Svelte sources, browser hard-reload is required. - **Planner 3b → 8b alias on HF Space:** `RIPRAP_OLLAMA_3B_TAG=granite4.1:8b` in the Dockerfile means both planner and reconciler use the 8b on the Space. If 3b is never pulled, the `granite4.1:3b` model is absent and an explicit call to that tag would fail. Current routing via the alias system prevents this, but a direct tag reference in new code would break. - **vLLM `[doc_id=X]` normalization in `app/llm.py:_normalize_citations()`:** Applied per-chunk in streaming and once on non-streaming responses. If vLLM ever batches citation tokens across two stream chunks, the regex would miss them. This hasn't happened in practice but is a known theoretical gap. - **RAG startup failure doesn't prevent startup:** `rag.warm()` is wrapped in a try/except that prints and continues. If sentence-transformers fails to load, all queries return without policy context — the briefing still works but silently loses the RAG section. - **Mellea API shape versioning:** `reconcile_strict()` uses `mellea.start_session(backend_name="ollama")` from Mellea 0.3/0.4 (HF has 0.3, local has 0.4). The `_extract_text()` and `_extract_attempts()` helpers duck-type multiple attribute names. `reconcile_strict_streaming()` avoids Mellea's session entirely (hand-rolled) and is version-independent — this is the production path. The `reconcile_strict()` function is only exercised in offline contexts. - **NYC 311 Socrata calls uncached:** Each query fetches fresh from Socrata. Under rate-limit or extended 311 maintenance, the specialist returns `n=0` and no 311 doc is emitted; the briefing silently lacks that signal. ### Known gaps / out-of-scope - **`compare` intent defined in planner.py INTENTS dict** but no routing to a `compare.py` intent module exists in `web/main.py:api_agent_stream`. Planner would route to it but the runner would fall through to `single_address`. - **Retrospective mode** (`what would Riprap have said on date X`): blocked at planner with not-implemented message. No historical data replay exists. - **Cross-register ranking** (`rank top 5 neighborhoods by flood exposure`): blocked at planner. Would require a cross-register join that doesn't exist. - **FEMA NFHL integration:** FEMA 1% and 0.2% floodplain indicators are in the scoring rubric (`app/score.py:REGULATORY`) but the corresponding FSM step and data layer are absent — they're stubbed at 0 in practice. The score still works but the FEMA regulatory sub-index doesn't contribute. - **Sub-surface flooding (Ida basement mode):** Optical satellites can't see basement flooding. Prithvi correctly emits no polygons for inland Queens. This is documented as an honest scope limit, not a bug. - **`/api/compare` endpoint** exists at `web/main.py:compare_stream` and works as a two-parallel-FSM-runs endpoint, but the SvelteKit UI doesn't expose a compare page (legacy `compare.html` was retired in v0.4.5). --- ## 8. The non-obvious decisions ### Why not a risk score from 0–100 The tier is a deterministic, published rubric (Cutter et al. 2003 construction, Tate 2012 equal-weights argument, Balica 2012 empirical floor). A continuous score would imply calibration against labeled damage outcomes — which don't exist here. Riprap has no closed claim records; producing "flood risk 0.73" without claims-driven calibration would be a fabricated precision. The tier is explicitly a prior (METHODOLOGY.md §1). FEMA Risk Rating 2.0 is the product to use if you want claims-driven numbers. ### Why silence over confabulation Specialists that don't fire emit nothing. `build_documents()` gates on `state.get(key) is not None`. Granite's post-training includes grounded-generation discipline ("don't generate from absent documents"). This plus the Mellea citation checks means a calm-weather query produces no NWS-alerts section in the briefing rather than "no alerts were found" — that would be correct but uncitable. The section is absent. This is explicit in the system prompt: "Omit any section whose supporting facts are absent from the documents." ### Why public-record-only at runtime Data governance: a newsroom with FOIL'd documents, or an agency with internal capital plans, can't paste that data into a vendor LLM (ARCHITECTURE.md §11). All specialist data comes from NYC OpenData, USGS, NOAA, NWS, FloodNet NYC (public sensor network). No commercial data; no private address databases. The system is reproducible and auditable. ### Why the four epistemic tiers (empirical / modeled / proxy / synthetic) The distinction matters for how much weight to give each signal, documented in ARCHITECTURE.md §1.2. Empirical (Sandy HWMs, Ida HWMs, FloodNet events) = something flooded a place and was measured. Modeled scenarios (DEP, FEMA NFHL) = hydraulic simulation under assumptions. Proxy (311 complaints, HAND, TWI) = indirect indicators. Synthetic prior (TerraMind synthesis) = generative model output, never "imaged" or "reconstructed." The `build_documents()` function embeds these interpretive framing sentences directly into the doc bodies so Granite is instructed in the document itself how to characterize each source. ### Why the Five Stones names Functional grouping for a trace UI with 25+ specialists. Stonework vocabulary maps to function: Cornerstone remembers the foundation (static hazard record); Keystone is the load-bearing arch piece (what's exposed); Touchstone is the evaluative reference (current state); Lodestone draws you toward something (forecast pull); Capstone is the crown that holds the vault (synthesis). The names let a non-technical demo audience follow the 25-step trace without reading each step label. ### Why citation-grounded prose vs structured output JSON structured output (tier + per-field arrays) is easy to produce but hard to cite in a grant application or news article. The four-section prose format with `[doc_id]` tags produces text a planner can quote in a FEMA BRIC sub-application or a journalist can use verbatim with inline sourcing. The citation tags map to clickable source chips in the frontend. Structured JSON of the underlying specialist outputs is also available in the `final` SSE event for machine consumption. ### Why Mellea rejection sampling (vs post-hoc sentence dropping) The original `verify_paragraph()` in `app/reconcile.py` drops sentences after generation. This produces a shorter briefing and a silent quality improvement — but the user sees a briefing that may have had sentences removed. The Mellea rejection sampler rerolls the entire generation when it fails, and streams each attempt's tokens to the user live (visible progress), then shows a green/amber inline banner. The user understands the system is enforcing quality, not silently deleting content. Psychologically this is more defensible in a professional context. ### Why planner-then-Capstone two-LLM split The planner is a structured-output routing task (small JSON, deterministic, temperature=0). It should be fast and cheap. The reconciler is a long-form synthesis task requiring dense citation discipline — it benefits from the larger context window and stronger instruction-following of the 8b model. Using 3b for routing keeps TTFB low (planner JSON appears in ~2 s vs ~8 s for 8b). On the HF Space both aliases map to 8b via `RIPRAP_OLLAMA_3B_TAG=granite4.1:8b` to avoid disk cost, accepting the TTFB penalty. ### Why LiteLLM Router The alternative was a hand-rolled `if primary == "vllm": ... else: ollama.chat(...)` dispatch. LiteLLM's Router gives model aliasing, failover, and a common call signature for free. The ~250-line shim in `app/llm.py` covers: Ollama-vs-vLLM backend selection, document-role message extraction for vLLM's HF chat template, `[doc_id=X]` → `[X]` citation normalization, JSON-mode translation, and backend info for the UI badge. Any future backend (mlx-lm, llama.cpp, etc.) is a 10-line entry in `_build_router()`. ### Why vLLM emits `[doc_id=X]` while Ollama emits `[X]` Ollama's Granite 4.1 Modelfile template lifts `role="document "` messages into a `` block and the model emits bare `[X]` citations. The HF tokenizer template used by vLLM emits `[doc_id=X]`. The rest of Riprap (Mellea regex, frontend citation chip parser, sources footer) was written against `[X]`. The `_CITE_NORMALIZE_RE` in `app/llm.py` normalizes per-chunk in streaming, preventing any vLLM-specific citation format from leaking downstream. ### Why Prithvi runs offline (baked GeoJSON) while TTM runs live Prithvi-EO 2.0 with TerraTorch needs GPU and minutes per HLS tile. Running it per-query on a CPU-basic Space is not viable. The 166-polygon GeoJSON was computed once on AMD MI300X, filtered (>30,000 sqft to drop noise, <1 km² to drop tidal artifacts), and committed. The runtime FSM does point-in-polygon (milliseconds). This is honest about what EO models earn their keep on: a one-time defensible event-level signal, not per-request inference. TTM r2 at 1.5 M params runs in milliseconds on CPU — no such tradeoff exists. ### Why `citations_dense` uses sentence scope, not character window The original implementation used `~40 chars` proximity between a number and its citation tag. This was fragile for normal English sentence structure ("The address has **11 flood-related complaints** [nyc311] within 200 m"). The citation might be 60 chars from the number. Switching to sentence scope (`.[\s)]` split) eliminated the chronic 3/4 neighborhood-query failure mode. "Sentence scope" is also how human readers actually assign attribution — the citation at the end of the sentence covers the claim anywhere in that sentence. --- ## 9. What's next From `OPEN-ISSUES.md`, `CLAUDE.md` polish targets, and code-level TODO comments, in priority order for the May 13 ASCE presentation: 1. **Demo-script dry run against live Space.** Space sometimes sleeps after idle; cold start is 30–90 s. Pre-ping the Space before presenting. Verify the backend pill shows correct hardware. 2. **`compare` intent wiring.** `planner.py` declares the `compare` intent (noted as `NOT_IMPLEMENTED` comment — actually the planner doesn't short-circuit compare, it just routes to single_address by default). If you want the compare flow to work end-to-end, `web/main.py:api_agent_stream` needs routing to `i_addr.run` twice in parallel, or a new `compare.py` intent module. 3. **FEMA NFHL layer.** The scoring rubric has `fema_1pct` and `fema_02pct` weights but no FSM step or data layer. Adding the FEMA NFHL download and a `step_fema_nfhl` action would materially improve Regulatory sub-index accuracy for addresses in AE/VE zones that aren't in Sandy extent. 4. **NYCHA/DOE/DOH registers on Space.** `RIPRAP_NYCHA_REGISTERS=0` by default. Enabling on HF Space would add 3 more Keystone specialists to every single_address query but requires the 91 MB sandy GeoJSON pre-load to complete within Space startup time. 5. **Droplet redeploy verification.** The `services/riprap-models/Dockerfile` was never tested end-to-end. The `safetensors==0.8.0rc0` RC pin is the most likely failure point. Next droplet bring-up should test this first. 6. **Experiments `OPEN-ISSUES.md` items.** All four issues are in `experiments/` only (F821 numpy annotation in exp17, f-string Py 3.12+ syntax in exp18, B023 closure variable in exp05, F841 unused api in exp18). Won't affect production but clean up the codebase. 7. **Reranker integration.** `app/rag.py` has a full `_ensure_reranker()` and `RIPRAP_RERANKER_ENABLE` flag for `ibm-granite/granite-embedding-reranker-english-r2` cross-encoder. Off by default (no HF Space disk for the CrossEncoder model). Enabling on the AMD droplet path would improve Policy context quality at no latency cost. 8. **Historical replay / retrospective mode.** Blocked at planner with not-implemented message. Substantial feature: would require snapshotting specialist output at query time or storing NOAA/311/FloodNet historical pull results. --- ## 10. Quick reference: files that matter | Task | Open first | |---|---| | **Add a new specialist** | `app/fsm.py` (add `@action` + wire into `build_app()`), `app/reconcile.py:build_documents()` (add doc emission), `app/intents/single_address.py` (no change usually needed), `web/sveltekit/src/` (add step label + source card) | | **Change the briefing structure / system prompt** | `app/reconcile.py:EXTRA_SYSTEM_PROMPT`, then `app/intents/neighborhood.py:EXTRA_SYSTEM_PROMPT` for neighborhood path; rebuild `web/sveltekit` if adding new section rendering | | **Tune the Mellea grounding checks** | `app/mellea_validator.py` — `_NUM_RE`, `_TRIVIAL_NUMS`, `_check_every_claim_cited()`, `_failing_sentences_for_citations()` | | **Change which backend (vLLM vs Ollama)** | `app/llm.py` env vars; no code change needed | | **Add a new intent** | `app/planner.py:INTENTS` + `SPECIALISTS` entries, `_required_specialists()`, then new `app/intents/.py`; wire in `web/main.py:api_agent_stream` and `api_agent` | | **Change the exposure tier scoring** | `app/score.py:REGULATORY/HYDROLOGICAL/EMPIRICAL` dicts + `TIER_BREAKPOINTS`; update `METHODOLOGY.md` | | **Debug why a specialist fired wrong** | `scripts/probe_mellea.py --query "

" --runs 1`; check step events in SSE stream; look at `final.mellea.requirements_failed` | | **Rebuild the frontend** | `cd web/sveltekit && npm run build` (new design-system UI); `cd web/svelte && npm run build` (legacy Svelte 5 custom elements to `web/static/dist/riprap.js`) | | **Run the full end-to-end test** | `.venv/bin/python scripts/probe_addresses.py` | | **Rebuild the pre-computed registers** | `scripts/build_mta_entrances_register.py`, `scripts/build_nycha_register.py`, `scripts/build_schools_register.py` | | **Rebuild Prithvi Ida polygons** | `scripts/run_prithvi_ida.py` — needs GPU + TerraTorch | | **Rebuild the pitch deck** | `cd slides && make pdf html pptx` (needs marp-cli) | | **Add a question-type framing** | `app/framing.py:_PATTERNS` + `_DIRECTIVES` | | **Understand why a doc was missing from the briefing** | Check `build_documents()` in `app/reconcile.py` — each block has an explicit gate condition; also check `trim_docs_to_plan()` | | **Understand the SSE stream structure** | `web/main.py:api_agent_stream`, the `_STEP_TO_STONE` dict, and the stone_start/stone_done wrapping logic | | **Deploy to HF Space** | `git push && git push huggingface main`; monitor rebuild via `curl -sf "https://huggingface.co/api/spaces/lablab-ai-amd-developer-hackathon/riprap-nyc/runtime" | python3 -m json.tool` | | **Deploy to AMD droplet** | `scripts/deploy_droplet.sh `, then set Space env vars via `huggingface-cli space variables`, restart Space |