riprap-nyc / docs /RIPRAP-OWNER-BRIEF.md
seriffic's picture
sync: bring HF Space to parity with origin/main (squash, XET-backed)
b2f95f6

Riprap β€” Owner's Brief


1. The system in one paragraph

Riprap takes any NYC address, neighborhood, or development-permit query and produces a four-section flood-exposure briefing where every numeric claim is anchored to a [doc_id] citation that traces back to the source dataset, agency report, or model output. A natural-language planner (Granite 4.1 3b) routes each query to one of four intent paths; the chosen path fans out across up to ~25 atomic data specialists; a synthesizer (Granite 4.1 8b) reads only the specialist outputs that fired and writes the briefing; a Mellea rejection sampler checks four grounding invariants and rerolls if any fail. The system is NYC-specific and public-record-only: all data comes from NYC OpenData, USGS, NOAA, NWS, or FloodNet, and all four models run inside the container β€” no vendor LLM is contacted at runtime. The output is a tier 1–4 exposure score (deterministic, published rubric, not generated by the LLM) plus a cited paragraph in prose. What Riprap does not do: damage probability, insurance rating, flood prediction, or any claim about basement apartments or infrastructure that isn't in a public register.


2. Architecture map

HTTP request lifecycle

User browser β†’ GET /api/agent/stream?q=<query>
  web/main.py: api_agent_stream()  (async SSE generator)
    runs runner() in a threadpool executor
      app/planner.plan(q, on_token=...)   β†’ streams plan_token events while Granite generates
        returns Plan(intent, targets, specialists, rationale)
      out_q.put({kind:"plan", ...})       β†’ SSE plan event
      intent dispatch:
        "single_address"     β†’ app/intents/single_address.run(plan, q, progress_q, strict=True)
        "neighborhood"       β†’ app/intents/neighborhood.run(plan, q, progress_q, strict=True)
        "development_check"  β†’ app/intents/development_check.run(plan, q, progress_q, strict=True)
        "live_now"           β†’ app/intents/live_now.run(plan, q, progress_q)
        "not_implemented"    β†’ inline JSON response, no FSM
      each intent calls fsm.iter_steps() or its own specialist loop
        β†’ out_q.put({kind:"step", ...})   per specialist
        β†’ out_q.put({kind:"token", ...})  per Granite reconcile chunk
        β†’ out_q.put({kind:"mellea_attempt", ...})  per Mellea pass/fail
      out_q.put({kind:"final", ...})
  event_stream() async generator reads out_q, wraps steps in
    stone_start / stone_done envelope keyed by _STEP_TO_STONE dict,
    yields SSE frames
  SSE response header: Cache-Control: no-cache, X-Accel-Buffering: no

Planner: app/planner.py

  • Entry: plan(query, model, on_token) β†’ Plan
  • Model: RIPRAP_PLANNER_MODEL env, default granite4.1:3b
  • Uses llm.chat(format="json") with temperature=0 for deterministic JSON output via Ollama's constrained-decode mode
  • Pre-filter: _not_implemented_message(query) checks two regex patterns (retrospective, ranking) and returns early with a Plan(intent="not_implemented") so no LLM call is made
  • Post-validator: _validate(d, raw_query) sanitizes intent, targets, specialists against the declared INTENTS/SPECIALISTS dicts; adds floor specialists via _required_specialists(intent) if planner omitted them
  • Floor specialists (always added regardless of planner output): geocode+sandy+dep_stormwater+microtopo for single_address; nta_resolve+sandy+dep_stormwater+nyc311 for neighborhood; nws_alerts+noaa_tides for live_now
  • Returns: Plan(intent, targets: list[dict], specialists: list[str], rationale: str)

FSM: app/fsm.py

  • Entry: build_app(query) β†’ Burr Application; run(query) β†’ dict; iter_steps(query) β†’ generator
  • Burr 0.x ApplicationBuilder with with_state(query=query, trace=[]), with_entrypoint("geocode")
  • Actions registered in dict order, transitions are consecutive pairs (linear, not DAG)
  • Each @action writes one state key + appends to trace list
  • Out-of-NYC guard: _NYC_S/W/N/E = 40.49, -74.27, 40.92, -73.69 β€” NYC-specific specialists skip with "out of NYC scope" reason; live/national specialists (NWS/NOAA/TTM) run unconditionally
  • Thread-locals for streaming (since Burr runs sync in a background thread):
    • set_strict_mode(bool) / _current_strict_mode()
    • set_token_callback(fn) / _current_token_callback()
    • set_mellea_attempt_callback(fn) / _current_mellea_attempt_callback()
    • set_planned_specialists(set) / _current_planned_specialists()
    • set_user_query(str) / _current_user_query()
    • set_planner_intent(str) / _current_planner_intent()
  • iter_steps spawns a daemon thread running app.iterate(halt_after=["reconcile"]); snaps threadlocals from caller thread and re-installs on iterate thread; deduplicates trace records by (step_name, started_at)
  • Heavy-specialist gate: _HEAVY_SPECIALISTS_ENABLED = True when RIPRAP_LLM_PRIMARY != ollama OR RIPRAP_ML_BASE_URL is set; otherwise False. Controls whether prithvi_live, terramind, eo_chip, terramind_lulc, terramind_buildings fire
  • NYCHA register gate: _NYCHA_REGISTERS_ENABLED = controlled by RIPRAP_NYCHA_REGISTERS=1 (default off); registers load a 91 MB GeoJSON file on first call

Full action sequence (default, single_address)

# Action name State key written Data source
1 geocode geocode, lat, lon NYC DCP Geosearch β†’ OSM Nominatim fallback
2 sandy sandy data/sandy_inundation.geojson (lru_cache)
3 dep dep data/dep/*.gdb (3 scenarios, lru_cache)
4 floodnet floodnet api.floodnet.nyc Hasura GraphQL
5 nyc311 nyc311 Socrata erm2-nwe9
6 noaa_tides noaa_tides api.tidesandcurrents.noaa.gov
7 nws_alerts nws_alerts api.weather.gov/alerts/active
8 nws_obs nws_obs api.weather.gov/stations//observations
9 ttm_forecast ttm_forecast ibm-granite/granite-timeseries-ttm-r2 (in-process or remote)
10 ttm_311_forecast ttm_311_forecast TTM r2 on local 311 weekly series
11 floodnet_forecast floodnet_forecast TTM r2 on nearest FloodNet sensor history
12 ttm_battery_surge ttm_battery_surge msradam/Granite-TTM-r2-Battery-Surge (remote or local)
13 microtopo microtopo data/nyc_dem_30m.tif, twi.tif, hand.tif
14 ida_hwm ida_hwm data/ida_2021_hwms_ny.geojson
15 mta_entrances mta_entrances data/mta_entrances.geojson
16 prithvi prithvi_water data/prithvi_ida_2021.geojson (166 polygons)
17–22 (heavy, if enabled) prithvi_live, terramind, eo_chip, terramind_lulc, terramind_buildings STAC/Sentinel-2, msradam/TerraMind-NYC-Adapters
23 rag rag Granite Embedding 278M over corpus/*.pdf (5 PDFs)
24 gliner gliner GLiNER typed-entity extraction over RAG hits
25 reconcile paragraph, audit, mellea Granite 4.1:8b via Mellea strict sampler

Capstone reconciliation: app/reconcile.py + app/mellea_validator.py

  • build_documents(state) β†’ list[dict] β€” emits one {"role": "document <doc_id>", "content": "..."} per specialist that fired, in Stones order; gatted by both specialist fire status and the out-of-NYC guard
  • trim_docs_to_plan(doc_msgs, planned_specialists) β€” drops doc messages not matching planner's specialist set; saves ~30–50% prompt tokens; RIPRAP_TRIM_DOCS=0 disables
  • EXTRA_SYSTEM_PROMPT β€” the 4-section skeleton with the citation-discipline rules
  • augment_system_prompt(EXTRA_SYSTEM_PROMPT, query, intent) β€” calls app/framing.detect() to classify question type (11 types, deterministic regex), then appends a QUESTION-AWARE OPENING: directive to the system prompt for non-generic questions
  • Strict path (production): reconcile_strict_streaming(doc_msgs, system_prompt, ...) in app/mellea_validator.py
    • Streams each attempt's tokens via on_token(delta, attempt_idx) callback
    • After each attempt runs four checks, fires on_attempt_end(attempt_idx, passed, failed) callback
    • On failure, appends a feedback user-turn naming failing sentences and rerolls
    • Budget: DEFAULT_LOOP_BUDGET = 2 (Ollama primary) or 3 (vLLM primary), overridable via RIPRAP_MELLEA_MAX_ATTEMPTS
  • Legacy path (non-strict): reconcile.reconcile(state) β†’ streams tokens, then calls verify_paragraph() which drops sentences with ungrounded numbers (post-hoc, not rejection-sampling)
  • The step_reconcile action detects strict mode via _current_strict_mode() and routes to one or the other

Four Mellea grounding checks (app/mellea_validator.py)

  1. numerics_grounded β€” _check_no_invented_numbers(): every non-trivial number in output appears verbatim in haystack (joined document content). Trivial set: {0–10, 100, 311, 911, 211}. Number regex: \b-?\d[\d,]*(?:\.\d+)?\b (word-boundary β€” skips QN1206, B12)
  2. no_placeholder_tokens β€” _check_no_placeholder_tokens(): output contains none of [source], <document, </document, [doc_id]
  3. citations_dense β€” _check_every_claim_cited(): each non-trivial number has a [doc_id] citation somewhere in the same sentence (sentence boundary: \.[\s)] or end of string)
  4. citations_resolve β€” _check_referenced_doc_ids_exist(): every [id] cited in output is a member of the input doc_id set

SSE event vocabulary (/api/agent/stream)

event payload when
hello {query} connection open
plan_token {delta} planner JSON tokens
plan {intent, targets, specialists, rationale} planner done
stone_start {name, tagline, description} first step in a Stone fires
step {step, ok, elapsed_s, result?, err?} each FSM action completes
token {delta, attempt?} Granite reconcile chunk (attempt idx resets on reroll)
mellea_attempt {attempt, passed, failed} end of each Mellea attempt
stone_done {name, tagline, description, n_steps} last step in a Stone done
final full state dict (geocode, sandy, dep, paragraph, mellea, energy, ...) reconcile done
error {err} exception
done {} stream closing

3. The Five Stones, one section each

Cornerstone β€” Hazard Reader

Job: Establish the historical and modeled flood record at the address. These are static datasets that do not change between queries.

Specialists (file:function, what it does):

Specialist File:function What it returns
step_sandy fsm.py:step_sandy Boolean: inside 2012 Sandy Inundation Zone; gpd.sjoin point-in-polygon against data/sandy_inundation.geojson (91 MB)
step_dep fsm.py:step_dep Three DEP stormwater scenarios: dep_extreme_2080 (3.66 in/hr rainfall, 2080 SLR), dep_moderate_2050 (2.13 in/hr, 2050 SLR), dep_moderate_current; depth class 1–3 per point
step_microtopo fsm.py:step_microtopo Point elevation (m), HAND (height above nearest drainage, m), TWI (topographic wetness index), rel_elev_pct_200m, rel_elev_pct_750m, basin_relief_m from rasters in data/
step_ida_hwm fsm.py:step_ida_hwm USGS Hurricane Ida 2021 high-water marks; n_within_800m, max_height_above_gnd_ft, nearest_dist_m
step_prithvi fsm.py:step_prithvi Point-in-polygon against 166 pre-computed polygons in data/prithvi_ida_2021.geojson; inside_water_polygon bool, nearest_distance_m

Data sources:

  • Sandy: NYC OpenData 5xsi-dfpx β€” downloaded to data/sandy_inundation.geojson
  • DEP: NYC DEP Stormwater Flood Maps (2021), Esri FileGDBs at data/dep/*.gdb
  • Microtopo: USGS 3DEP 30 m DEM via py3dep + whitebox-workflows for TWI/HAND computation, baked to data/nyc_dem_30m.tif, data/twi.tif, data/hand.tif by scripts/compute_hydrology_indices.py
  • Ida HWMs: USGS STN Event 312 (NY State), baked to data/ida_2021_hwms_ny.geojson by scripts/fetch_ida_hwms.py
  • Prithvi polygons: offline Prithvi-EO 2.0 segmentation on Sentinel-2 HLS tile (pre-event 2021-08-25, post-event 2021-09-02), baked to data/prithvi_ida_2021.geojson by scripts/run_prithvi_ida.py

Models invoked: Prithvi-EO 2.0 ran offline (TerraTorch) to produce the 166-polygon GeoJSON; no live model at query time for this Stone

Failure modes: Sandy/DEP/Prithvi fail silently if GeoJSON/GDB load errors; microtopo/ida_hwm fail if raster files absent from data/; all check _in_nyc() and skip with "out of NYC scope" for non-NYC addresses

UI evidence cards: Sandy inundation zone (boolean), DEP scenario depth classes (three cards), microtopo terrain indices, Ida HWM count/height, Prithvi polygon proximity


Keystone β€” Asset Register

Job: Quantify what public assets (transit, housing, schools, hospitals, buildings) are exposed to the hazards the Cornerstone established.

Specialists (file:function):

Specialist File:function What it returns
step_mta_entrances fsm.py:step_mta_entrances MTA subway entrances within 500 m: n_entrances, n_inside_sandy_2012, n_in_dep_extreme_2080; per-entrance elevation + HAND
step_nycha fsm.py:step_nycha NYCHA developments within 1.5 km: n_developments, n_majority_inside_sandy_2012, n_with_dep_2080_overlap; per-development footprint overlap percentages
step_doe_schools fsm.py:step_doe_schools DOE schools within 1 km: n_schools, n_inside_sandy_2012, n_in_dep_extreme_2080
step_doh_hospitals fsm.py:step_doh_hospitals NYS DOH hospitals within 2 km: n_hospitals, n_inside_sandy_2012, n_in_dep_extreme_2080
step_terramind_buildings fsm.py:step_terramind_buildings msradam/TerraMind-NYC-Adapters Buildings LoRA: pct_buildings, n_building_components in per-query Sentinel-2 chip; heavy, needs _HEAVY_SPECIALISTS_ENABLED

Data sources:

  • MTA: data/mta_entrances.geojson (pre-computed register with elevation + flood layer joins)
  • NYCHA: data/registers/nycha.json (built by scripts/build_nycha_register.py)
  • DOE schools: data/registers/schools.json
  • DOH hospitals: fetched from NYS DOH Health Facility Certification (vn5v-hh5r) at register-build time
  • TerraMind-Buildings: msradam/TerraMind-NYC-Adapters adapter nyc-buildings-v1, via app/context/terramind_nyc.py:buildings()

Models invoked: msradam/TerraMind-NYC-Adapters (TerraMind 1.0 base + Buildings LoRA), ~1.6 GB base + ~325 MB LoRA, loaded lazily and cached; runs on RIPRAP_ML remote if configured

Failure modes: NYCHA/DOE/DOH registers require RIPRAP_NYCHA_REGISTERS=1 and the 91 MB sandy GeoJSON to be loaded β€” they fire disabled by default on local dev. TerraMind-Buildings skips silently if eo_chip didn't fire, deps unavailable, or heavy disabled

UI evidence cards: MTA entrance exposure summary, NYCHA development exposure, school exposure, hospital exposure, TerraMind building footprint fraction


Touchstone β€” Live Observer

Job: Report current conditions β€” what sensors, 311 data, and EO imagery show right now.

Specialists (file:function):

Specialist File:function What it returns
step_floodnet fsm.py:step_floodnet FloodNet sensors within 600 m: n_sensors, n_sensors_with_events, n_flood_events_3y, peak_event (max_depth_mm)
step_311 fsm.py:step_311 NYC 311 flood complaints within 200 m, last 5 years: count, by_descriptor breakdown, by_year
step_nws_obs fsm.py:step_nws_obs Nearest ASOS hourly METAR: station_id, precip_last_hour_mm, precip_last_3h_mm, precip_last_6h_mm
step_noaa_tides fsm.py:step_noaa_tides Nearest of 3 NOAA gauges (Battery 8518750, Kings Point 8516945, Sandy Hook 8531680): observed_ft_mllw, predicted_ft_mllw, residual_ft
step_prithvi_live fsm.py:step_prithvi_live Live Sentinel-2 L2A water segmentation via msradam/Prithvi-EO-2.0-NYC-Pluvial v2; pct_water_within_500m, pct_water_full, scene_date, cloud_cover; heavy
step_terramind_lulc fsm.py:step_terramind_lulc msradam/TerraMind-NYC-Adapters LULC LoRA: dominant_class, dominant_pct, per-class fractions; heavy

Data sources:

  • FloodNet: https://api.floodnet.nyc/v1/graphql β€” Hasura GraphQL, no auth; ~350 sensors
  • 311: Socrata erm2-nwe9 (live API call, 200 m buffer, last 5 years)
  • NWS obs: https://api.weather.gov/stations/<id>/observations/latest; nearest of KNYC, KLGA, KJFK, KEWR, KFRG
  • NOAA tides: https://api.tidesandcurrents.noaa.gov/api/prod/datagetter; 6-min cadence
  • Prithvi live: Microsoft Planetary Computer STAC API for Sentinel-2 L2A; msradam/Prithvi-EO-2.0-NYC-Pluvial v2 weights
  • TerraMind LULC: shared chip from step_eo_chip (also STAC/Planetary Computer)

Models invoked: Prithvi-EO-2.0-NYC-Pluvial v2 (300 M params, TerraTorch, flood IoU 0.5979 vs 0.10 base); TerraMind-NYC-Adapters LULC LoRA (mIoU 0.5866, +6.13 pp over full-FT)

Failure modes: FloodNet GraphQL call sets verify=False (self-signed cert); 311 Socrata times out gracefully; NOAA/NWS calls have 15-20 s timeouts; Prithvi/TerraMind LULC require _HEAVY_SPECIALISTS_ENABLED and app/context/eo_chip_cache.py:fetch() succeeding


Lodestone β€” Projector

Job: Report forward-looking signals β€” NWS alerts, surge forecasts, and complaint-rate trends.

Specialists (file:function):

Specialist File:function What it returns
step_nws_alerts fsm.py:step_nws_alerts Active NWS flood-relevant alerts at point (Flash Flood, Coastal Flood, etc.): n_active, list of alerts with event/severity/urgency/expires
step_ttm_forecast fsm.py:step_ttm_forecast TTM r2 zero-shot Battery surge residual: context 512 steps (51 h at 6-min), horizon 96 steps (9.6 h); forecast_peak_ft, forecast_peak_minutes_ahead; only emits doc when interesting (peak > 0.3 ft)
step_ttm_311_forecast fsm.py:step_ttm_311_forecast TTM r2 zero-shot on 52 weeks of 311 complaint history β†’ 4-week forecast; forecast_mean_per_week, forecast_peak_per_week, accelerating flag
step_floodnet_forecast fsm.py:step_floodnet_forecast TTM r2 on nearest FloodNet sensor flood-event recurrence; forecast_28d_expected_events, accelerating; silent if sensor history too sparse
step_ttm_battery_surge fsm.py:step_ttm_battery_surge msradam/Granite-TTM-r2-Battery-Surge fine-tune: hourly cadence, 96 h horizon; forecast_peak_m, forecast_peak_hours_ahead; only emits doc when interesting

Data sources:

  • NWS alerts: https://api.weather.gov/alerts/active filtered to flood event types at the point's county
  • TTM context data: live pull from NOAA CO-OPS 6-min water level (for Battery/Kings Point/Sandy Hook); Socrata 311 history; FloodNet GraphQL event history
  • Battery surge fine-tune: NOAA hourly verified water level from Battery gauge (NOAA 8518750), loaded by app/live/ttm_battery_surge.py

Models invoked: ibm-granite/granite-timeseries-ttm-r2 (1.5 M params, ~30 MB, CPU-viable, zero-shot); msradam/Granite-TTM-r2-Battery-Surge fine-tune (same backbone, test MAE 0.1091 m, βˆ’41% vs persistence, βˆ’25% vs zero-shot)

Failure modes: NWS alerts call gracefully returns n_active=0 on timeout; TTM models loaded lazily via app/live/ttm_forecast.py:_load_model() with _DEPS_OK = False fallback pattern; all Lodestone specialists fire unconditionally (no NYC bbox gate except floodnet/311 which are NYC-specific)


Capstone β€” Synthesizer

Job: Read all documents produced by the four data-Stones and write a citation-grounded four-section prose briefing.

Entry: app/mellea_validator.py:reconcile_strict_streaming(doc_msgs, system_prompt, user_prompt, loop_budget, on_token, on_attempt_end)

Document ordering in prompt: geocode preamble β†’ Cornerstone (sandy, dep_*, ida_hwm, prithvi_water, microtopo) β†’ Keystone (mta_entrance_*, nycha_dev_*, doe_school_*, nyc_hospital_*, tm_buildings) β†’ Touchstone (floodnet, nyc311, nws_obs, noaa_tides, prithvi_live, tm_lulc) β†’ Lodestone (nws_alerts, ttm_forecast, ttm_311_forecast, floodnet_forecast_*, ttm_battery) β†’ Policy (rag_*, gliner_*)

Four-section skeleton (from EXTRA_SYSTEM_PROMPT):

  • Status. β€” dominant exposure signal, strongest doc_id citation
  • Empirical evidence. β€” Sandy, 311, FloodNet, Ida HWMs, Prithvi polygons
  • Modeled scenarios. β€” DEP dep_* scenarios, microtopo terrain (HAND, TWI, percentile)
  • Policy context. β€” one sentence per RAG hit, citing agency name + rag_* doc_id

Four grounding checks (described in Β§2 above): numerics_grounded, no_placeholder_tokens, citations_dense, citations_resolve

Reroll feedback mechanism: _failing_sentences_for_citations(text) identifies sentences with uncited numbers; on reroll the feedback user-turn names those specific sentences and instructs surgical citation additions

Model: RIPRAP_RECONCILER_MODEL env, default granite4.1:8b; num_ctx=4096, num_predict=400

Return shape from step_reconcile: {paragraph, audit: {raw, dropped}, mellea: {rerolls, n_attempts, requirements_passed, requirements_failed, requirements_total, model, loop_budget}}


4. The three NYC fine-tunes

msradam/Prithvi-EO-2.0-NYC-Pluvial

  • HF Hub path: msradam/Prithvi-EO-2.0-NYC-Pluvial
  • Base model: IBM/NASA Prithvi-EO 2.0 (300 M params, ViT-L foundation model pre-trained on HLS Sentinel-2 multispectral imagery), Apache-2.0
  • Training data: NYC HLS Sentinel-2 tiles with pluvial flood labels derived from USGS Ida HWM survey and NYC DEP records; LovΓ‘sz-Softmax loss with copy-paste augmentation; trained on AMD Instinct MI300X
  • Metrics: Test flood IoU 0.5979 vs 0.10 on Sen1Floods11 base (6Γ— improvement)
  • Invocation: Two paths:
    • Offline (Cornerstone): produced data/prithvi_ida_2021.geojson via scripts/run_prithvi_ida.py; runtime does point-in-polygon, no model call
    • Live (Touchstone): app/flood_layers/prithvi_live.py:fetch(lat, lon) β€” fetches latest Sentinel-2 L2A chip from Planetary Computer STAC, runs model forward pass, returns pct_water_within_500m, pct_water_full; slow (~30 s), gated by _HEAVY_SPECIALISTS_ENABLED; input 6-band S2L2A chip, output binary segmentation mask
  • Degradation: If Planetary Computer STAC unavailable or cloud cover too high, fetch() returns {ok: False, skipped: "...reason..."} and no doc is emitted

msradam/TerraMind-NYC-Adapters

  • HF Hub path: msradam/TerraMind-NYC-Adapters
  • Base model: TerraMind 1.0 (IBM/ESA any-to-any generative EO foundation model), Apache-2.0
  • Training data: NYC Sentinel-2 + SAR chips matched to ESRI Land Cover 2020–2022 labels (LULC adapter) and NYC building footprints (Buildings adapter); trained on AMD Instinct MI300X in ~18 minutes
  • Metrics: LULC test mIoU 0.5866 (+6.13 pp over full-FT baseline); Buildings test mIoU 0.5511; TiM 0.6023
  • Two adapters:
    • lulc β€” 5-class land cover (water, built, vegetation, bare, agriculture); invoked by step_terramind_lulc via app/context/terramind_nyc.py:lulc(s2_tensor, s1rtc, dem)
    • buildings β€” binary building footprint mask; invoked by step_terramind_buildings via app/context/terramind_nyc.py:buildings(s2_tensor, s1rtc, dem)
  • Shared chip: Both consume tensors from step_eo_chip β†’ app/context/eo_chip_cache.py:fetch(lat, lon), which fetches S2L2A + S1RTC + DEM chip once per query
  • Degradation: If eo_chip didn't fire successfully, both LoRA specialists silently no-op. Lazy load + cached in-process; first call ~30 s, subsequent calls ~3–7 s

msradam/Granite-TTM-r2-Battery-Surge

  • HF Hub path: msradam/Granite-TTM-r2-Battery-Surge
  • Base model: ibm-granite/granite-timeseries-ttm-r2 (1.5 M params, Tiny Time Mixer, Ekambaram et al. NeurIPS 2024), Apache-2.0
  • Training data: NOAA CO-OPS Battery gauge (station 8518750) hourly verified water level, surge residual computed as verified minus harmonic tide; trained on AMD Instinct MI300X
  • Metrics: Test MAE 0.1091 m, βˆ’41% vs persistence, βˆ’25% vs zero-shot TTM r2
  • Invocation: app/live/ttm_battery_surge.py:fetch() β€” loads model via tsfm_public.get_model(), fetches NOAA hourly context, returns {available, context_hours, horizon_hours: 96, forecast_peak_m, forecast_peak_hours_ahead, interesting}; in-process on CPU
  • Input shape: (context_length, 1) float tensor of hourly surge residuals; context = 336 h (~14 days)
  • Output shape: (96,) hourly forecast, scanned for peak
  • Degradation: _DEPS_OK module-level flag set at import time; on failure returns {available: False, reason: "..."}, no doc emitted

5. The deployment topology

Local development

  • Python 3.12 venv (.venv), uv for package management
  • Ollama serving granite4.1:3b + granite4.1:8b locally
  • uvicorn web.main:app --host 127.0.0.1 --port 7860
  • _HEAVY_SPECIALISTS_ENABLED = False by default (no RIPRAP_ML_BASE_URL set, no vLLM)
  • RIPRAP_NYCHA_REGISTERS = 0 by default (heavy 91 MB GeoJSON loads)
  • Granite Embedding 278M and TTM r2 download to HF cache on first query (~280 MB + ~30 MB)
  • SvelteKit UI built at web/sveltekit/build/; rebuild only needed when sources change

HF Space (production demo URL)

  • URL: https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space
  • Docker SDK, base nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04, hardware cpu-basic (actual hardware is cpu-basic, not T4 β€” the ARCHITECTURE.md mentions T4 but the Dockerfile's GPU notes are aspirational)
  • Python 3.10 inside container (pinning mellea<0.4, transformers<5, huggingface_hub<1)
  • entrypoint.sh flow:
    1. Attempts EO toolchain install at runtime to $HOME/.eo-pkgs (bypasses HF build disk limit); if fails, terramind/prithvi-live silently skip
    2. Starts ollama serve in background, polls until ready (up to 60 s)
    3. Pulls granite4.1:8b at runtime if not cached (~5 GB, ~2 min first cold start); 3b is optional
    4. Pre-warms 8b via curl POST /api/generate with keep_alive=24h
    5. Launches uvicorn web.main:app --host 0.0.0.0 --port 7860
  • RIPRAP_OLLAMA_3B_TAG=granite4.1:8b set in Dockerfile so planner routes to 8b (avoids disk cost of two separate model pulls)
  • web/main.py:_warm_caches() on startup: loads sandy + DEP layers, optionally NYCHA registers, warms RAG (Granite Embedding 278M + 5 PDFs), pre-imports heavy ML stacks to avoid import races, warms Ollama models via HTTP

AMD MI300X droplet (demo GPU path β€” currently destroyed)

  • Two Docker containers on same host, both with --device=/dev/kfd --device=/dev/dri
  • Container 1: vllm/vllm-openai-rocm:v0.17.1 β€” serves granite-4.1-8b on port 8001
    • --max-model-len 8192, --served-model-name granite-4.1-8b
    • GLOO_SOCKET_IFNAME=eth0 required or gloo fails to bind
  • Container 2: riprap-models:latest (built from services/riprap-models/Dockerfile) β€” FastAPI on port 8002 (or 7860 per scripts)
    • Endpoints: GET /healthz, POST /v1/prithvi-pluvial, POST /v1/terramind, POST /v1/ttm-forecast, POST /v1/granite-embed, POST /v1/gliner-extract
    • Model loading: lazy + per-model threading.Lock to prevent double-load on concurrent requests
    • ROCm device: cuda (ROCm's CUDA shim maps cuda to first /dev/kfd device)

Env vars to connect HF Space to droplet:

RIPRAP_LLM_PRIMARY=vllm
RIPRAP_LLM_BASE_URL=http://<ip>:8001/v1
RIPRAP_LLM_API_KEY=<token>
RIPRAP_ML_BASE_URL=http://<ip>:8002
RIPRAP_ML_API_KEY=<token>

What breaks if droplet IP changes: Set the four env vars above via huggingface-cli space variables and restart the Space. The LiteLLM Router builds at import time from env, so a Space restart is required.

Deterministic redeploy: scripts/deploy_droplet.sh <new-ip> $TOKEN β€” idempotent, ~10–20 min first run (pulls images, builds riprap-models); re-runs on same droplet ~1 min. Known fragile: safetensors==0.8.0rc0 pin in services/riprap-models/requirements-full.txt is an RC and may fail on future pip resolves.


6. One query traced end-to-end: "80 Pioneer Street, Brooklyn"

Query enters: GET /api/agent/stream?q=80+Pioneer+Street%2C+Brooklyn

1. Planner (app/planner.py:plan)

  • No not-implemented regex matches
  • Calls llm.chat(model="granite4.1:3b", messages=[system, user], format="json", stream=True, temperature=0)
  • Streams plan_token SSE events as JSON generates
  • Returns Plan(intent="single_address", targets=[{type:"address", text:"80 Pioneer Street, Brooklyn"}], specialists=[...], rationale="...")
  • Validator adds floor specialists: geocode, sandy, dep_stormwater, microtopo
  • SSE: plan event emitted

2. single_address.run (app/intents/single_address.py:run)

  • Sets threadlocals: strict=True, planned_specialists={...}, user_query="80 Pioneer Street, Brooklyn", planner_intent="single_address"
  • Registers on_token and on_mellea_attempt callbacks on progress_q
  • Calls fsm.iter_steps("80 Pioneer Street, Brooklyn")

3. FSM: step_geocode

  • app/geocode.py:geocode_one("80 Pioneer Street, Brooklyn")
  • Detects borough hint "Brooklyn", calls DCP Geosearch with size=8, filters for Brooklyn results
  • Returns GeocodeHit(address="80 Pioneer Street, Brooklyn, NY 11231", borough="Brooklyn", lat=40.6772, lon=-74.0070, bbl="3-00589-0003", ...)
  • State: {geocode: {...}, lat: 40.6772, lon: -74.0070}
  • SSE: step event {step: "geocode", ok: true, elapsed_s: 0.4, result: {address:..., lat:..., lon:...}}

4. FSM: step_sandy

  • Confirmed inside NYC bbox
  • sandy_inundation.join(point) β€” spatial join against data/sandy_inundation.geojson
  • Red Hook is inside the 2012 Sandy inundation zone β†’ sandy=True
  • State: {sandy: True}
  • SSE: step event β†’ opens stone_start: Cornerstone

5. FSM: step_dep

  • dep_stormwater.join(pt, scen) for each of 3 scenarios against data/dep/*.gdb
  • Likely returns dep_moderate_2050: depth_class=2 (Deep & Contiguous 1-4 ft), dep_extreme_2080: depth_class=3 (Deep Contiguous >4 ft), dep_moderate_current: depth_class=1

6–8. FSM: step_floodnet, step_311, step_noaa_tides

  • FloodNet: GraphQL POST to api.floodnet.nyc β€” checks sensors within 600 m of (40.6772, -74.0070)
  • 311: Socrata API call for flood complaints within 200 m, last 5 years
  • NOAA: fetches Battery gauge (closest of 3 stations to Red Hook), returns observed/predicted/residual

9–12. FSM: TTM forecast steps

  • ttm_forecast.summary_for_point(40.6772, -74.0070): loads ibm-granite/granite-timeseries-ttm-r2, fetches 512 steps of Battery residual history via NOAA, forecasts 96 steps ahead; emits doc only if peak > 0.3 ft
  • ttm_311_forecast.weekly_311_forecast_for_point(...): fetches 52-week complaint history for 200 m buffer from 311, runs TTM zero-shot
  • floodnet_forecast.summary_for_point(...): nearest sensor historical events β†’ TTM recurrence forecast
  • ttm_battery_surge.fetch(): msradam/Granite-TTM-r2-Battery-Surge, hourly context β†’ 96 h forecast

13–14. FSM: step_microtopo, step_ida_hwm

  • microtopo.microtopo_at(40.6772, -74.0070): samples data/nyc_dem_30m.tif, hand.tif, twi.tif at point; returns elevation ~3 m, HAND ~0.8 m (near drainage), TWI ~11
  • ida_hwm.summary_for_point(...): checks data/ida_2021_hwms_ny.geojson within 800 m β€” Ida hit Queens hardest, Red Hook had no USGS HWMs

15. FSM: step_mta_entrances

  • app/registers/mta_entrances.py:summary_for_point(...): loads data/mta_entrances.geojson, finds entrances within 500 m (likely Smith-9th and Carroll St A/C/G stations)

16. FSM: step_prithvi

  • prithvi_water.summary_for_point(40.6772, -74.0070): point-in-polygon against data/prithvi_ida_2021.geojson 166 polygons; Red Hook is coastal β€” likely inside_water_polygon=True or close proximity

17. FSM: step_rag

  • Builds query: "address 80 Pioneer Street, Brooklyn; inside Hurricane Sandy 2012 inundation zone; in Deep Contiguous pluvial scenario; flood resilience plan..."
  • rag.retrieve(q, k=3, min_score=0.45): Granite Embedding 278M cosine similarity over embedded corpus; likely returns rag_npcc4 (NPCC4 coastal) + rag_mta (MTA Resilience Roadmap coastal references) + rag_comptroller
  • External reads: none after startup (RAG index built at startup via rag.warm())

18. FSM: step_gliner

  • gliner_extract.extract_for_rag_hits(hits): GLiNER NER extraction over RAG paragraphs; extracts agency names, dollar amounts, infrastructure projects, NYC locations, date ranges
  • Emits gliner_{source} doc messages

19. FSM: step_reconcile

  • _current_strict_mode() = True
  • build_documents(snap) β†’ ~15 doc messages
  • trim_docs_to_plan(doc_msgs, planned_specialists) β†’ drops specialists planner didn't ask for
  • augment_system_prompt(EXTRA_SYSTEM_PROMPT, query="80 Pioneer Street, Brooklyn", intent="single_address") β†’ framing.detect() β†’ generic_exposure β†’ no directive added (Red Hook query has no question-shape keywords)
  • reconcile_strict_streaming(doc_msgs, framed_prompt, loop_budget=2, on_token=..., on_attempt_end=...)
    • Attempt 0: streams tokens to frontend; runs 4 checks; likely passes
    • If fails: feedback user-turn names failing sentences, attempt 1
  • Emits: paragraph, mellea metadata (rerolls=0, requirements_passed=[4/4])
  • SSE: multiple token events β†’ mellea_attempt event β†’ stone_done: Capstone β†’ final event

Scoring (computed in web/main.py from final state, or explicitly via app/score.py:composite()):

  • sandy=True β†’ empirical.sandy=1.0 β†’ floor triggered (tier capped at 2)
  • dep_moderate_2050 depth_class=2 β†’ regulatory.dep_moderate_2050=0.75
  • microtopo HAND=0.8 β†’ hydrological.hand_band=1.0 (HAND < 1 m)
  • composite likely β‰₯ 1.5 β†’ raw tier 1; floor_applied=True β†’ final tier capped at min(1,2) = 1 (floor is a floor, not a ceiling β€” the actual rule caps tier at no worse than 2; since tier 1 is better than tier 2, the floor is satisfied: tier stays 1)
  • Final tier: 1 (High exposure)

7. What's robust vs fragile

Robust (load-bearing, tested)

  • Silence-over-confabulation in specialists: Every FSM action returns the declared state key as None on failure; build_documents() gates on state.get(key) is not None; Granite never invents content from absent documents. Pattern is consistent across 25 specialists.
  • NYC-scope guard: _in_nyc() check in every FSM action + build_documents() scope_note mechanism for out-of-NYC addresses. National specialists (NOAA, NWS) still fire and a live-conditions-only briefing is produced.
  • LiteLLM Router failover: app/llm.py auto-fails from vLLM to Ollama on timeout/5xx. num_retries=0 so the Router doesn't burn seconds re-hitting dead endpoints. The Ollama fallback fires from the same call site.
  • Planner validator floor: _required_specialists() adds geocode/sandy/dep/microtopo even if planner forgot them; prevents silent missing-Stone briefings.
  • Four Mellea grounding checks with reroll feedback: The _failing_sentences_for_citations() targeted feedback mechanism is the reason neighborhood queries went from chronic 3/4 β†’ 4/4. The identifier-aware \b regex in _NUM_RE is specifically why it stopped false-firing on NTA codes.
  • End-to-end probe suite: scripts/probe_addresses.py drives /api/agent/stream against 5 addresses (442 E Houston, 80 Pioneer, 100 Gold, Hollis, Coney Island), asserts Stone fire patterns + Mellea 4/4 + four-section structure. Last green run: 5/5, 5.8–13.1 s per address at RIPRAP_MELLEA_MAX_ATTEMPTS=3.
  • Startup warmup in web/main.py:_warm_caches(): Sandy, DEP, RAG, Ollama models, and heavy ML module pre-imports all happen before the first request. The startup function catches exceptions individually so one failure doesn't kill the app.
  • Threadlocal cleanup in finally: blocks: app/intents/single_address.py always resets all five threadlocals in a finally: clause, preventing state bleeding between requests.

Fragile (single points of failure, missing error handling)

  • Burr FSM concurrent queries: iter_steps() mutates module-level Burr state. Two concurrent single_address queries to the same uvicorn worker will interleave threadlocals. No per-request isolation. Production HF Space is single-worker; local dev with --workers 2 would break.
  • build_documents() complexity radon F=101: ~750-line function with one if/elif branch per specialist. Order matters for the Granite prompt. Small edits risk subtle doc-ordering regressions that are silent but affect citation density.
  • entrypoint.sh EO install: Runtime pip install --target for terratorch/einops/diffusers/timm/torchvision into $HOME/.eo-pkgs is brittle β€” if pip fails mid-install the marker isn't created and the next container start retries, but if the Space's filesystem cache persists a partial install, it might never clear. The build log won't show this failure clearly.
  • Droplet redeploy: Dockerfile unverified end-to-end: The last full E2E Dockerfile build was never confirmed β€” the bootstrap droplet was destroyed before final verification. safetensors==0.8.0rc0 in services/riprap-models/requirements-full.txt is an RC that may fail on a fresh pip resolve.
  • NOAA/NWS live calls without rate-limit handling: app/context/noaa_tides.py and nws_obs.py call live APIs on every request with no caching, no retry-after handling. Under concurrent load or NOAA outage, specialists fail silently (returns error key in result dict) but every request re-hits the failed endpoint.
  • FloodNet GraphQL verify=False: Certificate validation disabled in app/context/floodnet.py:_gql(). This is a permanent workaround for FloodNet's self-signed cert, not a temporary workaround.
  • Static asset cache: web/sveltekit/build/ assets have no cache-busting. When iterating on Svelte sources, browser hard-reload is required.
  • Planner 3b β†’ 8b alias on HF Space: RIPRAP_OLLAMA_3B_TAG=granite4.1:8b in the Dockerfile means both planner and reconciler use the 8b on the Space. If 3b is never pulled, the granite4.1:3b model is absent and an explicit call to that tag would fail. Current routing via the alias system prevents this, but a direct tag reference in new code would break.
  • vLLM [doc_id=X] normalization in app/llm.py:_normalize_citations(): Applied per-chunk in streaming and once on non-streaming responses. If vLLM ever batches citation tokens across two stream chunks, the regex would miss them. This hasn't happened in practice but is a known theoretical gap.
  • RAG startup failure doesn't prevent startup: rag.warm() is wrapped in a try/except that prints and continues. If sentence-transformers fails to load, all queries return without policy context β€” the briefing still works but silently loses the RAG section.
  • Mellea API shape versioning: reconcile_strict() uses mellea.start_session(backend_name="ollama") from Mellea 0.3/0.4 (HF has 0.3, local has 0.4). The _extract_text() and _extract_attempts() helpers duck-type multiple attribute names. reconcile_strict_streaming() avoids Mellea's session entirely (hand-rolled) and is version-independent β€” this is the production path. The reconcile_strict() function is only exercised in offline contexts.
  • NYC 311 Socrata calls uncached: Each query fetches fresh from Socrata. Under rate-limit or extended 311 maintenance, the specialist returns n=0 and no 311 doc is emitted; the briefing silently lacks that signal.

Known gaps / out-of-scope

  • compare intent defined in planner.py INTENTS dict but no routing to a compare.py intent module exists in web/main.py:api_agent_stream. Planner would route to it but the runner would fall through to single_address.
  • Retrospective mode (what would Riprap have said on date X): blocked at planner with not-implemented message. No historical data replay exists.
  • Cross-register ranking (rank top 5 neighborhoods by flood exposure): blocked at planner. Would require a cross-register join that doesn't exist.
  • FEMA NFHL integration: FEMA 1% and 0.2% floodplain indicators are in the scoring rubric (app/score.py:REGULATORY) but the corresponding FSM step and data layer are absent β€” they're stubbed at 0 in practice. The score still works but the FEMA regulatory sub-index doesn't contribute.
  • Sub-surface flooding (Ida basement mode): Optical satellites can't see basement flooding. Prithvi correctly emits no polygons for inland Queens. This is documented as an honest scope limit, not a bug.
  • /api/compare endpoint exists at web/main.py:compare_stream and works as a two-parallel-FSM-runs endpoint, but the SvelteKit UI doesn't expose a compare page (legacy compare.html was retired in v0.4.5).

8. The non-obvious decisions

Why not a risk score from 0–100

The tier is a deterministic, published rubric (Cutter et al. 2003 construction, Tate 2012 equal-weights argument, Balica 2012 empirical floor). A continuous score would imply calibration against labeled damage outcomes β€” which don't exist here. Riprap has no closed claim records; producing "flood risk 0.73" without claims-driven calibration would be a fabricated precision. The tier is explicitly a prior (METHODOLOGY.md Β§1). FEMA Risk Rating 2.0 is the product to use if you want claims-driven numbers.

Why silence over confabulation

Specialists that don't fire emit nothing. build_documents() gates on state.get(key) is not None. Granite's post-training includes grounded-generation discipline ("don't generate from absent documents"). This plus the Mellea citation checks means a calm-weather query produces no NWS-alerts section in the briefing rather than "no alerts were found" β€” that would be correct but uncitable. The section is absent. This is explicit in the system prompt: "Omit any section whose supporting facts are absent from the documents."

Why public-record-only at runtime

Data governance: a newsroom with FOIL'd documents, or an agency with internal capital plans, can't paste that data into a vendor LLM (ARCHITECTURE.md Β§11). All specialist data comes from NYC OpenData, USGS, NOAA, NWS, FloodNet NYC (public sensor network). No commercial data; no private address databases. The system is reproducible and auditable.

Why the four epistemic tiers (empirical / modeled / proxy / synthetic)

The distinction matters for how much weight to give each signal, documented in ARCHITECTURE.md Β§1.2. Empirical (Sandy HWMs, Ida HWMs, FloodNet events) = something flooded a place and was measured. Modeled scenarios (DEP, FEMA NFHL) = hydraulic simulation under assumptions. Proxy (311 complaints, HAND, TWI) = indirect indicators. Synthetic prior (TerraMind synthesis) = generative model output, never "imaged" or "reconstructed." The build_documents() function embeds these interpretive framing sentences directly into the doc bodies so Granite is instructed in the document itself how to characterize each source.

Why the Five Stones names

Functional grouping for a trace UI with 25+ specialists. Stonework vocabulary maps to function: Cornerstone remembers the foundation (static hazard record); Keystone is the load-bearing arch piece (what's exposed); Touchstone is the evaluative reference (current state); Lodestone draws you toward something (forecast pull); Capstone is the crown that holds the vault (synthesis). The names let a non-technical demo audience follow the 25-step trace without reading each step label.

Why citation-grounded prose vs structured output

JSON structured output (tier + per-field arrays) is easy to produce but hard to cite in a grant application or news article. The four-section prose format with [doc_id] tags produces text a planner can quote in a FEMA BRIC sub-application or a journalist can use verbatim with inline sourcing. The citation tags map to clickable source chips in the frontend. Structured JSON of the underlying specialist outputs is also available in the final SSE event for machine consumption.

Why Mellea rejection sampling (vs post-hoc sentence dropping)

The original verify_paragraph() in app/reconcile.py drops sentences after generation. This produces a shorter briefing and a silent quality improvement β€” but the user sees a briefing that may have had sentences removed. The Mellea rejection sampler rerolls the entire generation when it fails, and streams each attempt's tokens to the user live (visible progress), then shows a green/amber inline banner. The user understands the system is enforcing quality, not silently deleting content. Psychologically this is more defensible in a professional context.

Why planner-then-Capstone two-LLM split

The planner is a structured-output routing task (small JSON, deterministic, temperature=0). It should be fast and cheap. The reconciler is a long-form synthesis task requiring dense citation discipline β€” it benefits from the larger context window and stronger instruction-following of the 8b model. Using 3b for routing keeps TTFB low (planner JSON appears in ~2 s vs ~8 s for 8b). On the HF Space both aliases map to 8b via RIPRAP_OLLAMA_3B_TAG=granite4.1:8b to avoid disk cost, accepting the TTFB penalty.

Why LiteLLM Router

The alternative was a hand-rolled if primary == "vllm": ... else: ollama.chat(...) dispatch. LiteLLM's Router gives model aliasing, failover, and a common call signature for free. The ~250-line shim in app/llm.py covers: Ollama-vs-vLLM backend selection, document-role message extraction for vLLM's HF chat template, [doc_id=X] β†’ [X] citation normalization, JSON-mode translation, and backend info for the UI badge. Any future backend (mlx-lm, llama.cpp, etc.) is a 10-line entry in _build_router().

Why vLLM emits [doc_id=X] while Ollama emits [X]

Ollama's Granite 4.1 Modelfile template lifts role="document <id>" messages into a <documents> block and the model emits bare [X] citations. The HF tokenizer template used by vLLM emits [doc_id=X]. The rest of Riprap (Mellea regex, frontend citation chip parser, sources footer) was written against [X]. The _CITE_NORMALIZE_RE in app/llm.py normalizes per-chunk in streaming, preventing any vLLM-specific citation format from leaking downstream.

Why Prithvi runs offline (baked GeoJSON) while TTM runs live

Prithvi-EO 2.0 with TerraTorch needs GPU and minutes per HLS tile. Running it per-query on a CPU-basic Space is not viable. The 166-polygon GeoJSON was computed once on AMD MI300X, filtered (>30,000 sqft to drop noise, <1 kmΒ² to drop tidal artifacts), and committed. The runtime FSM does point-in-polygon (milliseconds). This is honest about what EO models earn their keep on: a one-time defensible event-level signal, not per-request inference. TTM r2 at 1.5 M params runs in milliseconds on CPU β€” no such tradeoff exists.

Why citations_dense uses sentence scope, not character window

The original implementation used ~40 chars proximity between a number and its citation tag. This was fragile for normal English sentence structure ("The address has 11 flood-related complaints [nyc311] within 200 m"). The citation might be 60 chars from the number. Switching to sentence scope (.[\s)] split) eliminated the chronic 3/4 neighborhood-query failure mode. "Sentence scope" is also how human readers actually assign attribution β€” the citation at the end of the sentence covers the claim anywhere in that sentence.


9. What's next

From OPEN-ISSUES.md, CLAUDE.md polish targets, and code-level TODO comments, in priority order for the May 13 ASCE presentation:

  1. Demo-script dry run against live Space. Space sometimes sleeps after idle; cold start is 30–90 s. Pre-ping the Space before presenting. Verify the backend pill shows correct hardware.
  2. compare intent wiring. planner.py declares the compare intent (noted as NOT_IMPLEMENTED comment β€” actually the planner doesn't short-circuit compare, it just routes to single_address by default). If you want the compare flow to work end-to-end, web/main.py:api_agent_stream needs routing to i_addr.run twice in parallel, or a new compare.py intent module.
  3. FEMA NFHL layer. The scoring rubric has fema_1pct and fema_02pct weights but no FSM step or data layer. Adding the FEMA NFHL download and a step_fema_nfhl action would materially improve Regulatory sub-index accuracy for addresses in AE/VE zones that aren't in Sandy extent.
  4. NYCHA/DOE/DOH registers on Space. RIPRAP_NYCHA_REGISTERS=0 by default. Enabling on HF Space would add 3 more Keystone specialists to every single_address query but requires the 91 MB sandy GeoJSON pre-load to complete within Space startup time.
  5. Droplet redeploy verification. The services/riprap-models/Dockerfile was never tested end-to-end. The safetensors==0.8.0rc0 RC pin is the most likely failure point. Next droplet bring-up should test this first.
  6. Experiments OPEN-ISSUES.md items. All four issues are in experiments/ only (F821 numpy annotation in exp17, f-string Py 3.12+ syntax in exp18, B023 closure variable in exp05, F841 unused api in exp18). Won't affect production but clean up the codebase.
  7. Reranker integration. app/rag.py has a full _ensure_reranker() and RIPRAP_RERANKER_ENABLE flag for ibm-granite/granite-embedding-reranker-english-r2 cross-encoder. Off by default (no HF Space disk for the CrossEncoder model). Enabling on the AMD droplet path would improve Policy context quality at no latency cost.
  8. Historical replay / retrospective mode. Blocked at planner with not-implemented message. Substantial feature: would require snapshotting specialist output at query time or storing NOAA/311/FloodNet historical pull results.

10. Quick reference: files that matter

Task Open first
Add a new specialist app/fsm.py (add @action + wire into build_app()), app/reconcile.py:build_documents() (add doc emission), app/intents/single_address.py (no change usually needed), web/sveltekit/src/ (add step label + source card)
Change the briefing structure / system prompt app/reconcile.py:EXTRA_SYSTEM_PROMPT, then app/intents/neighborhood.py:EXTRA_SYSTEM_PROMPT for neighborhood path; rebuild web/sveltekit if adding new section rendering
Tune the Mellea grounding checks app/mellea_validator.py β€” _NUM_RE, _TRIVIAL_NUMS, _check_every_claim_cited(), _failing_sentences_for_citations()
Change which backend (vLLM vs Ollama) app/llm.py env vars; no code change needed
Add a new intent app/planner.py:INTENTS + SPECIALISTS entries, _required_specialists(), then new app/intents/<intent>.py; wire in web/main.py:api_agent_stream and api_agent
Change the exposure tier scoring app/score.py:REGULATORY/HYDROLOGICAL/EMPIRICAL dicts + TIER_BREAKPOINTS; update METHODOLOGY.md
Debug why a specialist fired wrong scripts/probe_mellea.py --query "<address>" --runs 1; check step events in SSE stream; look at final.mellea.requirements_failed
Rebuild the frontend cd web/sveltekit && npm run build (new design-system UI); cd web/svelte && npm run build (legacy Svelte 5 custom elements to web/static/dist/riprap.js)
Run the full end-to-end test .venv/bin/python scripts/probe_addresses.py
Rebuild the pre-computed registers scripts/build_mta_entrances_register.py, scripts/build_nycha_register.py, scripts/build_schools_register.py
Rebuild Prithvi Ida polygons scripts/run_prithvi_ida.py β€” needs GPU + TerraTorch
Rebuild the pitch deck cd slides && make pdf html pptx (needs marp-cli)
Add a question-type framing app/framing.py:_PATTERNS + _DIRECTIVES
Understand why a doc was missing from the briefing Check build_documents() in app/reconcile.py β€” each block has an explicit gate condition; also check trim_docs_to_plan()
Understand the SSE stream structure web/main.py:api_agent_stream, the _STEP_TO_STONE dict, and the stone_start/stone_done wrapping logic
Deploy to HF Space git push && git push huggingface main; monitor rebuild via `curl -sf "https://huggingface.co/api/spaces/lablab-ai-amd-developer-hackathon/riprap-nyc/runtime"
Deploy to AMD droplet scripts/deploy_droplet.sh <ip> <token>, then set Space env vars via huggingface-cli space variables, restart Space