Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

seriffic Claude Opus 4.7 (1M context) commited on 3 days ago

Commit

abcf7cd

1 Parent(s): 86e2a29

feat: route all GPU-accelerable inference through MI300X (Phase 1+2 of full GPU)

The user's directive: "I want anything that can be GPU accelerated to
run on there. Otherwise, keep it on the CPU of wherever it's running."

Lands seven pieces in one commit; each can stand alone but they're
interlocked.

app/inference.py (new, ~250 lines)
Router shim that mirrors app/llm.py's shape but for non-LLM models.
Exports: prithvi_pluvial(), terramind(), ttm_forecast(),
granite_embed(), gliner_extract(), healthcheck(), backend_info(),
plus the typed RemoteUnreachable exception caller modules catch
to fall back to local. Env-driven via RIPRAP_ML_BACKEND
(auto|remote|local) / RIPRAP_ML_BASE_URL / RIPRAP_ML_API_KEY,
same shape as RIPRAP_LLM_*.

services/riprap-models/ (new microservice)
FastAPI service that runs alongside vLLM on the AMD MI300X
droplet. One endpoint per model class:
/v1/prithvi-pluvial Prithvi-NYC-Pluvial v2 segmentation
/v1/terramind LULC / Buildings / Synthesis (LoRA)
/v1/ttm-forecast Granite TTM r2 (zero-shot + Battery
fine-tune + 311 + FloodNet)
/v1/granite-embed Granite Embedding 278M batch encode
/v1/gliner-extract GLiNER typed-entity extraction
/healthz reachability + warm-model list
Bearer auth same shape as vLLM. Lazy + cached model loads, ROCm
device binding via torch.cuda. Model loading code lifted from
the proven local paths (terratorch / peft / safetensors / tsfm
/ sentence-transformers / gliner). Designed to live in the
existing `terramind` Docker container on the droplet, which
already has every heavy dep installed.

Deploy:
Code rsync'd into the terramind container at /workspace/riprap-models
earlier in this session and pip install ran clean. Dropping
`uvicorn main:app --host 0.0.0.0 --port 7860` inside the
container brings it up on the host's already-mapped port 7860.
Currently blocked: droplet 129.212.181.238 went unreachable
mid-deploy; resume the start command once SSH comes back.

Per-specialist wiring (try-remote-then-local):
app/flood_layers/prithvi_live.py — Prithvi-NYC-Pluvial v2 (live)
app/context/terramind_nyc.py — TerraMind LULC + Buildings
app/live/ttm_forecast.py — TTM r2 zero-shot (Battery /
311 / FloodNet variants share
one inference function)
app/live/ttm_battery_surge.py — TTM r2 NYC fine-tune
app/rag.py — Granite Embedding 278M
(corpus encode + per-query)
app/context/gliner_extract.py — GLiNER typed extraction

Each module: try remote first, fall back to local on
RemoteUnreachable. Local _DEPS_OK gates only matter for the
fallback path now — the cpu-basic HF Space can run end-to-end
once the droplet service is live without baking transformers /
peft / terratorch / tsfm_public / sentence-transformers /
gliner into its image.

Result objects gain a `compute` field ("remote · cuda" / "local")
so the UI can surface where each specialist's GPU work landed.

The router fails open: with no env config, remote_enabled()=False and
every specialist takes its existing local path. Set RIPRAP_ML_BASE_URL
and the remote path activates without code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (10) hide show

app/context/gliner_extract.py +23 -1
app/context/terramind_nyc.py +42 -1
app/flood_layers/prithvi_live.py +38 -1
app/inference.py +224 -0
app/live/ttm_battery_surge.py +44 -17
app/live/ttm_forecast.py +33 -5
app/rag.py +59 -9
services/riprap-models/README.md +66 -0
services/riprap-models/main.py +561 -0
services/riprap-models/requirements.txt +12 -0

app/context/gliner_extract.py CHANGED Viewed

@@ -80,8 +80,30 @@ def _source_short(rag_doc_id: str) -> str:
 def extract_for_chunk(text: str, threshold: float = DEFAULT_THRESHOLD) -> list[Extraction]:
     model = _ensure_model()
-    if model is None or not text:
         return []
     raw = model.predict_entities(text, ENTITY_LABELS, threshold=threshold)
     return [Extraction(label=r["label"], text=r["text"],

 def extract_for_chunk(text: str, threshold: float = DEFAULT_THRESHOLD) -> list[Extraction]:
+    if not text:
+        return []
+    # v0.4.5 — try the MI300X service first. The remote handles its
+    # own GLiNER load; this lets cpu-basic surfaces run typed
+    # extraction without baking gliner into the image.
+    try:
+        from app import inference as _inf
+        if _inf.remote_enabled():
+            remote = _inf.gliner_extract(text, ENTITY_LABELS)
+            if remote.get("ok"):
+                return [
+                    Extraction(label=e["label"], text=e["text"],
+                               score=float(e.get("score", 0)))
+                    for e in remote.get("entities", [])
+                    if e.get("score", 0) >= threshold
+                ]
+    except _inf.RemoteUnreachable as e:
+        log.info("gliner: remote unreachable (%s); local fallback", e)
+    except Exception:
+        log.exception("gliner: remote call failed; local fallback")
     model = _ensure_model()
+    if model is None:
         return []
     raw = model.predict_entities(text, ENTITY_LABELS, threshold=threshold)
     return [Extraction(label=r["label"], text=r["text"],

app/context/terramind_nyc.py CHANGED Viewed

@@ -293,11 +293,51 @@ def _summarize_buildings(pred, class_labels: list[str]) -> dict[str, Any]:
     }
 def _run(adapter_name: str, modality_chips: dict, summarizer):
-    """Common boilerplate: gate, time, load, tiled predict, summarize."""
     if not ENABLE:
         return {"ok": False,
                 "skipped": "RIPRAP_TERRAMIND_NYC_ENABLE=0"}
     if not _DEPS_OK:
         return {"ok": False,
                 "skipped": f"deps unavailable on this deployment: "
@@ -315,6 +355,7 @@ def _run(adapter_name: str, modality_chips: dict, summarizer):
         result["elapsed_s"] = round(time.time() - t0, 2)
         result["adapter"] = adapter_name
         result["repo"] = ADAPTERS_REPO
         return result
     except Exception as e:
         log.exception("terramind_nyc.%s failed", adapter_name)

     }
+def _try_remote(adapter_name: str, modality_chips: dict) -> dict | None:
+    """v0.4.5 — POST to MI300X riprap-models if configured. Returns the
+    parsed result on success; None on RemoteUnreachable so the caller
+    falls through to the local terratorch path."""
+    try:
+        from app import inference as _inf
+        if not _inf.remote_enabled():
+            return None
+        s2 = modality_chips.get("S2L2A")
+        s1 = modality_chips.get("S1RTC")
+        dem = modality_chips.get("DEM")
+        # The router serializes torch tensors to base64 numpy float32 —
+        # the chip cache hands us [B, C, T, H, W]; keep that shape, the
+        # service rebuilds the temporal stack on its end.
+        result = _inf.terramind(adapter_name, s2, s1, dem)
+        if not result.get("ok"):
+            return None
+        result.setdefault("adapter", adapter_name)
+        result.setdefault("repo", ADAPTERS_REPO)
+        result["compute"] = f"remote · {result.get('device', 'gpu')}"
+        return result
+    except _inf.RemoteUnreachable as e:
+        log.info("terramind/%s: remote unreachable (%s); local fallback",
+                 adapter_name, e)
+        return None
+    except Exception:
+        log.exception("terramind/%s: remote call failed; local fallback",
+                       adapter_name)
+        return None
 def _run(adapter_name: str, modality_chips: dict, summarizer):
+    """Common boilerplate: gate, time, [remote attempt], load, tiled
+    predict, summarize."""
     if not ENABLE:
         return {"ok": False,
                 "skipped": "RIPRAP_TERRAMIND_NYC_ENABLE=0"}
+    # v0.4.5 — try remote first. The remote service has its own deps,
+    # so this path works even when local _DEPS_OK is False (the most
+    # common HF Spaces case until terratorch + peft are baked in).
+    remote = _try_remote(adapter_name, modality_chips or {})
+    if remote is not None:
+        return remote
     if not _DEPS_OK:
         return {"ok": False,
                 "skipped": f"deps unavailable on this deployment: "
         result["elapsed_s"] = round(time.time() - t0, 2)
         result["adapter"] = adapter_name
         result["repo"] = ADAPTERS_REPO
+        result["compute"] = "local"
         return result
     except Exception as e:
         log.exception("terramind_nyc.%s failed", adapter_name)

app/flood_layers/prithvi_live.py CHANGED Viewed

@@ -350,6 +350,43 @@ def fetch(lat: float, lon: float, timeout_s: float = 60.0) -> dict[str, Any]:
         img, ref_da, epsg = _build_chip(item, lat, lon)
         if time.time() - t0 > timeout_s:
             return {"ok": False, "skipped": "chip build exceeded budget"}
         model, run_model = _ensure_model()
         x = img[None, :, None, :, :]  # (1, 6, 1, H, W)
         pred_t = run_model(x, None, None, model.model, model.datamodule, IMG_SIZE)
@@ -361,7 +398,6 @@ def fetch(lat: float, lon: float, timeout_s: float = 60.0) -> dict[str, Any]:
         radius_px = CENTER_RADIUS_M / PIXEL_M
         circle = (yy - cy) ** 2 + (xx - cx) ** 2 <= radius_px ** 2
         pct_500 = float(100.0 * pred[circle].mean()) if circle.sum() else 0.0
-        # Polygonize the water mask into EPSG:4326 GeoJSON for the map.
         polygons_geojson = _polygonize_mask(pred, ref_da, epsg)
         return {
             "ok": True,
@@ -371,6 +407,7 @@ def fetch(lat: float, lon: float, timeout_s: float = 60.0) -> dict[str, Any]:
             "pct_water_full": pct_full,
             "pct_water_within_500m": pct_500,
             "polygons_geojson": polygons_geojson,
             "elapsed_s": round(time.time() - t0, 2),
         }
     except Exception as e:

         img, ref_da, epsg = _build_chip(item, lat, lon)
         if time.time() - t0 > timeout_s:
             return {"ok": False, "skipped": "chip build exceeded budget"}
+        # v0.4.5 — try the MI300X inference service first if configured.
+        # On RemoteUnreachable (service down / not configured / 5xx) fall
+        # through to the local terratorch path. The 4-band slice the
+        # service expects is the same shape the local path uses.
+        try:
+            from app import inference as _inf
+            if _inf.remote_enabled():
+                remote = _inf.prithvi_pluvial(
+                    img, scene_id=item.id,
+                    scene_datetime=str(item.datetime),
+                    cloud_cover=cc,
+                    timeout=timeout_s,
+                )
+                if remote.get("ok"):
+                    return {
+                        "ok": True,
+                        "item_id": item.id,
+                        "item_datetime": str(item.datetime),
+                        "cloud_cover": cc,
+                        "pct_water_full": remote.get("pct_water_full"),
+                        "pct_water_within_500m": remote.get("pct_water_within_500m"),
+                        # Service doesn't currently return polygonised GeoJSON
+                        # (transport size); the local fallback below produces
+                        # them. For now the remote path leaves polygons null
+                        # and the map renders the layer empty until the
+                        # service grows a polygonisation step.
+                        "polygons_geojson": None,
+                        "compute": f"remote · {remote.get('device', 'gpu')}",
+                        "elapsed_s": round(time.time() - t0, 2),
+                    }
+        except _inf.RemoteUnreachable as e:
+            log.info("prithvi_live: remote unreachable (%s); falling back to local", e)
+        except Exception:
+            log.exception("prithvi_live: remote call failed; falling back to local")
+        # Local fallback — the path that's been live since v0.4.4.
         model, run_model = _ensure_model()
         x = img[None, :, None, :, :]  # (1, 6, 1, H, W)
         pred_t = run_model(x, None, None, model.model, model.datamodule, IMG_SIZE)
         radius_px = CENTER_RADIUS_M / PIXEL_M
         circle = (yy - cy) ** 2 + (xx - cx) ** 2 <= radius_px ** 2
         pct_500 = float(100.0 * pred[circle].mean()) if circle.sum() else 0.0
         polygons_geojson = _polygonize_mask(pred, ref_da, epsg)
         return {
             "ok": True,
             "pct_water_full": pct_full,
             "pct_water_within_500m": pct_500,
             "polygons_geojson": polygons_geojson,
+            "compute": "local",
             "elapsed_s": round(time.time() - t0, 2),
         }
     except Exception as e:

app/inference.py ADDED Viewed

	@@ -0,0 +1,224 @@

+"""Remote-vs-local ML inference router.
+Mirrors the call-surface shape of `app/llm.py` but for the non-LLM
+heavy models (Prithvi, TerraMind, TTM, Granite Embedding, GLiNER).
+The droplet runs a `riprap-models` FastAPI service alongside vLLM that
+exposes an OpenAI-style endpoint per model class. When configured the
+router POSTs the relevant payload there and returns the parsed response;
+on connection error / 5xx / timeout it surfaces a typed exception that
+caller modules catch and fall back to a local in-process model load.
+Backend selection (env):
+  RIPRAP_ML_BACKEND   = "remote" | "local" | "auto"  (default: auto)
+                        - remote: use only the droplet, raise if it errors
+                        - local : never call the droplet, always use the
+                                  in-process model
+                        - auto  : try remote first, fall back to local if
+                                  remote is unreachable / errors out;
+                                  same semantics as app/llm.py
+  RIPRAP_ML_BASE_URL  = http://129.212.181.238:8002    (no trailing slash)
+  RIPRAP_ML_API_KEY   = <bearer token>
+The router is *transport*-only — it does not own model bytes, weights,
+or framework imports. Each specialist that wants remote inference calls
+into the helpers below and provides its own local fallback. That keeps
+the dependency graph clean: the local code path keeps working when the
+RIPRAP_ML_* env is unset (e.g. on first-light dev or in unit tests).
+"""
+from __future__ import annotations
+import base64
+import io
+import logging
+import os
+from typing import Any, Iterable
+import httpx
+log = logging.getLogger("riprap.inference")
+_BACKEND = os.environ.get("RIPRAP_ML_BACKEND", "auto").lower()
+_BASE_URL = os.environ.get("RIPRAP_ML_BASE_URL", "").rstrip("/")
+_API_KEY = os.environ.get("RIPRAP_ML_API_KEY", "")
+_DEFAULT_TIMEOUT = float(os.environ.get("RIPRAP_ML_TIMEOUT_S", "60"))
+class RemoteUnreachable(RuntimeError):
+    """Raised when the remote inference service is unconfigured, down,
+    times out, or returns 5xx. Callers catch this to fall through to a
+    local model load. 4xx errors propagate as the generic exception so
+    a caller bug doesn't get masked by a "fallback to local" path."""
+def remote_enabled() -> bool:
+    """True iff the router is configured to attempt remote calls.
+    Returns False under explicit `local` mode or when the base URL is
+    empty (the auto-default with no env config)."""
+    if _BACKEND == "local":
+        return False
+    if not _BASE_URL:
+        return False
+    return True
+def _client(timeout: float | None = None) -> httpx.Client:
+    headers = {"User-Agent": "riprap-app/0.4.5"}
+    if _API_KEY:
+        headers["Authorization"] = f"Bearer {_API_KEY}"
+    return httpx.Client(
+        base_url=_BASE_URL,
+        headers=headers,
+        timeout=timeout if timeout is not None else _DEFAULT_TIMEOUT,
+    )
+def _post(path: str, payload: dict[str, Any], timeout: float | None = None) -> dict:
+    """POST {payload} as JSON to the remote service's `path`. Returns the
+    parsed JSON body. Raises RemoteUnreachable on transport errors;
+    raises HTTPStatusError on 4xx so caller bugs surface."""
+    if not remote_enabled():
+        raise RemoteUnreachable("remote ML backend not configured "
+                                "(RIPRAP_ML_BASE_URL empty or BACKEND=local)")
+    try:
+        with _client(timeout) as c:
+            r = c.post(path, json=payload)
+    except (httpx.ConnectError, httpx.ReadError, httpx.WriteError,
+             httpx.TimeoutException, httpx.RemoteProtocolError) as e:
+        raise RemoteUnreachable(f"{type(e).__name__}: {e}") from e
+    if r.status_code >= 500:
+        raise RemoteUnreachable(f"HTTP {r.status_code} from {path}: {r.text[:200]}")
+    r.raise_for_status()
+    return r.json()
+def _serialize_array(arr) -> str:
+    """numpy/torch tensor → base64-encoded float32 raw bytes for transport.
+    Each remote handler decodes to (shape, dtype=float32) and reconstructs.
+    Reasonable round-trip for chips up to a few MB; large rasters should
+    use compressed numpy-savez instead — TODO when a model needs > 8 MB."""
+    import numpy as np
+    np_arr = arr if isinstance(arr, np.ndarray) else _to_numpy(arr)
+    np_arr = np_arr.astype("float32", copy=False)
+    return base64.b64encode(np_arr.tobytes()).decode("ascii")
+def _to_numpy(t):
+    """Best-effort tensor → numpy. Accepts torch.Tensor or numpy already."""
+    try:
+        import torch
+        if isinstance(t, torch.Tensor):
+            return t.detach().cpu().numpy()
+    except ImportError:
+        pass
+    import numpy as np
+    return np.asarray(t)
+def _deserialize_array(b64: str, shape: list[int]):
+    """Inverse of _serialize_array — bytes → numpy float32 with given shape."""
+    import numpy as np
+    raw = base64.b64decode(b64)
+    return np.frombuffer(raw, dtype="float32").reshape(shape)
+# ---- Public router entry points -------------------------------------------
+def healthcheck(timeout: float = 3.0) -> bool:
+    """Quick reachability probe. True if the service responds 200 to GET
+    /healthz within `timeout` seconds. Used by /api/backend so the UI can
+    show whether the remote ML backend is currently live."""
+    if not remote_enabled():
+        return False
+    try:
+        with _client(timeout) as c:
+            r = c.get("/healthz")
+        return r.status_code == 200
+    except Exception:
+        return False
+def backend_info() -> dict[str, Any]:
+    """Snapshot for /api/backend — what the UI should advertise."""
+    return {
+        "backend": _BACKEND,
+        "base_url": _BASE_URL or None,
+        "remote_enabled": remote_enabled(),
+        "reachable": healthcheck() if remote_enabled() else False,
+    }
+def prithvi_pluvial(s2_chip, *, scene_id: str | None = None,
+                     scene_datetime: str | None = None,
+                     cloud_cover: float | None = None,
+                     timeout: float | None = None) -> dict[str, Any]:
+    """Remote forward pass through Prithvi-NYC-Pluvial v2.
+    Input: 6-band Sentinel-2 chip (numpy or torch, shape [6, H, W]).
+    Output: { ok, pct_water_within_500m, pct_water_full, scene_id, ... }.
+    Raises RemoteUnreachable if the service is down."""
+    arr = _to_numpy(s2_chip)
+    return _post("/v1/prithvi-pluvial", {
+        "s2": _serialize_array(arr),
+        "shape": list(arr.shape),
+        "scene_id": scene_id,
+        "scene_datetime": scene_datetime,
+        "cloud_cover": cloud_cover,
+    }, timeout=timeout)
+def terramind(adapter: str, s2l2a, s1rtc=None, dem=None, *,
+               timeout: float | None = None) -> dict[str, Any]:
+    """Remote forward through TerraMind-NYC-Adapters (LULC or Buildings)
+    or the v1 base (synthetic).  `adapter` is one of: lulc, buildings,
+    synthesis. Each modality is a numpy array or None."""
+    payload: dict[str, Any] = {"adapter": adapter}
+    s2_np = _to_numpy(s2l2a)
+    payload["s2"] = _serialize_array(s2_np)
+    payload["s2_shape"] = list(s2_np.shape)
+    if s1rtc is not None:
+        s1_np = _to_numpy(s1rtc)
+        payload["s1"] = _serialize_array(s1_np)
+        payload["s1_shape"] = list(s1_np.shape)
+    if dem is not None:
+        dem_np = _to_numpy(dem)
+        payload["dem"] = _serialize_array(dem_np)
+        payload["dem_shape"] = list(dem_np.shape)
+    return _post("/v1/terramind", payload, timeout=timeout)
+def ttm_forecast(model: str, history: Iterable[float], *,
+                  context_length: int, prediction_length: int,
+                  cadence: str = "h",
+                  timeout: float | None = None) -> dict[str, Any]:
+    """Remote Granite TTM r2 forecast.
+    `model` is one of: zero_shot_battery, fine_tune_battery, weekly_311,
+    floodnet_recurrence — the service decides which checkpoint to use.
+    `history` is a 1-D iterable of floats (the time series); `cadence`
+    is for the service's labelling (h / d / w / 6m). Output shape is
+    `{ ok, forecast: [...], peak_index, peak_value }`."""
+    series = list(map(float, history))
+    return _post("/v1/ttm-forecast", {
+        "model": model,
+        "history": series,
+        "context_length": context_length,
+        "prediction_length": prediction_length,
+        "cadence": cadence,
+    }, timeout=timeout)
+def granite_embed(texts: list[str], *,
+                   timeout: float | None = None) -> dict[str, Any]:
+    """Remote Granite Embedding 278M batch encode.
+    Output: { ok, vectors: [[float, ...], ...] }. Vector dimension fixed
+    at 768 (granite-embedding-278m-multilingual)."""
+    return _post("/v1/granite-embed", {"texts": list(texts)}, timeout=timeout)
+def gliner_extract(text: str, labels: list[str], *,
+                    timeout: float | None = None) -> dict[str, Any]:
+    """Remote GLiNER typed-entity extraction.
+    Output: { ok, entities: [{label, text, start, end, score}, ...] }."""
+    return _post("/v1/gliner-extract", {
+        "text": text, "labels": list(labels),
+    }, timeout=timeout)

app/live/ttm_battery_surge.py CHANGED Viewed

@@ -230,10 +230,7 @@ def fetch(timeout_s: float = 60.0) -> dict[str, Any]:
     if not ENABLE:
         return {"available": False,
                 "reason": "RIPRAP_TTM_BATTERY_SURGE_ENABLE=0"}
-    if not _DEPS_OK:
-        return {"available": False,
-                "reason": f"deps unavailable on this deployment: "
-                          f"{_DEPS_MISSING}"}
     t0 = time.time()
     try:
         df = _fetch_battery_history(CONTEXT_LENGTH)
@@ -245,21 +242,51 @@ def fetch(timeout_s: float = 60.0) -> dict[str, Any]:
             return {"available": False,
                     "reason": "NOAA fetch exceeded budget"}
-        import torch
-        model = _ensure_model()
-        # [B=1, T=1024, C=1] tensor of metres surge residual.
         residuals = df["surge_residual_m"].to_numpy().astype("float32")
-        past = torch.from_numpy(residuals).unsqueeze(0).unsqueeze(-1)
-        if DEVICE == "cuda":
-            try:
-                if torch.cuda.is_available():
-                    past = past.cuda()
-            except Exception:
-                log.exception("ttm_battery_surge: cuda move failed")
-        with torch.no_grad():
-            out = model(past_values=past)
-        forecast = out.prediction_outputs.squeeze(-1).squeeze(0).cpu().numpy()
         result = _summarize(df, forecast)
         result["elapsed_s"] = round(time.time() - t0, 2)
         return result
     except Exception as e:

     if not ENABLE:
         return {"available": False,
                 "reason": "RIPRAP_TTM_BATTERY_SURGE_ENABLE=0"}
     t0 = time.time()
     try:
         df = _fetch_battery_history(CONTEXT_LENGTH)
             return {"available": False,
                     "reason": "NOAA fetch exceeded budget"}
         residuals = df["surge_residual_m"].to_numpy().astype("float32")
+        # v0.4.5 — try the MI300X service first. The remote handles its
+        # own model loading; if it's reachable we never need local
+        # tsfm_public, which lets the HF Space drop the granite-tsfm
+        # bake from the image.
+        forecast = None
+        compute = "local"
+        try:
+            from app import inference as _inf
+            if _inf.remote_enabled():
+                remote = _inf.ttm_forecast(
+                    "fine_tune_battery", residuals.tolist(),
+                    context_length=CONTEXT_LENGTH,
+                    prediction_length=PREDICTION_LENGTH,
+                    cadence="h",
+                    timeout=timeout_s,
+                )
+                if remote.get("ok"):
+                    import numpy as np
+                    forecast = np.asarray(remote["forecast"], dtype="float32")
+                    compute = f"remote · {remote.get('device', 'gpu')}"
+        except _inf.RemoteUnreachable as e:
+            log.info("ttm_battery_surge: remote unreachable (%s); local", e)
+        if forecast is None:
+            if not _DEPS_OK:
+                return {"available": False,
+                        "reason": f"deps unavailable on this deployment: "
+                                  f"{_DEPS_MISSING}"}
+            import torch
+            model = _ensure_model()
+            past = torch.from_numpy(residuals).unsqueeze(0).unsqueeze(-1)
+            if DEVICE == "cuda":
+                try:
+                    if torch.cuda.is_available():
+                        past = past.cuda()
+                except Exception:
+                    log.exception("ttm_battery_surge: cuda move failed")
+            with torch.no_grad():
+                out = model(past_values=past)
+            forecast = out.prediction_outputs.squeeze(-1).squeeze(0).cpu().numpy()
         result = _summarize(df, forecast)
+        result["compute"] = compute
         result["elapsed_s"] = round(time.time() - t0, 2)
         return result
     except Exception as e:

app/live/ttm_forecast.py CHANGED Viewed

@@ -180,16 +180,44 @@ def _residual_series(station_id: str,
 def _run_ttm(history: np.ndarray,
              context_length: int = CONTEXT_LENGTH,
-             prediction_length: int = PREDICTION_LENGTH) -> np.ndarray | None:
     """Channel-wise standardize, run model, de-standardize. Returns a
-    `prediction_length`-step de-standardized forecast in input units."""
     model = _load_model(context_length, prediction_length)
     if model is None:
         return None
     import torch
-    mu = float(history.mean())
-    sigma = float(history.std() + 1e-6)
-    normed = (history - mu) / sigma
     x = torch.from_numpy(normed.astype(np.float32))[None, :, None]
     try:
         with torch.no_grad():

 def _run_ttm(history: np.ndarray,
              context_length: int = CONTEXT_LENGTH,
+             prediction_length: int = PREDICTION_LENGTH,
+             cadence: str = "h") -> np.ndarray | None:
     """Channel-wise standardize, run model, de-standardize. Returns a
+    `prediction_length`-step de-standardized forecast in input units.
+    v0.4.5 — tries the MI300X riprap-models service first; falls back
+    to the local in-process model on RemoteUnreachable. The
+    standardize / de-standardize math is owned by THIS function so the
+    remote service stays a thin "given a series, give me a forecast"
+    contract.
+    """
+    mu = float(history.mean())
+    sigma = float(history.std() + 1e-6)
+    normed = (history - mu) / sigma
+    # Try remote first
+    try:
+        from app import inference as _inf
+        if _inf.remote_enabled():
+            remote = _inf.ttm_forecast(
+                "zero_shot_battery", normed.tolist(),
+                context_length=context_length,
+                prediction_length=prediction_length,
+                cadence=cadence,
+            )
+            if remote.get("ok"):
+                pred = np.asarray(remote["forecast"], dtype=np.float32)
+                return pred * sigma + mu
+    except _inf.RemoteUnreachable as e:
+        log.info("TTM zero-shot: remote unreachable (%s); local fallback", e)
+    except Exception:
+        log.exception("TTM zero-shot remote call failed; local fallback")
+    # Local fallback
     model = _load_model(context_length, prediction_length)
     if model is None:
         return None
     import torch
     x = torch.from_numpy(normed.astype(np.float32))[None, :, None]
     try:
         with torch.no_grad():

app/rag.py CHANGED Viewed

@@ -132,15 +132,38 @@ def _ensure_index():
         _INDEX = {"chunks": [], "embs": None, "model": None}
         return _INDEX
-    from sentence_transformers import SentenceTransformer
-    log.info("rag: loading %s", EMBED_MODEL_NAME)
-    model = SentenceTransformer(EMBED_MODEL_NAME)
     texts = [c.text for c in chunks]
     log.info("rag: embedding %d chunks", len(texts))
-    embs = model.encode(texts, batch_size=32, show_progress_bar=False,
-                        convert_to_numpy=True, normalize_embeddings=True)
-    _INDEX = {"chunks": chunks, "embs": embs.astype("float32"), "model": model}
     log.info("rag: index ready (%s)", embs.shape)
     return _INDEX
@@ -173,8 +196,35 @@ def retrieve(query: str, k: int = 4, min_score: float = 0.30) -> list[dict]:
     idx = _ensure_index()
     if idx["embs"] is None or not idx["chunks"]:
         return []
-    qv = idx["model"].encode([query], convert_to_numpy=True,
-                             normalize_embeddings=True).astype("float32")
     sims = (idx["embs"] @ qv.T).ravel()
     reranker = _ensure_reranker()

         _INDEX = {"chunks": [], "embs": None, "model": None}
         return _INDEX
     texts = [c.text for c in chunks]
     log.info("rag: embedding %d chunks", len(texts))
+    # v0.4.5 — try the MI300X service first. Avoids loading
+    # sentence-transformers + the granite-embedding weights on a
+    # cpu-basic surface (HF Space). Falls back to local on
+    # RemoteUnreachable so dev laptops keep working with no env.
+    embs = None
+    model = None
+    try:
+        from app import inference as _inf
+        if _inf.remote_enabled():
+            log.info("rag: encoding via remote MI300X")
+            remote = _inf.granite_embed(texts, timeout=120.0)
+            if remote.get("ok"):
+                embs = np.asarray(remote["vectors"], dtype="float32")
+                # Per-query encodes will also route through remote;
+                # `model` stays None and `retrieve()` checks for it.
+    except _inf.RemoteUnreachable as e:
+        log.info("rag: remote unreachable (%s); local fallback", e)
+    except Exception:
+        log.exception("rag: remote encode failed; local fallback")
+    if embs is None:
+        from sentence_transformers import SentenceTransformer
+        log.info("rag: loading %s (local fallback)", EMBED_MODEL_NAME)
+        model = SentenceTransformer(EMBED_MODEL_NAME)
+        embs = model.encode(texts, batch_size=32, show_progress_bar=False,
+                             convert_to_numpy=True, normalize_embeddings=True)
+        embs = embs.astype("float32")
+    _INDEX = {"chunks": chunks, "embs": embs, "model": model}
     log.info("rag: index ready (%s)", embs.shape)
     return _INDEX
     idx = _ensure_index()
     if idx["embs"] is None or not idx["chunks"]:
         return []
+    # v0.4.5 — encode query via remote when corpus was embedded remotely.
+    # `_ensure_index` leaves `model = None` when it took the remote
+    # path, so this branch handles both:
+    #   - model present  → local SentenceTransformer.encode (fast, in-mem)
+    #   - model is None  → POST to MI300X, fallback to a one-shot local
+    #                       SentenceTransformer load if remote is down.
+    if idx["model"] is not None:
+        qv = idx["model"].encode([query], convert_to_numpy=True,
+                                  normalize_embeddings=True).astype("float32")
+    else:
+        qv = None
+        try:
+            from app import inference as _inf
+            if _inf.remote_enabled():
+                remote = _inf.granite_embed([query])
+                if remote.get("ok"):
+                    qv = np.asarray(remote["vectors"], dtype="float32")
+        except _inf.RemoteUnreachable as e:
+            log.info("rag: per-query encode remote unreachable (%s)", e)
+        if qv is None:
+            from sentence_transformers import SentenceTransformer
+            log.info("rag: cold-loading %s for per-query encode (remote down)",
+                     EMBED_MODEL_NAME)
+            local = SentenceTransformer(EMBED_MODEL_NAME)
+            qv = local.encode([query], convert_to_numpy=True,
+                              normalize_embeddings=True).astype("float32")
+            # Cache so subsequent queries don't re-load
+            idx["model"] = local
     sims = (idx["embs"] @ qv.T).ravel()
     reranker = _ensure_reranker()

services/riprap-models/README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# Riprap Models — droplet inference service
+GPU inference microservice that runs alongside vLLM on the AMD MI300X
+droplet. Exposes one HTTP endpoint per model class consumed by the
+Riprap FastAPI app's specialists, so all GPU-accelerable forward
+passes (Prithvi-NYC-Pluvial, TerraMind LULC + Buildings, Granite TTM
+r2, Granite Embedding 278M, GLiNER) run on the MI300X regardless of
+which surface — laptop or HF Space — hosts the FastAPI process.
+## Service contract
+| Method | Path | Purpose |
+|---|---|---|
+| GET   | `/healthz`            | reachability probe + which models are warm |
+| POST  | `/v1/prithvi-pluvial` | Prithvi-NYC-Pluvial v2 segmentation |
+| POST  | `/v1/terramind`       | TerraMind LULC / Buildings / Synthesis (adapter-dispatched) |
+| POST  | `/v1/ttm-forecast`    | Granite TTM r2 (zero-shot Battery, fine-tune Battery, weekly 311, FloodNet recurrence) |
+| POST  | `/v1/granite-embed`   | Granite Embedding 278M batch encode |
+| POST  | `/v1/gliner-extract`  | GLiNER typed-entity extraction |
+Auth: bearer token on every `/v1/*` route via `RIPRAP_MODELS_API_KEY`.
+Same shape as vLLM. `/healthz` is open so liveness probes don't need
+auth.
+## Deploy
+The droplet's existing `terramind` container already has
+`torch+ROCm 7.0`, `terratorch 1.2.7`, `granite-tsfm 0.3.6`,
+`transformers 4.57`, `peft`, `safetensors`, `fastapi`, `uvicorn`. The
+service code lands under `/workspace/riprap-models/`; only deltas
+need installing.
+```bash
+# Copy code (run from project root)
+ssh root@129.212.181.238 'mkdir -p /workspace/riprap-models'
+rsync -av --delete services/riprap-models/ \
+    root@129.212.181.238:/workspace/riprap-models/
+# Install deltas + start uvicorn inside the terramind container
+ssh root@129.212.181.238 bash <<'REMOTE'
+docker cp /workspace/riprap-models terramind:/workspace/
+docker exec -d -e RIPRAP_MODELS_API_KEY="$RIPRAP_MODELS_API_KEY" terramind \
+  bash -c "cd /workspace/riprap-models && \
+           pip install --no-cache-dir -r requirements.txt && \
+           uvicorn main:app --host 0.0.0.0 --port 7860 --log-level info \
+                  > /workspace/riprap-models.log 2>&1"
+REMOTE
+```
+Service binds inside the container at `:7860`; the host port
+mapping was set when the `terramind` container was created
+(`docker run -p 7860:7860 ...`), so externally the service is at
+`http://129.212.181.238:7860`.
+## Local app config
+Set in either env or HF Space variables:
+```
+RIPRAP_ML_BACKEND   = remote
+RIPRAP_ML_BASE_URL  = http://129.212.181.238:7860
+RIPRAP_ML_API_KEY   = <bearer>
+```
+`app/inference.py` posts to those endpoints; specialists fall back
+to local in-process model loads when the service is unreachable.

services/riprap-models/main.py ADDED Viewed

	@@ -0,0 +1,561 @@

+"""Riprap Models — GPU inference microservice.
+Runs on the AMD MI300X droplet alongside vLLM, exposes one HTTP
+endpoint per model class consumed by the Riprap FastAPI app's
+specialists. The local app routes through this service when
+RIPRAP_ML_BACKEND=remote (or =auto with the service reachable),
+keeping all GPU-accelerable forward passes on the MI300X — Granite
+4.1 (LLM), Prithvi-NYC-Pluvial (segmentation), TerraMind LULC +
+Buildings + Synthesis (LoRA), Granite TTM r2 (forecasts), Granite
+Embedding 278M (RAG), and GLiNER (typed extraction).
+Authoritative bearer-token auth same as vLLM. Same env-var shape so
+the same secret can be reused across both services on a Space.
+Service contract (mirrors app/inference.py):
+  GET   /healthz                        → {ok: true, models_loaded: [...]}
+  POST  /v1/prithvi-pluvial             → see _prithvi_pluvial below
+  POST  /v1/terramind                   → adapter dispatch (lulc/buildings/synth)
+  POST  /v1/ttm-forecast                → model dispatch (zero_shot_battery, ...)
+  POST  /v1/granite-embed               → batch text → 768-d vectors
+  POST  /v1/gliner-extract              → text + labels → typed entities
+Model loading is lazy + cached per-process. The first call to a given
+model pays the cold-load cost (~5-30 s); subsequent calls reuse the
+in-memory instance. ROCm device binding goes through torch's CUDA
+shim — `cuda` is the ROCm device when running on a ROCm-built torch.
+"""
+from __future__ import annotations
+import base64
+import logging
+import os
+import threading
+import time
+from contextlib import asynccontextmanager
+from typing import Any
+import numpy as np
+from fastapi import Depends, FastAPI, HTTPException, Header
+from pydantic import BaseModel
+log = logging.getLogger("riprap.models")
+logging.basicConfig(
+    level=os.environ.get("RIPRAP_MODELS_LOG", "INFO").upper(),
+    format="%(asctime)s %(levelname)-5s %(name)s: %(message)s",
+)
+# Auth — same shape as vLLM. Set RIPRAP_MODELS_API_KEY in the
+# `docker run` env. When empty, the service runs unauthenticated
+# (only sane for localhost-only deployments).
+_AUTH_TOKEN = os.environ.get("RIPRAP_MODELS_API_KEY", "")
+# Device. ROCm-built torch reports CUDA-style symbols; "cuda" maps to
+# the first ROCm device on the MI300X.
+_DEVICE = os.environ.get("RIPRAP_MODELS_DEVICE", "cuda")
+def _require_auth(authorization: str | None = Header(default=None)) -> None:
+    if not _AUTH_TOKEN:
+        return
+    if not authorization or not authorization.startswith("Bearer "):
+        raise HTTPException(status_code=401, detail="Missing bearer token")
+    if authorization[7:].strip() != _AUTH_TOKEN:
+        raise HTTPException(status_code=401, detail="Invalid bearer token")
+# ---- Lazy model singletons --------------------------------------------------
+#
+# Each model has a `_load_<name>()` that returns the in-memory instance
+# (locking on a per-model threading.Lock so concurrent first-call
+# requests don't double-load). Callers grab via `_get_<name>()`.
+_LOCKS = {
+    "prithvi": threading.Lock(),
+    "terramind_lulc": threading.Lock(),
+    "terramind_buildings": threading.Lock(),
+    "terramind_synth": threading.Lock(),
+    "ttm": threading.Lock(),
+    "granite_embed": threading.Lock(),
+    "gliner": threading.Lock(),
+}
+_INSTANCES: dict[str, Any] = {}
+def _decode_array(b64: str, shape: list[int], dtype: str = "float32") -> np.ndarray:
+    raw = base64.b64decode(b64)
+    return np.frombuffer(raw, dtype=dtype).reshape(shape)
+def _to_device(t):
+    """Move a torch tensor to the configured device. No-op for CPU."""
+    if _DEVICE == "cpu":
+        return t
+    try:
+        import torch
+        if torch.cuda.is_available():
+            return t.to("cuda")
+    except Exception as e:
+        log.warning("device move skipped: %s", e)
+    return t
+# ---- Prithvi-NYC-Pluvial v2 -------------------------------------------------
+def _load_prithvi():
+    if "prithvi" in _INSTANCES:
+        return _INSTANCES["prithvi"]
+    with _LOCKS["prithvi"]:
+        if "prithvi" in _INSTANCES:
+            return _INSTANCES["prithvi"]
+        log.info("prithvi: cold load (msradam/Prithvi-EO-2.0-NYC-Pluvial)")
+        import importlib.util
+        from huggingface_hub import hf_hub_download
+        from terratorch.cli_tools import LightningInferenceModel
+        BASE_REPO = "ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11"
+        V2_REPO = "msradam/Prithvi-EO-2.0-NYC-Pluvial"
+        # Use the IBM-NASA base config + v2 ckpt. Mirrors
+        # app/flood_layers/prithvi_live.py:_ensure_model().
+        base_config = hf_hub_download(BASE_REPO, "config.yaml")
+        inference_py = hf_hub_download(BASE_REPO, "inference.py")
+        v2_yaml = None
+        v2_ckpt = None
+        for name in ("prithvi_nyc_phase14.yaml", "config.yaml"):
+            try:
+                v2_yaml = hf_hub_download(V2_REPO, name); break
+            except Exception:
+                continue
+        for name in ("prithvi_nyc_pluvial_v2.ckpt", "best_val_loss.ckpt", "model.ckpt"):
+            try:
+                v2_ckpt = hf_hub_download(V2_REPO, name); break
+            except Exception:
+                continue
+        if v2_yaml and v2_ckpt:
+            log.info("prithvi: building from v2 yaml=%s ckpt=%s", v2_yaml, v2_ckpt)
+            m = LightningInferenceModel.from_config(v2_yaml, v2_ckpt)
+        else:
+            log.info("prithvi: v2 unavailable, falling back to base")
+            base_ckpt = hf_hub_download(
+                BASE_REPO, "Prithvi-EO-V2-300M-TL-Sen1Floods11.pt")
+            m = LightningInferenceModel.from_config(base_config, base_ckpt)
+        m.model.eval()
+        try:
+            import torch
+            if _DEVICE == "cuda" and torch.cuda.is_available():
+                m.model.cuda()
+        except Exception:
+            log.exception("prithvi: cuda move failed; staying on cpu")
+        spec = importlib.util.spec_from_file_location("_prithvi_inference",
+                                                       inference_py)
+        mod = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(mod)
+        _INSTANCES["prithvi"] = (m, mod.run_model)
+        log.info("prithvi: ready")
+        return _INSTANCES["prithvi"]
+class PrithviIn(BaseModel):
+    s2: str
+    shape: list[int]
+    scene_id: str | None = None
+    scene_datetime: str | None = None
+    cloud_cover: float | None = None
+def _prithvi_pluvial(payload: PrithviIn) -> dict[str, Any]:
+    t0 = time.time()
+    m, run_model = _load_prithvi()
+    chip = _decode_array(payload.s2, payload.shape, "float32")
+    # Sen1Floods11 expects [1, 6, 1, H, W]
+    if chip.ndim == 3:
+        chip = chip[None, :, None, :, :]
+    pred_t = run_model(chip, None, None, m.model, m.datamodule, chip.shape[-1])
+    pred = pred_t[0].cpu().numpy().astype("uint8")
+    pct_full = float(100.0 * pred.mean())
+    # Center-disk fraction (500 m at 10 m/px → 50 px radius from chip center).
+    h, w = pred.shape
+    yy, xx = np.indices(pred.shape)
+    cy, cx = h // 2, w // 2
+    dist = np.sqrt((yy - cy) ** 2 + (xx - cx) ** 2)
+    mask = dist <= min(50, min(h, w) // 4)
+    pct_500m = float(100.0 * pred[mask].mean()) if mask.any() else pct_full
+    return {
+        "ok": True,
+        "elapsed_s": round(time.time() - t0, 2),
+        "device": _DEVICE,
+        "pct_water_within_500m": round(pct_500m, 3),
+        "pct_water_full": round(pct_full, 3),
+        "scene_id": payload.scene_id,
+        "scene_datetime": payload.scene_datetime,
+        "cloud_cover": payload.cloud_cover,
+        "shape": [int(h), int(w)],
+    }
+# ---- TerraMind (lulc / buildings / synthesis) -------------------------------
+_TERRAMIND_REPO = "msradam/TerraMind-NYC-Adapters"
+_TERRAMIND_SPECS = {
+    "lulc":      {"subdir": "lulc_nyc",      "num_classes": 5,
+                   "labels": ["Trees", "Cropland", "Built", "Bare", "Water"]},
+    "buildings": {"subdir": "buildings_nyc", "num_classes": 2,
+                   "labels": ["Background", "Building"]},
+}
+def _load_terramind(adapter: str):
+    key = f"terramind_{adapter}"
+    if key in _INSTANCES:
+        return _INSTANCES[key]
+    with _LOCKS.get(key, _LOCKS.get("terramind_lulc")):
+        if key in _INSTANCES:
+            return _INSTANCES[key]
+        log.info("terramind/%s: cold load", adapter)
+        from huggingface_hub import snapshot_download
+        from peft import LoraConfig, inject_adapter_in_model
+        from safetensors.torch import load_file
+        from terratorch.tasks import SemanticSegmentationTask
+        spec = _TERRAMIND_SPECS[adapter]
+        adapter_root = snapshot_download(
+            _TERRAMIND_REPO, allow_patterns=[f"{spec['subdir']}/*"])
+        task = SemanticSegmentationTask(
+            model_factory="EncoderDecoderFactory",
+            model_args=dict(
+                backbone="terramind_v1_base",
+                backbone_pretrained=True,
+                backbone_modalities=["S2L2A", "S1RTC", "DEM"],
+                backbone_use_temporal=True,
+                backbone_temporal_pooling="concat",
+                backbone_temporal_n_timestamps=4,
+                necks=[
+                    {"name": "SelectIndices", "indices": [2, 5, 8, 11]},
+                    {"name": "ReshapeTokensToImage", "remove_cls_token": False},
+                    {"name": "LearnedInterpolateToPyramidal"},
+                ],
+                decoder="UNetDecoder",
+                decoder_channels=[512, 256, 128, 64],
+                head_dropout=0.1,
+                num_classes=spec["num_classes"],
+            ),
+            loss="ce", lr=1e-4, freeze_backbone=False, freeze_decoder=False,
+        )
+        inject_adapter_in_model(LoraConfig(
+            r=16, lora_alpha=32, lora_dropout=0.05,
+            target_modules=["attn.qkv", "attn.proj"], bias="none",
+        ), task.model.encoder)
+        adapter_dir = f"{adapter_root}/{spec['subdir']}"
+        lora = load_file(f"{adapter_dir}/adapter_model.safetensors")
+        head = load_file(f"{adapter_dir}/decoder_head.safetensors")
+        task.model.encoder.load_state_dict(
+            {k.removeprefix("encoder."): v for k, v in lora.items()
+             if k.startswith("encoder.")}, strict=False)
+        for sub in ("decoder", "neck", "head", "aux_heads"):
+            ss = {k[len(sub) + 1:]: v for k, v in head.items()
+                   if k.startswith(sub + ".")}
+            if ss and hasattr(task.model, sub):
+                getattr(task.model, sub).load_state_dict(ss, strict=False)
+        try:
+            import torch
+            if _DEVICE == "cuda" and torch.cuda.is_available():
+                task = task.to("cuda")
+        except Exception:
+            log.exception("terramind: cuda move failed")
+        task.eval()
+        _INSTANCES[key] = task
+        log.info("terramind/%s: ready", adapter)
+        return task
+class TerramindIn(BaseModel):
+    adapter: str  # "lulc" | "buildings" | "synthesis"
+    s2: str
+    s2_shape: list[int]
+    s1: str | None = None
+    s1_shape: list[int] | None = None
+    dem: str | None = None
+    dem_shape: list[int] | None = None
+def _build_chip_tensor(np_arr, n_timesteps: int = 4):
+    import torch
+    t = torch.from_numpy(np_arr).float().unsqueeze(1)  # add T dim
+    if t.shape[1] == 1:
+        t = t.repeat(1, n_timesteps, 1, 1)
+    return t.unsqueeze(0)  # add batch
+def _terramind_inference(payload: TerramindIn) -> dict[str, Any]:
+    t0 = time.time()
+    if payload.adapter not in _TERRAMIND_SPECS:
+        raise HTTPException(status_code=400,
+                            detail=f"unknown adapter {payload.adapter!r}")
+    task = _load_terramind(payload.adapter)
+    spec = _TERRAMIND_SPECS[payload.adapter]
+    s2 = _decode_array(payload.s2, payload.s2_shape)
+    chips = {"S2L2A": _to_device(_build_chip_tensor(s2))}
+    if payload.s1 and payload.s1_shape:
+        s1 = _decode_array(payload.s1, payload.s1_shape)
+        chips["S1RTC"] = _to_device(_build_chip_tensor(s1))
+    if payload.dem and payload.dem_shape:
+        dem = _decode_array(payload.dem, payload.dem_shape)
+        chips["DEM"] = _to_device(_build_chip_tensor(dem))
+    import torch
+    from terratorch.tasks.tiled_inference import tiled_inference
+    def _forward(x, **_extra):
+        out = task.model(x)
+        return out.output if hasattr(out, "output") else out
+    with torch.no_grad():
+        logits = tiled_inference(
+            _forward, chips, out_channels=spec["num_classes"],
+            h_crop=224, w_crop=224, h_stride=128, w_stride=128,
+            average_patches=True, blend_overlaps=True, padding="reflect",
+        )
+    pred = logits.argmax(dim=1).squeeze(0).cpu().numpy().astype("uint8")
+    n = max(int(pred.size), 1)
+    fractions = {
+        spec["labels"][i]: round(100.0 * float((pred == i).sum()) / n, 2)
+        for i in range(spec["num_classes"])
+    }
+    fractions = {k: v for k, v in fractions.items() if v > 0}
+    dom_idx = int(max(range(spec["num_classes"]),
+                      key=lambda i: int((pred == i).sum())))
+    return {
+        "ok": True,
+        "adapter": payload.adapter,
+        "elapsed_s": round(time.time() - t0, 2),
+        "device": _DEVICE,
+        "shape": list(pred.shape),
+        "n_pixels": int(pred.size),
+        "class_fractions": fractions,
+        "dominant_class": spec["labels"][dom_idx],
+        "dominant_pct": fractions.get(spec["labels"][dom_idx], 0.0),
+        # Buildings-specific stat (NaN-safe; 0 when not the buildings adapter).
+        "pct_buildings": round(100.0 * float((pred == 1).sum()) / n, 2)
+                         if payload.adapter == "buildings" else None,
+    }
+# ---- Granite TTM r2 ---------------------------------------------------------
+_TTM_MODELS = {
+    "zero_shot_battery": "ibm-granite/granite-timeseries-ttm-r2",
+    "fine_tune_battery": "msradam/Granite-TTM-r2-Battery-Surge",
+    "weekly_311":        "ibm-granite/granite-timeseries-ttm-r2",
+    "floodnet_recurrence": "ibm-granite/granite-timeseries-ttm-r2",
+}
+def _load_ttm(model_key: str):
+    key = f"ttm:{model_key}"
+    if key in _INSTANCES:
+        return _INSTANCES[key]
+    with _LOCKS["ttm"]:
+        if key in _INSTANCES:
+            return _INSTANCES[key]
+        log.info("ttm/%s: cold load", model_key)
+        if model_key == "fine_tune_battery":
+            from huggingface_hub import snapshot_download
+            from tsfm_public import TinyTimeMixerForPrediction
+            local_dir = snapshot_download(_TTM_MODELS[model_key])
+            m = TinyTimeMixerForPrediction.from_pretrained(local_dir).eval()
+        else:
+            from tsfm_public.toolkit.get_model import get_model
+            # Caller passes (context_length, prediction_length) — for the
+            # zero-shot & 311 & FloodNet specialists we let the toolkit
+            # pick the best matching pretrained config. Cache one per
+            # model_key to avoid duplicate loads.
+            m = get_model(_TTM_MODELS[model_key],
+                          context_length=512, prediction_length=96).eval()
+        try:
+            import torch
+            if _DEVICE == "cuda" and torch.cuda.is_available():
+                m = m.to("cuda")
+        except Exception:
+            log.exception("ttm: cuda move failed")
+        _INSTANCES[key] = m
+        log.info("ttm/%s: ready", model_key)
+        return m
+class TtmIn(BaseModel):
+    model: str   # zero_shot_battery | fine_tune_battery | weekly_311 | floodnet_recurrence
+    history: list[float]
+    context_length: int
+    prediction_length: int
+    cadence: str = "h"
+def _ttm_forecast(payload: TtmIn) -> dict[str, Any]:
+    t0 = time.time()
+    if payload.model not in _TTM_MODELS:
+        raise HTTPException(status_code=400,
+                            detail=f"unknown model {payload.model!r}")
+    m = _load_ttm(payload.model)
+    import torch
+    series = np.array(payload.history, dtype="float32")
+    if len(series) < payload.context_length:
+        # Front-pad with the leading value so the model gets the right
+        # shape — caller-side fills are NaN-clean already, so this only
+        # extends a series whose history is shorter than context.
+        pad = np.full(payload.context_length - len(series), series[0]
+                      if len(series) else 0.0, dtype="float32")
+        series = np.concatenate([pad, series])
+    series = series[-payload.context_length:]
+    x = torch.from_numpy(series).float().unsqueeze(0).unsqueeze(-1)
+    x = _to_device(x)
+    with torch.no_grad():
+        out = m(past_values=x)
+    fc = out.prediction_outputs.squeeze(-1).squeeze(0).cpu().numpy()
+    peak_idx = int(np.argmax(np.abs(fc)))
+    return {
+        "ok": True,
+        "model": payload.model,
+        "elapsed_s": round(time.time() - t0, 2),
+        "device": _DEVICE,
+        "context_length": payload.context_length,
+        "prediction_length": payload.prediction_length,
+        "cadence": payload.cadence,
+        "forecast": [round(float(v), 6) for v in fc.tolist()],
+        "peak_index": peak_idx,
+        "peak_value": round(float(fc[peak_idx]), 6),
+    }
+# ---- Granite Embedding 278M -------------------------------------------------
+_EMBED_REPO = "ibm-granite/granite-embedding-278m-multilingual"
+def _load_embed():
+    if "granite_embed" in _INSTANCES:
+        return _INSTANCES["granite_embed"]
+    with _LOCKS["granite_embed"]:
+        if "granite_embed" in _INSTANCES:
+            return _INSTANCES["granite_embed"]
+        log.info("granite-embed: cold load")
+        from sentence_transformers import SentenceTransformer
+        m = SentenceTransformer(_EMBED_REPO,
+                                 device="cuda" if _DEVICE == "cuda" else "cpu")
+        _INSTANCES["granite_embed"] = m
+        log.info("granite-embed: ready")
+        return m
+class EmbedIn(BaseModel):
+    texts: list[str]
+def _granite_embed(payload: EmbedIn) -> dict[str, Any]:
+    t0 = time.time()
+    m = _load_embed()
+    vecs = m.encode(payload.texts, normalize_embeddings=True,
+                     show_progress_bar=False)
+    return {
+        "ok": True,
+        "elapsed_s": round(time.time() - t0, 2),
+        "device": _DEVICE,
+        "n": len(payload.texts),
+        "dim": int(vecs.shape[-1]) if hasattr(vecs, "shape") else len(vecs[0]),
+        "vectors": [list(map(float, v)) for v in vecs],
+    }
+# ---- GLiNER ----------------------------------------------------------------
+_GLINER_REPO = "urchade/gliner_medium-v2.1"
+def _load_gliner():
+    if "gliner" in _INSTANCES:
+        return _INSTANCES["gliner"]
+    with _LOCKS["gliner"]:
+        if "gliner" in _INSTANCES:
+            return _INSTANCES["gliner"]
+        log.info("gliner: cold load")
+        from gliner import GLiNER
+        m = GLiNER.from_pretrained(_GLINER_REPO)
+        try:
+            import torch
+            if _DEVICE == "cuda" and torch.cuda.is_available():
+                m = m.to("cuda")
+        except Exception:
+            log.exception("gliner: cuda move failed")
+        _INSTANCES["gliner"] = m
+        log.info("gliner: ready")
+        return m
+class GlinerIn(BaseModel):
+    text: str
+    labels: list[str]
+def _gliner_extract(payload: GlinerIn) -> dict[str, Any]:
+    t0 = time.time()
+    m = _load_gliner()
+    ents = m.predict_entities(payload.text, payload.labels)
+    return {
+        "ok": True,
+        "elapsed_s": round(time.time() - t0, 2),
+        "device": _DEVICE,
+        "entities": [
+            {"label": e["label"], "text": e["text"],
+             "start": int(e.get("start", 0)), "end": int(e.get("end", 0)),
+             "score": float(e.get("score", 0))}
+            for e in ents
+        ],
+    }
+# ---- FastAPI app ------------------------------------------------------------
+@asynccontextmanager
+async def lifespan(_app: FastAPI):
+    log.info("riprap-models starting on device=%s auth=%s",
+             _DEVICE, "yes" if _AUTH_TOKEN else "no")
+    yield
+    log.info("riprap-models stopping")
+app = FastAPI(title="riprap-models", version="0.4.5", lifespan=lifespan)
+@app.get("/healthz")
+def healthz():
+    return {"ok": True, "device": _DEVICE,
+             "models_loaded": sorted(_INSTANCES.keys())}
+@app.post("/v1/prithvi-pluvial", dependencies=[Depends(_require_auth)])
+def prithvi_pluvial_route(payload: PrithviIn):
+    return _prithvi_pluvial(payload)
+@app.post("/v1/terramind", dependencies=[Depends(_require_auth)])
+def terramind_route(payload: TerramindIn):
+    return _terramind_inference(payload)
+@app.post("/v1/ttm-forecast", dependencies=[Depends(_require_auth)])
+def ttm_forecast_route(payload: TtmIn):
+    return _ttm_forecast(payload)
+@app.post("/v1/granite-embed", dependencies=[Depends(_require_auth)])
+def granite_embed_route(payload: EmbedIn):
+    return _granite_embed(payload)
+@app.post("/v1/gliner-extract", dependencies=[Depends(_require_auth)])
+def gliner_extract_route(payload: GlinerIn):
+    return _gliner_extract(payload)

services/riprap-models/requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+# Riprap Models — droplet inference service.
+#
+# Most heavy deps (torch+ROCm, terratorch, granite-tsfm, transformers,
+# peft, safetensors, fastapi, uvicorn) are already in the `terramind`
+# container's image. This list is only the deltas the service needs
+# beyond that base — install with:
+#
+#   docker exec terramind pip install -r /workspace/riprap-models/requirements.txt
+fastapi-cli >= 0.0.5
+gliner >= 0.2.6
+sentence-transformers >= 5.0.0
+huggingface_hub >= 0.34