Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

seriffic Claude Opus 4.7 (1M context) commited on 1 day ago

Commit

fee1c30

1 Parent(s): 7cb5930

fix: thread Mellea attempt index + diagnose riprap-models 500s

Two regressions surfaced on the May 9 demo. Fix both.

1. Mellea reroll text concatenated in the briefing prose
============================================================
The strict-streaming reconcile path on a Mellea reroll fired
token events for the new attempt without telling the frontend
the attempt index changed. Result: the SvelteKit briefing
buffer never reset between attempts, so attempt 2 + attempt 3
text appended onto the already-rendered attempt 1 text.

Root cause: app/fsm.py:step_reconcile installed a token
forwarder that explicitly dropped the attempt_idx
(`lambda d, _ai: token_cb(d)`), and
app/intents/single_address.py:_on_token took only `delta`. So
the SSE `token` events carried no `attempt` field, and
web/sveltekit/src/lib/client/agentStream.ts:onAttemptStart
never fired because `d.attempt !== currentAttempt` stayed
undefined === undefined.

Fix:
- fsm.py: forward (delta, attempt_idx) to token_cb. Probe-fall
back to the 1-arg call for legacy callbacks via TypeError so
non-strict reconcilers keep working.
- single_address.py: _on_token(delta, attempt_idx=0) emits
`{kind: token, delta, attempt: attempt_idx + 1}` so the
client gets the 1-based attempt counter it already expects
(neighborhood.py + development_check.py already do this).

2. terramind / prithvi-pluvial 500s with no diagnostic detail
============================================================
The riprap-models routes raised through to FastAPI's default
handler, returning the opaque body
`{"detail": "Internal Server Error"}`. The lablab UI's
inference._post then surfaced this as
`remote terramind/lulc unreachable: HTTP 500 from /v1/terramind`,
correct but with no actionable detail about WHAT failed inside
the model service.

services/riprap-models/main.py:
- New _safe_route() wrapper: returns
`{"ok": False, "err": "<type>: <msg>", "stage": "<endpoint>"}`
with HTTP 200 instead of 500. The proxy on :7860 forwards
this body untouched so the FSM trace card now reads, e.g.,
`remote terramind/lulc non-ok: torch.cuda.OutOfMemoryError: ...`
instead of a generic Internal Server Error.
- Lifespan startup warms every heavy model (Prithvi, all three
TerraMind paths, GLiNER, Granite Embedding) before traffic is
accepted, so the first user query doesn't compete with
vLLM's CUDA-graph compile for memory bandwidth. Best-effort
per stage; failures are recorded into _LAST_ERR and do not
block startup.
- New GET /v1/diag (auth-required) snapshots loaded models,
CUDA memory state per device, and last-error per stage with
a 3-line traceback tail. Operators can hit it from outside
the Space without grepping container logs.
- /healthz also exposes last_errors.

prithvi_live.py: surfaced-error matcher now checks `err`,
`error`, and `skipped` so the new wrapped 200-with-body shape
propagates cleanly into the trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

app/flood_layers/prithvi_live.py +5 -2
app/fsm.py +14 -1
app/intents/single_address.py +9 -2
services/riprap-models/main.py +88 -7

app/flood_layers/prithvi_live.py CHANGED Viewed

@@ -445,9 +445,12 @@ def _fetch_inner(lat: float, lon: float, timeout_s: float) -> dict[str, Any]:
                         "compute": f"remote · {remote.get('device', 'gpu')}",
                         "elapsed_s": round(time.time() - t0, 2),
                     }
                 return {"ok": False,
-                        "skipped": f"remote prithvi-pluvial non-ok: "
-                                   f"{remote.get('error') or 'unknown'}",
                         "elapsed_s": round(time.time() - t0, 2)}
         except _inf.RemoteUnreachable as e:
             log.info("prithvi_live: remote unreachable (%s)", e)

                         "compute": f"remote · {remote.get('device', 'gpu')}",
                         "elapsed_s": round(time.time() - t0, 2),
                     }
+                err = (remote.get("err")
+                       or remote.get("error")
+                       or remote.get("skipped")
+                       or "unknown")
                 return {"ok": False,
+                        "skipped": f"remote prithvi-pluvial non-ok: {err}",
                         "elapsed_s": round(time.time() - t0, 2)}
         except _inf.RemoteUnreachable as e:
             log.info("prithvi_live: remote unreachable (%s)", e)

app/fsm.py CHANGED Viewed

@@ -1015,11 +1015,24 @@ def step_reconcile(state: State) -> State:
                     query=_current_user_query() or state.get("query") or "",
                     intent=_current_planner_intent() or "single_address",
                 )
                 mres = reconcile_strict_streaming(
                     doc_msgs, framed_prompt,
                     user_prompt="Write the cited paragraph now.",
                     loop_budget=DEFAULT_LOOP_BUDGET,
-                    on_token=(lambda d, _ai: token_cb(d)) if token_cb else None,
                     on_attempt_end=attempt_cb,
                 )
                 para = mres["paragraph"]

                     query=_current_user_query() or state.get("query") or "",
                     intent=_current_planner_intent() or "single_address",
                 )
+                # Forward the (delta, attempt_idx) pair through. Older
+                # token_cb signatures were single-arg; we detect by
+                # introspecting the callable's expected positional count
+                # so single_address.py's old shape still works while new
+                # callbacks see the attempt index they need to clear the
+                # frontend buffer on a Mellea reroll.
+                def _fwd_token(delta: str, attempt_idx: int) -> None:
+                    if token_cb is None:
+                        return
+                    try:
+                        token_cb(delta, attempt_idx)
+                    except TypeError:
+                        token_cb(delta)
                 mres = reconcile_strict_streaming(
                     doc_msgs, framed_prompt,
                     user_prompt="Write the cited paragraph now.",
                     loop_budget=DEFAULT_LOOP_BUDGET,
+                    on_token=_fwd_token if token_cb else None,
                     on_attempt_end=attempt_cb,
                 )
                 para = mres["paragraph"]

app/intents/single_address.py CHANGED Viewed

@@ -51,8 +51,15 @@ def run(plan, query: str, progress_q=None, strict: bool = False) -> dict:
     set_user_query(query)
     set_planner_intent(plan.intent)
     if progress_q is not None:
-        def _on_token(delta: str):
-            progress_q.put({"kind": "token", "delta": delta})
         def _on_mellea_attempt(attempt_idx, passed, failed):
             progress_q.put({"kind": "mellea_attempt",
                             "attempt": attempt_idx,

     set_user_query(query)
     set_planner_intent(plan.intent)
     if progress_q is not None:
+        def _on_token(delta: str, attempt_idx: int = 0):
+            # `attempt_idx` is the 0-based Mellea reroll index. The
+            # SvelteKit client treats a change in this value as a
+            # signal to clear the live briefing buffer (per
+            # web/sveltekit/src/lib/client/agentStream.ts:onAttemptStart).
+            # We surface it as a 1-based attempt counter so the chip
+            # in the UI reads "attempt N" naturally.
+            progress_q.put({"kind": "token", "delta": delta,
+                            "attempt": attempt_idx + 1})
         def _on_mellea_attempt(attempt_idx, passed, failed):
             progress_q.put({"kind": "mellea_attempt",
                             "attempt": attempt_idx,

services/riprap-models/main.py CHANGED Viewed

@@ -707,43 +707,124 @@ def _gliner_extract(payload: GlinerIn) -> dict[str, Any]:
 # ---- FastAPI app ------------------------------------------------------------
 @asynccontextmanager
 async def lifespan(_app: FastAPI):
     log.info("riprap-models starting on device=%s auth=%s",
              _DEVICE, "yes" if _AUTH_TOKEN else "no")
     yield
     log.info("riprap-models stopping")
-app = FastAPI(title="riprap-models", version="0.4.5", lifespan=lifespan)
 @app.get("/healthz")
 def healthz():
     return {"ok": True, "device": _DEVICE,
-             "models_loaded": sorted(_INSTANCES.keys())}
 @app.post("/v1/prithvi-pluvial", dependencies=[Depends(_require_auth)])
 def prithvi_pluvial_route(payload: PrithviIn):
-    return _prithvi_pluvial(payload)
 @app.post("/v1/terramind", dependencies=[Depends(_require_auth)])
 def terramind_route(payload: TerramindIn):
-    return _terramind_inference(payload)
 @app.post("/v1/ttm-forecast", dependencies=[Depends(_require_auth)])
 def ttm_forecast_route(payload: TtmIn):
-    return _ttm_forecast(payload)
 @app.post("/v1/granite-embed", dependencies=[Depends(_require_auth)])
 def granite_embed_route(payload: EmbedIn):
-    return _granite_embed(payload)
 @app.post("/v1/gliner-extract", dependencies=[Depends(_require_auth)])
 def gliner_extract_route(payload: GlinerIn):
-    return _gliner_extract(payload)

 # ---- FastAPI app ------------------------------------------------------------
+# Last error per route, kept on the in-memory map so /v1/diag can
+# expose it without forcing the operator to grep container logs.
+_LAST_ERR: dict[str, dict[str, Any]] = {}
+def _safe_route(stage: str, fn, payload):
+    """Wrap a route body so an uncaught exception becomes a structured
+    `{"ok": False, "err": "...", "stage": "..."}` JSON response with
+    HTTP 200 instead of FastAPI's opaque "Internal Server Error" body.
+    The proxy on :7860 forwards this body untouched, so the FSM
+    specialist surfaces the real reason in the trace card. Logs the
+    full traceback to stderr so operators can still root-cause from
+    the Space's runtime logs."""
+    try:
+        return fn(payload)
+    except HTTPException:
+        raise
+    except Exception as e:  # noqa: BLE001
+        import traceback
+        tb = traceback.format_exc()
+        log.error("route %s failed: %s\n%s", stage, e, tb)
+        info = {
+            "ok": False,
+            "err": f"{type(e).__name__}: {e}",
+            "stage": stage,
+            "ts": time.time(),
+        }
+        _LAST_ERR[stage] = {**info, "traceback_tail": tb.splitlines()[-3:]}
+        return info
 @asynccontextmanager
 async def lifespan(_app: FastAPI):
     log.info("riprap-models starting on device=%s auth=%s",
              _DEVICE, "yes" if _AUTH_TOKEN else "no")
+    # Pre-load the heavy models so the first user request doesn't
+    # collide with a cold-load on the same GPU as vLLM. Each warm
+    # is best-effort: a single model failing must not block the
+    # service from starting (others may still serve).
+    if os.environ.get("RIPRAP_MODELS_WARM_AT_STARTUP", "1").lower() in ("1", "true", "yes"):
+        for stage, fn in (
+            ("warm/prithvi", _load_prithvi),
+            ("warm/terramind_synthesis", _load_terramind_synthesis),
+            ("warm/terramind_lulc", lambda: _load_terramind("lulc")),
+            ("warm/terramind_buildings", lambda: _load_terramind("buildings")),
+            ("warm/embed", _load_embed),
+            ("warm/gliner", _load_gliner),
+        ):
+            try:
+                fn()
+                log.info("startup %s ok", stage)
+            except Exception as e:  # noqa: BLE001
+                log.exception("startup %s failed: %s", stage, e)
+                _LAST_ERR[stage] = {"ok": False,
+                                     "err": f"{type(e).__name__}: {e}",
+                                     "stage": stage}
     yield
     log.info("riprap-models stopping")
+app = FastAPI(title="riprap-models", version="0.5.1", lifespan=lifespan)
 @app.get("/healthz")
 def healthz():
     return {"ok": True, "device": _DEVICE,
+             "models_loaded": sorted(_INSTANCES.keys()),
+             "last_errors": _LAST_ERR}
+@app.get("/v1/diag", dependencies=[Depends(_require_auth)])
+def diag():
+    """Operator-only diagnostic snapshot — what's loaded, last
+    per-stage error (with a 3-line traceback tail), and CUDA
+    visibility. The proxy forwards this through the catch-all so
+    operators can hit it from outside the Space."""
+    cuda = {"available": False, "devices": []}
+    try:
+        import torch
+        cuda["available"] = bool(torch.cuda.is_available())
+        if cuda["available"]:
+            cuda["devices"] = [{
+                "name": torch.cuda.get_device_name(i),
+                "mem_total_mb": torch.cuda.get_device_properties(i).total_memory // (1024 * 1024),
+                "mem_alloc_mb": torch.cuda.memory_allocated(i) // (1024 * 1024),
+            } for i in range(torch.cuda.device_count())]
+    except Exception as e:  # noqa: BLE001
+        cuda["err"] = f"{type(e).__name__}: {e}"
+    return {
+        "device": _DEVICE,
+        "models_loaded": sorted(_INSTANCES.keys()),
+        "last_errors": _LAST_ERR,
+        "cuda": cuda,
+    }
 @app.post("/v1/prithvi-pluvial", dependencies=[Depends(_require_auth)])
 def prithvi_pluvial_route(payload: PrithviIn):
+    return _safe_route("prithvi-pluvial", _prithvi_pluvial, payload)
 @app.post("/v1/terramind", dependencies=[Depends(_require_auth)])
 def terramind_route(payload: TerramindIn):
+    return _safe_route(f"terramind/{payload.adapter}",
+                       _terramind_inference, payload)
 @app.post("/v1/ttm-forecast", dependencies=[Depends(_require_auth)])
 def ttm_forecast_route(payload: TtmIn):
+    return _safe_route("ttm-forecast", _ttm_forecast, payload)
 @app.post("/v1/granite-embed", dependencies=[Depends(_require_auth)])
 def granite_embed_route(payload: EmbedIn):
+    return _safe_route("granite-embed", _granite_embed, payload)
 @app.post("/v1/gliner-extract", dependencies=[Depends(_require_auth)])
 def gliner_extract_route(payload: GlinerIn):
+    return _safe_route("gliner-extract", _gliner_extract, payload)