Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

seriffic Claude Opus 4.7 (1M context) commited on 1 day ago

Commit

d48454d

1 Parent(s): b84be35

fix(emissions): default hardware to NVIDIA L4

The MI300X droplet was retired 2026-05-06; both inference Spaces
(msradam/riprap-vllm for Granite 4.1 8B FP8 + msradam/riprap-inference
for Prithvi / TerraMind / TTM / GLiNER / Embedding) now run on NVIDIA
L4 (24 GB, Ada Lovelace, 72 W TGP). Updates:

- app/emissions.py: HARDWARE adds nvidia_l4 (~60 W sustained per the
L4 data sheet) and reorders so it's the first/canonical entry. The
MI300X entry stays for operators who redeploy to that hardware and
set RIPRAP_HARDWARE_LABEL=AMD MI300X explicitly.
- app/llm.py:_hardware_for: when RIPRAP_LLM_BASE_URL is set (any
remote vLLM/Ollama backend), default to nvidia_l4 instead of MI300X.
RIPRAP_HARDWARE_LABEL override matrix expanded for l4 / t4 / mi300x.
- app/inference.py:_post: record nvidia_l4 by default; honor MI300X /
T4 overrides via RIPRAP_HARDWARE_LABEL.

Net effect on the briefing: a typical query (~700 LLM tokens + a few
ML calls on L4) now reports ~50-80 mWh instead of the inflated
600 mWh that the MI300X profile produced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

app/emissions.py +20 -4
app/inference.py +14 -5
app/llm.py +16 -5

app/emissions.py CHANGED Viewed

@@ -30,24 +30,40 @@ from typing import Any
 # (label, sustained_power_w, source)
 HARDWARE: dict[str, tuple[str, float, str]] = {
     "amd_mi300x": (
         "AMD MI300X",
         600.0,
         "AMD Instinct MI300X data sheet (750 W TDP); ~600 W sustained "
-        "during vLLM generation is a conservative midpoint of public "
-        "ROCm benchmarks.",
     ),
     "nvidia_t4": (
         "NVIDIA T4",
         50.0,
         "NVIDIA T4 data sheet (70 W max); ~50 W sustained during "
-        "transformer inference.",
     ),
     "apple_m": (
         "Apple M-series",
         20.0,
         "ml.energy / community measurements: ~20 W package power "
-        "during Granite 4.1 q4_K_M inference on Apple M3/M4.",
     ),
     "cpu_server": (
         "x86 CPU",

 # (label, sustained_power_w, source)
 HARDWARE: dict[str, tuple[str, float, str]] = {
+    "nvidia_l4": (
+        "NVIDIA L4",
+        60.0,
+        "NVIDIA L4 Tensor Core GPU data sheet (72 W TGP, Ada Lovelace, "
+        "24 GB); ~60 W sustained during transformer inference. The "
+        "active backend for both Riprap inference Spaces — "
+        "msradam/riprap-vllm for Granite 4.1 8B FP8 (vLLM), and "
+        "msradam/riprap-inference for Prithvi-EO / TerraMind / "
+        "Granite TTM / GLiNER / Granite Embedding.",
+    ),
     "amd_mi300x": (
         "AMD MI300X",
         600.0,
         "AMD Instinct MI300X data sheet (750 W TDP); ~600 W sustained "
+        "during vLLM generation. Selected only when an operator deploys "
+        "against an MI300X droplet and sets RIPRAP_HARDWARE_LABEL=AMD "
+        "MI300X explicitly. The hackathon submission used to run on "
+        "this hardware; the droplet was decommissioned 2026-05-06 and "
+        "inference now routes through L4 Spaces.",
     ),
     "nvidia_t4": (
         "NVIDIA T4",
         50.0,
         "NVIDIA T4 data sheet (70 W max); ~50 W sustained during "
+        "transformer inference. Used by the CPU-tier UI Spaces "
+        "(lablab + personal mirror) when a small inline LLM runs "
+        "alongside the FastAPI front-end.",
     ),
     "apple_m": (
         "Apple M-series",
         20.0,
         "ml.energy / community measurements: ~20 W package power "
+        "during Granite 4.1 q4_K_M inference on Apple M3/M4 (the "
+        "local-dev path, no remote backend configured).",
     ),
     "cpu_server": (
         "x86 CPU",

app/inference.py CHANGED Viewed

@@ -95,14 +95,23 @@ def _post(path: str, payload: dict[str, Any], timeout: float | None = None) -> d
         raise RemoteUnreachable(f"HTTP {r.status_code} from {path}: {r.text[:200]}")
     r.raise_for_status()
     duration_s = time.monotonic() - t0
-    # The remote ML service runs alongside vLLM on the AMD MI300X
-    # droplet; attribute the wallclock to that hardware. Local-fallback
-    # paths don't reach this function — they go straight to in-process
-    # model loads in the specialist module, which we don't track.
     emissions.active().record_ml(
         endpoint=path,
         backend="riprap-models",
-        hardware="amd_mi300x",
         duration_s=duration_s,
     )
     return r.json()

         raise RemoteUnreachable(f"HTTP {r.status_code} from {path}: {r.text[:200]}")
     r.raise_for_status()
     duration_s = time.monotonic() - t0
+    # Remote ML service is msradam/riprap-inference (or the vLLM-co-
+    # hosting msradam/riprap-vllm) — both run on NVIDIA L4 HF Spaces.
+    # Operators can override via RIPRAP_HARDWARE_LABEL when targeting
+    # different hardware (e.g. an MI300X droplet). Local-fallback paths
+    # don't reach this function — they go straight to in-process model
+    # loads in the specialist module, which we don't track.
+    override = (os.environ.get("RIPRAP_HARDWARE_LABEL") or "").lower()
+    if "mi300x" in override or "amd" in override:
+        hw = "amd_mi300x"
+    elif "t4" in override:
+        hw = "nvidia_t4"
+    else:
+        hw = "nvidia_l4"
     emissions.active().record_ml(
         endpoint=path,
         backend="riprap-models",
+        hardware=hw,
         duration_s=duration_s,
     )
     return r.json()

app/llm.py CHANGED Viewed

@@ -237,17 +237,28 @@ def _hardware_for(engine: str) -> str:
     """Map the active LLM engine to an emissions.HARDWARE key.
     Operator override via RIPRAP_HARDWARE_LABEL is honored where it
-    matches a known key (mi300x / t4 / apple / cpu); otherwise we
-    infer from engine selection and HF Space presence."""
     override = (os.environ.get("RIPRAP_HARDWARE_LABEL") or "").lower()
     if "mi300x" in override or "amd" in override:
         return "amd_mi300x"
-    if "t4" in override or "nvidia" in override:
         return "nvidia_t4"
     if "apple" in override or "m3" in override or "m4" in override:
         return "apple_m"
-    if engine == "vLLM" and _VLLM_BASE:
-        return "amd_mi300x"
     if os.environ.get("SPACE_ID") or os.environ.get("HF_SPACE_ID"):
         return "nvidia_t4"
     return "apple_m"

     """Map the active LLM engine to an emissions.HARDWARE key.
     Operator override via RIPRAP_HARDWARE_LABEL is honored where it
+    matches a known key (mi300x / l4 / t4 / apple / cpu); otherwise:
+      - Remote vLLM/Ollama (RIPRAP_LLM_BASE_URL set) → NVIDIA L4. Both
+        Riprap inference Spaces (msradam/riprap-vllm + msradam/
+        riprap-inference) run on L4. The MI300X droplet was retired
+        2026-05-06.
+      - On a CPU/T4-tier HF Space (UI Space with no remote backend) →
+        T4.
+      - Otherwise local dev → Apple M-series."""
     override = (os.environ.get("RIPRAP_HARDWARE_LABEL") or "").lower()
     if "mi300x" in override or "amd" in override:
         return "amd_mi300x"
+    if "l4" in override:
+        return "nvidia_l4"
+    if "t4" in override:
         return "nvidia_t4"
+    if "nvidia" in override:
+        return "nvidia_l4"
     if "apple" in override or "m3" in override or "m4" in override:
         return "apple_m"
+    if _VLLM_BASE:
+        # Any remote vLLM/Ollama backend currently lives on an L4 Space.
+        return "nvidia_l4"
     if os.environ.get("SPACE_ID") or os.environ.get("HF_SPACE_ID"):
         return "nvidia_t4"
     return "apple_m"