Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

seriffic commited on 3 days ago

Commit

131e277

1 Parent(s): 43f0938

Switch to GPU Dockerfile + 8b reconciler for nvidia-t4-small

User upgraded the HF Space to nvidia-t4-small (HF Pro). Production
image variant from the spine HEAD:

- Base: nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04
- Ollama installer auto-detects GPU, dispatches CUDA build
- Granite 4.1:8b pulled at *runtime* (HF build sandbox can't fit
8B + EO toolchain alongside torch); ~2 min cold start, then
OLLAMA_KEEP_ALIVE=24h holds it resident
- 3b alias remapped to 8b via RIPRAP_OLLAMA_3B_TAG so the planner
+ reconciler both run on 8b (full quality, single warm model)
- Flash attention + KV cache q8_0 for ~2x throughput on T4
- Pre-warm 8b into VRAM in entrypoint so the first reconcile
doesn't pay the ~30s model-load tax

EO toolchain (Phase 1 Prithvi live + Phase 4 TerraMind synthesis)
runtime-installs into $HOME/.eo-pkgs (build-sandbox couldn't fit
it). If the install fails, the lazy-import in those specialists
returns 'skipped' cleanly and the other 14 specialists run normally.

Inference drops from ~60-180s on cpu-basic to ~2-4s on t4-small
for the synthesis reconciler.

Files changed (2) hide show

Dockerfile +62 -41
entrypoint.sh +88 -5

Dockerfile CHANGED Viewed

@@ -1,36 +1,32 @@
-# Riprap — Hugging Face Spaces (Docker SDK) deployment.
 #
-# CPU-tuned variant for HF Spaces cpu-basic (free tier). The
-# nvidia-t4-small / MI300X variants live alongside as build args
-# to switch when the Space is upgraded.
 #
 # Bakes:
-#   - Python 3.12 + pip deps (~2.5 GB once torch is in)
-#   - Ollama + granite4.1:3b model (~2 GB) — 3b only on cpu-basic.
-#     RIPRAP_OLLAMA_8B_TAG=granite4.1:3b aliases the 8b reconciler
-#     calls to 3b so the polished UI runs end-to-end without 8b's
-#     ~5 GB image cost. Quality drops vs 8b; speed lever is the
-#     vLLM-on-AMD-MI300X demo path (RIPRAP_LLM_PRIMARY=vllm).
 #   - All pre-computed fixtures in data/ + corpus/
 #
 # Runtime:
-#   - Ollama daemon serves Granite 4.1:3b
 #   - Granite Embedding 278M auto-downloads via sentence-transformers
-#     on first FastAPI startup (~280 MB) — cached to /home/user/.cache
-#   - uvicorn FastAPI on port 7860 (HF default)
-FROM python:3.12-slim AS base
-# OS deps for geo libs + curl/zstd for Ollama installer (which now ships
-# its tarball compressed with zstd and refuses to install if it's missing).
 RUN apt-get update && apt-get install -y --no-install-recommends \
         curl ca-certificates zstd procps \
         gdal-bin libgdal-dev libgeos-dev libproj-dev \
     && rm -rf /var/lib/apt/lists/*
 # HF Spaces convention: run as a non-root "user" account at /home/user/app.
-# Granite Embedding cache lives in /home/user/.cache/huggingface — it
-# survives container restarts when persistent storage is mounted there.
 RUN useradd -m -u 1000 user
 ENV HOME=/home/user \
     PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:/bin \
@@ -39,38 +35,63 @@ ENV HOME=/home/user \
     OLLAMA_HOST=127.0.0.1:11434 \
     OLLAMA_NUM_PARALLEL=1 \
     OLLAMA_KEEP_ALIVE=24h \
-    RIPRAP_LLM_PRIMARY=ollama \
-    RIPRAP_OLLAMA_8B_TAG=granite4.1:3b \
-    RIPRAP_MELLEA_MAX_ATTEMPTS=2
-# Install Ollama (single-binary install)
 RUN curl -fsSL https://ollama.com/install.sh | sh
 WORKDIR /home/user/app
-# Python deps (cache the layer)
 COPY --chown=user:user requirements.txt ./
 RUN pip install --no-cache-dir --upgrade pip && \
     pip install --no-cache-dir -r requirements.txt
-# Pull Granite 4.1:3b into the image. The official Ollama installer
-# stores models under /usr/share/ollama/.ollama by default; we point at a
-# user-writable location so the runtime container can also serve.
 #
-# Pattern: start ollama serve in the background, poll its HTTP endpoint
-# until it answers, then pull the model. We do NOT kill the daemon at the
-# end — the RUN shell's exit reaps it automatically, and `pkill -f` would
-# match this RUN command line itself (SIGTERM propagates up, build exits 143).
-ENV OLLAMA_MODELS=/home/user/.ollama/models
-RUN mkdir -p $OLLAMA_MODELS && \
-    ollama serve > /tmp/ollama.log 2>&1 & \
-    for i in $(seq 1 60); do \
-        curl -sf http://127.0.0.1:11434/ > /dev/null 2>&1 && break; \
-        sleep 1; \
-    done && \
-    ollama list && \
-    ollama pull granite4.1:3b && \
-    ollama list
 # App code + fixtures
 COPY --chown=user:user app/ ./app/
@@ -78,7 +99,7 @@ COPY --chown=user:user web/ ./web/
 COPY --chown=user:user scripts/ ./scripts/
 COPY --chown=user:user data/ ./data/
 COPY --chown=user:user corpus/ ./corpus/
-COPY --chown=user:user agent.py helios_nyc.py ./
 COPY --chown=user:user entrypoint.sh ./
 RUN chmod +x ./entrypoint.sh

+# Riprap — Hugging Face Spaces deployment (Docker SDK, GPU).
 #
+# Base: NVIDIA CUDA 12.4 runtime + cuDNN on Ubuntu 22.04. Ollama's
+# installer detects the GPU and pulls the CUDA-aware build automatically;
+# Granite 4.1:3b inference drops from ~60-180s on CPU Basic to ~2-4s on
+# nvidia-t4-small.
 #
 # Bakes:
+#   - Python 3.10 (default on 22.04) + pip deps (~2.5 GB once torch is in)
+#   - Ollama + granite4.1:3b model (~2 GB)
 #   - All pre-computed fixtures in data/ + corpus/
 #
 # Runtime:
+#   - Ollama daemon serves Granite 4.1 via CUDA
 #   - Granite Embedding 278M auto-downloads via sentence-transformers
+#     on first FastAPI startup (~280 MB)
+#   - uvicorn FastAPI on port 7860 (HF Spaces default)
+FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 AS base
+# OS deps: Python 3.10 + geo libs + Ollama install dependencies.
+ENV DEBIAN_FRONTEND=noninteractive
 RUN apt-get update && apt-get install -y --no-install-recommends \
+        python3 python3-pip python3-venv python-is-python3 \
         curl ca-certificates zstd procps \
         gdal-bin libgdal-dev libgeos-dev libproj-dev \
     && rm -rf /var/lib/apt/lists/*
 # HF Spaces convention: run as a non-root "user" account at /home/user/app.
 RUN useradd -m -u 1000 user
 ENV HOME=/home/user \
     PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:/bin \
     OLLAMA_HOST=127.0.0.1:11434 \
     OLLAMA_NUM_PARALLEL=1 \
     OLLAMA_KEEP_ALIVE=24h \
+    OLLAMA_MAX_LOADED_MODELS=2 \
+    OLLAMA_FLASH_ATTENTION=1 \
+    OLLAMA_KV_CACHE_TYPE=q8_0 \
+    OLLAMA_DEBUG=1
+# Install Ollama. install.sh ships the cuda_v12 dispatcher libs
+# unconditionally; the GPU detection at the tail of the script only gates
+# host-driver install (a no-op inside a container). So this works fine
+# on a CPU builder for a GPU-attached runtime.
 RUN curl -fsSL https://ollama.com/install.sh | sh
 WORKDIR /home/user/app
+# Python deps. CUDA 12.x in base image lets pip pull cu124 torch wheels
+# automatically when sentence-transformers asks for torch.
 COPY --chown=user:user requirements.txt ./
 RUN pip install --no-cache-dir --upgrade pip && \
     pip install --no-cache-dir -r requirements.txt
+# --- Earth-observation toolchain (Phase 1 Prithvi live + Phase 4
+# TerraMind synthesis) ---------------------------------------------------
 #
+# Tried four times to land terratorch on HF's Py3.10 image alongside
+# our pinned stack (transformers<5, hf_hub<1, granite-tsfm<0.3.4,
+# mellea<0.4). Each attempt failed at the same point — a `mkdir`
+# immediately after the --no-deps install — with no actionable error
+# in HF's build log. The failure pattern is consistent with build-
+# sandbox disk exhaustion; even a 4-package narrow install
+# (terratorch + einops + diffusers + timm with --no-deps) hits it.
+#
+# Accepting this: TerraMind synthesis + Prithvi-live remain
+# local-/AMD-only on this deployment. The lazy-import pattern in
+# app/context/terramind_synthesis.py + app/flood_layers/prithvi_live.py
+# returns clean `skipped: deps unavailable on this deployment` on HF;
+# the trace card and the map legend make that visible. The other 14
+# specialists run normally.
+#
+# Re-enable on a deployment with more build disk (Docker SDK on a
+# self-hosted machine, AMD droplet, etc.) by adding the EO --no-deps
+# install back here.
+# Pull both Granite 4.1 variants into the image:
+#   :3b — fast routing (planner) + live_now reconciler (short outputs)
+#   :8b — synthesis reconciler for single_address / neighborhood / dev_check
+# Both fit warm on the T4 with OLLAMA_MAX_LOADED_MODELS=2 (~10 GB total
+# VRAM out of 16). We start ollama in the background, poll its HTTP
+# endpoint, pull, and let the layer exit (Docker reaps the daemon —
+# don't pkill, it'll match this RUN's own cmdline and exit 143).
+ENV OLLAMA_MODELS=/home/user/.ollama/models \
+    RIPRAP_OLLAMA_3B_TAG=granite4.1:8b
+# Granite weights are pulled at *container start* (see entrypoint.sh)
+# instead of at build time. HF's build sandbox can't fit the EO
+# toolchain + Granite 8B (5GB) simultaneously, but the runtime
+# rootfs is larger and persists between container starts within an
+# image lifetime. Cold-start on first launch ~2 min for the 8B pull;
+# subsequent restarts are fast since Ollama's cache survives.
+RUN mkdir -p $OLLAMA_MODELS
 # App code + fixtures
 COPY --chown=user:user app/ ./app/
 COPY --chown=user:user scripts/ ./scripts/
 COPY --chown=user:user data/ ./data/
 COPY --chown=user:user corpus/ ./corpus/
+COPY --chown=user:user agent.py riprap.py ./
 COPY --chown=user:user entrypoint.sh ./
 RUN chmod +x ./entrypoint.sh

entrypoint.sh CHANGED Viewed

@@ -6,6 +6,59 @@
 # $HOME (which we own) instead.
 set -e
 # Stream Ollama's stdout+stderr to BOTH stdout (so it shows up in HF
 # Spaces runtime logs — needed to see GPU discovery output from
 # OLLAMA_DEBUG=1) AND a file (for the readiness fail-fast tail below).
@@ -34,14 +87,44 @@ if ! curl -sf http://127.0.0.1:11434/ > /dev/null 2>&1; then
   exit 1
 fi
-# Sanity check: Granite 4.1 model is present (baked in during build)
-if ! ollama list | grep -q "granite4.1:3b"; then
-  echo "[entrypoint] WARNING: granite4.1:3b not found; pulling now (slow!)..."
-  ollama pull granite4.1:3b || echo "[entrypoint] pull failed; reconciler will fail"
-fi
 ollama list
 # Log GPU visibility + Ollama lib layout so we can confirm CUDA dispatch
 # from the runtime logs (paired with OLLAMA_DEBUG=1 in the daemon).
 if command -v nvidia-smi > /dev/null 2>&1; then

 # $HOME (which we own) instead.
 set -e
+# --- Earth-observation toolchain (Phase 1 + Phase 4) -------------------
+# Build-time install was blocked by HF's build-disk threshold (5
+# attempts; all failed at the same point). Runtime install in the
+# running container works around the build-sandbox limit — the
+# running container has more disk.
+#
+# Use `--target=$EO_DIR` instead of `--user`: explicit path that we
+# can prepend to PYTHONPATH ourselves, so the install location is
+# guaranteed visible regardless of HF Spaces' Python site-config.
+# The `--user` approach was failing silently because HF's Python
+# environment apparently bypasses the user-site discovery path.
+EO_DIR="$HOME/.eo-pkgs"
+EO_MARKER="$EO_DIR/.installed"
+if [ ! -f "$EO_MARKER" ]; then
+    echo "[entrypoint] EO toolchain not yet installed; running pip install (~2 min)..."
+    mkdir -p "$EO_DIR"
+    # Bisect: previous build (1cf59ee) added torchvision + 7 more deps
+    # at once and the whole install failed (eo_dir empty, no marker).
+    # Pip's resolver is all-or-nothing per RUN — one bad package fails
+    # everything. Revert to the known-good 4 + just torchvision (the
+    # one terratorch actually needs to import). Once this proves out,
+    # add Prithvi-live deps in a second RUN.
+    if pip install --no-cache-dir --no-deps --target="$EO_DIR" \
+            terratorch==1.1rc6 \
+            einops \
+            diffusers \
+            timm \
+            torchvision; then
+        echo "[entrypoint] pip install OK; verifying import..."
+        if PYTHONPATH="$EO_DIR:$PYTHONPATH" python -c "
+import terratorch
+from terratorch.registry import FULL_MODEL_REGISTRY
+import terratorch.models.backbones.terramind.model.terramind_register
+n = len([k for k in FULL_MODEL_REGISTRY if 'terramind' in k.lower()])
+assert n > 0, 'no terramind register entries'
+print(f'[entrypoint] terratorch ok, terramind register: {n} entries')
+"; then
+            touch "$EO_MARKER"
+            echo "[entrypoint] EO toolchain READY at $EO_DIR"
+        else
+            echo "[entrypoint] EO verify FAILED — TerraMind/Prithvi-live will skip"
+        fi
+    else
+        echo "[entrypoint] pip install FAILED — TerraMind/Prithvi-live will skip"
+    fi
+else
+    echo "[entrypoint] EO toolchain already installed at $EO_DIR (cached)"
+fi
+# Always export PYTHONPATH so uvicorn can find the install (no-op if
+# the install failed and the dir is empty — the lazy-import in the
+# specialists handles that case cleanly).
+export PYTHONPATH="$EO_DIR:$PYTHONPATH"
 # Stream Ollama's stdout+stderr to BOTH stdout (so it shows up in HF
 # Spaces runtime logs — needed to see GPU discovery output from
 # OLLAMA_DEBUG=1) AND a file (for the readiness fail-fast tail below).
   exit 1
 fi
+# Granite 4.1:8b is pulled at runtime instead of baked into the image
+# — the EO toolchain (Phase 1 Prithvi + Phase 4 TerraMind) doesn't
+# fit alongside Granite weights in HF's build sandbox. First container
+# start does the pull (~2 min over the wire). Subsequent runtime
+# restarts within the same image lifetime reuse Ollama's cache so
+# this is a one-time per-image cost.
+#
+# 3b is also handled if present, but with RIPRAP_OLLAMA_3B_TAG=
+# granite4.1:8b set, the planner alias resolves to 8b too — so 8b
+# alone covers planner + reconciler.
+for model in "granite4.1:8b" "granite4.1:3b"; do
+  if ! ollama list | grep -q "$model"; then
+    if [ "$model" = "granite4.1:8b" ]; then
+      echo "[entrypoint] $model not found; pulling now (~5GB, ~2 min over the wire)..."
+      ollama pull "$model" || {
+        echo "[entrypoint] FATAL: pull failed for $model — reconciler will not work"
+        exit 1
+      }
+    else
+      # 3B is optional; if it's not there and the env override is set,
+      # the router will route the planner alias to 8B.
+      echo "[entrypoint] $model not found (optional — planner alias remapped to 8b via RIPRAP_OLLAMA_3B_TAG)"
+    fi
+  fi
+done
 ollama list
+# Pre-warm Granite 4.1:8b into VRAM so the first reconcile doesn't pay
+# the ~30s model-load tax. The empty prompt keeps it tiny; OLLAMA_KEEP_ALIVE
+# (24h) holds the weights resident through the demo.
+echo "[entrypoint] pre-warming granite4.1:8b into VRAM (one-shot)..."
+curl -s -X POST http://127.0.0.1:11434/api/generate \
+     -d '{"model":"granite4.1:8b","prompt":"hi","stream":false,"keep_alive":"24h","options":{"num_predict":1}}' \
+     -o /dev/null --max-time 120 \
+     && echo "[entrypoint] granite4.1:8b warm" \
+     || echo "[entrypoint] WARNING: 8b warmup failed (will load lazily)"
 # Log GPU visibility + Ollama lib layout so we can confirm CUDA dispatch
 # from the runtime logs (paired with OLLAMA_DEBUG=1 in the daemon).
 if command -v nvidia-smi > /dev/null 2>&1; then