seriffic commited on
Commit
84bb74d
·
1 Parent(s): 131e277

Move CLAUDE.md to gitignored per-machine territory

Browse files

CLAUDE.md was tracked because it predates the project getting public.
Now that the rebuild is done and the repo lives on origin + HF, the
project-orientation content fits in ARCHITECTURE.md and the migration
plan in UPDATE_STONES.md. CLAUDE.md becomes a per-machine session
bootstrap for whichever Claude Code instance is working on the repo
locally — no reason to ship it to anyone cloning.

CLAUDE.local.md (already gitignored) is deleted in the same change;
its only durable role was the progressive-history rebuild rule, which
is complete.

Files changed (2) hide show
  1. .gitignore +2 -1
  2. CLAUDE.md +0 -546
.gitignore CHANGED
@@ -10,7 +10,8 @@ node_modules/
10
  .ruff_cache/
11
  .pytest_cache/
12
 
13
- # Session-local Claude Code context (per-machine, not for the public repo)
 
14
  CLAUDE.local.md
15
  .claude/
16
 
 
10
  .ruff_cache/
11
  .pytest_cache/
12
 
13
+ # Claude Code context (per-machine, not for the public repo)
14
+ CLAUDE.md
15
  CLAUDE.local.md
16
  .claude/
17
 
CLAUDE.md DELETED
@@ -1,546 +0,0 @@
1
- # Riprap — Claude Code orientation
2
-
3
- Citation-grounded NYC flood-exposure briefings. Granite 4.1 via a
4
- LiteLLM Router (Ollama for local/T4, vLLM-on-ROCm for the AMD MI300X
5
- demo path), Mellea-validated reconciliation, vanilla JS + Svelte 5
6
- custom elements, FastAPI on T4 (HF Spaces).
7
- **AMD hackathon demo: May 4–10, 2026.**
8
-
9
- `ARCHITECTURE.md` is the source of truth for *what the system does*.
10
- This file is for *how to work on it*.
11
-
12
- ---
13
-
14
- ## Critical constraints
15
-
16
- - **HF Spaces base image is Python 3.10.** This pins:
17
- - `mellea<0.4` (0.4+ requires 3.11+) — no `find_citations` /
18
- `flag_hallucinated_content` intrinsics in production.
19
- - `transformers>=4.55,<5` + `huggingface_hub>=0.34,<1` — coexistence
20
- with `granite-tsfm 0.3.x` (which calls `transformers.utils.download_url`,
21
- removed in transformers 5.x).
22
- - Don't bump these without testing the full HF rebuild end-to-end.
23
- - Local venv is Python 3.12 — Mellea 0.4.x is installed there but
24
- its RAG intrinsics need a HuggingFace transformers backend (LoRA
25
- loading); they don't work over Ollama. Don't accidentally rely on
26
- them.
27
-
28
- - **All LLM calls go through `app/llm.py`.** Never `import ollama`
29
- in new code. The shim exposes `chat(model, messages, options,
30
- stream, format)` with the same return shape as `ollama.chat`, and
31
- routes through a LiteLLM Router. Two backends are wired:
32
- - `RIPRAP_LLM_PRIMARY=ollama` (default) — local + HF Space path.
33
- Quant override: `RIPRAP_OLLAMA_8B_TAG=granite4.1:8b-q3_K_M`
34
- saves ~1 GB resident vs the default Q4_K_M.
35
- - `RIPRAP_LLM_PRIMARY=vllm` + `RIPRAP_LLM_BASE_URL` +
36
- `RIPRAP_LLM_API_KEY` — AMD MI300X demo path. Auto-fails over to
37
- Ollama if vLLM is unreachable. Same env vars work for local dev,
38
- HF Space → AMD, or AMD droplet → AMD self-host.
39
-
40
- An mlx-lm-backed third backend was prototyped (Apple-Silicon-native
41
- via `mlx_lm.server` with speculative decoding) but reverted — the
42
- install bumped torch internals in a way that broke `terratorch`'s
43
- Prithvi backbone with a `meta vs cpu` device mismatch. Stick with
44
- Ollama on local; switch to vLLM for the AMD demo. mlx-lm can be
45
- revisited once the EO toolchain isolates its torch state.
46
-
47
- - **Ollama and vLLM use different chat templates.** Ollama's
48
- Modelfile recognises `role: "document <doc_id>"` and bundles those
49
- into a `<documents>` block. The HF tokenizer chat template (used
50
- by vLLM) silently drops non-standard roles. `app/llm.py` papers
51
- over this: extracts document-role messages into
52
- `extra_body.documents` / `chat_template_kwargs.documents` for vLLM,
53
- while leaving them in `messages` for the Ollama fallback. It also
54
- normalizes vLLM's `[doc_id=X]` emissions back to `[X]` so Mellea
55
- checks and frontend chips see the same format from both paths.
56
-
57
- - **The vLLM deployment serves only the 8B.** One served-name per
58
- vLLM process and we don't run two containers. The planner alias
59
- (`granite-3b`) is mapped to the same served name as the reconciler
60
- (`granite-4.1-8b`) when primary=vllm. On Ollama, 3B and 8B are
61
- distinct. Override per-alias with `RIPRAP_LLM_VLLM_3B_NAME` /
62
- `RIPRAP_LLM_VLLM_8B_NAME` if you stand up a second vLLM.
63
-
64
- - **No LoRA / aLoRA / Granite Citation LoRA in production.** Even
65
- with vLLM available, we don't load LoRAs at runtime — Mellea's
66
- Ollama backend raises `NotImplementedError` for activated LoRAs,
67
- and we deliberately keep the call path identical across backends.
68
- Hand-rolled `[doc_id]` regex + reroll is the citation discipline
69
- mechanism. See §6 of ARCHITECTURE.md.
70
-
71
- - **Two committed JS bundles, two source dirs.** HF Spaces does not
72
- run Node, so we ship pre-built artefacts:
73
- - `web/sveltekit/build/` — **the new design-system UI** (SvelteKit +
74
- adapter-static, IBM Plex, four-tier glyphs, MapLibre). Sources in
75
- `web/sveltekit/src/`. Rebuild with `cd web/sveltekit && npm run
76
- build`. FastAPI serves it at `/`, `/q/sample`, `/q/<query>`.
77
- - `web/static/dist/riprap.js` — legacy custom-element bundle. Sources
78
- in `web/svelte/src/`. Rebuild with `cd web/svelte && npm run
79
- build`. FastAPI serves it at `/legacy`, `/single`, `/compare`,
80
- `/register/*` while the new UI is being filled in.
81
- Commit both build outputs after editing the corresponding sources.
82
-
83
- - **Models baked into the Docker image.** Both `granite4.1:3b` and
84
- `granite4.1:8b` are pulled at build time (~10 GB), so HF rebuilds
85
- take ~10 min. `entrypoint.sh` pre-warms the 8b into VRAM after
86
- Ollama is up so the first reconcile doesn't pay a cold-load.
87
-
88
- ---
89
-
90
- ## Run / build / test
91
-
92
- ```bash
93
- # Local server (default: routes to local Ollama)
94
- cd /Users/amsrahman/riprap-nyc
95
- .venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860 --log-level info
96
- # → http://127.0.0.1:7860/ (primary UI; agent.html is the canonical home)
97
-
98
- # Local server pointed at AMD MI300X (vLLM primary, Ollama fallback)
99
- RIPRAP_LLM_PRIMARY=vllm \
100
- RIPRAP_LLM_BASE_URL=http://<droplet-ip>:8000/v1 \
101
- RIPRAP_LLM_API_KEY=<token> \
102
- .venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860 --log-level info
103
- # Pill in the top-right shows "AMD MI300X · Granite 4.1 / vLLM" when
104
- # the primary is reachable; flips amber on Ollama fallback, red if
105
- # everything is down. Backed by GET /api/backend.
106
-
107
- # Frontend rebuilds (only when sources change)
108
- cd web/sveltekit && npm run build # writes web/sveltekit/build/ (new UI)
109
- cd web/svelte && npm run build # writes web/static/dist/riprap.js (legacy)
110
-
111
- # Static checks (all should be clean)
112
- .venv/bin/ruff check app/ web/ scripts/
113
- .venv/bin/vulture app/ web/main.py --min-confidence 90
114
- .venv/bin/radon cc app/ web/main.py -s -n C # complexity hotspots
115
-
116
- # Programmatic Mellea probe (server must be running)
117
- .venv/bin/python scripts/probe_mellea.py --query "Hollis" --runs 5
118
- # Outputs outputs/probe_*.csv with per-attempt pass/fail, paragraph,
119
- # elapsed time, reroll count.
120
-
121
- # Smoke-test the streaming endpoint directly
122
- curl -sN "http://127.0.0.1:7860/api/agent/stream?q=Hollis" --max-time 120
123
-
124
- # Local-tuning env knobs (independent of backend):
125
- # OLLAMA_KEEP_ALIVE=24h keep granite4.1:8b resident across requests
126
- # OLLAMA_NUM_PARALLEL=1 stop Ollama loading a 2nd copy under contention
127
- # RIPRAP_MELLEA_MAX_ATTEMPTS=2 cap rejection-sampling rerolls (default 2 local, 3 remote)
128
- # RIPRAP_TRIM_DOCS=1 drop doc messages whose specialist isn't in plan (default on)
129
- # RIPRAP_OLLAMA_8B_TAG=granite4.1:8b-q3_K_M ~1 GB lighter than default Q4_K_M
130
- ```
131
-
132
- **Don't restart uvicorn while a model is mid-generation** — Ollama will
133
- keep the request alive but the FastAPI handler dies, leaving the user
134
- staring at a dead stream. Pre-flight: `pkill -f "uvicorn web.main:app"`.
135
-
136
- ---
137
-
138
- ## Deploy
139
-
140
- Single command for both remotes:
141
-
142
- ```bash
143
- git push && git push huggingface main
144
- ```
145
-
146
- GitHub remote = `origin` (msradam/riprap-nyc). HF Space remote =
147
- `huggingface` (msradam/riprap-nyc on huggingface.co).
148
-
149
- HF rebuild status:
150
-
151
- ```bash
152
- curl -sf "https://huggingface.co/api/spaces/msradam/riprap-nyc/runtime" \
153
- | python3 -m json.tool
154
- # stage: BUILDING | RUNNING_APP_STARTING | RUNNING
155
- # sha: should match the latest local commit when RUNNING
156
- ```
157
-
158
- Live URL: <https://msradam-riprap-nyc.hf.space>
159
-
160
- ---
161
-
162
- ## Repo map (high-signal files)
163
-
164
- ```
165
- app/
166
- llm.py LiteLLM Router shim. chat(model, messages, options,
167
- stream, format) — drop-in for ollama.chat. Routes
168
- to vLLM (AMD MI300X) when RIPRAP_LLM_PRIMARY=vllm,
169
- with Ollama fallback. Extracts role="document <id>"
170
- into extra_body.documents for vLLM's HF chat
171
- template; normalizes [doc_id=X] -> [X]. backend_info()
172
- powers the UI pill via web/main.py:/api/backend.
173
- fsm.py Burr FSM. Threadlocal hooks: set_strict_mode,
174
- set_token_callback, set_mellea_attempt_callback.
175
- step_reconcile() routes to reconcile_strict_streaming
176
- when strict mode is on.
177
- reconcile.py EXTRA_SYSTEM_PROMPT (the 4-section skeleton + citation
178
- discipline). build_documents() is the doc_id ride-along.
179
- verify_paragraph() is the legacy non-strict guardrail.
180
- mellea_validator.py reconcile_strict_streaming() — the streaming rejection
181
- sampler with 4 grounding checks (numerics_grounded,
182
- no_placeholder_tokens, citations_dense,
183
- citations_resolve). Reroll feedback names the specific
184
- failing sentences.
185
- planner.py Granite 4.1:3b intent router → live_now / single_address
186
- / neighborhood / development_check / compare.
187
- intents/ Per-intent orchestration. Each run() takes
188
- (plan, query, progress_q, strict). Strict path uses
189
- reconcile_strict_streaming via either threadlocal
190
- (single_address, fsm-based) or direct call (neighborhood,
191
- dev_check).
192
- rag.py Granite Embedding 278M retrieval over corpus/*.pdf.
193
- flood_layers/ Sandy zone, DEP scenarios, Ida HWMs, Prithvi polygons.
194
- context/ Microtopo (HAND/TWI), 311, FloodNet, NOAA, NWS, DOB.
195
- live/ttm_forecast.py Granite TTM r2 surge residual nowcast.
196
-
197
- web/
198
- main.py FastAPI; SSE stream at /api/agent/stream emits
199
- plan_token, plan, step, token, mellea_attempt,
200
- final, error, done events.
201
- static/
202
- agent.html Primary UI. Mounts <r-briefing>, <r-trace>,
203
- <r-sources-footer> (Svelte custom elements).
204
- agent.js EventSource client. setBriefingText() sets the
205
- .text property on <r-briefing>; pushTraceStep()
206
- calls .pushStep() on <r-trace>. Form binding is
207
- BEFORE ensureMap() so a WebGL throw doesn't
208
- strand the Ask button.
209
- dist/riprap.js Built Svelte bundle (committed).
210
- components/ OLD Lit components — kept for reference but
211
- not loaded by agent.html anymore.
212
- main.py Adds GET /api/backend (live LLM-backend descriptor
213
- + reachability ping for the pill). All other LLM
214
- traffic goes through app/llm.py — don't add
215
- ollama.chat calls here.
216
- svelte/src/lib/ Svelte 5 sources. customElement: true globally
217
- via vite.config.js.
218
- stores.js highlightedDocId, citeIndex (writable). The
219
- cross-component chip ↔ source-row highlight
220
- reacts via these.
221
-
222
- scripts/
223
- probe_mellea.py Drives the SSE stream N times, dumps CSV.
224
- run_prithvi_ida.py Offline Prithvi-EO 2.0 segmentation (one-shot).
225
- build_*_register.py Bulk-mode register builders (offline).
226
-
227
- corpus/ 5 LFS-tracked NYC policy PDFs (NPCC4 etc).
228
- data/ LFS-tracked baked fixtures (Sandy, DEP, Prithvi
229
- polygons, DEM/HAND/TWI rasters, registers).
230
- ```
231
-
232
- ---
233
-
234
- ## Project conventions
235
-
236
- ### Document message convention
237
-
238
- Specialists emit data as chat messages with `role="document <doc_id>"`.
239
- Granite 4.1's Ollama template recognises this prefix and bundles them
240
- into a `<documents>` block + auto-injects IBM's grounded-generation
241
- system message. Don't reinvent — `app/reconcile.py:build_documents()`
242
- already wires it. `app/llm.py` additionally extracts the same messages
243
- into `chat_template_kwargs.documents` so vLLM's HF tokenizer template
244
- sees them too — both backends honour the same grounding contract from
245
- identical caller code.
246
-
247
- ### The four Mellea grounding requirements
248
-
249
- 1. **`numerics_grounded`** — every non-trivial number in the output
250
- appears verbatim in a source document.
251
- 2. **`no_placeholder_tokens`** — output contains no leaked
252
- `[source]` / `<document>` template fragments.
253
- 3. **`citations_dense`** — every non-trivial number has a `[doc_id]`
254
- citation **somewhere in the same sentence**. Sentence scope, not a
255
- character window. Identifier codes (`QN1206`, BBL parcels, `B12`)
256
- are skipped via `\b` word-boundary regex so they don't get treated
257
- as numeric claims.
258
- 4. **`citations_resolve`** — cited `doc_id`s ⊆ input `doc_id`s.
259
-
260
- If you change the regex or sentence boundary, **re-run the probe**:
261
-
262
- ```bash
263
- .venv/bin/python scripts/probe_mellea.py --query "Hollis" --runs 5
264
- .venv/bin/python scripts/probe_mellea.py --query "100 Gold St Manhattan" --runs 3
265
- .venv/bin/python scripts/probe_mellea.py --query "what are they building in Gowanus and is it risky" --runs 3
266
- ```
267
-
268
- ### Threadlocal hooks in `app/fsm.py`
269
-
270
- The FSM is sync code called from a threadpool executor. To plumb
271
- streaming callbacks without changing every action signature, we use
272
- threadlocals:
273
- - `set_strict_mode(bool)` → `_current_strict_mode()` decides whether
274
- `step_reconcile` routes to Mellea or the legacy reconciler.
275
- - `set_token_callback(fn)` → `_current_token_callback()` for streaming
276
- tokens out of the reconciler.
277
- - `set_mellea_attempt_callback(fn)` → fires after each Mellea attempt
278
- with `(attempt_idx, passed, failed)`.
279
-
280
- **Always reset in a `finally:`.** `app/intents/single_address.py:run()`
281
- is the canonical example.
282
-
283
- ### SSE event vocabulary (`/api/agent/stream`)
284
-
285
- | event | payload | when |
286
- |-------|---------|------|
287
- | `hello` | `{query}` | connection open |
288
- | `plan_token` | `{delta}` | each token of the planner JSON |
289
- | `plan` | `{intent, targets, specialists, rationale}` | planner finished |
290
- | `step` | `{step, ok, started_at, elapsed_s, result?, err?}` | each FSM action |
291
- | `token` | `{delta, attempt?}` | each Granite reconcile token |
292
- | `mellea_attempt` | `{attempt, passed, failed}` | end of each Mellea attempt |
293
- | `final` | full result dict (`paragraph`, `mellea`, `audit`, `tier`, `score`, ...) | reconcile done |
294
- | `error` | `{err}` | exception in the runner |
295
- | `done` | `{}` | stream closing |
296
-
297
- Frontend resets the briefing buffer when `token.attempt` changes
298
- (handles reroll cleanly).
299
-
300
- ### Frontend property convention
301
-
302
- Svelte custom elements take props via JS property setters:
303
-
304
- ```js
305
- const el = document.getElementById("paragraph"); // <r-briefing>
306
- await customElements.whenDefined("r-briefing");
307
- el.sourceLabels = SOURCE_LABELS;
308
- el.text = "...streaming markdown...";
309
- ```
310
-
311
- `<r-trace>` exposes imperative methods on the host:
312
-
313
- ```js
314
- el.pushStep({ step: "geocode", ok: true, elapsed_s: 0.3, result: {...} });
315
- el.clear();
316
- ```
317
-
318
- `<r-sources-footer>` reads `citeIndex` from the shared store; the
319
- Briefing populates it whenever its `bodyHtml` is computed.
320
-
321
- ---
322
-
323
- ## Decisions worth remembering
324
-
325
- These are paths we explored and either chose or ruled out. Don't
326
- re-litigate them without new information.
327
-
328
- - **Lit → Svelte (May 2026).** Three Lit components were live first
329
- (`web/static/components/`) but the user wanted a full Svelte
330
- rewrite. Migrated to Svelte 5 custom-element bundle (drop-in
331
- replacement — same tag names, same property API). The Lit files
332
- are still on disk for reference but not loaded.
333
-
334
- - **Granite 4.x native inline citations are deprecated.** We
335
- investigated the `<|start_of_cite|>...<|end_of_cite|>` mode. The
336
- official Ollama template removed it for 4.x; `granite_common` ships
337
- no `granite4/` package; `granite-io` has no 4.x processor.
338
- 4.1 emits citation tokens only in an end-of-response list, never
339
- inline. IBM's expected 4.x citation path is a separate LoRA on
340
- granite-4.0-micro that produces post-hoc JSON — needs HF
341
- transformers, not Ollama. **Hand-rolled `[doc_id]` regex + reroll
342
- is the right pattern for our setup.**
343
-
344
- - **Mellea 0.4 RAG intrinsics aren't reachable.** `find_citations`,
345
- `flag_hallucinated_content`, `check_context_relevance` all route
346
- through `GraniteCommonAdapter` → activated LoRA on the HF
347
- transformers backend. `mellea/backends/ollama.py:357-359` literally
348
- raises `NotImplementedError` for activated LoRAs. To use them we'd
349
- swap the serving layer, eat ~5GB more RAM, lose Ollama's
350
- optimizations. Not worth it for the demo.
351
-
352
- - **CARTO Voyager basemap (not Stadia).** Tried Stadia Alidade
353
- Smooth — looks great, but they 401 without an API key and
354
- domain allowlist. Voyager is auth-free, retina-tiled, more
355
- editorial than Positron.
356
-
357
- - **Speculative streaming Mellea.** `reconcile_strict_streaming`
358
- streams every attempt's tokens to the user (visible at t≈30s
359
- instead of after t≈95s of validation silence). Inline banner
360
- shows reroll status. Felt latency drops dramatically even when
361
- total wall-clock is the same.
362
-
363
- - **Sentence-scoped `citations_dense` + identifier-aware `\b` regex.**
364
- The combo killed the chronic 3/4 reroll loop on neighborhood
365
- queries. Hollis: was 3/4 with 2 rerolls every run; now 4/4 with
366
- ≤1 reroll. Don't tighten the regex back to a fixed-width window
367
- without re-running the probe across all three intent types.
368
-
369
- - **LiteLLM Router for backend abstraction (May 2026).** Considered
370
- hand-rolling an OpenAI-vs-Ollama dispatch ourselves. LiteLLM's
371
- Router gives us model aliasing + fallback for free, and Mellea
372
- has a litellm backend if we ever need it. The shim is ~250 lines
373
- total (`app/llm.py`); the entire production code path stayed in
374
- the `ollama.chat`-shaped call signature. Don't replace this with
375
- the openai SDK directly — the failover behaviour is load-bearing.
376
-
377
- - **Granite 4.1 is dense decoder-only.** Earlier confusion: the
378
- hybrid Mamba variants are in **Granite 4.0-H**, not 4.1. vLLM
379
- 0.17 serves 4.1 as a vanilla LLaMA-style model — no architecture
380
- risk, no special flags. If a future bump introduces a hybrid 4.x,
381
- re-verify vLLM compatibility before deploying.
382
-
383
- - **vLLM HF chat template emits `[doc_id=X]`, Ollama Modelfile emits
384
- `[X]`.** The rest of Riprap (Mellea regex, frontend chip parser,
385
- citations footer) was written against `[X]`. `app/llm.py` runs a
386
- one-line regex normalize on every response and stream chunk. Don't
387
- remove it without changing every other consumer.
388
-
389
- - **HF Space → AMD GPU as primary, T4 Ollama as fallback.** Considered
390
- using the HF Space's bundled Ollama as a remote inference server
391
- (proxy `/v1/chat/completions` from FastAPI to localhost:11434) so
392
- that local dev could use the T4. Rejected: T4 is slower than
393
- MI300X, surface area is bigger, and the AMD path already covers
394
- the "fast remote inference" use case. The proxy idea is recoverable
395
- in ~25 lines of FastAPI if we ever want it.
396
-
397
- ---
398
-
399
- ## Common tasks playbook
400
-
401
- ### Add a new specialist
402
-
403
- 1. Add a module under `app/context/` or `app/flood_layers/`.
404
- 2. Add an action in `app/fsm.py` (`step_yourname`) with `@action(reads=[...], writes=[...])`.
405
- 3. Wire it into the FSM graph in the `Application.with_actions(...)` chain.
406
- 4. Add a doc message builder in `app/reconcile.py:build_documents()`.
407
- 5. Update `STEP_LABELS` in `web/static/agent.js` for the trace label.
408
- 6. Update `SOURCE_LABELS` / `SOURCE_URLS` / `SOURCE_VINTAGES` in
409
- `web/static/agent.js` for the chip + footer rendering.
410
- 7. Double-gate the new specialist: run the SSE probe against both
411
- `RIPRAP_LLM_PRIMARY=ollama` and `=vllm` and confirm the briefing
412
- cites the new doc_id with no Mellea regressions.
413
-
414
- ### Prototype a new specialist (experimental)
415
-
416
- For exploratory work that isn't yet ready to land in `app/`:
417
-
418
- 1. Scaffold `experiments/<NN>_<name>/` with its own `RESULTS.md`,
419
- smoke tests, and cached fixtures. Don't import from `app/` except
420
- `app.llm.chat` — keeps the experiment portable.
421
- 2. License-check the model: confirm Apache-2.0 / MIT / BSD on the
422
- actual `LICENSE` file in the model repo (not the HF metadata
423
- field — they sometimes disagree). Add a row to
424
- `experiments/shared/licenses.md`.
425
- 3. Validate against both `RIPRAP_LLM_PRIMARY=ollama` and
426
- `=vllm` before proposing integration. Specialist behaviour must
427
- be backend-independent — never branch on backend in specialist
428
- code.
429
- 4. Only after the experiment passes both gates and produces a
430
- demo-safe trace UI rendering, propose a PR-style summary for
431
- integration into `app/`.
432
-
433
- ### Change the briefing markdown structure
434
-
435
- 1. Edit `EXTRA_SYSTEM_PROMPT` in `app/reconcile.py`.
436
- 2. Edit `renderMarkdownPure()` in `web/svelte/src/lib/Briefing.svelte`
437
- if you add new block syntax.
438
- 3. Rebuild Svelte: `cd web/svelte && npm run build`.
439
- 4. Re-run the probe to confirm Mellea still passes.
440
-
441
- ### Tune the Mellea checks
442
-
443
- `app/mellea_validator.py`:
444
- - `_NUM_RE` — number recognition. Use `\b` boundaries to skip
445
- identifiers.
446
- - `_TRIVIAL_NUMS` — set of numbers exempt from citation requirement
447
- (small integers, NYC service line numbers like 311/911).
448
- - `_check_every_claim_cited()` — sentence-scoped; uses `_SENT_END`
449
- for boundaries.
450
- - `_failing_sentences_for_citations()` — feeds the reroll feedback
451
- prompt with surgical corrections.
452
-
453
- After any change here: probe across 3 intent types (above).
454
-
455
- ### Add a new Svelte component
456
-
457
- 1. Create `web/svelte/src/lib/MyComponent.svelte` with
458
- `<svelte:options customElement={{ tag: "r-mycomp", props: {...} }} />`.
459
- 2. Side-effect import it in `web/svelte/src/main.js`.
460
- 3. Mount `<r-mycomp>` in `agent.html`.
461
- 4. `cd web/svelte && npm run build`.
462
- 5. Commit `web/static/dist/riprap.js` and `riprap.js.map`.
463
-
464
- ---
465
-
466
- ## Known sharp edges
467
-
468
- - **`build_documents` complexity (radon F=101).** It's a giant
469
- `if`/`elif` per specialist. Don't refactor pre-demo; touching it
470
- risks subtle regressions in doc message ordering, which Granite is
471
- sensitive to.
472
-
473
- - **Static assets cache hard.** When iterating on Svelte or `agent.js`,
474
- the user must hard-reload (⌘⇧R). Cache-busting query strings are
475
- not in place.
476
-
477
- - **Ollama keeps stale models loaded across rebuilds locally.** If
478
- you change a Modelfile or pull a new tag, restart `ollama serve`
479
- to be sure.
480
-
481
- - **Burr FSM `iter_steps` mutates global state.** Don't run two
482
- concurrent `single_address` queries against the same uvicorn
483
- worker — strict-mode threadlocal makes it safer than it was, but
484
- there's no per-request isolation.
485
-
486
- - **Mellea 0.3 vs 0.4 API differences.** Local venv has 0.4 (3.12),
487
- HF has 0.3 (3.10). `start_session`, `RejectionSamplingStrategy`,
488
- `MelleaSession.instruct(strategy, requirements,
489
- return_sampling_results)` are stable across both. Don't import
490
- anything from `mellea.stdlib.components.intrinsic.*` — that
491
- package only exists in 0.4 and won't import on HF.
492
-
493
- - **HF Space sleeps after idle.** Free tier; first request after
494
- sleep is a 30–90 s cold start. Ping the space before a demo.
495
-
496
- - **vLLM cold compile / first-call slowdown.** First few requests
497
- against a fresh `vllm serve` container can log surprisingly low
498
- throughput (single-digit tokens/s prompt, ~4 tokens/s gen on a
499
- MI300X) while ROCm kernels JIT-compile and the prefix cache
500
- warms. Subsequent requests are 30–50× faster. If a benchmark
501
- reads "vLLM is slow" on the first run, run it three more times
502
- before believing it.
503
-
504
- - **Backend pill auto-detection.** `app/llm.py:_default_hardware_label`
505
- picks `AMD MI300X` when `RIPRAP_LLM_PRIMARY=vllm`, `NVIDIA T4`
506
- when `SPACE_ID` is set (HF Spaces injects this), `Local` otherwise.
507
- Override with `RIPRAP_HARDWARE_LABEL` / `RIPRAP_ENGINE_LABEL`
508
- if you bring up a different GPU.
509
-
510
- ---
511
-
512
- ## Useful one-liners
513
-
514
- ```bash
515
- # Tail the local server log
516
- tail -f /tmp/riprap-local.log
517
-
518
- # Inspect the live HF Space's deployed SHA + stage
519
- curl -sf "https://huggingface.co/api/spaces/msradam/riprap-nyc/runtime" | python3 -m json.tool
520
-
521
- # Confirm both remotes have the same HEAD
522
- git log --oneline -1 && git ls-remote huggingface main | head -1
523
-
524
- # Force-re-pull Granite weights locally if Ollama seems wrong
525
- ollama rm granite4.1:8b && ollama pull granite4.1:8b
526
-
527
- # What backend is the running server on? (live reachability + label)
528
- curl -s http://127.0.0.1:7860/api/backend | python3 -m json.tool
529
-
530
- # Bring up vLLM on a fresh AMD ROCm droplet (one-shot)
531
- docker run -d --name vllm \
532
- --device=/dev/kfd --device=/dev/dri --group-add=video \
533
- --ipc=host --shm-size=16g -p 8000:8000 \
534
- -v /root/hf-cache:/root/.cache/huggingface \
535
- -e GLOO_SOCKET_IFNAME=eth0 -e VLLM_HOST_IP=127.0.0.1 \
536
- vllm/vllm-openai-rocm:v0.17.1 \
537
- --model ibm-granite/granite-4.1-8b \
538
- --host 0.0.0.0 --port 8000 --api-key "$TOKEN" \
539
- --max-model-len 8192 --served-model-name granite-4.1-8b
540
- # Without GLOO_SOCKET_IFNAME, gloo fails to bind 0.0.0.0 and the
541
- # engine core never initialises.
542
-
543
- # Check what doc_ids the briefing should contain for an intent
544
- .venv/bin/python -c "from app.reconcile import build_documents; \
545
- print([m['role'] for m in build_documents({'sandy':{'inside':True}, 'nyc311':{'n':5}})])"
546
- ```