mekosotto Claude Opus 4.7 (1M context) commited on
Commit
09dd9c3
·
1 Parent(s): d05fcf1

docs(spec): Day-7 final-5% design — drift, traceability, agents

Browse files

Sealed architectural decisions:
- Drift state: in-process deque(maxlen=100) per worker + train-time
median/std on model._neurobridge_train_stats (joblib roundtrip-safe).
- LLM provider: OpenRouter via openai==1.51.0 SDK with deterministic
template fallback. NEUROBRIDGE_DISABLE_LLM=1 demo lifeline.

4 tasks: T1 drift, T2 MLflow badge, T3 LLM explainer + AI Assistant
tab, T4 close-out. Test growth: 165 → 175 green (+10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs/superpowers/specs/2026-05-05-day7-drift-traceability-agents-design.md ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Day 7 — The Final 5% (Drift, Traceability & Agents) Design Spec
2
+
3
+ **Date:** 2026-05-05
4
+ **Status:** Approved by user; ready for `/superpowers:writing-plans`.
5
+ **Predecessor:** Day 6 (`docs/superpowers/plans/2026-05-04-day6-final-polish-demo-features.md`) — closed at SHA `d05fcf1`, 165 green tests.
6
+
7
+ ---
8
+
9
+ ## 1. Goal
10
+
11
+ Close the remaining 5% gap to a top-tier hackathon submission by hardening the two evaluation dimensions where Day-6 left visible weakness:
12
+
13
+ - **Adapt Over Time** (Slide 12 "Living Systems — Your Edge"): the system is currently static post-training. Add a drift-detection stub that compares trailing prediction confidence to the training distribution.
14
+ - **AI Lab Agents** (Slide 7, Track 1): we have no agent surface. Add a chat-style "Why?" endpoint that explains a BBB prediction in natural language using SHAP attributions and drift context.
15
+
16
+ Plus one System-Quality polish (MLflow traceability badge in the decision card) so jurors can audit *which* model produced a given prediction.
17
+
18
+ **Non-goals (YAGNI):**
19
+
20
+ - No retraining loop. Drift is *observed and reported*, not acted on.
21
+ - No conversational state, no multi-turn agent. One question → one rationale.
22
+ - No vector store or RAG corpus. The "context" is the prediction payload itself plus a small built-in literature primer.
23
+ - No new front-end framework. Stay on Streamlit; reuse Trust & Authority brand tokens.
24
+ - No new visualization library. Reuse altair (already shipped).
25
+
26
+ ---
27
+
28
+ ## 2. Sealed Architectural Decisions
29
+
30
+ These are locked. The implementation plan must follow them as-is.
31
+
32
+ ### 2.1 Drift state location
33
+
34
+ **Decision:** in-process `collections.deque(maxlen=100)` per FastAPI worker, plus train-time median/std baked into `model._neurobridge_train_stats: dict` (joblib-roundtrip-safe).
35
+
36
+ | Field | Type | Source |
37
+ |---|---|---|
38
+ | `model._neurobridge_train_stats["median"]` | `float` | `np.median(model.predict_proba(X_train).max(axis=1))` |
39
+ | `model._neurobridge_train_stats["std"]` | `float` | `np.std(model.predict_proba(X_train).max(axis=1))` |
40
+ | `model._neurobridge_train_stats["n_train"]` | `int` | `len(y_train)` |
41
+ | `WORKER_CONFIDENCE_DEQUE` | `deque[float]` (maxlen=100) | module-level singleton in `src/api/routes.py` |
42
+
43
+ **Z-score formula:** `drift_z = (rolling_median − train_median) / max(train_std, 1e-9)`
44
+
45
+ **Edge cases:**
46
+ - `len(deque) < 10` → `drift_z = None`, UI shows "Warming up (n/100)".
47
+ - `_neurobridge_train_stats` missing (legacy joblib) → `drift_z = None`, UI shows "Drift unavailable".
48
+
49
+ **Rejected alternatives:**
50
+ - Joblib sidecar file (B): I/O race risk, slower.
51
+ - SQLite (C): production-grade but +30 min setup overhead, not worth it for the demo budget.
52
+ - Train-time stats only without rolling window (D): kills the "trailing 100" narrative.
53
+
54
+ ### 2.2 LLM provider
55
+
56
+ **Decision:** OpenRouter via `openai==1.51.0` SDK, hybrid template fallback, kill-switch env gate.
57
+
58
+ | Setting | Value |
59
+ |---|---|
60
+ | Base URL | `https://openrouter.ai/api/v1` |
61
+ | Default model | `meta-llama/llama-3.2-3b-instruct:free` |
62
+ | Auth | `OPENROUTER_API_KEY` env var |
63
+ | Lifeline gate | `NEUROBRIDGE_DISABLE_LLM=1` → force template path |
64
+ | Timeout | 8 seconds (HTTP request) |
65
+ | Max tokens | 256 (response cap) |
66
+ | Temperature | 0.3 (deterministic-ish, jury demos must be predictable) |
67
+
68
+ **Fallback chain (in order):**
69
+
70
+ 1. `NEUROBRIDGE_DISABLE_LLM=1` set → template
71
+ 2. `OPENROUTER_API_KEY` not set → template
72
+ 3. `openai` SDK raises `APIConnectionError` / `APITimeoutError` / `RateLimitError` → log warning, template
73
+ 4. LLM returns empty / malformed response → log warning, template
74
+ 5. Otherwise → LLM rationale
75
+
76
+ **Response contract (always populated):**
77
+
78
+ ```python
79
+ {"rationale": str, "source": "llm" | "template", "model": str | None}
80
+ ```
81
+
82
+ `source` makes the auditing story crisp: "this rationale came from the deterministic template" vs "this came from llama-3.2-3b". Jurors can verify reproducibility.
83
+
84
+ **Rejected alternatives:**
85
+ - Anthropic API (C): no key available.
86
+ - Local Ollama (B): demo-day install/load risk too high.
87
+ - Pure deterministic template (A): kills the "AI Lab Agents" narrative.
88
+ - Pure LLM with no fallback: demo-day network failure = total failure.
89
+
90
+ ---
91
+
92
+ ## 3. Component Design
93
+
94
+ ### 3.1 Drift layer
95
+
96
+ **Files touched:**
97
+
98
+ - `src/models/bbb_model.py` — extend `train()` to compute and stash `_neurobridge_train_stats`.
99
+ - `src/api/schemas.py` — add `drift_z: float | None` and `rolling_n: int` to `BBBPredictResponse`.
100
+ - `src/api/routes.py` — module-level deque, helper `_compute_drift_z(model, confidence) -> tuple[float | None, int]`, wire into `predict_bbb`.
101
+ - `src/frontend/app.py` — add a drift line to `_render_prediction_card` between the calibration caption and the SHAP section.
102
+
103
+ **Boundary contracts:**
104
+
105
+ - `bbb_model._compute_train_stats(model, X_train, y_train) -> dict` (private helper, mirrors `_compute_calibration_bins` shape).
106
+ - `routes._compute_drift_z(model, confidence) -> tuple[float | None, int]` — returns `(drift_z, len_after_append)`. Side effect: appends to module-level deque.
107
+ - The deque is module-level so it survives across requests but resets per worker restart. This is acceptable: drift is a *demo-day signal*, not a production audit trail.
108
+
109
+ **Streamlit rendering rule:**
110
+
111
+ ```
112
+ if drift_z is None and rolling_n < 10:
113
+ show "Drift: warming up ({rolling_n}/10)"
114
+ elif drift_z is None:
115
+ show "Drift: unavailable (no train-time stats)"
116
+ else:
117
+ show "Drift: trailing-100 median is {drift_z:+.2f}σ from training distribution"
118
+ ```
119
+
120
+ ### 3.2 MLflow traceability badge
121
+
122
+ **Files touched:**
123
+
124
+ - `src/api/schemas.py` — add `provenance: ModelProvenance | None` (new schema) to `BBBPredictResponse`.
125
+ - `src/api/routes.py` — read MLflow run metadata once at module load (cached); attach to every `/predict/bbb` response.
126
+ - `src/frontend/app.py` — render badge in `_render_prediction_card` near the top of the card.
127
+
128
+ **`ModelProvenance` schema:**
129
+
130
+ ```python
131
+ class ModelProvenance(BaseModel):
132
+ mlflow_run_id: str | None = None
133
+ model_version: str = "v1" # bumped manually per train cycle
134
+ train_date: str | None = None # ISO 8601, from MLflow run start_time
135
+ n_examples: int | None = None # from model._neurobridge_train_stats["n_train"]
136
+ ```
137
+
138
+ **Lookup logic (one-time per process startup, then cached):**
139
+
140
+ 1. Try `mlflow.search_runs(experiment_names=["bbb_pipeline"], max_results=1, order_by=["start_time DESC"])`.
141
+ 2. If found → populate `mlflow_run_id`, `train_date`.
142
+ 3. If not found or `NEUROBRIDGE_DISABLE_MLFLOW=1` → all fields stay None except hardcoded `model_version="v1"`.
143
+ 4. `n_examples` comes from `model._neurobridge_train_stats["n_train"]` (set in T1A).
144
+
145
+ The badge is purely informational — the API still works without MLflow, just shows "Provenance unavailable" in the UI.
146
+
147
+ ### 3.3 LLM explainer
148
+
149
+ **New file:** `src/llm/explainer.py`
150
+
151
+ Public surface:
152
+
153
+ ```python
154
+ def explain(payload: ExplainPayload) -> ExplainResult:
155
+ """Return a natural-language rationale for a BBB prediction.
156
+
157
+ Falls back to a deterministic template when LLM is unavailable.
158
+ Never raises — always returns a usable rationale.
159
+ """
160
+ ```
161
+
162
+ Where `ExplainPayload` is a typed dict with: `smiles`, `label_text`, `confidence`, `top_features` (list of `{feature, shap_value}`), `calibration` (optional), `drift_z` (optional).
163
+
164
+ **Internal structure:**
165
+
166
+ ```
167
+ explain(payload)
168
+ ├── _should_use_llm() → bool # gates: env flag, key, etc.
169
+ ├── _llm_explain(payload) → str | None # OpenRouter call, returns None on any failure
170
+ ├── _template_explain(payload) → str # always-available deterministic path
171
+ └── compose ExplainResult with source/model fields
172
+ ```
173
+
174
+ **Template (deterministic, jury-friendly):**
175
+
176
+ The template stitches together:
177
+ 1. Sentence 1: "Predicted **{label_text}** with {confidence*100:.0f}% confidence."
178
+ 2. Sentence 2 (if calibration): "Calibration: predictions in the ≥{threshold}% bin are correct {precision}% of the time on held-out data (n={support})."
179
+ 3. Sentence 3: "Top SHAP attributions toward this label: {feat_1} (Δ{shap_1:+.3f}), {feat_2} (Δ{shap_2:+.3f}), {feat_3} (Δ{shap_3:+.3f})."
180
+ 4. Sentence 4 (if drift_z): "Drift signal: trailing-100 confidence median is {drift_z:+.2f}σ from training distribution; {interpretation}."
181
+ - interpretation: `|drift_z| < 1` → "within expected range"; `1 ≤ |drift_z| < 2` → "mild distribution shift"; `|drift_z| ≥ 2` → "significant shift, retrain recommended".
182
+
183
+ The template is auditable: every word is derived from numeric inputs. Useful for jurors who challenge "is this actually using the model output?".
184
+
185
+ **LLM prompt (single-shot, no system message clutter):**
186
+
187
+ ```
188
+ You are a clinical-ML explainer for a B2B blood-brain-barrier permeability tool.
189
+ Given the prediction details below, write a 2–4 sentence rationale a researcher
190
+ could paste into a paper. Use the SHAP attributions to justify the verdict.
191
+ Mention drift if abnormal. Avoid hedging; be specific about the numbers.
192
+
193
+ Prediction:
194
+ - SMILES: {smiles}
195
+ - Verdict: {label_text} ({confidence*100:.0f}% confident)
196
+ - Top SHAP features (positive = pushed toward verdict):
197
+ {top_features_bulleted}
198
+ - Drift z-score: {drift_z}
199
+
200
+ Respond with the rationale only, no preamble.
201
+ ```
202
+
203
+ ### 3.4 `POST /explain/bbb` route
204
+
205
+ **Files touched:**
206
+
207
+ - `src/api/schemas.py` — `BBBExplainRequest`, `BBBExplainResponse`.
208
+ - `src/api/routes.py` — register on `predict_router` so URL is `/predict/bbb` (existing) + `/explain/bbb` (new); both under `/predict` prefix... actually wait, see below.
209
+
210
+ **Routing decision:** new `explain_router` with prefix `/explain` → final URL `POST /explain/bbb`. Mounted on the FastAPI app alongside the existing `router` (prefix `/pipeline`) and `predict_router` (prefix `/predict`). This mirrors the prediction surface symmetrically (`/predict/bbb` ↔ `/explain/bbb`) and leaves room for `/explain/eeg` and `/explain/mri` later without restructuring.
211
+
212
+ **Request:**
213
+
214
+ ```python
215
+ class BBBExplainRequest(BaseModel):
216
+ smiles: str
217
+ label: int
218
+ label_text: str
219
+ confidence: float
220
+ top_features: list[FeatureAttribution]
221
+ calibration: CalibrationContext | None = None
222
+ drift_z: float | None = None
223
+ ```
224
+
225
+ **Response:**
226
+
227
+ ```python
228
+ class BBBExplainResponse(BaseModel):
229
+ rationale: str
230
+ source: str # "llm" | "template"
231
+ model: str | None = None # llm model name when source="llm"
232
+ ```
233
+
234
+ **Error cases:**
235
+
236
+ - Empty `top_features` → 400 (a real prediction always has SHAP attributions).
237
+ - Otherwise → 200 always (the explainer never raises; template fallback ensures success).
238
+
239
+ ### 3.5 Streamlit "AI Assistant" tab
240
+
241
+ **File touched:** `src/frontend/app.py`
242
+
243
+ **Layout:**
244
+
245
+ ```
246
+ ┌──────────────────────────────────────────────────────────────┐
247
+ │ AI Assistant — explain the last BBB prediction │
248
+ ├──────────────────────────────────────────────────────────────┤
249
+ │ [Last prediction card preview: label, confidence, top-3 SHAP]│
250
+ │ │
251
+ │ Pre-canned questions (st.selectbox): │
252
+ │ • Why was this molecule predicted as permeable? │
253
+ │ • Which features pushed the verdict the most? │
254
+ │ • Is the prediction trustworthy given drift? │
255
+ │ │
256
+ │ [Custom question text_input — optional] │
257
+ │ │
258
+ │ [Ask the AI Assistant — primary button] │
259
+ ├──────────────────────────────────────────────────────────────┤
260
+ │ Response card: │
261
+ │ "{rationale}" │
262
+ │ Source: {llm | template} · Model: {model or "—"} │
263
+ └──────────────────────────────────────────────────────────────┘
264
+ ```
265
+
266
+ **Question routing into the prompt:** the user's selected/typed question is **not** sent to the LLM as a separate field. It is *appended to the prompt as a "User question:" line* before the closing instruction. This keeps the response contract (`{rationale, source, model}`) identical regardless of question, and lets the deterministic template ignore the question entirely (template always answers the meta-question "explain this prediction" — which subsumes all three pre-canned questions). For a custom question that diverges far from the canned three, the LLM path will adapt; the template path will give the same generic SHAP-driven rationale. Acceptable trade-off for Day 7.
267
+
268
+ **Session state:**
269
+
270
+ - `st.session_state["last_bbb_prediction"]` — populated by `_render_prediction_card` after every successful BBB predict (stores the entire `/predict/bbb` response dict).
271
+ - `st.session_state["explain_history"]` — list of `(question, response)` tuples; rendered in reverse-chronological order.
272
+ - If `last_bbb_prediction` is None, show empty state: "Run a BBB prediction first to enable the AI Assistant."
273
+
274
+ **No multi-turn conversation.** Each question is independent; history is visible but not fed back into subsequent prompts.
275
+
276
+ ---
277
+
278
+ ## 4. Test Plan
279
+
280
+ | Suite | New Tests | What they cover |
281
+ |---|---|---|
282
+ | `tests/models/test_bbb_model.py` | +2 | `_neurobridge_train_stats` attribute presence, joblib roundtrip |
283
+ | `tests/api/test_routes.py` (T1B) | +2 | `drift_z` and `rolling_n` in `/predict/bbb` body; deque rolls (101st predict drops 1st) |
284
+ | `tests/api/test_routes.py` (T2) | +1 | `provenance` field in `/predict/bbb` response (smoke — fields can be None) |
285
+ | `tests/llm/test_explainer.py` (new dir) | +4 | (a) template path returns deterministic rationale; (b) template includes top feature names; (c) template includes label_text; (d) `NEUROBRIDGE_DISABLE_LLM=1` forces template even with key set |
286
+ | `tests/api/test_routes.py` (T3B) | +1 | `POST /explain/bbb` 200 happy path with template source |
287
+ | **Total** | **+10** | **165 → 175 green** |
288
+
289
+ **LLM integration tests (env-gated, NOT counted in 175):**
290
+
291
+ - `tests/llm/test_explainer_integration.py` — marked `@pytest.mark.llm_integration`, runs only when `RUN_LLM_TESTS=1` set. Verifies real OpenRouter round-trip. Default: skip.
292
+
293
+ **TDD discipline:** For T1A, T1B, T3A, T3B: write the new tests, watch them fail (RED), then implement. T1C, T2, T3C are UI-only or thin glue; covered by import-smoke and the existing assertion extensions.
294
+
295
+ ---
296
+
297
+ ## 5. New Dependency
298
+
299
+ `openai==1.51.0` — added to `requirements.txt`. ~600 KB, minimal transitive (`httpx`, `pydantic`, `typing_extensions` — all already present). Pinned to 1.51.0 because that's a known-stable version with the OpenRouter-compatible client interface.
300
+
301
+ No other new pip deps. Streamlit, altair, sklearn, RDKit, MNE, nibabel, MLflow stay at current pins.
302
+
303
+ ---
304
+
305
+ ## 6. Failure Modes & Lifelines
306
+
307
+ | Failure | Detection | Lifeline |
308
+ |---|---|---|
309
+ | OpenRouter rate-limit during demo | HTTP 429 from SDK | Auto-fallback to template; log warning |
310
+ | OpenRouter network outage | `APIConnectionError` | Auto-fallback to template |
311
+ | API key revoked / typo'd | HTTP 401 | Auto-fallback to template |
312
+ | Demo runner forgot key | `os.environ.get("OPENROUTER_API_KEY") is None` | Auto-fallback to template |
313
+ | User wants to force template (e.g., for reproducibility) | `NEUROBRIDGE_DISABLE_LLM=1` | Hard gate, never calls LLM |
314
+ | Drift deque accumulates noise across worker lifetime | n/a | Worker restart clears state; demo runner can `pkill -f uvicorn && uvicorn …` between dry-runs. Documented in README's "Day 7 — Demo Recipe". |
315
+ | `_neurobridge_train_stats` missing on legacy model | `getattr(model, ..., None) is None` | `drift_z=None`, UI hedge string |
316
+ | MLflow store unreachable | `mlflow.search_runs` raises | `provenance` fields all None, UI shows "Provenance unavailable" |
317
+
318
+ ---
319
+
320
+ ## 7. Risks & Mitigations
321
+
322
+ - **Risk:** Streamlit session-state quirks may lose `last_bbb_prediction` across reruns.
323
+ **Mitigation:** Use `st.session_state` (persistent across reruns within a session). Test by clicking Predict → switching to AI Assistant tab → verifying last prediction is visible.
324
+
325
+ - **Risk:** OpenRouter free model returns garbage for chemistry questions.
326
+ **Mitigation:** Tightly scoped prompt (2–4 sentences, no preamble). Worst case the rationale is verbose but harmless; the source label tells jurors it's the LLM not the deterministic path.
327
+
328
+ - **Risk:** New `openai` dep conflicts with existing `httpx`.
329
+ **Mitigation:** `openai==1.51.0` uses `httpx>=0.23,<1.0`; we already pin `httpx==0.27.2`. Compatible. Verify with `pip check` after install.
330
+
331
+ - **Risk:** Reset-drift endpoint adds attack surface.
332
+ **Mitigation:** It's a POST that clears in-process state on the API server; no auth needed for a hackathon demo, but documented as "demo only" in OpenAPI description.
333
+
334
+ - **Risk:** MLflow lookup at module load slows API startup.
335
+ **Mitigation:** Wrap in try/except; on any error, set `_PROVENANCE_CACHE = None` and continue. Lazy-evaluate per-request only if cache is None. Time-bound the lookup to 2 seconds.
336
+
337
+ ---
338
+
339
+ ## 8. Definition of Done
340
+
341
+ - ✅ `pytest -q` reports **175 passed**.
342
+ - ✅ `pytest -W error::UserWarning tests/` reports zero warnings as errors.
343
+ - ✅ `pytest tests/llm/ -v` passes (template path, 4 tests).
344
+ - ✅ `streamlit run src/frontend/app.py` boots without ImportError; AI Assistant tab visible.
345
+ - ✅ `curl POST /predict/bbb {smiles: "CCO"}` returns body with `drift_z`, `rolling_n`, `provenance` keys.
346
+ - ✅ `curl POST /explain/bbb {…}` returns 200 with `{rationale, source: "template"}` when `NEUROBRIDGE_DISABLE_LLM=1`.
347
+ - ✅ With real `OPENROUTER_API_KEY` set + flag unset, `source: "llm"` and `model: "meta-llama/llama-3.2-3b-instruct:free"`.
348
+ - ✅ Streamlit BBB decision card shows: confidence progress + calibration caption + drift line + MLflow badge + SHAP bars (in that order).
349
+ - ✅ AI Assistant tab can ask "Why permeable?" and render a rationale (from either source).
350
+ - ✅ AGENTS.md §10 (Drift Surface) and §11 (LLM Explainer Surface) committed.
351
+ - ✅ README has a "Day 7 — Demo Recipe" section with two `curl` invocations.
352
+ - ✅ Final commit ledger has 5 commits: T1, T2, T3A, T3B+T3C, T4 (or finer granularity, but at least one commit per task boundary).
353
+
354
+ ---
355
+
356
+ ## 9. Out of Scope (deferred to "someday")
357
+
358
+ - Multi-turn conversation memory.
359
+ - Per-user drift profiles (currently shared across all clients of one worker).
360
+ - Retraining trigger when `|drift_z| > 2` for N consecutive predictions.
361
+ - Vector-store RAG over actual chemistry literature.
362
+ - LLM rationale streaming (Streamlit chat-style typewriter).
363
+ - Provenance signing / cryptographic audit trail.
364
+ - Drift state persistence across worker restarts.
365
+
366
+ These are recognized but explicitly not Day 7. Doing any of them would either blow the time budget or shift the demo focus away from the four sealed tasks.