mekosotto Claude Opus 4.7 (1M context) commited on
Commit
c4a01f0
·
1 Parent(s): 09dd9c3

docs(plan): Day-7 implementation plan — drift, traceability, agents

Browse files

8 task-level checkpoints (T1A model stats, T1B API drift, T1C UI drift,
T2 MLflow badge, T3A LLM explainer, T3B /explain/bbb, T3C AI Assistant
tab, T4 close-out) → 165 → 175 green. TDD discipline (RED → GREEN) for
every test-bearing task. Self-review pass clean: spec coverage 100%,
no placeholders, type names consistent across tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs/superpowers/plans/2026-05-05-day7-drift-traceability-agents.md ADDED
@@ -0,0 +1,1832 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Day 7 — The Final 5% (Drift, Traceability & Agents) Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Close the "Adapt Over Time" gap and add a Track-1 "AI Lab Agents" surface (chat-style explainer) without breaking the 165-test green floor. Test target: **165 → 175 passed** (+10 new tests).
6
+
7
+ **Architecture:** Drift = train-time stats baked into `model._neurobridge_train_stats` + module-level `collections.deque(maxlen=100)` per FastAPI worker. LLM explainer = thin abstraction (`src/llm/explainer.py`) with deterministic-template fallback and OpenRouter (via `openai==1.51.0` SDK) hybrid. Hard kill-switch: `NEUROBRIDGE_DISABLE_LLM=1` forces template path. Spec source of truth: [docs/superpowers/specs/2026-05-05-day7-drift-traceability-agents-design.md](docs/superpowers/specs/2026-05-05-day7-drift-traceability-agents-design.md) (commit `09dd9c3`).
8
+
9
+ **Tech Stack:** Python 3.12 · sklearn 1.5.1 (existing) · FastAPI + Pydantic (existing) · Streamlit + altair (existing) · MLflow 2.16.0 (existing) · **`openai==1.51.0` (NEW pip dep)**.
10
+
11
+ ---
12
+
13
+ ## File Structure
14
+
15
+ ```
16
+ src/
17
+ ├── models/
18
+ │ └── bbb_model.py # MODIFY — T1A: stash _neurobridge_train_stats
19
+ ├── api/
20
+ │ ├── schemas.py # MODIFY — T1B: drift_z + rolling_n; T2: ModelProvenance; T3B: BBBExplainRequest/Response
21
+ │ └── routes.py # MODIFY — T1B: deque + drift helper; T2: provenance lookup; T3B: explain_router + /explain/bbb
22
+ ├── llm/ # NEW dir
23
+ │ ├── __init__.py # CREATE
24
+ │ └── explainer.py # CREATE — T3A: explain() public API + template + openrouter
25
+ └── frontend/
26
+ └── app.py # MODIFY — T1C: drift line; T2: provenance badge; T3C: AI Assistant tab
27
+
28
+ tests/
29
+ ├── models/
30
+ │ └── test_bbb_model.py # MODIFY — T1A: TestTrainStatsMetadata (+2)
31
+ ├── api/
32
+ │ └── test_routes.py # MODIFY — T1B: extend TestBBBPredictRoute (+2); T2: extend with provenance (+1); T3B: TestExplainBBBRoute (+1)
33
+ └── llm/ # NEW dir
34
+ ├── __init__.py # CREATE
35
+ └── test_explainer.py # CREATE — T3A: TestTemplateExplain (+4)
36
+
37
+ requirements.txt # MODIFY — add openai==1.51.0
38
+ AGENTS.md # MODIFY — T4: §10 Drift Surface, §11 LLM Explainer Surface
39
+ README.md # MODIFY — T4: Day 7 row + curl recipe
40
+ ```
41
+
42
+ **Test count growth:** 2 (T1A) + 2 (T1B drift) + 1 (T2 provenance) + 4 (T3A template) + 1 (T3B route) = **+10 → 175 passed**.
43
+
44
+ ---
45
+
46
+ ## Pre-Flight Verification
47
+
48
+ - [ ] **Step 0: Confirm clean baseline**
49
+
50
+ ```bash
51
+ cd /Users/mertgungor/Desktop/hackathon
52
+ source .venv312/bin/activate
53
+ git status # Expect: clean tree on main
54
+ git log --oneline -1 # Expect: 09dd9c3 docs(spec): Day-7 final-5% design …
55
+ pytest -q 2>&1 | tail -3 # Expect: 165 passed
56
+ ```
57
+
58
+ If any of these fail, STOP and resolve before proceeding.
59
+
60
+ ---
61
+
62
+ ## Task 1A — Train-Time Stats Metadata
63
+
64
+ **Why:** Drift z-score requires a frozen "training distribution" reference (median + std of the model's own confidence on the train set). We bake this into the joblib artifact alongside the existing `_neurobridge_calibration` and `_neurobridge_fp_cols` so it survives save/load.
65
+
66
+ **Files:**
67
+ - Modify: `src/models/bbb_model.py`
68
+ - Modify: `tests/models/test_bbb_model.py`
69
+
70
+ ### Step 1: Write the 2 failing tests (RED)
71
+
72
+ - [ ] Append a new `TestTrainStatsMetadata` class at the end of `/Users/mertgungor/Desktop/hackathon/tests/models/test_bbb_model.py` (after `TestCalibrationMetadata`):
73
+
74
+ ```python
75
+ class TestTrainStatsMetadata:
76
+ """Day 7 — T1A: train()-time confidence distribution stash."""
77
+
78
+ def test_train_attaches_train_stats_attribute(self, trained_model_and_features):
79
+ model, _ = trained_model_and_features
80
+ assert hasattr(model, "_neurobridge_train_stats")
81
+ stats = model._neurobridge_train_stats
82
+ assert isinstance(stats, dict)
83
+ for key in ("median", "std", "n_train"):
84
+ assert key in stats, f"missing key {key!r} in train stats"
85
+ assert 0.0 <= stats["median"] <= 1.0
86
+ assert stats["std"] >= 0.0
87
+ assert stats["n_train"] >= 1
88
+
89
+ def test_train_stats_survives_save_load_roundtrip(
90
+ self, trained_model_and_features, tmp_path: Path,
91
+ ):
92
+ from src.models import bbb_model
93
+ model, _ = trained_model_and_features
94
+ path = tmp_path / "m.joblib"
95
+ bbb_model.save(model, path)
96
+ reloaded = bbb_model.load(path)
97
+ assert hasattr(reloaded, "_neurobridge_train_stats")
98
+ assert reloaded._neurobridge_train_stats == model._neurobridge_train_stats
99
+ ```
100
+
101
+ ### Step 2: Run the new tests — verify RED
102
+
103
+ - [ ] Run only these tests:
104
+
105
+ ```bash
106
+ pytest tests/models/test_bbb_model.py::TestTrainStatsMetadata -v
107
+ ```
108
+ Expected: **2 failed** with `AssertionError: hasattr(model, '_neurobridge_train_stats')` (or similar). If they pass, STOP — the attribute already exists somewhere unexpected.
109
+
110
+ ### Step 3: Implement `_compute_train_stats` and wire into `train()` (GREEN)
111
+
112
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/models/bbb_model.py`. Add this private helper immediately above `def train(`:
113
+
114
+ ```python
115
+ def _compute_train_stats(
116
+ model: RandomForestClassifier,
117
+ X_train: np.ndarray,
118
+ ) -> dict[str, float]:
119
+ """Compute median + std of the model's own confidence on the training set.
120
+
121
+ Used as the reference distribution for runtime drift detection. All values
122
+ are floats so the dict is joblib-roundtrip-safe and JSON-serializable.
123
+ """
124
+ if len(X_train) == 0:
125
+ return {"median": 0.0, "std": 0.0, "n_train": 0}
126
+ proba = model.predict_proba(X_train)
127
+ confidence = proba.max(axis=1)
128
+ return {
129
+ "median": float(np.median(confidence)),
130
+ "std": float(np.std(confidence)),
131
+ "n_train": int(len(X_train)),
132
+ }
133
+ ```
134
+
135
+ - [ ] In `train()`, immediately after the existing line `model._neurobridge_calibration = _compute_calibration_bins(model, X_test, y_test)`, add:
136
+
137
+ ```python
138
+ model._neurobridge_train_stats = _compute_train_stats(model, X_train)
139
+ ```
140
+
141
+ - [ ] Update the existing `logger.info(...)` line at the end of `train()` to also surface the train-stats summary:
142
+
143
+ Replace:
144
+ ```python
145
+ logger.info(
146
+ "Trained BBB classifier: n=%d, n_features=%d, classes=%s, "
147
+ "calibration_bins=%d",
148
+ len(y), X.shape[1], model.classes_.tolist(),
149
+ len(model._neurobridge_calibration),
150
+ )
151
+ ```
152
+ With:
153
+ ```python
154
+ logger.info(
155
+ "Trained BBB classifier: n=%d, n_features=%d, classes=%s, "
156
+ "calibration_bins=%d, train_confidence_median=%.3f",
157
+ len(y), X.shape[1], model.classes_.tolist(),
158
+ len(model._neurobridge_calibration),
159
+ model._neurobridge_train_stats["median"],
160
+ )
161
+ ```
162
+
163
+ ### Step 4: Run the new tests — verify GREEN
164
+
165
+ - [ ] Run:
166
+
167
+ ```bash
168
+ pytest tests/models/test_bbb_model.py::TestTrainStatsMetadata -v
169
+ ```
170
+ Expected: **2 passed**.
171
+
172
+ ### Step 5: Run the full suite — verify no regression
173
+
174
+ - [ ] Run:
175
+
176
+ ```bash
177
+ pytest -q 2>&1 | tail -3
178
+ ```
179
+ Expected: **167 passed** (165 + 2 new).
180
+
181
+ If any pre-existing test fails, the prime suspect is a model-equality assert that now fails because `_neurobridge_train_stats` was added. Read the failure; if it's a `model == reloaded_model` style check, update the assertion to `model._neurobridge_fp_cols == reloaded._neurobridge_fp_cols and model._neurobridge_calibration == reloaded._neurobridge_calibration and model._neurobridge_train_stats == reloaded._neurobridge_train_stats`. **Do not weaken assertions; expand them.**
182
+
183
+ ### Step 6: Commit T1A
184
+
185
+ - [ ] Run:
186
+
187
+ ```bash
188
+ git add src/models/bbb_model.py tests/models/test_bbb_model.py
189
+ git commit -m "$(cat <<'EOF'
190
+ feat(models): train-time confidence stats stashed on _neurobridge_train_stats
191
+
192
+ - _compute_train_stats() captures median, std, n_train of the model's
193
+ own predict_proba on X_train. Joblib-roundtrip-safe.
194
+ - train() persists stats alongside _neurobridge_fp_cols and
195
+ _neurobridge_calibration. INFO log line now surfaces the median.
196
+ - Foundation for Day-7 T1B drift z-score in /predict/bbb.
197
+ - 2 new tests (TestTrainStatsMetadata): attribute presence + roundtrip.
198
+
199
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
200
+ EOF
201
+ )"
202
+ ```
203
+
204
+ ---
205
+
206
+ ## Task 1B — Drift z-score in /predict/bbb
207
+
208
+ **Why:** Surface "Adapt Over Time" to the jury. Each prediction's confidence is appended to a per-worker `deque(maxlen=100)`. When ≥10 samples are buffered, we compute a z-score against the train-time median. The number flows through the API response into the UI (T1C) and the LLM explainer (T3A).
209
+
210
+ **Files:**
211
+ - Modify: `src/api/schemas.py`
212
+ - Modify: `src/api/routes.py`
213
+ - Modify: `tests/api/test_routes.py`
214
+
215
+ ### Step 1: Extend `BBBPredictResponse` schema
216
+
217
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/api/schemas.py`. Find the `BBBPredictResponse` class (it currently has `label`, `label_text`, `confidence`, `top_features`, `calibration`). Add two new optional fields:
218
+
219
+ ```python
220
+ class BBBPredictResponse(BaseModel):
221
+ """Decision-system payload: prediction + uncertainty + explanation + drift."""
222
+ label: int
223
+ label_text: str = Field(..., description="'permeable' or 'non-permeable'")
224
+ confidence: float
225
+ top_features: list[FeatureAttribution]
226
+ calibration: CalibrationContext | None = None
227
+ drift_z: float | None = Field(
228
+ None,
229
+ description=(
230
+ "Z-score of the trailing-100 confidence median against the "
231
+ "train-time median; None when warming up (<10 samples) or "
232
+ "when the model lacks _neurobridge_train_stats."
233
+ ),
234
+ )
235
+ rolling_n: int = Field(
236
+ 0,
237
+ description=(
238
+ "Number of confidence samples currently buffered in the worker's "
239
+ "rolling window (max 100). Zero on a fresh worker."
240
+ ),
241
+ )
242
+ ```
243
+
244
+ ### Step 2: Write the 2 failing tests (RED)
245
+
246
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/tests/api/test_routes.py`. Find the `TestBBBPredictRoute` class. Add two NEW test methods inside that class (place them after the existing `test_returns_200_with_prediction_and_attributions`):
247
+
248
+ ```python
249
+ def test_predict_response_includes_drift_z_and_rolling_n(
250
+ self, _set_bbb_model_path,
251
+ ):
252
+ """T1B: drift_z and rolling_n keys must always appear in the body."""
253
+ # Reset deque before this test so rolling_n starts deterministic.
254
+ from src.api import routes
255
+ routes.WORKER_CONFIDENCE_DEQUE.clear()
256
+
257
+ resp = client.post("/predict/bbb", json={"smiles": "CCO", "top_k": 5})
258
+ assert resp.status_code == 200, resp.text
259
+ body = resp.json()
260
+ assert "drift_z" in body
261
+ assert "rolling_n" in body
262
+ # First request: buffer has 1 sample (just appended), so warming up.
263
+ assert body["rolling_n"] == 1
264
+ assert body["drift_z"] is None # <10 samples = warming up
265
+
266
+ def test_predict_deque_rolls_at_100(self, _set_bbb_model_path):
267
+ """T1B: after 100 predictions, deque caps at maxlen=100 (rolls)."""
268
+ from src.api import routes
269
+ routes.WORKER_CONFIDENCE_DEQUE.clear()
270
+ # Fire 105 calls; final rolling_n must be 100, not 105.
271
+ last_body = None
272
+ for _ in range(105):
273
+ resp = client.post(
274
+ "/predict/bbb", json={"smiles": "CCO", "top_k": 3},
275
+ )
276
+ assert resp.status_code == 200
277
+ last_body = resp.json()
278
+ assert last_body["rolling_n"] == 100
279
+ # By call 105, drift_z is computable (≥10 samples) — assert numeric.
280
+ assert isinstance(last_body["drift_z"], float)
281
+ ```
282
+
283
+ ### Step 3: Run the new tests — verify RED
284
+
285
+ - [ ] Run:
286
+
287
+ ```bash
288
+ pytest tests/api/test_routes.py::TestBBBPredictRoute::test_predict_response_includes_drift_z_and_rolling_n -v
289
+ pytest tests/api/test_routes.py::TestBBBPredictRoute::test_predict_deque_rolls_at_100 -v
290
+ ```
291
+ Expected: both **FAIL** — the deque doesn't exist yet (`AttributeError: module 'src.api.routes' has no attribute 'WORKER_CONFIDENCE_DEQUE'`).
292
+
293
+ ### Step 4: Implement deque + drift helper + wire into `predict_bbb` (GREEN)
294
+
295
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/api/routes.py`. Add `from collections import deque` to the imports (alphabetical order):
296
+
297
+ ```python
298
+ from collections import deque
299
+ ```
300
+
301
+ - [ ] Just below the `_DEFAULT_BBB_MODEL_PATH = Path(...)` line (after the `_bbb_model_path()` helper), add the module-level deque + helper:
302
+
303
+ ```python
304
+ # Per-worker rolling window of recent prediction confidences.
305
+ # Cleared on worker restart; multi-worker setups have independent windows.
306
+ WORKER_CONFIDENCE_DEQUE: deque[float] = deque(maxlen=100)
307
+ _DRIFT_MIN_SAMPLES = 10
308
+
309
+
310
+ def _compute_drift_z(model, confidence: float) -> tuple[float | None, int]:
311
+ """Append `confidence` to the worker deque and compute the drift z-score.
312
+
313
+ Returns (drift_z, rolling_n). drift_z is None until both:
314
+ (1) the deque has at least `_DRIFT_MIN_SAMPLES` samples, AND
315
+ (2) the model has `_neurobridge_train_stats` attached.
316
+
317
+ z = (rolling_median - train_median) / max(train_std, 1e-9)
318
+ """
319
+ import statistics
320
+
321
+ WORKER_CONFIDENCE_DEQUE.append(float(confidence))
322
+ rolling_n = len(WORKER_CONFIDENCE_DEQUE)
323
+ stats = getattr(model, "_neurobridge_train_stats", None)
324
+ if rolling_n < _DRIFT_MIN_SAMPLES or stats is None:
325
+ return None, rolling_n
326
+ rolling_median = statistics.median(WORKER_CONFIDENCE_DEQUE)
327
+ train_median = float(stats["median"])
328
+ train_std = max(float(stats["std"]), 1e-9)
329
+ drift_z = (rolling_median - train_median) / train_std
330
+ return float(drift_z), rolling_n
331
+ ```
332
+
333
+ - [ ] In `predict_bbb()`, immediately before the `return BBBPredictResponse(...)` block, compute drift:
334
+
335
+ ```python
336
+ drift_z, rolling_n = _compute_drift_z(model, pred["confidence"])
337
+ ```
338
+
339
+ - [ ] Update the `return BBBPredictResponse(...)` to pass the new fields:
340
+
341
+ ```python
342
+ return BBBPredictResponse(
343
+ label=pred["label"],
344
+ label_text=label_text,
345
+ confidence=pred["confidence"],
346
+ top_features=[FeatureAttribution(**a) for a in attributions],
347
+ calibration=calibration,
348
+ drift_z=drift_z,
349
+ rolling_n=rolling_n,
350
+ )
351
+ ```
352
+
353
+ ### Step 5: Run the new tests — verify GREEN
354
+
355
+ - [ ] Run:
356
+
357
+ ```bash
358
+ pytest tests/api/test_routes.py::TestBBBPredictRoute -v
359
+ ```
360
+ Expected: **all TestBBBPredictRoute tests pass** (including the 2 new ones, totalling whatever was there before + 2 = currently 3 + 2 = 5 in this class).
361
+
362
+ ### Step 6: Run the full suite — verify no regression
363
+
364
+ - [ ] Run:
365
+
366
+ ```bash
367
+ pytest -q 2>&1 | tail -3
368
+ ```
369
+ Expected: **169 passed** (167 + 2 new).
370
+
371
+ ### Step 7: Commit T1B
372
+
373
+ - [ ] Run:
374
+
375
+ ```bash
376
+ git add src/api/schemas.py src/api/routes.py tests/api/test_routes.py
377
+ git commit -m "$(cat <<'EOF'
378
+ feat(api): drift z-score in /predict/bbb response
379
+
380
+ - WORKER_CONFIDENCE_DEQUE: collections.deque(maxlen=100), per-worker
381
+ rolling window of confidences; drift_z computed against train-time
382
+ median when ≥10 samples buffered AND model has _neurobridge_train_stats.
383
+ - BBBPredictResponse gains drift_z (float | None) and rolling_n (int).
384
+ - 2 new tests: drift_z/rolling_n always present in body; deque rolls
385
+ at 100 after 105 predictions.
386
+
387
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
388
+ EOF
389
+ )"
390
+ ```
391
+
392
+ ---
393
+
394
+ ## Task 1C — Streamlit Drift Metric Line
395
+
396
+ **Why:** Without a UI surface, the drift signal is invisible to the jury. Render a one-line caption between the calibration caption and the SHAP section in `_render_prediction_card`.
397
+
398
+ **Files:**
399
+ - Modify: `src/frontend/app.py`
400
+
401
+ No new tests — UI wiring covered by the existing 2 import-smoke tests. Frontend test floor stays at 2.
402
+
403
+ ### Step 1: Locate `_render_prediction_card`
404
+
405
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/frontend/app.py`. Find `_render_prediction_card(result)`. The function currently renders (in order): label badge → confidence progress → calibration caption → SHAP section. Drift goes between calibration and SHAP.
406
+
407
+ ### Step 2: Add the drift line block
408
+
409
+ - [ ] Inside `_render_prediction_card`, immediately AFTER the existing calibration caption block (the `if calibration is not None:` / `elif calibration is not None:` block) and BEFORE the SHAP section header (the `st.markdown("**Top {n_features} SHAP attributions**" …)` or equivalent), insert:
410
+
411
+ ```python
412
+ drift_z = result.get("drift_z")
413
+ rolling_n = result.get("rolling_n", 0)
414
+ if drift_z is None and rolling_n < 10:
415
+ st.caption(
416
+ f"📈 Drift: warming up ({rolling_n}/10 predictions buffered)."
417
+ )
418
+ elif drift_z is None:
419
+ st.caption(
420
+ "📈 Drift: unavailable (model lacks train-time confidence stats)."
421
+ )
422
+ else:
423
+ # Sign + magnitude: |z| < 1 in-band, 1–2 mild, >=2 significant.
424
+ if abs(drift_z) < 1.0:
425
+ tag = "within expected range"
426
+ elif abs(drift_z) < 2.0:
427
+ tag = "mild distribution shift"
428
+ else:
429
+ tag = "significant shift — retrain recommended"
430
+ st.caption(
431
+ f"📈 Drift: trailing-{rolling_n} confidence median is "
432
+ f"**{drift_z:+.2f}σ** from train-time distribution ({tag})."
433
+ )
434
+ ```
435
+
436
+ ### Step 3: Persist the last prediction in session state
437
+
438
+ - [ ] Inside `_render_prediction_card`, at the very TOP of the function body (before any other call), add:
439
+
440
+ ```python
441
+ st.session_state["last_bbb_prediction"] = result
442
+ ```
443
+
444
+ This unlocks the AI Assistant tab in T3C — the tab can read `st.session_state["last_bbb_prediction"]` to populate its question form.
445
+
446
+ ### Step 4: Smoke test
447
+
448
+ - [ ] Verify import + Streamlit boot:
449
+
450
+ ```bash
451
+ pytest tests/frontend/ -v
452
+ ```
453
+ Expected: **2 passed**.
454
+
455
+ ```bash
456
+ streamlit run src/frontend/app.py --server.headless true --server.port 8530 &
457
+ STREAMLIT_PID=$!
458
+ sleep 6
459
+ curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8530
460
+ kill $STREAMLIT_PID 2>/dev/null
461
+ sleep 1
462
+ ```
463
+ Expected: HTTP `200`.
464
+
465
+ ### Step 5: Full suite — verify no regression
466
+
467
+ - [ ] Run:
468
+
469
+ ```bash
470
+ pytest -q 2>&1 | tail -3
471
+ ```
472
+ Expected: **169 passed** (no count change from T1B; UI-only).
473
+
474
+ ### Step 6: Commit T1C
475
+
476
+ - [ ] Run:
477
+
478
+ ```bash
479
+ git add src/frontend/app.py
480
+ git commit -m "$(cat <<'EOF'
481
+ feat(frontend): drift metric line + last-prediction session state
482
+
483
+ - Renders one-line drift caption between the calibration caption and
484
+ the SHAP section. Three states: warming up (<10 samples), unavailable
485
+ (no train stats), drift z-score with magnitude tag (in-band / mild /
486
+ significant).
487
+ - Stashes /predict/bbb response in st.session_state["last_bbb_prediction"]
488
+ so the Day-7 T3C AI Assistant tab can pick it up.
489
+ - No backend / schema / test count changes.
490
+
491
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
492
+ EOF
493
+ )"
494
+ ```
495
+
496
+ ---
497
+
498
+ ## Task 2 — MLflow Traceability Badge
499
+
500
+ **Why:** Spec §3.2. Jurors should be able to point at a decision card and ask "which exact training run produced this?". One smoke test on the API (the `provenance` field appears in the body), one badge in the UI.
501
+
502
+ **Files:**
503
+ - Modify: `src/api/schemas.py`
504
+ - Modify: `src/api/routes.py`
505
+ - Modify: `src/frontend/app.py`
506
+ - Modify: `tests/api/test_routes.py`
507
+
508
+ ### Step 1: Add `ModelProvenance` schema
509
+
510
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/api/schemas.py`. Append (above `BBBPredictResponse` so the type is in scope when referenced):
511
+
512
+ Find the line `class BBBPredictResponse(BaseModel):` and add this class IMMEDIATELY ABOVE it:
513
+
514
+ ```python
515
+ class ModelProvenance(BaseModel):
516
+ """Auditable provenance of the BBB model that produced a prediction."""
517
+ mlflow_run_id: str | None = Field(None, description="MLflow run id of the most recent training run, if any")
518
+ model_version: str = Field("v1", description="Manually-bumped model version label")
519
+ train_date: str | None = Field(None, description="ISO 8601 train timestamp from MLflow run start_time")
520
+ n_examples: int | None = Field(None, description="Training set size (from model._neurobridge_train_stats[\"n_train\"])")
521
+ ```
522
+
523
+ - [ ] Modify `BBBPredictResponse` to add a `provenance` field at the end:
524
+
525
+ ```python
526
+ provenance: ModelProvenance | None = Field(
527
+ None,
528
+ description="Auditing metadata (MLflow run id, train date, n_examples).",
529
+ )
530
+ ```
531
+
532
+ ### Step 2: Write the failing test (RED)
533
+
534
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/tests/api/test_routes.py`. Inside `TestBBBPredictRoute`, append:
535
+
536
+ ```python
537
+ def test_predict_response_includes_provenance(self, _set_bbb_model_path):
538
+ """T2: provenance field is present in body (fields may be None)."""
539
+ from src.api import routes
540
+ routes.WORKER_CONFIDENCE_DEQUE.clear()
541
+
542
+ resp = client.post("/predict/bbb", json={"smiles": "CCO", "top_k": 3})
543
+ assert resp.status_code == 200, resp.text
544
+ body = resp.json()
545
+ assert "provenance" in body
546
+ assert body["provenance"] is not None, "provenance should be populated even when MLflow is empty"
547
+ prov = body["provenance"]
548
+ assert "mlflow_run_id" in prov
549
+ assert "model_version" in prov
550
+ assert prov["model_version"] == "v1" # default until bumped manually
551
+ assert "train_date" in prov
552
+ assert "n_examples" in prov
553
+ # n_examples comes from train_stats — must be a positive int for the test fixture
554
+ assert isinstance(prov["n_examples"], int) and prov["n_examples"] >= 1
555
+ ```
556
+
557
+ ### Step 3: Run the test — verify RED
558
+
559
+ - [ ] Run:
560
+
561
+ ```bash
562
+ pytest tests/api/test_routes.py::TestBBBPredictRoute::test_predict_response_includes_provenance -v
563
+ ```
564
+ Expected: **FAIL** — `assert "provenance" in body` fails because no route populates it yet.
565
+
566
+ ### Step 4: Implement provenance lookup + cache (GREEN)
567
+
568
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/api/routes.py`. Add the schema import:
569
+
570
+ ```python
571
+ from src.api.schemas import (
572
+ BBBPredictRequest,
573
+ BBBPredictResponse,
574
+ BBBRequest,
575
+ CalibrationContext,
576
+ EEGRequest,
577
+ FeatureAttribution,
578
+ HarmonizationRow,
579
+ ModelProvenance, # NEW
580
+ MRIDiagnosticsRequest,
581
+ MRIDiagnosticsResponse,
582
+ MRIRequest,
583
+ PipelineResponse,
584
+ )
585
+ ```
586
+
587
+ - [ ] Below the `_compute_drift_z` helper, add a provenance lookup helper. The cache is module-level so MLflow is queried once per worker:
588
+
589
+ ```python
590
+ _PROVENANCE_CACHE: ModelProvenance | None = None
591
+ _MODEL_VERSION = "v1" # bump manually per train cycle
592
+
593
+
594
+ def _build_provenance(model) -> ModelProvenance:
595
+ """Look up the most recent BBB MLflow run; build a ModelProvenance.
596
+
597
+ Cached at module level so we hit MLflow once per worker. Failures (no
598
+ runs found, MLflow unreachable, NEUROBRIDGE_DISABLE_MLFLOW=1) all
599
+ degrade to a partial ModelProvenance with mlflow_run_id=None — the
600
+ badge still renders, just without a run id.
601
+ """
602
+ global _PROVENANCE_CACHE
603
+ if _PROVENANCE_CACHE is not None:
604
+ # Refresh n_examples each call from the model (cheap lookup).
605
+ n_train = None
606
+ stats = getattr(model, "_neurobridge_train_stats", None)
607
+ if stats is not None:
608
+ n_train = int(stats.get("n_train", 0)) or None
609
+ return _PROVENANCE_CACHE.model_copy(update={"n_examples": n_train})
610
+
611
+ run_id: str | None = None
612
+ train_date: str | None = None
613
+ if os.environ.get("NEUROBRIDGE_DISABLE_MLFLOW") != "1":
614
+ try:
615
+ runs = mlflow.search_runs(
616
+ experiment_names=["bbb_pipeline"],
617
+ max_results=1,
618
+ order_by=["start_time DESC"],
619
+ )
620
+ if len(runs):
621
+ row = runs.iloc[0]
622
+ run_id = str(row["run_id"])
623
+ ts = row.get("start_time")
624
+ if ts is not None:
625
+ train_date = str(pd.Timestamp(ts).isoformat())
626
+ except Exception as e: # broad: MLflow store unreachable, schema mismatch, etc.
627
+ logger.warning("MLflow provenance lookup failed: %s", e)
628
+
629
+ n_train = None
630
+ stats = getattr(model, "_neurobridge_train_stats", None)
631
+ if stats is not None:
632
+ n_train = int(stats.get("n_train", 0)) or None
633
+
634
+ _PROVENANCE_CACHE = ModelProvenance(
635
+ mlflow_run_id=run_id,
636
+ model_version=_MODEL_VERSION,
637
+ train_date=train_date,
638
+ n_examples=n_train,
639
+ )
640
+ return _PROVENANCE_CACHE
641
+ ```
642
+
643
+ - [ ] In `predict_bbb()`, immediately after `drift_z, rolling_n = _compute_drift_z(...)`, add:
644
+
645
+ ```python
646
+ provenance = _build_provenance(model)
647
+ ```
648
+
649
+ - [ ] Update the `return BBBPredictResponse(...)` to pass `provenance=provenance`:
650
+
651
+ ```python
652
+ return BBBPredictResponse(
653
+ label=pred["label"],
654
+ label_text=label_text,
655
+ confidence=pred["confidence"],
656
+ top_features=[FeatureAttribution(**a) for a in attributions],
657
+ calibration=calibration,
658
+ drift_z=drift_z,
659
+ rolling_n=rolling_n,
660
+ provenance=provenance,
661
+ )
662
+ ```
663
+
664
+ ### Step 5: Run the test — verify GREEN
665
+
666
+ - [ ] Run:
667
+
668
+ ```bash
669
+ pytest tests/api/test_routes.py::TestBBBPredictRoute::test_predict_response_includes_provenance -v
670
+ ```
671
+ Expected: **PASS**.
672
+
673
+ ### Step 6: Render badge in Streamlit decision card
674
+
675
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/frontend/app.py`. In `_render_prediction_card`, immediately after the line `st.session_state["last_bbb_prediction"] = result` (added in T1C Step 3) and BEFORE the existing label badge, add:
676
+
677
+ ```python
678
+ provenance = result.get("provenance")
679
+ if provenance is not None:
680
+ run_id = provenance.get("mlflow_run_id")
681
+ run_label = run_id[:8] if run_id else "—"
682
+ train_date = provenance.get("train_date") or "—"
683
+ n_examples = provenance.get("n_examples")
684
+ n_label = f"n={n_examples}" if n_examples else "n=—"
685
+ st.caption(
686
+ f"🔎 MLflow run **{run_label}** · "
687
+ f"Model **{provenance.get('model_version', 'v1')}** · "
688
+ f"trained {train_date} · {n_label}"
689
+ )
690
+ ```
691
+
692
+ ### Step 7: Full suite — verify no regression
693
+
694
+ - [ ] Run:
695
+
696
+ ```bash
697
+ pytest -q 2>&1 | tail -3
698
+ ```
699
+ Expected: **170 passed** (169 + 1 new).
700
+
701
+ ### Step 8: Streamlit smoke
702
+
703
+ - [ ] Run:
704
+
705
+ ```bash
706
+ streamlit run src/frontend/app.py --server.headless true --server.port 8531 &
707
+ STREAMLIT_PID=$!
708
+ sleep 6
709
+ curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8531
710
+ kill $STREAMLIT_PID 2>/dev/null
711
+ sleep 1
712
+ ```
713
+ Expected: HTTP `200`.
714
+
715
+ ### Step 9: Commit T2
716
+
717
+ - [ ] Run:
718
+
719
+ ```bash
720
+ git add src/api/schemas.py src/api/routes.py src/frontend/app.py tests/api/test_routes.py
721
+ git commit -m "$(cat <<'EOF'
722
+ feat(api+frontend): MLflow provenance badge in decision card
723
+
724
+ - ModelProvenance schema (mlflow_run_id, model_version, train_date,
725
+ n_examples). BBBPredictResponse.provenance is always populated; failed
726
+ MLflow lookup degrades to None fields without breaking the response.
727
+ - _build_provenance() module-level cache: one MLflow query per worker.
728
+ NEUROBRIDGE_DISABLE_MLFLOW=1 short-circuits to None fields. n_examples
729
+ pulled per-request from model._neurobridge_train_stats.
730
+ - Streamlit decision card renders a one-line audit badge above the
731
+ label: run id (first 8 chars), model version, train date, n_examples.
732
+ - 1 new test: provenance field present in /predict/bbb body with the
733
+ fixture model (n_examples ≥ 1 from train stats).
734
+
735
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
736
+ EOF
737
+ )"
738
+ ```
739
+
740
+ ---
741
+
742
+ ## Task 3A — LLM Explainer (template + OpenRouter)
743
+
744
+ **Why:** This is the heart of the Track-1 "AI Lab Agents" wink. A small, self-contained module that ALWAYS returns a usable rationale: deterministic template for reproducibility, OpenRouter llama-3.2-3b-instruct (free) for the "real agent" demo. Spec §3.3.
745
+
746
+ **Files:**
747
+ - Modify: `requirements.txt`
748
+ - Create: `src/llm/__init__.py`
749
+ - Create: `src/llm/explainer.py`
750
+ - Create: `tests/llm/__init__.py`
751
+ - Create: `tests/llm/test_explainer.py`
752
+
753
+ ### Step 1: Add the new pip dep + install
754
+
755
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/requirements.txt`. Add `openai==1.51.0` in the appropriate alphabetical position (after `nibabel==…` if alphabetical, or at the end if grouped). To match existing style: scan the file with `head` first; if no clear ordering, append at the end with a comment:
756
+
757
+ Append:
758
+ ```
759
+ openai==1.51.0 # OpenRouter SDK (Day-7 LLM explainer; deterministic-template fallback always available)
760
+ ```
761
+
762
+ - [ ] Install:
763
+
764
+ ```bash
765
+ pip install openai==1.51.0
766
+ pip check 2>&1 | tail -5
767
+ ```
768
+ Expected: `pip check` reports no incompatibilities. If a conflict appears (e.g. with `httpx==0.27.2`), STOP and resolve before continuing — the spec sealed compatibility.
769
+
770
+ ### Step 2: Create `src/llm/__init__.py`
771
+
772
+ - [ ] Run:
773
+
774
+ ```bash
775
+ mkdir -p src/llm tests/llm
776
+ ```
777
+
778
+ - [ ] Create `/Users/mertgungor/Desktop/hackathon/src/llm/__init__.py` with this exact content:
779
+
780
+ ```python
781
+ """LLM-backed natural-language explainers (Day 7).
782
+
783
+ `explain()` is the ONLY public entry point. It guarantees a non-empty
784
+ rationale every call: tries OpenRouter when available, falls back to a
785
+ deterministic template otherwise. The deterministic path is the source
786
+ of truth for tests; the LLM path is gated behind env config.
787
+ """
788
+ from src.llm.explainer import ExplainPayload, ExplainResult, explain # noqa: F401
789
+ ```
790
+
791
+ ### Step 3: Write the 4 failing tests (RED)
792
+
793
+ - [ ] Create `/Users/mertgungor/Desktop/hackathon/tests/llm/__init__.py` (empty):
794
+
795
+ ```python
796
+ ```
797
+
798
+ - [ ] Create `/Users/mertgungor/Desktop/hackathon/tests/llm/test_explainer.py` with this exact content:
799
+
800
+ ```python
801
+ """Tests for src.llm.explainer.
802
+
803
+ The deterministic template path is exhaustively tested here. The LLM
804
+ path is exercised only by env-gated integration tests in
805
+ test_explainer_integration.py (NOT run in CI by default).
806
+ """
807
+ from __future__ import annotations
808
+
809
+ import os
810
+
811
+ import pytest
812
+
813
+ from src.llm.explainer import ExplainPayload, explain
814
+
815
+
816
+ def _payload(**overrides) -> ExplainPayload:
817
+ """Build a representative ExplainPayload; overrides win."""
818
+ base: ExplainPayload = {
819
+ "smiles": "CCO",
820
+ "label": 1,
821
+ "label_text": "permeable",
822
+ "confidence": 0.82,
823
+ "top_features": [
824
+ {"feature": "fp_341", "shap_value": 0.045},
825
+ {"feature": "fp_902", "shap_value": -0.031},
826
+ {"feature": "fp_77", "shap_value": 0.022},
827
+ ],
828
+ "calibration": {"threshold": 0.80, "precision": 0.92, "support": 18},
829
+ "drift_z": 0.42,
830
+ "user_question": "Why was this molecule predicted as permeable?",
831
+ }
832
+ base.update(overrides)
833
+ return base
834
+
835
+
836
+ class TestTemplateExplain:
837
+ """Day-7 T3A: deterministic-template path of the explainer."""
838
+
839
+ def test_template_path_is_deterministic(self, monkeypatch):
840
+ """Same input → byte-identical rationale string. No randomness."""
841
+ monkeypatch.setenv("NEUROBRIDGE_DISABLE_LLM", "1")
842
+ out_a = explain(_payload())
843
+ out_b = explain(_payload())
844
+ assert out_a["rationale"] == out_b["rationale"]
845
+ assert out_a["source"] == "template"
846
+ assert out_b["source"] == "template"
847
+ assert out_a["model"] is None
848
+
849
+ def test_template_includes_top_feature_names(self, monkeypatch):
850
+ """Rationale must mention the SHAP features so jurors see attribution."""
851
+ monkeypatch.setenv("NEUROBRIDGE_DISABLE_LLM", "1")
852
+ result = explain(_payload())
853
+ for feat in ("fp_341", "fp_902", "fp_77"):
854
+ assert feat in result["rationale"], (
855
+ f"expected feature {feat!r} in rationale, got {result['rationale']!r}"
856
+ )
857
+
858
+ def test_template_includes_label_text(self, monkeypatch):
859
+ """The verdict word ('permeable' / 'non-permeable') must appear."""
860
+ monkeypatch.setenv("NEUROBRIDGE_DISABLE_LLM", "1")
861
+ result = explain(_payload(label=0, label_text="non-permeable"))
862
+ assert "non-permeable" in result["rationale"]
863
+
864
+ def test_disable_flag_forces_template_even_with_key_set(self, monkeypatch):
865
+ """NEUROBRIDGE_DISABLE_LLM=1 wins over OPENROUTER_API_KEY presence."""
866
+ monkeypatch.setenv("NEUROBRIDGE_DISABLE_LLM", "1")
867
+ monkeypatch.setenv("OPENROUTER_API_KEY", "sk-fake-not-used")
868
+ result = explain(_payload())
869
+ assert result["source"] == "template"
870
+ assert result["model"] is None
871
+ ```
872
+
873
+ ### Step 4: Run the new tests — verify RED
874
+
875
+ - [ ] Run:
876
+
877
+ ```bash
878
+ pytest tests/llm/ -v
879
+ ```
880
+ Expected: 4 errors / fails — `ModuleNotFoundError: No module named 'src.llm.explainer'` (file doesn't exist yet). If by some accident the module exists, the tests will fail because `explain` is not implemented.
881
+
882
+ ### Step 5: Implement `src/llm/explainer.py` (GREEN)
883
+
884
+ - [ ] Create `/Users/mertgungor/Desktop/hackathon/src/llm/explainer.py` with this exact content:
885
+
886
+ ```python
887
+ """Natural-language rationale for a single BBB prediction.
888
+
889
+ Public entry point: `explain(payload)`. Always returns a usable
890
+ ExplainResult — never raises. Tries OpenRouter first when a key is set
891
+ and the kill-switch is off; falls back to a deterministic template on
892
+ any failure (network, auth, rate limit, malformed response).
893
+
894
+ Test discipline: deterministic template path is the source of truth.
895
+ LLM path is env-gated and exercised by integration tests only.
896
+ """
897
+ from __future__ import annotations
898
+
899
+ import os
900
+ from typing import Any, TypedDict
901
+
902
+ from src.core.logger import get_logger
903
+
904
+ logger = get_logger(__name__)
905
+
906
+
907
+ class FeatureRow(TypedDict):
908
+ feature: str
909
+ shap_value: float
910
+
911
+
912
+ class CalibrationDict(TypedDict):
913
+ threshold: float
914
+ precision: float
915
+ support: int
916
+
917
+
918
+ class ExplainPayload(TypedDict, total=False):
919
+ smiles: str
920
+ label: int
921
+ label_text: str
922
+ confidence: float
923
+ top_features: list[FeatureRow]
924
+ calibration: CalibrationDict | None
925
+ drift_z: float | None
926
+ user_question: str
927
+
928
+
929
+ class ExplainResult(TypedDict):
930
+ rationale: str
931
+ source: str # "llm" | "template"
932
+ model: str | None # llm model name when source="llm", else None
933
+
934
+
935
+ _OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1"
936
+ _DEFAULT_MODEL = "meta-llama/llama-3.2-3b-instruct:free"
937
+ _LLM_TIMEOUT_SECONDS = 8.0
938
+ _LLM_MAX_TOKENS = 256
939
+ _LLM_TEMPERATURE = 0.3
940
+
941
+
942
+ def _should_use_llm() -> bool:
943
+ """Gate: env kill-switch off AND key present."""
944
+ if os.environ.get("NEUROBRIDGE_DISABLE_LLM") == "1":
945
+ return False
946
+ if not os.environ.get("OPENROUTER_API_KEY"):
947
+ return False
948
+ return True
949
+
950
+
951
+ def _drift_interpretation(drift_z: float | None) -> str:
952
+ if drift_z is None:
953
+ return "drift unavailable"
954
+ mag = abs(drift_z)
955
+ if mag < 1.0:
956
+ return "within expected range"
957
+ if mag < 2.0:
958
+ return "mild distribution shift"
959
+ return "significant shift, retrain recommended"
960
+
961
+
962
+ def _template_explain(payload: ExplainPayload) -> str:
963
+ """Deterministic, jury-friendly rationale. Never raises."""
964
+ label_text = payload.get("label_text", "unknown")
965
+ confidence = float(payload.get("confidence", 0.0))
966
+ top_features = payload.get("top_features") or []
967
+
968
+ # Sentence 1
969
+ sentences = [
970
+ f"Predicted **{label_text}** with {confidence * 100:.0f}% confidence."
971
+ ]
972
+
973
+ # Sentence 2 (calibration, optional)
974
+ cal = payload.get("calibration")
975
+ if cal is not None:
976
+ thr_pct = float(cal["threshold"]) * 100
977
+ prec_pct = float(cal["precision"]) * 100
978
+ support = int(cal["support"])
979
+ if support > 0:
980
+ sentences.append(
981
+ f"Calibration: predictions in the ≥{thr_pct:.0f}% bin are "
982
+ f"correct {prec_pct:.0f}% of the time on held-out data "
983
+ f"(n={support})."
984
+ )
985
+
986
+ # Sentence 3 (top-3 SHAP features)
987
+ if top_features:
988
+ feat_strs = [
989
+ f"{row['feature']} (Δ{float(row['shap_value']):+.3f})"
990
+ for row in top_features[:3]
991
+ ]
992
+ sentences.append(
993
+ f"Top SHAP attributions toward this label: {', '.join(feat_strs)}."
994
+ )
995
+
996
+ # Sentence 4 (drift, optional)
997
+ drift_z = payload.get("drift_z")
998
+ if drift_z is not None:
999
+ interp = _drift_interpretation(drift_z)
1000
+ sentences.append(
1001
+ f"Drift signal: trailing-100 confidence median is "
1002
+ f"{float(drift_z):+.2f}σ from training distribution ({interp})."
1003
+ )
1004
+
1005
+ return " ".join(sentences)
1006
+
1007
+
1008
+ def _build_llm_prompt(payload: ExplainPayload) -> str:
1009
+ """Format the payload + user question into a single LLM prompt."""
1010
+ top_features = payload.get("top_features") or []
1011
+ top_lines = "\n".join(
1012
+ f" - {row['feature']}: Δ{float(row['shap_value']):+.3f}"
1013
+ for row in top_features[:5]
1014
+ ) or " - (none)"
1015
+ drift_z = payload.get("drift_z")
1016
+ drift_str = "n/a" if drift_z is None else f"{float(drift_z):+.2f}"
1017
+ user_q = payload.get("user_question") or (
1018
+ "Explain the prediction in 2-4 sentences."
1019
+ )
1020
+ return (
1021
+ "You are a clinical-ML explainer for a B2B blood-brain-barrier "
1022
+ "permeability tool. Given the prediction details below, write a "
1023
+ "2-4 sentence rationale a researcher could paste into a paper. "
1024
+ "Use the SHAP attributions to justify the verdict. Mention drift "
1025
+ "if abnormal. Avoid hedging; be specific about the numbers.\n\n"
1026
+ f"Prediction:\n"
1027
+ f"- SMILES: {payload.get('smiles', '?')}\n"
1028
+ f"- Verdict: {payload.get('label_text', '?')} "
1029
+ f"({float(payload.get('confidence', 0.0)) * 100:.0f}% confident)\n"
1030
+ f"- Top SHAP features (positive = pushed toward verdict):\n"
1031
+ f"{top_lines}\n"
1032
+ f"- Drift z-score: {drift_str}\n"
1033
+ f"\nUser question: {user_q}\n"
1034
+ f"\nRespond with the rationale only, no preamble."
1035
+ )
1036
+
1037
+
1038
+ def _llm_explain(payload: ExplainPayload) -> tuple[str, str] | None:
1039
+ """Try the OpenRouter chat completion. Return (rationale, model) or None."""
1040
+ try:
1041
+ # Local import — keeps this dep optional at module load time.
1042
+ from openai import OpenAI
1043
+ except ImportError as e:
1044
+ logger.warning("openai SDK not importable: %s", e)
1045
+ return None
1046
+
1047
+ api_key = os.environ.get("OPENROUTER_API_KEY")
1048
+ if not api_key:
1049
+ return None
1050
+
1051
+ client = OpenAI(
1052
+ base_url=_OPENROUTER_BASE_URL,
1053
+ api_key=api_key,
1054
+ timeout=_LLM_TIMEOUT_SECONDS,
1055
+ )
1056
+ prompt = _build_llm_prompt(payload)
1057
+ try:
1058
+ completion = client.chat.completions.create(
1059
+ model=_DEFAULT_MODEL,
1060
+ messages=[{"role": "user", "content": prompt}],
1061
+ max_tokens=_LLM_MAX_TOKENS,
1062
+ temperature=_LLM_TEMPERATURE,
1063
+ )
1064
+ except Exception as e: # broad: APITimeoutError, APIConnectionError, RateLimitError, ...
1065
+ logger.warning("LLM call failed (%s); falling back to template.", type(e).__name__)
1066
+ return None
1067
+
1068
+ try:
1069
+ text = completion.choices[0].message.content
1070
+ except (AttributeError, IndexError, TypeError) as e:
1071
+ logger.warning("LLM response malformed (%s); falling back to template.", e)
1072
+ return None
1073
+
1074
+ if not text or not text.strip():
1075
+ logger.warning("LLM returned empty rationale; falling back to template.")
1076
+ return None
1077
+
1078
+ return text.strip(), _DEFAULT_MODEL
1079
+
1080
+
1081
+ def explain(payload: ExplainPayload) -> ExplainResult:
1082
+ """Return a natural-language rationale for a BBB prediction.
1083
+
1084
+ Tries the LLM first when env-permitted; falls back to a deterministic
1085
+ template on any failure. Never raises.
1086
+ """
1087
+ if _should_use_llm():
1088
+ llm_out: Any = _llm_explain(payload)
1089
+ if llm_out is not None:
1090
+ rationale, model = llm_out
1091
+ return ExplainResult(rationale=rationale, source="llm", model=model)
1092
+ # else: fall through to template
1093
+ return ExplainResult(
1094
+ rationale=_template_explain(payload),
1095
+ source="template",
1096
+ model=None,
1097
+ )
1098
+ ```
1099
+
1100
+ ### Step 6: Run the new tests — verify GREEN
1101
+
1102
+ - [ ] Run:
1103
+
1104
+ ```bash
1105
+ pytest tests/llm/ -v
1106
+ ```
1107
+ Expected: **4 passed**.
1108
+
1109
+ ### Step 7: Full suite — verify no regression
1110
+
1111
+ - [ ] Run:
1112
+
1113
+ ```bash
1114
+ pytest -q 2>&1 | tail -3
1115
+ ```
1116
+ Expected: **174 passed** (170 + 4 new).
1117
+
1118
+ ### Step 8: UserWarning gate
1119
+
1120
+ - [ ] Verify the new `openai` import doesn't introduce sklearn-style UserWarnings:
1121
+
1122
+ ```bash
1123
+ pytest -W error::UserWarning tests/ 2>&1 | tail -3
1124
+ ```
1125
+ Expected: same count (174), 0 UserWarning errors.
1126
+
1127
+ ### Step 9: Commit T3A
1128
+
1129
+ - [ ] Run:
1130
+
1131
+ ```bash
1132
+ git add requirements.txt src/llm/ tests/llm/
1133
+ git commit -m "$(cat <<'EOF'
1134
+ feat(llm): explainer with deterministic template + OpenRouter fallback
1135
+
1136
+ - New module src/llm/explainer.py — single public entry point
1137
+ explain(payload). Returns {rationale, source, model}. Never raises.
1138
+ - Deterministic template (4 sentences: verdict, calibration if any,
1139
+ top-3 SHAP, drift) is the source of truth for tests.
1140
+ - LLM path: OpenRouter chat completions via openai==1.51.0 SDK,
1141
+ model meta-llama/llama-3.2-3b-instruct:free, 8s timeout, 256 max
1142
+ tokens, temperature 0.3. Gated by OPENROUTER_API_KEY presence and
1143
+ NEUROBRIDGE_DISABLE_LLM=1 kill-switch.
1144
+ - Fallback chain: env-disabled → no key → SDK ImportError → API error
1145
+ → empty/malformed response → all degrade to template, log WARNING,
1146
+ source="template".
1147
+ - 4 new tests: deterministic, top features included, label text
1148
+ included, kill-switch overrides key.
1149
+ - New pip dep: openai==1.51.0 (~600KB, transitive deps already present).
1150
+
1151
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1152
+ EOF
1153
+ )"
1154
+ ```
1155
+
1156
+ ---
1157
+
1158
+ ## Task 3B — POST /explain/bbb Route
1159
+
1160
+ **Why:** Wire the explainer into the API surface so the Streamlit AI Assistant tab (T3C) can call it. Spec §3.4: new `explain_router` with `/explain` prefix.
1161
+
1162
+ **Files:**
1163
+ - Modify: `src/api/schemas.py`
1164
+ - Modify: `src/api/routes.py`
1165
+ - Modify: `src/api/__init__.py` (or wherever the FastAPI app is assembled — verify in step 1)
1166
+ - Modify: `tests/api/test_routes.py`
1167
+
1168
+ ### Step 1: Locate the FastAPI app + router registration
1169
+
1170
+ - [ ] Find where `router` and `predict_router` are mounted on the FastAPI app:
1171
+
1172
+ ```bash
1173
+ grep -rn "include_router" /Users/mertgungor/Desktop/hackathon/src/
1174
+ ```
1175
+ The output will point to a `main.py` or similar (likely `src/api/main.py`). Note the file path; we'll add `app.include_router(explain_router)` there.
1176
+
1177
+ ### Step 2: Add `BBBExplainRequest` and `BBBExplainResponse` schemas
1178
+
1179
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/api/schemas.py`. Append at the bottom of the file:
1180
+
1181
+ ```python
1182
+ class BBBExplainRequest(BaseModel):
1183
+ """Day-7 T3B: payload for POST /explain/bbb (chat-style explainer)."""
1184
+ smiles: str = Field(..., description="SMILES string of the molecule")
1185
+ label: int = Field(..., description="Predicted label (0 = non-permeable, 1 = permeable)")
1186
+ label_text: str = Field(..., description="'permeable' or 'non-permeable'")
1187
+ confidence: float = Field(..., ge=0.0, le=1.0)
1188
+ top_features: list[FeatureAttribution] = Field(
1189
+ ..., min_length=1,
1190
+ description="Non-empty list of SHAP attributions; an empty list returns 400.",
1191
+ )
1192
+ calibration: CalibrationContext | None = None
1193
+ drift_z: float | None = None
1194
+ user_question: str | None = Field(
1195
+ None,
1196
+ description="Optional question from the user; passed to the LLM prompt only.",
1197
+ )
1198
+
1199
+
1200
+ class BBBExplainResponse(BaseModel):
1201
+ """Day-7 T3B: response from POST /explain/bbb."""
1202
+ rationale: str = Field(..., description="2-4 sentence natural-language explanation")
1203
+ source: str = Field(..., description="'llm' or 'template'")
1204
+ model: str | None = Field(
1205
+ None,
1206
+ description="LLM model name when source='llm'; None when source='template'",
1207
+ )
1208
+ ```
1209
+
1210
+ ### Step 3: Write the failing test (RED)
1211
+
1212
+ - [ ] In `/Users/mertgungor/Desktop/hackathon/tests/api/test_routes.py`, append at the very bottom (after `TestMRIDiagnosticsRoute`):
1213
+
1214
+ ```python
1215
+ class TestExplainBBBRoute:
1216
+ """Day-7 T3B: POST /explain/bbb."""
1217
+
1218
+ def test_returns_200_with_template_source(self, monkeypatch):
1219
+ """Kill-switch on → /explain/bbb returns rationale with source=template."""
1220
+ monkeypatch.setenv("NEUROBRIDGE_DISABLE_LLM", "1")
1221
+ body = {
1222
+ "smiles": "CCO",
1223
+ "label": 1,
1224
+ "label_text": "permeable",
1225
+ "confidence": 0.82,
1226
+ "top_features": [
1227
+ {"feature": "fp_341", "shap_value": 0.045},
1228
+ {"feature": "fp_902", "shap_value": -0.031},
1229
+ {"feature": "fp_77", "shap_value": 0.022},
1230
+ ],
1231
+ "calibration": {"threshold": 0.80, "precision": 0.92, "support": 18},
1232
+ "drift_z": 0.42,
1233
+ "user_question": "Why permeable?",
1234
+ }
1235
+ resp = client.post("/explain/bbb", json=body)
1236
+ assert resp.status_code == 200, resp.text
1237
+ out = resp.json()
1238
+ assert out["source"] == "template"
1239
+ assert out["model"] is None
1240
+ # Template must mention all three features
1241
+ for feat in ("fp_341", "fp_902", "fp_77"):
1242
+ assert feat in out["rationale"]
1243
+ assert "permeable" in out["rationale"]
1244
+ ```
1245
+
1246
+ ### Step 4: Run the test — verify RED
1247
+
1248
+ - [ ] Run:
1249
+
1250
+ ```bash
1251
+ pytest tests/api/test_routes.py::TestExplainBBBRoute -v
1252
+ ```
1253
+ Expected: **FAIL with 404 Not Found** — `/explain/bbb` doesn't exist yet.
1254
+
1255
+ ### Step 5: Add the route + schema imports + router registration (GREEN)
1256
+
1257
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/api/routes.py`. Add the new schemas to the import block (alphabetical):
1258
+
1259
+ ```python
1260
+ from src.api.schemas import (
1261
+ BBBExplainRequest, # NEW
1262
+ BBBExplainResponse, # NEW
1263
+ BBBPredictRequest,
1264
+ BBBPredictResponse,
1265
+ BBBRequest,
1266
+ CalibrationContext,
1267
+ EEGRequest,
1268
+ FeatureAttribution,
1269
+ HarmonizationRow,
1270
+ ModelProvenance,
1271
+ MRIDiagnosticsRequest,
1272
+ MRIDiagnosticsResponse,
1273
+ MRIRequest,
1274
+ PipelineResponse,
1275
+ )
1276
+ ```
1277
+
1278
+ Add the explainer module import (alphabetical with other `src.*` imports):
1279
+
1280
+ ```python
1281
+ from src.llm import explainer as llm_explainer
1282
+ ```
1283
+
1284
+ Add a new router declaration immediately after the existing `predict_router` line (around line 38):
1285
+
1286
+ ```python
1287
+ explain_router = APIRouter(prefix="/explain")
1288
+ ```
1289
+
1290
+ Append the route at the end of the file:
1291
+
1292
+ ```python
1293
+ @explain_router.post("/bbb", response_model=BBBExplainResponse)
1294
+ def explain_bbb(req: BBBExplainRequest) -> BBBExplainResponse:
1295
+ """Natural-language rationale for a single BBB prediction.
1296
+
1297
+ Always returns 200 — the explainer is guaranteed to produce a
1298
+ rationale via deterministic-template fallback. Pydantic enforces
1299
+ a non-empty top_features list; an empty list returns 422 from
1300
+ FastAPI before this handler runs.
1301
+ """
1302
+ payload: llm_explainer.ExplainPayload = {
1303
+ "smiles": req.smiles,
1304
+ "label": req.label,
1305
+ "label_text": req.label_text,
1306
+ "confidence": req.confidence,
1307
+ "top_features": [
1308
+ {"feature": f.feature, "shap_value": f.shap_value}
1309
+ for f in req.top_features
1310
+ ],
1311
+ "calibration": (
1312
+ None
1313
+ if req.calibration is None
1314
+ else {
1315
+ "threshold": req.calibration.threshold,
1316
+ "precision": req.calibration.precision,
1317
+ "support": req.calibration.support,
1318
+ }
1319
+ ),
1320
+ "drift_z": req.drift_z,
1321
+ "user_question": req.user_question or "",
1322
+ }
1323
+ result = llm_explainer.explain(payload)
1324
+ return BBBExplainResponse(
1325
+ rationale=result["rationale"],
1326
+ source=result["source"],
1327
+ model=result["model"],
1328
+ )
1329
+ ```
1330
+
1331
+ - [ ] Open `src/api/main.py` (or whichever file Step 1 identified). Find where `app.include_router(predict_router)` is called. Immediately after that line, add:
1332
+
1333
+ ```python
1334
+ from src.api.routes import explain_router # if not already imported
1335
+ app.include_router(explain_router)
1336
+ ```
1337
+
1338
+ (If `predict_router` is imported as `from src.api.routes import predict_router`, add `explain_router` to that same import.)
1339
+
1340
+ ### Step 6: Run the test — verify GREEN
1341
+
1342
+ - [ ] Run:
1343
+
1344
+ ```bash
1345
+ pytest tests/api/test_routes.py::TestExplainBBBRoute -v
1346
+ ```
1347
+ Expected: **PASS**.
1348
+
1349
+ ### Step 7: Full suite — verify no regression
1350
+
1351
+ - [ ] Run:
1352
+
1353
+ ```bash
1354
+ pytest -q 2>&1 | tail -3
1355
+ ```
1356
+ Expected: **175 passed** (174 + 1 new).
1357
+
1358
+ ### Step 8: Commit T3B
1359
+
1360
+ - [ ] Run:
1361
+
1362
+ ```bash
1363
+ git add src/api/schemas.py src/api/routes.py src/api/main.py tests/api/test_routes.py
1364
+ git commit -m "$(cat <<'EOF'
1365
+ feat(api): POST /explain/bbb — natural-language rationale endpoint
1366
+
1367
+ - New explain_router with /explain prefix; symmetric with /predict/bbb
1368
+ and reserves /explain/eeg, /explain/mri for future expansion.
1369
+ - BBBExplainRequest carries the prediction snapshot + optional
1370
+ user_question. top_features is required and must be non-empty
1371
+ (Pydantic min_length=1 → 422 on empty).
1372
+ - BBBExplainResponse: {rationale, source, model}. Always 200 because
1373
+ the explainer's template fallback never raises.
1374
+ - 1 new test: 200 + source='template' under NEUROBRIDGE_DISABLE_LLM=1
1375
+ with full SHAP + calibration + drift payload.
1376
+
1377
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1378
+ EOF
1379
+ )"
1380
+ ```
1381
+
1382
+ ---
1383
+
1384
+ ## Task 3C — Streamlit "AI Assistant" Tab
1385
+
1386
+ **Why:** Spec §3.5. Lets the jury type / pick a question and watch the system reason in natural language. Pulls the last `/predict/bbb` result from `st.session_state` (populated in T1C Step 3) and POSTs to `/explain/bbb`.
1387
+
1388
+ **Files:**
1389
+ - Modify: `src/frontend/app.py`
1390
+
1391
+ No new tests — covered by the 2 existing import-smoke tests.
1392
+
1393
+ ### Step 1: Locate the tab assembly
1394
+
1395
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/src/frontend/app.py`. Find the `main()` function. The current tabs are likely created via something like:
1396
+
1397
+ ```python
1398
+ tab_bbb, tab_eeg, tab_mri = st.tabs(["BBB", "EEG", "MRI"])
1399
+ ```
1400
+
1401
+ Note the exact line so we can extend it.
1402
+
1403
+ ### Step 2: Extend the tabs list
1404
+
1405
+ - [ ] Replace the existing tab declaration with:
1406
+
1407
+ ```python
1408
+ tab_bbb, tab_eeg, tab_mri, tab_assistant = st.tabs(
1409
+ ["BBB", "EEG", "MRI", "AI Assistant"]
1410
+ )
1411
+ ```
1412
+
1413
+ - [ ] Wherever the existing 3 tabs are rendered (`with tab_bbb: _render_bbb_tab()` etc.), append:
1414
+
1415
+ ```python
1416
+ with tab_assistant:
1417
+ _render_ai_assistant_tab()
1418
+ ```
1419
+
1420
+ ### Step 3: Add the helper function `_render_ai_assistant_tab`
1421
+
1422
+ - [ ] Add this new function above `main()` (near the other `_render_*_tab` helpers):
1423
+
1424
+ ```python
1425
+ def _render_ai_assistant_tab() -> None:
1426
+ """Day-7 T3C: chat-style explainer for the most recent BBB prediction."""
1427
+ _render_section(
1428
+ "AI Assistant",
1429
+ "Natural-language rationale (LLM or deterministic template)",
1430
+ "Pulls the most recent BBB prediction from this session and asks "
1431
+ "the explainer to justify it. Falls back to a deterministic, "
1432
+ "auditable template when no LLM is configured."
1433
+ )
1434
+
1435
+ last = st.session_state.get("last_bbb_prediction")
1436
+ if last is None:
1437
+ st.info(
1438
+ "Run a BBB prediction first (BBB tab → Predict button), "
1439
+ "then come back here to ask the assistant about it."
1440
+ )
1441
+ return
1442
+
1443
+ # Snapshot card so the user knows which prediction is being explained
1444
+ st.caption(
1445
+ f"Latest prediction: **{last['label_text']}** "
1446
+ f"({float(last['confidence']) * 100:.0f}% confident) · "
1447
+ f"Top SHAP: {', '.join(f['feature'] for f in last.get('top_features', [])[:3])}"
1448
+ )
1449
+
1450
+ PRESETS = [
1451
+ "Why was this molecule predicted as permeable?",
1452
+ "Which features pushed the verdict the most?",
1453
+ "Is this prediction trustworthy given the drift signal?",
1454
+ ]
1455
+ preset = st.selectbox("Preset question", options=PRESETS, key="ai_preset")
1456
+ custom = st.text_input(
1457
+ "Or type your own question (optional)",
1458
+ value="",
1459
+ key="ai_custom",
1460
+ help="Custom questions only affect the LLM path; the template gives a generic SHAP-driven rationale either way.",
1461
+ )
1462
+ question = custom.strip() or preset
1463
+
1464
+ if st.button("Ask the AI Assistant", type="primary", key="ai_ask"):
1465
+ with st.spinner("Composing rationale…"):
1466
+ try:
1467
+ body = {
1468
+ "smiles": last.get("smiles", ""),
1469
+ "label": last["label"],
1470
+ "label_text": last["label_text"],
1471
+ "confidence": last["confidence"],
1472
+ "top_features": last.get("top_features", []),
1473
+ "calibration": last.get("calibration"),
1474
+ "drift_z": last.get("drift_z"),
1475
+ "user_question": question,
1476
+ }
1477
+ # The /predict/bbb response payload doesn't include the
1478
+ # user-supplied SMILES (only label/confidence/etc.), so
1479
+ # pull it from the input widget for paper-trail accuracy.
1480
+ # Streamlit text inputs persist via st.session_state.
1481
+ if not body["smiles"]:
1482
+ body["smiles"] = st.session_state.get("bbb_smiles", "")
1483
+ resp = _post("/explain/bbb", body)
1484
+ except httpx.HTTPStatusError as e:
1485
+ st.error(
1486
+ f"Explainer failed (HTTP {e.response.status_code}): "
1487
+ f"{e.response.text}"
1488
+ )
1489
+ return
1490
+ except httpx.RequestError as e:
1491
+ st.error(f"Cannot reach FastAPI at {_API_URL}: {e!r}")
1492
+ return
1493
+
1494
+ history = st.session_state.setdefault("explain_history", [])
1495
+ history.insert(0, (question, resp))
1496
+
1497
+ # Render history (most recent first)
1498
+ history = st.session_state.get("explain_history", [])
1499
+ if history:
1500
+ st.markdown("### Conversation")
1501
+ for q, r in history[:10]: # cap at 10 most recent
1502
+ with st.container():
1503
+ st.markdown(f"**Q:** {q}")
1504
+ st.markdown(f"**A:** {r['rationale']}")
1505
+ source = r.get("source", "?")
1506
+ model = r.get("model") or "—"
1507
+ st.caption(f"Source: `{source}` · Model: `{model}`")
1508
+ st.divider()
1509
+ ```
1510
+
1511
+ ### Step 4: Smoke test
1512
+
1513
+ - [ ] Run:
1514
+
1515
+ ```bash
1516
+ pytest tests/frontend/ -v
1517
+ ```
1518
+ Expected: **2 passed**.
1519
+
1520
+ ```bash
1521
+ streamlit run src/frontend/app.py --server.headless true --server.port 8532 &
1522
+ STREAMLIT_PID=$!
1523
+ sleep 6
1524
+ curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8532
1525
+ kill $STREAMLIT_PID 2>/dev/null
1526
+ sleep 1
1527
+ ```
1528
+ Expected: HTTP `200`.
1529
+
1530
+ ### Step 5: Full suite — verify no regression
1531
+
1532
+ - [ ] Run:
1533
+
1534
+ ```bash
1535
+ pytest -q 2>&1 | tail -3
1536
+ ```
1537
+ Expected: **175 passed** (no count change — UI only).
1538
+
1539
+ ### Step 6: Commit T3C
1540
+
1541
+ - [ ] Run:
1542
+
1543
+ ```bash
1544
+ git add src/frontend/app.py
1545
+ git commit -m "$(cat <<'EOF'
1546
+ feat(frontend): AI Assistant tab — natural-language explainer
1547
+
1548
+ - New 4th tab in main(): BBB / EEG / MRI / AI Assistant.
1549
+ - _render_ai_assistant_tab pulls last_bbb_prediction from session
1550
+ state, shows a snapshot caption, lets the user pick from 3 preset
1551
+ questions or type a custom one, POSTs to /explain/bbb, and renders
1552
+ a reverse-chronological history (capped at 10).
1553
+ - Each history entry shows source (llm | template) and model so
1554
+ jurors can audit which path served each rationale.
1555
+ - Empty state when no prediction yet: explicit prompt to run BBB tab
1556
+ first.
1557
+ - No new tests; covered by 2 existing import-smoke tests.
1558
+
1559
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1560
+ EOF
1561
+ )"
1562
+ ```
1563
+
1564
+ ---
1565
+
1566
+ ## Task 4 — Close-out: AGENTS.md + README + DoD
1567
+
1568
+ **Why:** Anchor the new contracts in `AGENTS.md`, give the demo runner a `curl` recipe in `README.md`, run the full Day-7 DoD.
1569
+
1570
+ **Files:**
1571
+ - Modify: `AGENTS.md`
1572
+ - Modify: `README.md`
1573
+
1574
+ ### Step 1: AGENTS.md — append §10 and §11
1575
+
1576
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/AGENTS.md`. Confirm the last section is currently §9 (Demo Features). Append at the end:
1577
+
1578
+ ```markdown
1579
+ ## 10. Drift Surface (Day 7)
1580
+
1581
+ Each predict route maintains a per-worker rolling window of recent
1582
+ prediction confidences (`collections.deque(maxlen=100)`). Train-time
1583
+ median + std are stashed on `model._neurobridge_train_stats` (joblib
1584
+ roundtrip-safe). The drift z-score is `(rolling_median − train_median) /
1585
+ max(train_std, 1e-9)`, computed only when the buffer holds ≥10 samples
1586
+ AND the model has the train-stats attribute. The `/predict/bbb`
1587
+ response carries `drift_z: float | None` and `rolling_n: int`. The UI
1588
+ renders a one-line caption with a magnitude tag (in-band, mild,
1589
+ significant). Worker restart clears the deque; this is acceptable for
1590
+ demo and removes the audit-trail concern.
1591
+
1592
+ ## 11. LLM Explainer Surface (Day 7)
1593
+
1594
+ `src/llm/explainer.py` is the single entry point for natural-language
1595
+ rationales. `explain(payload)` always returns `{rationale, source,
1596
+ model}`. The deterministic template path is the source of truth for
1597
+ tests; the LLM path is OpenRouter via the `openai==1.51.0` SDK using
1598
+ `meta-llama/llama-3.2-3b-instruct:free`. Two env knobs control the
1599
+ behavior:
1600
+
1601
+ - `OPENROUTER_API_KEY` — when absent, fallback to template.
1602
+ - `NEUROBRIDGE_DISABLE_LLM=1` — hard kill-switch; force template even
1603
+ if a key is set. Use this for demo days when you want fully
1604
+ deterministic, reproducible rationales.
1605
+
1606
+ The `POST /explain/bbb` endpoint mirrors this contract. Pydantic
1607
+ enforces a non-empty `top_features` list (422 on empty); every other
1608
+ failure mode degrades to template + WARNING log + `source="template"`.
1609
+ ```
1610
+
1611
+ ### Step 2: README.md — add Day 7 row + curl recipe
1612
+
1613
+ - [ ] Open `/Users/mertgungor/Desktop/hackathon/README.md`. Find the day-by-day status table from Day 6 (it should have a row like `| Day 6 — Final Polish & Demo Features ... | ✅ Shipped — 165 tests green |`). Append a new row immediately below it:
1614
+
1615
+ ```markdown
1616
+ | Day 7 — Final 5% (Drift, Traceability & Agents) | ✅ Shipped — 175 tests green |
1617
+ ```
1618
+
1619
+ - [ ] Find the "Where to Look" / pointers section (Day 6's close-out added entries here). Append:
1620
+
1621
+ - `docs/superpowers/specs/2026-05-05-day7-drift-traceability-agents-design.md` (Day-7 design spec)
1622
+ - `docs/superpowers/plans/2026-05-05-day7-drift-traceability-agents.md` (Day-7 plan)
1623
+ - New surface: `POST /explain/bbb` — natural-language rationale (LLM + deterministic fallback)
1624
+ - New surface: `drift_z` / `rolling_n` / `provenance` fields in `POST /predict/bbb` response
1625
+
1626
+ - [ ] Find the existing "Demo Recipe" section if any; otherwise append a new section near the end:
1627
+
1628
+ ```markdown
1629
+ ## Day 7 — Demo Recipe
1630
+
1631
+ Pre-flight (one terminal):
1632
+
1633
+ ```bash
1634
+ # Start API with deterministic explainer (no LLM key needed)
1635
+ NEUROBRIDGE_DISABLE_LLM=1 BBB_MODEL_PATH=data/processed/bbb_model.joblib \
1636
+ uvicorn src.api.main:app --port 8000
1637
+ ```
1638
+
1639
+ Predict + explain (other terminal):
1640
+
1641
+ ```bash
1642
+ # 1) Predict — body now carries drift_z, rolling_n, provenance
1643
+ curl -s -X POST http://localhost:8000/predict/bbb \
1644
+ -H "Content-Type: application/json" \
1645
+ -d '{"smiles": "CCO", "top_k": 5}' | jq
1646
+
1647
+ # 2) Explain — feed the predict response back as the explain payload
1648
+ curl -s -X POST http://localhost:8000/explain/bbb \
1649
+ -H "Content-Type: application/json" \
1650
+ -d '{
1651
+ "smiles": "CCO",
1652
+ "label": 1,
1653
+ "label_text": "permeable",
1654
+ "confidence": 0.82,
1655
+ "top_features": [
1656
+ {"feature": "fp_341", "shap_value": 0.045},
1657
+ {"feature": "fp_902", "shap_value": -0.031}
1658
+ ],
1659
+ "drift_z": 0.42,
1660
+ "user_question": "Why permeable?"
1661
+ }' | jq
1662
+
1663
+ # 3) Same call but with LLM enabled (set the key first)
1664
+ unset NEUROBRIDGE_DISABLE_LLM
1665
+ export OPENROUTER_API_KEY="sk-or-v1-…"
1666
+ # Repeat the curl above; expect "source": "llm" and a model name.
1667
+ ```
1668
+
1669
+ Streamlit demo: `streamlit run src/frontend/app.py` → BBB tab → Predict → AI Assistant tab → ask a preset question.
1670
+
1671
+ Drift demo: refresh the BBB tab and predict 10+ times in a row — the drift caption transitions from "warming up" to a numeric z-score.
1672
+ ```
1673
+
1674
+ ### Step 3: Run the full DoD verification
1675
+
1676
+ All of these must pass:
1677
+
1678
+ - [ ] **DoD-1: Full suite at 175.**
1679
+
1680
+ ```bash
1681
+ pytest -q 2>&1 | tail -3
1682
+ ```
1683
+ Expected: **175 passed**.
1684
+
1685
+ - [ ] **DoD-2: UserWarning gate clean.**
1686
+
1687
+ ```bash
1688
+ pytest -W error::UserWarning tests/ 2>&1 | tail -3
1689
+ ```
1690
+ Expected: 175 passed, 0 warnings escalated.
1691
+
1692
+ - [ ] **DoD-3: Streamlit boots.**
1693
+
1694
+ ```bash
1695
+ streamlit run src/frontend/app.py --server.headless true --server.port 8533 &
1696
+ STREAMLIT_PID=$!
1697
+ sleep 6
1698
+ curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8533
1699
+ kill $STREAMLIT_PID 2>/dev/null
1700
+ sleep 1
1701
+ ```
1702
+ Expected: `200`.
1703
+
1704
+ - [ ] **DoD-4: Predict endpoint shape.**
1705
+
1706
+ Start the API in the background with the kill-switch on and a fresh deque:
1707
+
1708
+ ```bash
1709
+ NEUROBRIDGE_DISABLE_LLM=1 BBB_MODEL_PATH=data/processed/bbb_model.joblib \
1710
+ uvicorn src.api.main:app --port 8534 &
1711
+ UVICORN_PID=$!
1712
+ sleep 4
1713
+ curl -s -X POST http://localhost:8534/predict/bbb \
1714
+ -H "Content-Type: application/json" \
1715
+ -d '{"smiles": "CCO", "top_k": 3}' | python3 -c "
1716
+ import json, sys
1717
+ body = json.load(sys.stdin)
1718
+ required = {'label','label_text','confidence','top_features','calibration','drift_z','rolling_n','provenance'}
1719
+ missing = required - set(body.keys())
1720
+ print('missing keys:', missing if missing else 'NONE')
1721
+ "
1722
+ kill $UVICORN_PID 2>/dev/null
1723
+ sleep 1
1724
+ ```
1725
+ Expected: `missing keys: NONE`.
1726
+
1727
+ - [ ] **DoD-5: Explain endpoint deterministic path.**
1728
+
1729
+ ```bash
1730
+ NEUROBRIDGE_DISABLE_LLM=1 BBB_MODEL_PATH=data/processed/bbb_model.joblib \
1731
+ uvicorn src.api.main:app --port 8535 &
1732
+ UVICORN_PID=$!
1733
+ sleep 4
1734
+ curl -s -X POST http://localhost:8535/explain/bbb \
1735
+ -H "Content-Type: application/json" \
1736
+ -d '{
1737
+ "smiles": "CCO",
1738
+ "label": 1,
1739
+ "label_text": "permeable",
1740
+ "confidence": 0.82,
1741
+ "top_features": [{"feature":"fp_341","shap_value":0.045}],
1742
+ "drift_z": 0.42
1743
+ }' | python3 -c "
1744
+ import json, sys
1745
+ body = json.load(sys.stdin)
1746
+ assert body['source'] == 'template', f\"expected source=template, got {body['source']}\"
1747
+ assert body['model'] is None
1748
+ assert 'fp_341' in body['rationale']
1749
+ print('explain endpoint OK:', body['rationale'][:80], '…')
1750
+ "
1751
+ kill $UVICORN_PID 2>/dev/null
1752
+ sleep 1
1753
+ ```
1754
+ Expected: `explain endpoint OK: …` printed.
1755
+
1756
+ If ANY of DoD-1 through DoD-5 fails, STOP and report. Do NOT commit T4 with a failing DoD.
1757
+
1758
+ ### Step 4: Commit T4
1759
+
1760
+ - [ ] Run:
1761
+
1762
+ ```bash
1763
+ git add AGENTS.md README.md
1764
+ git commit -m "$(cat <<'EOF'
1765
+ docs: Day-7 close-out — AGENTS §10 drift + §11 LLM explainer + README recipe
1766
+
1767
+ - AGENTS §10 documents the per-worker deque, train-stats stash, and
1768
+ z-score formula. §11 documents the explainer's two-path contract,
1769
+ env knobs (OPENROUTER_API_KEY, NEUROBRIDGE_DISABLE_LLM=1), and the
1770
+ /explain/bbb endpoint shape.
1771
+ - README adds Day 7 to the status table (175 tests green), pointers
1772
+ to the Day-7 spec + plan + new surfaces, and a Demo Recipe section
1773
+ with curl invocations for both endpoints (template-only and LLM).
1774
+ - DoD-1 through DoD-5 all green: pytest 175, UserWarning gate clean,
1775
+ Streamlit boot 200, predict body shape, explain template path.
1776
+
1777
+ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1778
+ EOF
1779
+ )"
1780
+ ```
1781
+
1782
+ ---
1783
+
1784
+ ## Definition of Done (Day 7)
1785
+
1786
+ | Check | Pass criterion |
1787
+ |---|---|
1788
+ | Full suite green | `pytest -q` reports **175 passed** |
1789
+ | UserWarning gate | `pytest -W error::UserWarning tests/` reports same count, 0 escalations |
1790
+ | Streamlit boots | `streamlit run …` returns HTTP 200 |
1791
+ | `/predict/bbb` body shape | Includes `drift_z`, `rolling_n`, `provenance` keys |
1792
+ | `/explain/bbb` template path | Returns `source: "template"`, rationale contains top feature names |
1793
+ | `_neurobridge_train_stats` persists | `TestTrainStatsMetadata.test_train_stats_survives_save_load_roundtrip` |
1794
+ | Deque rolls at 100 | `TestBBBPredictRoute.test_predict_deque_rolls_at_100` |
1795
+ | AI Assistant tab renders | Streamlit boot + manual click verify |
1796
+ | MLflow badge appears in card | Streamlit boot + manual prediction verify |
1797
+ | AGENTS §10 + §11 committed | yes |
1798
+ | README has Day-7 row + curl recipe | yes |
1799
+ | 9 commits in Day-7 ledger | T1A, T1B, T1C, T2, T3A, T3B, T3C, T4, plus the spec commit `09dd9c3` |
1800
+
1801
+ When all rows green: Day 7 mühürlü. Hackathon submission hazır.
1802
+
1803
+ ---
1804
+
1805
+ ## Self-Review (Plan Author)
1806
+
1807
+ **Spec coverage:**
1808
+ - §1 Goal — covered by all 4 tasks.
1809
+ - §2.1 Drift state location (deque + train_stats) — T1A + T1B.
1810
+ - §2.2 LLM provider (OpenRouter, kill-switch) — T3A.
1811
+ - §3.1 Drift layer (model, schemas, routes, frontend) — T1A + T1B + T1C.
1812
+ - §3.2 MLflow traceability badge (schema, lookup, UI) — T2.
1813
+ - §3.3 LLM explainer module (template + OpenRouter + fallback chain) — T3A.
1814
+ - §3.4 `POST /explain/bbb` (explain_router, schemas, route) — T3B.
1815
+ - §3.5 Streamlit AI Assistant tab (session state, presets, history) — T3C.
1816
+ - §4 Test plan (+10 tests) — 2 (T1A) + 2 (T1B) + 1 (T2) + 4 (T3A) + 1 (T3B) = 10 ✅.
1817
+ - §5 New dep — T3A Step 1.
1818
+ - §6 Failure modes / lifelines — T2 Step 4 (`NEUROBRIDGE_DISABLE_MLFLOW`), T3A `_should_use_llm` + `_llm_explain` exception handler.
1819
+ - §8 DoD — T4 Step 3 (DoD-1 through DoD-5).
1820
+ - §9 Out of scope — explicitly NOT touched (no streaming, no retraining, no vector RAG, no provenance signing).
1821
+
1822
+ **Placeholder scan:** No `TBD`, `TODO`, `FIXME`, "implement later", "fill in details", or vague "add appropriate error handling" instructions remain. Every code step shows the actual code; every command shows the expected output.
1823
+
1824
+ **Type / name consistency:**
1825
+ - `model._neurobridge_train_stats` keys: `median`, `std`, `n_train` — used identically in T1A (set), T1B (`stats["median"]`, `stats["std"]`), T2 (`stats.get("n_train", 0)`). ✅
1826
+ - `WORKER_CONFIDENCE_DEQUE` — defined T1B Step 4, referenced in T1B tests Step 2. ✅
1827
+ - `_compute_drift_z(model, confidence) -> tuple[float | None, int]` — return shape used in T1B Step 4 implementation matches the test assertions in Step 2. ✅
1828
+ - `BBBPredictResponse` field additions: `drift_z` (T1B), `rolling_n` (T1B), `provenance` (T2). UI helper reads the same names in T1C / T2 Step 6. ✅
1829
+ - `ExplainResult` keys: `rationale`, `source`, `model` — used in T3A tests, T3A implementation, T3B route handler, T3B test, T3C UI. ✅
1830
+ - `explain_router` (prefix `/explain`) → `POST /explain/bbb` — declared T3B Step 5, mounted in same step, tested in T3B Step 3, called from UI in T3C Step 3. ✅
1831
+
1832
+ No issues found.