mekosotto Claude Opus 4.7 (1M context) commited on
Commit
62d4000
·
1 Parent(s): 53256ed

docs(plan): add Day-6 final-polish + demo-features plan

Browse files

3 high-ROI features (Robustness/Interaction/Creativity):
1. BBB edge-case dropdown (frontend curates 5 robustness probes)
2. Calibration trust caption (precision-at-confidence bins from train/test split)
3. MRI ComBat KDE viz + site-gap KPIs (new /pipeline/mri/diagnostics endpoint)

+ 8 new tests, 158 → 165 target. No new pip deps (altair ships with streamlit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs/superpowers/plans/2026-05-04-day6-final-polish-demo-features.md ADDED
@@ -0,0 +1,1109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Day 6 — Final Polish & Demo Features Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Pump the jury-day **Robustness, Interaction, and Creativity** sub-scores (slide 14) by adding 3 high-ROI demo features without touching the 158-test green floor.
6
+
7
+ **Architecture:** Day 5's `/predict/bbb` already returns `confidence` + SHAP top-k. Day 6 adds: (1) a frontend-only "Test Edge Cases" dropdown that picks from a curated catalog of robustness probes and visualizes how the system *gracefully* handles them; (2) a calibration metadata layer — `train()` does an 80/20 stratified split, computes precision-at-confidence-threshold bins, and stashes them on `model._neurobridge_calibration`; the API includes the matching bin in the response and the UI renders a one-line trust caption; (3) a new `POST /pipeline/mri/diagnostics` endpoint that runs the MRI pipeline twice (pre-ComBat features + post-ComBat features) and returns a long-format JSON for the Streamlit MRI tab to render as side-by-side altair KDE plots colored by site, headlined with the 3290× site-gap reduction KPI.
8
+
9
+ **Tech Stack:** No new pip deps — altair 5.5.0 ships with Streamlit; sklearn already has `train_test_split`. Existing brand tokens (navy `#0F172A`, sky `#0369A1`, slate `#475569`, Plus Jakarta Sans) are reused. New UX patterns conform to the Day-4 Trust & Authority style.
10
+
11
+ ---
12
+
13
+ ## File Structure
14
+
15
+ ```
16
+ src/
17
+ ├── models/
18
+ │ └── bbb_model.py # MODIFY — Task 2: train() returns model with calibration metadata
19
+ ├── api/
20
+ │ ├── schemas.py # MODIFY — Task 2/3: CalibrationContext, MRIDiagnosticsResponse
21
+ │ └── routes.py # MODIFY — Task 2/3: include calibration in /predict/bbb, new /pipeline/mri/diagnostics
22
+ ├── pipelines/
23
+ │ └── mri_pipeline.py # MODIFY — Task 3: expose compute_harmonization_diagnostics()
24
+ └── frontend/
25
+ └── app.py # MODIFY — Tasks 1/2/3: edge-case dropdown, trust caption, MRI KDE viz
26
+
27
+ tests/
28
+ ├── models/
29
+ │ └── test_bbb_model.py # MODIFY — Task 2: TestCalibrationMetadata (3 tests)
30
+ ├── api/
31
+ │ └── test_routes.py # MODIFY — Task 2/3: calibration assertion + TestMRIDiagnosticsRoute (3 tests)
32
+ ├── pipelines/
33
+ │ └── test_mri_pipeline.py # MODIFY — Task 3: TestComputeHarmonizationDiagnostics (2 tests)
34
+ └── frontend/
35
+ └── test_app_import.py # (unchanged — 2 import-smoke tests still cover the module)
36
+
37
+ AGENTS.md # MODIFY — Task 4: §8 Calibration sub-section + §9 Demo Features
38
+ README.md # MODIFY — Task 4
39
+ ```
40
+
41
+ **Test count target:** 158 + ~8 = **~166 tests green at end of Day 6**.
42
+
43
+ ---
44
+
45
+ ## Task 1: BBB Tab — "Test Edge Cases" dropdown (Robustness)
46
+
47
+ **Files:**
48
+ - Modify: `src/frontend/app.py`
49
+
50
+ **No backend changes** — this task purely curates inputs the existing `/predict/bbb` endpoint already handles correctly. The user value is that the dropdown turns implicit robustness into a visible demo artifact.
51
+
52
+ - [ ] **Step 1: Replace the bare `text_input` in `_render_bbb_tab` with a robustness-aware input flow**
53
+
54
+ In `/Users/mertgungor/Desktop/hackathon/src/frontend/app.py`, find `_render_bbb_tab()` (search for the existing `smiles = st.text_input("SMILES string", ...)` call) and replace the input section (the `st.text_input` + `st.slider` + `st.button` block) with this:
55
+
56
+ ```python
57
+ EDGE_CASES = {
58
+ "Custom input (default)": {
59
+ "smiles": "CCO",
60
+ "label": "Ethanol — small, drug-like, BBB-permeable",
61
+ "expectation": "High confidence, label = permeable",
62
+ },
63
+ "Invalid SMILES (parse-error path)": {
64
+ "smiles": "this_is_not_a_valid_molecule_at_all_!!",
65
+ "label": "Garbage string — should not parse",
66
+ "expectation": "API returns HTTP 400 with parse error; UI shows recoverable warning",
67
+ },
68
+ "Empty string (boundary)": {
69
+ "smiles": "",
70
+ "label": "Empty input — boundary condition",
71
+ "expectation": "Pydantic accepts empty; API returns 400 (RDKit cannot parse)",
72
+ },
73
+ "Massive OOD: cyclosporine-like macrocycle": {
74
+ "smiles": (
75
+ "CC[C@H](C)[C@@H]1NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H](CC(C)C)N(C)C(=O)"
76
+ "[C@@H]2CCCN2C(=O)[C@H](C(C)C)NC(=O)[C@H]([C@@H](C)CC)N(C)C(=O)"
77
+ "[C@H](C)NC(=O)[C@H](C)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@@H](NC(=O)"
78
+ "[C@H](CC(C)C)N(C)C(=O)CN(C)C1=O)C(C)C"
79
+ ),
80
+ "label": "Cyclosporine — 11-residue macrocycle (~1.2 kDa)",
81
+ "expectation": (
82
+ "Far outside training distribution; model should hedge "
83
+ "with low confidence (well-calibrated systems don't "
84
+ "pretend to know)."
85
+ ),
86
+ },
87
+ "OOD: heavy halogenated aromatic": {
88
+ "smiles": "Fc1c(F)c(F)c(c(F)c1F)c2c(F)c(F)c(F)c(F)c2F",
89
+ "label": "Decafluorobiphenyl — extreme halogen density",
90
+ "expectation": "Rare scaffold; expect lowered confidence vs ethanol",
91
+ },
92
+ }
93
+
94
+ case_name = st.selectbox(
95
+ "Test Edge Cases",
96
+ options=list(EDGE_CASES.keys()),
97
+ index=0,
98
+ key="bbb_case",
99
+ help=(
100
+ "Pick a robustness probe. Each case demonstrates how the "
101
+ "system handles a real-world failure mode — invalid input, "
102
+ "out-of-distribution molecules, or boundary conditions."
103
+ ),
104
+ )
105
+ case = EDGE_CASES[case_name]
106
+ st.caption(f"**Probe:** {case['label']} · **Expected:** {case['expectation']}")
107
+
108
+ smiles = st.text_input(
109
+ "SMILES string",
110
+ value=case["smiles"],
111
+ key="bbb_smiles",
112
+ help="Examples: CCO (ethanol), CC(=O)Nc1ccc(O)cc1 (paracetamol)",
113
+ )
114
+ top_k = st.slider(
115
+ "SHAP features to display", min_value=3, max_value=10, value=5, key="bbb_topk",
116
+ )
117
+
118
+ if st.button("Predict BBB permeability", type="primary", key="bbb_predict"):
119
+ with st.spinner("Computing fingerprint, predicting, and explaining…"):
120
+ try:
121
+ result = _post("/predict/bbb", {"smiles": smiles, "top_k": top_k})
122
+ _render_prediction_card(result)
123
+ st.toast("Prediction complete", icon="✅")
124
+ except httpx.HTTPStatusError as e:
125
+ if e.response.status_code == 503:
126
+ st.error(
127
+ "Model artifact not loaded yet. Run "
128
+ "`python -m src.models.bbb_model` to train it, "
129
+ "then retry."
130
+ )
131
+ elif e.response.status_code == 400:
132
+ # Robustness story: show the WARNING instead of an ERROR
133
+ # — invalid input is a recoverable path, not a crash.
134
+ st.warning(
135
+ f"Robustness check passed: API rejected the input "
136
+ f"with HTTP 400 (no crash). Detail: "
137
+ f"{e.response.json().get('detail', e.response.text)}"
138
+ )
139
+ else:
140
+ st.error(
141
+ f"Prediction failed (HTTP {e.response.status_code}): "
142
+ f"{e.response.text}"
143
+ )
144
+ except httpx.RequestError as e:
145
+ st.error(f"Cannot reach FastAPI at {_API_URL}: {e!r}")
146
+ ```
147
+
148
+ > **Note for the implementer**: read `_render_bbb_tab` first to confirm the current variable names (`smiles`, `top_k`, `bbb_predict`). Substitute as needed if the file diverged.
149
+
150
+ - [ ] **Step 2: Run the existing frontend smoke tests to confirm import still works**
151
+
152
+ ```
153
+ cd /Users/mertgungor/Desktop/hackathon && source .venv312/bin/activate && pytest tests/frontend/ -v
154
+ ```
155
+ Expected: 2 passed (the import-smoke tests still verify the module loads).
156
+
157
+ - [ ] **Step 3: Run the full suite to confirm 158 still green**
158
+
159
+ ```
160
+ pytest -v 2>&1 | tail -3
161
+ ```
162
+ Expected: 158 passed (no test count change — UI-only addition).
163
+
164
+ - [ ] **Step 4: Manual smoke**
165
+
166
+ ```
167
+ streamlit run src/frontend/app.py --server.headless true &
168
+ sleep 5
169
+ curl -s http://localhost:8501 | head -3
170
+ pkill -f "streamlit run"
171
+ ```
172
+ Expected: HTTP 200 with Streamlit HTML.
173
+
174
+ - [ ] **Step 5: Commit**
175
+
176
+ ```bash
177
+ git add src/frontend/app.py
178
+ git commit -m "feat(frontend): edge-case dropdown for BBB robustness demo"
179
+ ```
180
+
181
+ ---
182
+
183
+ ## Task 2: Decision Card calibration trust caption (Interaction & Trust)
184
+
185
+ **Files:**
186
+ - Modify: `src/models/bbb_model.py` — `train()` computes calibration bins on a held-out 20% test split, stashes them on `model._neurobridge_calibration`
187
+ - Modify: `src/api/schemas.py` — add `CalibrationContext` schema; extend `BBBPredictResponse` with `calibration: CalibrationContext | None`
188
+ - Modify: `src/api/routes.py` — `predict_bbb` looks up the matching calibration bin and includes it in the response
189
+ - Modify: `tests/models/test_bbb_model.py` — `TestCalibrationMetadata` (3 tests)
190
+ - Modify: `tests/api/test_routes.py` — extend `TestBBBPredictRoute.test_returns_200_*` with calibration assertion
191
+ - Modify: `src/frontend/app.py` — `_render_prediction_card` shows trust caption from `result["calibration"]`
192
+
193
+ ### 2A — Calibration metadata in `train()`
194
+
195
+ - [ ] **Step 1: Append failing tests to `tests/models/test_bbb_model.py`**
196
+
197
+ Add a `TestCalibrationMetadata` class at the bottom of `tests/models/test_bbb_model.py`:
198
+
199
+ ```python
200
+ class TestCalibrationMetadata:
201
+ def test_train_attaches_calibration_attribute(self, trained_model_and_features):
202
+ model, _ = trained_model_and_features
203
+ assert hasattr(model, "_neurobridge_calibration")
204
+ bins = model._neurobridge_calibration
205
+ assert isinstance(bins, list)
206
+ # Always at least one bin (the lowest-threshold one)
207
+ assert len(bins) >= 1
208
+ for b in bins:
209
+ assert "threshold" in b
210
+ assert "precision" in b
211
+ assert "support" in b
212
+ assert 0.0 <= b["threshold"] <= 1.0
213
+ assert 0.0 <= b["precision"] <= 1.0
214
+ assert b["support"] >= 0
215
+
216
+ def test_calibration_thresholds_are_sorted_ascending(
217
+ self, trained_model_and_features,
218
+ ):
219
+ model, _ = trained_model_and_features
220
+ thresholds = [b["threshold"] for b in model._neurobridge_calibration]
221
+ assert thresholds == sorted(thresholds)
222
+
223
+ def test_calibration_survives_save_load_roundtrip(
224
+ self, trained_model_and_features, tmp_path: Path,
225
+ ):
226
+ model, _ = trained_model_and_features
227
+ artifact = tmp_path / "calibrated.joblib"
228
+ bbb_model.save(model, artifact)
229
+ reloaded = bbb_model.load(artifact)
230
+ assert hasattr(reloaded, "_neurobridge_calibration")
231
+ assert reloaded._neurobridge_calibration == model._neurobridge_calibration
232
+ ```
233
+
234
+ - [ ] **Step 2: Run tests to verify they fail**
235
+
236
+ ```
237
+ pytest tests/models/test_bbb_model.py::TestCalibrationMetadata -v
238
+ ```
239
+ Expected: 3 fails — `_neurobridge_calibration` attribute does not exist.
240
+
241
+ - [ ] **Step 3: Implement calibration in `train()`**
242
+
243
+ In `src/models/bbb_model.py`, add this near the top of the file (with other imports):
244
+
245
+ ```python
246
+ from sklearn.model_selection import train_test_split
247
+ ```
248
+
249
+ And add a private helper above `train()`:
250
+
251
+ ```python
252
+ _CALIBRATION_THRESHOLDS: tuple[float, ...] = (0.50, 0.60, 0.70, 0.75, 0.80, 0.90)
253
+
254
+
255
+ def _compute_calibration_bins(
256
+ model: RandomForestClassifier,
257
+ X_test: np.ndarray,
258
+ y_test: np.ndarray,
259
+ ) -> list[dict[str, float]]:
260
+ """Compute precision-at-confidence-threshold bins on a held-out test set.
261
+
262
+ For each threshold T in `_CALIBRATION_THRESHOLDS`, picks the predictions
263
+ whose max class probability >= T, computes precision and support, and
264
+ returns one bin per threshold. Bins with zero support are still emitted
265
+ (precision = 0.0, support = 0) so the API can always find a match.
266
+ """
267
+ if len(y_test) == 0:
268
+ return [
269
+ {"threshold": float(t), "precision": 0.0, "support": 0}
270
+ for t in _CALIBRATION_THRESHOLDS
271
+ ]
272
+ proba = model.predict_proba(X_test)
273
+ pred = model.predict(X_test)
274
+ confidence = proba.max(axis=1)
275
+ correct = (pred == y_test).astype(int)
276
+ bins: list[dict[str, float]] = []
277
+ for t in _CALIBRATION_THRESHOLDS:
278
+ mask = confidence >= t
279
+ support = int(mask.sum())
280
+ if support == 0:
281
+ precision = 0.0
282
+ else:
283
+ precision = float(correct[mask].mean())
284
+ bins.append({
285
+ "threshold": float(t), "precision": precision, "support": support,
286
+ })
287
+ return bins
288
+ ```
289
+
290
+ Then modify the existing `train()` function. Find the line `model.fit(X, y)` and replace the body around it:
291
+
292
+ ```python
293
+ X, y, fp_cols = _split_features_and_label(df, label_col)
294
+ # Stratified 80/20 split for honest calibration metrics. Falls back to
295
+ # train-on-all if the dataset is too tiny for a stratified split (test
296
+ # fixtures with 3-4 rows hit this branch).
297
+ try:
298
+ X_train, X_test, y_train, y_test = train_test_split(
299
+ X, y, test_size=0.2, random_state=random_state, stratify=y,
300
+ )
301
+ except ValueError:
302
+ # n_classes > n_samples_per_class for stratification: train on all,
303
+ # leave the test set empty (calibration bins will be zero-support).
304
+ X_train, X_test, y_train, y_test = X, np.empty((0, X.shape[1])), y, np.empty((0,))
305
+
306
+ model = RandomForestClassifier(
307
+ n_estimators=n_estimators,
308
+ random_state=random_state,
309
+ n_jobs=1,
310
+ )
311
+ model.fit(X_train, y_train)
312
+ # Stash the column names under a project-owned attribute so SHAP (Task 2)
313
+ # can map values back to fp_<bit> indices. Sklearn's own feature_names_in_
314
+ # is only set automatically when fit receives a DataFrame; setting it
315
+ # manually fires UserWarning on every predict call.
316
+ model._neurobridge_fp_cols = list(fp_cols)
317
+ model._neurobridge_calibration = _compute_calibration_bins(model, X_test, y_test)
318
+ logger.info(
319
+ "Trained BBB classifier: n=%d, n_features=%d, classes=%s, "
320
+ "calibration_bins=%d",
321
+ len(y), X.shape[1], model.classes_.tolist(),
322
+ len(model._neurobridge_calibration),
323
+ )
324
+ return model
325
+ ```
326
+
327
+ - [ ] **Step 4: Run tests, expect 3 passed**
328
+
329
+ ```
330
+ pytest tests/models/test_bbb_model.py::TestCalibrationMetadata -v
331
+ ```
332
+
333
+ - [ ] **Step 5: Run full model + suite (the existing `TestTrain` tests now run on a slightly smaller training set; verify they still pass)**
334
+
335
+ ```
336
+ pytest tests/models/ -v 2>&1 | tail -10
337
+ ```
338
+ Expected: 16 passed (13 prior + 3 new calibration tests).
339
+
340
+ ```
341
+ pytest -v 2>&1 | tail -3
342
+ ```
343
+ Expected: 161 passed (158 prior + 3 new).
344
+
345
+ - [ ] **Step 6: Commit Task 2A**
346
+
347
+ ```bash
348
+ git add src/models/bbb_model.py tests/models/test_bbb_model.py
349
+ git commit -m "feat(models): compute precision-at-confidence calibration bins"
350
+ ```
351
+
352
+ ### 2B — API + Pydantic schema
353
+
354
+ - [ ] **Step 7: Add `CalibrationContext` schema and extend `BBBPredictResponse`**
355
+
356
+ In `src/api/schemas.py`, append after `FeatureAttribution`:
357
+
358
+ ```python
359
+ class CalibrationContext(BaseModel):
360
+ """The calibration bin matching the prediction's confidence.
361
+
362
+ Lets the UI show statements like 'predictions ≥75% confident are
363
+ correct 92% of the time on the held-out test set'.
364
+ """
365
+ threshold: float = Field(..., description="Confidence threshold for this bin")
366
+ precision: float = Field(..., description="Test-set precision at this threshold")
367
+ support: int = Field(..., description="Test-set sample count at this threshold")
368
+ ```
369
+
370
+ Modify `BBBPredictResponse`: add a new optional field `calibration: CalibrationContext | None = None`. The full updated class:
371
+
372
+ ```python
373
+ class BBBPredictResponse(BaseModel):
374
+ """Decision-system payload: prediction + uncertainty + explanation."""
375
+ label: int
376
+ label_text: str = Field(..., description="'permeable' or 'non-permeable'")
377
+ confidence: float
378
+ top_features: list[FeatureAttribution]
379
+ calibration: CalibrationContext | None = Field(
380
+ None,
381
+ description=(
382
+ "The calibration bin matching this prediction's confidence; "
383
+ "None if the model lacks calibration metadata"
384
+ ),
385
+ )
386
+ ```
387
+
388
+ - [ ] **Step 8: Extend `predict_bbb` route to look up the matching bin**
389
+
390
+ In `src/api/routes.py`, find the existing `predict_bbb` function. Add a helper above it:
391
+
392
+ ```python
393
+ def _matching_calibration_bin(
394
+ model: object, confidence: float,
395
+ ) -> dict[str, float] | None:
396
+ """Return the highest-threshold calibration bin still ≤ `confidence`.
397
+
398
+ Returns None if the model has no calibration metadata.
399
+ """
400
+ bins = getattr(model, "_neurobridge_calibration", None)
401
+ if not bins:
402
+ return None
403
+ eligible = [b for b in bins if b["threshold"] <= confidence]
404
+ if not eligible:
405
+ return None
406
+ return max(eligible, key=lambda b: b["threshold"])
407
+ ```
408
+
409
+ Modify the response construction at the end of `predict_bbb`:
410
+
411
+ ```python
412
+ label_text = "permeable" if pred["label"] == 1 else "non-permeable"
413
+ cal_bin = _matching_calibration_bin(model, pred["confidence"])
414
+ return BBBPredictResponse(
415
+ label=pred["label"],
416
+ label_text=label_text,
417
+ confidence=pred["confidence"],
418
+ top_features=[FeatureAttribution(**a) for a in attributions],
419
+ calibration=(
420
+ CalibrationContext(**cal_bin) if cal_bin is not None else None
421
+ ),
422
+ )
423
+ ```
424
+
425
+ Update the imports in `src/api/routes.py` to include `CalibrationContext`:
426
+
427
+ ```python
428
+ from src.api.schemas import (
429
+ BBBPredictRequest,
430
+ BBBPredictResponse,
431
+ BBBRequest,
432
+ CalibrationContext,
433
+ EEGRequest,
434
+ FeatureAttribution,
435
+ MRIRequest,
436
+ PipelineResponse,
437
+ )
438
+ ```
439
+
440
+ - [ ] **Step 9: Extend the existing 200-path test in `tests/api/test_routes.py`**
441
+
442
+ Find `TestBBBPredictRoute.test_returns_200_with_prediction_and_attributions` and add these assertions before the closing of the `try:` block:
443
+
444
+ ```python
445
+ assert "calibration" in body
446
+ if body["calibration"] is not None:
447
+ cal = body["calibration"]
448
+ assert "threshold" in cal and 0.0 <= cal["threshold"] <= 1.0
449
+ assert "precision" in cal and 0.0 <= cal["precision"] <= 1.0
450
+ assert "support" in cal and cal["support"] >= 0
451
+ ```
452
+
453
+ - [ ] **Step 10: Run tests**
454
+
455
+ ```
456
+ pytest tests/api/test_routes.py::TestBBBPredictRoute -v
457
+ ```
458
+ Expected: 3 passed.
459
+
460
+ ```
461
+ pytest -v 2>&1 | tail -3
462
+ ```
463
+ Expected: 161 passed (no count change — extended existing test).
464
+
465
+ - [ ] **Step 11: Commit Task 2B**
466
+
467
+ ```bash
468
+ git add src/api/schemas.py src/api/routes.py tests/api/test_routes.py
469
+ git commit -m "feat(api): include calibration context in /predict/bbb response"
470
+ ```
471
+
472
+ ### 2C — Streamlit trust caption
473
+
474
+ - [ ] **Step 12: Update `_render_prediction_card` in `src/frontend/app.py`**
475
+
476
+ Find `_render_prediction_card` and append the trust caption logic AFTER the existing `st.progress(float(result["confidence"]))` line and BEFORE the SHAP section (the `st.markdown` for "Top N SHAP attributions"):
477
+
478
+ ```python
479
+ cal = result.get("calibration")
480
+ if cal is not None and cal["support"] > 0:
481
+ threshold_pct = cal["threshold"] * 100
482
+ precision_pct = cal["precision"] * 100
483
+ st.caption(
484
+ f"📊 **Calibration context:** on the held-out test set, "
485
+ f"predictions with ≥{threshold_pct:.0f}% confidence are correct "
486
+ f"**{precision_pct:.0f}%** of the time (n={cal['support']})."
487
+ )
488
+ elif cal is not None:
489
+ # Bin matched but support=0 (tiny test fixtures). Still useful as honesty.
490
+ threshold_pct = cal["threshold"] * 100
491
+ st.caption(
492
+ f"📊 **Calibration context:** test-set support at "
493
+ f"≥{threshold_pct:.0f}% confidence is too small to estimate "
494
+ f"precision (n=0). Train on a larger dataset to see precision."
495
+ )
496
+ ```
497
+
498
+ - [ ] **Step 13: Run frontend smoke**
499
+
500
+ ```
501
+ pytest tests/frontend/ -v
502
+ ```
503
+ Expected: 2 passed.
504
+
505
+ - [ ] **Step 14: Commit Task 2C**
506
+
507
+ ```bash
508
+ git add src/frontend/app.py
509
+ git commit -m "feat(frontend): trust caption with calibration context in BBB card"
510
+ ```
511
+
512
+ ---
513
+
514
+ ## Task 3: MRI ComBat histogram visualization (Creativity & Problem Depth)
515
+
516
+ **Files:**
517
+ - Modify: `src/pipelines/mri_pipeline.py` — add `compute_harmonization_diagnostics(input_dir, sites_csv, ...)` returning a long-format DataFrame
518
+ - Modify: `src/api/schemas.py` — `MRIDiagnosticsRequest`, `MRIDiagnosticsResponse`
519
+ - Modify: `src/api/routes.py` — `POST /pipeline/mri/diagnostics` endpoint
520
+ - Modify: `tests/pipelines/test_mri_pipeline.py` — `TestComputeHarmonizationDiagnostics` (2 tests)
521
+ - Modify: `tests/api/test_routes.py` — `TestMRIDiagnosticsRoute` (2 tests)
522
+ - Modify: `src/frontend/app.py` — replace bare MRI tab with diagnostics-driven KDE viz + site-gap KPI
523
+
524
+ ### 3A — `compute_harmonization_diagnostics` function
525
+
526
+ - [ ] **Step 1: Append failing tests to `tests/pipelines/test_mri_pipeline.py`**
527
+
528
+ Append at the bottom:
529
+
530
+ ```python
531
+ class TestComputeHarmonizationDiagnostics:
532
+ def test_returns_long_format_with_pre_and_post_states(self, tmp_path: Path):
533
+ from tests.fixtures.build_mri_fixture import build as build_mri
534
+ from src.pipelines.mri_pipeline import compute_harmonization_diagnostics
535
+
536
+ fixture_dir = build_mri(out_dir=tmp_path / "mri")
537
+ diagnostics = compute_harmonization_diagnostics(
538
+ input_dir=fixture_dir,
539
+ sites_csv=fixture_dir / "sites.csv",
540
+ )
541
+ assert "feature_value" in diagnostics.columns
542
+ assert "site" in diagnostics.columns
543
+ assert "harmonization_state" in diagnostics.columns
544
+ assert "feature" in diagnostics.columns
545
+ states = set(diagnostics["harmonization_state"].unique())
546
+ assert states == {"Pre-ComBat", "Post-ComBat"}
547
+
548
+ def test_post_combat_site_gap_is_smaller_than_pre(self, tmp_path: Path):
549
+ """Day-3 demonstrated 5.0 → 0.0015 gap reduction. This regression
550
+ test pins the property: post-ComBat per-site means MUST be closer
551
+ together than pre-ComBat per-site means."""
552
+ from tests.fixtures.build_mri_fixture import build as build_mri
553
+ from src.pipelines.mri_pipeline import compute_harmonization_diagnostics
554
+
555
+ fixture_dir = build_mri(out_dir=tmp_path / "mri")
556
+ diagnostics = compute_harmonization_diagnostics(
557
+ input_dir=fixture_dir,
558
+ sites_csv=fixture_dir / "sites.csv",
559
+ )
560
+ pre = diagnostics[diagnostics["harmonization_state"] == "Pre-ComBat"]
561
+ post = diagnostics[diagnostics["harmonization_state"] == "Post-ComBat"]
562
+ # Compute site-gap as range of per-site means on the first feature
563
+ feat = diagnostics["feature"].iloc[0]
564
+ pre_gap = pre[pre["feature"] == feat].groupby("site")["feature_value"].mean().agg(lambda s: s.max() - s.min())
565
+ post_gap = post[post["feature"] == feat].groupby("site")["feature_value"].mean().agg(lambda s: s.max() - s.min())
566
+ assert post_gap < pre_gap, (
567
+ f"Expected post-gap < pre-gap, got pre={pre_gap}, post={post_gap}"
568
+ )
569
+ ```
570
+
571
+ - [ ] **Step 2: Run failing tests**
572
+
573
+ ```
574
+ pytest tests/pipelines/test_mri_pipeline.py::TestComputeHarmonizationDiagnostics -v
575
+ ```
576
+ Expected: ImportError — `compute_harmonization_diagnostics` does not exist.
577
+
578
+ - [ ] **Step 3: Implement `compute_harmonization_diagnostics`**
579
+
580
+ In `src/pipelines/mri_pipeline.py`, append at the end of the file (after `run_pipeline`):
581
+
582
+ ```python
583
+ def compute_harmonization_diagnostics(
584
+ input_dir: Path,
585
+ sites_csv: Path | None = None,
586
+ intensity_threshold: float | None = None,
587
+ n_roi_axes: tuple[int, int, int] = DEFAULT_N_ROI_AXES,
588
+ ) -> pd.DataFrame:
589
+ """Run the MRI pipeline twice — pre-ComBat features and post-ComBat —
590
+ and return a long-format DataFrame ready for visualization.
591
+
592
+ Output columns: ``subject_id``, ``site``, ``feature``, ``feature_value``,
593
+ ``harmonization_state`` ('Pre-ComBat' or 'Post-ComBat').
594
+
595
+ Used by the FastAPI ``/pipeline/mri/diagnostics`` endpoint to feed the
596
+ Streamlit MRI tab's KDE / histogram comparison plot.
597
+
598
+ Raises:
599
+ FileNotFoundError: if ``input_dir`` does not exist.
600
+ KeyError: if any subject is missing a site assignment.
601
+ """
602
+ input_dir = Path(input_dir)
603
+ if not input_dir.exists():
604
+ raise FileNotFoundError(f"MRI input directory not found: {input_dir}")
605
+ sites_csv = Path(sites_csv) if sites_csv is not None else input_dir / "sites.csv"
606
+ sites_df = pd.read_csv(sites_csv)
607
+
608
+ feature_cols = [
609
+ f"feat_roi{i}_{stat}"
610
+ for i in range(int(np.prod(n_roi_axes)))
611
+ for stat in ROI_STATS
612
+ ]
613
+
614
+ rows: list[dict[str, object]] = []
615
+ for nifti_path in sorted(input_dir.glob("*.nii*")):
616
+ subject_id = nifti_path.stem.replace(".nii", "")
617
+ volume = nib.load(nifti_path).get_fdata()
618
+ if not is_valid_volume(volume):
619
+ continue
620
+ mask = mask_brain(volume, intensity_threshold=intensity_threshold)
621
+ feats = extract_features_from_volume(
622
+ volume, mask, n_roi_axes=n_roi_axes,
623
+ )
624
+ row: dict[str, object] = {"subject_id": subject_id}
625
+ row.update(feats)
626
+ rows.append(row)
627
+
628
+ if not rows:
629
+ return pd.DataFrame(columns=[
630
+ "subject_id", "site", "feature", "feature_value", "harmonization_state",
631
+ ])
632
+
633
+ raw_features = pd.DataFrame(rows).merge(sites_df, on="subject_id", how="left")
634
+ if raw_features["site"].isna().any():
635
+ missing = raw_features.loc[raw_features["site"].isna(), "subject_id"].tolist()
636
+ raise KeyError(
637
+ f"sites_csv missing site assignment for subjects: {missing}"
638
+ )
639
+
640
+ # Post-ComBat: variance-aware harmonization. Reuses the same logic as
641
+ # run_pipeline so diagnostics reflect production behavior exactly.
642
+ col_std = raw_features[feature_cols].std()
643
+ var_feature_cols = [
644
+ c for c in feature_cols if col_std[c] > _MIN_VAR_THRESHOLD
645
+ ]
646
+ zero_var_cols = [
647
+ c for c in feature_cols if col_std[c] <= _MIN_VAR_THRESHOLD
648
+ ]
649
+ if not var_feature_cols:
650
+ harmonized = raw_features[feature_cols].copy()
651
+ else:
652
+ harmonized = harmonize_combat(
653
+ raw_features, raw_features["site"], var_feature_cols,
654
+ )
655
+ for c in zero_var_cols:
656
+ harmonized[c] = raw_features[c].to_numpy()
657
+ harmonized = harmonized[feature_cols]
658
+ post_features = pd.concat(
659
+ [raw_features[["subject_id", "site"]].reset_index(drop=True),
660
+ harmonized.reset_index(drop=True)],
661
+ axis=1,
662
+ )
663
+
664
+ long_pre = raw_features.melt(
665
+ id_vars=["subject_id", "site"], value_vars=feature_cols,
666
+ var_name="feature", value_name="feature_value",
667
+ )
668
+ long_pre["harmonization_state"] = "Pre-ComBat"
669
+ long_post = post_features.melt(
670
+ id_vars=["subject_id", "site"], value_vars=feature_cols,
671
+ var_name="feature", value_name="feature_value",
672
+ )
673
+ long_post["harmonization_state"] = "Post-ComBat"
674
+ return pd.concat([long_pre, long_post], ignore_index=True)
675
+ ```
676
+
677
+ - [ ] **Step 4: Run tests**
678
+
679
+ ```
680
+ pytest tests/pipelines/test_mri_pipeline.py::TestComputeHarmonizationDiagnostics -v
681
+ ```
682
+ Expected: 2 passed.
683
+
684
+ - [ ] **Step 5: Run full MRI tests + suite**
685
+
686
+ ```
687
+ pytest tests/pipelines/test_mri_pipeline.py -v 2>&1 | tail -5
688
+ ```
689
+ Expected: 42 passed (40 prior + 2 new).
690
+
691
+ ```
692
+ pytest -v 2>&1 | tail -3
693
+ ```
694
+ Expected: 163 passed.
695
+
696
+ - [ ] **Step 6: Commit Task 3A**
697
+
698
+ ```bash
699
+ git add src/pipelines/mri_pipeline.py tests/pipelines/test_mri_pipeline.py
700
+ git commit -m "feat(mri): compute_harmonization_diagnostics for pre/post comparison"
701
+ ```
702
+
703
+ ### 3B — API endpoint
704
+
705
+ - [ ] **Step 7: Add Pydantic schemas to `src/api/schemas.py`**
706
+
707
+ Append at the bottom:
708
+
709
+ ```python
710
+ class MRIDiagnosticsRequest(BaseModel):
711
+ """Request body for /pipeline/mri/diagnostics — same as MRIRequest minus output_path."""
712
+ input_dir: str = Field(..., description="Directory of .nii.gz files")
713
+ sites_csv: str = Field(..., description="CSV mapping subject_id → site")
714
+
715
+
716
+ class HarmonizationRow(BaseModel):
717
+ subject_id: str
718
+ site: str
719
+ feature: str
720
+ feature_value: float
721
+ harmonization_state: str
722
+
723
+
724
+ class MRIDiagnosticsResponse(BaseModel):
725
+ """Long-format pre/post ComBat data for visualization."""
726
+ rows: list[HarmonizationRow]
727
+ site_gap_pre: float = Field(..., description="Range of per-site means before ComBat")
728
+ site_gap_post: float = Field(..., description="Range of per-site means after ComBat")
729
+ reduction_factor: float = Field(..., description="site_gap_pre / max(site_gap_post, eps)")
730
+ ```
731
+
732
+ - [ ] **Step 8: Add the route to `src/api/routes.py`**
733
+
734
+ Update the schema imports to include the 3 new types:
735
+
736
+ ```python
737
+ from src.api.schemas import (
738
+ BBBPredictRequest,
739
+ BBBPredictResponse,
740
+ BBBRequest,
741
+ CalibrationContext,
742
+ EEGRequest,
743
+ FeatureAttribution,
744
+ HarmonizationRow,
745
+ MRIDiagnosticsRequest,
746
+ MRIDiagnosticsResponse,
747
+ MRIRequest,
748
+ PipelineResponse,
749
+ )
750
+ ```
751
+
752
+ Add the route at the end of the file:
753
+
754
+ ```python
755
+ @router.post("/mri/diagnostics", response_model=MRIDiagnosticsResponse)
756
+ def mri_diagnostics(req: MRIDiagnosticsRequest) -> MRIDiagnosticsResponse:
757
+ """Run the MRI pipeline twice and return pre/post ComBat data + site-gap KPIs."""
758
+ input_dir = Path(req.input_dir)
759
+ sites_csv = Path(req.sites_csv)
760
+ try:
761
+ df = mri_pipeline.compute_harmonization_diagnostics(
762
+ input_dir=input_dir, sites_csv=sites_csv,
763
+ )
764
+ except FileNotFoundError as e:
765
+ raise HTTPException(status_code=404, detail=str(e))
766
+ except KeyError as e:
767
+ raise HTTPException(status_code=400, detail=str(e))
768
+
769
+ if df.empty:
770
+ return MRIDiagnosticsResponse(
771
+ rows=[], site_gap_pre=0.0, site_gap_post=0.0, reduction_factor=0.0,
772
+ )
773
+
774
+ # Site-gap KPI on the first feature, averaged per site
775
+ feat = df["feature"].iloc[0]
776
+ feat_df = df[df["feature"] == feat]
777
+ pre_means = feat_df[feat_df["harmonization_state"] == "Pre-ComBat"].groupby(
778
+ "site"
779
+ )["feature_value"].mean()
780
+ post_means = feat_df[feat_df["harmonization_state"] == "Post-ComBat"].groupby(
781
+ "site"
782
+ )["feature_value"].mean()
783
+ site_gap_pre = float(pre_means.max() - pre_means.min())
784
+ site_gap_post = float(post_means.max() - post_means.min())
785
+ eps = 1e-9
786
+ reduction_factor = site_gap_pre / max(site_gap_post, eps)
787
+
788
+ rows = [
789
+ HarmonizationRow(**rec) for rec in df.to_dict(orient="records")
790
+ ]
791
+ return MRIDiagnosticsResponse(
792
+ rows=rows,
793
+ site_gap_pre=site_gap_pre,
794
+ site_gap_post=site_gap_post,
795
+ reduction_factor=reduction_factor,
796
+ )
797
+ ```
798
+
799
+ > **Note**: this route belongs on `router` (the `/pipeline/...` router), NOT on `predict_router`. Verify the existing `router = APIRouter(prefix="/pipeline")` at the top of routes.py.
800
+
801
+ - [ ] **Step 9: Append failing route tests to `tests/api/test_routes.py`**
802
+
803
+ ```python
804
+ class TestMRIDiagnosticsRoute:
805
+ def test_returns_200_with_pre_and_post_data(self, tmp_path: Path):
806
+ from tests.fixtures.build_mri_fixture import build as build_mri
807
+ fixture_dir = build_mri(out_dir=tmp_path / "mri")
808
+ resp = client.post(
809
+ "/pipeline/mri/diagnostics",
810
+ json={
811
+ "input_dir": str(fixture_dir),
812
+ "sites_csv": str(fixture_dir / "sites.csv"),
813
+ },
814
+ )
815
+ assert resp.status_code == 200
816
+ body = resp.json()
817
+ assert len(body["rows"]) > 0
818
+ assert body["site_gap_pre"] >= 0.0
819
+ assert body["site_gap_post"] >= 0.0
820
+ # Reduction factor is the headline KPI
821
+ assert body["reduction_factor"] >= 1.0 # ComBat must reduce, not amplify
822
+ states = {r["harmonization_state"] for r in body["rows"]}
823
+ assert states == {"Pre-ComBat", "Post-ComBat"}
824
+
825
+ def test_returns_404_when_input_dir_missing(self, tmp_path: Path):
826
+ resp = client.post(
827
+ "/pipeline/mri/diagnostics",
828
+ json={
829
+ "input_dir": str(tmp_path / "does_not_exist"),
830
+ "sites_csv": str(tmp_path / "sites.csv"),
831
+ },
832
+ )
833
+ assert resp.status_code == 404
834
+ ```
835
+
836
+ - [ ] **Step 10: Run tests**
837
+
838
+ ```
839
+ pytest tests/api/test_routes.py::TestMRIDiagnosticsRoute -v
840
+ ```
841
+ Expected: 2 passed.
842
+
843
+ ```
844
+ pytest -v 2>&1 | tail -3
845
+ ```
846
+ Expected: 165 passed (163 + 2).
847
+
848
+ - [ ] **Step 11: Commit Task 3B**
849
+
850
+ ```bash
851
+ git add src/api/schemas.py src/api/routes.py tests/api/test_routes.py
852
+ git commit -m "feat(api): POST /pipeline/mri/diagnostics for pre/post ComBat KPIs"
853
+ ```
854
+
855
+ ### 3C — Streamlit MRI tab visualization
856
+
857
+ - [ ] **Step 12: Replace `_render_mri_tab` body in `src/frontend/app.py`**
858
+
859
+ Find the existing `_render_mri_tab` function (which currently does a basic POST to `/pipeline/mri`) and replace its body entirely with:
860
+
861
+ ```python
862
+ def _render_mri_tab() -> None:
863
+ _render_section(
864
+ "IMAGE — MRI",
865
+ "Multi-site harmonization via ComBat",
866
+ "Loads NIfTI volumes, masks brain tissue, computes per-ROI summary "
867
+ "statistics, then harmonizes across acquisition sites with neuroHarmonize "
868
+ "to remove scanner-driven domain shift. The diagnostic plot below "
869
+ "compares per-site feature distributions before and after harmonization."
870
+ )
871
+ mri_dir = st.text_input(
872
+ "Input NIfTI directory", "tests/fixtures/mri_sample", key="mri_dir",
873
+ help="Path to a directory of .nii(.gz) files + sites.csv",
874
+ )
875
+ sites_csv = st.text_input(
876
+ "Sites CSV", "tests/fixtures/mri_sample/sites.csv", key="mri_sites",
877
+ )
878
+
879
+ if st.button("Run ComBat diagnostics", type="primary", key="mri_diag"):
880
+ with st.spinner("Running pre + post ComBat (×2 the work)…"):
881
+ try:
882
+ result = _post(
883
+ "/pipeline/mri/diagnostics",
884
+ {"input_dir": mri_dir, "sites_csv": sites_csv},
885
+ )
886
+ _render_combat_diagnostics(result)
887
+ st.toast("Diagnostics complete", icon="✅")
888
+ except httpx.HTTPStatusError as e:
889
+ st.error(
890
+ f"Diagnostics failed (HTTP {e.response.status_code}): "
891
+ f"{e.response.text}"
892
+ )
893
+ except httpx.RequestError as e:
894
+ st.error(f"Cannot reach FastAPI at {_API_URL}: {e!r}")
895
+ ```
896
+
897
+ - [ ] **Step 13: Add `_render_combat_diagnostics` helper above `main()` in `src/frontend/app.py`**
898
+
899
+ ```python
900
+ def _render_combat_diagnostics(result: dict) -> None:
901
+ """Render the Pre/Post-ComBat KDE comparison + site-gap KPI strip."""
902
+ import altair as alt
903
+ import pandas as pd
904
+
905
+ rows = result.get("rows", [])
906
+ if not rows:
907
+ st.info(
908
+ "No data returned. Check that the input directory contains "
909
+ ".nii(.gz) files and a sites.csv with subject_id/site columns."
910
+ )
911
+ return
912
+
913
+ cols = st.columns(3)
914
+ cols[0].metric("Site-gap (Pre-ComBat)", f"{result['site_gap_pre']:.4f}")
915
+ cols[1].metric("Site-gap (Post-ComBat)", f"{result['site_gap_post']:.4f}")
916
+ cols[2].metric(
917
+ "Reduction factor",
918
+ f"{result['reduction_factor']:.0f}×",
919
+ help=(
920
+ "Pre-gap / Post-gap. A 100× reduction means ComBat "
921
+ "removed two orders of magnitude of site-driven domain shift."
922
+ ),
923
+ )
924
+
925
+ df = pd.DataFrame(rows)
926
+ # Pin the chart to the first feature (most recognizable for the audience).
927
+ feat = df["feature"].iloc[0]
928
+ feat_df = df[df["feature"] == feat]
929
+
930
+ # Layered KDE: x = feature_value, color = site, faceted by harmonization_state.
931
+ chart = (
932
+ alt.Chart(feat_df)
933
+ .transform_density(
934
+ density="feature_value",
935
+ groupby=["site", "harmonization_state"],
936
+ as_=["feature_value", "density"],
937
+ )
938
+ .mark_area(opacity=0.55)
939
+ .encode(
940
+ x=alt.X("feature_value:Q", title=f"{feat} (intensity)"),
941
+ y=alt.Y("density:Q", title="Density"),
942
+ color=alt.Color(
943
+ "site:N",
944
+ title="Site",
945
+ scale=alt.Scale(scheme="tableau10"),
946
+ ),
947
+ tooltip=[
948
+ alt.Tooltip("site:N"),
949
+ alt.Tooltip("feature_value:Q", format=".4f"),
950
+ alt.Tooltip("density:Q", format=".3f"),
951
+ ],
952
+ )
953
+ .properties(width=380, height=260)
954
+ .facet(
955
+ column=alt.Column(
956
+ "harmonization_state:N",
957
+ title=None,
958
+ sort=["Pre-ComBat", "Post-ComBat"],
959
+ header=alt.Header(labelFontSize=13, labelFontWeight="bold"),
960
+ )
961
+ )
962
+ .resolve_scale(x="shared", y="shared")
963
+ )
964
+ st.altair_chart(chart, use_container_width=True)
965
+
966
+ st.caption(
967
+ f"Per-site density of `{feat}` before and after ComBat. Each "
968
+ f"colored region is one acquisition site. **Convergence of the "
969
+ f"colored regions in the Post-ComBat panel is the visual proof "
970
+ f"of harmonization** — the same property the {result['reduction_factor']:.0f}× "
971
+ f"site-gap reduction quantifies."
972
+ )
973
+ ```
974
+
975
+ - [ ] **Step 14: Run frontend smoke**
976
+
977
+ ```
978
+ pytest tests/frontend/ -v
979
+ ```
980
+ Expected: 2 passed.
981
+
982
+ ```
983
+ pytest -v 2>&1 | tail -3
984
+ ```
985
+ Expected: 165 passed (no new tests added at this step).
986
+
987
+ - [ ] **Step 15: Manual smoke**
988
+
989
+ ```
990
+ streamlit run src/frontend/app.py --server.headless true &
991
+ sleep 5
992
+ curl -s http://localhost:8501 | head -3
993
+ pkill -f "streamlit run"
994
+ ```
995
+ Expected: HTTP 200 with Streamlit HTML.
996
+
997
+ - [ ] **Step 16: Commit Task 3C**
998
+
999
+ ```bash
1000
+ git add src/frontend/app.py
1001
+ git commit -m "feat(frontend): MRI tab — Pre/Post ComBat KDE + site-gap KPI"
1002
+ ```
1003
+
1004
+ ---
1005
+
1006
+ ## Task 4: Final close-out — AGENTS.md + README + DoD
1007
+
1008
+ **Files:**
1009
+ - Modify: `AGENTS.md` — §8 sub-section on calibration metadata; new §9 on Day-6 demo features
1010
+ - Modify: `README.md` — Day 6 row in status table
1011
+
1012
+ - [ ] **Step 1: Update AGENTS.md**
1013
+
1014
+ Add a sub-section to §8 right after the uniform-surface bullets:
1015
+
1016
+ ```markdown
1017
+ **Calibration metadata** (Day 6): `train()` does an 80/20 stratified split,
1018
+ computes precision-at-confidence-threshold bins on the held-out test set,
1019
+ and stashes them on `model._neurobridge_calibration: list[dict]` (sorted
1020
+ ascending by threshold). The API includes the bin matching each
1021
+ prediction's confidence in `BBBPredictResponse.calibration`. UI uses this
1022
+ to render an honest trust caption ("≥75% confident → 92% precision, n=18").
1023
+ For tiny test fixtures where stratified split fails, calibration falls
1024
+ back to zero-support bins so the API contract is always populated.
1025
+ ```
1026
+
1027
+ After §8, append a new §9:
1028
+
1029
+ ```markdown
1030
+ ## 9. Demo Features (Day 6)
1031
+
1032
+ The frontend includes three jury-day demo amplifiers that don't change
1033
+ the core contract:
1034
+
1035
+ - **Edge-case dropdown** (BBB tab): a curated catalog of 5 robustness
1036
+ probes — invalid SMILES, empty input, OOD macrocycle (cyclosporine-like),
1037
+ heavy halogenated aromatic. Each has a stated expectation; the UI
1038
+ visualizes graceful failure (HTTP 400 → recoverable warning, never
1039
+ a crash).
1040
+ - **Calibration trust caption** (BBB decision card): renders the
1041
+ precision-at-confidence-threshold from `BBBPredictResponse.calibration`.
1042
+ Demonstrates that the system knows what it doesn't know.
1043
+ - **MRI ComBat diagnostics** (MRI tab): `POST /pipeline/mri/diagnostics`
1044
+ runs the pipeline twice (pre + post ComBat) and returns long-format
1045
+ data + site-gap KPIs (Pre, Post, Reduction factor). The UI renders
1046
+ a faceted altair density plot — visual proof that ComBat removes
1047
+ site-driven domain shift.
1048
+ ```
1049
+
1050
+ - [ ] **Step 2: Update README.md**
1051
+
1052
+ Add Day 6 to the status table:
1053
+
1054
+ ```markdown
1055
+ | Day 6 — Final Polish & Demo Features (Edge cases + Calibration + ComBat viz) | ✅ Shipped — 165 tests green |
1056
+ ```
1057
+
1058
+ Add to "Where to Look":
1059
+ - `docs/superpowers/plans/2026-05-04-day6-final-polish-demo-features.md` (Day-6 plan)
1060
+ - New surface: `POST /pipeline/mri/diagnostics`
1061
+
1062
+ - [ ] **Step 3: DoD verification**
1063
+
1064
+ Run all 4 checks:
1065
+
1066
+ ```
1067
+ pytest -v 2>&1 | tail -3
1068
+ ```
1069
+ Expected: **165 passed**.
1070
+
1071
+ ```
1072
+ pytest -W error::UserWarning tests/models/ tests/api/ tests/pipelines/ tests/frontend/ 2>&1 | tail -3
1073
+ ```
1074
+ Expected: same count, zero UserWarning errors.
1075
+
1076
+ ```
1077
+ streamlit run src/frontend/app.py --server.headless true &
1078
+ sleep 5
1079
+ curl -s http://localhost:8501 | head -3
1080
+ pkill -f "streamlit run"
1081
+ ```
1082
+ Expected: HTML response.
1083
+
1084
+ If `data/raw/bbbp.csv` exists, do a full E2E retrain → predict → diagnostics smoke. Otherwise skip (covered by tests).
1085
+
1086
+ - [ ] **Step 4: Commit close-out**
1087
+
1088
+ ```bash
1089
+ git add AGENTS.md README.md
1090
+ git commit -m "docs: Day-6 close-out — AGENTS §8 calibration + §9 demo features"
1091
+ ```
1092
+
1093
+ ---
1094
+
1095
+ ## Definition of Done (Day 6)
1096
+
1097
+ | Check | Pass criterion |
1098
+ |---|---|
1099
+ | Full suite green | `pytest -v` reports 165 passed |
1100
+ | Edge-case dropdown lists 5 cases (incl. invalid + OOD) | manual / Streamlit run |
1101
+ | Calibration metadata persists across save/load | `TestCalibrationMetadata.test_calibration_survives_save_load_roundtrip` |
1102
+ | `BBBPredictResponse.calibration` is populated for predictions ≥0.5 confidence | `TestBBBPredictRoute.test_returns_200_*` calibration assertions |
1103
+ | `POST /pipeline/mri/diagnostics` returns site-gap KPIs + Pre/Post rows | `TestMRIDiagnosticsRoute.test_returns_200_*` |
1104
+ | `compute_harmonization_diagnostics` regression-tests post < pre | `TestComputeHarmonizationDiagnostics.test_post_combat_site_gap_is_smaller_than_pre` |
1105
+ | Streamlit MRI tab renders altair faceted density plot | manual |
1106
+ | 158 prior tests still green | yes |
1107
+ | AGENTS.md §8 documents calibration; §9 documents demo features | yes |
1108
+
1109
+ When all rows green: Day 6 mühürlü. Jüri demosu hazır.