docs(spec): Day-7 final-5% design — drift, traceability, agents
Browse filesSealed architectural decisions:
- Drift state: in-process deque(maxlen=100) per worker + train-time
median/std on model._neurobridge_train_stats (joblib roundtrip-safe).
- LLM provider: OpenRouter via openai==1.51.0 SDK with deterministic
template fallback. NEUROBRIDGE_DISABLE_LLM=1 demo lifeline.
4 tasks: T1 drift, T2 MLflow badge, T3 LLM explainer + AI Assistant
tab, T4 close-out. Test growth: 165 → 175 green (+10).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/superpowers/specs/2026-05-05-day7-drift-traceability-agents-design.md
ADDED
|
@@ -0,0 +1,366 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Day 7 — The Final 5% (Drift, Traceability & Agents) Design Spec
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-05-05
|
| 4 |
+
**Status:** Approved by user; ready for `/superpowers:writing-plans`.
|
| 5 |
+
**Predecessor:** Day 6 (`docs/superpowers/plans/2026-05-04-day6-final-polish-demo-features.md`) — closed at SHA `d05fcf1`, 165 green tests.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Goal
|
| 10 |
+
|
| 11 |
+
Close the remaining 5% gap to a top-tier hackathon submission by hardening the two evaluation dimensions where Day-6 left visible weakness:
|
| 12 |
+
|
| 13 |
+
- **Adapt Over Time** (Slide 12 "Living Systems — Your Edge"): the system is currently static post-training. Add a drift-detection stub that compares trailing prediction confidence to the training distribution.
|
| 14 |
+
- **AI Lab Agents** (Slide 7, Track 1): we have no agent surface. Add a chat-style "Why?" endpoint that explains a BBB prediction in natural language using SHAP attributions and drift context.
|
| 15 |
+
|
| 16 |
+
Plus one System-Quality polish (MLflow traceability badge in the decision card) so jurors can audit *which* model produced a given prediction.
|
| 17 |
+
|
| 18 |
+
**Non-goals (YAGNI):**
|
| 19 |
+
|
| 20 |
+
- No retraining loop. Drift is *observed and reported*, not acted on.
|
| 21 |
+
- No conversational state, no multi-turn agent. One question → one rationale.
|
| 22 |
+
- No vector store or RAG corpus. The "context" is the prediction payload itself plus a small built-in literature primer.
|
| 23 |
+
- No new front-end framework. Stay on Streamlit; reuse Trust & Authority brand tokens.
|
| 24 |
+
- No new visualization library. Reuse altair (already shipped).
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 2. Sealed Architectural Decisions
|
| 29 |
+
|
| 30 |
+
These are locked. The implementation plan must follow them as-is.
|
| 31 |
+
|
| 32 |
+
### 2.1 Drift state location
|
| 33 |
+
|
| 34 |
+
**Decision:** in-process `collections.deque(maxlen=100)` per FastAPI worker, plus train-time median/std baked into `model._neurobridge_train_stats: dict` (joblib-roundtrip-safe).
|
| 35 |
+
|
| 36 |
+
| Field | Type | Source |
|
| 37 |
+
|---|---|---|
|
| 38 |
+
| `model._neurobridge_train_stats["median"]` | `float` | `np.median(model.predict_proba(X_train).max(axis=1))` |
|
| 39 |
+
| `model._neurobridge_train_stats["std"]` | `float` | `np.std(model.predict_proba(X_train).max(axis=1))` |
|
| 40 |
+
| `model._neurobridge_train_stats["n_train"]` | `int` | `len(y_train)` |
|
| 41 |
+
| `WORKER_CONFIDENCE_DEQUE` | `deque[float]` (maxlen=100) | module-level singleton in `src/api/routes.py` |
|
| 42 |
+
|
| 43 |
+
**Z-score formula:** `drift_z = (rolling_median − train_median) / max(train_std, 1e-9)`
|
| 44 |
+
|
| 45 |
+
**Edge cases:**
|
| 46 |
+
- `len(deque) < 10` → `drift_z = None`, UI shows "Warming up (n/100)".
|
| 47 |
+
- `_neurobridge_train_stats` missing (legacy joblib) → `drift_z = None`, UI shows "Drift unavailable".
|
| 48 |
+
|
| 49 |
+
**Rejected alternatives:**
|
| 50 |
+
- Joblib sidecar file (B): I/O race risk, slower.
|
| 51 |
+
- SQLite (C): production-grade but +30 min setup overhead, not worth it for the demo budget.
|
| 52 |
+
- Train-time stats only without rolling window (D): kills the "trailing 100" narrative.
|
| 53 |
+
|
| 54 |
+
### 2.2 LLM provider
|
| 55 |
+
|
| 56 |
+
**Decision:** OpenRouter via `openai==1.51.0` SDK, hybrid template fallback, kill-switch env gate.
|
| 57 |
+
|
| 58 |
+
| Setting | Value |
|
| 59 |
+
|---|---|
|
| 60 |
+
| Base URL | `https://openrouter.ai/api/v1` |
|
| 61 |
+
| Default model | `meta-llama/llama-3.2-3b-instruct:free` |
|
| 62 |
+
| Auth | `OPENROUTER_API_KEY` env var |
|
| 63 |
+
| Lifeline gate | `NEUROBRIDGE_DISABLE_LLM=1` → force template path |
|
| 64 |
+
| Timeout | 8 seconds (HTTP request) |
|
| 65 |
+
| Max tokens | 256 (response cap) |
|
| 66 |
+
| Temperature | 0.3 (deterministic-ish, jury demos must be predictable) |
|
| 67 |
+
|
| 68 |
+
**Fallback chain (in order):**
|
| 69 |
+
|
| 70 |
+
1. `NEUROBRIDGE_DISABLE_LLM=1` set → template
|
| 71 |
+
2. `OPENROUTER_API_KEY` not set → template
|
| 72 |
+
3. `openai` SDK raises `APIConnectionError` / `APITimeoutError` / `RateLimitError` → log warning, template
|
| 73 |
+
4. LLM returns empty / malformed response → log warning, template
|
| 74 |
+
5. Otherwise → LLM rationale
|
| 75 |
+
|
| 76 |
+
**Response contract (always populated):**
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
{"rationale": str, "source": "llm" | "template", "model": str | None}
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
`source` makes the auditing story crisp: "this rationale came from the deterministic template" vs "this came from llama-3.2-3b". Jurors can verify reproducibility.
|
| 83 |
+
|
| 84 |
+
**Rejected alternatives:**
|
| 85 |
+
- Anthropic API (C): no key available.
|
| 86 |
+
- Local Ollama (B): demo-day install/load risk too high.
|
| 87 |
+
- Pure deterministic template (A): kills the "AI Lab Agents" narrative.
|
| 88 |
+
- Pure LLM with no fallback: demo-day network failure = total failure.
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## 3. Component Design
|
| 93 |
+
|
| 94 |
+
### 3.1 Drift layer
|
| 95 |
+
|
| 96 |
+
**Files touched:**
|
| 97 |
+
|
| 98 |
+
- `src/models/bbb_model.py` — extend `train()` to compute and stash `_neurobridge_train_stats`.
|
| 99 |
+
- `src/api/schemas.py` — add `drift_z: float | None` and `rolling_n: int` to `BBBPredictResponse`.
|
| 100 |
+
- `src/api/routes.py` — module-level deque, helper `_compute_drift_z(model, confidence) -> tuple[float | None, int]`, wire into `predict_bbb`.
|
| 101 |
+
- `src/frontend/app.py` — add a drift line to `_render_prediction_card` between the calibration caption and the SHAP section.
|
| 102 |
+
|
| 103 |
+
**Boundary contracts:**
|
| 104 |
+
|
| 105 |
+
- `bbb_model._compute_train_stats(model, X_train, y_train) -> dict` (private helper, mirrors `_compute_calibration_bins` shape).
|
| 106 |
+
- `routes._compute_drift_z(model, confidence) -> tuple[float | None, int]` — returns `(drift_z, len_after_append)`. Side effect: appends to module-level deque.
|
| 107 |
+
- The deque is module-level so it survives across requests but resets per worker restart. This is acceptable: drift is a *demo-day signal*, not a production audit trail.
|
| 108 |
+
|
| 109 |
+
**Streamlit rendering rule:**
|
| 110 |
+
|
| 111 |
+
```
|
| 112 |
+
if drift_z is None and rolling_n < 10:
|
| 113 |
+
show "Drift: warming up ({rolling_n}/10)"
|
| 114 |
+
elif drift_z is None:
|
| 115 |
+
show "Drift: unavailable (no train-time stats)"
|
| 116 |
+
else:
|
| 117 |
+
show "Drift: trailing-100 median is {drift_z:+.2f}σ from training distribution"
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
### 3.2 MLflow traceability badge
|
| 121 |
+
|
| 122 |
+
**Files touched:**
|
| 123 |
+
|
| 124 |
+
- `src/api/schemas.py` — add `provenance: ModelProvenance | None` (new schema) to `BBBPredictResponse`.
|
| 125 |
+
- `src/api/routes.py` — read MLflow run metadata once at module load (cached); attach to every `/predict/bbb` response.
|
| 126 |
+
- `src/frontend/app.py` — render badge in `_render_prediction_card` near the top of the card.
|
| 127 |
+
|
| 128 |
+
**`ModelProvenance` schema:**
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
class ModelProvenance(BaseModel):
|
| 132 |
+
mlflow_run_id: str | None = None
|
| 133 |
+
model_version: str = "v1" # bumped manually per train cycle
|
| 134 |
+
train_date: str | None = None # ISO 8601, from MLflow run start_time
|
| 135 |
+
n_examples: int | None = None # from model._neurobridge_train_stats["n_train"]
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
**Lookup logic (one-time per process startup, then cached):**
|
| 139 |
+
|
| 140 |
+
1. Try `mlflow.search_runs(experiment_names=["bbb_pipeline"], max_results=1, order_by=["start_time DESC"])`.
|
| 141 |
+
2. If found → populate `mlflow_run_id`, `train_date`.
|
| 142 |
+
3. If not found or `NEUROBRIDGE_DISABLE_MLFLOW=1` → all fields stay None except hardcoded `model_version="v1"`.
|
| 143 |
+
4. `n_examples` comes from `model._neurobridge_train_stats["n_train"]` (set in T1A).
|
| 144 |
+
|
| 145 |
+
The badge is purely informational — the API still works without MLflow, just shows "Provenance unavailable" in the UI.
|
| 146 |
+
|
| 147 |
+
### 3.3 LLM explainer
|
| 148 |
+
|
| 149 |
+
**New file:** `src/llm/explainer.py`
|
| 150 |
+
|
| 151 |
+
Public surface:
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
def explain(payload: ExplainPayload) -> ExplainResult:
|
| 155 |
+
"""Return a natural-language rationale for a BBB prediction.
|
| 156 |
+
|
| 157 |
+
Falls back to a deterministic template when LLM is unavailable.
|
| 158 |
+
Never raises — always returns a usable rationale.
|
| 159 |
+
"""
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
Where `ExplainPayload` is a typed dict with: `smiles`, `label_text`, `confidence`, `top_features` (list of `{feature, shap_value}`), `calibration` (optional), `drift_z` (optional).
|
| 163 |
+
|
| 164 |
+
**Internal structure:**
|
| 165 |
+
|
| 166 |
+
```
|
| 167 |
+
explain(payload)
|
| 168 |
+
├── _should_use_llm() → bool # gates: env flag, key, etc.
|
| 169 |
+
├── _llm_explain(payload) → str | None # OpenRouter call, returns None on any failure
|
| 170 |
+
├── _template_explain(payload) → str # always-available deterministic path
|
| 171 |
+
└── compose ExplainResult with source/model fields
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
**Template (deterministic, jury-friendly):**
|
| 175 |
+
|
| 176 |
+
The template stitches together:
|
| 177 |
+
1. Sentence 1: "Predicted **{label_text}** with {confidence*100:.0f}% confidence."
|
| 178 |
+
2. Sentence 2 (if calibration): "Calibration: predictions in the ≥{threshold}% bin are correct {precision}% of the time on held-out data (n={support})."
|
| 179 |
+
3. Sentence 3: "Top SHAP attributions toward this label: {feat_1} (Δ{shap_1:+.3f}), {feat_2} (Δ{shap_2:+.3f}), {feat_3} (Δ{shap_3:+.3f})."
|
| 180 |
+
4. Sentence 4 (if drift_z): "Drift signal: trailing-100 confidence median is {drift_z:+.2f}σ from training distribution; {interpretation}."
|
| 181 |
+
- interpretation: `|drift_z| < 1` → "within expected range"; `1 ≤ |drift_z| < 2` → "mild distribution shift"; `|drift_z| ≥ 2` → "significant shift, retrain recommended".
|
| 182 |
+
|
| 183 |
+
The template is auditable: every word is derived from numeric inputs. Useful for jurors who challenge "is this actually using the model output?".
|
| 184 |
+
|
| 185 |
+
**LLM prompt (single-shot, no system message clutter):**
|
| 186 |
+
|
| 187 |
+
```
|
| 188 |
+
You are a clinical-ML explainer for a B2B blood-brain-barrier permeability tool.
|
| 189 |
+
Given the prediction details below, write a 2–4 sentence rationale a researcher
|
| 190 |
+
could paste into a paper. Use the SHAP attributions to justify the verdict.
|
| 191 |
+
Mention drift if abnormal. Avoid hedging; be specific about the numbers.
|
| 192 |
+
|
| 193 |
+
Prediction:
|
| 194 |
+
- SMILES: {smiles}
|
| 195 |
+
- Verdict: {label_text} ({confidence*100:.0f}% confident)
|
| 196 |
+
- Top SHAP features (positive = pushed toward verdict):
|
| 197 |
+
{top_features_bulleted}
|
| 198 |
+
- Drift z-score: {drift_z}
|
| 199 |
+
|
| 200 |
+
Respond with the rationale only, no preamble.
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
### 3.4 `POST /explain/bbb` route
|
| 204 |
+
|
| 205 |
+
**Files touched:**
|
| 206 |
+
|
| 207 |
+
- `src/api/schemas.py` — `BBBExplainRequest`, `BBBExplainResponse`.
|
| 208 |
+
- `src/api/routes.py` — register on `predict_router` so URL is `/predict/bbb` (existing) + `/explain/bbb` (new); both under `/predict` prefix... actually wait, see below.
|
| 209 |
+
|
| 210 |
+
**Routing decision:** new `explain_router` with prefix `/explain` → final URL `POST /explain/bbb`. Mounted on the FastAPI app alongside the existing `router` (prefix `/pipeline`) and `predict_router` (prefix `/predict`). This mirrors the prediction surface symmetrically (`/predict/bbb` ↔ `/explain/bbb`) and leaves room for `/explain/eeg` and `/explain/mri` later without restructuring.
|
| 211 |
+
|
| 212 |
+
**Request:**
|
| 213 |
+
|
| 214 |
+
```python
|
| 215 |
+
class BBBExplainRequest(BaseModel):
|
| 216 |
+
smiles: str
|
| 217 |
+
label: int
|
| 218 |
+
label_text: str
|
| 219 |
+
confidence: float
|
| 220 |
+
top_features: list[FeatureAttribution]
|
| 221 |
+
calibration: CalibrationContext | None = None
|
| 222 |
+
drift_z: float | None = None
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
**Response:**
|
| 226 |
+
|
| 227 |
+
```python
|
| 228 |
+
class BBBExplainResponse(BaseModel):
|
| 229 |
+
rationale: str
|
| 230 |
+
source: str # "llm" | "template"
|
| 231 |
+
model: str | None = None # llm model name when source="llm"
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
**Error cases:**
|
| 235 |
+
|
| 236 |
+
- Empty `top_features` → 400 (a real prediction always has SHAP attributions).
|
| 237 |
+
- Otherwise → 200 always (the explainer never raises; template fallback ensures success).
|
| 238 |
+
|
| 239 |
+
### 3.5 Streamlit "AI Assistant" tab
|
| 240 |
+
|
| 241 |
+
**File touched:** `src/frontend/app.py`
|
| 242 |
+
|
| 243 |
+
**Layout:**
|
| 244 |
+
|
| 245 |
+
```
|
| 246 |
+
┌──────────────────────────────────────────────────────────────┐
|
| 247 |
+
│ AI Assistant — explain the last BBB prediction │
|
| 248 |
+
├──────────────────────────────────────────────────────────────┤
|
| 249 |
+
│ [Last prediction card preview: label, confidence, top-3 SHAP]│
|
| 250 |
+
│ │
|
| 251 |
+
│ Pre-canned questions (st.selectbox): │
|
| 252 |
+
│ • Why was this molecule predicted as permeable? │
|
| 253 |
+
│ • Which features pushed the verdict the most? │
|
| 254 |
+
│ • Is the prediction trustworthy given drift? │
|
| 255 |
+
│ │
|
| 256 |
+
│ [Custom question text_input — optional] │
|
| 257 |
+
│ │
|
| 258 |
+
│ [Ask the AI Assistant — primary button] │
|
| 259 |
+
├──────────────────────────────────────────────────────────────┤
|
| 260 |
+
│ Response card: │
|
| 261 |
+
│ "{rationale}" │
|
| 262 |
+
│ Source: {llm | template} · Model: {model or "—"} │
|
| 263 |
+
└──────────────────────────────────────────────────────────────┘
|
| 264 |
+
```
|
| 265 |
+
|
| 266 |
+
**Question routing into the prompt:** the user's selected/typed question is **not** sent to the LLM as a separate field. It is *appended to the prompt as a "User question:" line* before the closing instruction. This keeps the response contract (`{rationale, source, model}`) identical regardless of question, and lets the deterministic template ignore the question entirely (template always answers the meta-question "explain this prediction" — which subsumes all three pre-canned questions). For a custom question that diverges far from the canned three, the LLM path will adapt; the template path will give the same generic SHAP-driven rationale. Acceptable trade-off for Day 7.
|
| 267 |
+
|
| 268 |
+
**Session state:**
|
| 269 |
+
|
| 270 |
+
- `st.session_state["last_bbb_prediction"]` — populated by `_render_prediction_card` after every successful BBB predict (stores the entire `/predict/bbb` response dict).
|
| 271 |
+
- `st.session_state["explain_history"]` — list of `(question, response)` tuples; rendered in reverse-chronological order.
|
| 272 |
+
- If `last_bbb_prediction` is None, show empty state: "Run a BBB prediction first to enable the AI Assistant."
|
| 273 |
+
|
| 274 |
+
**No multi-turn conversation.** Each question is independent; history is visible but not fed back into subsequent prompts.
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## 4. Test Plan
|
| 279 |
+
|
| 280 |
+
| Suite | New Tests | What they cover |
|
| 281 |
+
|---|---|---|
|
| 282 |
+
| `tests/models/test_bbb_model.py` | +2 | `_neurobridge_train_stats` attribute presence, joblib roundtrip |
|
| 283 |
+
| `tests/api/test_routes.py` (T1B) | +2 | `drift_z` and `rolling_n` in `/predict/bbb` body; deque rolls (101st predict drops 1st) |
|
| 284 |
+
| `tests/api/test_routes.py` (T2) | +1 | `provenance` field in `/predict/bbb` response (smoke — fields can be None) |
|
| 285 |
+
| `tests/llm/test_explainer.py` (new dir) | +4 | (a) template path returns deterministic rationale; (b) template includes top feature names; (c) template includes label_text; (d) `NEUROBRIDGE_DISABLE_LLM=1` forces template even with key set |
|
| 286 |
+
| `tests/api/test_routes.py` (T3B) | +1 | `POST /explain/bbb` 200 happy path with template source |
|
| 287 |
+
| **Total** | **+10** | **165 → 175 green** |
|
| 288 |
+
|
| 289 |
+
**LLM integration tests (env-gated, NOT counted in 175):**
|
| 290 |
+
|
| 291 |
+
- `tests/llm/test_explainer_integration.py` — marked `@pytest.mark.llm_integration`, runs only when `RUN_LLM_TESTS=1` set. Verifies real OpenRouter round-trip. Default: skip.
|
| 292 |
+
|
| 293 |
+
**TDD discipline:** For T1A, T1B, T3A, T3B: write the new tests, watch them fail (RED), then implement. T1C, T2, T3C are UI-only or thin glue; covered by import-smoke and the existing assertion extensions.
|
| 294 |
+
|
| 295 |
+
---
|
| 296 |
+
|
| 297 |
+
## 5. New Dependency
|
| 298 |
+
|
| 299 |
+
`openai==1.51.0` — added to `requirements.txt`. ~600 KB, minimal transitive (`httpx`, `pydantic`, `typing_extensions` — all already present). Pinned to 1.51.0 because that's a known-stable version with the OpenRouter-compatible client interface.
|
| 300 |
+
|
| 301 |
+
No other new pip deps. Streamlit, altair, sklearn, RDKit, MNE, nibabel, MLflow stay at current pins.
|
| 302 |
+
|
| 303 |
+
---
|
| 304 |
+
|
| 305 |
+
## 6. Failure Modes & Lifelines
|
| 306 |
+
|
| 307 |
+
| Failure | Detection | Lifeline |
|
| 308 |
+
|---|---|---|
|
| 309 |
+
| OpenRouter rate-limit during demo | HTTP 429 from SDK | Auto-fallback to template; log warning |
|
| 310 |
+
| OpenRouter network outage | `APIConnectionError` | Auto-fallback to template |
|
| 311 |
+
| API key revoked / typo'd | HTTP 401 | Auto-fallback to template |
|
| 312 |
+
| Demo runner forgot key | `os.environ.get("OPENROUTER_API_KEY") is None` | Auto-fallback to template |
|
| 313 |
+
| User wants to force template (e.g., for reproducibility) | `NEUROBRIDGE_DISABLE_LLM=1` | Hard gate, never calls LLM |
|
| 314 |
+
| Drift deque accumulates noise across worker lifetime | n/a | Worker restart clears state; demo runner can `pkill -f uvicorn && uvicorn …` between dry-runs. Documented in README's "Day 7 — Demo Recipe". |
|
| 315 |
+
| `_neurobridge_train_stats` missing on legacy model | `getattr(model, ..., None) is None` | `drift_z=None`, UI hedge string |
|
| 316 |
+
| MLflow store unreachable | `mlflow.search_runs` raises | `provenance` fields all None, UI shows "Provenance unavailable" |
|
| 317 |
+
|
| 318 |
+
---
|
| 319 |
+
|
| 320 |
+
## 7. Risks & Mitigations
|
| 321 |
+
|
| 322 |
+
- **Risk:** Streamlit session-state quirks may lose `last_bbb_prediction` across reruns.
|
| 323 |
+
**Mitigation:** Use `st.session_state` (persistent across reruns within a session). Test by clicking Predict → switching to AI Assistant tab → verifying last prediction is visible.
|
| 324 |
+
|
| 325 |
+
- **Risk:** OpenRouter free model returns garbage for chemistry questions.
|
| 326 |
+
**Mitigation:** Tightly scoped prompt (2–4 sentences, no preamble). Worst case the rationale is verbose but harmless; the source label tells jurors it's the LLM not the deterministic path.
|
| 327 |
+
|
| 328 |
+
- **Risk:** New `openai` dep conflicts with existing `httpx`.
|
| 329 |
+
**Mitigation:** `openai==1.51.0` uses `httpx>=0.23,<1.0`; we already pin `httpx==0.27.2`. Compatible. Verify with `pip check` after install.
|
| 330 |
+
|
| 331 |
+
- **Risk:** Reset-drift endpoint adds attack surface.
|
| 332 |
+
**Mitigation:** It's a POST that clears in-process state on the API server; no auth needed for a hackathon demo, but documented as "demo only" in OpenAPI description.
|
| 333 |
+
|
| 334 |
+
- **Risk:** MLflow lookup at module load slows API startup.
|
| 335 |
+
**Mitigation:** Wrap in try/except; on any error, set `_PROVENANCE_CACHE = None` and continue. Lazy-evaluate per-request only if cache is None. Time-bound the lookup to 2 seconds.
|
| 336 |
+
|
| 337 |
+
---
|
| 338 |
+
|
| 339 |
+
## 8. Definition of Done
|
| 340 |
+
|
| 341 |
+
- ✅ `pytest -q` reports **175 passed**.
|
| 342 |
+
- ✅ `pytest -W error::UserWarning tests/` reports zero warnings as errors.
|
| 343 |
+
- ✅ `pytest tests/llm/ -v` passes (template path, 4 tests).
|
| 344 |
+
- ✅ `streamlit run src/frontend/app.py` boots without ImportError; AI Assistant tab visible.
|
| 345 |
+
- ✅ `curl POST /predict/bbb {smiles: "CCO"}` returns body with `drift_z`, `rolling_n`, `provenance` keys.
|
| 346 |
+
- ✅ `curl POST /explain/bbb {…}` returns 200 with `{rationale, source: "template"}` when `NEUROBRIDGE_DISABLE_LLM=1`.
|
| 347 |
+
- ✅ With real `OPENROUTER_API_KEY` set + flag unset, `source: "llm"` and `model: "meta-llama/llama-3.2-3b-instruct:free"`.
|
| 348 |
+
- ✅ Streamlit BBB decision card shows: confidence progress + calibration caption + drift line + MLflow badge + SHAP bars (in that order).
|
| 349 |
+
- ✅ AI Assistant tab can ask "Why permeable?" and render a rationale (from either source).
|
| 350 |
+
- ✅ AGENTS.md §10 (Drift Surface) and §11 (LLM Explainer Surface) committed.
|
| 351 |
+
- ✅ README has a "Day 7 — Demo Recipe" section with two `curl` invocations.
|
| 352 |
+
- ✅ Final commit ledger has 5 commits: T1, T2, T3A, T3B+T3C, T4 (or finer granularity, but at least one commit per task boundary).
|
| 353 |
+
|
| 354 |
+
---
|
| 355 |
+
|
| 356 |
+
## 9. Out of Scope (deferred to "someday")
|
| 357 |
+
|
| 358 |
+
- Multi-turn conversation memory.
|
| 359 |
+
- Per-user drift profiles (currently shared across all clients of one worker).
|
| 360 |
+
- Retraining trigger when `|drift_z| > 2` for N consecutive predictions.
|
| 361 |
+
- Vector-store RAG over actual chemistry literature.
|
| 362 |
+
- LLM rationale streaming (Streamlit chat-style typewriter).
|
| 363 |
+
- Provenance signing / cryptographic audit trail.
|
| 364 |
+
- Drift state persistence across worker restarts.
|
| 365 |
+
|
| 366 |
+
These are recognized but explicitly not Day 7. Doing any of them would either blow the time budget or shift the demo focus away from the four sealed tasks.
|