mekosotto Claude Opus 4.7 (1M context) commited on
Commit
26adc32
·
1 Parent(s): 87845ef

chore: add real-llm-rationale plan + ignore .worktrees/

Browse files

Plan covers dropping the template-only fallback in favor of real
OpenRouter LLM calls, with template kept as outage fallback only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

.gitignore CHANGED
@@ -28,6 +28,7 @@ mlartifacts/
28
 
29
  # Claude Code / agent tooling
30
  .sixth/
 
31
 
32
  # IDE
33
  .idea/
 
28
 
29
  # Claude Code / agent tooling
30
  .sixth/
31
+ .worktrees/
32
 
33
  # IDE
34
  .idea/
docs/superpowers/plans/2026-05-02-real-llm-rationale.md ADDED
@@ -0,0 +1,634 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Real-LLM Rationale (drop the template-only fallback) Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Make `POST /explain/{bbb,eeg,mri}` return `source="llm"` end-to-end against OpenRouter, instead of the deterministic template — without removing the template (it stays as a true outage fallback, not the everyday path).
6
+
7
+ **Architecture:** The explainer at [src/llm/explainer.py](src/llm/explainer.py) already has both paths; the LLM path is silently failing because (a) the configured `OPENROUTER_API_KEY` returns **401 on every model** today, (b) the `_DEFAULT_FREE_MODEL_CHAIN` lists a mix of speculative IDs that may not exist on OpenRouter, and (c) `401` (auth) is not classified as a fatal-not-recoverable error — it's swept into the generic "fall back to template" branch and a tester would never know auth was the cause. We fix auth → trim the chain to verified-live free models → add an explicit 401/400 short-circuit → add one network-gated integration test that proves an end-to-end LLM call returns `source="llm"`.
8
+
9
+ **Tech Stack:** Python 3.12, `openai==1.51.0` (OpenRouter-compatible), pytest 8.x, FastAPI 0.115, Streamlit 1.x.
10
+
11
+ ---
12
+
13
+ ## Pre-flight
14
+
15
+ - This plan modifies a single module ([src/llm/explainer.py](src/llm/explainer.py)) plus its test file ([tests/llm/test_explainer.py](tests/llm/test_explainer.py)) and adds a one-shot diagnostic script. Blast radius is small; a feature branch in the current working tree is sufficient. Worktree isolation (`superpowers:using-git-worktrees`) is **not required** unless the engineer wants to keep `main` clean while polling OpenRouter for live model IDs.
16
+ - The user has indicated `.env` already holds an `OPENROUTER_API_KEY`. Task 1 verifies whether that key still works against any model. If every probe returns 401, the engineer must reach the user before proceeding past Task 1 — code changes can't fix an unauthorized key.
17
+ - Test discipline: the deterministic template path is the **source of truth** for unit tests (per the existing module docstring). The LLM path is exercised by **one** opt-in network-gated test that auto-skips when `OPENROUTER_API_KEY` is missing. Do not mock the OpenAI SDK at the unit-test layer for the new integration test — that defeats its purpose.
18
+
19
+ ---
20
+
21
+ ## File Structure
22
+
23
+ | File | Status | Responsibility |
24
+ |---|---|---|
25
+ | [src/llm/explainer.py](src/llm/explainer.py) | Modify | Trim model chain; classify 401/400 explicitly; surface auth-failure log at WARNING with actionable hint |
26
+ | [tests/llm/test_explainer.py](tests/llm/test_explainer.py) | Modify | Add unit tests for new 401/400 classifier + one network-gated end-to-end LLM test |
27
+ | `scripts/diagnose_openrouter.py` | Create | One-shot probe that lists which free model IDs respond OK vs 401/404 — used in Task 1 and again in Task 7 to re-confirm the chain |
28
+
29
+ ---
30
+
31
+ ### Task 1: Diagnose the live OpenRouter free-tier surface
32
+
33
+ **Files:**
34
+ - Create: `scripts/diagnose_openrouter.py`
35
+
36
+ This task produces the empirical evidence later tasks depend on. **Do not skip** — the current chain in [src/llm/explainer.py:62-73](src/llm/explainer.py#L62-L73) lists IDs like `inclusionai/ling-2.6-1t:free` that may never have existed on OpenRouter. We need the real list before we can trim.
37
+
38
+ - [ ] **Step 1: Create the diagnostic script**
39
+
40
+ ```python
41
+ # scripts/diagnose_openrouter.py
42
+ """Probe OpenRouter for which free-tier model IDs are reachable today.
43
+
44
+ Reads OPENROUTER_API_KEY from .env (or process env). Issues a single
45
+ 8-token chat completion against a candidate list and prints one line per
46
+ model: status (OK / HTTP-code / exception name) + a 30-char preview of
47
+ the response when OK.
48
+
49
+ Use:
50
+ python scripts/diagnose_openrouter.py
51
+
52
+ Or to probe a custom list:
53
+ python scripts/diagnose_openrouter.py google/gemma-2-9b-it:free meta-llama/llama-3.2-3b-instruct:free
54
+ """
55
+ from __future__ import annotations
56
+
57
+ import os
58
+ import sys
59
+ from pathlib import Path
60
+
61
+ # Manually parse .env without python-dotenv (some envs choke on its
62
+ # frame-introspection in heredocs / non-stack-rooted callers).
63
+ _env_path = Path(__file__).resolve().parent.parent / ".env"
64
+ if _env_path.exists():
65
+ for raw in _env_path.read_text().splitlines():
66
+ s = raw.strip()
67
+ if not s or s.startswith("#") or "=" not in s:
68
+ continue
69
+ k, v = s.split("=", 1)
70
+ os.environ.setdefault(k.strip(), v.strip())
71
+
72
+ if not os.environ.get("OPENROUTER_API_KEY"):
73
+ sys.exit("OPENROUTER_API_KEY not set (looked in env and .env)")
74
+
75
+ # Candidate list: well-known stable free-tier IDs as of 2026-Q2.
76
+ # Update by replacing this list — script is a probe, not a config source.
77
+ DEFAULT_CANDIDATES = [
78
+ "google/gemma-2-9b-it:free",
79
+ "google/gemini-2.0-flash-exp:free",
80
+ "meta-llama/llama-3.2-3b-instruct:free",
81
+ "meta-llama/llama-3.3-70b-instruct:free",
82
+ "mistralai/mistral-7b-instruct:free",
83
+ "qwen/qwen-2.5-72b-instruct:free",
84
+ "deepseek/deepseek-r1:free",
85
+ "deepseek/deepseek-chat:free",
86
+ "nousresearch/hermes-3-llama-3.1-405b:free",
87
+ "microsoft/phi-3-mini-128k-instruct:free",
88
+ ]
89
+
90
+ candidates = sys.argv[1:] or DEFAULT_CANDIDATES
91
+
92
+ from openai import ( # noqa: E402 (after env load)
93
+ OpenAI, APIStatusError, APIConnectionError, RateLimitError, APITimeoutError,
94
+ )
95
+
96
+ client = OpenAI(
97
+ base_url="https://openrouter.ai/api/v1",
98
+ api_key=os.environ["OPENROUTER_API_KEY"],
99
+ timeout=15.0,
100
+ )
101
+
102
+ for m in candidates:
103
+ try:
104
+ c = client.chat.completions.create(
105
+ model=m,
106
+ messages=[{"role": "user", "content": "Reply with the single word OK."}],
107
+ max_tokens=8,
108
+ temperature=0,
109
+ )
110
+ text = (c.choices[0].message.content or "").strip()
111
+ print(f" OK {m} → {text[:30]!r}")
112
+ except APIStatusError as e:
113
+ code = getattr(e, "status_code", "?")
114
+ print(f" {code:<5} {m}")
115
+ except RateLimitError:
116
+ print(f" 429 {m} (rate-limited)")
117
+ except (APIConnectionError, APITimeoutError) as e:
118
+ print(f" CONN {m} ({type(e).__name__})")
119
+ except Exception as e:
120
+ print(f" ERR {m} ({type(e).__name__}: {e})")
121
+ ```
122
+
123
+ - [ ] **Step 2: Run the diagnostic**
124
+
125
+ ```bash
126
+ python scripts/diagnose_openrouter.py
127
+ ```
128
+
129
+ Expected output: one line per candidate, status code + preview.
130
+
131
+ - [ ] **Step 3: Branch on the result**
132
+
133
+ - If **at least one** model returns `OK ... → 'OK'` (or any non-empty text):
134
+ - Record the OK model IDs — they become the new chain in Task 3.
135
+ - Continue to Task 2.
136
+ - If **every** line shows `401`:
137
+ - The API key is unauthorized. **Stop and reach the user.** Likely causes: key revoked, wrong account, missing free-tier opt-in at https://openrouter.ai/settings/privacy (some free models require enabling "free model training" data sharing). Do not edit code — the chain doesn't matter while auth is broken.
138
+ - If lines show a mix of 401 and 404:
139
+ - 401 = auth failure (still blocking). Same as above.
140
+ - If lines show `404` for all:
141
+ - The chain candidates are all retired. Replace `DEFAULT_CANDIDATES` with fresh IDs from `curl https://openrouter.ai/api/v1/models | jq -r '.data[]|select(.pricing.prompt=="0")|.id'` and re-run.
142
+
143
+ - [ ] **Step 4: Commit the diagnostic script**
144
+
145
+ ```bash
146
+ git add scripts/diagnose_openrouter.py
147
+ git commit -m "chore(llm): one-shot OpenRouter free-tier reachability probe"
148
+ ```
149
+
150
+ ---
151
+
152
+ ### Task 2: Lock current behavior with a unit test for the new 401 classifier
153
+
154
+ **Files:**
155
+ - Modify: `tests/llm/test_explainer.py` — add one failing test before Task 3 changes the production code
156
+
157
+ This is TDD discipline: write the test that proves the new behavior **before** writing the code. The test asserts that an unauthorized response (401) classifies as fatal-no-retry — `_llm_explain` returns `None` immediately after one model attempt instead of trying every model in the chain.
158
+
159
+ - [ ] **Step 1: Read the existing test file structure**
160
+
161
+ ```bash
162
+ sed -n '1,40p' tests/llm/test_explainer.py
163
+ ```
164
+
165
+ Expected: confirm the file uses pytest classes + `monkeypatch.setenv`, and that an `_FIXTURE_PAYLOAD_BBB` (or similar) is defined.
166
+
167
+ - [ ] **Step 2: Write the failing test**
168
+
169
+ Append the following at the bottom of [tests/llm/test_explainer.py](tests/llm/test_explainer.py). The fixture name in the existing file is `_FIXTURE_PAYLOAD_BBB` — confirm by grep before pasting; if it differs, swap to whatever the file already exports.
170
+
171
+ > **Monkeypatch target subtlety:** `src/llm/explainer.py` does `from openai import OpenAI` **inside** `_llm_explain` (the import is local to the function), so `monkeypatch.setattr(ex, "OpenAI", ...)` would silently no-op (the module-level attribute doesn't exist and the function rebinds locally each call). We must patch on the `openai` module itself: `monkeypatch.setattr("openai.OpenAI", factory)`. The local `from openai import OpenAI` then resolves to our stub.
172
+
173
+ ```python
174
+ class TestAuthFailureShortCircuits:
175
+ """A 401 from OpenRouter means the key is unauthorized — every model
176
+ in the chain will fail the same way, so we must short-circuit instead
177
+ of burning the full chain on every request."""
178
+
179
+ def test_401_short_circuits_to_template_after_one_attempt(self, monkeypatch):
180
+ from src.llm import explainer as ex
181
+ from openai import APIStatusError
182
+ import httpx
183
+
184
+ monkeypatch.delenv("NEUROBRIDGE_DISABLE_LLM", raising=False)
185
+ monkeypatch.setenv("OPENROUTER_API_KEY", "sk-or-v1-deliberately-bad")
186
+
187
+ attempts: list[str] = []
188
+
189
+ def _raise_401(**kwargs):
190
+ attempts.append(kwargs["model"])
191
+ req = httpx.Request("POST", "https://openrouter.ai/api/v1/chat/completions")
192
+ resp = httpx.Response(status_code=401, request=req)
193
+ raise APIStatusError(message="No auth credentials found", response=resp, body={})
194
+
195
+ class _StubCompletions:
196
+ create = staticmethod(_raise_401)
197
+
198
+ class _StubChat:
199
+ completions = _StubCompletions()
200
+
201
+ class _StubClient:
202
+ chat = _StubChat()
203
+ def __init__(self, **kwargs):
204
+ pass
205
+
206
+ # Must patch on the `openai` module — the explainer does
207
+ # `from openai import OpenAI` *inside* the function (see
208
+ # src/llm/explainer.py:269-275), so any module-level attribute
209
+ # on `src.llm.explainer` would be a no-op.
210
+ monkeypatch.setattr("openai.OpenAI", _StubClient)
211
+
212
+ out = ex._llm_explain(_FIXTURE_PAYLOAD_BBB, modality="bbb")
213
+
214
+ assert out is None, "401 must surface as a None return (caller falls back to template)"
215
+ assert len(attempts) == 1, f"401 must short-circuit; tried {len(attempts)} models: {attempts}"
216
+
217
+ def test_explain_returns_template_source_on_401(self, monkeypatch):
218
+ from src.llm import explainer as ex
219
+ from openai import APIStatusError
220
+ import httpx
221
+
222
+ monkeypatch.delenv("NEUROBRIDGE_DISABLE_LLM", raising=False)
223
+ monkeypatch.setenv("OPENROUTER_API_KEY", "sk-or-v1-deliberately-bad")
224
+
225
+ def _raise_401(**kwargs):
226
+ req = httpx.Request("POST", "https://openrouter.ai/api/v1/chat/completions")
227
+ raise APIStatusError(
228
+ message="auth",
229
+ response=httpx.Response(401, request=req),
230
+ body={},
231
+ )
232
+
233
+ class _Comp:
234
+ create = staticmethod(_raise_401)
235
+
236
+ class _Chat:
237
+ completions = _Comp()
238
+
239
+ class _Client:
240
+ chat = _Chat()
241
+ def __init__(self, **kwargs):
242
+ pass
243
+
244
+ monkeypatch.setattr("openai.OpenAI", _Client)
245
+
246
+ result = ex.explain(_FIXTURE_PAYLOAD_BBB, modality="bbb")
247
+
248
+ assert result["source"] == "template"
249
+ assert result["model"] is None
250
+ assert result["rationale"], "rationale must never be empty"
251
+
252
+ def test_400_advances_to_next_model_instead_of_short_circuiting(self, monkeypatch):
253
+ """A 400 from one model is a prompt-shape mismatch with THAT model
254
+ (some models reject system roles, etc.) — try the next, don't give up."""
255
+ from src.llm import explainer as ex
256
+ from openai import APIStatusError
257
+ import httpx
258
+
259
+ monkeypatch.delenv("NEUROBRIDGE_DISABLE_LLM", raising=False)
260
+ monkeypatch.setenv("OPENROUTER_API_KEY", "sk-or-v1-anything")
261
+
262
+ attempts: list[str] = []
263
+ # Force a known multi-model chain so we can count attempts deterministically
264
+ monkeypatch.setenv("OPENROUTER_FREE_MODELS", "model-a:free,model-b:free,model-c:free")
265
+
266
+ def _raise_400(**kwargs):
267
+ attempts.append(kwargs["model"])
268
+ req = httpx.Request("POST", "https://openrouter.ai/api/v1/chat/completions")
269
+ raise APIStatusError(
270
+ message="bad request",
271
+ response=httpx.Response(400, request=req),
272
+ body={},
273
+ )
274
+
275
+ class _Comp:
276
+ create = staticmethod(_raise_400)
277
+
278
+ class _Chat:
279
+ completions = _Comp()
280
+
281
+ class _Client:
282
+ chat = _Chat()
283
+ def __init__(self, **kwargs):
284
+ pass
285
+
286
+ monkeypatch.setattr("openai.OpenAI", _Client)
287
+
288
+ out = ex._llm_explain(_FIXTURE_PAYLOAD_BBB, modality="bbb")
289
+
290
+ assert out is None, "all models 400'd → must return None for template fallback"
291
+ assert attempts == ["model-a:free", "model-b:free", "model-c:free"], (
292
+ f"400 must advance to next model; got attempts={attempts}"
293
+ )
294
+ ```
295
+
296
+ - [ ] **Step 3: Run the new tests — at least one MUST fail**
297
+
298
+ ```bash
299
+ pytest tests/llm/test_explainer.py::TestAuthFailureShortCircuits -v
300
+ ```
301
+
302
+ Expected:
303
+ - `test_400_advances_to_next_model_instead_of_short_circuiting` → **FAIL** (current code at [src/llm/explainer.py:303-310](src/llm/explainer.py#L303-L310) treats 400 as fatal and returns `None` after the first model, so `attempts` will equal `["model-a:free"]`, not the full chain).
304
+ - The two 401 tests may pass by accident with current code (the catch-all `return None` already short-circuits on any unclassified status). They stay as regression guards — Task 3 will explicitly classify 401 with an actionable log message that the test asserts on (we'll extend them in Task 3 Step 2).
305
+
306
+ This is the correct TDD red: at least one test fails on a behavior we are about to implement.
307
+
308
+ - [ ] **Step 4: Commit the failing test**
309
+
310
+ ```bash
311
+ git add tests/llm/test_explainer.py
312
+ git commit -m "test(llm): pin 401 short-circuit + 400 try-next-model behavior (red)"
313
+ ```
314
+
315
+ ---
316
+
317
+ ### Task 3: Add explicit 401/400 classification with actionable WARNING
318
+
319
+ **Files:**
320
+ - Modify: `src/llm/explainer.py:303-317`
321
+
322
+ The current code lumps "real auth failure" with "transient model error" in one branch. We split them so logs make the diagnosis obvious.
323
+
324
+ - [ ] **Step 1: Re-read the current except block to make the edit precise**
325
+
326
+ ```bash
327
+ sed -n '292,320p' src/llm/explainer.py
328
+ ```
329
+
330
+ - [ ] **Step 2: Replace the `APIStatusError` block with explicit classification**
331
+
332
+ Apply this edit to [src/llm/explainer.py](src/llm/explainer.py). Match the existing `except APIStatusError as e:` block (currently at line 303) exactly:
333
+
334
+ ```python
335
+ except APIStatusError as e:
336
+ status = getattr(e, "status_code", None)
337
+ # 401 = unauthorized — the key is bad, no model in this chain
338
+ # will succeed. Surface a loud, actionable hint and bail.
339
+ if status == 401:
340
+ logger.warning(
341
+ "OpenRouter 401 unauthorized on %s. The OPENROUTER_API_KEY "
342
+ "is rejected — verify it is current at "
343
+ "https://openrouter.ai/keys and that free-model data-sharing "
344
+ "is enabled at https://openrouter.ai/settings/privacy. "
345
+ "Falling back to deterministic template.",
346
+ model,
347
+ )
348
+ return None
349
+ # 400 = malformed prompt for this specific model (e.g. it
350
+ # rejected our system role). Skip this model, try the next.
351
+ if status == 400:
352
+ logger.info(
353
+ "OpenRouter 400 on %s (likely prompt-shape mismatch); "
354
+ "advancing to next free model.", model,
355
+ )
356
+ continue
357
+ # 402 credits / 403 access / 404 retired-id / 5xx upstream → next.
358
+ if status in (402, 403, 404) or (status is not None and 500 <= status < 600):
359
+ logger.info("OpenRouter %s on %s; advancing to next free model.", status, model)
360
+ continue
361
+ logger.warning("LLM call failed on %s (%s); falling back to template.", model, e)
362
+ return None
363
+ ```
364
+
365
+ - [ ] **Step 3: Run the new tests — they MUST pass now**
366
+
367
+ ```bash
368
+ pytest tests/llm/test_explainer.py::TestAuthFailureShortCircuits -v
369
+ ```
370
+
371
+ Expected: both PASS.
372
+
373
+ - [ ] **Step 4: Run the full LLM-explainer test suite to confirm no regressions**
374
+
375
+ ```bash
376
+ pytest tests/llm/ -v
377
+ ```
378
+
379
+ Expected: all template-path tests still pass (they should — they're env-gated to `NEUROBRIDGE_DISABLE_LLM=1`, untouched).
380
+
381
+ - [ ] **Step 5: Commit**
382
+
383
+ ```bash
384
+ git add src/llm/explainer.py
385
+ git commit -m "feat(llm): classify 401 as fatal+actionable, 400 as skip-this-model"
386
+ ```
387
+
388
+ ---
389
+
390
+ ### Task 4: Refresh `_DEFAULT_FREE_MODEL_CHAIN` with verified-live IDs
391
+
392
+ **Files:**
393
+ - Modify: `src/llm/explainer.py:62-73`
394
+
395
+ Use the OK list from Task 1's diagnostic. The chain should be ordered **smartest → smallest** so the best model is tried first; quota-exhausted models advance to the next.
396
+
397
+ - [ ] **Step 1: Re-run the diagnostic to confirm the chain is still live**
398
+
399
+ ```bash
400
+ python scripts/diagnose_openrouter.py
401
+ ```
402
+
403
+ Expected: at least 3 lines marked `OK`. Capture them.
404
+
405
+ - [ ] **Step 2: Replace the chain in [src/llm/explainer.py:62-73](src/llm/explainer.py#L62-L73)**
406
+
407
+ The exact replacement depends on Task 1's results. Example assuming Step 1 confirms `gemma-2-9b-it`, `llama-3.3-70b-instruct`, `mistral-7b-instruct`, `llama-3.2-3b-instruct` are OK:
408
+
409
+ ```python
410
+ # Free-tier fallback chain, smartest → smallest. When a model returns 429
411
+ # (rate-limit / daily-quota exhausted), 402 (credits), 404 (id retired) or
412
+ # 5xx (upstream), we advance to the next model. Network/timeout errors fall
413
+ # straight to the deterministic template — switching models won't help.
414
+ # Override at runtime via OPENROUTER_FREE_MODELS (comma-separated). Model
415
+ # availability on OpenRouter churns; verify with scripts/diagnose_openrouter.py.
416
+ _DEFAULT_FREE_MODEL_CHAIN: tuple[str, ...] = (
417
+ "meta-llama/llama-3.3-70b-instruct:free", # 70B reasoning-capable
418
+ "google/gemma-2-9b-it:free", # 9B instruct, fast
419
+ "mistralai/mistral-7b-instruct:free", # 7B last-resort
420
+ "meta-llama/llama-3.2-3b-instruct:free", # 3B emergency
421
+ )
422
+ ```
423
+
424
+ If Task 1 returned different OK IDs, substitute them; preserve the smartest-first ordering.
425
+
426
+ - [ ] **Step 3: Re-run the unit suite — must still pass**
427
+
428
+ ```bash
429
+ pytest tests/llm/ -v
430
+ ```
431
+
432
+ Expected: all green. The chain change is semantic-only (no test asserts specific model IDs).
433
+
434
+ - [ ] **Step 4: Commit**
435
+
436
+ ```bash
437
+ git add src/llm/explainer.py
438
+ git commit -m "feat(llm): refresh free-tier chain with verified-live OpenRouter IDs"
439
+ ```
440
+
441
+ ---
442
+
443
+ ### Task 5: Add one network-gated end-to-end LLM integration test
444
+
445
+ **Files:**
446
+ - Modify: `tests/llm/test_explainer.py` — append a new class
447
+
448
+ The unit suite proves classifier behavior with mocked errors. This test proves the **real** path: with a working key, `explain()` returns `source="llm"` and a non-empty rationale. It auto-skips when the key is missing so CI without secrets stays green.
449
+
450
+ - [ ] **Step 1: Append the integration test**
451
+
452
+ Add at the bottom of [tests/llm/test_explainer.py](tests/llm/test_explainer.py):
453
+
454
+ ```python
455
+ import os as _os
456
+
457
+ import pytest as _pytest
458
+
459
+
460
+ @_pytest.mark.skipif(
461
+ not _os.environ.get("OPENROUTER_API_KEY"),
462
+ reason="OPENROUTER_API_KEY not set — skipping live LLM integration test",
463
+ )
464
+ @_pytest.mark.skipif(
465
+ _os.environ.get("NEUROBRIDGE_DISABLE_LLM") == "1",
466
+ reason="NEUROBRIDGE_DISABLE_LLM=1 — skipping live LLM integration test",
467
+ )
468
+ class TestLiveOpenRouterLLM:
469
+ """End-to-end: hit a real OpenRouter free-tier model and assert
470
+ `explain()` returns source='llm' with non-empty content. Skipped
471
+ when no key is set or the kill-switch is on."""
472
+
473
+ def test_bbb_explain_returns_llm_source_with_real_key(self):
474
+ from src.llm import explainer as ex
475
+
476
+ result = ex.explain(_FIXTURE_PAYLOAD_BBB, modality="bbb")
477
+
478
+ # If every model in the chain is rate-limited or unreachable RIGHT NOW
479
+ # the result will fall back to template — that's a flaky-network
480
+ # condition, not a code bug. Surface it as an XFAIL-style assertion
481
+ # message instead of a hard failure.
482
+ if result["source"] == "template":
483
+ _pytest.skip(
484
+ "All free models in the chain were rate-limited or unreachable "
485
+ "at test time. Re-run later or run scripts/diagnose_openrouter.py."
486
+ )
487
+
488
+ assert result["source"] == "llm"
489
+ assert result["model"] is not None and result["model"].endswith(":free")
490
+ assert result["rationale"].strip(), "LLM returned empty rationale"
491
+ # Sanity: the rationale should mention SOMETHING about the prediction.
492
+ # We do not assert on exact model wording (non-deterministic), but
493
+ # we do assert it isn't a generic refusal/safety-filter response.
494
+ lowered = result["rationale"].lower()
495
+ assert not lowered.startswith("i cannot"), f"LLM refused: {result['rationale']!r}"
496
+ ```
497
+
498
+ - [ ] **Step 2: Run the integration test**
499
+
500
+ ```bash
501
+ pytest tests/llm/test_explainer.py::TestLiveOpenRouterLLM -v -s
502
+ ```
503
+
504
+ Expected (with a working key, post-Task 1 fix): PASS, with `-s` showing OpenRouter response in the WARNING/INFO logs if any.
505
+
506
+ If it skips with "rate-limited or unreachable": wait 60s and retry. If it skips with "OPENROUTER_API_KEY not set": Task 1's auth issue is unresolved — go back to Task 1 Step 3.
507
+
508
+ - [ ] **Step 3: Run the FULL test suite to confirm 188 → 190 (or higher)**
509
+
510
+ ```bash
511
+ pytest -q --tb=line
512
+ ```
513
+
514
+ Expected: previous count + 2 new passing unit tests + 1 new (passing or skipping) integration test. **Zero failures.**
515
+
516
+ - [ ] **Step 4: Commit**
517
+
518
+ ```bash
519
+ git add tests/llm/test_explainer.py
520
+ git commit -m "test(llm): add network-gated end-to-end OpenRouter integration test"
521
+ ```
522
+
523
+ ---
524
+
525
+ ### Task 6: End-to-end live verification through FastAPI + Streamlit
526
+
527
+ **Files:** none (verification only)
528
+
529
+ Confirm the wiring works the same way the user's UI smoke-test did, but with LLM **enabled**.
530
+
531
+ - [ ] **Step 1: Start FastAPI WITHOUT the kill-switch**
532
+
533
+ ```bash
534
+ NEUROBRIDGE_DISABLE_MLFLOW=1 \
535
+ uvicorn src.api.main:app --host 127.0.0.1 --port 8000 --log-level info &
536
+ sleep 4
537
+ curl -s http://127.0.0.1:8000/health | python -m json.tool
538
+ ```
539
+
540
+ Expected: `{"status":"ok","pipelines":["bbb","eeg","mri"]}`. **Note the absence of `NEUROBRIDGE_DISABLE_LLM`** — that's the whole point.
541
+
542
+ - [ ] **Step 2: Hit /explain/bbb with a real prediction payload**
543
+
544
+ ```bash
545
+ curl -s -X POST http://127.0.0.1:8000/explain/bbb \
546
+ -H 'Content-Type: application/json' \
547
+ -d '{
548
+ "smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",
549
+ "label": 1,
550
+ "label_text": "permeable",
551
+ "confidence": 0.98,
552
+ "top_features": [
553
+ {"feature":"fp_1822","shap_value":0.0796},
554
+ {"feature":"fp_1224","shap_value":0.0637},
555
+ {"feature":"fp_1323","shap_value":0.0570}
556
+ ]
557
+ }' | python -m json.tool
558
+ ```
559
+
560
+ Expected JSON: `"source": "llm"`, `"model": "<one of the chain ids>"`, `"rationale": "<2-4 free-form sentences mentioning caffeine / permeability / SHAP>"`. **Not** `"source": "template"`.
561
+
562
+ If `"source": "template"`: check the uvicorn log for the WARNING line added in Task 3 — it will tell you whether 401 (key issue), all-models-exhausted (quota/network), or something else.
563
+
564
+ - [ ] **Step 3: Hit /explain/eeg and /explain/mri**
565
+
566
+ ```bash
567
+ curl -s -X POST http://127.0.0.1:8000/explain/eeg \
568
+ -H 'Content-Type: application/json' \
569
+ -d '{"rows": 62, "columns": 640, "duration_sec": 1.86, "mlflow_run_id": "test"}' \
570
+ | python -m json.tool
571
+
572
+ curl -s -X POST http://127.0.0.1:8000/explain/mri \
573
+ -H 'Content-Type: application/json' \
574
+ -d '{"site_gap_pre": 8975.3, "site_gap_post": 3057.6, "reduction_factor": 3, "n_subjects": 6}' \
575
+ | python -m json.tool
576
+ ```
577
+
578
+ Expected: both return `"source": "llm"` with modality-appropriate prose.
579
+
580
+ - [ ] **Step 4: Start Streamlit and load the UI**
581
+
582
+ ```bash
583
+ NEUROBRIDGE_API_URL=http://127.0.0.1:8000 \
584
+ NEUROBRIDGE_DISABLE_MLFLOW=1 \
585
+ streamlit run src/frontend/app.py --server.port 8501 \
586
+ --server.headless true --browser.gatherUsageStats false &
587
+ sleep 5
588
+ curl -s -o /dev/null -w "HTTP %{http_code}\n" http://127.0.0.1:8501/
589
+ ```
590
+
591
+ Expected: HTTP 200.
592
+
593
+ - [ ] **Step 5: Manually verify the UI status badge flipped**
594
+
595
+ Open http://127.0.0.1:8501 in a browser. The Molecule (BBB) tab header should show `explainer · llm online` (green dot), **not** `explainer · template only` (mute). The status-line render is at [src/frontend/app.py:961-977](src/frontend/app.py#L961-L977) and depends on `_LLM_DISABLED` which reads `NEUROBRIDGE_DISABLE_LLM` at import time — since we did not set it, it should be False.
596
+
597
+ Then: predict a SMILES (e.g. caffeine `CN1C=NC2=C1C(=O)N(C(=O)N2C)C`), click the AI Assistant tab, generate a rationale. The rationale text should be free-form prose (not the templated "Predicted **X** with N% confidence." sentence pattern). The AI Assistant tab status indicator at [src/frontend/app.py:1056-1062](src/frontend/app.py#L1056-L1062) should also read `llm · online`.
598
+
599
+ If the badge still says `template only`: the env var leaked from a parent shell. `unset NEUROBRIDGE_DISABLE_LLM` and restart Streamlit.
600
+
601
+ - [ ] **Step 6: Tear down**
602
+
603
+ ```bash
604
+ pkill -f "uvicorn src.api.main"
605
+ pkill -f "streamlit run src/frontend"
606
+ sleep 2
607
+ lsof -iTCP:8000 -sTCP:LISTEN 2>/dev/null
608
+ lsof -iTCP:8501 -sTCP:LISTEN 2>/dev/null
609
+ echo "(both empty = down)"
610
+ ```
611
+
612
+ - [ ] **Step 7: No commit (verification-only task)**
613
+
614
+ If Step 2 or Step 5 surfaced any issue, fix it in the relevant earlier task and re-run from Step 1. Do not paper over a `source: "template"` response with a follow-up commit — root-cause it.
615
+
616
+ ---
617
+
618
+ ## Self-Review Checklist (run before declaring done)
619
+
620
+ - [ ] `pytest -q` reports the previous baseline + 2 new passing unit tests + 1 new passing-or-skipping integration test, zero failures.
621
+ - [ ] `python scripts/diagnose_openrouter.py` lists ≥1 OK model among the IDs hard-coded in `_DEFAULT_FREE_MODEL_CHAIN`.
622
+ - [ ] `curl /explain/bbb` with a real payload returns `"source": "llm"`.
623
+ - [ ] Streamlit BBB tab badge shows `explainer · llm online`, AI Assistant tab badge shows `llm · online`.
624
+ - [ ] Module docstring at [src/llm/explainer.py:1-10](src/llm/explainer.py#L1-L10) is still accurate (template = source of truth for unit tests, LLM = primary path in production).
625
+ - [ ] `NEUROBRIDGE_DISABLE_LLM=1` still forces template (existing test `test_disable_flag_forces_template_even_with_key_set` still passes — kill-switch preserved).
626
+
627
+ ---
628
+
629
+ ## Out of Scope (explicit non-goals)
630
+
631
+ - **Removing the template entirely.** Template stays as the outage fallback. The user said "remove from template" not "remove the template" — and even if they meant the latter, removing the template would mean a network blip = HTTP 500 from `/explain/*`, which the system-reliability shape of the project explicitly avoids (see [src/llm/explainer.py:1-10](src/llm/explainer.py#L1-L10) and the existing `test_disable_flag_forces_template_even_with_key_set` test).
632
+ - **Switching to a paid model / different provider.** The free-tier story is part of the hackathon narrative ("public-deployable on HF Spaces with one push"). Anthropic / OpenAI direct integration is a separate plan.
633
+ - **Streaming responses.** OpenRouter supports SSE streaming but neither the current API contract (`BBBExplainResponse` is a single string) nor the Streamlit UI ask for it.
634
+ - **Caching identical (payload, model) pairs.** Could halve latency for repeat clicks but adds a cache-invalidation surface; defer until a user actually complains about latency.