File size: 21,852 Bytes
29e929f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9e6d2f
29e929f
 
 
 
 
 
 
 
 
 
d3d1ac7
29e929f
 
d3d1ac7
 
 
 
 
 
29e929f
 
915880e
29e929f
d3d1ac7
 
 
 
 
 
 
 
 
29e929f
c0a7163
 
 
 
 
 
d3d1ac7
c0a7163
29e929f
 
d3d1ac7
 
 
29e929f
 
 
 
 
 
b06b105
29e929f
 
 
 
7edd13e
29e929f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a15285
 
 
 
 
 
 
b9e6d2f
 
 
 
 
 
 
 
29e929f
 
 
 
 
 
 
 
b06b105
 
29e929f
 
 
915880e
 
 
 
ea055f0
915880e
 
 
 
d3d1ac7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53256ed
 
 
 
c0a7163
 
53256ed
 
 
 
c0a7163
53256ed
c0a7163
53256ed
 
 
 
 
c0a7163
 
 
 
 
53256ed
c0a7163
 
 
 
 
 
 
 
 
53256ed
 
 
d05fcf1
c0a7163
 
 
 
d05fcf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c2d45f
 
 
 
 
 
 
 
 
 
 
 
 
 
3acc658
3c2d45f
 
 
 
3acc658
 
 
 
 
 
 
 
 
 
 
 
 
 
3c2d45f
 
 
 
 
 
3acc658
 
 
 
 
 
 
 
3c2d45f
 
 
3f6ac7b
3acc658
 
 
 
 
 
 
 
3f6ac7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a7163
 
 
3f6ac7b
 
 
 
 
 
 
 
 
c0a7163
 
 
 
 
 
 
 
 
 
 
 
 
3acc658
c0a7163
 
3f6ac7b
 
 
35ff61e
 
 
 
c0a7163
 
 
 
 
35ff61e
 
 
c0a7163
 
35ff61e
 
 
c0a7163
 
 
 
 
 
 
 
 
 
 
35ff61e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a7163
 
 
 
35ff61e
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
# AGENTS.md — NeuroBridge Enterprise Pipeline

> Read this file at the start of every session. It is the contract every agent
> (human or LLM) operates under in this repository.

## 1. Project Vision

**NeuroBridge Enterprise** is a B2B SaaS platform that solves three structural
problems in real-world clinical/biomedical ML pipelines:

1. **Data Drift** between hospitals and acquisition sites (multi-center MRI).
2. **Missing Modalities** (a patient may have MRI but no EEG, or vice versa).
3. **Artifacts** in raw biosignals (eye blinks, line noise, motion in EEG).

The platform exposes three production pipelines behind a single FastAPI surface:

| Modality | Pipeline | Core Technique |
|---|---|---|
| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
| Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
| Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |

All experiment runs are tracked in **MLflow**. All services ship as **Docker** images.

## 2. Directory Layout (load-bearing — do not violate)

```
.
├── AGENTS.md                 # This file
├── README.md
├── requirements.txt
├── pytest.ini
├── conftest.py               # Repo-wide pytest fixtures (autouse: pins MLFLOW_TRACKING_URI to tmp dir for test isolation)
├── Dockerfile                # Production image (FastAPI + pipelines)
├── docker-compose.yml        # api + mlflow services for local stack
├── .dockerignore
├── .streamlit/
│   └── config.toml           # Streamlit theme tokens
├── data/
│   ├── raw/                  # Untouched source data. NEVER train on this directly.
│   └── processed/            # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4).
├── src/
│   ├── api/                  # FastAPI surface
│   │   ├── main.py           # App factory + /health
│   │   ├── routes.py         # POST /pipeline/{bbb,eeg,mri} dispatch
│   │   └── schemas.py        # Shared Pydantic request/response models
│   ├── core/                 # Cross-cutting utilities
│   │   ├── logger.py         # Structured logger (mandatory in every pipeline)
│   │   ├── determinism.py    # Thread-pin env vars (OMP/OPENBLAS/MKL/pyarrow)
│   │   ├── storage.py        # Parquet read/write helpers (snappy, single-threaded, deterministic)
│   │   └── tracking.py       # MLflow `track_pipeline_run` context manager (see §7)
│   ├── pipelines/            # One file per modality. Pure functions + a `run_pipeline()` entry.
│   ├── models/               # Downstream decision-layer models
│   │   ├── bbb_model.py      # BBB-permeability classifier + SHAP explainer + trainer CLI
│   │   └── mri_model.py      # Volumetric MRI ONNX inference surface (external training)
│   ├── llm/                  # Natural-language explainers (template + OpenRouter fallback)
│   ├── rag/                  # Fastembed + FAISS retrieval layer
│   ├── agents/               # Tool registry + guarded OpenRouter orchestrator
│   └── frontend/
│       └── app.py            # Streamlit dashboard
└── tests/
    ├── core/
    ├── api/
    ├── frontend/
    ├── pipelines/            # incl. test_cross_pipeline_smoke.py for integration coverage
    └── fixtures/             # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
```

**Rules:**
- New modality → new file under `src/pipelines/`. No mixing modalities in one file.
- Anything imported by 2+ pipelines → `src/core/`.
- Pipeline code (`src/pipelines/`, `src/core/`) must not read from or write to any path outside `data/`. Test code may read `tests/fixtures/`. The `data/` boundary is the storage contract for production data.
- `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.

## 3. Coding Standards

- **Python 3.10–3.12** (the pinned native-extension dependencies do not yet ship cp313+ wheels). Use `from __future__ import annotations` when needed for forward refs.
- **Type hints are mandatory** on every public function/method (parameters and return).
- **Modular structure.** One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
- **TDD is the default workflow.** Write the failing test first, watch it fail, then implement. Tests live in `tests/` mirroring `src/`.
- **Logging is mandatory** for every pipeline. Use `src.core.logger.get_logger(__name__)`. No `print()` in `src/`.
- **Docstrings** on every public function — one-line summary + Args/Returns when non-trivial.
- **No hard-coded paths in business logic.** Pass paths as arguments to `run_pipeline(input_path, output_path)`.
- **Format & lint:** keep imports sorted; prefer `pathlib.Path` over `os.path`.
- **Commits are small and frequent.** Each green test → commit.

## 4. Data Readiness Principles

> **The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.**

Every modality pipeline MUST guarantee, before writing to `data/processed/`:

1. **Schema validity** — required columns present, expected dtypes.
2. **Domain validity** — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are **logged with their identifier and dropped**, never silently coerced.
3. **Determinism** — given the same `data/raw/` input, the pipeline produces byte-identical `data/processed/` output. No wall-clock, no random seeds without explicit seeding.
4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.

**Determinism environment**: byte-identical output requires deterministic
floating-point reductions. Each pipeline module sets `OMP_NUM_THREADS=1`,
`OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`, and pins pyarrow to
single-threaded mode at import time. CI runners and developer machines do
not need to set these manually — the pipeline modules handle it — but
overriding them in the environment will break Determinism rule 3.

**ComBat determinism boundary**: the MRI pipeline's `harmonize_combat` wraps
`neuroHarmonize.harmonizationLearn` and applies `np.round(14)` to its output.
This is a defensive measure: with the thread-pinning above, harmonization is
already bit-identical, but the rounding guarantees byte-identity even when
the env-pin discipline is bypassed (e.g. a sub-process that re-exports a
thread count). It discards ~5 trailing-mantissa bits of float64 — well below
ComBat's biological effect-size precision floor.

A model training script is allowed to import from `data/processed/` only. If a
training script references `data/raw/` directly, that is a bug and must be
refactored into a pipeline.

## 5. How to Add a New Pipeline (checklist)

1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
3. Use `get_logger(__name__)` for all status output (per §3).
4. Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run.
5. Write deterministic output to `output_path`.
6. Document any new dependency in `requirements.txt` (pinned).
7. Add a one-line entry to this file's pipeline table.

## 6. Storage Format Convention

All `data/processed/` outputs MUST be **Parquet** (`pyarrow` engine, `compression="snappy"`):
- Preserves dtypes (uint8 fingerprints stay uint8; float64 EEG features stay float64) — CSV silently widens numeric columns and is unsuitable for the high-dimensional float arrays produced by the EEG and MRI pipelines.
- Byte-deterministic with fixed compression and single-threaded writes (satisfies §4 Determinism).
- Read with `pd.read_parquet(path)`; no dtype hints required.

The raw `data/raw/` inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI).

## 7. Experiment Tracking

Every `run_pipeline()` invocation logs to MLflow via `src.core.tracking.track_pipeline_run`:

- **Experiment names** match the pipeline module: `bbb_pipeline`, `eeg_pipeline`, `mri_pipeline`.
- **Params**: input/output paths and pipeline hyperparameters (e.g. BBB `n_bits` / `radius`, EEG `epoch_duration_s` / `random_state`, MRI `intensity_threshold` / `n_roi_axes`).
- **Metrics**: row counts (`rows_in`, `rows_out`, `rows_dropped` — or modality equivalent like `subjects_in/out/dropped`) and `duration_sec`.
- **Artifact**: the produced Parquet at `data/processed/<modality>_features.parquet`.

The tracking URI is read from `MLFLOW_TRACKING_URI` (defaults to `./mlruns/` when unset).

**Live-demo lifeline**: set `NEUROBRIDGE_DISABLE_MLFLOW=1` to skip tracking entirely — the helper yields `None` and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost.

The repo-wide `conftest.py` autouse fixture pins `MLFLOW_TRACKING_URI` to a tmp directory for the test session, so the production `mlruns/` directory is never written by the test suite. Tests that interact with MLflow (in `tests/core/test_tracking.py` and the per-pipeline `Test<Modality>PipelineMLflow` classes) all share this isolated store.

## 8. Decision Layer (Downstream Models)

Pipelines produce features (`data/processed/<modality>_features.parquet`).
Downstream models live in `src/models/` and consume processed features or a
deterministic model-local preprocessing contract:

| Model | File | Output | Endpoint |
|---|---|---|---|
| BBB permeability | `src/models/bbb_model.py` | `data/processed/bbb_model.joblib` | `POST /predict/bbb` |
| MRI image classifier | `src/models/mri_model.py` | `data/processed/mri_model.onnx` | `POST /predict/mri` |

In-repo trainable downstream model modules expose a uniform surface:
- `train(df, label_col, ...)` → fitted classifier
- `save(model, path)` / `load(path)` → joblib artifact I/O
- `predict_with_proba(model, smiles)``{label, confidence}` (confidence is the max-class probability)
- `explain_prediction(model, smiles, top_k)` → SHAP top-k attributions sorted by `|shap_value|` descending

MRI DL exception: training happens outside this repo and exports ONNX, so it
does not expose `train()` or SHAP. Runtime
loads the ONNX artifact with `mri_model.load()`, preprocesses one NIfTI via the
same deterministic resize + z-score contract used during training
(`preprocess_nifti()`), then returns class probabilities via `predict_nifti()`.

The API loads model artifacts at request time. If an artifact is missing,
the endpoint returns **HTTP 503** with a remediation hint instead of failing
process startup. BBB points at the trainer CLI (`python -m src.models.bbb_model`);
MRI points at the external ONNX export path.

**Determinism**: all in-repo classifiers are seeded (`random_state=42`
default), `n_jobs=1` (no tree-parallelism races). Re-running the BBB trainer
on the same Parquet produces identical predictions. MRI ONNX determinism is
bounded by the exported model plus the fixed runtime preprocessing contract.

**Override `BBB_MODEL_PATH`** env var to point the API at a non-default
artifact location (used by tests for tmp_path isolation).

**Override `MRI_MODEL_PATH`** env var to point the API at a non-default ONNX
artifact location. If the ONNX artifact is missing, `POST /predict/mri`
returns **HTTP 503** with a remediation hint.

**Calibration metadata** (Day 6): `train()` does an 80/20 stratified split,
computes precision-at-confidence-threshold bins on the held-out test set,
and stashes them on `model._neurobridge_calibration: list[dict]` (sorted
ascending by threshold). The API includes the bin matching each
prediction's confidence in `BBBPredictResponse.calibration`. UI uses this
to render an honest trust caption ("≥75% confident → 92% precision, n=18").
For tiny test fixtures where stratified split fails, calibration falls
back to zero-support bins so the API contract is always populated.

## 9. Demo Features (Day 6)

The frontend includes three jury-day demo amplifiers that don't change
the core contract:

- **Edge-case dropdown** (BBB tab): a curated catalog of 5 robustness
  probes — invalid SMILES, empty input, OOD macrocycle (cyclosporine-like),
  heavy halogenated aromatic. Each has a stated expectation; the UI
  visualizes graceful failure (HTTP 400 → recoverable warning, never
  a crash).
- **Calibration trust caption** (BBB decision card): renders the
  precision-at-confidence-threshold from `BBBPredictResponse.calibration`.
  Demonstrates that the system knows what it doesn't know.
- **MRI ComBat diagnostics** (MRI tab): `POST /pipeline/mri/diagnostics`
  runs the pipeline twice (pre + post ComBat) and returns long-format
  data + site-gap KPIs (Pre, Post, Reduction factor). The UI renders
  a faceted altair density plot — visual proof that ComBat removes
  site-driven domain shift.

## 10. Drift Surface (Day 7)

Each predict route maintains a per-worker rolling window of recent
prediction confidences (`collections.deque(maxlen=100)`). Train-time
median + std are stashed on `model._neurobridge_train_stats` (joblib
roundtrip-safe). The drift z-score is `(rolling_median − train_median) /
max(train_std, 1e-9)`, computed only when the buffer holds ≥10 samples
AND the model has the train-stats attribute. The `/predict/bbb`
response carries `drift_z: float | None` and `rolling_n: int`. The UI
renders a one-line caption with a magnitude tag (in-band, mild,
significant). Worker restart clears the deque; this is acceptable for
demo and removes the audit-trail concern.

## 11. LLM Explainer Surface (Day 7 + 9)

`src/llm/explainer.py` is the single entry point for natural-language
rationales. `explain(payload)` always returns `{rationale, source,
model}`. The deterministic template path is the source of truth for
tests; the LLM path is OpenRouter via the `openai==1.51.0` SDK and
walks a **smartest → smallest free-tier fallback chain**
(`_DEFAULT_FREE_MODEL_CHAIN`, 10 ids — head: `inclusionai/ling-2.6-1t:free`).
The chain is overridable at runtime via `OPENROUTER_FREE_MODELS`
(comma-separated). Status-code classification:

- `401` → key is bad → bail to template + actionable WARNING (rotate at
  https://openrouter.ai/keys, enable free-model data-sharing at
  https://openrouter.ai/settings/privacy).
- `400` → prompt-shape mismatch on this model → advance to next.
- `402 / 403 / 404 / 429 / 5xx` → advance to next.
- Network/timeout → bail to template (switching models won't help).

Two env knobs control the gate:

- `OPENROUTER_API_KEY` — when absent, fallback to template.
- `NEUROBRIDGE_DISABLE_LLM=1` — hard kill-switch; force template even
  if a key is set. Use this for demo days when you want fully
  deterministic, reproducible rationales.

**Prompt design** (`_build_llm_prompt`): two intent modes. When the
caller supplies `user_question`, the model is instructed to
language-match (Turkish question → Turkish answer), answer the
question directly (not a canned paper-style summary), and respond
conversationally to off-topic / greeting questions. When no
`user_question` is supplied, falls back to the original 2-4 sentence
paper-style rationale.

The `POST /explain/bbb` endpoint mirrors this contract. Pydantic
enforces a non-empty `top_features` list (422 on empty); every other
failure mode degrades to template + WARNING log + `source="template"`.

**Diagnostics**: `GET /diag/openrouter` (`src/api/main.py`) returns
key-presence (length + 12-char prefix only), kill-switch state, chain
length, first model id, and the result of an 8-token probe call
against that model. Surfaced in Streamlit as the sidebar "🔧 Diagnose
LLM" button. Use it when the deployed Space shows `source="template"`
unexpectedly — the most common causes are a missing/misnamed
`OPENROUTER_API_KEY` Space secret or a revoked key.

## 12. Multi-Modal Explainer (Day 8)

`src/llm/explainer.py` exposes `explain(payload, modality)` where
`modality ∈ {"bbb", "eeg", "mri"}`. Each modality has its own
deterministic template (`_template_explain_bbb / _eeg / _mri`) and
its own LLM prompt header. Unknown modality strings degrade to the
BBB template with a warning log; the function never raises. The
hybrid OpenRouter fallback contract from §11 applies uniformly.

The API exposes three matching endpoints — `POST /explain/{bbb,eeg,mri}` —
each on the `explain_router` (`/explain` prefix). Streamlit surfaces
the BBB version in the AI Assistant tab and the EEG/MRI versions as
inline expanders inside their respective pipeline tabs.

## 13. Experiments Surface (Day 8)

`GET /experiments/runs` returns up to 50 most recent MLflow runs
across the bbb/eeg/mri experiments, flattened into a list of
`MLflowRunSummary` (run_id, experiment_name, start_time, status,
metrics, params). `POST /experiments/diff {run_id_a, run_id_b}`
returns a side-by-side metric+param diff (`RunDiffRow`).

When `NEUROBRIDGE_DISABLE_MLFLOW=1`, both endpoints return empty
responses without raising — useful for deployments where there is no
writable `mlruns/` tree or the tracking server is unavailable. Unknown
run ids → 404.

The Streamlit "Experiments" tab is the user-facing surface. Cached
in session state with an explicit Refresh button.

## 14. Deploy Surface (Day 8)

`Dockerfile.hf` is the Hugging Face Spaces image. Single container,
two processes (FastAPI :8000 + Streamlit :7860) launched via
`supervisord.conf`. Build-time `RUN python -m src.models.bbb_model`
bakes the BBB model artifact into the image so the first `/predict/bbb`
call is instant on cold start. Build-time RAG ingest creates
`data/processed/faiss_index/`.

`docker-entrypoint.sh` is the runtime guard for local Docker/Compose demos:
when a mounted `./data` volume hides image-built artifacts, it seeds fixture
raw data, rebuilds missing BBB features/model artifacts, and rebuilds the
FAISS index before starting supervisord. It does not bake
`NEUROBRIDGE_DISABLE_MLFLOW=1` into the image; operators may set that env at
runtime if their tracking service is unavailable.

Default environment: `DEPLOY_ENV=hf_spaces`. The LLM kill-switch is **not**
set — deployed Spaces use the real OpenRouter free-tier chain (§11) when
`OPENROUTER_API_KEY` is configured in the Space's Secrets panel. Set
`NEUROBRIDGE_DISABLE_LLM=1` only when you want to force the deterministic
template path for a fully-reproducible demo.

The README's YAML front-matter declares the Space metadata
(SDK=docker, port=7860, app_file=src/frontend/app.py).

## 15. Orchestrator Agent Surface

`src/agents/orchestrator.py` exposes a single-agent function-calling
loop over the openai SDK (no LangChain / framework dep). The API enables
the guarded workflow mode: if the LLM skips or mis-shapes a required tool
call, deterministic routing in `src/agents/routing.py` falls back to exactly
one pipeline tool, then exactly one retrieval tool, then final synthesis.
The agent holds 4 tools, defined in `src/agents/tools.py`:

- `run_bbb_pipeline(smiles, top_k)` — wraps `POST /predict/bbb`
- `run_eeg_pipeline(input_path)` — wraps `POST /pipeline/eeg`
- `run_mri_pipeline(input_dir, sites_csv=None)` — wraps `POST /pipeline/mri`
  and defaults `sites_csv` to `<input_dir>/sites.csv`
- `retrieve_context(query, k)` — wraps `src/rag/retrieve.py`

The system prompt (`src/agents/prompts.py:ORCHESTRATOR_SYSTEM_PROMPT`)
describes the workflow: pick exactly one pipeline → run it → formulate a
focused retrieval query → call retrieve_context → synthesize a 3-5 sentence
response that cites at least one chunk. The API-side workflow guard enforces
that order in code; the prompt is guidance, not the only control plane.
Language of the final response is mirrored from the user's question.

`POST /agent/run` is the public surface. It accepts `user_input`,
optional `user_question`, and optional MRI `sites_csv`. Default model is
`google/gemini-2.0-flash-exp:free` on OpenRouter (function-calling support
verified). Override via `NEUROBRIDGE_AGENT_MODEL` env var. Returns 503 when
`OPENROUTER_API_KEY` is unset.

Diagnostics: `GET /diag/agent` returns key presence, configured model,
RAG index status (chunk count), and the registered tool names.

## 16. RAG Surface

`src/rag/` is the retrieval layer. Stack: `fastembed`
(`BAAI/bge-small-en-v1.5`, 384-dim, ONNX, no torch dep) for
embeddings + `faiss-cpu` (`IndexFlatIP` after L2-norm = cosine) for
vector search.

Knowledge base lives at `data/knowledge_base/` (gitignored;
user-supplied `.md` / `.txt` / `.pdf`). Build the FAISS index with:

    python -m src.rag.ingest [<input_dir> [<output_dir>]]

Defaults: input=`data/knowledge_base/`, output=`data/processed/faiss_index/`.
The Dockerfile runs this at build time so deployed Spaces start with
a populated index. `docker-entrypoint.sh` also rebuilds the index at
startup when a mounted `data/` volume hides the image-built artifacts.
Empty KB → empty index → `retrieve_context` returns 0 chunks; the agent
surfaces this and answers from the pipeline result alone.

`tests/fixtures/kb_sample/` ships 3 seed markdown files (Lipinski,
ComBat, MNE+ICA) — these double as test fixtures and as the demo
seed if no user-supplied PDFs are added.