# AGENTS.md — NeuroBridge Enterprise Pipeline > Read this file at the start of every session. It is the contract every agent > (human or LLM) operates under in this repository. ## 1. Project Vision **NeuroBridge Enterprise** is a B2B SaaS platform that solves three structural problems in real-world clinical/biomedical ML pipelines: 1. **Data Drift** between hospitals and acquisition sites (multi-center MRI). 2. **Missing Modalities** (a patient may have MRI but no EEG, or vice versa). 3. **Artifacts** in raw biosignals (eye blinks, line noise, motion in EEG). The platform exposes three production pipelines behind a single FastAPI surface: | Modality | Pipeline | Core Technique | |---|---|---| | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift | | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal | | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES | All experiment runs are tracked in **MLflow**. All services ship as **Docker** images. ## 2. Directory Layout (load-bearing — do not violate) ``` . ├── AGENTS.md # This file ├── README.md ├── requirements.txt ├── pytest.ini ├── conftest.py # Repo-wide pytest fixtures (autouse: pins MLFLOW_TRACKING_URI to tmp dir for test isolation) ├── Dockerfile # Production image (FastAPI + pipelines) ├── docker-compose.yml # api + mlflow services for local stack ├── .dockerignore ├── .streamlit/ │ └── config.toml # Streamlit theme tokens ├── data/ │ ├── raw/ # Untouched source data. NEVER train on this directly. │ └── processed/ # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4). ├── src/ │ ├── api/ # FastAPI surface │ │ ├── main.py # App factory + /health │ │ ├── routes.py # POST /pipeline/{bbb,eeg,mri} dispatch │ │ └── schemas.py # Shared Pydantic request/response models │ ├── core/ # Cross-cutting utilities │ │ ├── logger.py # Structured logger (mandatory in every pipeline) │ │ ├── determinism.py # Thread-pin env vars (OMP/OPENBLAS/MKL/pyarrow) │ │ ├── storage.py # Parquet read/write helpers (snappy, single-threaded, deterministic) │ │ └── tracking.py # MLflow `track_pipeline_run` context manager (see §7) │ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry. │ ├── models/ # Downstream decision-layer models │ │ ├── bbb_model.py # BBB-permeability classifier + SHAP explainer + trainer CLI │ │ └── mri_model.py # Volumetric MRI ONNX inference surface (external training) │ ├── llm/ # Natural-language explainers (template + OpenRouter fallback) │ ├── rag/ # Fastembed + FAISS retrieval layer │ ├── agents/ # Tool registry + guarded OpenRouter orchestrator │ └── frontend/ │ └── app.py # Streamlit dashboard └── tests/ ├── core/ ├── api/ ├── frontend/ ├── pipelines/ # incl. test_cross_pipeline_smoke.py for integration coverage └── fixtures/ # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py) ``` **Rules:** - New modality → new file under `src/pipelines/`. No mixing modalities in one file. - Anything imported by 2+ pipelines → `src/core/`. - Pipeline code (`src/pipelines/`, `src/core/`) must not read from or write to any path outside `data/`. Test code may read `tests/fixtures/`. The `data/` boundary is the storage contract for production data. - `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there. ## 3. Coding Standards - **Python 3.10–3.12** (the pinned native-extension dependencies do not yet ship cp313+ wheels). Use `from __future__ import annotations` when needed for forward refs. - **Type hints are mandatory** on every public function/method (parameters and return). - **Modular structure.** One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it. - **TDD is the default workflow.** Write the failing test first, watch it fail, then implement. Tests live in `tests/` mirroring `src/`. - **Logging is mandatory** for every pipeline. Use `src.core.logger.get_logger(__name__)`. No `print()` in `src/`. - **Docstrings** on every public function — one-line summary + Args/Returns when non-trivial. - **No hard-coded paths in business logic.** Pass paths as arguments to `run_pipeline(input_path, output_path)`. - **Format & lint:** keep imports sorted; prefer `pathlib.Path` over `os.path`. - **Commits are small and frequent.** Each green test → commit. ## 4. Data Readiness Principles > **The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.** Every modality pipeline MUST guarantee, before writing to `data/processed/`: 1. **Schema validity** — required columns present, expected dtypes. 2. **Domain validity** — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are **logged with their identifier and dropped**, never silently coerced. 3. **Determinism** — given the same `data/raw/` input, the pipeline produces byte-identical `data/processed/` output. No wall-clock, no random seeds without explicit seeding. 4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level. 5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes. **Determinism environment**: byte-identical output requires deterministic floating-point reductions. Each pipeline module sets `OMP_NUM_THREADS=1`, `OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`, and pins pyarrow to single-threaded mode at import time. CI runners and developer machines do not need to set these manually — the pipeline modules handle it — but overriding them in the environment will break Determinism rule 3. **ComBat determinism boundary**: the MRI pipeline's `harmonize_combat` wraps `neuroHarmonize.harmonizationLearn` and applies `np.round(14)` to its output. This is a defensive measure: with the thread-pinning above, harmonization is already bit-identical, but the rounding guarantees byte-identity even when the env-pin discipline is bypassed (e.g. a sub-process that re-exports a thread count). It discards ~5 trailing-mantissa bits of float64 — well below ComBat's biological effect-size precision floor. A model training script is allowed to import from `data/processed/` only. If a training script references `data/raw/` directly, that is a bug and must be refactored into a pipeline. ## 5. How to Add a New Pipeline (checklist) 1. Add `tests/pipelines/test__pipeline.py` with the failing tests first. 2. Create `src/pipelines/_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`. 3. Use `get_logger(__name__)` for all status output (per §3). 4. Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run. 5. Write deterministic output to `output_path`. 6. Document any new dependency in `requirements.txt` (pinned). 7. Add a one-line entry to this file's pipeline table. ## 6. Storage Format Convention All `data/processed/` outputs MUST be **Parquet** (`pyarrow` engine, `compression="snappy"`): - Preserves dtypes (uint8 fingerprints stay uint8; float64 EEG features stay float64) — CSV silently widens numeric columns and is unsuitable for the high-dimensional float arrays produced by the EEG and MRI pipelines. - Byte-deterministic with fixed compression and single-threaded writes (satisfies §4 Determinism). - Read with `pd.read_parquet(path)`; no dtype hints required. The raw `data/raw/` inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI). ## 7. Experiment Tracking Every `run_pipeline()` invocation logs to MLflow via `src.core.tracking.track_pipeline_run`: - **Experiment names** match the pipeline module: `bbb_pipeline`, `eeg_pipeline`, `mri_pipeline`. - **Params**: input/output paths and pipeline hyperparameters (e.g. BBB `n_bits` / `radius`, EEG `epoch_duration_s` / `random_state`, MRI `intensity_threshold` / `n_roi_axes`). - **Metrics**: row counts (`rows_in`, `rows_out`, `rows_dropped` — or modality equivalent like `subjects_in/out/dropped`) and `duration_sec`. - **Artifact**: the produced Parquet at `data/processed/_features.parquet`. The tracking URI is read from `MLFLOW_TRACKING_URI` (defaults to `./mlruns/` when unset). **Live-demo lifeline**: set `NEUROBRIDGE_DISABLE_MLFLOW=1` to skip tracking entirely — the helper yields `None` and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost. The repo-wide `conftest.py` autouse fixture pins `MLFLOW_TRACKING_URI` to a tmp directory for the test session, so the production `mlruns/` directory is never written by the test suite. Tests that interact with MLflow (in `tests/core/test_tracking.py` and the per-pipeline `TestPipelineMLflow` classes) all share this isolated store. ## 8. Decision Layer (Downstream Models) Pipelines produce features (`data/processed/_features.parquet`). Downstream models live in `src/models/` and consume processed features or a deterministic model-local preprocessing contract: | Model | File | Output | Endpoint | |---|---|---|---| | BBB permeability | `src/models/bbb_model.py` | `data/processed/bbb_model.joblib` | `POST /predict/bbb` | | MRI image classifier | `src/models/mri_model.py` | `data/processed/mri_model.onnx` | `POST /predict/mri` | In-repo trainable downstream model modules expose a uniform surface: - `train(df, label_col, ...)` → fitted classifier - `save(model, path)` / `load(path)` → joblib artifact I/O - `predict_with_proba(model, smiles)` → `{label, confidence}` (confidence is the max-class probability) - `explain_prediction(model, smiles, top_k)` → SHAP top-k attributions sorted by `|shap_value|` descending MRI DL exception: training happens outside this repo and exports ONNX, so it does not expose `train()` or SHAP. Runtime loads the ONNX artifact with `mri_model.load()`, preprocesses one NIfTI via the same deterministic resize + z-score contract used during training (`preprocess_nifti()`), then returns class probabilities via `predict_nifti()`. The API loads model artifacts at request time. If an artifact is missing, the endpoint returns **HTTP 503** with a remediation hint instead of failing process startup. BBB points at the trainer CLI (`python -m src.models.bbb_model`); MRI points at the external ONNX export path. **Determinism**: all in-repo classifiers are seeded (`random_state=42` default), `n_jobs=1` (no tree-parallelism races). Re-running the BBB trainer on the same Parquet produces identical predictions. MRI ONNX determinism is bounded by the exported model plus the fixed runtime preprocessing contract. **Override `BBB_MODEL_PATH`** env var to point the API at a non-default artifact location (used by tests for tmp_path isolation). **Override `MRI_MODEL_PATH`** env var to point the API at a non-default ONNX artifact location. If the ONNX artifact is missing, `POST /predict/mri` returns **HTTP 503** with a remediation hint. **Calibration metadata** (Day 6): `train()` does an 80/20 stratified split, computes precision-at-confidence-threshold bins on the held-out test set, and stashes them on `model._neurobridge_calibration: list[dict]` (sorted ascending by threshold). The API includes the bin matching each prediction's confidence in `BBBPredictResponse.calibration`. UI uses this to render an honest trust caption ("≥75% confident → 92% precision, n=18"). For tiny test fixtures where stratified split fails, calibration falls back to zero-support bins so the API contract is always populated. ## 9. Demo Features (Day 6) The frontend includes three jury-day demo amplifiers that don't change the core contract: - **Edge-case dropdown** (BBB tab): a curated catalog of 5 robustness probes — invalid SMILES, empty input, OOD macrocycle (cyclosporine-like), heavy halogenated aromatic. Each has a stated expectation; the UI visualizes graceful failure (HTTP 400 → recoverable warning, never a crash). - **Calibration trust caption** (BBB decision card): renders the precision-at-confidence-threshold from `BBBPredictResponse.calibration`. Demonstrates that the system knows what it doesn't know. - **MRI ComBat diagnostics** (MRI tab): `POST /pipeline/mri/diagnostics` runs the pipeline twice (pre + post ComBat) and returns long-format data + site-gap KPIs (Pre, Post, Reduction factor). The UI renders a faceted altair density plot — visual proof that ComBat removes site-driven domain shift. ## 10. Drift Surface (Day 7) Each predict route maintains a per-worker rolling window of recent prediction confidences (`collections.deque(maxlen=100)`). Train-time median + std are stashed on `model._neurobridge_train_stats` (joblib roundtrip-safe). The drift z-score is `(rolling_median − train_median) / max(train_std, 1e-9)`, computed only when the buffer holds ≥10 samples AND the model has the train-stats attribute. The `/predict/bbb` response carries `drift_z: float | None` and `rolling_n: int`. The UI renders a one-line caption with a magnitude tag (in-band, mild, significant). Worker restart clears the deque; this is acceptable for demo and removes the audit-trail concern. ## 11. LLM Explainer Surface (Day 7 + 9) `src/llm/explainer.py` is the single entry point for natural-language rationales. `explain(payload)` always returns `{rationale, source, model}`. The deterministic template path is the source of truth for tests; the LLM path is OpenRouter via the `openai==1.51.0` SDK and walks a **smartest → smallest free-tier fallback chain** (`_DEFAULT_FREE_MODEL_CHAIN`, 10 ids — head: `inclusionai/ling-2.6-1t:free`). The chain is overridable at runtime via `OPENROUTER_FREE_MODELS` (comma-separated). Status-code classification: - `401` → key is bad → bail to template + actionable WARNING (rotate at https://openrouter.ai/keys, enable free-model data-sharing at https://openrouter.ai/settings/privacy). - `400` → prompt-shape mismatch on this model → advance to next. - `402 / 403 / 404 / 429 / 5xx` → advance to next. - Network/timeout → bail to template (switching models won't help). Two env knobs control the gate: - `OPENROUTER_API_KEY` — when absent, fallback to template. - `NEUROBRIDGE_DISABLE_LLM=1` — hard kill-switch; force template even if a key is set. Use this for demo days when you want fully deterministic, reproducible rationales. **Prompt design** (`_build_llm_prompt`): two intent modes. When the caller supplies `user_question`, the model is instructed to language-match (Turkish question → Turkish answer), answer the question directly (not a canned paper-style summary), and respond conversationally to off-topic / greeting questions. When no `user_question` is supplied, falls back to the original 2-4 sentence paper-style rationale. The `POST /explain/bbb` endpoint mirrors this contract. Pydantic enforces a non-empty `top_features` list (422 on empty); every other failure mode degrades to template + WARNING log + `source="template"`. **Diagnostics**: `GET /diag/openrouter` (`src/api/main.py`) returns key-presence (length + 12-char prefix only), kill-switch state, chain length, first model id, and the result of an 8-token probe call against that model. Surfaced in Streamlit as the sidebar "🔧 Diagnose LLM" button. Use it when the deployed Space shows `source="template"` unexpectedly — the most common causes are a missing/misnamed `OPENROUTER_API_KEY` Space secret or a revoked key. ## 12. Multi-Modal Explainer (Day 8) `src/llm/explainer.py` exposes `explain(payload, modality)` where `modality ∈ {"bbb", "eeg", "mri"}`. Each modality has its own deterministic template (`_template_explain_bbb / _eeg / _mri`) and its own LLM prompt header. Unknown modality strings degrade to the BBB template with a warning log; the function never raises. The hybrid OpenRouter fallback contract from §11 applies uniformly. The API exposes three matching endpoints — `POST /explain/{bbb,eeg,mri}` — each on the `explain_router` (`/explain` prefix). Streamlit surfaces the BBB version in the AI Assistant tab and the EEG/MRI versions as inline expanders inside their respective pipeline tabs. ## 13. Experiments Surface (Day 8) `GET /experiments/runs` returns up to 50 most recent MLflow runs across the bbb/eeg/mri experiments, flattened into a list of `MLflowRunSummary` (run_id, experiment_name, start_time, status, metrics, params). `POST /experiments/diff {run_id_a, run_id_b}` returns a side-by-side metric+param diff (`RunDiffRow`). When `NEUROBRIDGE_DISABLE_MLFLOW=1`, both endpoints return empty responses without raising — useful for deployments where there is no writable `mlruns/` tree or the tracking server is unavailable. Unknown run ids → 404. The Streamlit "Experiments" tab is the user-facing surface. Cached in session state with an explicit Refresh button. ## 14. Deploy Surface (Day 8) `Dockerfile.hf` is the Hugging Face Spaces image. Single container, two processes (FastAPI :8000 + Streamlit :7860) launched via `supervisord.conf`. Build-time `RUN python -m src.models.bbb_model` bakes the BBB model artifact into the image so the first `/predict/bbb` call is instant on cold start. Build-time RAG ingest creates `data/processed/faiss_index/`. `docker-entrypoint.sh` is the runtime guard for local Docker/Compose demos: when a mounted `./data` volume hides image-built artifacts, it seeds fixture raw data, rebuilds missing BBB features/model artifacts, and rebuilds the FAISS index before starting supervisord. It does not bake `NEUROBRIDGE_DISABLE_MLFLOW=1` into the image; operators may set that env at runtime if their tracking service is unavailable. Default environment: `DEPLOY_ENV=hf_spaces`. The LLM kill-switch is **not** set — deployed Spaces use the real OpenRouter free-tier chain (§11) when `OPENROUTER_API_KEY` is configured in the Space's Secrets panel. Set `NEUROBRIDGE_DISABLE_LLM=1` only when you want to force the deterministic template path for a fully-reproducible demo. The README's YAML front-matter declares the Space metadata (SDK=docker, port=7860, app_file=src/frontend/app.py). ## 15. Orchestrator Agent Surface `src/agents/orchestrator.py` exposes a single-agent function-calling loop over the openai SDK (no LangChain / framework dep). The API enables the guarded workflow mode: if the LLM skips or mis-shapes a required tool call, deterministic routing in `src/agents/routing.py` falls back to exactly one pipeline tool, then exactly one retrieval tool, then final synthesis. The agent holds 4 tools, defined in `src/agents/tools.py`: - `run_bbb_pipeline(smiles, top_k)` — wraps `POST /predict/bbb` - `run_eeg_pipeline(input_path)` — wraps `POST /pipeline/eeg` - `run_mri_pipeline(input_dir, sites_csv=None)` — wraps `POST /pipeline/mri` and defaults `sites_csv` to `/sites.csv` - `retrieve_context(query, k)` — wraps `src/rag/retrieve.py` The system prompt (`src/agents/prompts.py:ORCHESTRATOR_SYSTEM_PROMPT`) describes the workflow: pick exactly one pipeline → run it → formulate a focused retrieval query → call retrieve_context → synthesize a 3-5 sentence response that cites at least one chunk. The API-side workflow guard enforces that order in code; the prompt is guidance, not the only control plane. Language of the final response is mirrored from the user's question. `POST /agent/run` is the public surface. It accepts `user_input`, optional `user_question`, and optional MRI `sites_csv`. Default model is `google/gemini-2.0-flash-exp:free` on OpenRouter (function-calling support verified). Override via `NEUROBRIDGE_AGENT_MODEL` env var. Returns 503 when `OPENROUTER_API_KEY` is unset. Diagnostics: `GET /diag/agent` returns key presence, configured model, RAG index status (chunk count), and the registered tool names. ## 16. RAG Surface `src/rag/` is the retrieval layer. Stack: `fastembed` (`BAAI/bge-small-en-v1.5`, 384-dim, ONNX, no torch dep) for embeddings + `faiss-cpu` (`IndexFlatIP` after L2-norm = cosine) for vector search. Knowledge base lives at `data/knowledge_base/` (gitignored; user-supplied `.md` / `.txt` / `.pdf`). Build the FAISS index with: python -m src.rag.ingest [ []] Defaults: input=`data/knowledge_base/`, output=`data/processed/faiss_index/`. The Dockerfile runs this at build time so deployed Spaces start with a populated index. `docker-entrypoint.sh` also rebuilds the index at startup when a mounted `data/` volume hides the image-built artifacts. Empty KB → empty index → `retrieve_context` returns 0 chunks; the agent surfaces this and answers from the pipeline result alone. `tests/fixtures/kb_sample/` ships 3 seed markdown files (Lipinski, ComBat, MNE+ICA) — these double as test fixtures and as the demo seed if no user-supplied PDFs are added.