Spaces:

mekosotto
/

hackathon

Running

App Files Files Community

hackathon / AGENTS.md

bekir32419

Add project files

c0a7163 5 days ago

preview code

raw

history blame contribute delete

21.9 kB

	# AGENTS.md — NeuroBridge Enterprise Pipeline

	> Read this file at the start of every session. It is the contract every agent
	> (human or LLM) operates under in this repository.

	## 1. Project Vision

	NeuroBridge Enterprise is a B2B SaaS platform that solves three structural
	problems in real-world clinical/biomedical ML pipelines:

	1. Data Drift between hospitals and acquisition sites (multi-center MRI).
	2. Missing Modalities (a patient may have MRI but no EEG, or vice versa).
	3. Artifacts in raw biosignals (eye blinks, line noise, motion in EEG).

	The platform exposes three production pipelines behind a single FastAPI surface:

	\| Modality \| Pipeline \| Core Technique \|
	\|---\|---\|---\|
	\| Image (MRI / fMRI) \| `src/pipelines/mri_pipeline.py` \| ComBat Harmonization for site-level domain shift \|
	\| Signal (EEG) \| `src/pipelines/eeg_pipeline.py` \| MNE-Python + ICA for artifact removal \|
	\| Tabular (BBB / molecules) \| `src/pipelines/bbb_pipeline.py` \| RDKit Morgan fingerprints from SMILES \|

	All experiment runs are tracked in MLflow. All services ship as Docker images.

	## 2. Directory Layout (load-bearing — do not violate)

	```
	.
	├── AGENTS.md # This file
	├── README.md
	├── requirements.txt
	├── pytest.ini
	├── conftest.py # Repo-wide pytest fixtures (autouse: pins MLFLOW_TRACKING_URI to tmp dir for test isolation)
	├── Dockerfile # Production image (FastAPI + pipelines)
	├── docker-compose.yml # api + mlflow services for local stack
	├── .dockerignore
	├── .streamlit/
	│ └── config.toml # Streamlit theme tokens
	├── data/
	│ ├── raw/ # Untouched source data. NEVER train on this directly.
	│ └── processed/ # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4).
	├── src/
	│ ├── api/ # FastAPI surface
	│ │ ├── main.py # App factory + /health
	│ │ ├── routes.py # POST /pipeline/{bbb,eeg,mri} dispatch
	│ │ └── schemas.py # Shared Pydantic request/response models
	│ ├── core/ # Cross-cutting utilities
	│ │ ├── logger.py # Structured logger (mandatory in every pipeline)
	│ │ ├── determinism.py # Thread-pin env vars (OMP/OPENBLAS/MKL/pyarrow)
	│ │ ├── storage.py # Parquet read/write helpers (snappy, single-threaded, deterministic)
	│ │ └── tracking.py # MLflow `track_pipeline_run` context manager (see §7)
	│ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
	│ ├── models/ # Downstream decision-layer models
	│ │ ├── bbb_model.py # BBB-permeability classifier + SHAP explainer + trainer CLI
	│ │ └── mri_model.py # Volumetric MRI ONNX inference surface (external training)
	│ ├── llm/ # Natural-language explainers (template + OpenRouter fallback)
	│ ├── rag/ # Fastembed + FAISS retrieval layer
	│ ├── agents/ # Tool registry + guarded OpenRouter orchestrator
	│ └── frontend/
	│ └── app.py # Streamlit dashboard
	└── tests/
	├── core/
	├── api/
	├── frontend/
	├── pipelines/ # incl. test_cross_pipeline_smoke.py for integration coverage
	└── fixtures/ # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
	```

	Rules:
	- New modality → new file under `src/pipelines/`. No mixing modalities in one file.
	- Anything imported by 2+ pipelines → `src/core/`.
	- Pipeline code (`src/pipelines/`, `src/core/`) must not read from or write to any path outside `data/`. Test code may read `tests/fixtures/`. The `data/` boundary is the storage contract for production data.
	- `tests/fixtures/` holds CSV / numpy / DICOM blobs — do not add an `__init__.py` there.

	## 3. Coding Standards

	- Python 3.10–3.12 (the pinned native-extension dependencies do not yet ship cp313+ wheels). Use `from __future__ import annotations` when needed for forward refs.
	- Type hints are mandatory on every public function/method (parameters and return).
	- Modular structure. One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
	- TDD is the default workflow. Write the failing test first, watch it fail, then implement. Tests live in `tests/` mirroring `src/`.
	- Logging is mandatory for every pipeline. Use `src.core.logger.get_logger(__name__)`. No `print()` in `src/`.
	- Docstrings on every public function — one-line summary + Args/Returns when non-trivial.
	- No hard-coded paths in business logic. Pass paths as arguments to `run_pipeline(input_path, output_path)`.
	- Format & lint: keep imports sorted; prefer `pathlib.Path` over `os.path`.
	- Commits are small and frequent. Each green test → commit.

	## 4. Data Readiness Principles

	> The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.

	Every modality pipeline MUST guarantee, before writing to `data/processed/`:

	1. Schema validity — required columns present, expected dtypes.
	2. Domain validity — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are logged with their identifier and dropped, never silently coerced.
	3. Determinism — given the same `data/raw/` input, the pipeline produces byte-identical `data/processed/` output. No wall-clock, no random seeds without explicit seeding.
	4. Traceability — log row count in, row count out, and percentage dropped at INFO level.
	5. Idempotence — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.

	Determinism environment: byte-identical output requires deterministic
	floating-point reductions. Each pipeline module sets `OMP_NUM_THREADS=1`,
	`OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`, and pins pyarrow to
	single-threaded mode at import time. CI runners and developer machines do
	not need to set these manually — the pipeline modules handle it — but
	overriding them in the environment will break Determinism rule 3.

	ComBat determinism boundary: the MRI pipeline's `harmonize_combat` wraps
	`neuroHarmonize.harmonizationLearn` and applies `np.round(14)` to its output.
	This is a defensive measure: with the thread-pinning above, harmonization is
	already bit-identical, but the rounding guarantees byte-identity even when
	the env-pin discipline is bypassed (e.g. a sub-process that re-exports a
	thread count). It discards ~5 trailing-mantissa bits of float64 — well below
	ComBat's biological effect-size precision floor.

	A model training script is allowed to import from `data/processed/` only. If a
	training script references `data/raw/` directly, that is a bug and must be
	refactored into a pipeline.

	## 5. How to Add a New Pipeline (checklist)

	1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
	2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
	3. Use `get_logger(__name__)` for all status output (per §3).
	4. Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run.
	5. Write deterministic output to `output_path`.
	6. Document any new dependency in `requirements.txt` (pinned).
	7. Add a one-line entry to this file's pipeline table.

	## 6. Storage Format Convention

	All `data/processed/` outputs MUST be Parquet (`pyarrow` engine, `compression="snappy"`):
	- Preserves dtypes (uint8 fingerprints stay uint8; float64 EEG features stay float64) — CSV silently widens numeric columns and is unsuitable for the high-dimensional float arrays produced by the EEG and MRI pipelines.
	- Byte-deterministic with fixed compression and single-threaded writes (satisfies §4 Determinism).
	- Read with `pd.read_parquet(path)`; no dtype hints required.

	The raw `data/raw/` inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI).

	## 7. Experiment Tracking

	Every `run_pipeline()` invocation logs to MLflow via `src.core.tracking.track_pipeline_run`:

	- Experiment names match the pipeline module: `bbb_pipeline`, `eeg_pipeline`, `mri_pipeline`.
	- Params: input/output paths and pipeline hyperparameters (e.g. BBB `n_bits` / `radius`, EEG `epoch_duration_s` / `random_state`, MRI `intensity_threshold` / `n_roi_axes`).
	- Metrics: row counts (`rows_in`, `rows_out`, `rows_dropped` — or modality equivalent like `subjects_in/out/dropped`) and `duration_sec`.
	- Artifact: the produced Parquet at `data/processed/<modality>_features.parquet`.

	The tracking URI is read from `MLFLOW_TRACKING_URI` (defaults to `./mlruns/` when unset).

	Live-demo lifeline: set `NEUROBRIDGE_DISABLE_MLFLOW=1` to skip tracking entirely — the helper yields `None` and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost.

	The repo-wide `conftest.py` autouse fixture pins `MLFLOW_TRACKING_URI` to a tmp directory for the test session, so the production `mlruns/` directory is never written by the test suite. Tests that interact with MLflow (in `tests/core/test_tracking.py` and the per-pipeline `Test<Modality>PipelineMLflow` classes) all share this isolated store.

	## 8. Decision Layer (Downstream Models)

	Pipelines produce features (`data/processed/<modality>_features.parquet`).
	Downstream models live in `src/models/` and consume processed features or a
	deterministic model-local preprocessing contract:

	\| Model \| File \| Output \| Endpoint \|
	\|---\|---\|---\|---\|
	\| BBB permeability \| `src/models/bbb_model.py` \| `data/processed/bbb_model.joblib` \| `POST /predict/bbb` \|
	\| MRI image classifier \| `src/models/mri_model.py` \| `data/processed/mri_model.onnx` \| `POST /predict/mri` \|

	In-repo trainable downstream model modules expose a uniform surface:
	- `train(df, label_col, ...)` → fitted classifier
	- `save(model, path)` / `load(path)` → joblib artifact I/O
	- `predict_with_proba(model, smiles)` → `{label, confidence}` (confidence is the max-class probability)
	- `explain_prediction(model, smiles, top_k)` → SHAP top-k attributions sorted by `\|shap_value\|` descending

	MRI DL exception: training happens outside this repo and exports ONNX, so it
	does not expose `train()` or SHAP. Runtime
	loads the ONNX artifact with `mri_model.load()`, preprocesses one NIfTI via the
	same deterministic resize + z-score contract used during training
	(`preprocess_nifti()`), then returns class probabilities via `predict_nifti()`.

	The API loads model artifacts at request time. If an artifact is missing,
	the endpoint returns HTTP 503 with a remediation hint instead of failing
	process startup. BBB points at the trainer CLI (`python -m src.models.bbb_model`);
	MRI points at the external ONNX export path.

	Determinism: all in-repo classifiers are seeded (`random_state=42`
	default), `n_jobs=1` (no tree-parallelism races). Re-running the BBB trainer
	on the same Parquet produces identical predictions. MRI ONNX determinism is
	bounded by the exported model plus the fixed runtime preprocessing contract.

	Override `BBB_MODEL_PATH` env var to point the API at a non-default
	artifact location (used by tests for tmp_path isolation).

	Override `MRI_MODEL_PATH` env var to point the API at a non-default ONNX
	artifact location. If the ONNX artifact is missing, `POST /predict/mri`
	returns HTTP 503 with a remediation hint.

	Calibration metadata (Day 6): `train()` does an 80/20 stratified split,
	computes precision-at-confidence-threshold bins on the held-out test set,
	and stashes them on `model._neurobridge_calibration: list[dict]` (sorted
	ascending by threshold). The API includes the bin matching each
	prediction's confidence in `BBBPredictResponse.calibration`. UI uses this
	to render an honest trust caption ("≥75% confident → 92% precision, n=18").
	For tiny test fixtures where stratified split fails, calibration falls
	back to zero-support bins so the API contract is always populated.

	## 9. Demo Features (Day 6)

	The frontend includes three jury-day demo amplifiers that don't change
	the core contract:

	- Edge-case dropdown (BBB tab): a curated catalog of 5 robustness
	probes — invalid SMILES, empty input, OOD macrocycle (cyclosporine-like),
	heavy halogenated aromatic. Each has a stated expectation; the UI
	visualizes graceful failure (HTTP 400 → recoverable warning, never
	a crash).
	- Calibration trust caption (BBB decision card): renders the
	precision-at-confidence-threshold from `BBBPredictResponse.calibration`.
	Demonstrates that the system knows what it doesn't know.
	- MRI ComBat diagnostics (MRI tab): `POST /pipeline/mri/diagnostics`
	runs the pipeline twice (pre + post ComBat) and returns long-format
	data + site-gap KPIs (Pre, Post, Reduction factor). The UI renders
	a faceted altair density plot — visual proof that ComBat removes
	site-driven domain shift.

	## 10. Drift Surface (Day 7)

	Each predict route maintains a per-worker rolling window of recent
	prediction confidences (`collections.deque(maxlen=100)`). Train-time
	median + std are stashed on `model._neurobridge_train_stats` (joblib
	roundtrip-safe). The drift z-score is `(rolling_median − train_median) /
	max(train_std, 1e-9)`, computed only when the buffer holds ≥10 samples
	AND the model has the train-stats attribute. The `/predict/bbb`
	response carries `drift_z: float \| None` and `rolling_n: int`. The UI
	renders a one-line caption with a magnitude tag (in-band, mild,
	significant). Worker restart clears the deque; this is acceptable for
	demo and removes the audit-trail concern.

	## 11. LLM Explainer Surface (Day 7 + 9)

	`src/llm/explainer.py` is the single entry point for natural-language
	rationales. `explain(payload)` always returns `{rationale, source,
	model}`. The deterministic template path is the source of truth for
	tests; the LLM path is OpenRouter via the `openai==1.51.0` SDK and
	walks a smartest → smallest free-tier fallback chain
	(`_DEFAULT_FREE_MODEL_CHAIN`, 10 ids — head: `inclusionai/ling-2.6-1t:free`).
	The chain is overridable at runtime via `OPENROUTER_FREE_MODELS`
	(comma-separated). Status-code classification:

	- `401` → key is bad → bail to template + actionable WARNING (rotate at
	https://openrouter.ai/keys, enable free-model data-sharing at
	https://openrouter.ai/settings/privacy).
	- `400` → prompt-shape mismatch on this model → advance to next.
	- `402 / 403 / 404 / 429 / 5xx` → advance to next.
	- Network/timeout → bail to template (switching models won't help).

	Two env knobs control the gate:

	- `OPENROUTER_API_KEY` — when absent, fallback to template.
	- `NEUROBRIDGE_DISABLE_LLM=1` — hard kill-switch; force template even
	if a key is set. Use this for demo days when you want fully
	deterministic, reproducible rationales.

	Prompt design (`_build_llm_prompt`): two intent modes. When the
	caller supplies `user_question`, the model is instructed to
	language-match (Turkish question → Turkish answer), answer the
	question directly (not a canned paper-style summary), and respond
	conversationally to off-topic / greeting questions. When no
	`user_question` is supplied, falls back to the original 2-4 sentence
	paper-style rationale.

	The `POST /explain/bbb` endpoint mirrors this contract. Pydantic
	enforces a non-empty `top_features` list (422 on empty); every other
	failure mode degrades to template + WARNING log + `source="template"`.

	Diagnostics: `GET /diag/openrouter` (`src/api/main.py`) returns
	key-presence (length + 12-char prefix only), kill-switch state, chain
	length, first model id, and the result of an 8-token probe call
	against that model. Surfaced in Streamlit as the sidebar "🔧 Diagnose
	LLM" button. Use it when the deployed Space shows `source="template"`
	unexpectedly — the most common causes are a missing/misnamed
	`OPENROUTER_API_KEY` Space secret or a revoked key.

	## 12. Multi-Modal Explainer (Day 8)

	`src/llm/explainer.py` exposes `explain(payload, modality)` where
	`modality ∈ {"bbb", "eeg", "mri"}`. Each modality has its own
	deterministic template (`_template_explain_bbb / _eeg / _mri`) and
	its own LLM prompt header. Unknown modality strings degrade to the
	BBB template with a warning log; the function never raises. The
	hybrid OpenRouter fallback contract from §11 applies uniformly.

	The API exposes three matching endpoints — `POST /explain/{bbb,eeg,mri}` —
	each on the `explain_router` (`/explain` prefix). Streamlit surfaces
	the BBB version in the AI Assistant tab and the EEG/MRI versions as
	inline expanders inside their respective pipeline tabs.

	## 13. Experiments Surface (Day 8)

	`GET /experiments/runs` returns up to 50 most recent MLflow runs
	across the bbb/eeg/mri experiments, flattened into a list of
	`MLflowRunSummary` (run_id, experiment_name, start_time, status,
	metrics, params). `POST /experiments/diff {run_id_a, run_id_b}`
	returns a side-by-side metric+param diff (`RunDiffRow`).

	When `NEUROBRIDGE_DISABLE_MLFLOW=1`, both endpoints return empty
	responses without raising — useful for deployments where there is no
	writable `mlruns/` tree or the tracking server is unavailable. Unknown
	run ids → 404.

	The Streamlit "Experiments" tab is the user-facing surface. Cached
	in session state with an explicit Refresh button.

	## 14. Deploy Surface (Day 8)

	`Dockerfile.hf` is the Hugging Face Spaces image. Single container,
	two processes (FastAPI :8000 + Streamlit :7860) launched via
	`supervisord.conf`. Build-time `RUN python -m src.models.bbb_model`
	bakes the BBB model artifact into the image so the first `/predict/bbb`
	call is instant on cold start. Build-time RAG ingest creates
	`data/processed/faiss_index/`.

	`docker-entrypoint.sh` is the runtime guard for local Docker/Compose demos:
	when a mounted `./data` volume hides image-built artifacts, it seeds fixture
	raw data, rebuilds missing BBB features/model artifacts, and rebuilds the
	FAISS index before starting supervisord. It does not bake
	`NEUROBRIDGE_DISABLE_MLFLOW=1` into the image; operators may set that env at
	runtime if their tracking service is unavailable.

	Default environment: `DEPLOY_ENV=hf_spaces`. The LLM kill-switch is not
	set — deployed Spaces use the real OpenRouter free-tier chain (§11) when
	`OPENROUTER_API_KEY` is configured in the Space's Secrets panel. Set
	`NEUROBRIDGE_DISABLE_LLM=1` only when you want to force the deterministic
	template path for a fully-reproducible demo.

	The README's YAML front-matter declares the Space metadata
	(SDK=docker, port=7860, app_file=src/frontend/app.py).

	## 15. Orchestrator Agent Surface

	`src/agents/orchestrator.py` exposes a single-agent function-calling
	loop over the openai SDK (no LangChain / framework dep). The API enables
	the guarded workflow mode: if the LLM skips or mis-shapes a required tool
	call, deterministic routing in `src/agents/routing.py` falls back to exactly
	one pipeline tool, then exactly one retrieval tool, then final synthesis.
	The agent holds 4 tools, defined in `src/agents/tools.py`:

	- `run_bbb_pipeline(smiles, top_k)` — wraps `POST /predict/bbb`
	- `run_eeg_pipeline(input_path)` — wraps `POST /pipeline/eeg`
	- `run_mri_pipeline(input_dir, sites_csv=None)` — wraps `POST /pipeline/mri`
	and defaults `sites_csv` to `<input_dir>/sites.csv`
	- `retrieve_context(query, k)` — wraps `src/rag/retrieve.py`

	The system prompt (`src/agents/prompts.py:ORCHESTRATOR_SYSTEM_PROMPT`)
	describes the workflow: pick exactly one pipeline → run it → formulate a
	focused retrieval query → call retrieve_context → synthesize a 3-5 sentence
	response that cites at least one chunk. The API-side workflow guard enforces
	that order in code; the prompt is guidance, not the only control plane.
	Language of the final response is mirrored from the user's question.

	`POST /agent/run` is the public surface. It accepts `user_input`,
	optional `user_question`, and optional MRI `sites_csv`. Default model is
	`google/gemini-2.0-flash-exp:free` on OpenRouter (function-calling support
	verified). Override via `NEUROBRIDGE_AGENT_MODEL` env var. Returns 503 when
	`OPENROUTER_API_KEY` is unset.

	Diagnostics: `GET /diag/agent` returns key presence, configured model,
	RAG index status (chunk count), and the registered tool names.

	## 16. RAG Surface

	`src/rag/` is the retrieval layer. Stack: `fastembed`
	(`BAAI/bge-small-en-v1.5`, 384-dim, ONNX, no torch dep) for
	embeddings + `faiss-cpu` (`IndexFlatIP` after L2-norm = cosine) for
	vector search.

	Knowledge base lives at `data/knowledge_base/` (gitignored;
	user-supplied `.md` / `.txt` / `.pdf`). Build the FAISS index with:

	python -m src.rag.ingest [<input_dir> [<output_dir>]]

	Defaults: input=`data/knowledge_base/`, output=`data/processed/faiss_index/`.
	The Dockerfile runs this at build time so deployed Spaces start with
	a populated index. `docker-entrypoint.sh` also rebuilds the index at
	startup when a mounted `data/` volume hides the image-built artifacts.
	Empty KB → empty index → `retrieve_context` returns 0 chunks; the agent
	surfaces this and answers from the pipeline result alone.

	`tests/fixtures/kb_sample/` ships 3 seed markdown files (Lipinski,
	ComBat, MNE+ICA) — these double as test fixtures and as the demo
	seed if no user-supplied PDFs are added.