AGENTS.md — NeuroBridge Enterprise Pipeline
Read this file at the start of every session. It is the contract every agent (human or LLM) operates under in this repository.
1. Project Vision
NeuroBridge Enterprise is a B2B SaaS platform that solves three structural problems in real-world clinical/biomedical ML pipelines:
- Data Drift between hospitals and acquisition sites (multi-center MRI).
- Missing Modalities (a patient may have MRI but no EEG, or vice versa).
- Artifacts in raw biosignals (eye blinks, line noise, motion in EEG).
The platform exposes three production pipelines behind a single FastAPI surface:
| Modality | Pipeline | Core Technique |
|---|---|---|
| Image (MRI / fMRI) | src/pipelines/mri_pipeline.py |
ComBat Harmonization for site-level domain shift |
| Signal (EEG) | src/pipelines/eeg_pipeline.py |
MNE-Python + ICA for artifact removal |
| Tabular (BBB / molecules) | src/pipelines/bbb_pipeline.py |
RDKit Morgan fingerprints from SMILES |
All experiment runs are tracked in MLflow. All services ship as Docker images.
2. Directory Layout (load-bearing — do not violate)
.
├── AGENTS.md # This file
├── requirements.txt
├── pytest.ini
├── data/
│ ├── raw/ # Untouched source data. NEVER train on this directly.
│ └── processed/ # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4).
├── src/
│ ├── api/ # FastAPI routers, request/response schemas
│ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
│ └── core/ # Cross-cutting utilities: logging, config (MLflow helpers planned)
└── tests/
├── core/
├── pipelines/
└── fixtures/ # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
Rules:
- New modality → new file under
src/pipelines/. No mixing modalities in one file. - Anything imported by 2+ pipelines →
src/core/. - Pipeline code (
src/pipelines/,src/core/) must not read from or write to any path outsidedata/. Test code may readtests/fixtures/. Thedata/boundary is the storage contract for production data. tests/fixtures/holds CSV / numpy / DICOM blobs — do not add an__init__.pythere.
3. Coding Standards
- Python 3.10–3.12 (the pinned native-extension dependencies do not yet ship cp313+ wheels). Use
from __future__ import annotationswhen needed for forward refs. - Type hints are mandatory on every public function/method (parameters and return).
- Modular structure. One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
- TDD is the default workflow. Write the failing test first, watch it fail, then implement. Tests live in
tests/mirroringsrc/. - Logging is mandatory for every pipeline. Use
src.core.logger.get_logger(__name__). Noprint()insrc/. - Docstrings on every public function — one-line summary + Args/Returns when non-trivial.
- No hard-coded paths in business logic. Pass paths as arguments to
run_pipeline(input_path, output_path). - Format & lint: keep imports sorted; prefer
pathlib.Pathoveros.path. - Commits are small and frequent. Each green test → commit.
4. Data Readiness Principles
The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.
Every modality pipeline MUST guarantee, before writing to data/processed/:
- Schema validity — required columns present, expected dtypes.
- Domain validity — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are logged with their identifier and dropped, never silently coerced.
- Determinism — given the same
data/raw/input, the pipeline produces byte-identicaldata/processed/output. No wall-clock, no random seeds without explicit seeding. - Traceability — log row count in, row count out, and percentage dropped at INFO level.
- Idempotence — re-running the pipeline overwrites
data/processed/cleanly; no append, no partial writes.
A model training script is allowed to import from data/processed/ only. If a
training script references data/raw/ directly, that is a bug and must be
refactored into a pipeline.
5. How to Add a New Pipeline (checklist)
- Add
tests/pipelines/test_<name>_pipeline.pywith the failing tests first. - Create
src/pipelines/<name>_pipeline.pyexposingrun_pipeline(input_path: Path, output_path: Path) -> None. - Use
get_logger(__name__)for all status output (per §3). - Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run.
- Write deterministic output to
output_path. - Document any new dependency in
requirements.txt(pinned). - Add a one-line entry to this file's pipeline table.
6. Storage Format Convention
All data/processed/ outputs MUST be Parquet (pyarrow engine, compression="snappy"):
- Preserves dtypes (uint8 fingerprints stay uint8; float32 EEG features stay float32) — CSV silently widens numeric columns and is unsuitable for the high-dimensional float arrays produced by the EEG and MRI pipelines.
- Byte-deterministic with fixed compression and single-threaded writes (satisfies §4 Determinism).
- Read with
pd.read_parquet(path); no dtype hints required.
The raw data/raw/ inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI).