mekosotto commited on
Commit
29e929f
·
1 Parent(s): 938399b

docs: add AGENTS.md with vision, layout, standards, data readiness rules

Browse files
Files changed (1) hide show
  1. AGENTS.md +87 -0
AGENTS.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGENTS.md — NeuroBridge Enterprise Pipeline
2
+
3
+ > Read this file at the start of every session. It is the contract every agent
4
+ > (human or LLM) operates under in this repository.
5
+
6
+ ## 1. Project Vision
7
+
8
+ **NeuroBridge Enterprise** is a B2B SaaS platform that solves three structural
9
+ problems in real-world clinical/biomedical ML pipelines:
10
+
11
+ 1. **Data Drift** between hospitals and acquisition sites (multi-center MRI).
12
+ 2. **Missing Modalities** (a patient may have MRI but no EEG, or vice versa).
13
+ 3. **Artifacts** in raw biosignals (eye blinks, line noise, motion in EEG).
14
+
15
+ The platform exposes three production pipelines behind a single FastAPI surface:
16
+
17
+ | Modality | Pipeline | Core Technique |
18
+ |---|---|---|
19
+ | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
20
+ | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
21
+ | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
22
+
23
+ All experiment runs are tracked in **MLflow**. All services ship as **Docker** images.
24
+
25
+ ## 2. Directory Layout (load-bearing — do not violate)
26
+
27
+ ```
28
+ .
29
+ ├── AGENTS.md # This file
30
+ ├── requirements.txt
31
+ ├── pytest.ini
32
+ ├── data/
33
+ │ ├── raw/ # Untouched source data. NEVER train on this directly.
34
+ │ └── processed/ # Pipeline output. Model-ready. Versioned outputs.
35
+ ├── src/
36
+ │ ├── api/ # FastAPI routers, request/response schemas
37
+ │ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
38
+ │ └── core/ # Cross-cutting utilities: logging, config, MLflow helpers
39
+ └── tests/
40
+ ├── core/
41
+ ├── pipelines/
42
+ └── fixtures/ # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
43
+ ```
44
+
45
+ **Rules:**
46
+ - New modality → new file under `src/pipelines/`. No mixing modalities in one file.
47
+ - Anything imported by 2+ pipelines → `src/core/`.
48
+ - Never read from or write to paths outside `data/`. The `data/` boundary is the storage contract.
49
+ - `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.
50
+
51
+ ## 3. Coding Standards
52
+
53
+ - **Python 3.10+.** Use `from __future__ import annotations` when needed for forward refs.
54
+ - **Type hints are mandatory** on every public function/method (parameters and return).
55
+ - **Modular structure.** One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
56
+ - **TDD is the default workflow.** Write the failing test first, watch it fail, then implement. Tests live in `tests/` mirroring `src/`.
57
+ - **Logging is mandatory** for every pipeline. Use `src.core.logger.get_logger(__name__)`. No `print()` in `src/`.
58
+ - **Docstrings** on every public function — one-line summary + Args/Returns when non-trivial.
59
+ - **No hard-coded paths in business logic.** Pass paths as arguments to `run_pipeline(input_path, output_path)`.
60
+ - **Format & lint:** keep imports sorted; prefer `pathlib.Path` over `os.path`.
61
+ - **Commits are small and frequent.** Each green test → commit.
62
+
63
+ ## 4. Data Readiness Principles
64
+
65
+ > **The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.**
66
+
67
+ Every modality pipeline MUST guarantee, before writing to `data/processed/`:
68
+
69
+ 1. **Schema validity** — required columns present, expected dtypes.
70
+ 2. **Domain validity** — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are **logged with their identifier and dropped**, never silently coerced.
71
+ 3. **Determinism** — given the same `data/raw/` input, the pipeline produces byte-identical `data/processed/` output. No wall-clock, no random seeds without explicit seeding.
72
+ 4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
73
+ 5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
74
+
75
+ A model training script is allowed to import from `data/processed/` only. If a
76
+ training script references `data/raw/` directly, that is a bug and must be
77
+ refactored into a pipeline.
78
+
79
+ ## 5. How to Add a New Pipeline (checklist)
80
+
81
+ 1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
82
+ 2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
83
+ 3. Use `get_logger(__name__)` for all status output.
84
+ 4. Validate inputs and drop invalid rows with a logged warning.
85
+ 5. Write deterministic output to `output_path`.
86
+ 6. Document any new dependency in `requirements.txt` (pinned).
87
+ 7. Add a one-line entry to this file's pipeline table.