mekosotto Claude Sonnet 4.6 commited on
Commit
b9e6d2f
·
1 Parent(s): b18a079

docs: mark MRI pipeline shipped; add Day-3 smoke run + function reference

Browse files
Files changed (2) hide show
  1. AGENTS.md +9 -1
  2. README.md +24 -4
AGENTS.md CHANGED
@@ -16,7 +16,7 @@ The platform exposes three production pipelines behind a single FastAPI surface:
16
 
17
  | Modality | Pipeline | Core Technique |
18
  |---|---|---|
19
- | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` *(planned, Day 3)* | ComBat Harmonization for site-level domain shift |
20
  | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
21
  | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
22
 
@@ -79,6 +79,14 @@ single-threaded mode at import time. CI runners and developer machines do
79
  not need to set these manually — the pipeline modules handle it — but
80
  overriding them in the environment will break Determinism rule 3.
81
 
 
 
 
 
 
 
 
 
82
  A model training script is allowed to import from `data/processed/` only. If a
83
  training script references `data/raw/` directly, that is a bug and must be
84
  refactored into a pipeline.
 
16
 
17
  | Modality | Pipeline | Core Technique |
18
  |---|---|---|
19
+ | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
20
  | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
21
  | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
22
 
 
79
  not need to set these manually — the pipeline modules handle it — but
80
  overriding them in the environment will break Determinism rule 3.
81
 
82
+ **ComBat determinism boundary**: the MRI pipeline's `harmonize_combat` wraps
83
+ `neuroHarmonize.harmonizationLearn` and applies `np.round(14)` to its output.
84
+ This is a defensive measure: with the thread-pinning above, harmonization is
85
+ already bit-identical, but the rounding guarantees byte-identity even when
86
+ the env-pin discipline is bypassed (e.g. a sub-process that re-exports a
87
+ thread count). It discards ~5 trailing-mantissa bits of float64 — well below
88
+ ComBat's biological effect-size precision floor.
89
+
90
  A model training script is allowed to import from `data/processed/` only. If a
91
  training script references `data/raw/` directly, that is a bug and must be
92
  refactored into a pipeline.
README.md CHANGED
@@ -12,7 +12,7 @@ and Docker shipping.
12
  |-----|----------|----------|--------|
13
  | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
14
  | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
15
- | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
16
 
17
  ## Quick Start
18
 
@@ -23,7 +23,7 @@ and Docker shipping.
23
  # 1. Create venv and install
24
  python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
25
 
26
- # 2. Verify — expect 67 passed
27
  pytest -v
28
 
29
  # 3. Smoke run with the bundled 6-row fixture
@@ -45,6 +45,15 @@ python -m src.pipelines.eeg_pipeline
45
 
46
  Result lives at `data/processed/eeg_features.parquet`.
47
 
 
 
 
 
 
 
 
 
 
48
  > **Real BBBP data:** not bundled (gitignored). Download from
49
  > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
50
  > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
@@ -99,6 +108,18 @@ determinism, traceability, idempotence).
99
 
100
  The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet output for the same input — satisfying the §4 Determinism contract. Output is float64, preserved through the Parquet round-trip.
101
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  ## Storage Format
103
 
104
  Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
@@ -116,8 +137,7 @@ finishes in under 2 seconds on a 2024 laptop.
116
  ## Roadmap
117
 
118
  - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
119
- - **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
120
- (`neuroharmonize`) for site-level domain shift, write features to Parquet.
121
  - **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
122
  CI.
123
 
 
12
  |-----|----------|----------|--------|
13
  | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
14
  | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
15
+ | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Shipped 106 tests green |
16
 
17
  ## Quick Start
18
 
 
23
  # 1. Create venv and install
24
  python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
25
 
26
+ # 2. Verify — expect 106 passed
27
  pytest -v
28
 
29
  # 3. Smoke run with the bundled 6-row fixture
 
45
 
46
  Result lives at `data/processed/eeg_features.parquet`.
47
 
48
+ ```bash
49
+ # Smoke-test the MRI pipeline with the bundled fixture (6 subjects × 2 sites)
50
+ mkdir -p data/raw/mri
51
+ cp tests/fixtures/mri_sample/* data/raw/mri/
52
+ python -m src.pipelines.mri_pipeline
53
+ ```
54
+
55
+ Result lives at `data/processed/mri_features.parquet` (48 ROI features per subject, ComBat-harmonized across sites).
56
+
57
  > **Real BBBP data:** not bundled (gitignored). Download from
58
  > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
59
  > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
 
108
 
109
  The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet output for the same input — satisfying the §4 Determinism contract. Output is float64, preserved through the Parquet round-trip.
110
 
111
+ ## MRI Pipeline (Day 3)
112
+
113
+ | Function | Purpose |
114
+ |---|---|
115
+ | `is_valid_volume(volume)` | Returns True iff input is a finite, numeric, non-empty 3-D ndarray. Rejects NaN/inf, non-numeric dtypes, lists/scalars. |
116
+ | `mask_brain(volume, intensity_threshold)` | Two-step brain mask: intensity threshold (default = volume mean) + 6-connectivity morphological opening to drop isolated noise voxels. WARNs if mask is empty. |
117
+ | `extract_features_from_volume(volume, mask, n_roi_axes)` | Partitions the masked volume into `prod(n_roi_axes)` axis-aligned octants (default 2×2×2 = 8) and emits 6 stats per ROI: mean / std / p10 / p50 / p90 / voxel_count. Empty ROIs → 0.0 (no NaN). Single source of truth via `_ROI_STATS_FUNCS`. |
118
+ | `harmonize_combat(features, sites, feature_cols)` | Wraps `neuroHarmonize.harmonizationLearn` with `np.round(14)` defensive determinism boundary. Removes site-level domain shift on the named columns. Raises if <2 sites or empty `feature_cols` or row/site length mismatch. |
119
+ | `run_pipeline(input_dir, sites_csv, output_path, ...)` | End-to-end NIfTI directory → ComBat-harmonized Parquet orchestrator. Drops invalid volumes with logged WARNING. Splits feature columns on a `_MIN_VAR_THRESHOLD = 1e-8` variance floor (constant columns bypass ComBat to avoid NaN). Idempotent; raises on missing input or directory output. |
120
+
121
+ Output schema: one row per surviving subject with columns `subject_id, site, feat_roi{i}_<stat>` (8 ROIs × 6 stats = 48 features). All `feat_*` are float64 (preserved through the Parquet round-trip).
122
+
123
  ## Storage Format
124
 
125
  Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
 
137
  ## Roadmap
138
 
139
  - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
140
+ - **Day 3 (shipped):** `mri_pipeline.py` — NIfTI volume loading, brain masking, ROI feature extraction, ComBat harmonization (`neuroHarmonize`) for site-level domain shift → Parquet (48 features, 106 tests green).
 
141
  - **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
142
  CI.
143