docs: mark MRI pipeline shipped; add Day-3 smoke run + function reference
Browse files
AGENTS.md
CHANGED
|
@@ -16,7 +16,7 @@ The platform exposes three production pipelines behind a single FastAPI surface:
|
|
| 16 |
|
| 17 |
| Modality | Pipeline | Core Technique |
|
| 18 |
|---|---|---|
|
| 19 |
-
| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py`
|
| 20 |
| Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
|
| 21 |
| Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
|
| 22 |
|
|
@@ -79,6 +79,14 @@ single-threaded mode at import time. CI runners and developer machines do
|
|
| 79 |
not need to set these manually — the pipeline modules handle it — but
|
| 80 |
overriding them in the environment will break Determinism rule 3.
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
A model training script is allowed to import from `data/processed/` only. If a
|
| 83 |
training script references `data/raw/` directly, that is a bug and must be
|
| 84 |
refactored into a pipeline.
|
|
|
|
| 16 |
|
| 17 |
| Modality | Pipeline | Core Technique |
|
| 18 |
|---|---|---|
|
| 19 |
+
| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
|
| 20 |
| Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
|
| 21 |
| Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
|
| 22 |
|
|
|
|
| 79 |
not need to set these manually — the pipeline modules handle it — but
|
| 80 |
overriding them in the environment will break Determinism rule 3.
|
| 81 |
|
| 82 |
+
**ComBat determinism boundary**: the MRI pipeline's `harmonize_combat` wraps
|
| 83 |
+
`neuroHarmonize.harmonizationLearn` and applies `np.round(14)` to its output.
|
| 84 |
+
This is a defensive measure: with the thread-pinning above, harmonization is
|
| 85 |
+
already bit-identical, but the rounding guarantees byte-identity even when
|
| 86 |
+
the env-pin discipline is bypassed (e.g. a sub-process that re-exports a
|
| 87 |
+
thread count). It discards ~5 trailing-mantissa bits of float64 — well below
|
| 88 |
+
ComBat's biological effect-size precision floor.
|
| 89 |
+
|
| 90 |
A model training script is allowed to import from `data/processed/` only. If a
|
| 91 |
training script references `data/raw/` directly, that is a bug and must be
|
| 92 |
refactored into a pipeline.
|
README.md
CHANGED
|
@@ -12,7 +12,7 @@ and Docker shipping.
|
|
| 12 |
|-----|----------|----------|--------|
|
| 13 |
| 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
|
| 14 |
| 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
|
| 15 |
-
| 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) |
|
| 16 |
|
| 17 |
## Quick Start
|
| 18 |
|
|
@@ -23,7 +23,7 @@ and Docker shipping.
|
|
| 23 |
# 1. Create venv and install
|
| 24 |
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
|
| 25 |
|
| 26 |
-
# 2. Verify — expect
|
| 27 |
pytest -v
|
| 28 |
|
| 29 |
# 3. Smoke run with the bundled 6-row fixture
|
|
@@ -45,6 +45,15 @@ python -m src.pipelines.eeg_pipeline
|
|
| 45 |
|
| 46 |
Result lives at `data/processed/eeg_features.parquet`.
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
> **Real BBBP data:** not bundled (gitignored). Download from
|
| 49 |
> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
|
| 50 |
> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
|
|
@@ -99,6 +108,18 @@ determinism, traceability, idempotence).
|
|
| 99 |
|
| 100 |
The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet output for the same input — satisfying the §4 Determinism contract. Output is float64, preserved through the Parquet round-trip.
|
| 101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
## Storage Format
|
| 103 |
|
| 104 |
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
|
@@ -116,8 +137,7 @@ finishes in under 2 seconds on a 2024 laptop.
|
|
| 116 |
## Roadmap
|
| 117 |
|
| 118 |
- **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
|
| 119 |
-
- **Day 3:** `mri_pipeline.py` —
|
| 120 |
-
(`neuroharmonize`) for site-level domain shift, write features to Parquet.
|
| 121 |
- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
|
| 122 |
CI.
|
| 123 |
|
|
|
|
| 12 |
|-----|----------|----------|--------|
|
| 13 |
| 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
|
| 14 |
| 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
|
| 15 |
+
| 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Shipped — 106 tests green |
|
| 16 |
|
| 17 |
## Quick Start
|
| 18 |
|
|
|
|
| 23 |
# 1. Create venv and install
|
| 24 |
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
|
| 25 |
|
| 26 |
+
# 2. Verify — expect 106 passed
|
| 27 |
pytest -v
|
| 28 |
|
| 29 |
# 3. Smoke run with the bundled 6-row fixture
|
|
|
|
| 45 |
|
| 46 |
Result lives at `data/processed/eeg_features.parquet`.
|
| 47 |
|
| 48 |
+
```bash
|
| 49 |
+
# Smoke-test the MRI pipeline with the bundled fixture (6 subjects × 2 sites)
|
| 50 |
+
mkdir -p data/raw/mri
|
| 51 |
+
cp tests/fixtures/mri_sample/* data/raw/mri/
|
| 52 |
+
python -m src.pipelines.mri_pipeline
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
Result lives at `data/processed/mri_features.parquet` (48 ROI features per subject, ComBat-harmonized across sites).
|
| 56 |
+
|
| 57 |
> **Real BBBP data:** not bundled (gitignored). Download from
|
| 58 |
> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
|
| 59 |
> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
|
|
|
|
| 108 |
|
| 109 |
The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet output for the same input — satisfying the §4 Determinism contract. Output is float64, preserved through the Parquet round-trip.
|
| 110 |
|
| 111 |
+
## MRI Pipeline (Day 3)
|
| 112 |
+
|
| 113 |
+
| Function | Purpose |
|
| 114 |
+
|---|---|
|
| 115 |
+
| `is_valid_volume(volume)` | Returns True iff input is a finite, numeric, non-empty 3-D ndarray. Rejects NaN/inf, non-numeric dtypes, lists/scalars. |
|
| 116 |
+
| `mask_brain(volume, intensity_threshold)` | Two-step brain mask: intensity threshold (default = volume mean) + 6-connectivity morphological opening to drop isolated noise voxels. WARNs if mask is empty. |
|
| 117 |
+
| `extract_features_from_volume(volume, mask, n_roi_axes)` | Partitions the masked volume into `prod(n_roi_axes)` axis-aligned octants (default 2×2×2 = 8) and emits 6 stats per ROI: mean / std / p10 / p50 / p90 / voxel_count. Empty ROIs → 0.0 (no NaN). Single source of truth via `_ROI_STATS_FUNCS`. |
|
| 118 |
+
| `harmonize_combat(features, sites, feature_cols)` | Wraps `neuroHarmonize.harmonizationLearn` with `np.round(14)` defensive determinism boundary. Removes site-level domain shift on the named columns. Raises if <2 sites or empty `feature_cols` or row/site length mismatch. |
|
| 119 |
+
| `run_pipeline(input_dir, sites_csv, output_path, ...)` | End-to-end NIfTI directory → ComBat-harmonized Parquet orchestrator. Drops invalid volumes with logged WARNING. Splits feature columns on a `_MIN_VAR_THRESHOLD = 1e-8` variance floor (constant columns bypass ComBat to avoid NaN). Idempotent; raises on missing input or directory output. |
|
| 120 |
+
|
| 121 |
+
Output schema: one row per surviving subject with columns `subject_id, site, feat_roi{i}_<stat>` (8 ROIs × 6 stats = 48 features). All `feat_*` are float64 (preserved through the Parquet round-trip).
|
| 122 |
+
|
| 123 |
## Storage Format
|
| 124 |
|
| 125 |
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
|
|
|
| 137 |
## Roadmap
|
| 138 |
|
| 139 |
- **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
|
| 140 |
+
- **Day 3 (shipped):** `mri_pipeline.py` — NIfTI volume loading, brain masking, ROI feature extraction, ComBat harmonization (`neuroHarmonize`) for site-level domain shift → Parquet (48 features, 106 tests green).
|
|
|
|
| 141 |
- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
|
| 142 |
CI.
|
| 143 |
|