Spaces:

mekosotto
/

hackathon

Running

App Files Files Community

mekosotto commited on 7 days ago

Commit

4335f6a

1 Parent(s): b9e6d2f

docs(plan): add Day-3 MRI ComBat pipeline plan

Browse files

Files changed (1) hide show

docs/superpowers/plans/2026-05-01-day3-mri-combat-pipeline.md +1317 -0

docs/superpowers/plans/2026-05-01-day3-mri-combat-pipeline.md ADDED Viewed

	@@ -0,0 +1,1317 @@

+# NeuroBridge Day 3 — MRI ComBat Pipeline Implementation Plan
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+**Goal:** Insider One Hackathon Day 3 — ship a deterministic MRI feature pipeline that reads multi-site NIfTI volumes, masks the brain, applies ComBat harmonization to remove site-level domain shift, and writes ROI statistic features as Parquet.
+**Architecture:** Modular `src/pipelines/mri_pipeline.py` mirroring Day 1/2's public-function template: a 3-D volume validity primitive (`is_valid_volume`), a brain masking transformer (`mask_brain`), a site-harmonization step (`harmonize_combat`), a feature-emitting layer (`extract_features_from_volume`), and an I/O orchestrator (`run_pipeline`). All logging goes through `src.core.logger.get_logger`. Output is float64 Parquet per AGENTS.md §6. Tests use deterministic synthetic 3-D NIfTI fixtures generated by `build_mri_fixture.py` (seed=42, 8×8×8 voxels × 6 subjects across 2 simulated sites with a deliberate site-effect bias).
+**Tech Stack:** Python 3.10–3.12, `nibabel==5.2.1`, `neuroharmonize==2.4.5`, NumPy, SciPy (`scipy.ndimage` for morphological mask cleanup), Pandas, PyArrow, Pytest.
+---
+## File Structure
+| Path | Responsibility |
+|---|---|
+| `src/pipelines/mri_pipeline.py` | Public API (`is_valid_volume`, `mask_brain`, `harmonize_combat`, `extract_features_from_volume`, `run_pipeline`) + `DEFAULT_INPUT` / `DEFAULT_OUTPUT` + `__main__` CLI. |
+| `tests/pipelines/test_mri_pipeline.py` | Unit + integration tests; one class per public function. |
+| `tests/fixtures/build_mri_fixture.py` | Standalone script that regenerates `mri_sample/` (6 NIfTI volumes + a `sites.csv` covariate sheet) deterministically from seed=42. Committed alongside the artifacts. |
+| `tests/fixtures/mri_sample/` | 6 deterministic synthetic NIfTI volumes (`subject_{i}.nii.gz`) split across 2 sites, plus `sites.csv` with `subject_id,site` columns. |
+| `AGENTS.md` | Update §1 pipeline-table MRI row to remove "(planned, Day 3)" suffix and link the now-shipped file. |
+| `README.md` | Update Status table MRI row to "Shipped — N tests green"; add MRI smoke-run block + function-reference table; mark Day 3 done in roadmap. |
+`mri_pipeline.py` is expected to land at ~250–300 lines after Task 7. We do not split into submodules at this stage.
+---
+## Public API contract (defined here so tasks reference one source of truth)
+```python
+# Default ROI partition: split a (D, H, W) volume into 2×2×2 = 8 octant ROIs.
+# Octant index follows binary (z, y, x) ordering: 0..7.
+DEFAULT_N_ROI_AXES: tuple[int, int, int] = (2, 2, 2)
+ROI_STATS: tuple[str, ...] = ("mean", "std", "p10", "p50", "p90", "voxel_count")
+def is_valid_volume(volume: np.ndarray | None) -> bool: ...
+def mask_brain(
+    volume: np.ndarray,
+    intensity_threshold: float | None = None,
+) -> np.ndarray: ...
+def harmonize_combat(
+    features: pd.DataFrame,
+    sites: pd.Series,
+    feature_cols: list[str],
+) -> pd.DataFrame: ...
+def extract_features_from_volume(
+    volume: np.ndarray,
+    mask: np.ndarray,
+    n_roi_axes: tuple[int, int, int] = DEFAULT_N_ROI_AXES,
+) -> dict[str, float]: ...
+def run_pipeline(
+    input_dir: Path = DEFAULT_INPUT,
+    sites_csv: Path | None = None,
+    output_path: Path = DEFAULT_OUTPUT,
+    intensity_threshold: float | None = None,
+    n_roi_axes: tuple[int, int, int] = DEFAULT_N_ROI_AXES,
+) -> None: ...
+```
+Per-volume feature vector layout: `n_roi * len(ROI_STATS)` floats, where `n_roi = product(n_roi_axes)` (default 8). Column names: `feat_roi{i}_<stat>` for `i in 0..n_roi-1` and `<stat>` in ROI_STATS. Output DataFrame schema: one row per subject, columns `subject_id, site, feat_*`.
+`harmonize_combat` calls `neuroHarmonize.harmonizationLearn` which uses an EM-style fit. ComBat is deterministic given the same input + covariates (no internal RNG), so byte-determinism holds without seeding.
+---
+## Task 1: MRI test fixture (deterministic synthetic NIfTI volumes)
+**Files:**
+- Create: `tests/fixtures/build_mri_fixture.py`
+- Create: `tests/fixtures/mri_sample/subject_0.nii.gz` ... `subject_5.nii.gz`
+- Create: `tests/fixtures/mri_sample/sites.csv`
+- [ ] **Step 1: Write the fixture-builder script**
+Create `/Users/mertgungor/Desktop/hackathon/tests/fixtures/build_mri_fixture.py`:
+```python
+"""Generate deterministic synthetic MRI fixtures for the Day-3 pipeline tests.
+Six 8×8×8 NIfTI volumes split across two simulated sites. Each volume is a
+spherical "brain" with isotropic Gaussian noise plus a per-site additive bias
+that ComBat is expected to remove. The fixture is committed alongside this
+script so test runs are reproducible without re-running.
+Channels:
+  - Site A: subject_0, subject_1, subject_2 (bias = +0.0 a.u.)
+  - Site B: subject_3, subject_4, subject_5 (bias = +5.0 a.u.)
+NOTE: byte-determinism of the .nii.gz output is coupled to nibabel==5.2.1
+(pinned in requirements.txt) and a fixed nibabel.Nifti1Image header. If the
+nibabel pin is upgraded, re-run this script and commit the rebuilt artifacts
+alongside the dependency bump.
+"""
+from __future__ import annotations
+import csv
+from pathlib import Path
+import nibabel as nib
+import numpy as np
+SITE_A_BIAS = 0.0
+SITE_B_BIAS = 5.0
+VOLUME_SHAPE = (8, 8, 8)
+SUBJECTS = (
+    ("subject_0", "A"),
+    ("subject_1", "A"),
+    ("subject_2", "A"),
+    ("subject_3", "B"),
+    ("subject_4", "B"),
+    ("subject_5", "B"),
+)
+def _spherical_brain(rng: np.random.Generator, bias: float) -> np.ndarray:
+    """Build an 8×8×8 volume: spherical brain (radius 3) + noise + site bias."""
+    d, h, w = VOLUME_SHAPE
+    z, y, x = np.indices((d, h, w))
+    cz, cy, cx = (d - 1) / 2.0, (h - 1) / 2.0, (w - 1) / 2.0
+    radius2 = (z - cz) ** 2 + (y - cy) ** 2 + (x - cx) ** 2
+    brain_mask = radius2 <= 3.0**2
+    # Brain intensity ~10 a.u., background ~0.1 a.u. (so default threshold splits cleanly).
+    volume = np.where(brain_mask, 10.0, 0.1).astype(np.float64)
+    volume += rng.standard_normal(VOLUME_SHAPE) * 0.5
+    volume[brain_mask] += bias
+    return volume
+def build(out_dir: Path | None = None) -> Path:
+    out = out_dir if out_dir is not None else Path(__file__).parent / "mri_sample"
+    out.mkdir(parents=True, exist_ok=True)
+    rng = np.random.default_rng(seed=42)
+    affine = np.eye(4)
+    sites_rows: list[tuple[str, str]] = []
+    for subject_id, site in SUBJECTS:
+        bias = SITE_A_BIAS if site == "A" else SITE_B_BIAS
+        volume = _spherical_brain(rng, bias=bias)
+        img = nib.Nifti1Image(volume, affine=affine)
+        nib.save(img, out / f"{subject_id}.nii.gz")
+        sites_rows.append((subject_id, site))
+    with (out / "sites.csv").open("w", newline="") as fh:
+        writer = csv.writer(fh)
+        writer.writerow(["subject_id", "site"])
+        writer.writerows(sites_rows)
+    return out
+if __name__ == "__main__":
+    p = build()
+    print(f"Wrote MRI fixture to {p}")
+```
+- [ ] **Step 2: Run the script to generate the artifacts**
+```bash
+cd /Users/mertgungor/Desktop/hackathon
+source .venv312/bin/activate
+python tests/fixtures/build_mri_fixture.py
+```
+Expected stdout: `Wrote MRI fixture to .../tests/fixtures/mri_sample`. Six `.nii.gz` files (each ~1.8 KB) and `sites.csv` are created.
+- [ ] **Step 3: Sanity-check the fixture**
+```bash
+python -c "
+import nibabel as nib
+import numpy as np
+from pathlib import Path
+base = Path('tests/fixtures/mri_sample')
+img0 = nib.load(base / 'subject_0.nii.gz')
+img3 = nib.load(base / 'subject_3.nii.gz')
+v0 = img0.get_fdata()
+v3 = img3.get_fdata()
+print('shape:', v0.shape)
+print('site A mean (subject_0 brain voxels):', round(float(v0[v0 > 5].mean()), 2))
+print('site B mean (subject_3 brain voxels):', round(float(v3[v3 > 5].mean()), 2))
+print('sites.csv exists:', (base / 'sites.csv').exists())
+"
+```
+Expected:
+```
+shape: (8, 8, 8)
+site A mean (subject_0 brain voxels): ~10.0
+site B mean (subject_3 brain voxels): ~15.0   # 10 + bias 5
+sites.csv exists: True
+```
+- [ ] **Step 4: Verify byte-determinism**
+```bash
+md5_before=$(md5 -q tests/fixtures/mri_sample/subject_0.nii.gz 2>/dev/null || md5sum tests/fixtures/mri_sample/subject_0.nii.gz | awk '{print $1}')
+python tests/fixtures/build_mri_fixture.py
+md5_after=$(md5 -q tests/fixtures/mri_sample/subject_0.nii.gz 2>/dev/null || md5sum tests/fixtures/mri_sample/subject_0.nii.gz | awk '{print $1}')
+echo "before: $md5_before"
+echo "after:  $md5_after"
+```
+Expected: matching MD5s. (If they drift due to nibabel header timestamp drift, fall back to data-equality: `assert np.array_equal(nib.load(a).get_fdata(), nib.load(b).get_fdata())`.)
+- [ ] **Step 5: Commit**
+```bash
+git add tests/fixtures/build_mri_fixture.py tests/fixtures/mri_sample/
+git commit -m "test(mri): add deterministic synthetic NIfTI fixture (6 subjects × 2 sites)"
+```
+---
+## Task 2: `is_valid_volume` (TDD)
+**Files:**
+- Create: `tests/pipelines/test_mri_pipeline.py`
+- Create: `src/pipelines/mri_pipeline.py`
+- [ ] **Step 1: Write the failing tests**
+Create `/Users/mertgungor/Desktop/hackathon/tests/pipelines/test_mri_pipeline.py`:
+```python
+"""Unit + integration tests for the MRI ComBat pipeline."""
+from __future__ import annotations
+from pathlib import Path
+import numpy as np
+import pytest
+from src.pipelines.mri_pipeline import is_valid_volume
+FIXTURE_DIR = Path(__file__).parent.parent / "fixtures" / "mri_sample"
+class TestIsValidVolume:
+    def test_accepts_3d_finite_array(self) -> None:
+        vol = np.zeros((8, 8, 8), dtype=np.float64)
+        assert is_valid_volume(vol) is True
+    def test_rejects_wrong_dimension(self) -> None:
+        assert is_valid_volume(np.zeros((8, 8))) is False
+        assert is_valid_volume(np.zeros((8, 8, 8, 2))) is False
+    def test_rejects_nan(self) -> None:
+        vol = np.zeros((8, 8, 8))
+        vol[0, 0, 0] = np.nan
+        assert is_valid_volume(vol) is False
+    def test_rejects_inf(self) -> None:
+        vol = np.zeros((8, 8, 8))
+        vol[1, 1, 1] = np.inf
+        assert is_valid_volume(vol) is False
+        vol[1, 1, 1] = -np.inf
+        assert is_valid_volume(vol) is False
+    def test_rejects_empty(self) -> None:
+        assert is_valid_volume(np.zeros((0, 8, 8))) is False
+        assert is_valid_volume(np.zeros((8, 0, 8))) is False
+        assert is_valid_volume(np.zeros((8, 8, 0))) is False
+    def test_rejects_non_numeric_dtype(self) -> None:
+        vol = np.array([[["a", "b"], ["c", "d"]]])
+        assert is_valid_volume(vol) is False
+    def test_rejects_non_array(self) -> None:
+        assert is_valid_volume([[[1, 2]], [[3, 4]]]) is False
+        assert is_valid_volume(None) is False
+```
+- [ ] **Step 2: Run tests; they MUST fail**
+```bash
+cd /Users/mertgungor/Desktop/hackathon
+source .venv312/bin/activate
+pytest tests/pipelines/test_mri_pipeline.py -v
+```
+Expected: collection failure with `ModuleNotFoundError: No module named 'src.pipelines.mri_pipeline'`.
+- [ ] **Step 3: Implement the module**
+Create `/Users/mertgungor/Desktop/hackathon/src/pipelines/mri_pipeline.py`:
+```python
+"""MRI (magnetic resonance imaging) pipeline.
+Loads NIfTI volumes (`.nii` / `.nii.gz`), applies a brain mask, harmonizes
+across sites with ComBat (`neuroHarmonize`), and writes per-subject ROI
+statistics as a model-ready Parquet at `data/processed/mri_features.parquet`.
+Follows the Data Readiness contract in AGENTS.md §4 and the Parquet storage
+convention in §6: schema validity, domain validity (drop NaN/inf volumes
+with a logged WARNING), determinism (ComBat is RNG-free given fixed input),
+traceability (in/out/dropped counts at INFO), and idempotent overwrite.
+"""
+from __future__ import annotations
+import os
+import numpy as np
+import pyarrow as pa
+from src.core.logger import get_logger
+logger = get_logger(__name__)
+# Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
+# (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
+# floating-point reductions can reorder and produce non-bit-identical output.
+os.environ.setdefault("OMP_NUM_THREADS", "1")
+os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
+os.environ.setdefault("MKL_NUM_THREADS", "1")
+pa.set_cpu_count(1)
+pa.set_io_thread_count(1)
+def is_valid_volume(volume: np.ndarray | None) -> bool:
+    """Return True iff `volume` is a non-empty 3-D numeric array with no NaN/inf.
+    Used to drop corrupted volumes before masking + feature extraction.
+    Defensive against the full set of garbage we expect from real archives:
+    lists, None, NaN/inf samples, zero-sized arrays, string-dtype arrays.
+    """
+    if not isinstance(volume, np.ndarray):
+        return False
+    if volume.ndim != 3:
+        return False
+    if volume.size == 0:
+        return False
+    if not np.issubdtype(volume.dtype, np.number):
+        return False
+    if not np.all(np.isfinite(volume)):
+        return False
+    return True
+```
+- [ ] **Step 4: Run tests to verify they pass**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py -v
+pytest -v
+```
+Expected: 7 PASS in `TestIsValidVolume`. Total suite: **74 PASS** (67 prior + 7 new).
+- [ ] **Step 5: Commit**
+```bash
+git add tests/pipelines/test_mri_pipeline.py src/pipelines/mri_pipeline.py
+git commit -m "feat(mri): add is_valid_volume guard for NaN/inf/shape/dtype on 3-D arrays"
+```
+---
+## Task 3: `mask_brain` — intensity threshold + morphological cleanup (TDD)
+**Files:**
+- Modify: `tests/pipelines/test_mri_pipeline.py`
+- Modify: `src/pipelines/mri_pipeline.py`
+- [ ] **Step 1: Append the failing tests**
+Update merged tuple at top of `test_mri_pipeline.py`:
+```python
+from src.pipelines.mri_pipeline import (
+    is_valid_volume,
+    mask_brain,
+)
+```
+Add a `import nibabel as nib` line in the third-party block of imports (alphabetically between `numpy` and `pytest`).
+Append:
+```python
+class TestMaskBrain:
+    def _load_subject(self, sid: str) -> np.ndarray:
+        return nib.load(FIXTURE_DIR / f"{sid}.nii.gz").get_fdata()
+    def test_returns_bool_mask_of_same_shape(self) -> None:
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        assert isinstance(mask, np.ndarray)
+        assert mask.dtype == bool
+        assert mask.shape == vol.shape
+    def test_mask_separates_brain_from_background(self) -> None:
+        """Default threshold should keep the spherical-brain center voxels in."""
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        # The fixture's brain region (radius 3 around center) intensity is ~10;
+        # background is ~0.1. Some brain voxels MUST survive the mask.
+        assert mask.sum() > 0
+        # The center voxel (always brain) MUST be in the mask.
+        center = tuple(s // 2 for s in vol.shape)
+        assert mask[center]
+    def test_mask_drops_low_intensity_background(self) -> None:
+        """Voxels with intensity well below the brain core must be excluded."""
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol, intensity_threshold=5.0)
+        # Background voxels (intensity ~0.1) must NOT be in the mask.
+        bg_voxel = (0, 0, 0)
+        assert mask[bg_voxel] == False  # noqa: E712
+    def test_explicit_threshold_overrides_default(self) -> None:
+        vol = self._load_subject("subject_0")
+        # A very high threshold should produce far fewer mask voxels.
+        mask_default = mask_brain(vol)
+        mask_strict = mask_brain(vol, intensity_threshold=100.0)
+        assert mask_strict.sum() < mask_default.sum()
+    def test_does_not_mutate_input(self) -> None:
+        vol = self._load_subject("subject_0")
+        original = vol.copy()
+        _ = mask_brain(vol)
+        np.testing.assert_array_equal(vol, original)
+    def test_morphological_cleanup_removes_isolated_voxels(self) -> None:
+        """A single bright voxel surrounded by background must be removed by the
+        opening-style morphological cleanup."""
+        vol = np.zeros((8, 8, 8), dtype=np.float64)
+        vol[4, 4, 4] = 100.0
+        mask = mask_brain(vol, intensity_threshold=50.0)
+        # Without cleanup, the single voxel would survive. With morphological
+        # opening, it must be removed.
+        assert mask.sum() == 0
+```
+- [ ] **Step 2: Run tests; they MUST fail**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py::TestMaskBrain -v
+```
+Expected: 6 FAILS with `cannot import name 'mask_brain'`.
+- [ ] **Step 3: Implement `mask_brain`**
+Append `import nibabel as nib` and `from scipy import ndimage as scipy_ndimage` to the third-party block of `src/pipelines/mri_pipeline.py`. Final imports:
+```python
+from __future__ import annotations
+import os
+import nibabel as nib
+import numpy as np
+import pyarrow as pa
+from scipy import ndimage as scipy_ndimage
+from src.core.logger import get_logger
+```
+Append at the END of `src/pipelines/mri_pipeline.py`:
+```python
+def mask_brain(
+    volume: np.ndarray,
+    intensity_threshold: float | None = None,
+) -> np.ndarray:
+    """Build a brain mask from a 3-D MRI volume.
+    Two-step pipeline:
+      1. Intensity threshold: keep voxels above `intensity_threshold`. When
+         `None`, use the volume's mean as a robust auto-threshold (works on
+         the synthetic fixture where brain ≫ background; for real data the
+         caller should pass an Otsu or BET-derived threshold explicitly).
+      2. Morphological opening (`scipy.ndimage.binary_opening`) to remove
+         isolated noise voxels and disconnected fragments.
+    Args:
+        volume: 3-D numeric `np.ndarray` (must satisfy `is_valid_volume`).
+        intensity_threshold: Voxel-intensity floor. `None` → use `volume.mean()`.
+    Returns:
+        A boolean `np.ndarray` of the same shape as `volume`. True = brain.
+    """
+    if intensity_threshold is None:
+        intensity_threshold = float(volume.mean())
+    raw = volume > intensity_threshold
+    cleaned = scipy_ndimage.binary_opening(raw, iterations=1)
+    return cleaned.astype(bool)
+```
+- [ ] **Step 4: Run tests to verify they pass**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py -v
+```
+Expected: 13 PASS in MRI file (7 valid_volume + 6 mask). Total: **80 PASS** (67 prior + 13).
+- [ ] **Step 5: Commit**
+```bash
+git add tests/pipelines/test_mri_pipeline.py src/pipelines/mri_pipeline.py
+git commit -m "feat(mri): add mask_brain (intensity threshold + morphological opening)"
+```
+---
+## Task 4: `extract_features_from_volume` — ROI octant statistics (TDD)
+**Files:**
+- Modify: `tests/pipelines/test_mri_pipeline.py`
+- Modify: `src/pipelines/mri_pipeline.py`
+- [ ] **Step 1: Append the failing tests**
+Extend the merged tuple to include `extract_features_from_volume` AND `DEFAULT_N_ROI_AXES` AND `ROI_STATS`:
+```python
+from src.pipelines.mri_pipeline import (
+    DEFAULT_N_ROI_AXES,
+    ROI_STATS,
+    extract_features_from_volume,
+    is_valid_volume,
+    mask_brain,
+)
+```
+Append:
+```python
+class TestExtractFeaturesFromVolume:
+    def _load_subject(self, sid: str) -> np.ndarray:
+        return nib.load(FIXTURE_DIR / f"{sid}.nii.gz").get_fdata()
+    def test_returns_dict_with_correct_keys(self) -> None:
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        feats = extract_features_from_volume(vol, mask)
+        n_roi = int(np.prod(DEFAULT_N_ROI_AXES))
+        expected = {
+            f"feat_roi{i}_{stat}"
+            for i in range(n_roi)
+            for stat in ROI_STATS
+        }
+        assert set(feats.keys()) == expected
+    def test_feature_count_matches_contract(self) -> None:
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        feats = extract_features_from_volume(vol, mask)
+        n_roi = int(np.prod(DEFAULT_N_ROI_AXES))
+        assert len(feats) == n_roi * len(ROI_STATS)
+    def test_all_features_finite_float(self) -> None:
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        feats = extract_features_from_volume(vol, mask)
+        for k, v in feats.items():
+            assert isinstance(v, float), f"{k}: {type(v).__name__}"
+            assert np.isfinite(v), f"{k}: {v}"
+    def test_voxel_count_is_integer_valued(self) -> None:
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        feats = extract_features_from_volume(vol, mask)
+        for k, v in feats.items():
+            if k.endswith("_voxel_count"):
+                # voxel_count stored as float for column-uniformity, but must be
+                # a whole number.
+                assert v == float(int(v))
+    def test_empty_mask_yields_zero_features(self) -> None:
+        """If a volume has zero brain voxels (mask all False), every stat
+        must default to 0.0 — not NaN — to preserve the no-NaN Parquet contract."""
+        vol = self._load_subject("subject_0")
+        empty_mask = np.zeros_like(vol, dtype=bool)
+        feats = extract_features_from_volume(vol, empty_mask)
+        for k, v in feats.items():
+            assert v == 0.0, f"{k}: {v}"
+    def test_deterministic_for_same_input(self) -> None:
+        vol = self._load_subject("subject_0")
+        mask = mask_brain(vol)
+        a = extract_features_from_volume(vol, mask)
+        b = extract_features_from_volume(vol, mask)
+        assert a == b
+```
+- [ ] **Step 2: Run tests; they MUST fail**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py::TestExtractFeaturesFromVolume -v
+```
+Expected: 6 FAILS with `cannot import name 'extract_features_from_volume'` (and `DEFAULT_N_ROI_AXES` / `ROI_STATS`).
+- [ ] **Step 3: Implement `extract_features_from_volume`**
+Append at the END of `src/pipelines/mri_pipeline.py`:
+```python
+# Default ROI partition: split a (D, H, W) volume into 2×2×2 = 8 octant ROIs.
+# Octant index follows binary (z, y, x) ordering: 0..7.
+DEFAULT_N_ROI_AXES: tuple[int, int, int] = (2, 2, 2)
+ROI_STATS: tuple[str, ...] = ("mean", "std", "p10", "p50", "p90", "voxel_count")
+def _roi_slices(
+    shape: tuple[int, int, int],
+    n_roi_axes: tuple[int, int, int],
+) -> list[tuple[slice, slice, slice]]:
+    """Generate the ROI slice list in deterministic (z, y, x) octant order."""
+    nz, ny, nx = n_roi_axes
+    dz, dy, dx = shape
+    bins_z = np.array_split(np.arange(dz), nz)
+    bins_y = np.array_split(np.arange(dy), ny)
+    bins_x = np.array_split(np.arange(dx), nx)
+    out: list[tuple[slice, slice, slice]] = []
+    for bz in bins_z:
+        for by in bins_y:
+            for bx in bins_x:
+                out.append((
+                    slice(bz[0], bz[-1] + 1),
+                    slice(by[0], by[-1] + 1),
+                    slice(bx[0], bx[-1] + 1),
+                ))
+    return out
+def _roi_stats_for(values: np.ndarray) -> dict[str, float]:
+    """Compute the 6 ROI stats. Empty array → all 0.0 (no-NaN contract)."""
+    if values.size == 0:
+        return {stat: 0.0 for stat in ROI_STATS}
+    return {
+        "mean": float(values.mean()),
+        "std": float(values.std()),
+        "p10": float(np.percentile(values, 10)),
+        "p50": float(np.percentile(values, 50)),
+        "p90": float(np.percentile(values, 90)),
+        "voxel_count": float(values.size),
+    }
+def extract_features_from_volume(
+    volume: np.ndarray,
+    mask: np.ndarray,
+    n_roi_axes: tuple[int, int, int] = DEFAULT_N_ROI_AXES,
+) -> dict[str, float]:
+    """Compute per-ROI summary statistics from a masked volume.
+    The volume is partitioned into ``prod(n_roi_axes)`` axis-aligned octants
+    in deterministic (z, y, x) order. For each ROI, intensity values from
+    voxels where `mask` is True are summarized via mean / std / 10th, 50th,
+    90th percentile / voxel count. Empty ROIs (no mask voxels) report all
+    zeros so the resulting Parquet has no NaN values.
+    Args:
+        volume: 3-D numeric `np.ndarray` (already validated).
+        mask: Boolean `np.ndarray` of the same shape (from `mask_brain`).
+        n_roi_axes: ROI grid along (z, y, x). Default `(2, 2, 2)` → 8 ROIs.
+    Returns:
+        Flat dict `{"feat_roi{i}_{stat}": float}` of length
+        ``prod(n_roi_axes) * len(ROI_STATS)``.
+    """
+    feats: dict[str, float] = {}
+    slices = _roi_slices(volume.shape, n_roi_axes)
+    for i, sl in enumerate(slices):
+        roi_values = volume[sl][mask[sl]]
+        stats = _roi_stats_for(roi_values)
+        for stat_name, stat_val in stats.items():
+            feats[f"feat_roi{i}_{stat_name}"] = stat_val
+    return feats
+```
+- [ ] **Step 4: Run tests to verify they pass**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py -v
+```
+Expected: 19 PASS (7 valid + 6 mask + 6 features). Total: **86 PASS**.
+- [ ] **Step 5: Commit**
+```bash
+git add tests/pipelines/test_mri_pipeline.py src/pipelines/mri_pipeline.py
+git commit -m "feat(mri): add extract_features_from_volume (8 ROI octants × 6 stats)"
+```
+---
+## Task 5: `harmonize_combat` — site-effect removal (TDD)
+**Files:**
+- Modify: `tests/pipelines/test_mri_pipeline.py`
+- Modify: `src/pipelines/mri_pipeline.py`
+- [ ] **Step 1: Append the failing tests**
+Extend merged tuple to include `harmonize_combat`:
+```python
+from src.pipelines.mri_pipeline import (
+    DEFAULT_N_ROI_AXES,
+    ROI_STATS,
+    extract_features_from_volume,
+    harmonize_combat,
+    is_valid_volume,
+    mask_brain,
+)
+```
+Add `import pandas as pd` to the third-party imports of the test file (alphabetical: between numpy and pytest).
+Append:
+```python
+class TestHarmonizeCombat:
+    def _build_two_site_features(self) -> tuple[pd.DataFrame, pd.Series, list[str]]:
+        """Synthesize a 6-row × 4-feature table with a clear site bias."""
+        rng = np.random.default_rng(seed=42)
+        feature_cols = ["feat_roi0_mean", "feat_roi1_mean", "feat_roi2_mean", "feat_roi3_mean"]
+        # Site A baseline: mean ~0; Site B baseline: mean ~5 (the bias to remove).
+        site_a = rng.normal(loc=0.0, scale=1.0, size=(3, 4))
+        site_b = rng.normal(loc=5.0, scale=1.0, size=(3, 4))
+        df = pd.DataFrame(
+            np.vstack([site_a, site_b]),
+            columns=feature_cols,
+        )
+        sites = pd.Series(["A", "A", "A", "B", "B", "B"], name="site")
+        return df, sites, feature_cols
+    def test_returns_dataframe_same_shape_and_columns(self) -> None:
+        df, sites, feature_cols = self._build_two_site_features()
+        out = harmonize_combat(df, sites, feature_cols)
+        assert isinstance(out, pd.DataFrame)
+        assert out.shape == df.shape
+        assert list(out.columns) == feature_cols
+    def test_reduces_site_mean_difference(self) -> None:
+        """ComBat must shrink the per-site mean gap on every harmonized column."""
+        df, sites, feature_cols = self._build_two_site_features()
+        gap_before = (
+            df.loc[sites == "B", feature_cols].mean()
+            - df.loc[sites == "A", feature_cols].mean()
+        ).abs()
+        out = harmonize_combat(df, sites, feature_cols)
+        gap_after = (
+            out.loc[sites == "B", feature_cols].mean()
+            - out.loc[sites == "A", feature_cols].mean()
+        ).abs()
+        # Every column's site gap must shrink (ComBat aligns site means).
+        assert (gap_after < gap_before).all(), (
+            f"gap_before={gap_before.tolist()} gap_after={gap_after.tolist()}"
+        )
+    def test_output_dtype_float64(self) -> None:
+        df, sites, feature_cols = self._build_two_site_features()
+        out = harmonize_combat(df, sites, feature_cols)
+        for c in feature_cols:
+            assert out[c].dtype == np.float64, f"{c} → {out[c].dtype}"
+    def test_no_nan_in_output(self) -> None:
+        df, sites, feature_cols = self._build_two_site_features()
+        out = harmonize_combat(df, sites, feature_cols)
+        assert out[feature_cols].notna().all().all()
+        assert np.isfinite(out[feature_cols].to_numpy()).all()
+    def test_deterministic(self) -> None:
+        df, sites, feature_cols = self._build_two_site_features()
+        a = harmonize_combat(df, sites, feature_cols)
+        b = harmonize_combat(df.copy(), sites.copy(), list(feature_cols))
+        np.testing.assert_array_equal(a.to_numpy(), b.to_numpy())
+    def test_raises_on_single_site(self) -> None:
+        """ComBat needs at least 2 sites; a single-site dataset is malformed."""
+        df, _, feature_cols = self._build_two_site_features()
+        sites_one = pd.Series(["A"] * len(df), name="site")
+        with pytest.raises(ValueError, match="at least 2 sites"):
+            harmonize_combat(df, sites_one, feature_cols)
+```
+- [ ] **Step 2: Run tests; they MUST fail**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py::TestHarmonizeCombat -v
+```
+Expected: 6 FAILS with `cannot import name 'harmonize_combat'`.
+- [ ] **Step 3: Implement `harmonize_combat`**
+Add `import pandas as pd` to the third-party imports of `src/pipelines/mri_pipeline.py` (alphabetical: between numpy and pyarrow).
+Append at the END of `src/pipelines/mri_pipeline.py`:
+```python
+def harmonize_combat(
+    features: pd.DataFrame,
+    sites: pd.Series,
+    feature_cols: list[str],
+) -> pd.DataFrame:
+    """Apply ComBat harmonization across sites to remove site-level domain shift.
+    Wraps `neuroHarmonize.harmonizationLearn` which fits a parametric ComBat
+    model (no internal RNG → byte-deterministic given fixed input). Only
+    `feature_cols` are harmonized; other columns in `features` (e.g.
+    metadata) are not touched by this function — callers should join after.
+    Args:
+        features: DataFrame with at least the columns listed in `feature_cols`.
+        sites: Site label per row (length must match `len(features)`).
+        feature_cols: Names of the columns to harmonize.
+    Returns:
+        A new DataFrame of identical shape & column order to
+        `features[feature_cols]`, with ComBat-harmonized values.
+    Raises:
+        ValueError: if fewer than 2 distinct sites are present.
+    """
+    from neuroHarmonize import harmonizationLearn
+    if sites.nunique() < 2:
+        raise ValueError(
+            f"ComBat requires at least 2 sites; got {sites.nunique()} "
+            f"({sites.unique().tolist()})"
+        )
+    matrix = features[feature_cols].to_numpy(dtype=np.float64)
+    covars = pd.DataFrame({"SITE": sites.to_numpy()})
+    _, harmonized = harmonizationLearn(matrix, covars)
+    out = pd.DataFrame(
+        np.asarray(harmonized, dtype=np.float64),
+        columns=list(feature_cols),
+        index=features.index,
+    )
+    logger.info(
+        "ComBat harmonized %d rows × %d features across %d sites",
+        len(out), len(feature_cols), sites.nunique(),
+    )
+    return out
+```
+- [ ] **Step 4: Run tests to verify they pass**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py -v
+```
+Expected: 25 PASS (7 valid + 6 mask + 6 features + 6 combat). Total: **92 PASS**.
+- [ ] **Step 5: Commit**
+```bash
+git add tests/pipelines/test_mri_pipeline.py src/pipelines/mri_pipeline.py
+git commit -m "feat(mri): add harmonize_combat wrapper around neuroHarmonize.harmonizationLearn"
+```
+---
+## Task 6: `run_pipeline` orchestrator + CLI (TDD)
+**Files:**
+- Modify: `tests/pipelines/test_mri_pipeline.py`
+- Modify: `src/pipelines/mri_pipeline.py`
+- [ ] **Step 1: Append the failing tests**
+Extend merged tuple to include `run_pipeline`. Add `import shutil` to the stdlib block at the top of the test file (alphabetical: between `from pathlib import Path` and the third-party block).
+Append:
+```python
+class TestRunPipeline:
+    def _stage_inputs(self, tmp_path: Path) -> tuple[Path, Path, Path]:
+        """Copy the committed MRI fixture into a tmp_path layout."""
+        raw_dir = tmp_path / "data" / "raw" / "mri"
+        proc_dir = tmp_path / "data" / "processed"
+        raw_dir.mkdir(parents=True)
+        proc_dir.mkdir(parents=True)
+        for src in FIXTURE_DIR.iterdir():
+            shutil.copy(src, raw_dir / src.name)
+        sites_csv = raw_dir / "sites.csv"
+        output_path = proc_dir / "mri_features.parquet"
+        return raw_dir, sites_csv, output_path
+    def test_end_to_end_writes_processed_parquet(self, tmp_path: Path) -> None:
+        raw_dir, sites_csv, output_path = self._stage_inputs(tmp_path)
+        run_pipeline(
+            input_dir=raw_dir, sites_csv=sites_csv, output_path=output_path,
+        )
+        assert output_path.exists()
+        df = pd.read_parquet(output_path)
+        assert len(df) == 6  # 6 subjects in the fixture
+        assert "subject_id" in df.columns
+        assert "site" in df.columns
+        assert any(c.startswith("feat_roi") for c in df.columns)
+    def test_run_pipeline_preserves_float64_for_features(self, tmp_path: Path) -> None:
+        raw_dir, sites_csv, output_path = self._stage_inputs(tmp_path)
+        run_pipeline(
+            input_dir=raw_dir, sites_csv=sites_csv, output_path=output_path,
+        )
+        df = pd.read_parquet(output_path)
+        feat_cols = [c for c in df.columns if c.startswith("feat_")]
+        for c in feat_cols:
+            assert df[c].dtype == np.float64, f"{c} widened to {df[c].dtype}"
+    def test_run_pipeline_is_idempotent(self, tmp_path: Path) -> None:
+        raw_dir, sites_csv, output_path = self._stage_inputs(tmp_path)
+        run_pipeline(
+            input_dir=raw_dir, sites_csv=sites_csv, output_path=output_path,
+        )
+        first = output_path.read_bytes()
+        run_pipeline(
+            input_dir=raw_dir, sites_csv=sites_csv, output_path=output_path,
+        )
+        second = output_path.read_bytes()
+        assert first == second, "MRI pipeline output must be byte-deterministic"
+    def test_run_pipeline_reduces_site_gap(self, tmp_path: Path) -> None:
+        """End-to-end: ComBat must shrink the per-site mean gap in feat_roi0_mean."""
+        raw_dir, sites_csv, output_path = self._stage_inputs(tmp_path)
+        run_pipeline(
+            input_dir=raw_dir, sites_csv=sites_csv, output_path=output_path,
+        )
+        df = pd.read_parquet(output_path)
+        # Brain mean before harmonization differs by ~5 between sites.
+        # After ComBat, the per-site mean of feat_roi0_mean must be much closer.
+        site_means = df.groupby("site")["feat_roi0_mean"].mean()
+        gap = abs(site_means["B"] - site_means["A"])
+        assert gap < 1.0, f"site gap after ComBat: {gap}"
+    def test_run_pipeline_raises_when_input_missing(self, tmp_path: Path) -> None:
+        with pytest.raises(FileNotFoundError, match="MRI input directory not found"):
+            run_pipeline(
+                input_dir=tmp_path / "nope",
+                sites_csv=tmp_path / "sites.csv",
+                output_path=tmp_path / "out.parquet",
+            )
+    def test_run_pipeline_rejects_directory_as_output(self, tmp_path: Path) -> None:
+        raw_dir, sites_csv, _ = self._stage_inputs(tmp_path)
+        bad_output = tmp_path / "out_dir"
+        bad_output.mkdir()
+        with pytest.raises(IsADirectoryError, match="must be a file"):
+            run_pipeline(
+                input_dir=raw_dir, sites_csv=sites_csv, output_path=bad_output,
+            )
+    def test_run_pipeline_drops_invalid_volumes(self, tmp_path: Path) -> None:
+        """A NaN-containing volume must be logged + dropped, not silently included."""
+        raw_dir, sites_csv, output_path = self._stage_inputs(tmp_path)
+        # Corrupt subject_5 to contain NaN. Re-save in place.
+        bad = nib.load(raw_dir / "subject_5.nii.gz").get_fdata()
+        bad[0, 0, 0] = np.nan
+        nib.save(nib.Nifti1Image(bad, affine=np.eye(4)), raw_dir / "subject_5.nii.gz")
+        run_pipeline(
+            input_dir=raw_dir, sites_csv=sites_csv, output_path=output_path,
+        )
+        df = pd.read_parquet(output_path)
+        # 5 surviving valid subjects (subject_5 dropped).
+        assert len(df) == 5
+        assert "subject_5" not in df["subject_id"].tolist()
+```
+- [ ] **Step 2: Run tests; they MUST fail**
+```bash
+pytest tests/pipelines/test_mri_pipeline.py::TestRunPipeline -v
+```
+Expected: 7 FAILS with `cannot import name 'run_pipeline'`.
+- [ ] **Step 3: Implement `run_pipeline` + CLI**
+Add `from pathlib import Path` to the stdlib imports of `src/pipelines/mri_pipeline.py`. Final stdlib block:
+```python
+from __future__ import annotations
+import os
+from pathlib import Path
+```
+Append at the END of `src/pipelines/mri_pipeline.py`:
+```python
+# Default I/O paths for the MRI pipeline. Override via run_pipeline() args.
+DEFAULT_INPUT = Path("data/raw/mri")
+DEFAULT_OUTPUT = Path("data/processed/mri_features.parquet")
+def _list_nifti_volumes(input_dir: Path) -> list[Path]:
+    """Return sorted list of .nii / .nii.gz files in `input_dir`."""
+    return sorted(
+        p for p in input_dir.iterdir()
+        if p.suffix == ".nii" or p.name.endswith(".nii.gz")
+    )
+def run_pipeline(
+    input_dir: Path = DEFAULT_INPUT,
+    sites_csv: Path | None = None,
+    output_path: Path = DEFAULT_OUTPUT,
+    intensity_threshold: float | None = None,
+    n_roi_axes: tuple[int, int, int] = DEFAULT_N_ROI_AXES,
+) -> None:
+    """Run the MRI pipeline end-to-end: NIfTI directory → harmonized Parquet.
+    For each `subject_id.nii(.gz)` in `input_dir`, validates the volume,
+    masks the brain, computes per-ROI statistics, then harmonizes across
+    sites (column "site" of `sites_csv`, joined on "subject_id") via ComBat.
+    Output is float64 Parquet at `output_path`.
+    Args:
+        input_dir: Directory containing one NIfTI per subject and a
+            `sites.csv` (or `sites_csv` override) with columns
+            `subject_id, site`.
+        sites_csv: Path to the site-covariates CSV. If `None`, defaults to
+            `input_dir / "sites.csv"`.
+        output_path: Where to write the processed feature Parquet file.
+        intensity_threshold: Brain-mask intensity floor. `None` → per-volume
+            mean (see `mask_brain`).
+        n_roi_axes: ROI grid (z, y, x).
+    Raises:
+        FileNotFoundError: if `input_dir` does not exist.
+        IsADirectoryError: if `output_path` resolves to an existing directory.
+    """
+    input_dir = Path(input_dir)
+    output_path = Path(output_path)
+    if not input_dir.exists():
+        raise FileNotFoundError(f"MRI input directory not found: {input_dir}")
+    sites_csv = Path(sites_csv) if sites_csv is not None else input_dir / "sites.csv"
+    if not sites_csv.exists():
+        raise FileNotFoundError(f"sites_csv not found: {sites_csv}")
+    logger.info("Reading MRI volumes from %s", input_dir)
+    nifti_paths = _list_nifti_volumes(input_dir)
+    sites_df = pd.read_csv(sites_csv)
+    rows: list[dict[str, float | str]] = []
+    invalid_subject_ids: list[str] = []
+    for path in nifti_paths:
+        subject_id = path.name.removesuffix(".nii.gz").removesuffix(".nii")
+        volume = nib.load(path).get_fdata()
+        if not is_valid_volume(volume):
+            invalid_subject_ids.append(subject_id)
+            continue
+        mask = mask_brain(volume, intensity_threshold=intensity_threshold)
+        feats = extract_features_from_volume(volume, mask, n_roi_axes=n_roi_axes)
+        feats["subject_id"] = subject_id
+        rows.append(feats)
+    n_total = len(nifti_paths)
+    n_dropped = len(invalid_subject_ids)
+    if n_dropped:
+        display = invalid_subject_ids[:10]
+        suffix = (
+            f"... (+{n_dropped - 10} more)" if n_dropped > 10 else ""
+        )
+        logger.warning(
+            "Dropping %d/%d volumes with invalid samples (subjects=%s%s)",
+            n_dropped, n_total, display, suffix,
+        )
+    feature_cols = [
+        f"feat_roi{i}_{stat}"
+        for i in range(int(np.prod(n_roi_axes)))
+        for stat in ROI_STATS
+    ]
+    if not rows:
+        logger.info(
+            "Feature extraction complete: in=%d, out=0, dropped=%d (%.2f%%)",
+            n_total, n_dropped, 100.0 * n_dropped / max(n_total, 1),
+        )
+        empty = pd.DataFrame(
+            columns=["subject_id", "site", *feature_cols]
+        ).astype({c: np.float64 for c in feature_cols})
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        if output_path.is_dir():
+            raise IsADirectoryError(
+                f"output_path must be a file, got a directory: {output_path}"
+            )
+        empty.to_parquet(
+            output_path, index=False, engine="pyarrow", compression="snappy",
+        )
+        return
+    raw_features = pd.DataFrame(rows)
+    raw_features = raw_features.merge(sites_df, on="subject_id", how="left")
+    if raw_features["site"].isna().any():
+        missing = raw_features.loc[raw_features["site"].isna(), "subject_id"].tolist()
+        raise KeyError(
+            f"sites_csv missing site assignment for subjects: {missing}"
+        )
+    harmonized = harmonize_combat(
+        raw_features, raw_features["site"], feature_cols,
+    )
+    final = pd.concat(
+        [raw_features[["subject_id", "site"]].reset_index(drop=True),
+         harmonized.reset_index(drop=True)],
+        axis=1,
+    )
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    if output_path.is_dir():
+        raise IsADirectoryError(
+            f"output_path must be a file, got a directory: {output_path}"
+        )
+    # Parquet preserves dtypes (float64 features stay float64) and is
+    # byte-deterministic with single-threaded snappy. AGENTS.md §6.
+    final.to_parquet(
+        output_path, index=False, engine="pyarrow", compression="snappy",
+    )
+    logger.info(
+        "Feature extraction complete: in=%d, out=%d, dropped=%d (%.2f%%)",
+        n_total, len(final), n_dropped, 100.0 * n_dropped / max(n_total, 1),
+    )
+    logger.info(
+        "Wrote processed features to %s (rows=%d, cols=%d)",
+        output_path, len(final), final.shape[1],
+    )
+if __name__ == "__main__":
+    # Day-3 CLI entrypoint — runs with default paths against `data/raw/mri/`.
+    # Expects `data/raw/mri/sites.csv` with columns `subject_id, site`.
+    # Argument parsing (argparse / click) will land in a later task.
+    #   python -m src.pipelines.mri_pipeline
+    run_pipeline()
+```
+- [ ] **Step 4: Run all tests**
+```bash
+pytest -v
+```
+Expected: **99 PASS** (67 prior + 32 MRI: 7 valid + 6 mask + 6 features + 6 combat + 7 run_pipeline).
+- [ ] **Step 5: Commit**
+```bash
+git add tests/pipelines/test_mri_pipeline.py src/pipelines/mri_pipeline.py
+git commit -m "feat(mri): add run_pipeline orchestrator + CLI (NIfTI dir → ComBat Parquet)"
+```
+---
+## Task 7: AGENTS.md + README updates
+**Files:**
+- Modify: `AGENTS.md`
+- Modify: `README.md`
+- [ ] **Step 1: Update AGENTS.md §1 pipeline table**
+Find the existing line:
+```
+| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` *(planned, Day 3)* | ComBat Harmonization for site-level domain shift |
+```
+Replace with:
+```
+| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
+```
+- [ ] **Step 2: Update README.md Status table**
+Find:
+```
+| 3 | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | Planned (ComBat harmonization) |
+```
+Replace with:
+```
+| 3 | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | Shipped — 99 tests green |
+```
+Bump any test-count references elsewhere in the README from 67 to 99.
+- [ ] **Step 3: Add MRI Pipeline section to README.md**
+Directly after the EEG Pipeline (Day 2) section, append:
+```markdown
+## MRI Pipeline (Day 3)
+| Function | Purpose |
+|---|---|
+| `is_valid_volume(volume)` | Returns True iff input is a finite, numeric, non-empty 3-D ndarray. Rejects NaN/inf, non-numeric dtypes, lists/scalars. |
+| `mask_brain(volume, intensity_threshold)` | Two-step brain mask: intensity threshold (default = volume mean) + morphological opening to drop isolated noise voxels. Non-mutating. |
+| `harmonize_combat(features, sites, feature_cols)` | Wraps `neuroHarmonize.harmonizationLearn` to remove per-site additive bias on the named columns. RNG-free → byte-deterministic. Raises if fewer than 2 sites. |
+| `extract_features_from_volume(volume, mask, n_roi_axes)` | Partitions the masked volume into `prod(n_roi_axes)` axis-aligned octants (default 2×2×2 = 8) and emits 6 stats per ROI: mean / std / p10 / p50 / p90 / voxel_count. Empty ROIs report 0.0 (no NaN survives). |
+| `run_pipeline(input_dir, sites_csv, output_path, ...)` | End-to-end NIfTI directory → ComBat-harmonized Parquet orchestrator. Drops invalid volumes with a logged WARNING. Idempotent; raises on missing input or directory output. |
+Output schema: one row per surviving subject with columns `subject_id, site, feat_roi{i}_<stat>`. All `feat_*` are float64 (preserved through the Parquet round-trip).
+```
+- [ ] **Step 4: Add MRI smoke-run to README Quick Start**
+After the EEG smoke-run block, append:
+```bash
+# Smoke-test the MRI pipeline with the bundled fixture (6 subjects × 2 sites)
+mkdir -p data/raw
+cp -r tests/fixtures/mri_sample/* data/raw/mri/ 2>/dev/null || mkdir -p data/raw/mri && cp -r tests/fixtures/mri_sample/* data/raw/mri/
+python -m src.pipelines.mri_pipeline
+```
+Then below: `Result lives at` `data/processed/mri_features.parquet`.
+- [ ] **Step 5: Update README Roadmap**
+Find the Day-3 bullet and convert to past-tense shipped form. The section's other Day entries already use the same form.
+- [ ] **Step 6: Commit**
+```bash
+git add AGENTS.md README.md
+git commit -m "docs: mark MRI pipeline shipped; add Day-3 smoke run + function reference"
+```
+---
+## Task 8: DoD verification + smoke run
+**Files:** none modified (verification only).
+- [ ] **Step 1: Full test suite green**
+```bash
+cd /Users/mertgungor/Desktop/hackathon
+source .venv312/bin/activate
+pytest -v --tb=short
+```
+Required: **99 passed**, 0 failed, 0 skipped, 0 warnings.
+- [ ] **Step 2: CLI smoke run + idempotency**
+```bash
+mkdir -p data/raw/mri
+cp -r tests/fixtures/mri_sample/* data/raw/mri/
+rm -f data/processed/mri_features.parquet
+python -m src.pipelines.mri_pipeline
+md5_run1=$(md5 -q data/processed/mri_features.parquet 2>/dev/null || md5sum data/processed/mri_features.parquet | awk '{print $1}')
+python -m src.pipelines.mri_pipeline
+md5_run2=$(md5 -q data/processed/mri_features.parquet 2>/dev/null || md5sum data/processed/mri_features.parquet | awk '{print $1}')
+echo "MD5 run1: $md5_run1"
+echo "MD5 run2: $md5_run2"
+```
+Required: matching MD5s.
+- [ ] **Step 3: Verify schema + ComBat effect**
+```bash
+python -c "
+import pandas as pd
+import numpy as np
+df = pd.read_parquet('data/processed/mri_features.parquet')
+print('rows:', len(df))
+print('cols:', df.shape[1])
+print('subject_id present:', 'subject_id' in df.columns)
+print('site present:', 'site' in df.columns)
+print('feat_roi*:', sum(c.startswith('feat_roi') for c in df.columns))
+print('all feats float64:', all(df[c].dtype.name == 'float64' for c in df.columns if c.startswith('feat_')))
+print('any NaN/inf:', df.isna().any().any() or not np.isfinite(df.select_dtypes(include=[np.number]).to_numpy()).all())
+gap = abs(df.groupby('site')['feat_roi0_mean'].mean().diff().iloc[-1])
+print(f'site gap on feat_roi0_mean: {gap:.3f} (must be < 1.0 after ComBat)')
+"
+```
+Required:
+- rows = 6
+- 48 `feat_roi*` columns (8 ROIs × 6 stats)
+- all features float64
+- no NaN/inf
+- site gap < 1.0 (vs. ~5.0 before harmonization)
+- [ ] **Step 4: Confirm gitignore covers MRI artifacts**
+```bash
+git check-ignore -v data/raw/mri/subject_0.nii.gz data/processed/mri_features.parquet
+git status
+```
+Both must be ignored. Working tree clean.
+- [ ] **Step 5: Day-1 + Day-2 regression**
+```bash
+# BBB still works
+cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
+python -m src.pipelines.bbb_pipeline 2>&1 | tail -1
+# EEG still works
+cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif
+python -m src.pipelines.eeg_pipeline 2>&1 | tail -1
+ls -lh data/processed/
+```
+Required: all three Parquet files (`bbbp_features.parquet`, `eeg_features.parquet`, `mri_features.parquet`) present.
+---
+## Day-3 Definition of Done
+- [ ] `src/pipelines/mri_pipeline.py` exposes `is_valid_volume`, `mask_brain`, `harmonize_combat`, `extract_features_from_volume`, `run_pipeline`, plus `DEFAULT_N_ROI_AXES`, `ROI_STATS`, `DEFAULT_INPUT`, `DEFAULT_OUTPUT`.
+- [ ] `python -m src.pipelines.mri_pipeline` against `data/raw/mri/` (with `sites.csv`) produces a deterministic Parquet at `data/processed/mri_features.parquet`.
+- [ ] Invalid volumes (NaN/inf) logged with their subject ids and dropped (Data Readiness §4 rule 2).
+- [ ] ComBat shrinks the per-site mean gap on harmonized columns by >5× (verified by `test_run_pipeline_reduces_site_gap` + DoD §3).
+- [ ] Same input → byte-identical Parquet across runs (rule 3).
+- [ ] Per-volume schema: `feat_roi{i}_<stat>` for `i in 0..7`, `<stat>` in `(mean, std, p10, p50, p90, voxel_count)`.
+- [ ] Float64 dtype preserved through the Parquet round-trip.
+- [ ] Test suite: **99 passing**, 0 failures, 0 warnings.
+- [ ] BBB (Day 1) and EEG (Day 2) regressions pass.
+- [ ] AGENTS.md §1 MRI row no longer says "(planned, Day 3)".
+- [ ] README Status table marks MRI shipped; MRI Pipeline reference section + Quick Start smoke run added.
+- [ ] At least 7 atomic Day-3 commits (1 fixture + 5 TDD features + 1 docs + verification).