docs: mark EEG pipeline shipped; add Day-2 smoke run + function reference
Browse files
README.md
CHANGED
|
@@ -11,7 +11,7 @@ and Docker shipping.
|
|
| 11 |
| Day | Modality | Pipeline | Status |
|
| 12 |
|-----|----------|----------|--------|
|
| 13 |
| 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
|
| 14 |
-
| 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) |
|
| 15 |
| 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
|
| 16 |
|
| 17 |
## Quick Start
|
|
@@ -23,7 +23,7 @@ and Docker shipping.
|
|
| 23 |
# 1. Create venv and install
|
| 24 |
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
|
| 25 |
|
| 26 |
-
# 2. Verify — expect
|
| 27 |
pytest -v
|
| 28 |
|
| 29 |
# 3. Smoke run with the bundled 6-row fixture
|
|
@@ -34,6 +34,17 @@ python -m src.pipelines.bbb_pipeline
|
|
| 34 |
python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
|
| 35 |
```
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
> **Real BBBP data:** not bundled (gitignored). Download from
|
| 38 |
> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
|
| 39 |
> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
|
|
@@ -74,6 +85,19 @@ All four functions log via `src.core.logger.get_logger(__name__)` per AGENTS.md
|
|
| 74 |
satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
|
| 75 |
determinism, traceability, idempotence).
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
## Storage Format
|
| 78 |
|
| 79 |
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
|
@@ -90,8 +114,7 @@ finishes in under 2 seconds on a 2024 laptop.
|
|
| 90 |
|
| 91 |
## Roadmap
|
| 92 |
|
| 93 |
-
- **Day 2:** `eeg_pipeline.py` —
|
| 94 |
-
`float32` features to Parquet.
|
| 95 |
- **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
|
| 96 |
(`neuroharmonize`) for site-level domain shift, write features to Parquet.
|
| 97 |
- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
|
|
|
|
| 11 |
| Day | Modality | Pipeline | Status |
|
| 12 |
|-----|----------|----------|--------|
|
| 13 |
| 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
|
| 14 |
+
| 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
|
| 15 |
| 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
|
| 16 |
|
| 17 |
## Quick Start
|
|
|
|
| 23 |
# 1. Create venv and install
|
| 24 |
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
|
| 25 |
|
| 26 |
+
# 2. Verify — expect 67 passed
|
| 27 |
pytest -v
|
| 28 |
|
| 29 |
# 3. Smoke run with the bundled 6-row fixture
|
|
|
|
| 34 |
python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
|
| 35 |
```
|
| 36 |
|
| 37 |
+
Result lives at `data/processed/bbbp_features.parquet`.
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
# Smoke-test the EEG pipeline with the bundled fixture (5 ch synthetic .fif)
|
| 41 |
+
mkdir -p data/raw
|
| 42 |
+
cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif
|
| 43 |
+
python -m src.pipelines.eeg_pipeline
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Result lives at `data/processed/eeg_features.parquet`.
|
| 47 |
+
|
| 48 |
> **Real BBBP data:** not bundled (gitignored). Download from
|
| 49 |
> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
|
| 50 |
> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
|
|
|
|
| 85 |
satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
|
| 86 |
determinism, traceability, idempotence).
|
| 87 |
|
| 88 |
+
## EEG Pipeline (Day 2)
|
| 89 |
+
|
| 90 |
+
| Function | Purpose |
|
| 91 |
+
|---|---|
|
| 92 |
+
| `is_valid_epoch(epoch)` | Returns True iff the input is a finite, numeric, non-empty 2-D array. Rejects NaN/inf, non-numeric dtypes, lists/scalars. |
|
| 93 |
+
| `bandpass_filter(raw, l_freq, h_freq)` | Non-mutating MNE bandpass (default 1–40 Hz). Raises ValueError on inverted frequency range. |
|
| 94 |
+
| `remove_artifacts_with_ica(raw, eog_ch_name, n_components, random_state)` | Seeded ICA + correlation-based EOG component rejection. Skips gracefully (no-op + WARNING) on missing/typo EOG channel or NaN-contaminated data. |
|
| 95 |
+
| `compute_features_from_epoch(epoch, sfreq)` | Per-channel PSD bands (delta/theta/alpha/beta/gamma) + 5 statistical moments (mean/std/var/skew/kurtosis). Constant-channel safe (NaN-cleaned). |
|
| 96 |
+
| `extract_features_from_recording(raw, epoch_duration_s, eog_ch_name, n_components, random_state)` | Chains filter → ICA → epoching → feature extraction. Drops invalid epochs (logged WARNING with truncated index list). Returns 2-D `pd.DataFrame` with deterministic `feat_<channel>_psd_<band>` and `feat_<channel>_<stat>` columns. |
|
| 97 |
+
| `run_pipeline(input_path, output_path, ...)` | End-to-end FIF/EDF → Parquet orchestrator. Idempotent; raises on missing input or directory output. |
|
| 98 |
+
|
| 99 |
+
The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet output for the same input — satisfying the §4 Determinism contract. Output is float64, preserved through the Parquet round-trip.
|
| 100 |
+
|
| 101 |
## Storage Format
|
| 102 |
|
| 103 |
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
|
|
|
| 114 |
|
| 115 |
## Roadmap
|
| 116 |
|
| 117 |
+
- **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
|
|
|
|
| 118 |
- **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
|
| 119 |
(`neuroharmonize`) for site-level domain shift, write features to Parquet.
|
| 120 |
- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
|