mekosotto Claude Sonnet 4.6 commited on
Commit
ff35cee
·
1 Parent(s): ea055f0

docs: mark EEG pipeline shipped; add Day-2 smoke run + function reference

Browse files
Files changed (1) hide show
  1. README.md +27 -4
README.md CHANGED
@@ -11,7 +11,7 @@ and Docker shipping.
11
  | Day | Modality | Pipeline | Status |
12
  |-----|----------|----------|--------|
13
  | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
14
- | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Planned (MNE-Python + ICA) |
15
  | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
16
 
17
  ## Quick Start
@@ -23,7 +23,7 @@ and Docker shipping.
23
  # 1. Create venv and install
24
  python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
25
 
26
- # 2. Verify — expect 30 passed
27
  pytest -v
28
 
29
  # 3. Smoke run with the bundled 6-row fixture
@@ -34,6 +34,17 @@ python -m src.pipelines.bbb_pipeline
34
  python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
35
  ```
36
 
 
 
 
 
 
 
 
 
 
 
 
37
  > **Real BBBP data:** not bundled (gitignored). Download from
38
  > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
39
  > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
@@ -74,6 +85,19 @@ All four functions log via `src.core.logger.get_logger(__name__)` per AGENTS.md
74
  satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
75
  determinism, traceability, idempotence).
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ## Storage Format
78
 
79
  Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
@@ -90,8 +114,7 @@ finishes in under 2 seconds on a 2024 laptop.
90
 
91
  ## Roadmap
92
 
93
- - **Day 2:** `eeg_pipeline.py` — load EDF/FIF, MNE-Python ICA artifact removal, write
94
- `float32` features to Parquet.
95
  - **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
96
  (`neuroharmonize`) for site-level domain shift, write features to Parquet.
97
  - **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
 
11
  | Day | Modality | Pipeline | Status |
12
  |-----|----------|----------|--------|
13
  | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
14
+ | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped 67 tests green |
15
  | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
16
 
17
  ## Quick Start
 
23
  # 1. Create venv and install
24
  python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
25
 
26
+ # 2. Verify — expect 67 passed
27
  pytest -v
28
 
29
  # 3. Smoke run with the bundled 6-row fixture
 
34
  python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
35
  ```
36
 
37
+ Result lives at `data/processed/bbbp_features.parquet`.
38
+
39
+ ```bash
40
+ # Smoke-test the EEG pipeline with the bundled fixture (5 ch synthetic .fif)
41
+ mkdir -p data/raw
42
+ cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif
43
+ python -m src.pipelines.eeg_pipeline
44
+ ```
45
+
46
+ Result lives at `data/processed/eeg_features.parquet`.
47
+
48
  > **Real BBBP data:** not bundled (gitignored). Download from
49
  > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
50
  > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
 
85
  satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
86
  determinism, traceability, idempotence).
87
 
88
+ ## EEG Pipeline (Day 2)
89
+
90
+ | Function | Purpose |
91
+ |---|---|
92
+ | `is_valid_epoch(epoch)` | Returns True iff the input is a finite, numeric, non-empty 2-D array. Rejects NaN/inf, non-numeric dtypes, lists/scalars. |
93
+ | `bandpass_filter(raw, l_freq, h_freq)` | Non-mutating MNE bandpass (default 1–40 Hz). Raises ValueError on inverted frequency range. |
94
+ | `remove_artifacts_with_ica(raw, eog_ch_name, n_components, random_state)` | Seeded ICA + correlation-based EOG component rejection. Skips gracefully (no-op + WARNING) on missing/typo EOG channel or NaN-contaminated data. |
95
+ | `compute_features_from_epoch(epoch, sfreq)` | Per-channel PSD bands (delta/theta/alpha/beta/gamma) + 5 statistical moments (mean/std/var/skew/kurtosis). Constant-channel safe (NaN-cleaned). |
96
+ | `extract_features_from_recording(raw, epoch_duration_s, eog_ch_name, n_components, random_state)` | Chains filter → ICA → epoching → feature extraction. Drops invalid epochs (logged WARNING with truncated index list). Returns 2-D `pd.DataFrame` with deterministic `feat_<channel>_psd_<band>` and `feat_<channel>_<stat>` columns. |
97
+ | `run_pipeline(input_path, output_path, ...)` | End-to-end FIF/EDF → Parquet orchestrator. Idempotent; raises on missing input or directory output. |
98
+
99
+ The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet output for the same input — satisfying the §4 Determinism contract. Output is float64, preserved through the Parquet round-trip.
100
+
101
  ## Storage Format
102
 
103
  Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
 
114
 
115
  ## Roadmap
116
 
117
+ - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
 
118
  - **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
119
  (`neuroharmonize`) for site-level domain shift, write features to Parquet.
120
  - **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,