mekosotto commited on
Commit
dc611c3
·
1 Parent(s): eccd68f

docs: add Day 2 EEG MNE+ICA pipeline plan

Browse files

Add a comprehensive implementation plan for NeuroBridge Day 2: an EEG pipeline using MNE + ICA. The new markdown (docs/superpowers/plans/2026-04-30-day2-eeg-mne-ica-pipeline.md) specifies goals, architecture, public API (is_valid_epoch, bandpass_filter, remove_artifacts_with_ica, compute_features_from_epoch, extract_features_from_recording, run_pipeline), tech stack, file layout, TDD tasks (fixture, unit tests, feature implementation, orchestrator/CLI), expected behavior, logging, determinism requirements, Parquet output schema, and a Definition of Done checklist. The document provides step-by-step tasks and expected test outcomes to guide development and verification.

docs/superpowers/plans/2026-04-30-day2-eeg-mne-ica-pipeline.md ADDED
@@ -0,0 +1,1281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NeuroBridge Day 2 — EEG MNE+ICA Pipeline Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Insider One Hackathon Day 2 — ship a production-grade EEG pipeline that loads raw recordings, bandpass-filters, removes EOG artifacts via ICA, slices into epochs, computes per-band PSD + statistical features, flattens to a 2D table, and persists as Parquet.
6
+
7
+ **Architecture:** Modular `src/pipelines/eeg_pipeline.py` mirroring Day 1's BBB four-public-function pattern: a small validity primitive (`is_valid_epoch`), three pure transformers (`bandpass_filter`, `remove_artifacts_with_ica`, `compute_features_from_epoch`), one DataFrame-emitting layer (`extract_features_from_recording`), and one I/O orchestrator (`run_pipeline`). All logging goes through `src.core.logger.get_logger`. Output is Parquet per AGENTS.md §6. Tests use a deterministic synthetic `mne.io.RawArray` fixture so the suite stays under 5 s on a laptop.
8
+
9
+ **Tech Stack:** Python 3.10–3.12, `mne==1.7.1`, NumPy, SciPy (`scipy.stats.skew`, `kurtosis`), Pandas, PyArrow, Pytest.
10
+
11
+ ---
12
+
13
+ ## File Structure
14
+
15
+ | Path | Responsibility |
16
+ |---|---|
17
+ | `src/pipelines/eeg_pipeline.py` | Public API (`is_valid_epoch`, `bandpass_filter`, `remove_artifacts_with_ica`, `compute_features_from_epoch`, `extract_features_from_recording`, `run_pipeline`) + `DEFAULT_INPUT` / `DEFAULT_OUTPUT` + `__main__` CLI. |
18
+ | `tests/pipelines/test_eeg_pipeline.py` | Unit + integration tests; one class per public function. |
19
+ | `tests/fixtures/eeg_sample.fif` | Deterministic synthetic Raw (5 ch, 256 Hz, 10 s) with seeded sine signals + EOG-like blinks; built once on disk via a tiny build script. |
20
+ | `tests/fixtures/build_eeg_fixture.py` | Standalone script that regenerates `eeg_sample.fif` from a fixed seed; committed alongside the .fif so anyone can reproduce. |
21
+ | `AGENTS.md` | Update §1 pipeline table row for EEG to "Shipped". |
22
+ | `README.md` | Update Status table EEG row to "Shipped"; bump test count. |
23
+
24
+ The `eeg_pipeline.py` module is expected to land at ~250–280 lines after Task 7. We do not split into submodules at this stage — Day 1's BBB pattern works well at this size.
25
+
26
+ ---
27
+
28
+ ## Public API contract (defined here so tasks reference one source of truth)
29
+
30
+ ```python
31
+ EEG_BANDS: dict[str, tuple[float, float]] = {
32
+ "delta": (1.0, 4.0),
33
+ "theta": (4.0, 8.0),
34
+ "alpha": (8.0, 13.0),
35
+ "beta": (13.0, 30.0),
36
+ "gamma": (30.0, 40.0),
37
+ }
38
+ STATS: tuple[str, ...] = ("mean", "std", "var", "skew", "kurtosis")
39
+
40
+ def is_valid_epoch(epoch: np.ndarray) -> bool: ...
41
+ def bandpass_filter(raw: mne.io.BaseRaw, l_freq: float = 1.0, h_freq: float = 40.0) -> mne.io.BaseRaw: ...
42
+ def remove_artifacts_with_ica(
43
+ raw: mne.io.BaseRaw,
44
+ eog_ch_name: str | None = None,
45
+ n_components: int = 15,
46
+ random_state: int = 97,
47
+ ) -> mne.io.BaseRaw: ...
48
+ def compute_features_from_epoch(epoch: np.ndarray, sfreq: float) -> np.ndarray: ...
49
+ def extract_features_from_recording(
50
+ raw: mne.io.BaseRaw,
51
+ epoch_duration_s: float = 2.0,
52
+ eog_ch_name: str | None = None,
53
+ n_components: int = 15,
54
+ random_state: int = 97,
55
+ ) -> pd.DataFrame: ...
56
+ def run_pipeline(
57
+ input_path: Path = DEFAULT_INPUT,
58
+ output_path: Path = DEFAULT_OUTPUT,
59
+ epoch_duration_s: float = 2.0,
60
+ eog_ch_name: str | None = None,
61
+ n_components: int = 15,
62
+ random_state: int = 97,
63
+ ) -> None: ...
64
+ ```
65
+
66
+ Per-epoch feature vector shape: `(n_channels * (len(EEG_BANDS) + len(STATS)),)` = `n_channels * 10` floats. For 4 EEG channels → 40 features per epoch. Column names: `feat_<channel_name>_psd_<band>` and `feat_<channel_name>_<stat>` — alphabetical and deterministic given a fixed channel order.
67
+
68
+ ---
69
+
70
+ ## Task 1: EEG Test Fixture (deterministic synthetic .fif)
71
+
72
+ **Files:**
73
+ - Create: `tests/fixtures/build_eeg_fixture.py`
74
+ - Create: `tests/fixtures/eeg_sample.fif` (regenerated by the script above)
75
+
76
+ - [ ] **Step 1: Write the fixture-builder script**
77
+
78
+ Create `/Users/mertgungor/Desktop/hackathon/tests/fixtures/build_eeg_fixture.py`:
79
+ ```python
80
+ """Generate a deterministic synthetic MNE Raw fixture for EEG pipeline tests.
81
+
82
+ The fixture is committed to the repo alongside this script so test runs are
83
+ reproducible without re-running the script. Re-run only if the contract changes.
84
+
85
+ Channels: 4 EEG (Cz, Pz, O1, O2) + 1 EOG (EOG061).
86
+ Sampling rate: 256 Hz. Duration: 10 s.
87
+ Synthetic content: a 10 Hz alpha sine on each EEG channel, plus a 1.5 Hz EOG
88
+ "blink" injected on EOG061 and bleed-through on the frontal-most EEG channel
89
+ (Cz) so ICA has something to detect.
90
+ """
91
+ from __future__ import annotations
92
+
93
+ from pathlib import Path
94
+
95
+ import mne
96
+ import numpy as np
97
+
98
+
99
+ def build() -> Path:
100
+ rng = np.random.default_rng(seed=42)
101
+ sfreq = 256.0
102
+ duration_s = 10.0
103
+ n_samples = int(sfreq * duration_s)
104
+ t = np.arange(n_samples) / sfreq
105
+
106
+ # Base alpha (10 Hz) + small white noise on every EEG channel.
107
+ eeg_alpha = np.sin(2 * np.pi * 10.0 * t)
108
+ eeg_noise = rng.standard_normal((4, n_samples)) * 1e-6
109
+ eeg = (eeg_alpha[None, :] * 1e-5) + eeg_noise
110
+
111
+ # EOG blink: low-frequency square-ish pulse train at ~1.5 Hz.
112
+ eog_pulse = (np.sin(2 * np.pi * 1.5 * t) > 0.95).astype(float) * 1e-4
113
+
114
+ # Bleed EOG into Cz (channel 0) so ICA finds an EOG-correlated component.
115
+ eeg[0] += 0.3 * eog_pulse
116
+
117
+ data = np.vstack([eeg, eog_pulse[None, :]]) # shape: (5, n_samples)
118
+
119
+ info = mne.create_info(
120
+ ch_names=["Cz", "Pz", "O1", "O2", "EOG061"],
121
+ sfreq=sfreq,
122
+ ch_types=["eeg", "eeg", "eeg", "eeg", "eog"],
123
+ )
124
+ raw = mne.io.RawArray(data, info, verbose="ERROR")
125
+
126
+ out = Path(__file__).parent / "eeg_sample.fif"
127
+ raw.save(out, overwrite=True, verbose="ERROR")
128
+ return out
129
+
130
+
131
+ if __name__ == "__main__":
132
+ p = build()
133
+ print(f"Wrote {p}")
134
+ ```
135
+
136
+ - [ ] **Step 2: Run the script to generate the .fif**
137
+
138
+ ```bash
139
+ cd /Users/mertgungor/Desktop/hackathon
140
+ source .venv312/bin/activate
141
+ python tests/fixtures/build_eeg_fixture.py
142
+ ```
143
+ Expected: prints `Wrote .../tests/fixtures/eeg_sample.fif`. File size ~50 KB.
144
+
145
+ - [ ] **Step 3: Sanity-check the fixture**
146
+
147
+ ```bash
148
+ python -c "
149
+ import mne
150
+ raw = mne.io.read_raw_fif('tests/fixtures/eeg_sample.fif', preload=True, verbose='ERROR')
151
+ print('ch_names:', raw.ch_names)
152
+ print('sfreq:', raw.info['sfreq'])
153
+ print('n_times:', raw.n_times)
154
+ print('eog channel present:', 'EOG061' in raw.ch_names)
155
+ "
156
+ ```
157
+ Expected:
158
+ ```
159
+ ch_names: ['Cz', 'Pz', 'O1', 'O2', 'EOG061']
160
+ sfreq: 256.0
161
+ n_times: 2560
162
+ eog channel present: True
163
+ ```
164
+
165
+ - [ ] **Step 4: Commit**
166
+
167
+ ```bash
168
+ git add tests/fixtures/build_eeg_fixture.py tests/fixtures/eeg_sample.fif
169
+ git commit -m "test(eeg): add deterministic synthetic Raw fixture (5 ch, 256 Hz, 10 s)"
170
+ ```
171
+
172
+ ---
173
+
174
+ ## Task 2: `is_valid_epoch` (TDD)
175
+
176
+ **Files:**
177
+ - Create: `tests/pipelines/test_eeg_pipeline.py` (new)
178
+ - Create: `src/pipelines/eeg_pipeline.py` (new)
179
+
180
+ - [ ] **Step 1: Write the failing tests**
181
+
182
+ Create `/Users/mertgungor/Desktop/hackathon/tests/pipelines/test_eeg_pipeline.py`:
183
+ ```python
184
+ """Unit + integration tests for the EEG pipeline."""
185
+ from __future__ import annotations
186
+
187
+ from pathlib import Path
188
+
189
+ import numpy as np
190
+ import pytest
191
+
192
+ from src.pipelines.eeg_pipeline import is_valid_epoch
193
+
194
+
195
+ FIXTURE = Path(__file__).parent.parent / "fixtures" / "eeg_sample.fif"
196
+
197
+
198
+ class TestIsValidEpoch:
199
+ def test_accepts_2d_finite_array(self) -> None:
200
+ epoch = np.zeros((4, 256), dtype=np.float64)
201
+ assert is_valid_epoch(epoch) is True
202
+
203
+ def test_rejects_wrong_dimension(self) -> None:
204
+ assert is_valid_epoch(np.zeros((4,))) is False
205
+ assert is_valid_epoch(np.zeros((4, 256, 2))) is False
206
+
207
+ def test_rejects_nan(self) -> None:
208
+ epoch = np.zeros((4, 256))
209
+ epoch[0, 10] = np.nan
210
+ assert is_valid_epoch(epoch) is False
211
+
212
+ def test_rejects_inf(self) -> None:
213
+ epoch = np.zeros((4, 256))
214
+ epoch[1, 5] = np.inf
215
+ assert is_valid_epoch(epoch) is False
216
+
217
+ def test_rejects_empty(self) -> None:
218
+ assert is_valid_epoch(np.zeros((0, 256))) is False
219
+ assert is_valid_epoch(np.zeros((4, 0))) is False
220
+
221
+ def test_rejects_non_array(self) -> None:
222
+ assert is_valid_epoch([[1, 2, 3]]) is False
223
+ assert is_valid_epoch(None) is False
224
+ ```
225
+
226
+ - [ ] **Step 2: Run tests to verify they fail**
227
+
228
+ ```bash
229
+ pytest tests/pipelines/test_eeg_pipeline.py -v
230
+ ```
231
+ Expected: collection failure on `from src.pipelines.eeg_pipeline import is_valid_epoch` → `ModuleNotFoundError`.
232
+
233
+ - [ ] **Step 3: Write the implementation**
234
+
235
+ Create `/Users/mertgungor/Desktop/hackathon/src/pipelines/eeg_pipeline.py`:
236
+ ```python
237
+ """EEG (electroencephalography) pipeline.
238
+
239
+ Loads raw recordings (FIF/EDF), bandpass-filters, removes EOG artifacts via
240
+ ICA, slices into fixed-duration epochs, computes per-band PSD + statistical
241
+ features, flattens to a 2D table, and writes a model-ready Parquet at
242
+ `data/processed/eeg_features.parquet`.
243
+
244
+ Follows the Data Readiness contract in AGENTS.md §4 and the Parquet storage
245
+ convention in §6: schema validity, domain validity (drop NaN/inf epochs with
246
+ a logged WARNING), determinism (seeded ICA + sklearn RNG), traceability
247
+ (in/out/dropped counts at INFO), and idempotent overwrite output.
248
+ """
249
+ from __future__ import annotations
250
+
251
+ import numpy as np
252
+
253
+ from src.core.logger import get_logger
254
+
255
+ logger = get_logger(__name__)
256
+
257
+
258
+ def is_valid_epoch(epoch: object) -> bool:
259
+ """Return True iff `epoch` is a non-empty 2-D float array with no NaN/inf.
260
+
261
+ Used to drop corrupted segments before feature extraction. Defensive
262
+ against the full set of garbage we expect from real recordings: lists,
263
+ None, NaN/inf samples, zero-sized arrays.
264
+ """
265
+ if not isinstance(epoch, np.ndarray):
266
+ return False
267
+ if epoch.ndim != 2:
268
+ return False
269
+ if epoch.size == 0:
270
+ return False
271
+ if not np.all(np.isfinite(epoch)):
272
+ return False
273
+ return True
274
+ ```
275
+
276
+ - [ ] **Step 4: Run tests to verify they pass**
277
+
278
+ ```bash
279
+ pytest tests/pipelines/test_eeg_pipeline.py -v
280
+ ```
281
+ Expected: **6 PASS** in `TestIsValidEpoch`. Total suite: 36 (30 prior + 6).
282
+
283
+ - [ ] **Step 5: Commit**
284
+
285
+ ```bash
286
+ git add tests/pipelines/test_eeg_pipeline.py src/pipelines/eeg_pipeline.py
287
+ git commit -m "feat(eeg): add is_valid_epoch guard for NaN/inf/shape/dtype"
288
+ ```
289
+
290
+ ---
291
+
292
+ ## Task 3: `bandpass_filter` (TDD)
293
+
294
+ **Files:**
295
+ - Modify: `tests/pipelines/test_eeg_pipeline.py`
296
+ - Modify: `src/pipelines/eeg_pipeline.py`
297
+
298
+ - [ ] **Step 1: Append the failing tests**
299
+
300
+ Update the merged import at the top of `tests/pipelines/test_eeg_pipeline.py`. Replace:
301
+ ```python
302
+ from src.pipelines.eeg_pipeline import is_valid_epoch
303
+ ```
304
+ with:
305
+ ```python
306
+ import mne
307
+
308
+ from src.pipelines.eeg_pipeline import (
309
+ bandpass_filter,
310
+ is_valid_epoch,
311
+ )
312
+ ```
313
+
314
+ Append at the end of `tests/pipelines/test_eeg_pipeline.py`:
315
+ ```python
316
+
317
+
318
+ class TestBandpassFilter:
319
+ def _load(self) -> mne.io.BaseRaw:
320
+ return mne.io.read_raw_fif(FIXTURE, preload=True, verbose="ERROR")
321
+
322
+ def test_returns_raw_instance(self) -> None:
323
+ raw = self._load()
324
+ out = bandpass_filter(raw, l_freq=1.0, h_freq=40.0)
325
+ assert isinstance(out, mne.io.BaseRaw)
326
+
327
+ def test_preserves_shape(self) -> None:
328
+ raw = self._load()
329
+ n_ch_before, n_t_before = raw.get_data().shape
330
+ out = bandpass_filter(raw, l_freq=1.0, h_freq=40.0)
331
+ assert out.get_data().shape == (n_ch_before, n_t_before)
332
+
333
+ def test_attenuates_dc_component(self) -> None:
334
+ """A bandpass with l_freq=1.0 must remove a DC offset."""
335
+ raw = self._load()
336
+ # Inject a large DC offset on every channel.
337
+ data = raw.get_data() + 1e-3
338
+ raw_dc = mne.io.RawArray(data, raw.info, verbose="ERROR")
339
+ out = bandpass_filter(raw_dc, l_freq=1.0, h_freq=40.0)
340
+ # Mean on each channel should be near zero (much smaller than 1e-3).
341
+ assert np.all(np.abs(out.get_data().mean(axis=1)) < 1e-4)
342
+
343
+ def test_does_not_mutate_input(self) -> None:
344
+ raw = self._load()
345
+ original_mean = raw.get_data().mean()
346
+ _ = bandpass_filter(raw, l_freq=1.0, h_freq=40.0)
347
+ assert raw.get_data().mean() == pytest.approx(original_mean, rel=1e-12)
348
+ ```
349
+
350
+ - [ ] **Step 2: Run tests; they MUST fail**
351
+
352
+ ```bash
353
+ pytest tests/pipelines/test_eeg_pipeline.py::TestBandpassFilter -v
354
+ ```
355
+ Expected: 4 FAILS with `cannot import name 'bandpass_filter'`.
356
+
357
+ - [ ] **Step 3: Implement `bandpass_filter`**
358
+
359
+ Append to `/Users/mertgungor/Desktop/hackathon/src/pipelines/eeg_pipeline.py`:
360
+ ```python
361
+ import mne
362
+
363
+
364
+ def bandpass_filter(
365
+ raw: mne.io.BaseRaw,
366
+ l_freq: float = 1.0,
367
+ h_freq: float = 40.0,
368
+ ) -> mne.io.BaseRaw:
369
+ """Apply a non-mutating bandpass filter to an MNE Raw.
370
+
371
+ Default 1–40 Hz removes drift below 1 Hz and high-frequency noise / line
372
+ artifacts above 40 Hz. Returns a copy; the input `raw` is unchanged.
373
+
374
+ Args:
375
+ raw: Loaded `mne.io.BaseRaw` (call `.load_data()` first if from disk).
376
+ l_freq: Low-cut frequency in Hz.
377
+ h_freq: High-cut frequency in Hz.
378
+
379
+ Returns:
380
+ A filtered copy of `raw`.
381
+ """
382
+ out = raw.copy()
383
+ out.filter(l_freq=l_freq, h_freq=h_freq, picks="all", verbose="ERROR")
384
+ logger.info("Bandpass filter applied: %.1f-%.1f Hz", l_freq, h_freq)
385
+ return out
386
+ ```
387
+
388
+ - [ ] **Step 4: Run tests to verify they pass**
389
+
390
+ ```bash
391
+ pytest tests/pipelines/test_eeg_pipeline.py -v
392
+ ```
393
+ Expected: 10 PASS (6 prior EEG + 4 bandpass).
394
+
395
+ - [ ] **Step 5: Commit**
396
+
397
+ ```bash
398
+ git add tests/pipelines/test_eeg_pipeline.py src/pipelines/eeg_pipeline.py
399
+ git commit -m "feat(eeg): add non-mutating bandpass_filter (default 1-40 Hz)"
400
+ ```
401
+
402
+ ---
403
+
404
+ ## Task 4: `remove_artifacts_with_ica` (TDD)
405
+
406
+ **Files:**
407
+ - Modify: `tests/pipelines/test_eeg_pipeline.py`
408
+ - Modify: `src/pipelines/eeg_pipeline.py`
409
+
410
+ - [ ] **Step 1: Append the failing tests**
411
+
412
+ Extend the merged test import tuple:
413
+ ```python
414
+ from src.pipelines.eeg_pipeline import (
415
+ bandpass_filter,
416
+ is_valid_epoch,
417
+ remove_artifacts_with_ica,
418
+ )
419
+ ```
420
+
421
+ Append:
422
+ ```python
423
+
424
+
425
+ class TestRemoveArtifactsWithIca:
426
+ def _load(self) -> mne.io.BaseRaw:
427
+ return mne.io.read_raw_fif(FIXTURE, preload=True, verbose="ERROR")
428
+
429
+ def test_returns_raw_instance(self) -> None:
430
+ raw = bandpass_filter(self._load(), l_freq=1.0, h_freq=40.0)
431
+ out = remove_artifacts_with_ica(
432
+ raw, eog_ch_name="EOG061", n_components=4, random_state=97,
433
+ )
434
+ assert isinstance(out, mne.io.BaseRaw)
435
+
436
+ def test_preserves_shape(self) -> None:
437
+ raw = bandpass_filter(self._load(), l_freq=1.0, h_freq=40.0)
438
+ before = raw.get_data().shape
439
+ out = remove_artifacts_with_ica(
440
+ raw, eog_ch_name="EOG061", n_components=4, random_state=97,
441
+ )
442
+ assert out.get_data().shape == before
443
+
444
+ def test_reduces_eog_correlation_on_frontal_channel(self) -> None:
445
+ """ICA must reduce correlation between EOG and Cz (the bleed channel)."""
446
+ raw = bandpass_filter(self._load(), l_freq=1.0, h_freq=40.0)
447
+ before = raw.get_data()
448
+ cz_idx = raw.ch_names.index("Cz")
449
+ eog_idx = raw.ch_names.index("EOG061")
450
+ corr_before = abs(np.corrcoef(before[cz_idx], before[eog_idx])[0, 1])
451
+
452
+ out = remove_artifacts_with_ica(
453
+ raw, eog_ch_name="EOG061", n_components=4, random_state=97,
454
+ )
455
+ after = out.get_data()
456
+ corr_after = abs(np.corrcoef(after[cz_idx], after[eog_idx])[0, 1])
457
+ # Allow for noise — but the dominant EOG bleed must be reduced.
458
+ assert corr_after < corr_before
459
+
460
+ def test_no_eog_channel_is_a_noop(self) -> None:
461
+ """Without an EOG reference, ICA can't auto-reject — should pass through."""
462
+ raw = bandpass_filter(self._load(), l_freq=1.0, h_freq=40.0)
463
+ out = remove_artifacts_with_ica(
464
+ raw, eog_ch_name=None, n_components=4, random_state=97,
465
+ )
466
+ # Identical shape; data approximately equal (no rejection happened).
467
+ assert out.get_data().shape == raw.get_data().shape
468
+ np.testing.assert_allclose(
469
+ out.get_data(), raw.get_data(), rtol=1e-6, atol=1e-12
470
+ )
471
+
472
+ def test_is_deterministic_with_seed(self) -> None:
473
+ raw = bandpass_filter(self._load(), l_freq=1.0, h_freq=40.0)
474
+ a = remove_artifacts_with_ica(
475
+ raw, eog_ch_name="EOG061", n_components=4, random_state=97,
476
+ )
477
+ b = remove_artifacts_with_ica(
478
+ raw, eog_ch_name="EOG061", n_components=4, random_state=97,
479
+ )
480
+ np.testing.assert_allclose(a.get_data(), b.get_data(), rtol=1e-12, atol=1e-15)
481
+ ```
482
+
483
+ - [ ] **Step 2: Run tests; they MUST fail**
484
+
485
+ ```bash
486
+ pytest tests/pipelines/test_eeg_pipeline.py::TestRemoveArtifactsWithIca -v
487
+ ```
488
+ Expected: 5 FAILS with `cannot import name 'remove_artifacts_with_ica'`.
489
+
490
+ - [ ] **Step 3: Implement `remove_artifacts_with_ica`**
491
+
492
+ Append to `src/pipelines/eeg_pipeline.py`:
493
+ ```python
494
+ from mne.preprocessing import ICA
495
+
496
+
497
+ def remove_artifacts_with_ica(
498
+ raw: mne.io.BaseRaw,
499
+ eog_ch_name: str | None = None,
500
+ n_components: int = 15,
501
+ random_state: int = 97,
502
+ ) -> mne.io.BaseRaw:
503
+ """Remove EOG-like artifacts using MNE's ICA + EOG correlation.
504
+
505
+ Fits an ICA decomposition on `raw`, finds components whose time courses
506
+ correlate with the named EOG channel via `find_bads_eog`, marks them as
507
+ "bad" and reconstructs the signal without them. Returns a copy; the
508
+ input `raw` is unchanged.
509
+
510
+ If `eog_ch_name` is None or no bad components are found, returns a
511
+ copy of `raw` unchanged. This keeps the function safe to call on
512
+ recordings without an EOG reference.
513
+
514
+ Args:
515
+ raw: Loaded, ideally bandpass-filtered, `mne.io.BaseRaw`.
516
+ eog_ch_name: Name of the EOG channel for correlation-based detection.
517
+ None disables auto-rejection.
518
+ n_components: Number of ICA components. For small recordings, MNE
519
+ will silently cap this at the rank of the data.
520
+ random_state: Seed for ICA's underlying solver. Required for §4
521
+ Determinism.
522
+
523
+ Returns:
524
+ A copy of `raw` with EOG-correlated ICA components removed.
525
+ """
526
+ out = raw.copy()
527
+ if eog_ch_name is None or eog_ch_name not in out.ch_names:
528
+ logger.info("ICA skipped: no EOG channel reference provided")
529
+ return out
530
+
531
+ # Cap n_components at the rank of the data to avoid solver complaints
532
+ # on small synthetic fixtures.
533
+ n_eeg = len(mne.pick_types(out.info, eeg=True, meg=False))
534
+ safe_n = min(n_components, max(n_eeg - 1, 1))
535
+
536
+ ica = ICA(
537
+ n_components=safe_n,
538
+ random_state=random_state,
539
+ max_iter="auto",
540
+ method="fastica",
541
+ verbose="ERROR",
542
+ )
543
+ ica.fit(out, picks="eeg", verbose="ERROR")
544
+ bad_idx, _ = ica.find_bads_eog(out, ch_name=eog_ch_name, verbose="ERROR")
545
+ ica.exclude = list(bad_idx)
546
+ logger.info(
547
+ "ICA fit: n_components=%d, EOG-correlated rejected=%d",
548
+ safe_n, len(ica.exclude),
549
+ )
550
+ ica.apply(out, verbose="ERROR")
551
+ return out
552
+ ```
553
+
554
+ - [ ] **Step 4: Run tests to verify they pass**
555
+
556
+ ```bash
557
+ pytest tests/pipelines/test_eeg_pipeline.py -v
558
+ ```
559
+ Expected: 15 PASS (10 prior + 5 ICA).
560
+
561
+ - [ ] **Step 5: Commit**
562
+
563
+ ```bash
564
+ git add tests/pipelines/test_eeg_pipeline.py src/pipelines/eeg_pipeline.py
565
+ git commit -m "feat(eeg): add remove_artifacts_with_ica with EOG correlation rejection"
566
+ ```
567
+
568
+ ---
569
+
570
+ ## Task 5: `compute_features_from_epoch` (TDD)
571
+
572
+ **Files:**
573
+ - Modify: `tests/pipelines/test_eeg_pipeline.py`
574
+ - Modify: `src/pipelines/eeg_pipeline.py`
575
+
576
+ - [ ] **Step 1: Append the failing tests**
577
+
578
+ Extend the merged test import tuple:
579
+ ```python
580
+ from src.pipelines.eeg_pipeline import (
581
+ bandpass_filter,
582
+ compute_features_from_epoch,
583
+ is_valid_epoch,
584
+ remove_artifacts_with_ica,
585
+ )
586
+ ```
587
+ Also add at the top of the test file (after the existing imports), the band/stat constants for assertions:
588
+ ```python
589
+ EEG_BANDS = ("delta", "theta", "alpha", "beta", "gamma")
590
+ STATS = ("mean", "std", "var", "skew", "kurtosis")
591
+ ```
592
+
593
+ Append:
594
+ ```python
595
+
596
+
597
+ class TestComputeFeaturesFromEpoch:
598
+ def test_returns_1d_float_array(self) -> None:
599
+ epoch = np.random.default_rng(0).standard_normal((4, 256))
600
+ out = compute_features_from_epoch(epoch, sfreq=256.0)
601
+ assert isinstance(out, np.ndarray)
602
+ assert out.ndim == 1
603
+ assert out.dtype == np.float64
604
+
605
+ def test_feature_count_matches_contract(self) -> None:
606
+ """Each channel contributes len(EEG_BANDS) PSD features + len(STATS) stats."""
607
+ n_channels = 4
608
+ epoch = np.random.default_rng(0).standard_normal((n_channels, 256))
609
+ out = compute_features_from_epoch(epoch, sfreq=256.0)
610
+ expected = n_channels * (len(EEG_BANDS) + len(STATS))
611
+ assert out.shape == (expected,)
612
+
613
+ def test_alpha_band_dominates_for_alpha_signal(self) -> None:
614
+ """Pure 10 Hz sine on 1 channel should put most PSD power in alpha (8-13 Hz)."""
615
+ sfreq = 256.0
616
+ t = np.arange(int(sfreq * 2.0)) / sfreq
617
+ signal = np.sin(2 * np.pi * 10.0 * t)[None, :] # (1, n_samples)
618
+ out = compute_features_from_epoch(signal, sfreq=sfreq)
619
+ # Layout for n_channels=1: [psd_delta, psd_theta, psd_alpha, psd_beta, psd_gamma, mean, std, var, skew, kurtosis]
620
+ psd_block = out[: len(EEG_BANDS)]
621
+ alpha_idx = EEG_BANDS.index("alpha")
622
+ assert psd_block[alpha_idx] == psd_block.max()
623
+
624
+ def test_finite_output(self) -> None:
625
+ epoch = np.random.default_rng(0).standard_normal((4, 256))
626
+ out = compute_features_from_epoch(epoch, sfreq=256.0)
627
+ assert np.all(np.isfinite(out))
628
+
629
+ def test_deterministic_for_same_input(self) -> None:
630
+ epoch = np.random.default_rng(0).standard_normal((4, 256))
631
+ a = compute_features_from_epoch(epoch, sfreq=256.0)
632
+ b = compute_features_from_epoch(epoch, sfreq=256.0)
633
+ np.testing.assert_array_equal(a, b)
634
+ ```
635
+
636
+ - [ ] **Step 2: Run tests; they MUST fail**
637
+
638
+ ```bash
639
+ pytest tests/pipelines/test_eeg_pipeline.py::TestComputeFeaturesFromEpoch -v
640
+ ```
641
+ Expected: 5 FAILS with `cannot import name 'compute_features_from_epoch'`.
642
+
643
+ - [ ] **Step 3: Implement features**
644
+
645
+ Append to `src/pipelines/eeg_pipeline.py`:
646
+ ```python
647
+ from scipy import signal as scipy_signal
648
+ from scipy import stats as scipy_stats
649
+
650
+
651
+ EEG_BANDS: dict[str, tuple[float, float]] = {
652
+ "delta": (1.0, 4.0),
653
+ "theta": (4.0, 8.0),
654
+ "alpha": (8.0, 13.0),
655
+ "beta": (13.0, 30.0),
656
+ "gamma": (30.0, 40.0),
657
+ }
658
+ STATS: tuple[str, ...] = ("mean", "std", "var", "skew", "kurtosis")
659
+
660
+
661
+ def _band_power(freqs: np.ndarray, psd: np.ndarray, lo: float, hi: float) -> float:
662
+ """Mean PSD value within the [lo, hi) frequency band."""
663
+ mask = (freqs >= lo) & (freqs < hi)
664
+ if not mask.any():
665
+ return 0.0
666
+ return float(psd[mask].mean())
667
+
668
+
669
+ def compute_features_from_epoch(epoch: np.ndarray, sfreq: float) -> np.ndarray:
670
+ """Compute PSD-band + statistical features for one epoch.
671
+
672
+ Per channel, the feature block is:
673
+ [psd_delta, psd_theta, psd_alpha, psd_beta, psd_gamma,
674
+ mean, std, var, skew, kurtosis]
675
+ Channels are stacked in their input order. The resulting 1-D vector has
676
+ length `n_channels * (len(EEG_BANDS) + len(STATS))`.
677
+
678
+ PSD is computed with Welch's method (`scipy.signal.welch`) at the
679
+ epoch's sample rate. Higher moments use `scipy.stats` with default
680
+ bias correction.
681
+
682
+ Args:
683
+ epoch: A 2-D array shape (n_channels, n_samples).
684
+ sfreq: Sampling rate in Hz.
685
+
686
+ Returns:
687
+ A 1-D `np.ndarray` of dtype float64.
688
+ """
689
+ n_channels, n_samples = epoch.shape
690
+ nperseg = min(256, n_samples)
691
+ feats: list[float] = []
692
+ for ch in range(n_channels):
693
+ x = epoch[ch]
694
+ freqs, psd = scipy_signal.welch(x, fs=sfreq, nperseg=nperseg)
695
+ for _band, (lo, hi) in EEG_BANDS.items():
696
+ feats.append(_band_power(freqs, psd, lo, hi))
697
+ feats.append(float(np.mean(x)))
698
+ feats.append(float(np.std(x)))
699
+ feats.append(float(np.var(x)))
700
+ feats.append(float(scipy_stats.skew(x)))
701
+ feats.append(float(scipy_stats.kurtosis(x)))
702
+ return np.asarray(feats, dtype=np.float64)
703
+ ```
704
+
705
+ - [ ] **Step 4: Run tests to verify they pass**
706
+
707
+ ```bash
708
+ pytest tests/pipelines/test_eeg_pipeline.py -v
709
+ ```
710
+ Expected: 20 PASS (15 prior + 5 features).
711
+
712
+ - [ ] **Step 5: Commit**
713
+
714
+ ```bash
715
+ git add tests/pipelines/test_eeg_pipeline.py src/pipelines/eeg_pipeline.py
716
+ git commit -m "feat(eeg): add compute_features_from_epoch (PSD bands + 5 statistics)"
717
+ ```
718
+
719
+ ---
720
+
721
+ ## Task 6: `extract_features_from_recording` (TDD — flatten to 2D table)
722
+
723
+ **Files:**
724
+ - Modify: `tests/pipelines/test_eeg_pipeline.py`
725
+ - Modify: `src/pipelines/eeg_pipeline.py`
726
+
727
+ - [ ] **Step 1: Append the failing tests**
728
+
729
+ Extend the merged test import tuple:
730
+ ```python
731
+ from src.pipelines.eeg_pipeline import (
732
+ bandpass_filter,
733
+ compute_features_from_epoch,
734
+ extract_features_from_recording,
735
+ is_valid_epoch,
736
+ remove_artifacts_with_ica,
737
+ )
738
+ ```
739
+ Also add `import pandas as pd` at the top of the test file (in the third-party block, alphabetical: numpy → pandas → pytest).
740
+
741
+ Append:
742
+ ```python
743
+
744
+
745
+ class TestExtractFeaturesFromRecording:
746
+ def _load(self) -> mne.io.BaseRaw:
747
+ return mne.io.read_raw_fif(FIXTURE, preload=True, verbose="ERROR")
748
+
749
+ def test_returns_dataframe(self) -> None:
750
+ raw = self._load()
751
+ df = extract_features_from_recording(
752
+ raw, epoch_duration_s=2.0, eog_ch_name="EOG061",
753
+ n_components=4, random_state=97,
754
+ )
755
+ assert isinstance(df, pd.DataFrame)
756
+
757
+ def test_row_count_matches_epochs(self) -> None:
758
+ """10 s recording / 2 s epoch = 5 epochs."""
759
+ raw = self._load()
760
+ df = extract_features_from_recording(
761
+ raw, epoch_duration_s=2.0, eog_ch_name="EOG061",
762
+ n_components=4, random_state=97,
763
+ )
764
+ assert len(df) == 5
765
+
766
+ def test_column_naming_is_deterministic_and_explicit(self) -> None:
767
+ raw = self._load()
768
+ df = extract_features_from_recording(
769
+ raw, epoch_duration_s=2.0, eog_ch_name="EOG061",
770
+ n_components=4, random_state=97,
771
+ )
772
+ # 4 EEG channels: Cz, Pz, O1, O2 (EOG channel is excluded from features).
773
+ for ch in ("Cz", "Pz", "O1", "O2"):
774
+ for band in EEG_BANDS:
775
+ assert f"feat_{ch}_psd_{band}" in df.columns
776
+ for stat in STATS:
777
+ assert f"feat_{ch}_{stat}" in df.columns
778
+
779
+ def test_no_feat_for_eog_channel(self) -> None:
780
+ raw = self._load()
781
+ df = extract_features_from_recording(
782
+ raw, epoch_duration_s=2.0, eog_ch_name="EOG061",
783
+ n_components=4, random_state=97,
784
+ )
785
+ assert not any("EOG061" in c for c in df.columns)
786
+
787
+ def test_all_features_finite_float64(self) -> None:
788
+ raw = self._load()
789
+ df = extract_features_from_recording(
790
+ raw, epoch_duration_s=2.0, eog_ch_name="EOG061",
791
+ n_components=4, random_state=97,
792
+ )
793
+ feat_cols = [c for c in df.columns if c.startswith("feat_")]
794
+ assert all(df[c].dtype == np.float64 for c in feat_cols)
795
+ assert df[feat_cols].notna().all().all()
796
+ assert np.isfinite(df[feat_cols].to_numpy()).all()
797
+
798
+ def test_drops_invalid_epochs_with_warning(self, caplog) -> None:
799
+ """If an epoch contains NaN, it is logged and dropped."""
800
+ raw = self._load()
801
+ # Inject a NaN into the last 2-second window so that exactly one epoch
802
+ # fails `is_valid_epoch`.
803
+ data = raw.get_data().copy()
804
+ data[0, -10] = np.nan
805
+ bad_raw = mne.io.RawArray(data, raw.info, verbose="ERROR")
806
+ df = extract_features_from_recording(
807
+ bad_raw, epoch_duration_s=2.0, eog_ch_name="EOG061",
808
+ n_components=4, random_state=97,
809
+ )
810
+ # 5 epochs minus 1 dropped = 4
811
+ assert len(df) == 4
812
+ ```
813
+
814
+ - [ ] **Step 2: Run tests; they MUST fail**
815
+
816
+ ```bash
817
+ pytest tests/pipelines/test_eeg_pipeline.py::TestExtractFeaturesFromRecording -v
818
+ ```
819
+ Expected: 6 FAILS with `cannot import name 'extract_features_from_recording'`.
820
+
821
+ - [ ] **Step 3: Implement the recording-level extractor**
822
+
823
+ Add `import pandas as pd` to the third-party imports block at the top of `src/pipelines/eeg_pipeline.py` (alphabetical: numpy → pandas → others).
824
+
825
+ Append:
826
+ ```python
827
+ def _build_feature_columns(eeg_ch_names: list[str]) -> list[str]:
828
+ """Generate the deterministic, alphabetical-by-channel column ordering."""
829
+ cols: list[str] = []
830
+ for ch in eeg_ch_names:
831
+ for band in EEG_BANDS:
832
+ cols.append(f"feat_{ch}_psd_{band}")
833
+ for stat in STATS:
834
+ cols.append(f"feat_{ch}_{stat}")
835
+ return cols
836
+
837
+
838
+ def extract_features_from_recording(
839
+ raw: mne.io.BaseRaw,
840
+ epoch_duration_s: float = 2.0,
841
+ eog_ch_name: str | None = None,
842
+ n_components: int = 15,
843
+ random_state: int = 97,
844
+ ) -> pd.DataFrame:
845
+ """Run the EEG pipeline on a Raw and return a 2-D feature DataFrame.
846
+
847
+ Steps:
848
+ 1. Bandpass filter (1-40 Hz).
849
+ 2. ICA-based EOG artifact rejection (skipped if `eog_ch_name` is None).
850
+ 3. Slice into fixed-duration epochs.
851
+ 4. Drop any epoch with NaN/inf samples (logged WARNING).
852
+ 5. Compute features per epoch and stack into a DataFrame whose columns
853
+ are `feat_<channel>_psd_<band>` and `feat_<channel>_<stat>`.
854
+
855
+ Args:
856
+ raw: Loaded `mne.io.BaseRaw` (must be `.load_data()`'d).
857
+ epoch_duration_s: Length of each epoch in seconds.
858
+ eog_ch_name: Name of EOG reference channel for ICA. None disables ICA.
859
+ n_components: Cap on ICA components.
860
+ random_state: Seed for ICA's solver (determinism).
861
+
862
+ Returns:
863
+ A `pd.DataFrame` with one row per valid epoch and `n_eeg_channels *
864
+ (len(EEG_BANDS) + len(STATS))` `feat_*` columns.
865
+ """
866
+ filtered = bandpass_filter(raw, l_freq=1.0, h_freq=40.0)
867
+ cleaned = remove_artifacts_with_ica(
868
+ filtered,
869
+ eog_ch_name=eog_ch_name,
870
+ n_components=n_components,
871
+ random_state=random_state,
872
+ )
873
+
874
+ sfreq = float(cleaned.info["sfreq"])
875
+ n_samples_per_epoch = int(round(epoch_duration_s * sfreq))
876
+ eeg_picks = mne.pick_types(cleaned.info, eeg=True, meg=False, eog=False)
877
+ eeg_names = [cleaned.ch_names[i] for i in eeg_picks]
878
+ data = cleaned.get_data(picks=eeg_picks) # shape (n_eeg, n_times)
879
+ n_eeg, n_times = data.shape
880
+ n_total_epochs = n_times // n_samples_per_epoch
881
+
882
+ feature_cols = _build_feature_columns(eeg_names)
883
+ rows: list[np.ndarray] = []
884
+ invalid_indices: list[int] = []
885
+ for ep in range(n_total_epochs):
886
+ start = ep * n_samples_per_epoch
887
+ end = start + n_samples_per_epoch
888
+ epoch = data[:, start:end]
889
+ if not is_valid_epoch(epoch):
890
+ invalid_indices.append(ep)
891
+ continue
892
+ rows.append(compute_features_from_epoch(epoch, sfreq=sfreq))
893
+
894
+ n_dropped = len(invalid_indices)
895
+ if n_dropped:
896
+ display = invalid_indices[:10]
897
+ suffix = (
898
+ f"... (+{n_dropped - 10} more)" if n_dropped > 10 else ""
899
+ )
900
+ logger.warning(
901
+ "Dropping %d/%d epochs with invalid samples (indices=%s%s)",
902
+ n_dropped, n_total_epochs, display, suffix,
903
+ )
904
+
905
+ if not rows:
906
+ logger.info(
907
+ "Feature extraction complete: in=%d, out=0, dropped=%d (%.2f%%)",
908
+ n_total_epochs, n_dropped,
909
+ 100.0 * n_dropped / max(n_total_epochs, 1),
910
+ )
911
+ return pd.DataFrame(columns=feature_cols).astype(np.float64)
912
+
913
+ matrix = np.vstack(rows)
914
+ out = pd.DataFrame(matrix, columns=feature_cols, dtype=np.float64)
915
+ logger.info(
916
+ "Feature extraction complete: in=%d, out=%d, dropped=%d (%.2f%%)",
917
+ n_total_epochs, len(out), n_dropped,
918
+ 100.0 * n_dropped / max(n_total_epochs, 1),
919
+ )
920
+ return out
921
+ ```
922
+
923
+ - [ ] **Step 4: Run tests to verify they pass**
924
+
925
+ ```bash
926
+ pytest tests/pipelines/test_eeg_pipeline.py -v
927
+ ```
928
+ Expected: 26 PASS (20 prior + 6 recording).
929
+
930
+ - [ ] **Step 5: Commit**
931
+
932
+ ```bash
933
+ git add tests/pipelines/test_eeg_pipeline.py src/pipelines/eeg_pipeline.py
934
+ git commit -m "feat(eeg): flatten 3D epochs into deterministic 2D feat_<ch>_<band|stat> table"
935
+ ```
936
+
937
+ ---
938
+
939
+ ## Task 7: `run_pipeline` orchestrator + CLI (TDD)
940
+
941
+ **Files:**
942
+ - Modify: `tests/pipelines/test_eeg_pipeline.py`
943
+ - Modify: `src/pipelines/eeg_pipeline.py`
944
+
945
+ - [ ] **Step 1: Append the failing tests**
946
+
947
+ Extend the merged test import tuple to include `run_pipeline`. Add `import shutil` to the stdlib block at the top of the test file.
948
+
949
+ Append:
950
+ ```python
951
+
952
+
953
+ class TestRunPipeline:
954
+ def test_end_to_end_writes_processed_parquet(self, tmp_path: Path) -> None:
955
+ raw_dir = tmp_path / "data" / "raw"
956
+ proc_dir = tmp_path / "data" / "processed"
957
+ raw_dir.mkdir(parents=True)
958
+ proc_dir.mkdir(parents=True)
959
+ input_path = raw_dir / "rec.fif"
960
+ output_path = proc_dir / "eeg_features.parquet"
961
+ shutil.copy(FIXTURE, input_path)
962
+
963
+ run_pipeline(
964
+ input_path=input_path, output_path=output_path,
965
+ epoch_duration_s=2.0, eog_ch_name="EOG061",
966
+ n_components=4, random_state=97,
967
+ )
968
+
969
+ assert output_path.exists()
970
+ df = pd.read_parquet(output_path)
971
+ assert len(df) == 5
972
+ assert all(c.startswith("feat_") for c in df.columns)
973
+
974
+ def test_run_pipeline_preserves_float64_dtype(self, tmp_path: Path) -> None:
975
+ raw_dir = tmp_path / "data" / "raw"
976
+ proc_dir = tmp_path / "data" / "processed"
977
+ raw_dir.mkdir(parents=True)
978
+ proc_dir.mkdir(parents=True)
979
+ input_path = raw_dir / "rec.fif"
980
+ output_path = proc_dir / "eeg_features.parquet"
981
+ shutil.copy(FIXTURE, input_path)
982
+
983
+ run_pipeline(
984
+ input_path=input_path, output_path=output_path,
985
+ epoch_duration_s=2.0, eog_ch_name="EOG061",
986
+ n_components=4, random_state=97,
987
+ )
988
+ df = pd.read_parquet(output_path)
989
+ for col in df.columns:
990
+ assert df[col].dtype == np.float64, f"{col} widened to {df[col].dtype}"
991
+
992
+ def test_run_pipeline_is_idempotent(self, tmp_path: Path) -> None:
993
+ raw_dir = tmp_path / "data" / "raw"
994
+ proc_dir = tmp_path / "data" / "processed"
995
+ raw_dir.mkdir(parents=True)
996
+ proc_dir.mkdir(parents=True)
997
+ input_path = raw_dir / "rec.fif"
998
+ output_path = proc_dir / "eeg_features.parquet"
999
+ shutil.copy(FIXTURE, input_path)
1000
+
1001
+ run_pipeline(
1002
+ input_path=input_path, output_path=output_path,
1003
+ epoch_duration_s=2.0, eog_ch_name="EOG061",
1004
+ n_components=4, random_state=97,
1005
+ )
1006
+ first = output_path.read_bytes()
1007
+ run_pipeline(
1008
+ input_path=input_path, output_path=output_path,
1009
+ epoch_duration_s=2.0, eog_ch_name="EOG061",
1010
+ n_components=4, random_state=97,
1011
+ )
1012
+ second = output_path.read_bytes()
1013
+ assert first == second, "EEG pipeline output must be byte-deterministic"
1014
+
1015
+ def test_run_pipeline_raises_when_input_missing(self, tmp_path: Path) -> None:
1016
+ with pytest.raises(FileNotFoundError):
1017
+ run_pipeline(
1018
+ input_path=tmp_path / "nope.fif",
1019
+ output_path=tmp_path / "out.parquet",
1020
+ )
1021
+
1022
+ def test_run_pipeline_rejects_directory_as_output(self, tmp_path: Path) -> None:
1023
+ raw_dir = tmp_path / "data" / "raw"
1024
+ raw_dir.mkdir(parents=True)
1025
+ input_path = raw_dir / "rec.fif"
1026
+ shutil.copy(FIXTURE, input_path)
1027
+ bad_output = tmp_path / "out_dir"
1028
+ bad_output.mkdir()
1029
+ with pytest.raises(IsADirectoryError, match="must be a file"):
1030
+ run_pipeline(
1031
+ input_path=input_path, output_path=bad_output,
1032
+ epoch_duration_s=2.0, eog_ch_name="EOG061",
1033
+ n_components=4, random_state=97,
1034
+ )
1035
+ ```
1036
+
1037
+ - [ ] **Step 2: Run tests; they MUST fail**
1038
+
1039
+ Expected: 5 FAILS with `cannot import name 'run_pipeline'`.
1040
+
1041
+ - [ ] **Step 3: Implement the orchestrator + CLI**
1042
+
1043
+ Add `from pathlib import Path` to the stdlib imports block at the top of `src/pipelines/eeg_pipeline.py`.
1044
+
1045
+ Append at the END of the file:
1046
+ ```python
1047
+
1048
+
1049
+ # Default I/O paths for the EEG pipeline. Override via run_pipeline() args.
1050
+ DEFAULT_INPUT = Path("data/raw/eeg.fif")
1051
+ DEFAULT_OUTPUT = Path("data/processed/eeg_features.parquet")
1052
+
1053
+
1054
+ def run_pipeline(
1055
+ input_path: Path = DEFAULT_INPUT,
1056
+ output_path: Path = DEFAULT_OUTPUT,
1057
+ epoch_duration_s: float = 2.0,
1058
+ eog_ch_name: str | None = None,
1059
+ n_components: int = 15,
1060
+ random_state: int = 97,
1061
+ ) -> None:
1062
+ """Run the EEG pipeline end-to-end: raw FIF/EDF → processed feature Parquet.
1063
+
1064
+ Reads `input_path` via MNE, applies bandpass + ICA + epoching + feature
1065
+ extraction, then writes a model-ready Parquet at `output_path` (preserves
1066
+ float64 dtype; satisfies AGENTS.md §6).
1067
+
1068
+ Args:
1069
+ input_path: Path to the raw recording (.fif or .edf).
1070
+ output_path: Where to write the processed feature Parquet file.
1071
+ Parent directory is created if missing.
1072
+ epoch_duration_s: Length of each fixed-duration epoch (seconds).
1073
+ eog_ch_name: Name of the EOG channel for ICA-based artifact rejection.
1074
+ None disables ICA.
1075
+ n_components: Cap on ICA components.
1076
+ random_state: Seed for ICA's solver. Required for §4 Determinism.
1077
+
1078
+ Raises:
1079
+ FileNotFoundError: if `input_path` does not exist.
1080
+ IsADirectoryError: if `output_path` resolves to an existing directory.
1081
+ """
1082
+ input_path = Path(input_path)
1083
+ output_path = Path(output_path)
1084
+ if not input_path.exists():
1085
+ raise FileNotFoundError(f"Raw EEG file not found: {input_path}")
1086
+
1087
+ logger.info("Reading raw EEG from %s", input_path)
1088
+ if input_path.suffix.lower() == ".edf":
1089
+ raw = mne.io.read_raw_edf(input_path, preload=True, verbose="ERROR")
1090
+ else:
1091
+ raw = mne.io.read_raw_fif(input_path, preload=True, verbose="ERROR")
1092
+ logger.info(
1093
+ "Loaded %d channels, sfreq=%.1f Hz, n_times=%d",
1094
+ len(raw.ch_names), raw.info["sfreq"], raw.n_times,
1095
+ )
1096
+
1097
+ features = extract_features_from_recording(
1098
+ raw,
1099
+ epoch_duration_s=epoch_duration_s,
1100
+ eog_ch_name=eog_ch_name,
1101
+ n_components=n_components,
1102
+ random_state=random_state,
1103
+ )
1104
+
1105
+ output_path.parent.mkdir(parents=True, exist_ok=True)
1106
+ if output_path.is_dir():
1107
+ raise IsADirectoryError(
1108
+ f"output_path must be a file, got a directory: {output_path}"
1109
+ )
1110
+ # Parquet preserves dtypes (float64 features stay float64) and is
1111
+ # byte-deterministic with single-threaded snappy. AGENTS.md §6.
1112
+ features.to_parquet(
1113
+ output_path, index=False, engine="pyarrow", compression="snappy",
1114
+ )
1115
+ logger.info(
1116
+ "Wrote processed features to %s (rows=%d, cols=%d)",
1117
+ output_path, len(features), features.shape[1],
1118
+ )
1119
+
1120
+
1121
+ if __name__ == "__main__":
1122
+ # Day-2 CLI entrypoint — runs with default paths against `data/raw/eeg.fif`.
1123
+ # Argument parsing (argparse / click) will land in a later task.
1124
+ # python -m src.pipelines.eeg_pipeline
1125
+ run_pipeline()
1126
+ ```
1127
+
1128
+ - [ ] **Step 4: Run tests; full suite green**
1129
+
1130
+ ```bash
1131
+ pytest -v
1132
+ ```
1133
+ Expected: **61 PASS** (30 from Day 1 + 31 EEG: 6 valid_epoch + 4 bandpass + 5 ICA + 5 features + 6 recording + 5 run_pipeline).
1134
+
1135
+ - [ ] **Step 5: Commit**
1136
+
1137
+ ```bash
1138
+ git add tests/pipelines/test_eeg_pipeline.py src/pipelines/eeg_pipeline.py
1139
+ git commit -m "feat(eeg): add run_pipeline orchestrator + CLI (FIF/EDF → Parquet)"
1140
+ ```
1141
+
1142
+ ---
1143
+
1144
+ ## Task 8: AGENTS.md + README updates
1145
+
1146
+ **Files:**
1147
+ - Modify: `AGENTS.md`
1148
+ - Modify: `README.md`
1149
+
1150
+ - [ ] **Step 1: Update AGENTS.md §1 pipeline table**
1151
+
1152
+ In `/Users/mertgungor/Desktop/hackathon/AGENTS.md`, find the pipeline table:
1153
+ ```
1154
+ | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
1155
+ | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
1156
+ | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
1157
+ ```
1158
+ The lines stay; nothing to change here — the EEG row already exists.
1159
+
1160
+ - [ ] **Step 2: Update README Status table**
1161
+
1162
+ In `/Users/mertgungor/Desktop/hackathon/README.md`, find:
1163
+ ```
1164
+ | 2 | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | Planned (MNE-Python + ICA) |
1165
+ ```
1166
+ Replace with:
1167
+ ```
1168
+ | 2 | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | Shipped — 61 tests green |
1169
+ ```
1170
+ Also update the Day-1 row's count if needed (it should still read "Shipped — 30 tests green" since Day 1 tests didn't change). And update any Quick Start `pytest -v` expected count from 30 to 61.
1171
+
1172
+ - [ ] **Step 3: Add EEG smoke-run line to README Quick Start**
1173
+
1174
+ In the Quick Start section, after the BBB smoke-run line, append:
1175
+ ```bash
1176
+ # Smoke-test the EEG pipeline with the bundled fixture
1177
+ mkdir -p data/raw
1178
+ cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif
1179
+ python -m src.pipelines.eeg_pipeline
1180
+ ```
1181
+ And below it: "Result lives at `data/processed/eeg_features.parquet`."
1182
+
1183
+ - [ ] **Step 4: Commit**
1184
+
1185
+ ```bash
1186
+ git add AGENTS.md README.md
1187
+ git commit -m "docs: mark EEG pipeline shipped; bump test count to 61"
1188
+ ```
1189
+
1190
+ ---
1191
+
1192
+ ## Task 9: DoD verification + smoke run
1193
+
1194
+ **Files:** none modified (verification only)
1195
+
1196
+ - [ ] **Step 1: Full test suite green**
1197
+
1198
+ ```bash
1199
+ cd /Users/mertgungor/Desktop/hackathon
1200
+ source .venv312/bin/activate
1201
+ pytest -v --tb=short
1202
+ ```
1203
+ Required: **61 passed**, 0 failed, 0 skipped, 0 warnings.
1204
+
1205
+ - [ ] **Step 2: CLI smoke run (real-world flow)**
1206
+
1207
+ ```bash
1208
+ mkdir -p data/raw
1209
+ cp tests/fixtures/eeg_sample.fif data/raw/eeg.fif
1210
+ rm -f data/processed/eeg_features.parquet
1211
+
1212
+ python -c "
1213
+ from pathlib import Path
1214
+ from src.pipelines.eeg_pipeline import run_pipeline
1215
+ run_pipeline(
1216
+ input_path=Path('data/raw/eeg.fif'),
1217
+ output_path=Path('data/processed/eeg_features.parquet'),
1218
+ epoch_duration_s=2.0, eog_ch_name='EOG061',
1219
+ n_components=4, random_state=97,
1220
+ )
1221
+ "
1222
+ md5_run1=$(md5 -q data/processed/eeg_features.parquet 2>/dev/null || md5sum data/processed/eeg_features.parquet | awk '{print $1}')
1223
+ echo "MD5 run1: $md5_run1"
1224
+
1225
+ python -c "
1226
+ from pathlib import Path
1227
+ from src.pipelines.eeg_pipeline import run_pipeline
1228
+ run_pipeline(
1229
+ input_path=Path('data/raw/eeg.fif'),
1230
+ output_path=Path('data/processed/eeg_features.parquet'),
1231
+ epoch_duration_s=2.0, eog_ch_name='EOG061',
1232
+ n_components=4, random_state=97,
1233
+ )
1234
+ "
1235
+ md5_run2=$(md5 -q data/processed/eeg_features.parquet 2>/dev/null || md5sum data/processed/eeg_features.parquet | awk '{print $1}')
1236
+ echo "MD5 run2: $md5_run2"
1237
+ ```
1238
+ Required: `md5_run1 == md5_run2` (Determinism).
1239
+
1240
+ - [ ] **Step 3: Verify schema**
1241
+
1242
+ ```bash
1243
+ python -c "
1244
+ import pandas as pd
1245
+ df = pd.read_parquet('data/processed/eeg_features.parquet')
1246
+ print('rows:', len(df))
1247
+ print('cols:', df.shape[1])
1248
+ print('feat_*:', sum(c.startswith('feat_') for c in df.columns))
1249
+ print('any EOG col:', any('EOG' in c for c in df.columns))
1250
+ print('all float64:', all(df[c].dtype.name == 'float64' for c in df.columns))
1251
+ print('first 4 cols:', list(df.columns)[:4])
1252
+ "
1253
+ ```
1254
+ Required:
1255
+ - rows = 5
1256
+ - feat_* count = 4 channels × (5 bands + 5 stats) = 40
1257
+ - "any EOG col" = False
1258
+ - "all float64" = True
1259
+ - first columns must follow `feat_<channel>_psd_<band>` / `feat_<channel>_<stat>` pattern
1260
+
1261
+ - [ ] **Step 4: Verify data is gitignored**
1262
+
1263
+ ```bash
1264
+ git check-ignore -v data/raw/eeg.fif data/processed/eeg_features.parquet
1265
+ git status
1266
+ ```
1267
+ Expected: both ignored, working tree clean.
1268
+
1269
+ ---
1270
+
1271
+ ## Day-2 Definition of Done
1272
+
1273
+ - [ ] `src/pipelines/eeg_pipeline.py` exposes `is_valid_epoch`, `bandpass_filter`, `remove_artifacts_with_ica`, `compute_features_from_epoch`, `extract_features_from_recording`, `run_pipeline`, plus `EEG_BANDS`, `STATS`, `DEFAULT_INPUT`, `DEFAULT_OUTPUT`.
1274
+ - [ ] `python -m src.pipelines.eeg_pipeline` against `data/raw/eeg.fif` produces a deterministic Parquet at `data/processed/eeg_features.parquet`.
1275
+ - [ ] Invalid epochs (NaN/inf) are logged with their indices and dropped (Data Readiness §4 rule 2).
1276
+ - [ ] ICA is seeded; same input → byte-identical output (rule 3 + 5).
1277
+ - [ ] Row count in / out / dropped logged at INFO (rule 4).
1278
+ - [ ] Per-epoch feature schema is `feat_<channel>_psd_<band>` and `feat_<channel>_<stat>` for every EEG channel.
1279
+ - [ ] Parquet output preserves `float64` dtype across the round-trip.
1280
+ - [ ] Test suite: **61 passing**, 0 failures, 0 warnings.
1281
+ - [ ] At least 9 atomic commits across Day 2 (1 fixture + 6 TDD features + 1 doc update + 1 close-out).