refactor: pin single-threaded determinism env; close Day-2 doc/typo gaps
Browse files- README: float32→float64, "will produce"→"produces" (I1)
- README: add eeg_pipeline.py to repository layout tree (I2)
- README: add Day-2 plan and EEG test file to Where to Look (M1)
- bbb_pipeline + eeg_pipeline: import os/pyarrow at top, set OMP/OPENBLAS/MKL=1 and pa thread counts at module level after logger (I3a)
- AGENTS.md §4: document Determinism environment paragraph (I3b)
- AGENTS.md §1: mark mri_pipeline.py as (planned, Day 3) (I4)
- pytest.ini: add markers block with slow: marker (M2)
- Day-1 plan: no float32 references found — no-op (M3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- AGENTS.md +8 -1
- README.md +5 -2
- pytest.ini +2 -0
- src/pipelines/bbb_pipeline.py +11 -0
- src/pipelines/eeg_pipeline.py +11 -0
AGENTS.md
CHANGED
|
@@ -16,7 +16,7 @@ The platform exposes three production pipelines behind a single FastAPI surface:
|
|
| 16 |
|
| 17 |
| Modality | Pipeline | Core Technique |
|
| 18 |
|---|---|---|
|
| 19 |
-
| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
|
| 20 |
| Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
|
| 21 |
| Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
|
| 22 |
|
|
@@ -72,6 +72,13 @@ Every modality pipeline MUST guarantee, before writing to `data/processed/`:
|
|
| 72 |
4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
|
| 73 |
5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
A model training script is allowed to import from `data/processed/` only. If a
|
| 76 |
training script references `data/raw/` directly, that is a bug and must be
|
| 77 |
refactored into a pipeline.
|
|
|
|
| 16 |
|
| 17 |
| Modality | Pipeline | Core Technique |
|
| 18 |
|---|---|---|
|
| 19 |
+
| Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` *(planned, Day 3)* | ComBat Harmonization for site-level domain shift |
|
| 20 |
| Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
|
| 21 |
| Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
|
| 22 |
|
|
|
|
| 72 |
4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
|
| 73 |
5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
|
| 74 |
|
| 75 |
+
**Determinism environment**: byte-identical output requires deterministic
|
| 76 |
+
floating-point reductions. Each pipeline module sets `OMP_NUM_THREADS=1`,
|
| 77 |
+
`OPENBLAS_NUM_THREADS=1`, `MKL_NUM_THREADS=1`, and pins pyarrow to
|
| 78 |
+
single-threaded mode at import time. CI runners and developer machines do
|
| 79 |
+
not need to set these manually — the pipeline modules handle it — but
|
| 80 |
+
overriding them in the environment will break Determinism rule 3.
|
| 81 |
+
|
| 82 |
A model training script is allowed to import from `data/processed/` only. If a
|
| 83 |
training script references `data/raw/` directly, that is a bug and must be
|
| 84 |
refactored into a pipeline.
|
README.md
CHANGED
|
@@ -65,7 +65,8 @@ Result lives at `data/processed/eeg_features.parquet`.
|
|
| 65 |
├── src/
|
| 66 |
│ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
|
| 67 |
│ ├── pipelines/
|
| 68 |
-
│ │
|
|
|
|
| 69 |
│ └── api/ # FastAPI surface (placeholder until Day 4+)
|
| 70 |
└── tests/
|
| 71 |
├── core/, pipelines/ # Mirror src/ structure
|
|
@@ -103,7 +104,7 @@ The pipeline is seeded (`random_state=97`) and produces byte-identical Parquet o
|
|
| 103 |
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
| 104 |
compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
|
| 105 |
widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
|
| 106 |
-
for the `
|
| 107 |
|
| 108 |
## Testing & TDD
|
| 109 |
|
|
@@ -124,5 +125,7 @@ finishes in under 2 seconds on a 2024 laptop.
|
|
| 124 |
|
| 125 |
- **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
|
| 126 |
- **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
|
|
|
|
| 127 |
- **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
|
| 128 |
- **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)
|
|
|
|
|
|
| 65 |
├── src/
|
| 66 |
│ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
|
| 67 |
│ ├── pipelines/
|
| 68 |
+
│ │ ├── bbb_pipeline.py # Day-1 pipeline (4 public funcs + CLI entry)
|
| 69 |
+
│ │ └── eeg_pipeline.py # Day-2 pipeline (6 public funcs + CLI entry)
|
| 70 |
│ └── api/ # FastAPI surface (placeholder until Day 4+)
|
| 71 |
└── tests/
|
| 72 |
├── core/, pipelines/ # Mirror src/ structure
|
|
|
|
| 104 |
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
| 105 |
compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
|
| 106 |
widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
|
| 107 |
+
for the `float64` EEG features Day 2 produces. See AGENTS.md §6.
|
| 108 |
|
| 109 |
## Testing & TDD
|
| 110 |
|
|
|
|
| 125 |
|
| 126 |
- **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
|
| 127 |
- **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
|
| 128 |
+
- **Day-2 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day2-eeg-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day2-eeg-pipeline.md)
|
| 129 |
- **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
|
| 130 |
- **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)
|
| 131 |
+
- **EEG pipeline:** [`src/pipelines/eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) + [`tests/pipelines/test_eeg_pipeline.py`](tests/pipelines/test_eeg_pipeline.py)
|
pytest.ini
CHANGED
|
@@ -2,3 +2,5 @@
|
|
| 2 |
testpaths = tests
|
| 3 |
pythonpath = .
|
| 4 |
addopts = -v --tb=short
|
|
|
|
|
|
|
|
|
| 2 |
testpaths = tests
|
| 3 |
pythonpath = .
|
| 4 |
addopts = -v --tb=short
|
| 5 |
+
markers =
|
| 6 |
+
slow: marks tests as slow (deselect with '-m "not slow"')
|
src/pipelines/bbb_pipeline.py
CHANGED
|
@@ -11,10 +11,12 @@ traceability (row count in / out / dropped), and idempotent output.
|
|
| 11 |
from __future__ import annotations
|
| 12 |
|
| 13 |
import math
|
|
|
|
| 14 |
from pathlib import Path
|
| 15 |
|
| 16 |
import numpy as np
|
| 17 |
import pandas as pd
|
|
|
|
| 18 |
from rdkit import Chem, RDLogger
|
| 19 |
from rdkit.Chem import AllChem
|
| 20 |
from rdkit.DataStructs import ConvertToNumpyArray
|
|
@@ -23,6 +25,15 @@ from src.core.logger import get_logger
|
|
| 23 |
|
| 24 |
logger = get_logger(__name__)
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
# Suppress RDKit's noisy C++-level warning stream; we surface our own
|
| 27 |
# structured warnings via the project logger when a SMILES fails to parse.
|
| 28 |
#
|
|
|
|
| 11 |
from __future__ import annotations
|
| 12 |
|
| 13 |
import math
|
| 14 |
+
import os
|
| 15 |
from pathlib import Path
|
| 16 |
|
| 17 |
import numpy as np
|
| 18 |
import pandas as pd
|
| 19 |
+
import pyarrow as pa
|
| 20 |
from rdkit import Chem, RDLogger
|
| 21 |
from rdkit.Chem import AllChem
|
| 22 |
from rdkit.DataStructs import ConvertToNumpyArray
|
|
|
|
| 25 |
|
| 26 |
logger = get_logger(__name__)
|
| 27 |
|
| 28 |
+
# Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
|
| 29 |
+
# (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
|
| 30 |
+
# floating-point reductions can reorder and produce non-bit-identical output.
|
| 31 |
+
os.environ.setdefault("OMP_NUM_THREADS", "1")
|
| 32 |
+
os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
|
| 33 |
+
os.environ.setdefault("MKL_NUM_THREADS", "1")
|
| 34 |
+
pa.set_cpu_count(1)
|
| 35 |
+
pa.set_io_thread_count(1)
|
| 36 |
+
|
| 37 |
# Suppress RDKit's noisy C++-level warning stream; we surface our own
|
| 38 |
# structured warnings via the project logger when a SMILES fails to parse.
|
| 39 |
#
|
src/pipelines/eeg_pipeline.py
CHANGED
|
@@ -12,11 +12,13 @@ a logged WARNING), determinism (seeded ICA + sklearn RNG), traceability
|
|
| 12 |
"""
|
| 13 |
from __future__ import annotations
|
| 14 |
|
|
|
|
| 15 |
from pathlib import Path
|
| 16 |
|
| 17 |
import mne
|
| 18 |
import numpy as np
|
| 19 |
import pandas as pd
|
|
|
|
| 20 |
from mne.preprocessing import ICA
|
| 21 |
from scipy import signal as scipy_signal
|
| 22 |
from scipy import stats as scipy_stats
|
|
@@ -25,6 +27,15 @@ from src.core.logger import get_logger
|
|
| 25 |
|
| 26 |
logger = get_logger(__name__)
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
# Pearson-correlation threshold for EOG-component rejection in ICA.
|
| 29 |
# Real-world EOG components typically score 0.8-0.95 against the EOG channel;
|
| 30 |
# 0.9 is a conservative floor that avoids false positives at the cost of
|
|
|
|
| 12 |
"""
|
| 13 |
from __future__ import annotations
|
| 14 |
|
| 15 |
+
import os
|
| 16 |
from pathlib import Path
|
| 17 |
|
| 18 |
import mne
|
| 19 |
import numpy as np
|
| 20 |
import pandas as pd
|
| 21 |
+
import pyarrow as pa
|
| 22 |
from mne.preprocessing import ICA
|
| 23 |
from scipy import signal as scipy_signal
|
| 24 |
from scipy import stats as scipy_stats
|
|
|
|
| 27 |
|
| 28 |
logger = get_logger(__name__)
|
| 29 |
|
| 30 |
+
# Pin BLAS / OpenMP / pyarrow to single-threaded mode so byte-determinism
|
| 31 |
+
# (AGENTS.md §4 rule 3) holds across hardware. Without this, multi-threaded
|
| 32 |
+
# floating-point reductions can reorder and produce non-bit-identical output.
|
| 33 |
+
os.environ.setdefault("OMP_NUM_THREADS", "1")
|
| 34 |
+
os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
|
| 35 |
+
os.environ.setdefault("MKL_NUM_THREADS", "1")
|
| 36 |
+
pa.set_cpu_count(1)
|
| 37 |
+
pa.set_io_thread_count(1)
|
| 38 |
+
|
| 39 |
# Pearson-correlation threshold for EOG-component rejection in ICA.
|
| 40 |
# Real-world EOG components typically score 0.8-0.95 against the EOG channel;
|
| 41 |
# 0.9 is a conservative floor that avoids false positives at the cost of
|