File size: 5,413 Bytes
a13e268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# NeuroBridge Enterprise Pipeline

NeuroBridge Enterprise tackles the three chronic failure modes in clinical ML — data drift
across acquisition sites, missing modalities, and signal/image artifacts — by running
three specialist preprocessing pipelines (MRI ComBat harmonization, EEG MNE+ICA, and BBB
molecular featurization with RDKit) behind a single FastAPI surface with MLflow tracking
and Docker shipping.

## Status

| Day | Modality | Pipeline | Status |
|-----|----------|----------|--------|
| 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
| 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Planned (MNE-Python + ICA) |
| 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |

## Quick Start

**Prerequisite:** Python 3.10–3.12. The pinned `requirements.txt` has no cp313+ wheels;
`.python-version` pins to 3.12.

```bash
# 1. Create venv and install
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt

# 2. Verify — expect 30 passed
pytest -v

# 3. Smoke run with the bundled 6-row fixture
mkdir -p data/raw && cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
python -m src.pipelines.bbb_pipeline

# 4. Inspect the output at data/processed/bbbp_features.parquet
python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
```

> **Real BBBP data:** not bundled (gitignored). Download from
> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.

## Repository Layout

```text
.
├── AGENTS.md                 # Project contract (vision, layout, code & data rules) — read first
├── README.md                 # this file
├── requirements.txt          # Pinned deps; Python 3.10–3.12 only
├── .python-version           # 3.12
├── pytest.ini
├── data/
│   ├── raw/                  # vendor inputs (CSV / EDF / NIfTI); gitignored
│   └── processed/            # Parquet outputs from pipelines; gitignored
├── docs/superpowers/plans/   # Per-day implementation plans
├── src/
│   ├── core/logger.py        # Shared structured logger (mandatory in every pipeline)
│   ├── pipelines/
│   │   └── bbb_pipeline.py   # Day-1 pipeline (4 public funcs + CLI entry)
│   └── api/                  # FastAPI surface (placeholder until Day 4+)
└── tests/
    ├── core/, pipelines/     # Mirror src/ structure
    └── fixtures/             # bbbp_sample.csv (6 rows for smoke tests)
```

## BBB Pipeline (Day 1)

| Function | Purpose |
|----------|---------|
| `is_valid_smiles(smiles)` | Returns `True` iff the input is a non-empty SMILES that RDKit can parse. Handles `None`, `NaN`, and garbage strings. |
| `compute_morgan_fingerprint(smiles, n_bits, radius)` | Returns a `(n_bits,)` `uint8` numpy array using the modern `MorganGenerator` API. |
| `extract_features_from_dataframe(df, smiles_col, n_bits, radius)` | Drops invalid rows (logged WARNING with truncated index list), expands fingerprints into `fp_0..fp_{n-1}` columns, preserves metadata. Returns a model-ready `pd.DataFrame`. |
| `run_pipeline(input_path, output_path, smiles_col, n_bits, radius)` | End-to-end CSV → Parquet orchestrator. Idempotent; raises on missing input or directory output. |

All four functions log via `src.core.logger.get_logger(__name__)` per AGENTS.md §3 and
satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
determinism, traceability, idempotence).

## Storage Format

Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
for the `float32` EEG features Day 2 will produce. See AGENTS.md §6.

## Testing & TDD

All four BBB functions and the shared logger were built TDD-first (RED → GREEN →
REFACTOR). Each task ended in a green commit; review-and-fix loops landed as separate
commits with `fix:` / `refactor:` prefixes. Run `pytest -v` at any time — the full suite
finishes in under 2 seconds on a 2024 laptop.

## Roadmap

- **Day 2:** `eeg_pipeline.py` — load EDF/FIF, MNE-Python ICA artifact removal, write
  `float32` features to Parquet.
- **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
  (`neuroharmonize`) for site-level domain shift, write features to Parquet.
- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
  CI.

## Where to Look

- **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
- **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
- **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
- **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)