mekosotto Claude Sonnet 4.6 commited on
Commit
a13e268
·
1 Parent(s): 915880e

docs: add README with quick start, status, and Day-2 onboarding map

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NeuroBridge Enterprise Pipeline
2
+
3
+ NeuroBridge Enterprise tackles the three chronic failure modes in clinical ML — data drift
4
+ across acquisition sites, missing modalities, and signal/image artifacts — by running
5
+ three specialist preprocessing pipelines (MRI ComBat harmonization, EEG MNE+ICA, and BBB
6
+ molecular featurization with RDKit) behind a single FastAPI surface with MLflow tracking
7
+ and Docker shipping.
8
+
9
+ ## Status
10
+
11
+ | Day | Modality | Pipeline | Status |
12
+ |-----|----------|----------|--------|
13
+ | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
14
+ | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Planned (MNE-Python + ICA) |
15
+ | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
16
+
17
+ ## Quick Start
18
+
19
+ **Prerequisite:** Python 3.10–3.12. The pinned `requirements.txt` has no cp313+ wheels;
20
+ `.python-version` pins to 3.12.
21
+
22
+ ```bash
23
+ # 1. Create venv and install
24
+ python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
25
+
26
+ # 2. Verify — expect 30 passed
27
+ pytest -v
28
+
29
+ # 3. Smoke run with the bundled 6-row fixture
30
+ mkdir -p data/raw && cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
31
+ python -m src.pipelines.bbb_pipeline
32
+
33
+ # 4. Inspect the output at data/processed/bbbp_features.parquet
34
+ python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
35
+ ```
36
+
37
+ > **Real BBBP data:** not bundled (gitignored). Download from
38
+ > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
39
+ > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
40
+
41
+ ## Repository Layout
42
+
43
+ ```text
44
+ .
45
+ ├── AGENTS.md # Project contract (vision, layout, code & data rules) — read first
46
+ ├── README.md # this file
47
+ ├── requirements.txt # Pinned deps; Python 3.10–3.12 only
48
+ ├── .python-version # 3.12
49
+ ├── pytest.ini
50
+ ├── data/
51
+ │ ├── raw/ # vendor inputs (CSV / EDF / NIfTI); gitignored
52
+ │ └── processed/ # Parquet outputs from pipelines; gitignored
53
+ ├── docs/superpowers/plans/ # Per-day implementation plans
54
+ ├── src/
55
+ │ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
56
+ │ ├── pipelines/
57
+ │ │ └── bbb_pipeline.py # Day-1 pipeline (4 public funcs + CLI entry)
58
+ │ └── api/ # FastAPI surface (placeholder until Day 4+)
59
+ └── tests/
60
+ ├── core/, pipelines/ # Mirror src/ structure
61
+ └── fixtures/ # bbbp_sample.csv (6 rows for smoke tests)
62
+ ```
63
+
64
+ ## BBB Pipeline (Day 1)
65
+
66
+ | Function | Purpose |
67
+ |----------|---------|
68
+ | `is_valid_smiles(smiles)` | Returns `True` iff the input is a non-empty SMILES that RDKit can parse. Handles `None`, `NaN`, and garbage strings. |
69
+ | `compute_morgan_fingerprint(smiles, n_bits, radius)` | Returns a `(n_bits,)` `uint8` numpy array using the modern `MorganGenerator` API. |
70
+ | `extract_features_from_dataframe(df, smiles_col, n_bits, radius)` | Drops invalid rows (logged WARNING with truncated index list), expands fingerprints into `fp_0..fp_{n-1}` columns, preserves metadata. Returns a model-ready `pd.DataFrame`. |
71
+ | `run_pipeline(input_path, output_path, smiles_col, n_bits, radius)` | End-to-end CSV → Parquet orchestrator. Idempotent; raises on missing input or directory output. |
72
+
73
+ All four functions log via `src.core.logger.get_logger(__name__)` per AGENTS.md §3 and
74
+ satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
75
+ determinism, traceability, idempotence).
76
+
77
+ ## Storage Format
78
+
79
+ Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
80
+ compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
81
+ widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
82
+ for the `float32` EEG features Day 2 will produce. See AGENTS.md §6.
83
+
84
+ ## Testing & TDD
85
+
86
+ All four BBB functions and the shared logger were built TDD-first (RED → GREEN →
87
+ REFACTOR). Each task ended in a green commit; review-and-fix loops landed as separate
88
+ commits with `fix:` / `refactor:` prefixes. Run `pytest -v` at any time — the full suite
89
+ finishes in under 2 seconds on a 2024 laptop.
90
+
91
+ ## Roadmap
92
+
93
+ - **Day 2:** `eeg_pipeline.py` — load EDF/FIF, MNE-Python ICA artifact removal, write
94
+ `float32` features to Parquet.
95
+ - **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
96
+ (`neuroharmonize`) for site-level domain shift, write features to Parquet.
97
+ - **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
98
+ CI.
99
+
100
+ ## Where to Look
101
+
102
+ - **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
103
+ - **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
104
+ - **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
105
+ - **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)