docs: add README with quick start, status, and Day-2 onboarding map
Browse files
README.md
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NeuroBridge Enterprise Pipeline
|
| 2 |
+
|
| 3 |
+
NeuroBridge Enterprise tackles the three chronic failure modes in clinical ML — data drift
|
| 4 |
+
across acquisition sites, missing modalities, and signal/image artifacts — by running
|
| 5 |
+
three specialist preprocessing pipelines (MRI ComBat harmonization, EEG MNE+ICA, and BBB
|
| 6 |
+
molecular featurization with RDKit) behind a single FastAPI surface with MLflow tracking
|
| 7 |
+
and Docker shipping.
|
| 8 |
+
|
| 9 |
+
## Status
|
| 10 |
+
|
| 11 |
+
| Day | Modality | Pipeline | Status |
|
| 12 |
+
|-----|----------|----------|--------|
|
| 13 |
+
| 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
|
| 14 |
+
| 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Planned (MNE-Python + ICA) |
|
| 15 |
+
| 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Planned (ComBat harmonization) |
|
| 16 |
+
|
| 17 |
+
## Quick Start
|
| 18 |
+
|
| 19 |
+
**Prerequisite:** Python 3.10–3.12. The pinned `requirements.txt` has no cp313+ wheels;
|
| 20 |
+
`.python-version` pins to 3.12.
|
| 21 |
+
|
| 22 |
+
```bash
|
| 23 |
+
# 1. Create venv and install
|
| 24 |
+
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt
|
| 25 |
+
|
| 26 |
+
# 2. Verify — expect 30 passed
|
| 27 |
+
pytest -v
|
| 28 |
+
|
| 29 |
+
# 3. Smoke run with the bundled 6-row fixture
|
| 30 |
+
mkdir -p data/raw && cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
|
| 31 |
+
python -m src.pipelines.bbb_pipeline
|
| 32 |
+
|
| 33 |
+
# 4. Inspect the output at data/processed/bbbp_features.parquet
|
| 34 |
+
python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
> **Real BBBP data:** not bundled (gitignored). Download from
|
| 38 |
+
> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
|
| 39 |
+
> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
|
| 40 |
+
|
| 41 |
+
## Repository Layout
|
| 42 |
+
|
| 43 |
+
```text
|
| 44 |
+
.
|
| 45 |
+
├── AGENTS.md # Project contract (vision, layout, code & data rules) — read first
|
| 46 |
+
├── README.md # this file
|
| 47 |
+
├── requirements.txt # Pinned deps; Python 3.10–3.12 only
|
| 48 |
+
├── .python-version # 3.12
|
| 49 |
+
├── pytest.ini
|
| 50 |
+
├── data/
|
| 51 |
+
│ ├── raw/ # vendor inputs (CSV / EDF / NIfTI); gitignored
|
| 52 |
+
│ └── processed/ # Parquet outputs from pipelines; gitignored
|
| 53 |
+
├── docs/superpowers/plans/ # Per-day implementation plans
|
| 54 |
+
├── src/
|
| 55 |
+
│ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
|
| 56 |
+
│ ├── pipelines/
|
| 57 |
+
│ │ └── bbb_pipeline.py # Day-1 pipeline (4 public funcs + CLI entry)
|
| 58 |
+
│ └── api/ # FastAPI surface (placeholder until Day 4+)
|
| 59 |
+
└── tests/
|
| 60 |
+
├── core/, pipelines/ # Mirror src/ structure
|
| 61 |
+
└── fixtures/ # bbbp_sample.csv (6 rows for smoke tests)
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## BBB Pipeline (Day 1)
|
| 65 |
+
|
| 66 |
+
| Function | Purpose |
|
| 67 |
+
|----------|---------|
|
| 68 |
+
| `is_valid_smiles(smiles)` | Returns `True` iff the input is a non-empty SMILES that RDKit can parse. Handles `None`, `NaN`, and garbage strings. |
|
| 69 |
+
| `compute_morgan_fingerprint(smiles, n_bits, radius)` | Returns a `(n_bits,)` `uint8` numpy array using the modern `MorganGenerator` API. |
|
| 70 |
+
| `extract_features_from_dataframe(df, smiles_col, n_bits, radius)` | Drops invalid rows (logged WARNING with truncated index list), expands fingerprints into `fp_0..fp_{n-1}` columns, preserves metadata. Returns a model-ready `pd.DataFrame`. |
|
| 71 |
+
| `run_pipeline(input_path, output_path, smiles_col, n_bits, radius)` | End-to-end CSV → Parquet orchestrator. Idempotent; raises on missing input or directory output. |
|
| 72 |
+
|
| 73 |
+
All four functions log via `src.core.logger.get_logger(__name__)` per AGENTS.md §3 and
|
| 74 |
+
satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
|
| 75 |
+
determinism, traceability, idempotence).
|
| 76 |
+
|
| 77 |
+
## Storage Format
|
| 78 |
+
|
| 79 |
+
Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
|
| 80 |
+
compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
|
| 81 |
+
widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
|
| 82 |
+
for the `float32` EEG features Day 2 will produce. See AGENTS.md §6.
|
| 83 |
+
|
| 84 |
+
## Testing & TDD
|
| 85 |
+
|
| 86 |
+
All four BBB functions and the shared logger were built TDD-first (RED → GREEN →
|
| 87 |
+
REFACTOR). Each task ended in a green commit; review-and-fix loops landed as separate
|
| 88 |
+
commits with `fix:` / `refactor:` prefixes. Run `pytest -v` at any time — the full suite
|
| 89 |
+
finishes in under 2 seconds on a 2024 laptop.
|
| 90 |
+
|
| 91 |
+
## Roadmap
|
| 92 |
+
|
| 93 |
+
- **Day 2:** `eeg_pipeline.py` — load EDF/FIF, MNE-Python ICA artifact removal, write
|
| 94 |
+
`float32` features to Parquet.
|
| 95 |
+
- **Day 3:** `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
|
| 96 |
+
(`neuroharmonize`) for site-level domain shift, write features to Parquet.
|
| 97 |
+
- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
|
| 98 |
+
CI.
|
| 99 |
+
|
| 100 |
+
## Where to Look
|
| 101 |
+
|
| 102 |
+
- **Project rules (mandatory reading for any agent):** [`AGENTS.md`](AGENTS.md)
|
| 103 |
+
- **Day-1 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
|
| 104 |
+
- **Logger contract:** [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
|
| 105 |
+
- **BBB pipeline:** [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)
|