hackathon / README.md
mekosotto's picture
docs: add README with quick start, status, and Day-2 onboarding map
a13e268
|
raw
history blame
5.41 kB

NeuroBridge Enterprise Pipeline

NeuroBridge Enterprise tackles the three chronic failure modes in clinical ML — data drift across acquisition sites, missing modalities, and signal/image artifacts — by running three specialist preprocessing pipelines (MRI ComBat harmonization, EEG MNE+ICA, and BBB molecular featurization with RDKit) behind a single FastAPI surface with MLflow tracking and Docker shipping.

Status

Day Modality Pipeline Status
1 Tabular (BBB / molecules) bbb_pipeline.py Shipped — 30 tests green
2 Signal (EEG) eeg_pipeline.py Planned (MNE-Python + ICA)
3 Image (MRI / fMRI) mri_pipeline.py Planned (ComBat harmonization)

Quick Start

Prerequisite: Python 3.10–3.12. The pinned requirements.txt has no cp313+ wheels; .python-version pins to 3.12.

# 1. Create venv and install
python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt

# 2. Verify — expect 30 passed
pytest -v

# 3. Smoke run with the bundled 6-row fixture
mkdir -p data/raw && cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
python -m src.pipelines.bbb_pipeline

# 4. Inspect the output at data/processed/bbbp_features.parquet
python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"

Real BBBP data: not bundled (gitignored). Download from Kaggle or MoleculeNet; place as data/raw/bbbp.csv.

Repository Layout

.
├── AGENTS.md                 # Project contract (vision, layout, code & data rules) — read first
├── README.md                 # this file
├── requirements.txt          # Pinned deps; Python 3.10–3.12 only
├── .python-version           # 3.12
├── pytest.ini
├── data/
│   ├── raw/                  # vendor inputs (CSV / EDF / NIfTI); gitignored
│   └── processed/            # Parquet outputs from pipelines; gitignored
├── docs/superpowers/plans/   # Per-day implementation plans
├── src/
│   ├── core/logger.py        # Shared structured logger (mandatory in every pipeline)
│   ├── pipelines/
│   │   └── bbb_pipeline.py   # Day-1 pipeline (4 public funcs + CLI entry)
│   └── api/                  # FastAPI surface (placeholder until Day 4+)
└── tests/
    ├── core/, pipelines/     # Mirror src/ structure
    └── fixtures/             # bbbp_sample.csv (6 rows for smoke tests)

BBB Pipeline (Day 1)

Function Purpose
is_valid_smiles(smiles) Returns True iff the input is a non-empty SMILES that RDKit can parse. Handles None, NaN, and garbage strings.
compute_morgan_fingerprint(smiles, n_bits, radius) Returns a (n_bits,) uint8 numpy array using the modern MorganGenerator API.
extract_features_from_dataframe(df, smiles_col, n_bits, radius) Drops invalid rows (logged WARNING with truncated index list), expands fingerprints into fp_0..fp_{n-1} columns, preserves metadata. Returns a model-ready pd.DataFrame.
run_pipeline(input_path, output_path, smiles_col, n_bits, radius) End-to-end CSV → Parquet orchestrator. Idempotent; raises on missing input or directory output.

All four functions log via src.core.logger.get_logger(__name__) per AGENTS.md §3 and satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity, determinism, traceability, idempotence).

Storage Format

Pipeline outputs are written as Parquet files using the pyarrow engine with snappy compression. This preserves dtypes (uint8 fingerprint columns stay uint8 instead of widening to int64 as CSV would do) and yields ~10× smaller files than CSV — material for the float32 EEG features Day 2 will produce. See AGENTS.md §6.

Testing & TDD

All four BBB functions and the shared logger were built TDD-first (RED → GREEN → REFACTOR). Each task ended in a green commit; review-and-fix loops landed as separate commits with fix: / refactor: prefixes. Run pytest -v at any time — the full suite finishes in under 2 seconds on a 2024 laptop.

Roadmap

  • Day 2: eeg_pipeline.py — load EDF/FIF, MNE-Python ICA artifact removal, write float32 features to Parquet.
  • Day 3: mri_pipeline.py — load NIfTI volumes, ComBat harmonization (neuroharmonize) for site-level domain shift, write features to Parquet.
  • Day 4+: FastAPI surface in src/api/, MLflow experiment tracking, Docker images, CI.

Where to Look