Spaces:

mekosotto
/

hackathon

Running

App Files Files Community

hackathon / README.md

mekosotto

docs: add README with quick start, status, and Day-2 onboarding map

a13e268 25 days ago

preview code

raw

history blame

5.41 kB

	# NeuroBridge Enterprise Pipeline

	NeuroBridge Enterprise tackles the three chronic failure modes in clinical ML — data drift
	across acquisition sites, missing modalities, and signal/image artifacts — by running
	three specialist preprocessing pipelines (MRI ComBat harmonization, EEG MNE+ICA, and BBB
	molecular featurization with RDKit) behind a single FastAPI surface with MLflow tracking
	and Docker shipping.

	## Status

	\| Day \| Modality \| Pipeline \| Status \|
	\|-----\|----------\|----------\|--------\|
	\| 1 \| Tabular (BBB / molecules) \| [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) \| Shipped — 30 tests green \|
	\| 2 \| Signal (EEG) \| [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) \| Planned (MNE-Python + ICA) \|
	\| 3 \| Image (MRI / fMRI) \| [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) \| Planned (ComBat harmonization) \|

	## Quick Start

	Prerequisite: Python 3.10–3.12. The pinned `requirements.txt` has no cp313+ wheels;
	`.python-version` pins to 3.12.

	```bash
	# 1. Create venv and install
	python3.12 -m venv .venv312 && source .venv312/bin/activate && pip install -r requirements.txt

	# 2. Verify — expect 30 passed
	pytest -v

	# 3. Smoke run with the bundled 6-row fixture
	mkdir -p data/raw && cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
	python -m src.pipelines.bbb_pipeline

	# 4. Inspect the output at data/processed/bbbp_features.parquet
	python -c "import pandas as pd; df = pd.read_parquet('data/processed/bbbp_features.parquet'); print(df.shape, df.dtypes.head())"
	```

	> Real BBBP data: not bundled (gitignored). Download from
	> [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
	> [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.

	## Repository Layout

	```text
	.
	├── AGENTS.md # Project contract (vision, layout, code & data rules) — read first
	├── README.md # this file
	├── requirements.txt # Pinned deps; Python 3.10–3.12 only
	├── .python-version # 3.12
	├── pytest.ini
	├── data/
	│ ├── raw/ # vendor inputs (CSV / EDF / NIfTI); gitignored
	│ └── processed/ # Parquet outputs from pipelines; gitignored
	├── docs/superpowers/plans/ # Per-day implementation plans
	├── src/
	│ ├── core/logger.py # Shared structured logger (mandatory in every pipeline)
	│ ├── pipelines/
	│ │ └── bbb_pipeline.py # Day-1 pipeline (4 public funcs + CLI entry)
	│ └── api/ # FastAPI surface (placeholder until Day 4+)
	└── tests/
	├── core/, pipelines/ # Mirror src/ structure
	└── fixtures/ # bbbp_sample.csv (6 rows for smoke tests)
	```

	## BBB Pipeline (Day 1)

	\| Function \| Purpose \|
	\|----------\|---------\|
	\| `is_valid_smiles(smiles)` \| Returns `True` iff the input is a non-empty SMILES that RDKit can parse. Handles `None`, `NaN`, and garbage strings. \|
	\| `compute_morgan_fingerprint(smiles, n_bits, radius)` \| Returns a `(n_bits,)` `uint8` numpy array using the modern `MorganGenerator` API. \|
	\| `extract_features_from_dataframe(df, smiles_col, n_bits, radius)` \| Drops invalid rows (logged WARNING with truncated index list), expands fingerprints into `fp_0..fp_{n-1}` columns, preserves metadata. Returns a model-ready `pd.DataFrame`. \|
	\| `run_pipeline(input_path, output_path, smiles_col, n_bits, radius)` \| End-to-end CSV → Parquet orchestrator. Idempotent; raises on missing input or directory output. \|

	All four functions log via `src.core.logger.get_logger(__name__)` per AGENTS.md §3 and
	satisfy the §4 Data Readiness contract (5 invariants: schema validity, domain validity,
	determinism, traceability, idempotence).

	## Storage Format

	Pipeline outputs are written as Parquet files using the `pyarrow` engine with snappy
	compression. This preserves dtypes (`uint8` fingerprint columns stay `uint8` instead of
	widening to `int64` as CSV would do) and yields ~10× smaller files than CSV — material
	for the `float32` EEG features Day 2 will produce. See AGENTS.md §6.

	## Testing & TDD

	All four BBB functions and the shared logger were built TDD-first (RED → GREEN →
	REFACTOR). Each task ended in a green commit; review-and-fix loops landed as separate
	commits with `fix:` / `refactor:` prefixes. Run `pytest -v` at any time — the full suite
	finishes in under 2 seconds on a 2024 laptop.

	## Roadmap

	- Day 2: `eeg_pipeline.py` — load EDF/FIF, MNE-Python ICA artifact removal, write
	`float32` features to Parquet.
	- Day 3: `mri_pipeline.py` — load NIfTI volumes, ComBat harmonization
	(`neuroharmonize`) for site-level domain shift, write features to Parquet.
	- Day 4+: FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
	CI.

	## Where to Look

	- Project rules (mandatory reading for any agent): [`AGENTS.md`](AGENTS.md)
	- Day-1 plan (full TDD task breakdown): [`docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md`](docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md)
	- Logger contract: [`src/core/logger.py`](src/core/logger.py) + [`tests/core/test_logger.py`](tests/core/test_logger.py)
	- BBB pipeline: [`src/pipelines/bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) + [`tests/pipelines/test_bbb_pipeline.py`](tests/pipelines/test_bbb_pipeline.py)