Spaces:

mekosotto
/

hackathon

Running

App Files Files Community

mekosotto commited on 7 days ago

Commit

d3d1ac7

1 Parent(s): ef4cf4a

docs: Day-4 close-out — AGENTS §7 tracking, README MLOps surface

Browse files

Files changed (2) hide show

AGENTS.md +36 -3
README.md +21 -2

AGENTS.md CHANGED Viewed

@@ -27,18 +27,36 @@ All experiment runs are tracked in **MLflow**. All services ship as **Docker** i
 ```
 .
 ├── AGENTS.md                 # This file
 ├── requirements.txt
 ├── pytest.ini
 ├── data/
 │   ├── raw/                  # Untouched source data. NEVER train on this directly.
 │   └── processed/            # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4).
 ├── src/
-│   ├── api/                  # FastAPI routers, request/response schemas
 │   ├── pipelines/            # One file per modality. Pure functions + a `run_pipeline()` entry.
-│   └── core/                 # Cross-cutting utilities: logging, config (MLflow helpers planned)
 └── tests/
     ├── core/
-    ├── pipelines/
     └── fixtures/             # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
 ```
@@ -109,3 +127,18 @@ All `data/processed/` outputs MUST be **Parquet** (`pyarrow` engine, `compressio
 - Read with `pd.read_parquet(path)`; no dtype hints required.
 The raw `data/raw/` inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI).

 ```
 .
 ├── AGENTS.md                 # This file
+├── README.md
 ├── requirements.txt
 ├── pytest.ini
+├── conftest.py               # Repo-wide pytest fixtures (autouse: pins MLFLOW_TRACKING_URI to tmp dir for test isolation)
+├── Dockerfile                # Production image (FastAPI + pipelines)
+├── docker-compose.yml        # api + mlflow services for local stack
+├── .dockerignore
+├── .streamlit/
+│   └── config.toml           # Streamlit theme tokens
 ├── data/
 │   ├── raw/                  # Untouched source data. NEVER train on this directly.
 │   └── processed/            # Pipeline output as Parquet (preserves dtypes; overwritten each run; see §4).
 ├── src/
+│   ├── api/                  # FastAPI surface
+│   │   ├── main.py           # App factory + /health
+│   │   ├── routes.py         # POST /pipeline/{bbb,eeg,mri} dispatch
+│   │   └── schemas.py        # Shared Pydantic request/response models
+│   ├── core/                 # Cross-cutting utilities
+│   │   ├── logger.py         # Structured logger (mandatory in every pipeline)
+│   │   ├── determinism.py    # Thread-pin env vars (OMP/OPENBLAS/MKL/pyarrow)
+│   │   ├── storage.py        # Parquet read/write helpers (snappy, single-threaded, deterministic)
+│   │   └── tracking.py       # MLflow `track_pipeline_run` context manager (see §7)
 │   ├── pipelines/            # One file per modality. Pure functions + a `run_pipeline()` entry.
+│   └── frontend/
+│       └── app.py            # Streamlit dashboard (3 tabs, one per modality)
 └── tests/
     ├── core/
+    ├── api/
+    ├── frontend/
+    ├── pipelines/            # incl. test_cross_pipeline_smoke.py for integration coverage
     └── fixtures/             # Tiny synthetic data files used by tests (NOT a Python package — no __init__.py)
 ```
 - Read with `pd.read_parquet(path)`; no dtype hints required.
 The raw `data/raw/` inputs may be in any vendor-supplied format (CSV for BBBP, EDF/FIF for EEG, NIfTI for MRI).
+## 7. Experiment Tracking
+Every `run_pipeline()` invocation logs to MLflow via `src.core.tracking.track_pipeline_run`:
+- **Experiment names** match the pipeline module: `bbb_pipeline`, `eeg_pipeline`, `mri_pipeline`.
+- **Params**: input/output paths and pipeline hyperparameters (e.g. BBB `n_bits` / `radius`, EEG `epoch_duration_s` / `random_state`, MRI `intensity_threshold` / `n_roi_axes`).
+- **Metrics**: row counts (`rows_in`, `rows_out`, `rows_dropped` — or modality equivalent like `subjects_in/out/dropped`) and `duration_sec`.
+- **Artifact**: the produced Parquet at `data/processed/<modality>_features.parquet`.
+The tracking URI is read from `MLFLOW_TRACKING_URI` (defaults to `./mlruns/` when unset).
+**Live-demo lifeline**: set `NEUROBRIDGE_DISABLE_MLFLOW=1` to skip tracking entirely — the helper yields `None` and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost.
+The repo-wide `conftest.py` autouse fixture pins `MLFLOW_TRACKING_URI` to a tmp directory for the test session, so the production `mlruns/` directory is never written by the test suite. Tests that interact with MLflow (in `tests/core/test_tracking.py` and the per-pipeline `Test<Modality>PipelineMLflow` classes) all share this isolated store.

README.md CHANGED Viewed

@@ -13,6 +13,7 @@ and Docker shipping.
 | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
 | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
 | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Shipped — 106 tests green |
 ## Quick Start
@@ -58,6 +59,19 @@ Result lives at `data/processed/mri_features.parquet` (48 ROI features per subje
 > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
 > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
 ## Repository Layout
 ```text
@@ -139,8 +153,7 @@ finishes in under 4 seconds on a 2024 laptop.
 - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
 - **Day 3 (shipped):** `mri_pipeline.py` — NIfTI volume loading, brain masking, ROI feature extraction, ComBat harmonization (`neuroHarmonize`) for site-level domain shift → Parquet (48 features, 106 tests green).
-- **Day 4+:** FastAPI surface in `src/api/`, MLflow experiment tracking, Docker images,
-  CI.
 ## Where to Look
@@ -152,3 +165,9 @@ finishes in under 4 seconds on a 2024 laptop.
 - **EEG pipeline:** [`src/pipelines/eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) + [`tests/pipelines/test_eeg_pipeline.py`](tests/pipelines/test_eeg_pipeline.py)
 - **Day-3 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-05-01-day3-mri-combat-pipeline.md`](docs/superpowers/plans/2026-05-01-day3-mri-combat-pipeline.md)
 - **MRI pipeline:** [`src/pipelines/mri_pipeline.py`](src/pipelines/mri_pipeline.py) + [`tests/pipelines/test_mri_pipeline.py`](tests/pipelines/test_mri_pipeline.py)

 | 1 | Tabular (BBB / molecules) | [`bbb_pipeline.py`](src/pipelines/bbb_pipeline.py) | Shipped — 30 tests green |
 | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
 | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Shipped — 106 tests green |
+| 4 | API + MLOps + Frontend | FastAPI + MLflow + Streamlit + Docker | Shipped — 142 tests green |
 ## Quick Start
 > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
 > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
+### Run the full stack with Docker
+```bash
+docker compose up
+```
+Then browse to:
+- **FastAPI Swagger** — <http://localhost:8000/docs>
+- **Streamlit dashboard** — `streamlit run src/frontend/app.py` (port 8501; not in compose by default)
+- **MLflow UI** — <http://localhost:5000>
+Live-demo robustness: if the MLflow service is unreachable, set `NEUROBRIDGE_DISABLE_MLFLOW=1` to make the pipelines run without tracking.
 ## Repository Layout
 ```text
 - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
 - **Day 3 (shipped):** `mri_pipeline.py` — NIfTI volume loading, brain masking, ROI feature extraction, ComBat harmonization (`neuroHarmonize`) for site-level domain shift → Parquet (48 features, 106 tests green).
+- **Day 4 (shipped):** FastAPI surface in `src/api/` (POST `/pipeline/{bbb,eeg,mri}` + `/health`), MLflow experiment tracking via `src.core.tracking` (see AGENTS.md §7), Streamlit dashboard at `src/frontend/app.py`, and Docker / `docker-compose.yml` for the api + MLflow stack — 142 tests green.
 ## Where to Look
 - **EEG pipeline:** [`src/pipelines/eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) + [`tests/pipelines/test_eeg_pipeline.py`](tests/pipelines/test_eeg_pipeline.py)
 - **Day-3 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-05-01-day3-mri-combat-pipeline.md`](docs/superpowers/plans/2026-05-01-day3-mri-combat-pipeline.md)
 - **MRI pipeline:** [`src/pipelines/mri_pipeline.py`](src/pipelines/mri_pipeline.py) + [`tests/pipelines/test_mri_pipeline.py`](tests/pipelines/test_mri_pipeline.py)
+- **Day-4 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-05-02-day4-api-mlops-frontend.md`](docs/superpowers/plans/2026-05-02-day4-api-mlops-frontend.md)
+- **Shared core helpers:** [`src/core/determinism.py`](src/core/determinism.py), [`src/core/storage.py`](src/core/storage.py), [`src/core/tracking.py`](src/core/tracking.py)
+- **FastAPI surface:** [`src/api/main.py`](src/api/main.py), [`src/api/routes.py`](src/api/routes.py), [`src/api/schemas.py`](src/api/schemas.py)
+- **Streamlit dashboard:** [`src/frontend/app.py`](src/frontend/app.py)
+- **Container stack:** [`Dockerfile`](Dockerfile), [`docker-compose.yml`](docker-compose.yml)
+- **Day-4 tests:** [`tests/api/`](tests/api/), [`tests/frontend/`](tests/frontend/), [`tests/pipelines/test_cross_pipeline_smoke.py`](tests/pipelines/test_cross_pipeline_smoke.py)