Spaces:

mekosotto
/

hackathon

Running

App Files Files Community

mekosotto commited on 7 days ago

Commit

53256ed

1 Parent(s): b6f1745

docs: Day-5 close-out — AGENTS §8 decision layer + trainer CLI

Browse files

Files changed (3) hide show

AGENTS.md +30 -0
README.md +19 -0
src/models/bbb_model.py +26 -0

AGENTS.md CHANGED Viewed

@@ -50,6 +50,8 @@ All experiment runs are tracked in **MLflow**. All services ship as **Docker** i
 │   │   ├── storage.py        # Parquet read/write helpers (snappy, single-threaded, deterministic)
 │   │   └── tracking.py       # MLflow `track_pipeline_run` context manager (see §7)
 │   ├── pipelines/            # One file per modality. Pure functions + a `run_pipeline()` entry.
 │   └── frontend/
 │       └── app.py            # Streamlit dashboard (3 tabs, one per modality)
 └── tests/
@@ -142,3 +144,31 @@ The tracking URI is read from `MLFLOW_TRACKING_URI` (defaults to `./mlruns/` whe
 **Live-demo lifeline**: set `NEUROBRIDGE_DISABLE_MLFLOW=1` to skip tracking entirely — the helper yields `None` and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost.
 The repo-wide `conftest.py` autouse fixture pins `MLFLOW_TRACKING_URI` to a tmp directory for the test session, so the production `mlruns/` directory is never written by the test suite. Tests that interact with MLflow (in `tests/core/test_tracking.py` and the per-pipeline `Test<Modality>PipelineMLflow` classes) all share this isolated store.

 │   │   ├── storage.py        # Parquet read/write helpers (snappy, single-threaded, deterministic)
 │   │   └── tracking.py       # MLflow `track_pipeline_run` context manager (see §7)
 │   ├── pipelines/            # One file per modality. Pure functions + a `run_pipeline()` entry.
+│   ├── models/               # Downstream decision-layer models (consume processed features)
+│   │   └── bbb_model.py      # BBB-permeability classifier + SHAP explainer + trainer CLI
 │   └── frontend/
 │       └── app.py            # Streamlit dashboard (3 tabs, one per modality)
 └── tests/
 **Live-demo lifeline**: set `NEUROBRIDGE_DISABLE_MLFLOW=1` to skip tracking entirely — the helper yields `None` and emits no MLflow calls. Use this when the tracking server is unreachable (offline demo, network outage, or CI without an MLflow service). Pipelines complete normally; only the run metadata is lost.
 The repo-wide `conftest.py` autouse fixture pins `MLFLOW_TRACKING_URI` to a tmp directory for the test session, so the production `mlruns/` directory is never written by the test suite. Tests that interact with MLflow (in `tests/core/test_tracking.py` and the per-pipeline `Test<Modality>PipelineMLflow` classes) all share this isolated store.
+## 8. Decision Layer (Downstream Models)
+Pipelines produce features (`data/processed/<modality>_features.parquet`).
+Downstream models live in `src/models/` and consume those features:
+| Model | File | Output | Endpoint |
+|---|---|---|---|
+| BBB permeability | `src/models/bbb_model.py` | `data/processed/bbb_model.joblib` | `POST /predict/bbb` |
+Each downstream model module exposes a uniform surface:
+- `train(df, label_col, ...)` → fitted classifier
+- `save(model, path)` / `load(path)` → joblib artifact I/O
+- `predict_with_proba(model, smiles)` → `{label, confidence}` (confidence is the max-class probability)
+- `explain_prediction(model, smiles, top_k)` → SHAP top-k attributions sorted by `|shap_value|` descending
+The API loads the joblib artifact at request time. If the artifact is
+missing, the endpoint returns **HTTP 503** with a remediation hint pointing
+at the trainer CLI (`python -m src.models.<name>`). This keeps the API
+process startup fast and lets operators retrain without redeploying — the
+Day-5 analog of Day-4's `NEUROBRIDGE_DISABLE_MLFLOW` lifeline.
+**Determinism**: all classifiers are seeded (`random_state=42` default),
+`n_jobs=1` (no tree-parallelism races). Re-running the trainer on the same
+Parquet produces identical predictions.
+**Override `BBB_MODEL_PATH`** env var to point the API at a non-default
+artifact location (used by tests for tmp_path isolation).

README.md CHANGED Viewed

@@ -14,6 +14,7 @@ and Docker shipping.
 | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
 | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Shipped — 106 tests green |
 | 4 | API + MLOps + Frontend | FastAPI + MLflow + Streamlit + Docker | Shipped — 142 tests green |
 ## Quick Start
@@ -59,6 +60,21 @@ Result lives at `data/processed/mri_features.parquet` (48 ROI features per subje
 > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
 > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
 ### Run the full stack with Docker
 ```bash
@@ -154,6 +170,7 @@ finishes in under 4 seconds on a 2024 laptop.
 - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
 - **Day 3 (shipped):** `mri_pipeline.py` — NIfTI volume loading, brain masking, ROI feature extraction, ComBat harmonization (`neuroHarmonize`) for site-level domain shift → Parquet (48 features, 106 tests green).
 - **Day 4 (shipped):** FastAPI surface in `src/api/` (POST `/pipeline/{bbb,eeg,mri}` + `/health`), MLflow experiment tracking via `src.core.tracking` (see AGENTS.md §7), Streamlit dashboard at `src/frontend/app.py`, and Docker / `docker-compose.yml` for the api + MLflow stack — 142 tests green.
 ## Where to Look
@@ -171,3 +188,5 @@ finishes in under 4 seconds on a 2024 laptop.
 - **Streamlit dashboard:** [`src/frontend/app.py`](src/frontend/app.py)
 - **Container stack:** [`Dockerfile`](Dockerfile), [`docker-compose.yml`](docker-compose.yml)
 - **Day-4 tests:** [`tests/api/`](tests/api/), [`tests/frontend/`](tests/frontend/), [`tests/pipelines/test_cross_pipeline_smoke.py`](tests/pipelines/test_cross_pipeline_smoke.py)

 | 2 | Signal (EEG) | [`eeg_pipeline.py`](src/pipelines/eeg_pipeline.py) | Shipped — 67 tests green |
 | 3 | Image (MRI / fMRI) | [`mri_pipeline.py`](src/pipelines/mri_pipeline.py) | Shipped — 106 tests green |
 | 4 | API + MLOps + Frontend | FastAPI + MLflow + Streamlit + Docker | Shipped — 142 tests green |
+| 5 | Decision Layer (Model + XAI + Interactive UI) | [`bbb_model.py`](src/models/bbb_model.py) — RandomForest + SHAP + `POST /predict/bbb` | Shipped — 158 tests green |
 ## Quick Start
 > [Kaggle](https://www.kaggle.com/datasets/priyanagda/bbbp) or
 > [MoleculeNet](https://moleculenet.org/datasets-1); place as `data/raw/bbbp.csv`.
+### Train the downstream BBB model (one-time)
+```bash
+python -m src.pipelines.bbb_pipeline   # produces data/processed/bbbp_features.parquet
+python -m src.models.bbb_model          # produces data/processed/bbb_model.joblib
+```
+Then `POST /predict/bbb` (and the Streamlit BBB tab) become live. Try:
+```bash
+curl -s -X POST http://localhost:8000/predict/bbb \
+  -H 'Content-Type: application/json' \
+  -d '{"smiles": "CCO", "top_k": 5}' | python3 -m json.tool
+```
 ### Run the full stack with Docker
 ```bash
 - **Day 2 (shipped):** `eeg_pipeline.py` — bandpass + MNE ICA artifact removal + PSD + statistical features → Parquet.
 - **Day 3 (shipped):** `mri_pipeline.py` — NIfTI volume loading, brain masking, ROI feature extraction, ComBat harmonization (`neuroHarmonize`) for site-level domain shift → Parquet (48 features, 106 tests green).
 - **Day 4 (shipped):** FastAPI surface in `src/api/` (POST `/pipeline/{bbb,eeg,mri}` + `/health`), MLflow experiment tracking via `src.core.tracking` (see AGENTS.md §7), Streamlit dashboard at `src/frontend/app.py`, and Docker / `docker-compose.yml` for the api + MLflow stack — 142 tests green.
+- **Day 5 (shipped):** Decision layer in `src/models/bbb_model.py` — RandomForest BBB classifier on Morgan fingerprints, SHAP top-k explanations, `POST /predict/bbb` endpoint, interactive Streamlit BBB tab with SMILES input + decision card + SHAP bar chart, and trainer CLI (`python -m src.models.bbb_model`). See AGENTS.md §8 — 158 tests green.
 ## Where to Look
 - **Streamlit dashboard:** [`src/frontend/app.py`](src/frontend/app.py)
 - **Container stack:** [`Dockerfile`](Dockerfile), [`docker-compose.yml`](docker-compose.yml)
 - **Day-4 tests:** [`tests/api/`](tests/api/), [`tests/frontend/`](tests/frontend/), [`tests/pipelines/test_cross_pipeline_smoke.py`](tests/pipelines/test_cross_pipeline_smoke.py)
+- **Day-5 plan (full TDD task breakdown):** [`docs/superpowers/plans/2026-05-03-day5-downstream-model-xai-interactive.md`](docs/superpowers/plans/2026-05-03-day5-downstream-model-xai-interactive.md)
+- **BBB downstream model (classifier + SHAP explainer + trainer CLI):** [`src/models/bbb_model.py`](src/models/bbb_model.py) + [`tests/models/test_bbb_model.py`](tests/models/test_bbb_model.py) (12 tests)

src/models/bbb_model.py CHANGED Viewed

@@ -205,3 +205,29 @@ def explain_prediction(
         {"feature": str(name), "shap_value": float(value)}
         for name, value in pairs[:top_k]
     ]

         {"feature": str(name), "shap_value": float(value)}
         for name, value in pairs[:top_k]
     ]
+DEFAULT_FEATURES_PATH = Path("data/processed/bbbp_features.parquet")
+DEFAULT_MODEL_PATH = Path("data/processed/bbb_model.joblib")
+def main() -> None:
+    """Train and persist the production BBB model from the Day-4 features Parquet.
+    Reads from `DEFAULT_FEATURES_PATH`, trains with default hyperparameters,
+    and writes the artifact to `DEFAULT_MODEL_PATH`. Re-runs are idempotent
+    (same random_state).
+    """
+    if not DEFAULT_FEATURES_PATH.exists():
+        raise FileNotFoundError(
+            f"Features Parquet not found at {DEFAULT_FEATURES_PATH}. "
+            f"Run `python -m src.pipelines.bbb_pipeline` first."
+        )
+    df = pd.read_parquet(DEFAULT_FEATURES_PATH)
+    model = train(df, label_col="p_np")
+    save(model, DEFAULT_MODEL_PATH)
+    logger.info("BBB model artifact ready at %s", DEFAULT_MODEL_PATH)
+if __name__ == "__main__":
+    main()