mekosotto Claude Sonnet 4.6 commited on
Commit
938399b
·
1 Parent(s): 043ea3a

chore: track planning docs and ignore .sixth/ tooling dir

Browse files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.gitignore CHANGED
@@ -21,6 +21,9 @@ data/processed/*
21
  mlruns/
22
  mlartifacts/
23
 
 
 
 
24
  # IDE
25
  .idea/
26
  .vscode/
 
21
  mlruns/
22
  mlartifacts/
23
 
24
+ # Claude Code / agent tooling
25
+ .sixth/
26
+
27
  # IDE
28
  .idea/
29
  .vscode/
docs/superpowers/plans/2026-04-29-neurobridge-day1-bootstrap-bbb-pipeline.md ADDED
@@ -0,0 +1,966 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NeuroBridge Day 1 — Bootstrap & BBB Pipeline Implementation Plan
2
+
3
+ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4
+
5
+ **Goal:** Insider One Hackathon Day 1 — bootstrap the NeuroBridge Enterprise repo (governance + dependencies) and ship the first working pipeline (BBBP / RDKit Morgan fingerprints) end-to-end with TDD.
6
+
7
+ **Architecture:** Modular `src/` layout with three sibling pipeline packages (image / signal / tabular). Day 1 lands the **tabular (BBB)** pipeline only. A shared `src/core/logger.py` standardizes structured logging across pipelines. RDKit is used for SMILES parsing and Morgan fingerprint generation; invalid SMILES are logged and dropped at the validation layer (Data Readiness gate). The pipeline reads `data/raw/bbbp.csv` and writes a model-ready `data/processed/bbbp_features.csv`.
8
+
9
+ **Tech Stack:** Python 3.10+, FastAPI, Uvicorn, Pandas, NumPy, Scikit-learn, RDKit, MNE-Python, MLflow, Pytest, Docker.
10
+
11
+ ---
12
+
13
+ ## File Structure
14
+
15
+ Files created in this plan:
16
+
17
+ | Path | Responsibility |
18
+ |---|---|
19
+ | `AGENTS.md` | Agent-facing rulebook: vision, dir layout, coding standards, Data Readiness principles. |
20
+ | `requirements.txt` | Pinned Python deps for all 3 pipelines + API + tracking. |
21
+ | `.gitignore` | Standard Python + data/processed + MLflow artifacts ignore. |
22
+ | `pytest.ini` | Pytest config (rootdir, testpaths). |
23
+ | `src/__init__.py` | Mark `src` as a package. |
24
+ | `src/core/__init__.py` | Core/shared utilities package. |
25
+ | `src/core/logger.py` | `get_logger(name)` — structured stdout logger reused by all pipelines. |
26
+ | `src/pipelines/__init__.py` | Pipelines package. |
27
+ | `src/pipelines/bbb_pipeline.py` | BBBP SMILES → Morgan FP feature extractor + I/O orchestrator. |
28
+ | `src/api/__init__.py` | FastAPI package placeholder (filled later in week). |
29
+ | `tests/__init__.py` | Tests root. |
30
+ | `tests/core/__init__.py` | Core tests package. |
31
+ | `tests/core/test_logger.py` | Logger unit tests. |
32
+ | `tests/pipelines/__init__.py` | Pipeline tests package. |
33
+ | `tests/pipelines/test_bbb_pipeline.py` | BBB pipeline unit + integration tests. |
34
+ | `tests/fixtures/bbbp_sample.csv` | Tiny BBBP fixture (mix of valid + invalid SMILES). |
35
+ | `data/raw/.gitkeep` | Keep raw data folder under git, real CSVs ignored. |
36
+ | `data/processed/.gitkeep` | Keep processed folder under git. |
37
+
38
+ ---
39
+
40
+ ## Task 1: Project Skeleton & Git Bootstrap
41
+
42
+ **Files:**
43
+ - Create: `.gitignore`
44
+ - Create: `pytest.ini`
45
+ - Create: `data/raw/.gitkeep`
46
+ - Create: `data/processed/.gitkeep`
47
+ - Create: `src/__init__.py`
48
+ - Create: `src/core/__init__.py`
49
+ - Create: `src/pipelines/__init__.py`
50
+ - Create: `src/api/__init__.py`
51
+ - Create: `tests/__init__.py`
52
+ - Create: `tests/core/__init__.py`
53
+ - Create: `tests/pipelines/__init__.py`
54
+ - Create: `tests/fixtures/` (folder)
55
+
56
+ - [ ] **Step 1: Create directory skeleton**
57
+
58
+ Run:
59
+ ```bash
60
+ cd /Users/mertgungor/Desktop/hackathon
61
+ mkdir -p data/raw data/processed \
62
+ src/core src/pipelines src/api \
63
+ tests/core tests/pipelines tests/fixtures
64
+ ```
65
+
66
+ - [ ] **Step 2: Create empty package markers**
67
+
68
+ Run:
69
+ ```bash
70
+ touch src/__init__.py src/core/__init__.py src/pipelines/__init__.py src/api/__init__.py \
71
+ tests/__init__.py tests/core/__init__.py tests/pipelines/__init__.py \
72
+ data/raw/.gitkeep data/processed/.gitkeep
73
+ ```
74
+
75
+ - [ ] **Step 3: Write `.gitignore`**
76
+
77
+ Create `.gitignore`:
78
+ ```gitignore
79
+ # Byte-compiled / cache
80
+ __pycache__/
81
+ *.py[cod]
82
+ *.egg-info/
83
+ .pytest_cache/
84
+ .mypy_cache/
85
+ .ruff_cache/
86
+
87
+ # Virtual envs
88
+ .venv/
89
+ venv/
90
+ env/
91
+
92
+ # Data — only keep folder structure, never raw payloads
93
+ data/raw/*
94
+ !data/raw/.gitkeep
95
+ data/processed/*
96
+ !data/processed/.gitkeep
97
+
98
+ # MLflow / experiment tracking
99
+ mlruns/
100
+ mlartifacts/
101
+
102
+ # IDE
103
+ .idea/
104
+ .vscode/
105
+ .DS_Store
106
+ ```
107
+
108
+ - [ ] **Step 4: Write `pytest.ini`**
109
+
110
+ Create `pytest.ini`:
111
+ ```ini
112
+ [pytest]
113
+ testpaths = tests
114
+ pythonpath = .
115
+ addopts = -v --tb=short
116
+ ```
117
+
118
+ - [ ] **Step 5: Initialize git and commit skeleton**
119
+
120
+ Run:
121
+ ```bash
122
+ cd /Users/mertgungor/Desktop/hackathon
123
+ git init -b main
124
+ git add .gitignore pytest.ini data/ src/ tests/
125
+ git commit -m "chore: bootstrap NeuroBridge project skeleton"
126
+ ```
127
+
128
+ Expected: a single commit with the skeleton tree.
129
+
130
+ ---
131
+
132
+ ## Task 2: AGENTS.md — Project Rulebook
133
+
134
+ **Files:**
135
+ - Create: `AGENTS.md`
136
+
137
+ - [ ] **Step 1: Write `AGENTS.md`**
138
+
139
+ Create `AGENTS.md`:
140
+ ````markdown
141
+ # AGENTS.md — NeuroBridge Enterprise Pipeline
142
+
143
+ > Read this file at the start of every session. It is the contract every agent
144
+ > (human or LLM) operates under in this repository.
145
+
146
+ ## 1. Project Vision
147
+
148
+ **NeuroBridge Enterprise** is a B2B SaaS platform that solves three structural
149
+ problems in real-world clinical/biomedical ML pipelines:
150
+
151
+ 1. **Data Drift** between hospitals and acquisition sites (multi-center MRI).
152
+ 2. **Missing Modalities** (a patient may have MRI but no EEG, or vice versa).
153
+ 3. **Artifacts** in raw biosignals (eye blinks, line noise, motion in EEG).
154
+
155
+ The platform exposes three production pipelines behind a single FastAPI surface:
156
+
157
+ | Modality | Pipeline | Core Technique |
158
+ |---|---|---|
159
+ | Image (MRI / fMRI) | `src/pipelines/mri_pipeline.py` | ComBat Harmonization for site-level domain shift |
160
+ | Signal (EEG) | `src/pipelines/eeg_pipeline.py` | MNE-Python + ICA for artifact removal |
161
+ | Tabular (BBB / molecules) | `src/pipelines/bbb_pipeline.py` | RDKit Morgan fingerprints from SMILES |
162
+
163
+ All experiment runs are tracked in **MLflow**. All services ship as **Docker** images.
164
+
165
+ ## 2. Directory Layout (load-bearing — do not violate)
166
+
167
+ ```
168
+ .
169
+ ├── AGENTS.md # This file
170
+ ├── requirements.txt
171
+ ├── pytest.ini
172
+ ├── data/
173
+ │ ├── raw/ # Untouched source data. NEVER train on this directly.
174
+ │ └── processed/ # Pipeline output. Model-ready. Versioned outputs.
175
+ ├── src/
176
+ │ ├── api/ # FastAPI routers, request/response schemas
177
+ │ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
178
+ │ └── core/ # Cross-cutting utilities: logging, config, MLflow helpers
179
+ └── tests/
180
+ ├── core/
181
+ ├── pipelines/
182
+ └── fixtures/ # Tiny synthetic data files used by tests
183
+ ```
184
+
185
+ **Rules:**
186
+ - New modality → new file under `src/pipelines/`. No mixing modalities in one file.
187
+ - Anything imported by 2+ pipelines → `src/core/`.
188
+ - Never read from or write to paths outside `data/`. The `data/` boundary is the storage contract.
189
+
190
+ ## 3. Coding Standards
191
+
192
+ - **Python 3.10+.** Use `from __future__ import annotations` when needed for forward refs.
193
+ - **Type hints are mandatory** on every public function/method (parameters and return).
194
+ - **Modular structure.** One responsibility per function. If a function exceeds ~40 lines or 3 levels of nesting, split it.
195
+ - **TDD is the default workflow.** Write the failing test first, watch it fail, then implement. Tests live in `tests/` mirroring `src/`.
196
+ - **Logging is mandatory** for every pipeline. Use `src.core.logger.get_logger(__name__)`. No `print()` in `src/`.
197
+ - **Docstrings** on every public function — one-line summary + Args/Returns when non-trivial.
198
+ - **No hard-coded paths in business logic.** Pass paths as arguments to `run_pipeline(input_path, output_path)`.
199
+ - **Format & lint:** keep imports sorted; prefer `pathlib.Path` over `os.path`.
200
+ - **Commits are small and frequent.** Each green test → commit.
201
+
202
+ ## 4. Data Readiness Principles
203
+
204
+ > **The Golden Rule: never train a model directly on raw data. Raw data must always pass through a pipeline first.**
205
+
206
+ Every modality pipeline MUST guarantee, before writing to `data/processed/`:
207
+
208
+ 1. **Schema validity** — required columns present, expected dtypes.
209
+ 2. **Domain validity** — invalid records (e.g. unparseable SMILES, NaN-only EEG epochs, corrupted DICOMs) are **logged with their identifier and dropped**, never silently coerced.
210
+ 3. **Determinism** — given the same `data/raw/` input, the pipeline produces byte-identical `data/processed/` output. No wall-clock, no random seeds without explicit seeding.
211
+ 4. **Traceability** — log row count in, row count out, and percentage dropped at INFO level.
212
+ 5. **Idempotence** — re-running the pipeline overwrites `data/processed/` cleanly; no append, no partial writes.
213
+
214
+ A model training script is allowed to import from `data/processed/` only. If a
215
+ training script references `data/raw/` directly, that is a bug and must be
216
+ refactored into a pipeline.
217
+
218
+ ## 5. How to Add a New Pipeline (checklist)
219
+
220
+ 1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
221
+ 2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
222
+ 3. Use `get_logger(__name__)` for all status output.
223
+ 4. Validate inputs and drop invalid rows with a logged warning.
224
+ 5. Write deterministic output to `output_path`.
225
+ 6. Document any new dependency in `requirements.txt` (pinned).
226
+ 7. Add a one-line entry to this file's pipeline table.
227
+ ````
228
+
229
+ - [ ] **Step 2: Commit**
230
+
231
+ ```bash
232
+ git add AGENTS.md
233
+ git commit -m "docs: add AGENTS.md with vision, layout, standards, data readiness rules"
234
+ ```
235
+
236
+ ---
237
+
238
+ ## Task 3: requirements.txt
239
+
240
+ **Files:**
241
+ - Create: `requirements.txt`
242
+
243
+ - [ ] **Step 1: Write `requirements.txt`**
244
+
245
+ Create `requirements.txt`:
246
+ ```text
247
+ # --- Web / API layer ---
248
+ fastapi==0.115.0
249
+ uvicorn[standard]==0.30.6
250
+ pydantic==2.9.2
251
+
252
+ # --- Core data stack ---
253
+ numpy==1.26.4
254
+ pandas==2.2.2
255
+ scipy==1.13.1
256
+ scikit-learn==1.5.1
257
+
258
+ # --- Modality: tabular / molecules (BBB pipeline) ---
259
+ rdkit==2024.3.5
260
+
261
+ # --- Modality: signal (EEG pipeline) ---
262
+ mne==1.7.1
263
+
264
+ # --- Modality: image (MRI pipeline) ---
265
+ nibabel==5.2.1
266
+ neuroharmonize==2.4.5 # ComBat harmonization wrapper
267
+
268
+ # --- Experiment tracking ---
269
+ mlflow==2.16.0
270
+
271
+ # --- Tooling / tests ---
272
+ pytest==8.3.3
273
+ pytest-cov==5.0.0
274
+ httpx==0.27.2 # FastAPI test client
275
+ ```
276
+
277
+ - [ ] **Step 2: Commit**
278
+
279
+ ```bash
280
+ git add requirements.txt
281
+ git commit -m "chore: pin runtime + dev dependencies for all three modalities"
282
+ ```
283
+
284
+ > **Note for engineer:** dependency installation (creating a venv, `pip install -r requirements.txt`) is delegated to the human / CI. The plan does not assume a venv is active. Subsequent tasks rely on `rdkit`, `pytest`, etc. being importable; if the environment is not yet set up, set it up before Task 4.
285
+
286
+ ---
287
+
288
+ ## Task 4: Shared Logger (`src/core/logger.py`) — TDD
289
+
290
+ **Files:**
291
+ - Create: `tests/core/test_logger.py`
292
+ - Create: `src/core/logger.py`
293
+
294
+ - [ ] **Step 1: Write the failing tests**
295
+
296
+ Create `tests/core/test_logger.py`:
297
+ ```python
298
+ """Unit tests for the shared structured logger."""
299
+ from __future__ import annotations
300
+
301
+ import logging
302
+
303
+ from src.core.logger import get_logger
304
+
305
+
306
+ def test_get_logger_returns_logger_instance() -> None:
307
+ logger = get_logger("neurobridge.test")
308
+ assert isinstance(logger, logging.Logger)
309
+ assert logger.name == "neurobridge.test"
310
+
311
+
312
+ def test_get_logger_attaches_single_handler() -> None:
313
+ """Repeated calls must not duplicate handlers (idempotence)."""
314
+ name = "neurobridge.idempotent"
315
+ first = get_logger(name)
316
+ second = get_logger(name)
317
+ assert first is second
318
+ assert len(first.handlers) == 1
319
+
320
+
321
+ def test_get_logger_default_level_is_info() -> None:
322
+ logger = get_logger("neurobridge.level_check")
323
+ assert logger.level == logging.INFO
324
+
325
+
326
+ def test_get_logger_emits_formatted_record(caplog) -> None:
327
+ logger = get_logger("neurobridge.emit")
328
+ with caplog.at_level(logging.INFO, logger="neurobridge.emit"):
329
+ logger.info("hello-world")
330
+ assert any("hello-world" in record.message for record in caplog.records)
331
+ ```
332
+
333
+ - [ ] **Step 2: Run tests to verify they fail**
334
+
335
+ Run: `pytest tests/core/test_logger.py -v`
336
+
337
+ Expected: 4 FAILS with `ModuleNotFoundError: No module named 'src.core.logger'`.
338
+
339
+ - [ ] **Step 3: Implement the logger**
340
+
341
+ Create `src/core/logger.py`:
342
+ ```python
343
+ """Shared structured logger for NeuroBridge pipelines.
344
+
345
+ All modules in `src/` must obtain their logger via `get_logger(__name__)`
346
+ instead of using `print()`. This guarantees consistent format and INFO-level
347
+ traceability across pipelines (per AGENTS.md §4).
348
+ """
349
+ from __future__ import annotations
350
+
351
+ import logging
352
+ import sys
353
+
354
+ _LOG_FORMAT = "%(asctime)s | %(levelname)-7s | %(name)s | %(message)s"
355
+ _DATE_FORMAT = "%Y-%m-%dT%H:%M:%S"
356
+
357
+
358
+ def get_logger(name: str, level: int = logging.INFO) -> logging.Logger:
359
+ """Return a process-wide singleton logger for the given name.
360
+
361
+ Idempotent: repeated calls with the same name return the same Logger
362
+ instance and never stack duplicate handlers.
363
+
364
+ Args:
365
+ name: Dotted logger name, conventionally `__name__`.
366
+ level: Logging level (default `logging.INFO`).
367
+
368
+ Returns:
369
+ Configured `logging.Logger` writing to stdout.
370
+ """
371
+ logger = logging.getLogger(name)
372
+ if logger.handlers:
373
+ return logger
374
+
375
+ handler = logging.StreamHandler(stream=sys.stdout)
376
+ handler.setFormatter(logging.Formatter(_LOG_FORMAT, datefmt=_DATE_FORMAT))
377
+ logger.addHandler(handler)
378
+ logger.setLevel(level)
379
+ logger.propagate = False
380
+ return logger
381
+ ```
382
+
383
+ - [ ] **Step 4: Run tests to verify they pass**
384
+
385
+ Run: `pytest tests/core/test_logger.py -v`
386
+
387
+ Expected: 4 PASS.
388
+
389
+ - [ ] **Step 5: Commit**
390
+
391
+ ```bash
392
+ git add src/core/logger.py tests/core/test_logger.py
393
+ git commit -m "feat(core): add shared structured logger with idempotent handler attach"
394
+ ```
395
+
396
+ ---
397
+
398
+ ## Task 5: BBB Pipeline — Test Fixture & SMILES Validation (TDD)
399
+
400
+ **Files:**
401
+ - Create: `tests/fixtures/bbbp_sample.csv`
402
+ - Create: `tests/pipelines/test_bbb_pipeline.py`
403
+ - Create: `src/pipelines/bbb_pipeline.py`
404
+
405
+ - [ ] **Step 1: Create the test fixture CSV**
406
+
407
+ Create `tests/fixtures/bbbp_sample.csv` (matches Kaggle BBBP schema: `num,name,p_np,smiles`):
408
+ ```csv
409
+ num,name,p_np,smiles
410
+ 1,Propanol,1,CCCO
411
+ 2,Benzene,1,c1ccccc1
412
+ 3,Aspirin,1,CC(=O)OC1=CC=CC=C1C(=O)O
413
+ 4,InvalidMol,0,this_is_not_a_smiles
414
+ 5,Caffeine,1,CN1C=NC2=C1C(=O)N(C(=O)N2C)C
415
+ 6,EmptyMol,0,
416
+ ```
417
+
418
+ Two rows are invalid by design: row 4 (garbage string) and row 6 (empty). Both must be filtered out by the pipeline.
419
+
420
+ - [ ] **Step 2: Write the failing test for `is_valid_smiles`**
421
+
422
+ Create `tests/pipelines/test_bbb_pipeline.py`:
423
+ ```python
424
+ """Unit + integration tests for the BBB (SMILES → Morgan FP) pipeline."""
425
+ from __future__ import annotations
426
+
427
+ from pathlib import Path
428
+
429
+ import pandas as pd
430
+ import pytest
431
+
432
+ from src.pipelines.bbb_pipeline import is_valid_smiles
433
+
434
+
435
+ FIXTURE = Path(__file__).parent.parent / "fixtures" / "bbbp_sample.csv"
436
+
437
+
438
+ class TestIsValidSmiles:
439
+ def test_accepts_simple_alcohol(self) -> None:
440
+ assert is_valid_smiles("CCCO") is True
441
+
442
+ def test_accepts_aromatic_ring(self) -> None:
443
+ assert is_valid_smiles("c1ccccc1") is True
444
+
445
+ def test_rejects_garbage_string(self) -> None:
446
+ assert is_valid_smiles("this_is_not_a_smiles") is False
447
+
448
+ def test_rejects_empty_string(self) -> None:
449
+ assert is_valid_smiles("") is False
450
+
451
+ def test_rejects_none(self) -> None:
452
+ assert is_valid_smiles(None) is False # type: ignore[arg-type]
453
+
454
+ def test_rejects_nan(self) -> None:
455
+ import math
456
+ assert is_valid_smiles(math.nan) is False # type: ignore[arg-type]
457
+ ```
458
+
459
+ - [ ] **Step 3: Run tests to verify they fail**
460
+
461
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py -v`
462
+
463
+ Expected: FAILS with `ModuleNotFoundError: No module named 'src.pipelines.bbb_pipeline'`.
464
+
465
+ - [ ] **Step 4: Implement `is_valid_smiles`**
466
+
467
+ Create `src/pipelines/bbb_pipeline.py`:
468
+ ```python
469
+ """BBB (Blood-Brain Barrier) molecule pipeline.
470
+
471
+ Reads the Kaggle BBBP dataset (SMILES strings + binary penetration label),
472
+ filters chemically invalid SMILES, computes Morgan circular fingerprints with
473
+ RDKit, and writes a model-ready feature table to `data/processed/`.
474
+
475
+ This module follows the Data Readiness contract in AGENTS.md §4:
476
+ schema validity, domain validity (drop invalid SMILES), determinism,
477
+ traceability (row count in / out / dropped), and idempotent output.
478
+ """
479
+ from __future__ import annotations
480
+
481
+ import math
482
+ from typing import Any
483
+
484
+ from rdkit import Chem, RDLogger
485
+
486
+ from src.core.logger import get_logger
487
+
488
+ logger = get_logger(__name__)
489
+
490
+ # Suppress RDKit's noisy C++-level warning stream; we surface our own
491
+ # structured warnings via the project logger when a SMILES fails to parse.
492
+ RDLogger.DisableLog("rdApp.*")
493
+
494
+
495
+ def is_valid_smiles(smiles: Any) -> bool:
496
+ """Return True iff `smiles` is a non-empty string parseable by RDKit.
497
+
498
+ Handles the full set of garbage we expect from real CSVs:
499
+ None, NaN floats, empty strings, and unparseable text.
500
+ """
501
+ if smiles is None:
502
+ return False
503
+ if isinstance(smiles, float) and math.isnan(smiles):
504
+ return False
505
+ if not isinstance(smiles, str) or not smiles.strip():
506
+ return False
507
+ return Chem.MolFromSmiles(smiles) is not None
508
+ ```
509
+
510
+ - [ ] **Step 5: Run tests to verify they pass**
511
+
512
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py -v`
513
+
514
+ Expected: 6 PASS in `TestIsValidSmiles`.
515
+
516
+ - [ ] **Step 6: Commit**
517
+
518
+ ```bash
519
+ git add tests/fixtures/bbbp_sample.csv tests/pipelines/test_bbb_pipeline.py src/pipelines/bbb_pipeline.py
520
+ git commit -m "feat(bbb): add SMILES validity guard with RDKit + test fixture"
521
+ ```
522
+
523
+ ---
524
+
525
+ ## Task 6: BBB Pipeline — Morgan Fingerprint Extraction (TDD)
526
+
527
+ **Files:**
528
+ - Modify: `tests/pipelines/test_bbb_pipeline.py`
529
+ - Modify: `src/pipelines/bbb_pipeline.py`
530
+
531
+ - [ ] **Step 1: Write the failing test for `compute_morgan_fingerprint`**
532
+
533
+ Append to `tests/pipelines/test_bbb_pipeline.py`:
534
+ ```python
535
+ import numpy as np
536
+
537
+ from src.pipelines.bbb_pipeline import compute_morgan_fingerprint
538
+
539
+
540
+ class TestComputeMorganFingerprint:
541
+ def test_returns_numpy_array_of_correct_length(self) -> None:
542
+ fp = compute_morgan_fingerprint("CCCO", n_bits=2048, radius=2)
543
+ assert isinstance(fp, np.ndarray)
544
+ assert fp.shape == (2048,)
545
+ assert fp.dtype == np.uint8
546
+
547
+ def test_only_zero_or_one(self) -> None:
548
+ fp = compute_morgan_fingerprint("c1ccccc1", n_bits=1024, radius=2)
549
+ assert set(np.unique(fp).tolist()).issubset({0, 1})
550
+
551
+ def test_different_molecules_yield_different_fingerprints(self) -> None:
552
+ fp_a = compute_morgan_fingerprint("CCCO", n_bits=2048, radius=2)
553
+ fp_b = compute_morgan_fingerprint("c1ccccc1", n_bits=2048, radius=2)
554
+ assert not np.array_equal(fp_a, fp_b)
555
+
556
+ def test_invalid_smiles_raises_value_error(self) -> None:
557
+ with pytest.raises(ValueError, match="invalid SMILES"):
558
+ compute_morgan_fingerprint("not_a_smiles", n_bits=2048, radius=2)
559
+ ```
560
+
561
+ - [ ] **Step 2: Run tests to verify they fail**
562
+
563
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py::TestComputeMorganFingerprint -v`
564
+
565
+ Expected: 4 FAILS with `ImportError: cannot import name 'compute_morgan_fingerprint'`.
566
+
567
+ - [ ] **Step 3: Implement `compute_morgan_fingerprint`**
568
+
569
+ Append to `src/pipelines/bbb_pipeline.py`:
570
+ ```python
571
+ import numpy as np
572
+ from rdkit.Chem import AllChem
573
+
574
+
575
+ def compute_morgan_fingerprint(
576
+ smiles: str,
577
+ n_bits: int = 2048,
578
+ radius: int = 2,
579
+ ) -> np.ndarray:
580
+ """Compute the Morgan (ECFP-like) circular fingerprint for a SMILES.
581
+
582
+ Args:
583
+ smiles: A SMILES string already known to be valid. Pass through
584
+ `is_valid_smiles` first if the source is untrusted.
585
+ n_bits: Length of the bit vector. 2048 is the de-facto default
586
+ for downstream scikit-learn classifiers.
587
+ radius: Morgan radius (2 ≈ ECFP4).
588
+
589
+ Returns:
590
+ A 1-D `np.ndarray` of length `n_bits` and dtype `uint8`, where
591
+ each element is 0 or 1.
592
+
593
+ Raises:
594
+ ValueError: if `smiles` cannot be parsed by RDKit.
595
+ """
596
+ mol = Chem.MolFromSmiles(smiles)
597
+ if mol is None:
598
+ raise ValueError(f"invalid SMILES: {smiles!r}")
599
+
600
+ bit_vect = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=n_bits)
601
+ arr = np.zeros((n_bits,), dtype=np.uint8)
602
+ # RDKit ships a fast C++ writer into a preallocated numpy buffer.
603
+ from rdkit.DataStructs import ConvertToNumpyArray
604
+ ConvertToNumpyArray(bit_vect, arr)
605
+ return arr
606
+ ```
607
+
608
+ - [ ] **Step 4: Run tests to verify they pass**
609
+
610
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py -v`
611
+
612
+ Expected: all tests so far PASS (6 from Task 5 + 4 new).
613
+
614
+ - [ ] **Step 5: Commit**
615
+
616
+ ```bash
617
+ git add tests/pipelines/test_bbb_pipeline.py src/pipelines/bbb_pipeline.py
618
+ git commit -m "feat(bbb): add Morgan fingerprint extraction with shape/dtype guarantees"
619
+ ```
620
+
621
+ ---
622
+
623
+ ## Task 7: BBB Pipeline — DataFrame Feature Extraction (TDD)
624
+
625
+ **Files:**
626
+ - Modify: `tests/pipelines/test_bbb_pipeline.py`
627
+ - Modify: `src/pipelines/bbb_pipeline.py`
628
+
629
+ - [ ] **Step 1: Write the failing test for `extract_features_from_dataframe`**
630
+
631
+ Append to `tests/pipelines/test_bbb_pipeline.py`:
632
+ ```python
633
+ from src.pipelines.bbb_pipeline import extract_features_from_dataframe
634
+
635
+
636
+ class TestExtractFeaturesFromDataFrame:
637
+ def test_filters_invalid_smiles(self) -> None:
638
+ raw = pd.read_csv(FIXTURE)
639
+ # Sanity: fixture contains 6 rows total, 2 are invalid by construction.
640
+ assert len(raw) == 6
641
+
642
+ features = extract_features_from_dataframe(raw, smiles_col="smiles", n_bits=128, radius=2)
643
+
644
+ # Only the 4 chemically valid rows should remain.
645
+ assert len(features) == 4
646
+
647
+ def test_preserves_label_column(self) -> None:
648
+ raw = pd.read_csv(FIXTURE)
649
+ features = extract_features_from_dataframe(raw, smiles_col="smiles", n_bits=128, radius=2)
650
+ assert "p_np" in features.columns
651
+
652
+ def test_expands_fingerprint_into_named_columns(self) -> None:
653
+ raw = pd.read_csv(FIXTURE)
654
+ features = extract_features_from_dataframe(raw, smiles_col="smiles", n_bits=128, radius=2)
655
+ fp_cols = [c for c in features.columns if c.startswith("fp_")]
656
+ assert len(fp_cols) == 128
657
+ # All FP columns must be 0/1 integers.
658
+ assert features[fp_cols].isin([0, 1]).all().all()
659
+
660
+ def test_drops_smiles_string_after_expansion(self) -> None:
661
+ """Once expanded to bits, the original SMILES string adds no signal."""
662
+ raw = pd.read_csv(FIXTURE)
663
+ features = extract_features_from_dataframe(raw, smiles_col="smiles", n_bits=128, radius=2)
664
+ assert "smiles" not in features.columns
665
+
666
+ def test_resets_index(self) -> None:
667
+ raw = pd.read_csv(FIXTURE)
668
+ features = extract_features_from_dataframe(raw, smiles_col="smiles", n_bits=128, radius=2)
669
+ assert list(features.index) == list(range(len(features)))
670
+ ```
671
+
672
+ - [ ] **Step 2: Run tests to verify they fail**
673
+
674
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py::TestExtractFeaturesFromDataFrame -v`
675
+
676
+ Expected: 5 FAILS with `ImportError: cannot import name 'extract_features_from_dataframe'`.
677
+
678
+ - [ ] **Step 3: Implement `extract_features_from_dataframe`**
679
+
680
+ Append to `src/pipelines/bbb_pipeline.py`:
681
+ ```python
682
+ import pandas as pd
683
+
684
+
685
+ def extract_features_from_dataframe(
686
+ df: pd.DataFrame,
687
+ smiles_col: str = "smiles",
688
+ n_bits: int = 2048,
689
+ radius: int = 2,
690
+ ) -> pd.DataFrame:
691
+ """Convert a DataFrame of (SMILES + metadata) into model-ready features.
692
+
693
+ Steps:
694
+ 1. Validate every SMILES with `is_valid_smiles`. Invalid rows are
695
+ logged at WARNING with their original index and dropped.
696
+ 2. Compute the Morgan fingerprint for each remaining SMILES.
697
+ 3. Expand the bit vector into `n_bits` integer columns named
698
+ `fp_0 ... fp_{n_bits - 1}` and concatenate with the surviving
699
+ non-SMILES metadata.
700
+
701
+ Args:
702
+ df: Raw DataFrame; must contain `smiles_col`.
703
+ smiles_col: Name of the SMILES column (default `"smiles"`).
704
+ n_bits: Fingerprint length.
705
+ radius: Morgan radius.
706
+
707
+ Returns:
708
+ A new DataFrame with the SMILES column dropped and `n_bits` new
709
+ `fp_*` columns appended. Index is reset to 0..N-1.
710
+
711
+ Raises:
712
+ KeyError: if `smiles_col` is missing from `df`.
713
+ """
714
+ if smiles_col not in df.columns:
715
+ raise KeyError(f"DataFrame is missing required column {smiles_col!r}")
716
+
717
+ n_total = len(df)
718
+ valid_mask = df[smiles_col].apply(is_valid_smiles)
719
+ n_invalid = int((~valid_mask).sum())
720
+
721
+ if n_invalid:
722
+ invalid_indices = df.index[~valid_mask].tolist()
723
+ logger.warning(
724
+ "Dropping %d/%d rows with invalid SMILES (indices=%s)",
725
+ n_invalid, n_total, invalid_indices,
726
+ )
727
+
728
+ valid_df = df.loc[valid_mask].reset_index(drop=True)
729
+
730
+ fingerprints = np.stack(
731
+ [
732
+ compute_morgan_fingerprint(s, n_bits=n_bits, radius=radius)
733
+ for s in valid_df[smiles_col].tolist()
734
+ ],
735
+ axis=0,
736
+ )
737
+ fp_columns = [f"fp_{i}" for i in range(n_bits)]
738
+ fp_df = pd.DataFrame(fingerprints, columns=fp_columns, dtype=np.uint8)
739
+
740
+ metadata = valid_df.drop(columns=[smiles_col]).reset_index(drop=True)
741
+ out = pd.concat([metadata, fp_df], axis=1)
742
+
743
+ logger.info(
744
+ "Feature extraction complete: in=%d, out=%d, dropped=%d (%.2f%%)",
745
+ n_total, len(out), n_invalid, 100.0 * n_invalid / max(n_total, 1),
746
+ )
747
+ return out
748
+ ```
749
+
750
+ - [ ] **Step 4: Run all tests to verify they pass**
751
+
752
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py -v`
753
+
754
+ Expected: all tests so far PASS (6 + 4 + 5 = 15).
755
+
756
+ - [ ] **Step 5: Commit**
757
+
758
+ ```bash
759
+ git add tests/pipelines/test_bbb_pipeline.py src/pipelines/bbb_pipeline.py
760
+ git commit -m "feat(bbb): expand SMILES → Morgan FP into model-ready DataFrame with drift logging"
761
+ ```
762
+
763
+ ---
764
+
765
+ ## Task 8: BBB Pipeline — `run_pipeline` Orchestrator + CLI (TDD)
766
+
767
+ **Files:**
768
+ - Modify: `tests/pipelines/test_bbb_pipeline.py`
769
+ - Modify: `src/pipelines/bbb_pipeline.py`
770
+
771
+ - [ ] **Step 1: Write the failing integration test for `run_pipeline`**
772
+
773
+ Append to `tests/pipelines/test_bbb_pipeline.py`:
774
+ ```python
775
+ import shutil
776
+
777
+ from src.pipelines.bbb_pipeline import run_pipeline
778
+
779
+
780
+ class TestRunPipeline:
781
+ def test_end_to_end_writes_processed_csv(self, tmp_path: Path) -> None:
782
+ # Arrange: copy fixture into a synthetic raw layout.
783
+ raw_dir = tmp_path / "data" / "raw"
784
+ proc_dir = tmp_path / "data" / "processed"
785
+ raw_dir.mkdir(parents=True)
786
+ proc_dir.mkdir(parents=True)
787
+ input_path = raw_dir / "bbbp.csv"
788
+ output_path = proc_dir / "bbbp_features.csv"
789
+ shutil.copy(FIXTURE, input_path)
790
+
791
+ # Act
792
+ run_pipeline(input_path=input_path, output_path=output_path, n_bits=128, radius=2)
793
+
794
+ # Assert: file exists
795
+ assert output_path.exists(), "pipeline must write processed CSV"
796
+
797
+ # Assert: content is correct
798
+ out = pd.read_csv(output_path)
799
+ assert len(out) == 4 # 6 raw - 2 invalid
800
+ assert "p_np" in out.columns
801
+ assert sum(c.startswith("fp_") for c in out.columns) == 128
802
+ assert "smiles" not in out.columns
803
+
804
+ def test_run_pipeline_is_idempotent(self, tmp_path: Path) -> None:
805
+ raw_dir = tmp_path / "data" / "raw"
806
+ proc_dir = tmp_path / "data" / "processed"
807
+ raw_dir.mkdir(parents=True)
808
+ proc_dir.mkdir(parents=True)
809
+ input_path = raw_dir / "bbbp.csv"
810
+ output_path = proc_dir / "bbbp_features.csv"
811
+ shutil.copy(FIXTURE, input_path)
812
+
813
+ run_pipeline(input_path=input_path, output_path=output_path, n_bits=64, radius=2)
814
+ first_bytes = output_path.read_bytes()
815
+ run_pipeline(input_path=input_path, output_path=output_path, n_bits=64, radius=2)
816
+ second_bytes = output_path.read_bytes()
817
+
818
+ assert first_bytes == second_bytes, "pipeline output must be byte-deterministic"
819
+
820
+ def test_run_pipeline_raises_when_input_missing(self, tmp_path: Path) -> None:
821
+ with pytest.raises(FileNotFoundError):
822
+ run_pipeline(
823
+ input_path=tmp_path / "nope.csv",
824
+ output_path=tmp_path / "out.csv",
825
+ )
826
+ ```
827
+
828
+ - [ ] **Step 2: Run tests to verify they fail**
829
+
830
+ Run: `pytest tests/pipelines/test_bbb_pipeline.py::TestRunPipeline -v`
831
+
832
+ Expected: 3 FAILS with `ImportError: cannot import name 'run_pipeline'`.
833
+
834
+ - [ ] **Step 3: Implement `run_pipeline` and CLI entrypoint**
835
+
836
+ Append to `src/pipelines/bbb_pipeline.py`:
837
+ ```python
838
+ from pathlib import Path
839
+
840
+ DEFAULT_INPUT = Path("data/raw/bbbp.csv")
841
+ DEFAULT_OUTPUT = Path("data/processed/bbbp_features.csv")
842
+
843
+
844
+ def run_pipeline(
845
+ input_path: Path = DEFAULT_INPUT,
846
+ output_path: Path = DEFAULT_OUTPUT,
847
+ smiles_col: str = "smiles",
848
+ n_bits: int = 2048,
849
+ radius: int = 2,
850
+ ) -> None:
851
+ """Run the BBB pipeline end-to-end: raw CSV → processed feature CSV.
852
+
853
+ Reads the Kaggle BBBP CSV at `input_path`, validates and converts
854
+ SMILES into Morgan fingerprints, and writes the model-ready table
855
+ to `output_path`. Output is overwritten on every run (idempotent).
856
+
857
+ Args:
858
+ input_path: Path to the raw BBBP CSV (must include `smiles_col`).
859
+ output_path: Where to write the processed feature CSV. Parent
860
+ directory is created if missing.
861
+ smiles_col: SMILES column name in the raw CSV.
862
+ n_bits: Morgan fingerprint length.
863
+ radius: Morgan radius.
864
+
865
+ Raises:
866
+ FileNotFoundError: if `input_path` does not exist.
867
+ KeyError: if `smiles_col` is missing from the CSV.
868
+ """
869
+ input_path = Path(input_path)
870
+ output_path = Path(output_path)
871
+
872
+ if not input_path.exists():
873
+ raise FileNotFoundError(f"Raw BBBP file not found: {input_path}")
874
+
875
+ logger.info("Reading raw BBBP from %s", input_path)
876
+ df = pd.read_csv(input_path)
877
+ logger.info("Loaded %d rows, columns=%s", len(df), list(df.columns))
878
+
879
+ features = extract_features_from_dataframe(
880
+ df, smiles_col=smiles_col, n_bits=n_bits, radius=radius,
881
+ )
882
+
883
+ output_path.parent.mkdir(parents=True, exist_ok=True)
884
+ features.to_csv(output_path, index=False)
885
+ logger.info("Wrote processed features to %s (rows=%d, cols=%d)",
886
+ output_path, len(features), features.shape[1])
887
+
888
+
889
+ if __name__ == "__main__":
890
+ # Production-ready CLI entrypoint:
891
+ # python -m src.pipelines.bbb_pipeline
892
+ run_pipeline()
893
+ ```
894
+
895
+ - [ ] **Step 4: Run the full test suite to verify everything passes**
896
+
897
+ Run: `pytest -v`
898
+
899
+ Expected: 22 PASS (4 logger + 18 BBB: 6 SMILES validity + 4 Morgan FP + 5 DataFrame + 3 run_pipeline).
900
+
901
+ - [ ] **Step 5: Commit**
902
+
903
+ ```bash
904
+ git add tests/pipelines/test_bbb_pipeline.py src/pipelines/bbb_pipeline.py
905
+ git commit -m "feat(bbb): add run_pipeline orchestrator + CLI entrypoint with idempotent writes"
906
+ ```
907
+
908
+ ---
909
+
910
+ ## Task 9: Final Wiring & Day-1 Acceptance Check
911
+
912
+ **Files:** none modified (verification + docs only)
913
+
914
+ - [ ] **Step 1: Run the full suite one last time**
915
+
916
+ Run: `pytest -v --tb=short`
917
+
918
+ Expected: **22 passed**, no warnings other than RDKit deprecation notices (already silenced via `RDLogger.DisableLog`).
919
+
920
+ - [ ] **Step 2: Confirm the CLI works against a real (or sample) BBBP file**
921
+
922
+ If a real Kaggle BBBP dump is available, place it at `data/raw/bbbp.csv`. Otherwise copy the fixture for a smoke run:
923
+ ```bash
924
+ cp tests/fixtures/bbbp_sample.csv data/raw/bbbp.csv
925
+ python -m src.pipelines.bbb_pipeline
926
+ ```
927
+
928
+ Expected stdout (timestamps will differ):
929
+ ```
930
+ ... | INFO | src.pipelines.bbb_pipeline | Reading raw BBBP from data/raw/bbbp.csv
931
+ ... | INFO | src.pipelines.bbb_pipeline | Loaded 6 rows, columns=['num', 'name', 'p_np', 'smiles']
932
+ ... | WARNING | src.pipelines.bbb_pipeline | Dropping 2/6 rows with invalid SMILES (indices=[3, 5])
933
+ ... | INFO | src.pipelines.bbb_pipeline | Feature extraction complete: in=6, out=4, dropped=2 (33.33%)
934
+ ... | INFO | src.pipelines.bbb_pipeline | Wrote processed features to data/processed/bbbp_features.csv (rows=4, cols=2050)
935
+ ```
936
+
937
+ And confirm the output:
938
+ ```bash
939
+ ls -lh data/processed/bbbp_features.csv
940
+ head -1 data/processed/bbbp_features.csv | tr ',' '\n' | head -5
941
+ ```
942
+
943
+ Expected: file exists, header begins with `num,name,p_np,fp_0,fp_1,...`.
944
+
945
+ - [ ] **Step 3: Final commit (sample raw seeded for next agent's smoke test)**
946
+
947
+ If you copied the fixture into `data/raw/bbbp.csv`, **do not commit it** (gitignored by design). Just leave it on disk for local runs. Confirm git is clean:
948
+
949
+ ```bash
950
+ git status
951
+ ```
952
+
953
+ Expected: `nothing to commit, working tree clean` (data files ignored).
954
+
955
+ ---
956
+
957
+ ## Day-1 Definition of Done
958
+
959
+ - [ ] `AGENTS.md` lives at the repo root and documents vision, layout, standards, and the Data Readiness contract.
960
+ - [ ] `requirements.txt` pins all deps for the three modalities + FastAPI + MLflow + tests.
961
+ - [ ] `src/core/logger.py` exposes `get_logger()` with idempotent handler attachment.
962
+ - [ ] `src/pipelines/bbb_pipeline.py` exposes `is_valid_smiles`, `compute_morgan_fingerprint`, `extract_features_from_dataframe`, and `run_pipeline`.
963
+ - [ ] Invalid SMILES are **logged with their indices** and dropped (Data Readiness §2).
964
+ - [ ] `pytest -v` is green with **22 tests** (4 logger + 18 BBB).
965
+ - [ ] Running `python -m src.pipelines.bbb_pipeline` against `data/raw/bbbp.csv` produces a deterministic `data/processed/bbbp_features.csv`.
966
+ - [ ] Each task above ended in its own commit; `git log --oneline` shows ≥ 8 atomic commits for the day.