docs: tighten AGENTS.md — clarify Data Readiness checklist and storage boundary
Browse files
AGENTS.md
CHANGED
|
@@ -31,11 +31,11 @@ All experiment runs are tracked in **MLflow**. All services ship as **Docker** i
|
|
| 31 |
├── pytest.ini
|
| 32 |
├── data/
|
| 33 |
│ ├── raw/ # Untouched source data. NEVER train on this directly.
|
| 34 |
-
│ └── processed/ # Pipeline output. Model-ready
|
| 35 |
├── src/
|
| 36 |
│ ├── api/ # FastAPI routers, request/response schemas
|
| 37 |
│ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
|
| 38 |
-
│ └── core/ # Cross-cutting utilities: logging, config
|
| 39 |
└── tests/
|
| 40 |
├── core/
|
| 41 |
├── pipelines/
|
|
@@ -45,7 +45,7 @@ All experiment runs are tracked in **MLflow**. All services ship as **Docker** i
|
|
| 45 |
**Rules:**
|
| 46 |
- New modality → new file under `src/pipelines/`. No mixing modalities in one file.
|
| 47 |
- Anything imported by 2+ pipelines → `src/core/`.
|
| 48 |
-
-
|
| 49 |
- `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.
|
| 50 |
|
| 51 |
## 3. Coding Standards
|
|
@@ -80,8 +80,8 @@ refactored into a pipeline.
|
|
| 80 |
|
| 81 |
1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
|
| 82 |
2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
|
| 83 |
-
3. Use `get_logger(__name__)` for all status output.
|
| 84 |
-
4.
|
| 85 |
5. Write deterministic output to `output_path`.
|
| 86 |
6. Document any new dependency in `requirements.txt` (pinned).
|
| 87 |
7. Add a one-line entry to this file's pipeline table.
|
|
|
|
| 31 |
├── pytest.ini
|
| 32 |
├── data/
|
| 33 |
│ ├── raw/ # Untouched source data. NEVER train on this directly.
|
| 34 |
+
│ └── processed/ # Pipeline output. Model-ready outputs (overwritten on each run; see §4).
|
| 35 |
├── src/
|
| 36 |
│ ├── api/ # FastAPI routers, request/response schemas
|
| 37 |
│ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
|
| 38 |
+
│ └── core/ # Cross-cutting utilities: logging, config (MLflow helpers planned)
|
| 39 |
└── tests/
|
| 40 |
├── core/
|
| 41 |
├── pipelines/
|
|
|
|
| 45 |
**Rules:**
|
| 46 |
- New modality → new file under `src/pipelines/`. No mixing modalities in one file.
|
| 47 |
- Anything imported by 2+ pipelines → `src/core/`.
|
| 48 |
+
- Pipeline code (`src/pipelines/`, `src/core/`) must not read from or write to any path outside `data/`. Test code may read `tests/fixtures/`. The `data/` boundary is the storage contract for production data.
|
| 49 |
- `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.
|
| 50 |
|
| 51 |
## 3. Coding Standards
|
|
|
|
| 80 |
|
| 81 |
1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
|
| 82 |
2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
|
| 83 |
+
3. Use `get_logger(__name__)` for all status output (per §3).
|
| 84 |
+
4. Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run.
|
| 85 |
5. Write deterministic output to `output_path`.
|
| 86 |
6. Document any new dependency in `requirements.txt` (pinned).
|
| 87 |
7. Add a one-line entry to this file's pipeline table.
|