mekosotto commited on
Commit
b06b105
·
1 Parent(s): 29e929f

docs: tighten AGENTS.md — clarify Data Readiness checklist and storage boundary

Browse files
Files changed (1) hide show
  1. AGENTS.md +5 -5
AGENTS.md CHANGED
@@ -31,11 +31,11 @@ All experiment runs are tracked in **MLflow**. All services ship as **Docker** i
31
  ├── pytest.ini
32
  ├── data/
33
  │ ├── raw/ # Untouched source data. NEVER train on this directly.
34
- │ └── processed/ # Pipeline output. Model-ready. Versioned outputs.
35
  ├── src/
36
  │ ├── api/ # FastAPI routers, request/response schemas
37
  │ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
38
- │ └── core/ # Cross-cutting utilities: logging, config, MLflow helpers
39
  └── tests/
40
  ├── core/
41
  ├── pipelines/
@@ -45,7 +45,7 @@ All experiment runs are tracked in **MLflow**. All services ship as **Docker** i
45
  **Rules:**
46
  - New modality → new file under `src/pipelines/`. No mixing modalities in one file.
47
  - Anything imported by 2+ pipelines → `src/core/`.
48
- - Never read from or write to paths outside `data/`. The `data/` boundary is the storage contract.
49
  - `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.
50
 
51
  ## 3. Coding Standards
@@ -80,8 +80,8 @@ refactored into a pipeline.
80
 
81
  1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
82
  2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
83
- 3. Use `get_logger(__name__)` for all status output.
84
- 4. Validate inputs and drop invalid rows with a logged warning.
85
  5. Write deterministic output to `output_path`.
86
  6. Document any new dependency in `requirements.txt` (pinned).
87
  7. Add a one-line entry to this file's pipeline table.
 
31
  ├── pytest.ini
32
  ├── data/
33
  │ ├── raw/ # Untouched source data. NEVER train on this directly.
34
+ │ └── processed/ # Pipeline output. Model-ready outputs (overwritten on each run; see §4).
35
  ├── src/
36
  │ ├── api/ # FastAPI routers, request/response schemas
37
  │ ├── pipelines/ # One file per modality. Pure functions + a `run_pipeline()` entry.
38
+ │ └── core/ # Cross-cutting utilities: logging, config (MLflow helpers planned)
39
  └── tests/
40
  ├── core/
41
  ├── pipelines/
 
45
  **Rules:**
46
  - New modality → new file under `src/pipelines/`. No mixing modalities in one file.
47
  - Anything imported by 2+ pipelines → `src/core/`.
48
+ - Pipeline code (`src/pipelines/`, `src/core/`) must not read from or write to any path outside `data/`. Test code may read `tests/fixtures/`. The `data/` boundary is the storage contract for production data.
49
  - `tests/fixtures/` holds CSV / numpy / DICOM blobs — do **not** add an `__init__.py` there.
50
 
51
  ## 3. Coding Standards
 
80
 
81
  1. Add `tests/pipelines/test_<name>_pipeline.py` with the failing tests first.
82
  2. Create `src/pipelines/<name>_pipeline.py` exposing `run_pipeline(input_path: Path, output_path: Path) -> None`.
83
+ 3. Use `get_logger(__name__)` for all status output (per §3).
84
+ 4. Apply the §4 Data Readiness contract: validate + drop invalid records with a logged WARNING (identifier + count), log row count in/out/dropped at INFO, write deterministically, and overwrite (do not append) on re-run.
85
  5. Write deterministic output to `output_path`.
86
  6. Document any new dependency in `requirements.txt` (pinned).
87
  7. Add a one-line entry to this file's pipeline table.