ehsaaniqbal commited on
Commit
9a0ecd1
·
unverified ·
0 Parent(s):
.dockerignore ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .venv
2
+ .git
3
+ .gitignore
4
+ .env
5
+ __pycache__/
6
+ *.pyc
7
+ *.pyo
8
+ *.pyd
9
+ *.pyw
10
+ *.pyz
11
+ *.pywz
12
+ *.pyzw
13
+ *.pyzwz
14
+
15
+
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ .DS_Store
2
+ __pycache__/
3
+ *.pyc
4
+ .pytest_cache/
5
+ .venv/
6
+ *.egg-info/
7
+ outputs/
8
+ batch_runs/
9
+ .env
10
+ .env.*
Dockerfile ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
2
+ FROM ${BASE_IMAGE} AS builder
3
+
4
+ WORKDIR /app
5
+
6
+ RUN apt-get update && \
7
+ apt-get install -y --no-install-recommends git && \
8
+ rm -rf /var/lib/apt/lists/*
9
+
10
+ ARG BUILD_MODE=in-repo
11
+ ARG ENV_NAME=invoiceops_env
12
+
13
+ COPY . /app/env
14
+ WORKDIR /app/env
15
+
16
+ RUN if ! command -v uv >/dev/null 2>&1; then \
17
+ curl -LsSf https://astral.sh/uv/install.sh | sh && \
18
+ mv /root/.local/bin/uv /usr/local/bin/uv && \
19
+ mv /root/.local/bin/uvx /usr/local/bin/uvx; \
20
+ fi
21
+
22
+ RUN --mount=type=cache,target=/root/.cache/uv \
23
+ if [ -f uv.lock ]; then \
24
+ uv sync --frozen --no-install-project --no-editable; \
25
+ else \
26
+ uv sync --no-install-project --no-editable; \
27
+ fi
28
+
29
+ RUN --mount=type=cache,target=/root/.cache/uv \
30
+ if [ -f uv.lock ]; then \
31
+ uv sync --frozen --no-editable; \
32
+ else \
33
+ uv sync --no-editable; \
34
+ fi
35
+
36
+ FROM ${BASE_IMAGE}
37
+
38
+ WORKDIR /app
39
+
40
+ COPY --from=builder /app/env/.venv /app/.venv
41
+ COPY --from=builder /app/env /app/env
42
+
43
+ ENV PATH="/app/.venv/bin:$PATH"
44
+ ENV PYTHONPATH="/app/env:$PYTHONPATH"
45
+
46
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
47
+ CMD curl -f http://localhost:8000/health || exit 1
48
+
49
+ CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: InvoiceOps Environment Server
3
+ emoji: 📄
4
+ colorFrom: yellow
5
+ colorTo: gray
6
+ sdk: docker
7
+ pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
+ - finance
13
+ - accounts-payable
14
+ - invoices
15
+ ---
16
+
17
+ # InvoiceOps Environment
18
+
19
+ Submitted by team: `Markov`
20
+
21
+ `InvoiceOps` is a deterministic OpenEnv environment for [accounts payable (AP)](https://en.wikipedia.org/wiki/Accounts_payable) invoice exception handling. Each episode is one invoice case. The agent inspects surfaced exceptions, opens typed supporting artifacts, optionally runs duplicate checks, writes structured notes, saves line and header resolutions, and submits the case for deterministic grading.
22
+
23
+ In real AP operations, this is the core decision problem: determine whether an invoice can be paid now, partially released, or routed for further review based on invoices, POs, receipts, approval status, and policy evidence.
24
+
25
+ The workflow is loosely modeled on real enterprise AP controls used in systems such as [Microsoft Dynamics 365 Accounts payable](https://learn.microsoft.com/en-us/dynamics365/finance/accounts-payable/accounts-payable), including invoice review and approval, invoice matching, workflow routing, and partial payment handling.
26
+
27
+ This environment is intentionally small and CPU-friendly, but it still measures real AP judgment:
28
+
29
+ - evidence gathering before payment decisions
30
+ - line-level vs header-level separation
31
+ - duplicate-review strategy selection
32
+ - receipt support judgment
33
+ - partial release vs full hold
34
+ - chronology-aware exception handling
35
+ - routing to the correct follow-up owner when payment is not safe
36
+
37
+ ## Public Benchmark
38
+
39
+ The public benchmark has four tasks. `easy` is a warm-up. `medium` and `medium_plus` test distinct mid-tier capabilities. `hard` is the composition case.
40
+
41
+ | Task | Core burden | Best outcome |
42
+ | ------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------ |
43
+ | `easy` | Start a missing approval workflow for a non-PO invoice | Hold and route to `requester` |
44
+ | `medium` | Clear a duplicate exception using the correct evidence path | Approve both lines and release payment |
45
+ | `medium_plus` | Combine duplicate clearance with mixed line outcomes | Approve `L1`, hold `L2`, release approved lines |
46
+ | `hard` | Combine duplicate review, invoice arithmetic, receipt chronology, and a tax block | Approve `L1` and `L3`, hold `L2`, hold header to `tax` |
47
+
48
+ ### Task Details
49
+
50
+ #### `easy`
51
+
52
+ Non-PO invoice with no initiated approval workflow. The invoice amount is within requester authority, so the correct action is to hold and route to `requester`.
53
+
54
+ #### `medium`
55
+
56
+ PO-backed invoice with a possible duplicate flag. The decisive evidence appears only after the normalized invoice number duplicate search. Approving safely requires the right duplicate path plus PO and receipt review.
57
+
58
+ #### `medium_plus`
59
+
60
+ PO-backed invoice with a possible duplicate flag and one short-received line above the de minimis threshold. The agent must clear the duplicate, separate line outcomes correctly, and use `release_approved_lines` instead of a blanket hold.
61
+
62
+ #### `hard`
63
+
64
+ Project invoice with interacting burdens: duplicate review, de minimis invoice arithmetic on `L1`, chronology-sensitive receipt support on `L2`, and a tax header block that routes to `tax`.
65
+
66
+ ## Action Space
67
+
68
+ `InvoiceOpsAction` is a typed action model with these actions:
69
+
70
+ - `open_artifact`
71
+ - `inspect_exception`
72
+ - `run_duplicate_check`
73
+ - `add_note`
74
+ - `set_line_resolution`
75
+ - `set_header_resolution`
76
+ - `submit_case`
77
+
78
+ ## Observation Space
79
+
80
+ `InvoiceOpsObservation` includes:
81
+
82
+ - queue-level case summary
83
+ - available artifacts
84
+ - most recently opened artifact
85
+ - exception stubs and inspected exception details
86
+ - duplicate candidates surfaced by the chosen strategy
87
+ - saved notes
88
+ - draft line and header resolutions
89
+ - progress counters
90
+ - final deterministic submission report after submit
91
+
92
+ ## Scoring
93
+
94
+ The reward function provides dense trajectory signal for useful work such as first-time artifact opens, exception inspection, duplicate checks, notes, and valid saved resolutions. It penalizes invalid or redundant actions and inefficient trajectories.
95
+
96
+ Final grading is deterministic and two-stage:
97
+
98
+ 1. Assign a `decision_band`: `best`, `safe_suboptimal`, `wrong`, or `unsafe`.
99
+ 2. Score within that band using core decision quality, timely evidence, structured documentation coverage, and efficiency.
100
+
101
+ Important grading rule: best outcomes require the agent to uncover the required evidence before saving the decision. Conservative holds can still earn `safe_suboptimal` when the observed evidence justifies caution.
102
+
103
+ ## Design Choices
104
+
105
+ This benchmark was iterated on, not created in one pass. We tried weaker task and grader shapes first, then removed designs that were easy to game or that clustered strong models for the wrong reasons.
106
+
107
+ Key anti-gaming choices:
108
+
109
+ - no pre-opened artifacts, auto-inspected exceptions, or auto-run duplicate checks
110
+ - no hidden scenario-specific solver logic in the environment or grader
111
+ - no prose grading; scores depend on typed actions, saved resolutions, observed evidence, and timing
112
+ - fallback runs are zeroed in baseline mean scoring
113
+ - conservative blanket holds are capped in `safe_suboptimal`; they do not earn `best`
114
+
115
+ Main lessons from iteration:
116
+
117
+ - making partial credit harsher did not improve the benchmark; harder tasks had to require better evidence use and better judgment
118
+ - gating on restated citation strings was too brittle; grading now depends on evidence actually uncovered before the decision was saved
119
+
120
+ ## Local Setup
121
+
122
+ ```bash
123
+ cd invoiceops_env
124
+ uv sync --extra dev
125
+ uv run pytest -q
126
+ uv run server --port 8000
127
+ ```
128
+
129
+ Run validation from the environment root:
130
+
131
+ ```bash
132
+ openenv validate .
133
+ openenv validate --url http://localhost:8000
134
+ ```
135
+
136
+ If `openenv` is not installed in the current environment:
137
+
138
+ ```bash
139
+ uvx --from openenv-core openenv validate .
140
+ uvx --from openenv-core openenv validate --url http://localhost:8000
141
+ ```
142
+
143
+ ## Baseline
144
+
145
+ The root [inference.py](./inference.py) script is the reproducible baseline.
146
+
147
+ - OpenAI Python client
148
+ - default `API_BASE_URL`: `https://router.huggingface.co/v1`
149
+ - default `MODEL_NAME`: `zai-org/GLM-5.1`
150
+ - fallback tasks are zeroed in `mean_score` by default while raw environment scores are still preserved
151
+ - run artifacts are written under `outputs/evals/`
152
+
153
+ Verified baseline on the current public benchmark:
154
+
155
+ - model: `zai-org/GLM-5.1`
156
+ - mean score: `0.6149`
157
+ - task scores: `easy 0.9862`, `medium 0.9628`, `medium_plus 0.3130`, `hard 0.1975`
158
+
159
+ Run it with:
160
+
161
+ ```bash
162
+ cd invoiceops_env
163
+ HF_TOKEN=... \
164
+ API_BASE_URL=https://router.huggingface.co/v1 \
165
+ uv run python inference.py
166
+ ```
167
+
168
+ Optional environment variables:
169
+
170
+ - `HF_TOKEN`
171
+ - `API_BASE_URL`
172
+ - `MODEL_NAME`
173
+ - `ENV_URL`
174
+ - `EVAL_RUN_NAME`
175
+ - `MAX_TOKENS`
176
+ - `RETRY_MAX_TOKENS`
177
+ - `STRICT_BASELINE_SCORING`
178
+
179
+ ## Docker
180
+
181
+ ```bash
182
+ cd invoiceops_env
183
+ docker build -t invoiceops-env:latest .
184
+ docker run -p 8000:8000 invoiceops-env:latest
185
+ ```
__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """InvoiceOps environment package."""
2
+
3
+ from invoiceops_env.client import InvoiceOpsEnv
4
+ from invoiceops_env.models import (
5
+ InvoiceOpsAction,
6
+ InvoiceOpsObservation,
7
+ InvoiceOpsState,
8
+ TaskId,
9
+ )
10
+
11
+ __all__ = [
12
+ "InvoiceOpsAction",
13
+ "InvoiceOpsEnv",
14
+ "InvoiceOpsObservation",
15
+ "InvoiceOpsState",
16
+ "TaskId",
17
+ ]
batch ADDED
@@ -0,0 +1,536 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import concurrent.futures
7
+ import csv
8
+ import json
9
+ import os
10
+ import re
11
+ import subprocess
12
+ import sys
13
+ import time
14
+ from dataclasses import dataclass
15
+ from datetime import datetime, timezone
16
+ from pathlib import Path
17
+ from typing import Any
18
+ from urllib.error import URLError
19
+ from urllib.request import urlopen
20
+
21
+ ROOT = Path(__file__).resolve().parent
22
+ DEFAULT_API_BASE_URL = "https://router.huggingface.co/v1"
23
+ DEFAULT_MODELS = [
24
+ "zai-org/GLM-5.1",
25
+ "openai/gpt-oss-120b",
26
+ "MiniMaxAI/MiniMax-M2.5",
27
+ "moonshotai/Kimi-K2.5",
28
+ # "google/gemma-4-31B-it",
29
+ ]
30
+ TASK_COLUMNS = ["easy", "medium", "medium_plus", "hard"]
31
+
32
+
33
+ @dataclass
34
+ class ServerHandle:
35
+ port: int
36
+ process: subprocess.Popen[str] | None
37
+ log_path: Path
38
+ log_handle: Any | None
39
+ reused: bool = False
40
+
41
+ @property
42
+ def base_url(self) -> str:
43
+ return f"http://localhost:{self.port}"
44
+
45
+
46
+ def parse_args() -> argparse.Namespace:
47
+ parser = argparse.ArgumentParser(
48
+ description="Run a batch of HF models against the local InvoiceOps environment."
49
+ )
50
+ parser.add_argument("--models", nargs="+", help="Override the default model list.")
51
+ parser.add_argument(
52
+ "--models-file",
53
+ help="Optional text file with one HF model id per line.",
54
+ )
55
+ parser.add_argument(
56
+ "--port",
57
+ type=int,
58
+ default=8000,
59
+ help="Port for the local InvoiceOps server.",
60
+ )
61
+ parser.add_argument(
62
+ "--sync",
63
+ action="store_true",
64
+ help="Run `uv sync --extra dev` before starting.",
65
+ )
66
+ parser.add_argument(
67
+ "--validate",
68
+ action="store_true",
69
+ help="Run `openenv validate --url` before the sweep.",
70
+ )
71
+ parser.add_argument(
72
+ "--verbose",
73
+ action="store_true",
74
+ help="Echo inference stderr while runs complete.",
75
+ )
76
+ parser.add_argument(
77
+ "--dry-run",
78
+ action="store_true",
79
+ help="Print the planned configuration without starting servers or calling models.",
80
+ )
81
+ parser.add_argument(
82
+ "--jobs",
83
+ type=int,
84
+ default=1,
85
+ help="Number of concurrent model runs.",
86
+ )
87
+ parser.add_argument(
88
+ "--reuse-running-server",
89
+ action="store_true",
90
+ help="Reuse an already-running server on the target port instead of failing fast.",
91
+ )
92
+ return parser.parse_args()
93
+
94
+
95
+ def slugify(value: str) -> str:
96
+ slug = re.sub(r"[^A-Za-z0-9._-]+", "-", value.strip())
97
+ slug = slug.strip("-._")
98
+ return slug or "value"
99
+
100
+
101
+ def load_models(args: argparse.Namespace) -> list[str]:
102
+ if args.models:
103
+ return args.models
104
+ if args.models_file:
105
+ path = Path(args.models_file).expanduser().resolve()
106
+ models = [
107
+ line.strip()
108
+ for line in path.read_text(encoding="utf-8").splitlines()
109
+ if line.strip() and not line.strip().startswith("#")
110
+ ]
111
+ if not models:
112
+ raise RuntimeError(f"No models found in {path}")
113
+ return models
114
+ return DEFAULT_MODELS
115
+
116
+
117
+ def is_healthy(base_url: str, timeout_s: float = 1.0) -> bool:
118
+ try:
119
+ with urlopen(f"{base_url}/health", timeout=timeout_s) as response:
120
+ return response.status == 200
121
+ except URLError:
122
+ return False
123
+ except Exception:
124
+ return False
125
+
126
+
127
+ def wait_for_health(base_url: str, timeout_s: float = 20.0) -> bool:
128
+ start = time.time()
129
+ while time.time() - start < timeout_s:
130
+ if is_healthy(base_url, timeout_s=1.0):
131
+ return True
132
+ time.sleep(0.5)
133
+ return False
134
+
135
+
136
+ def start_server(
137
+ port: int,
138
+ batch_dir: Path,
139
+ *,
140
+ reuse_running_server: bool,
141
+ ) -> ServerHandle:
142
+ base_url = f"http://localhost:{port}"
143
+ if is_healthy(base_url):
144
+ if not reuse_running_server:
145
+ raise RuntimeError(
146
+ "A healthy server is already running at "
147
+ f"{base_url}. Stop it first or rerun with --reuse-running-server."
148
+ )
149
+ print(f"[batch] reusing running invoiceops_env at {base_url}", file=sys.stderr)
150
+ return ServerHandle(
151
+ port=port,
152
+ process=None,
153
+ log_path=batch_dir / "logs" / "invoiceops_env__server.log",
154
+ log_handle=None,
155
+ reused=True,
156
+ )
157
+
158
+ log_path = batch_dir / "logs" / "invoiceops_env__server.log"
159
+ log_handle = log_path.open("w", encoding="utf-8")
160
+ process = subprocess.Popen(
161
+ ["uv", "run", "server", "--port", str(port)],
162
+ cwd=ROOT,
163
+ stdout=log_handle,
164
+ stderr=subprocess.STDOUT,
165
+ text=True,
166
+ )
167
+ if not wait_for_health(base_url):
168
+ process.terminate()
169
+ try:
170
+ process.wait(timeout=5)
171
+ except subprocess.TimeoutExpired:
172
+ process.kill()
173
+ log_handle.close()
174
+ tail = log_path.read_text(encoding="utf-8", errors="replace")[-4000:]
175
+ raise RuntimeError(f"Failed to start invoiceops_env.\n{tail}")
176
+
177
+ print(f"[batch] started invoiceops_env at {base_url}", file=sys.stderr)
178
+ return ServerHandle(
179
+ port=port,
180
+ process=process,
181
+ log_path=log_path,
182
+ log_handle=log_handle,
183
+ reused=False,
184
+ )
185
+
186
+
187
+ def stop_server(handle: ServerHandle) -> None:
188
+ if handle.process is not None:
189
+ handle.process.terminate()
190
+ try:
191
+ handle.process.wait(timeout=5)
192
+ except subprocess.TimeoutExpired:
193
+ handle.process.kill()
194
+ handle.process.wait(timeout=5)
195
+ if handle.log_handle is not None:
196
+ handle.log_handle.close()
197
+
198
+
199
+ def validate_server(handle: ServerHandle) -> None:
200
+ subprocess.run(
201
+ [
202
+ "uvx",
203
+ "--from",
204
+ "openenv-core",
205
+ "openenv",
206
+ "validate",
207
+ "--url",
208
+ handle.base_url,
209
+ ],
210
+ cwd=ROOT,
211
+ check=True,
212
+ )
213
+
214
+
215
+ def parse_output_path(stderr_text: str) -> Path | None:
216
+ for line in reversed(stderr_text.splitlines()):
217
+ if line.startswith("wrote="):
218
+ return Path(line.split("=", 1)[1].strip())
219
+ return None
220
+
221
+
222
+ def make_batch_dir() -> Path:
223
+ batch_id = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
224
+ batch_dir = ROOT / "batch_runs" / batch_id
225
+ (batch_dir / "logs").mkdir(parents=True, exist_ok=True)
226
+ return batch_dir
227
+
228
+
229
+ def _collect_request_errors(node: Any) -> list[str]:
230
+ errors: list[str] = []
231
+ if isinstance(node, dict):
232
+ if node.get("failure_reason") == "request_error":
233
+ message = node.get("error_message")
234
+ if isinstance(message, str) and message.strip():
235
+ errors.append(message.strip())
236
+ for value in node.values():
237
+ errors.extend(_collect_request_errors(value))
238
+ elif isinstance(node, list):
239
+ for value in node:
240
+ errors.extend(_collect_request_errors(value))
241
+ return errors
242
+
243
+
244
+ def classify_status(
245
+ *,
246
+ returncode: int,
247
+ payload: dict[str, Any] | None,
248
+ request_errors: list[str],
249
+ ) -> str:
250
+ if returncode != 0 or payload is None:
251
+ return "failed"
252
+ if not request_errors:
253
+ return "ok"
254
+ joined = "\n".join(request_errors).lower()
255
+ if "model_not_supported" in joined or "not a chat model" in joined:
256
+ return "invalid_model"
257
+ if "depleted your monthly included credits" in joined:
258
+ return "provider_credit_error"
259
+ return "request_error"
260
+
261
+
262
+ def extract_scores(payload: dict[str, Any]) -> tuple[dict[str, float], int, int]:
263
+ results = payload.get("results") or []
264
+ scores: dict[str, float] = {}
265
+ fallback_count = 0
266
+ parse_failure_count = 0
267
+ for result in results:
268
+ task_id = result.get("task_id")
269
+ score = result.get("score")
270
+ if isinstance(task_id, str) and isinstance(score, (int, float)):
271
+ scores[task_id] = float(score)
272
+ if result.get("used_fallback") is True:
273
+ fallback_count += 1
274
+ if result.get("decision_parsed") is False:
275
+ parse_failure_count += 1
276
+ return scores, fallback_count, parse_failure_count
277
+
278
+
279
+ def run_inference(
280
+ handle: ServerHandle,
281
+ *,
282
+ model_name: str,
283
+ hf_token: str,
284
+ api_base_url: str,
285
+ batch_name: str,
286
+ logs_dir: Path,
287
+ verbose: bool,
288
+ ) -> dict[str, Any]:
289
+ model_slug = slugify(model_name)
290
+ stdout_path = logs_dir / f"invoiceops_env__{model_slug}.stdout.log"
291
+ stderr_path = logs_dir / f"invoiceops_env__{model_slug}.stderr.log"
292
+
293
+ env = os.environ.copy()
294
+ env.update(
295
+ {
296
+ "HF_TOKEN": hf_token,
297
+ "API_BASE_URL": api_base_url,
298
+ "MODEL_NAME": model_name,
299
+ "ENV_URL": handle.base_url,
300
+ "EVAL_RUN_NAME": batch_name,
301
+ }
302
+ )
303
+
304
+ started_at = time.time()
305
+ result = subprocess.run(
306
+ ["uv", "run", "python", "inference.py"],
307
+ cwd=ROOT,
308
+ env=env,
309
+ capture_output=True,
310
+ text=True,
311
+ check=False,
312
+ )
313
+ duration_s = round(time.time() - started_at, 2)
314
+ stdout_path.write_text(result.stdout, encoding="utf-8")
315
+ stderr_path.write_text(result.stderr, encoding="utf-8")
316
+
317
+ if verbose and result.stderr.strip():
318
+ sys.stderr.write(result.stderr)
319
+ if not result.stderr.endswith("\n"):
320
+ sys.stderr.write("\n")
321
+
322
+ output_path = parse_output_path(result.stderr)
323
+ payload: dict[str, Any] | None = None
324
+ if output_path is not None and output_path.exists():
325
+ payload = json.loads(output_path.read_text(encoding="utf-8"))
326
+
327
+ scores: dict[str, float] = {}
328
+ fallback_count = 0
329
+ parse_failure_count = 0
330
+ mean_score = None
331
+ request_errors: list[str] = []
332
+ if payload is not None:
333
+ if isinstance(payload.get("mean_score"), (int, float)):
334
+ mean_score = float(payload["mean_score"])
335
+ elif isinstance(payload.get("raw_mean_score"), (int, float)):
336
+ mean_score = float(payload["raw_mean_score"])
337
+ scores, fallback_count, parse_failure_count = extract_scores(payload)
338
+ request_errors = _collect_request_errors(payload)
339
+
340
+ status = classify_status(
341
+ returncode=result.returncode,
342
+ payload=payload,
343
+ request_errors=request_errors,
344
+ )
345
+ return {
346
+ "model": model_name,
347
+ "status": status,
348
+ "returncode": result.returncode,
349
+ "duration_s": duration_s,
350
+ "mean_score": mean_score,
351
+ "fallback_count": fallback_count,
352
+ "parse_failure_count": parse_failure_count,
353
+ "request_error_count": len(request_errors),
354
+ "first_request_error": request_errors[0] if request_errors else "",
355
+ "output_json": str(output_path) if output_path is not None else "",
356
+ "stdout_log": str(stdout_path),
357
+ "stderr_log": str(stderr_path),
358
+ **{task_id: scores.get(task_id) for task_id in TASK_COLUMNS},
359
+ }
360
+
361
+
362
+ def print_summary(rows: list[dict[str, Any]]) -> None:
363
+ headers = [
364
+ "model",
365
+ "mean",
366
+ *TASK_COLUMNS,
367
+ "fallbacks",
368
+ "parse_fail",
369
+ "req_err",
370
+ "status",
371
+ "sec",
372
+ ]
373
+ widths = {header: len(header) for header in headers}
374
+ rendered_rows: list[dict[str, str]] = []
375
+
376
+ for row in rows:
377
+ rendered = {
378
+ "model": row["model"],
379
+ "mean": "-" if row["mean_score"] is None else f"{row['mean_score']:.4f}",
380
+ "fallbacks": str(row["fallback_count"]),
381
+ "parse_fail": str(row["parse_failure_count"]),
382
+ "req_err": str(row["request_error_count"]),
383
+ "status": row["status"],
384
+ "sec": f"{row['duration_s']:.1f}",
385
+ }
386
+ rendered.update(
387
+ {
388
+ task_id: "-" if row.get(task_id) is None else f"{row[task_id]:.4f}"
389
+ for task_id in TASK_COLUMNS
390
+ }
391
+ )
392
+ rendered_rows.append(rendered)
393
+ for key, value in rendered.items():
394
+ widths[key] = max(widths[key], len(value))
395
+
396
+ print(" ".join(header.ljust(widths[header]) for header in headers))
397
+ print(" ".join("-" * widths[header] for header in headers))
398
+ for row in rendered_rows:
399
+ print(" ".join(row[header].ljust(widths[header]) for header in headers))
400
+
401
+
402
+ def write_summary_files(
403
+ batch_dir: Path, rows: list[dict[str, Any]]
404
+ ) -> tuple[Path, Path]:
405
+ csv_path = batch_dir / "summary.csv"
406
+ json_path = batch_dir / "summary.json"
407
+ fieldnames = [
408
+ "model",
409
+ "mean_score",
410
+ *TASK_COLUMNS,
411
+ "fallback_count",
412
+ "parse_failure_count",
413
+ "request_error_count",
414
+ "status",
415
+ "duration_s",
416
+ "returncode",
417
+ "first_request_error",
418
+ "output_json",
419
+ "stdout_log",
420
+ "stderr_log",
421
+ ]
422
+ with csv_path.open("w", encoding="utf-8", newline="") as handle:
423
+ writer = csv.DictWriter(handle, fieldnames=fieldnames if rows else ["model"])
424
+ writer.writeheader()
425
+ writer.writerows(rows)
426
+ json_path.write_text(json.dumps(rows, indent=2), encoding="utf-8")
427
+ return csv_path, json_path
428
+
429
+
430
+ def main() -> int:
431
+ args = parse_args()
432
+ if args.jobs < 1:
433
+ raise RuntimeError("--jobs must be at least 1.")
434
+ models = load_models(args)
435
+ api_base_url = os.getenv("API_BASE_URL", DEFAULT_API_BASE_URL)
436
+ hf_token = os.getenv("HF_TOKEN")
437
+
438
+ if not hf_token and not args.dry_run:
439
+ raise RuntimeError("Set HF_TOKEN in the shell before running ./batch.")
440
+
441
+ if args.dry_run:
442
+ print("Dry run only.")
443
+ print(f"API_BASE_URL={api_base_url}")
444
+ print(f"models={','.join(models)}")
445
+ print(f"jobs={args.jobs}")
446
+ print(f"invoiceops_env -> http://localhost:{args.port}")
447
+ return 0
448
+
449
+ batch_dir = make_batch_dir()
450
+ batch_name = batch_dir.name
451
+ logs_dir = batch_dir / "logs"
452
+ rows: list[dict[str, Any]] = []
453
+ handle: ServerHandle | None = None
454
+
455
+ try:
456
+ if args.sync:
457
+ subprocess.run(["uv", "sync", "--extra", "dev"], cwd=ROOT, check=True)
458
+
459
+ handle = start_server(
460
+ args.port,
461
+ batch_dir,
462
+ reuse_running_server=args.reuse_running_server,
463
+ )
464
+ if args.validate:
465
+ validate_server(handle)
466
+
467
+ print(f"[batch] batch={batch_name}", file=sys.stderr)
468
+ print(f"[batch] api_base_url={api_base_url}", file=sys.stderr)
469
+
470
+ if args.jobs == 1:
471
+ for model_name in models:
472
+ print(
473
+ f"[batch] running invoiceops_env :: {model_name}", file=sys.stderr
474
+ )
475
+ row = run_inference(
476
+ handle,
477
+ model_name=model_name,
478
+ hf_token=hf_token,
479
+ api_base_url=api_base_url,
480
+ batch_name=batch_name,
481
+ logs_dir=logs_dir,
482
+ verbose=args.verbose,
483
+ )
484
+ rows.append(row)
485
+ mean_display = (
486
+ "-" if row["mean_score"] is None else f"{row['mean_score']:.4f}"
487
+ )
488
+ print(
489
+ f"[batch] result invoiceops_env :: {model_name} mean={mean_display} status={row['status']}",
490
+ file=sys.stderr,
491
+ )
492
+ else:
493
+ with concurrent.futures.ThreadPoolExecutor(
494
+ max_workers=args.jobs
495
+ ) as executor:
496
+ futures = {
497
+ executor.submit(
498
+ run_inference,
499
+ handle,
500
+ model_name=model_name,
501
+ hf_token=hf_token,
502
+ api_base_url=api_base_url,
503
+ batch_name=batch_name,
504
+ logs_dir=logs_dir,
505
+ verbose=args.verbose,
506
+ ): model_name
507
+ for model_name in models
508
+ }
509
+ for future in concurrent.futures.as_completed(futures):
510
+ model_name = futures[future]
511
+ row = future.result()
512
+ rows.append(row)
513
+ mean_display = (
514
+ "-" if row["mean_score"] is None else f"{row['mean_score']:.4f}"
515
+ )
516
+ print(
517
+ f"[batch] result invoiceops_env :: {model_name} mean={mean_display} status={row['status']}",
518
+ file=sys.stderr,
519
+ )
520
+
521
+ order = {model: index for index, model in enumerate(models)}
522
+ rows.sort(key=lambda row: order[row["model"]])
523
+
524
+ csv_path, json_path = write_summary_files(batch_dir, rows)
525
+ print_summary(rows)
526
+ print(f"\nsummary_csv={csv_path}")
527
+ print(f"summary_json={json_path}")
528
+ print(f"logs_dir={logs_dir}")
529
+ return 0
530
+ finally:
531
+ if handle is not None:
532
+ stop_server(handle)
533
+
534
+
535
+ if __name__ == "__main__":
536
+ raise SystemExit(main())
client.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Client for the InvoiceOps environment."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any
6
+
7
+ from openenv.core import EnvClient
8
+ from openenv.core.client_types import StepResult
9
+
10
+ from invoiceops_env.models import (
11
+ InvoiceOpsAction,
12
+ InvoiceOpsObservation,
13
+ InvoiceOpsState,
14
+ )
15
+
16
+
17
+ class InvoiceOpsEnv(EnvClient[InvoiceOpsAction, InvoiceOpsObservation, InvoiceOpsState]):
18
+ """WebSocket client for persistent InvoiceOps sessions."""
19
+
20
+ def _step_payload(self, action: InvoiceOpsAction) -> dict[str, Any]:
21
+ return action.model_dump(exclude_none=True)
22
+
23
+ def _parse_result(self, payload: dict[str, Any]) -> StepResult[InvoiceOpsObservation]:
24
+ obs_data = payload.get("observation", {})
25
+ observation = InvoiceOpsObservation.model_validate(
26
+ {
27
+ **obs_data,
28
+ "done": payload.get("done", False),
29
+ "reward": payload.get("reward"),
30
+ }
31
+ )
32
+ return StepResult(
33
+ observation=observation,
34
+ reward=payload.get("reward"),
35
+ done=payload.get("done", False),
36
+ )
37
+
38
+ def _parse_state(self, payload: dict[str, Any]) -> InvoiceOpsState:
39
+ return InvoiceOpsState.model_validate(payload)
data/scenarios/easy.json ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "scenario_id": "easy",
3
+ "task_id": "easy",
4
+ "case_id": "CASE-EASY-001",
5
+ "title": "Non-PO invoice with unstarted approval workflow",
6
+ "description": "Review a non-PO services invoice with an open approval control. Determine the correct requester-tier routing and payment recommendation from the workflow state and policy.",
7
+ "step_limit": 10,
8
+ "queue_card": {
9
+ "case_id": "CASE-EASY-001",
10
+ "vendor_name": "Orion Advisory Partners",
11
+ "vendor_id": "V-741",
12
+ "invoice_number": "OA-4401",
13
+ "invoice_date": "2026-03-20",
14
+ "invoice_total": 8500.0,
15
+ "currency": "USD",
16
+ "po_number": null,
17
+ "risk_flags": ["non_po_invoice", "missing_approval"],
18
+ "summary": "Non-PO services invoice with an open approval control."
19
+ },
20
+ "artifacts": [
21
+ {
22
+ "artifact_id": "art-invoice",
23
+ "artifact_type": "invoice_packet",
24
+ "title": "Invoice packet OA-4401",
25
+ "summary": "Single-line non-PO advisory invoice.",
26
+ "fields": [
27
+ {"label": "Vendor", "value": "Orion Advisory Partners"},
28
+ {"label": "Invoice number", "value": "OA-4401"},
29
+ {"label": "Invoice date", "value": "2026-03-20"},
30
+ {"label": "Invoice type", "value": "Non-PO services"},
31
+ {"label": "Cost center", "value": "EXEC-110"},
32
+ {"label": "Gross total", "value": "8500.00 USD"}
33
+ ],
34
+ "line_items": [
35
+ {
36
+ "line_id": "L1",
37
+ "description": "Q1 strategic advisory engagement package",
38
+ "quantity": 1.0,
39
+ "unit_price": 8500.0,
40
+ "amount": 8500.0,
41
+ "status": "invoiced",
42
+ "notes": "Cost center EXEC-110"
43
+ }
44
+ ],
45
+ "events": [
46
+ {
47
+ "event_id": "evt-received",
48
+ "event_type": "invoice_received",
49
+ "event_date": "2026-03-21",
50
+ "description": "Invoice packet received through AP inbox",
51
+ "quantity": null,
52
+ "amount": 8500.0,
53
+ "status": "queued"
54
+ }
55
+ ],
56
+ "related_refs": ["art-approval", "art-policy", "EX-NONPO-APPROVAL"]
57
+ },
58
+ {
59
+ "artifact_id": "art-approval",
60
+ "artifact_type": "approval_artifact",
61
+ "title": "Approval trail for OA-4401",
62
+ "summary": "Approval workflow has not been started.",
63
+ "fields": [
64
+ {"label": "Workflow status", "value": "Not initiated"},
65
+ {"label": "Requester", "value": "Jordan Kim"},
66
+ {"label": "Requester authority", "value": "Up to 10000.00 USD"},
67
+ {"label": "Submitted for approval", "value": "No"}
68
+ ],
69
+ "line_items": [],
70
+ "events": [],
71
+ "related_refs": ["art-policy", "EX-NONPO-APPROVAL"]
72
+ },
73
+ {
74
+ "artifact_id": "art-vendor",
75
+ "artifact_type": "vendor_master",
76
+ "title": "Vendor master: Orion Advisory Partners",
77
+ "summary": "Active professional-services vendor.",
78
+ "fields": [
79
+ {"label": "Vendor ID", "value": "V-741"},
80
+ {"label": "Payment terms", "value": "Net 30"},
81
+ {"label": "Vendor status", "value": "Active"},
82
+ {"label": "Blanket PO authorization", "value": "None"}
83
+ ],
84
+ "line_items": [],
85
+ "events": [],
86
+ "related_refs": ["art-invoice"]
87
+ },
88
+ {
89
+ "artifact_id": "art-policy",
90
+ "artifact_type": "policy_card",
91
+ "title": "AP policy card",
92
+ "summary": "Non-PO authorization and routing thresholds.",
93
+ "fields": [
94
+ {"label": "Non-PO authorization rule", "value": "Non-PO invoices require completed authorization before any payment release."},
95
+ {"label": "Requester authority limit", "value": "Business requester may authorize non-PO spend up to 10000.00 USD."},
96
+ {"label": "AP Manager authority", "value": "Amounts above requester authority require AP Manager authorization before release."},
97
+ {"label": "Unstarted workflow handling", "value": "If no approval workflow exists, place the invoice on hold and route it to the requester to initiate approval when the amount is within requester authority."}
98
+ ],
99
+ "line_items": [],
100
+ "events": [],
101
+ "related_refs": ["EX-NONPO-APPROVAL"]
102
+ }
103
+ ],
104
+ "exceptions": [
105
+ {
106
+ "exception_id": "EX-NONPO-APPROVAL",
107
+ "exception_type": "non_po_missing_approval",
108
+ "severity": "high",
109
+ "headline": "Authorization control is open for this non-PO invoice",
110
+ "impacted_line_ids": ["L1"],
111
+ "short_description": "The invoice has not completed required authorization.",
112
+ "fields": [
113
+ {"label": "Workflow status", "value": "Not initiated"},
114
+ {"label": "Invoice total", "value": "8500.00 USD"},
115
+ {"label": "Authorization status", "value": "Incomplete"}
116
+ ],
117
+ "reviewer_guidance": "Review the approval trail and policy before deciding."
118
+ }
119
+ ],
120
+ "duplicate_candidates": [],
121
+ "hidden_truth": {
122
+ "line_expectations": {
123
+ "L1": {
124
+ "amount": 8500.0,
125
+ "score_map": {
126
+ "hold": 1.0,
127
+ "escalate": 0.65,
128
+ "reject": 0.15,
129
+ "approve": 0.0
130
+ },
131
+ "accepted_reason_sets": [
132
+ ["non_po_approval_missing"]
133
+ ],
134
+ "accepted_routes": ["requester"],
135
+ "gating_refs": [],
136
+ "decisive_refs": ["art-invoice", "art-approval", "art-policy", "EX-NONPO-APPROVAL"],
137
+ "unsafe_approve": true
138
+ }
139
+ },
140
+ "header_expectation": {
141
+ "score_map": {
142
+ "hold_full_invoice": 1.0,
143
+ "escalate_case": 0.75,
144
+ "reject_full_invoice": 0.15,
145
+ "release_approved_lines": 0.0
146
+ },
147
+ "accepted_reason_sets": [
148
+ ["non_po_approval_missing"]
149
+ ],
150
+ "accepted_routes": ["requester"],
151
+ "gating_refs": [],
152
+ "decisive_refs": ["art-approval", "art-policy", "EX-NONPO-APPROVAL"],
153
+ "unsafe_recommendations": ["release_approved_lines"],
154
+ "overconservative_recommendations": []
155
+ },
156
+ "note_expectations": [
157
+ {
158
+ "issue_id": "non_po_workflow_missing",
159
+ "accepted_reason_sets": [
160
+ ["non_po_approval_missing"]
161
+ ],
162
+ "decisive_refs": ["art-approval", "art-policy"]
163
+ }
164
+ ],
165
+ "efficient_step_target": 8
166
+ }
167
+ }
data/scenarios/hard.json ADDED
@@ -0,0 +1,537 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "scenario_id": "hard",
3
+ "task_id": "hard",
4
+ "case_id": "CASE-HARD-001",
5
+ "title": "Project invoice with mixed support, duplicate review, and tax block",
6
+ "description": "Review a project equipment invoice with duplicate, receiving, and tax controls. Resolve each line and the case header from the evidence you gather.",
7
+ "step_limit": 24,
8
+ "queue_card": {
9
+ "case_id": "CASE-HARD-001",
10
+ "vendor_name": "Northshore Controls",
11
+ "vendor_id": "V-229",
12
+ "invoice_number": "NC-8831/2",
13
+ "invoice_date": "2026-03-29",
14
+ "invoice_total": 12936.3,
15
+ "currency": "USD",
16
+ "po_number": "PO-44019",
17
+ "risk_flags": [
18
+ "po_invoice",
19
+ "possible_duplicate",
20
+ "receipt_variance",
21
+ "partial_receipt",
22
+ "tax_variance"
23
+ ],
24
+ "summary": "Project equipment invoice with duplicate, receipt, and tax controls open."
25
+ },
26
+ "artifacts": [
27
+ {
28
+ "artifact_id": "art-invoice",
29
+ "artifact_type": "invoice_packet",
30
+ "title": "Invoice packet NC-8831/2",
31
+ "summary": "Three-line project equipment invoice with billed sales tax and a slash-suffixed invoice number.",
32
+ "fields": [
33
+ {"label": "Vendor", "value": "Northshore Controls"},
34
+ {"label": "Invoice number", "value": "NC-8831/2"},
35
+ {"label": "Invoice date", "value": "2026-03-29"},
36
+ {"label": "PO number", "value": "PO-44019"},
37
+ {"label": "Project code", "value": "GREEN-440"},
38
+ {"label": "Subtotal", "value": "12090.00 USD"},
39
+ {"label": "Tax", "value": "846.30 USD"},
40
+ {"label": "Gross total", "value": "12936.30 USD"}
41
+ ],
42
+ "line_items": [
43
+ {
44
+ "line_id": "L1",
45
+ "description": "Calibration harness assemblies",
46
+ "quantity": 42.0,
47
+ "unit_price": 95.0,
48
+ "amount": 3990.0,
49
+ "status": "invoiced",
50
+ "notes": "PO line 10"
51
+ },
52
+ {
53
+ "line_id": "L2",
54
+ "description": "Field junction boxes",
55
+ "quantity": 18.0,
56
+ "unit_price": 190.0,
57
+ "amount": 3420.0,
58
+ "status": "invoiced",
59
+ "notes": "PO line 20"
60
+ },
61
+ {
62
+ "line_id": "L3",
63
+ "description": "Sensor mounting rails",
64
+ "quantity": 36.0,
65
+ "unit_price": 130.0,
66
+ "amount": 4680.0,
67
+ "status": "invoiced",
68
+ "notes": "PO line 30"
69
+ }
70
+ ],
71
+ "events": [
72
+ {
73
+ "event_id": "evt-received",
74
+ "event_type": "invoice_received",
75
+ "event_date": "2026-03-30",
76
+ "description": "AP queue received invoice packet through EDI",
77
+ "quantity": null,
78
+ "amount": 12936.3,
79
+ "status": "queued"
80
+ }
81
+ ],
82
+ "related_refs": [
83
+ "art-po",
84
+ "art-receipts",
85
+ "art-history",
86
+ "art-vendor",
87
+ "art-policy",
88
+ "EX-POSSIBLE-DUP",
89
+ "EX-RECEIPT-L1",
90
+ "EX-RECEIPT-L2",
91
+ "EX-TAX-001"
92
+ ]
93
+ },
94
+ {
95
+ "artifact_id": "art-po",
96
+ "artifact_type": "purchase_order",
97
+ "title": "PO-44019",
98
+ "summary": "Project equipment order for GREEN-440.",
99
+ "fields": [
100
+ {"label": "Buyer", "value": "Project Procurement"},
101
+ {"label": "PO number", "value": "PO-44019"},
102
+ {"label": "Supplier", "value": "Northshore Controls"},
103
+ {"label": "Project code", "value": "GREEN-440"},
104
+ {"label": "Tax handling", "value": "GREEN-440 exemption certificate applies; consult AP tax rules when billed tax appears."}
105
+ ],
106
+ "line_items": [
107
+ {
108
+ "line_id": "L1",
109
+ "description": "Calibration harness assemblies",
110
+ "quantity": 42.0,
111
+ "unit_price": 95.0,
112
+ "amount": 3990.0,
113
+ "status": "ordered",
114
+ "notes": "PO line 10"
115
+ },
116
+ {
117
+ "line_id": "L2",
118
+ "description": "Field junction boxes",
119
+ "quantity": 18.0,
120
+ "unit_price": 190.0,
121
+ "amount": 3420.0,
122
+ "status": "ordered",
123
+ "notes": "PO line 20"
124
+ },
125
+ {
126
+ "line_id": "L3",
127
+ "description": "Sensor mounting rails",
128
+ "quantity": 36.0,
129
+ "unit_price": 130.0,
130
+ "amount": 4680.0,
131
+ "status": "ordered",
132
+ "notes": "PO line 30"
133
+ }
134
+ ],
135
+ "events": [],
136
+ "related_refs": ["art-invoice", "art-receipts", "art-vendor"]
137
+ },
138
+ {
139
+ "artifact_id": "art-receipts",
140
+ "artifact_type": "receipt_log",
141
+ "title": "Receipt log for PO-44019",
142
+ "summary": "One line is short, one line is under later receiving review, and one line is fully received.",
143
+ "fields": [
144
+ {"label": "Receiving site", "value": "GREEN-440 project warehouse"},
145
+ {"label": "Last receipt update", "value": "2026-03-30"},
146
+ {"label": "Open receipt issue", "value": "PO line 20 needs receiving follow-up before support is final"}
147
+ ],
148
+ "line_items": [
149
+ {
150
+ "line_id": "L1",
151
+ "description": "Calibration harness assemblies",
152
+ "quantity": 41.0,
153
+ "unit_price": null,
154
+ "amount": null,
155
+ "status": "short_received",
156
+ "notes": "41 of 42 units posted on 2026-03-27"
157
+ },
158
+ {
159
+ "line_id": "L2",
160
+ "description": "Field junction boxes",
161
+ "quantity": 18.0,
162
+ "unit_price": null,
163
+ "amount": null,
164
+ "status": "received_under_review",
165
+ "notes": "Initial receipt posted on 2026-03-26; see receiving history for current support"
166
+ },
167
+ {
168
+ "line_id": "L3",
169
+ "description": "Sensor mounting rails",
170
+ "quantity": 36.0,
171
+ "unit_price": null,
172
+ "amount": null,
173
+ "status": "fully_received",
174
+ "notes": "Received in full on 2026-03-28"
175
+ }
176
+ ],
177
+ "events": [
178
+ {
179
+ "event_id": "evt-rcv-l1",
180
+ "event_type": "goods_receipt",
181
+ "event_date": "2026-03-27",
182
+ "description": "Received 41 calibration harness assemblies",
183
+ "quantity": 41.0,
184
+ "amount": null,
185
+ "status": "posted"
186
+ },
187
+ {
188
+ "event_id": "evt-rcv-l2-initial",
189
+ "event_type": "goods_receipt",
190
+ "event_date": "2026-03-26",
191
+ "description": "Received 18 field junction boxes",
192
+ "quantity": 18.0,
193
+ "amount": null,
194
+ "status": "initially_posted"
195
+ },
196
+ {
197
+ "event_id": "evt-rcv-l2-review",
198
+ "event_type": "receiving_review",
199
+ "event_date": "2026-03-30",
200
+ "description": "Receiving posted a follow-up control update for 18 field junction boxes after damage inspection",
201
+ "quantity": null,
202
+ "amount": null,
203
+ "status": "review_open"
204
+ },
205
+ {
206
+ "event_id": "evt-rcv-l3",
207
+ "event_type": "goods_receipt",
208
+ "event_date": "2026-03-28",
209
+ "description": "Received 36 sensor mounting rails",
210
+ "quantity": 36.0,
211
+ "amount": null,
212
+ "status": "posted"
213
+ }
214
+ ],
215
+ "related_refs": ["art-po", "art-history"]
216
+ },
217
+ {
218
+ "artifact_id": "art-history",
219
+ "artifact_type": "invoice_history",
220
+ "title": "Receiving and invoice history for NC-8831/2",
221
+ "summary": "History shows a reversed prior AP import and an open later receiving hold on PO line 20.",
222
+ "fields": [
223
+ {"label": "Prior AP duplicate status", "value": "Same normalized invoice number was reversed before payment after EDI retry"},
224
+ {"label": "Latest receiving control", "value": "Damage hold case RCV-1187 is open on PO line 20 as of 2026-03-30"},
225
+ {"label": "Replacement ETA", "value": "Pending vendor reship confirmation"}
226
+ ],
227
+ "line_items": [],
228
+ "events": [
229
+ {
230
+ "event_id": "evt-dup-reversal",
231
+ "event_type": "invoice_reversal",
232
+ "event_date": "2026-03-18",
233
+ "description": "Prior AP record NC88312 reversed before payment after duplicate EDI import",
234
+ "quantity": null,
235
+ "amount": 12936.3,
236
+ "status": "closed"
237
+ },
238
+ {
239
+ "event_id": "evt-receiving-hold",
240
+ "event_type": "receiving_hold",
241
+ "event_date": "2026-03-30",
242
+ "description": "Receiving opened a damage hold on PO line 20 after inspection; replacement disposition pending",
243
+ "quantity": 18.0,
244
+ "amount": 3420.0,
245
+ "status": "open"
246
+ }
247
+ ],
248
+ "related_refs": ["EX-POSSIBLE-DUP", "EX-RECEIPT-L2", "art-receipts"]
249
+ },
250
+ {
251
+ "artifact_id": "art-vendor",
252
+ "artifact_type": "vendor_master",
253
+ "title": "Vendor master: Northshore Controls",
254
+ "summary": "Active vendor with an active GREEN-440 project exemption profile.",
255
+ "fields": [
256
+ {"label": "Vendor ID", "value": "V-229"},
257
+ {"label": "Vendor status", "value": "Active"},
258
+ {"label": "Project exemption profile", "value": "GREEN-440 exemption certificate EX-118 active through 2026-12-31"},
259
+ {"label": "Tax note", "value": "Exemption certificate is on file for GREEN-440; consult AP tax handling rules when billed tax appears"}
260
+ ],
261
+ "line_items": [],
262
+ "events": [],
263
+ "related_refs": ["art-invoice", "art-po", "EX-TAX-001"]
264
+ },
265
+ {
266
+ "artifact_id": "art-policy",
267
+ "artifact_type": "policy_card",
268
+ "title": "AP policy card",
269
+ "summary": "Duplicate, receipt, chronology, and tax handling rules for project invoices.",
270
+ "fields": [
271
+ {"label": "Duplicate review rule", "value": "When possible_duplicate is flagged on format variants such as slash or punctuation changes, review a normalized invoice number match before relying on heuristic amount/date similarity."},
272
+ {"label": "Reversed duplicate rule", "value": "A prior AP record reversed or voided before payment is not a payment block."},
273
+ {"label": "De minimis receipt shortage", "value": "A line may release when unsupported amount is 150.00 USD or less and no later receiving reversal or hold remains."},
274
+ {"label": "Receipt chronology", "value": "The latest receiving control event supersedes earlier posted receipt support. If current support is still under review, route the line to Receiving instead of approving it."},
275
+ {"label": "Tax dispute rule", "value": "If billed tax conflicts with active exempt project status, hold the full invoice and route the case to tax even when supported goods lines are approved."}
276
+ ],
277
+ "line_items": [],
278
+ "events": [],
279
+ "related_refs": ["EX-POSSIBLE-DUP", "EX-RECEIPT-L1", "EX-RECEIPT-L2", "EX-TAX-001"]
280
+ }
281
+ ],
282
+ "exceptions": [
283
+ {
284
+ "exception_id": "EX-POSSIBLE-DUP",
285
+ "exception_type": "possible_duplicate",
286
+ "severity": "high",
287
+ "headline": "Duplicate control is open for this invoice",
288
+ "impacted_line_ids": ["L1", "L2", "L3"],
289
+ "short_description": "A prior AP record may overlap with this invoice.",
290
+ "fields": [
291
+ {"label": "Invoice number", "value": "NC-8831/2"},
292
+ {"label": "Vendor", "value": "Northshore Controls"},
293
+ {"label": "Control status", "value": "Duplicate review required before release"}
294
+ ],
295
+ "reviewer_guidance": "Run the relevant duplicate search and review candidate status before deciding."
296
+ },
297
+ {
298
+ "exception_id": "EX-RECEIPT-L1",
299
+ "exception_type": "receipt_quantity_variance",
300
+ "severity": "medium",
301
+ "headline": "Receipt support is short on L1",
302
+ "impacted_line_ids": ["L1"],
303
+ "short_description": "Received quantity on L1 is below the invoiced quantity.",
304
+ "fields": [
305
+ {"label": "Invoice quantity", "value": "42"},
306
+ {"label": "Received quantity", "value": "41"},
307
+ {"label": "Short quantity", "value": "1"}
308
+ ],
309
+ "reviewer_guidance": "Review the invoice unit rate, receipt details, and shortage rule before deciding."
310
+ },
311
+ {
312
+ "exception_id": "EX-RECEIPT-L2",
313
+ "exception_type": "receipt_quantity_variance",
314
+ "severity": "high",
315
+ "headline": "Receipt support changed after initial posting on L2",
316
+ "impacted_line_ids": ["L2"],
317
+ "short_description": "The initial receipt support on L2 may no longer be current.",
318
+ "fields": [
319
+ {"label": "Invoice quantity", "value": "18"},
320
+ {"label": "Initial posted receipt", "value": "18 units on 2026-03-26"},
321
+ {"label": "Latest control update", "value": "Receiving review posted on 2026-03-30"}
322
+ ],
323
+ "reviewer_guidance": "Review the receipt log and receiving history before deciding whether support is still current."
324
+ },
325
+ {
326
+ "exception_id": "EX-TAX-001",
327
+ "exception_type": "tax_variance",
328
+ "severity": "high",
329
+ "headline": "Tax control is open for this invoice",
330
+ "impacted_line_ids": ["L1", "L2", "L3"],
331
+ "short_description": "Billed tax may conflict with the expected project tax treatment.",
332
+ "fields": [
333
+ {"label": "Project code", "value": "GREEN-440"},
334
+ {"label": "Invoice taxable basis", "value": "12090.00 USD"},
335
+ {"label": "Billed tax", "value": "846.30 USD"},
336
+ {"label": "Jurisdiction", "value": "Washington"}
337
+ ],
338
+ "reviewer_guidance": "Review the vendor tax profile and policy before deciding what can be released."
339
+ }
340
+ ],
341
+ "duplicate_candidates": [
342
+ {
343
+ "candidate_id": "CAND-NORM-01",
344
+ "vendor_name": "Northshore Controls",
345
+ "invoice_number": "NC88312",
346
+ "invoice_date": "2026-03-18",
347
+ "gross_amount": 12936.3,
348
+ "status": "reversed on 2026-03-18 before payment after EDI retry; closed",
349
+ "match_basis": "Normalized invoice number + vendor + gross amount",
350
+ "overlap_summary": "Same normalized invoice number. Prior AP record was reversed before payment.",
351
+ "supported_match_strategies": ["normalized_invoice_no"]
352
+ },
353
+ {
354
+ "candidate_id": "CAND-AMT-02",
355
+ "vendor_name": "Northshore Controls",
356
+ "invoice_number": "NC-8807",
357
+ "invoice_date": "2026-03-29",
358
+ "gross_amount": 12936.3,
359
+ "status": "open",
360
+ "match_basis": "Vendor + gross amount + nearby invoice date",
361
+ "overlap_summary": "Same amount and nearby date, but invoice number and project context differ.",
362
+ "supported_match_strategies": ["vendor_amount_date"]
363
+ }
364
+ ],
365
+ "hidden_truth": {
366
+ "line_expectations": {
367
+ "L1": {
368
+ "amount": 3990.0,
369
+ "score_map": {
370
+ "approve": 1.0,
371
+ "hold": 0.25,
372
+ "escalate": 0.1,
373
+ "reject": 0.0
374
+ },
375
+ "accepted_reason_sets": [
376
+ ["partial_receipt_pending", "safe_to_pay"],
377
+ ["safe_to_pay"],
378
+ ["partial_receipt_pending"]
379
+ ],
380
+ "accepted_routes": [],
381
+ "gating_refs": [
382
+ "art-invoice",
383
+ "art-receipts",
384
+ "duplicate_check:normalized_invoice_no",
385
+ "CAND-NORM-01",
386
+ "EX-RECEIPT-L1",
387
+ "art-policy"
388
+ ],
389
+ "safe_gating_refs": [
390
+ "art-receipts",
391
+ "duplicate_check:normalized_invoice_no",
392
+ "CAND-NORM-01",
393
+ "EX-RECEIPT-L1"
394
+ ],
395
+ "decisive_refs": [
396
+ "art-invoice",
397
+ "art-receipts",
398
+ "EX-RECEIPT-L1",
399
+ "art-policy",
400
+ "duplicate_check:normalized_invoice_no",
401
+ "CAND-NORM-01"
402
+ ],
403
+ "unsafe_approve": false
404
+ },
405
+ "L2": {
406
+ "amount": 3420.0,
407
+ "score_map": {
408
+ "hold": 1.0,
409
+ "escalate": 0.45,
410
+ "reject": 0.1,
411
+ "approve": 0.0
412
+ },
413
+ "accepted_reason_sets": [
414
+ ["receipt_not_confirmed"],
415
+ ["receipt_not_confirmed", "partial_receipt_pending"]
416
+ ],
417
+ "accepted_routes": ["receiving"],
418
+ "gating_refs": [
419
+ "art-receipts",
420
+ "art-history",
421
+ "EX-RECEIPT-L2",
422
+ "art-policy"
423
+ ],
424
+ "safe_gating_refs": [
425
+ "art-receipts",
426
+ "EX-RECEIPT-L2"
427
+ ],
428
+ "decisive_refs": [
429
+ "art-receipts",
430
+ "art-history",
431
+ "EX-RECEIPT-L2",
432
+ "art-policy"
433
+ ],
434
+ "unsafe_approve": true
435
+ },
436
+ "L3": {
437
+ "amount": 4680.0,
438
+ "score_map": {
439
+ "approve": 1.0,
440
+ "hold": 0.1,
441
+ "escalate": 0.05,
442
+ "reject": 0.0
443
+ },
444
+ "accepted_reason_sets": [
445
+ ["matched_to_po_and_receipt", "safe_to_pay"],
446
+ ["safe_to_pay"],
447
+ ["possible_duplicate_review"]
448
+ ],
449
+ "accepted_routes": [],
450
+ "gating_refs": [
451
+ "art-invoice",
452
+ "art-receipts",
453
+ "duplicate_check:normalized_invoice_no",
454
+ "CAND-NORM-01"
455
+ ],
456
+ "safe_gating_refs": [
457
+ "art-receipts",
458
+ "duplicate_check:normalized_invoice_no",
459
+ "CAND-NORM-01"
460
+ ],
461
+ "decisive_refs": [
462
+ "art-invoice",
463
+ "art-receipts",
464
+ "duplicate_check:normalized_invoice_no",
465
+ "CAND-NORM-01"
466
+ ],
467
+ "unsafe_approve": false
468
+ }
469
+ },
470
+ "header_expectation": {
471
+ "score_map": {
472
+ "hold_full_invoice": 1.0,
473
+ "escalate_case": 0.55,
474
+ "reject_full_invoice": 0.05,
475
+ "release_approved_lines": 0.0
476
+ },
477
+ "accepted_reason_sets": [
478
+ ["tax_amount_mismatch"],
479
+ ["tax_amount_mismatch", "receipt_not_confirmed"],
480
+ ["tax_amount_mismatch", "safe_to_pay"]
481
+ ],
482
+ "accepted_routes": ["tax"],
483
+ "gating_refs": [
484
+ "art-vendor",
485
+ "art-policy",
486
+ "EX-TAX-001"
487
+ ],
488
+ "safe_gating_refs": [
489
+ "art-vendor",
490
+ "art-policy",
491
+ "EX-TAX-001"
492
+ ],
493
+ "decisive_refs": [
494
+ "art-vendor",
495
+ "art-policy",
496
+ "EX-TAX-001"
497
+ ],
498
+ "unsafe_recommendations": ["release_approved_lines"],
499
+ "overconservative_recommendations": ["escalate_case"]
500
+ },
501
+ "note_expectations": [
502
+ {
503
+ "issue_id": "duplicate_cleared",
504
+ "accepted_reason_sets": [
505
+ ["possible_duplicate_review", "safe_to_pay"],
506
+ ["possible_duplicate_review"]
507
+ ],
508
+ "decisive_refs": [
509
+ "duplicate_check:normalized_invoice_no",
510
+ "CAND-NORM-01"
511
+ ]
512
+ },
513
+ {
514
+ "issue_id": "receipt_reversal_hold",
515
+ "accepted_reason_sets": [
516
+ ["receipt_not_confirmed"]
517
+ ],
518
+ "decisive_refs": [
519
+ "art-history",
520
+ "EX-RECEIPT-L2"
521
+ ]
522
+ },
523
+ {
524
+ "issue_id": "tax_hold",
525
+ "accepted_reason_sets": [
526
+ ["tax_amount_mismatch"]
527
+ ],
528
+ "decisive_refs": [
529
+ "art-vendor",
530
+ "art-policy",
531
+ "EX-TAX-001"
532
+ ]
533
+ }
534
+ ],
535
+ "efficient_step_target": 18
536
+ }
537
+ }
data/scenarios/medium.json ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "scenario_id": "medium",
3
+ "task_id": "medium",
4
+ "case_id": "CASE-MEDIUM-001",
5
+ "title": "PO invoice with a duplicate control that clears only after number-based review",
6
+ "description": "Review a PO-backed goods invoice with a possible duplicate flag. The correct action depends on choosing the right duplicate search, interpreting the surfaced candidates, and then deciding whether the invoice can release.",
7
+ "step_limit": 12,
8
+ "queue_card": {
9
+ "case_id": "CASE-MEDIUM-001",
10
+ "vendor_name": "TechLink Solutions",
11
+ "vendor_id": "V-315",
12
+ "invoice_number": "TL-9205/A",
13
+ "invoice_date": "2026-03-22",
14
+ "invoice_total": 3800.0,
15
+ "currency": "USD",
16
+ "po_number": "PO-29034",
17
+ "risk_flags": ["po_invoice", "possible_duplicate"],
18
+ "summary": "PO-backed goods invoice with an open duplicate control."
19
+ },
20
+ "artifacts": [
21
+ {
22
+ "artifact_id": "art-invoice",
23
+ "artifact_type": "invoice_packet",
24
+ "title": "Invoice packet TL-9205/A",
25
+ "summary": "Two-line goods invoice with PO reference.",
26
+ "fields": [
27
+ {"label": "Vendor", "value": "TechLink Solutions"},
28
+ {"label": "Invoice number", "value": "TL-9205/A"},
29
+ {"label": "Invoice date", "value": "2026-03-22"},
30
+ {"label": "PO number", "value": "PO-29034"},
31
+ {"label": "Payment terms", "value": "Net 30"},
32
+ {"label": "Gross total", "value": "3800.00 USD"}
33
+ ],
34
+ "line_items": [
35
+ {
36
+ "line_id": "L1",
37
+ "description": "Server rack mounting components",
38
+ "quantity": 6.0,
39
+ "unit_price": 350.0,
40
+ "amount": 2100.0,
41
+ "status": "invoiced",
42
+ "notes": "PO line 10"
43
+ },
44
+ {
45
+ "line_id": "L2",
46
+ "description": "Cable management kit",
47
+ "quantity": 4.0,
48
+ "unit_price": 425.0,
49
+ "amount": 1700.0,
50
+ "status": "invoiced",
51
+ "notes": "PO line 20"
52
+ }
53
+ ],
54
+ "events": [
55
+ {
56
+ "event_id": "evt-received",
57
+ "event_type": "invoice_received",
58
+ "event_date": "2026-03-23",
59
+ "description": "Invoice packet received through EDI channel",
60
+ "quantity": null,
61
+ "amount": 3800.0,
62
+ "status": "queued"
63
+ }
64
+ ],
65
+ "related_refs": ["art-po", "art-receipts", "EX-POSSIBLE-DUP"]
66
+ },
67
+ {
68
+ "artifact_id": "art-po",
69
+ "artifact_type": "purchase_order",
70
+ "title": "PO-29034",
71
+ "summary": "Purchase order for IT infrastructure components.",
72
+ "fields": [
73
+ {"label": "Buyer", "value": "IT Procurement"},
74
+ {"label": "PO number", "value": "PO-29034"},
75
+ {"label": "Supplier", "value": "TechLink Solutions"},
76
+ {"label": "Payment terms", "value": "Net 30"}
77
+ ],
78
+ "line_items": [
79
+ {
80
+ "line_id": "L1",
81
+ "description": "Server rack mounting components",
82
+ "quantity": 6.0,
83
+ "unit_price": 350.0,
84
+ "amount": 2100.0,
85
+ "status": "ordered",
86
+ "notes": "PO line 10"
87
+ },
88
+ {
89
+ "line_id": "L2",
90
+ "description": "Cable management kit",
91
+ "quantity": 4.0,
92
+ "unit_price": 425.0,
93
+ "amount": 1700.0,
94
+ "status": "ordered",
95
+ "notes": "PO line 20"
96
+ }
97
+ ],
98
+ "events": [],
99
+ "related_refs": ["art-invoice", "art-receipts"]
100
+ },
101
+ {
102
+ "artifact_id": "art-receipts",
103
+ "artifact_type": "receipt_log",
104
+ "title": "Receipt log for PO-29034",
105
+ "summary": "Both lines are fully received.",
106
+ "fields": [
107
+ {"label": "Receiving site", "value": "Central IT warehouse"},
108
+ {"label": "Last receipt update", "value": "2026-03-21"},
109
+ {"label": "Open receipt issue", "value": "None"}
110
+ ],
111
+ "line_items": [
112
+ {
113
+ "line_id": "L1",
114
+ "description": "Server rack mounting components",
115
+ "quantity": 6.0,
116
+ "unit_price": null,
117
+ "amount": 2100.0,
118
+ "status": "fully_received",
119
+ "notes": "Received in full on 2026-03-20"
120
+ },
121
+ {
122
+ "line_id": "L2",
123
+ "description": "Cable management kit",
124
+ "quantity": 4.0,
125
+ "unit_price": null,
126
+ "amount": 1700.0,
127
+ "status": "fully_received",
128
+ "notes": "Received in full on 2026-03-21"
129
+ }
130
+ ],
131
+ "events": [
132
+ {
133
+ "event_id": "evt-rcv-l1",
134
+ "event_type": "goods_receipt",
135
+ "event_date": "2026-03-20",
136
+ "description": "Received 6 server rack mounting components",
137
+ "quantity": 6.0,
138
+ "amount": 2100.0,
139
+ "status": "posted"
140
+ },
141
+ {
142
+ "event_id": "evt-rcv-l2",
143
+ "event_type": "goods_receipt",
144
+ "event_date": "2026-03-21",
145
+ "description": "Received 4 cable management kits",
146
+ "quantity": 4.0,
147
+ "amount": 1700.0,
148
+ "status": "posted"
149
+ }
150
+ ],
151
+ "related_refs": ["art-po"]
152
+ },
153
+ {
154
+ "artifact_id": "art-vendor",
155
+ "artifact_type": "vendor_master",
156
+ "title": "Vendor master: TechLink Solutions",
157
+ "summary": "Active vendor with no payment hold.",
158
+ "fields": [
159
+ {"label": "Vendor ID", "value": "V-315"},
160
+ {"label": "Payment terms", "value": "Net 30"},
161
+ {"label": "Vendor status", "value": "Active"},
162
+ {"label": "Hold status", "value": "No vendor hold"}
163
+ ],
164
+ "line_items": [],
165
+ "events": [],
166
+ "related_refs": ["art-invoice"]
167
+ },
168
+ {
169
+ "artifact_id": "art-policy",
170
+ "artifact_type": "policy_card",
171
+ "title": "AP policy card",
172
+ "summary": "Duplicate-review precedence rules.",
173
+ "fields": [
174
+ {"label": "Duplicate review rule", "value": "When possible_duplicate is flagged, review a number-based duplicate search before relying on heuristic amount/date similarity."},
175
+ {"label": "Reversed prior record", "value": "A reversed or voided prior record with the same vendor and invoice number is not a payment block."},
176
+ {"label": "Heuristic amount/date hit", "value": "A same-amount same-date hit with a different invoice number is informational only unless other evidence shows true duplicate billing."}
177
+ ],
178
+ "line_items": [],
179
+ "events": [],
180
+ "related_refs": ["EX-POSSIBLE-DUP"]
181
+ }
182
+ ],
183
+ "exceptions": [
184
+ {
185
+ "exception_id": "EX-POSSIBLE-DUP",
186
+ "exception_type": "possible_duplicate",
187
+ "severity": "high",
188
+ "headline": "Duplicate control is open for this invoice",
189
+ "impacted_line_ids": ["L1", "L2"],
190
+ "short_description": "A prior AP record may overlap with this invoice.",
191
+ "fields": [
192
+ {"label": "Invoice number", "value": "TL-9205/A"},
193
+ {"label": "Vendor", "value": "TechLink Solutions"},
194
+ {"label": "Control status", "value": "Duplicate review required before release"}
195
+ ],
196
+ "reviewer_guidance": "Run the relevant duplicate search and review surfaced candidates before deciding."
197
+ }
198
+ ],
199
+ "duplicate_candidates": [
200
+ {
201
+ "candidate_id": "CAND-NORM-01",
202
+ "vendor_name": "TechLink Solutions",
203
+ "invoice_number": "TL9205A",
204
+ "invoice_date": "2026-03-10",
205
+ "gross_amount": 3800.0,
206
+ "status": "reversed on 2026-03-11 after import duplicate; closed",
207
+ "match_basis": "Normalized invoice number + vendor + gross amount",
208
+ "overlap_summary": "Same normalized invoice number. Prior record was reversed before payment.",
209
+ "supported_match_strategies": [
210
+ "normalized_invoice_no"
211
+ ]
212
+ },
213
+ {
214
+ "candidate_id": "CAND-AMT-02",
215
+ "vendor_name": "TechLink Solutions",
216
+ "invoice_number": "TL-9188",
217
+ "invoice_date": "2026-03-22",
218
+ "gross_amount": 3800.0,
219
+ "status": "open",
220
+ "match_basis": "Vendor + gross amount + nearby invoice date",
221
+ "overlap_summary": "Same amount and nearby date, but invoice number differs.",
222
+ "supported_match_strategies": [
223
+ "vendor_amount_date"
224
+ ]
225
+ }
226
+ ],
227
+ "hidden_truth": {
228
+ "line_expectations": {
229
+ "L1": {
230
+ "amount": 2100.0,
231
+ "score_map": {
232
+ "approve": 1.0,
233
+ "hold": 0.3,
234
+ "escalate": 0.1,
235
+ "reject": 0.0
236
+ },
237
+ "accepted_reason_sets": [
238
+ ["matched_to_po_and_receipt", "safe_to_pay"],
239
+ ["safe_to_pay", "possible_duplicate_review"],
240
+ ["safe_to_pay"]
241
+ ],
242
+ "accepted_routes": [],
243
+ "gating_refs": [
244
+ "art-po",
245
+ "art-receipts",
246
+ "EX-POSSIBLE-DUP",
247
+ "duplicate_check:normalized_invoice_no",
248
+ "CAND-NORM-01"
249
+ ],
250
+ "decisive_refs": ["art-po", "art-receipts", "EX-POSSIBLE-DUP", "duplicate_check:normalized_invoice_no", "CAND-NORM-01"],
251
+ "unsafe_approve": false
252
+ },
253
+ "L2": {
254
+ "amount": 1700.0,
255
+ "score_map": {
256
+ "approve": 1.0,
257
+ "hold": 0.3,
258
+ "escalate": 0.1,
259
+ "reject": 0.0
260
+ },
261
+ "accepted_reason_sets": [
262
+ ["matched_to_po_and_receipt", "safe_to_pay"],
263
+ ["safe_to_pay", "possible_duplicate_review"],
264
+ ["safe_to_pay"]
265
+ ],
266
+ "accepted_routes": [],
267
+ "gating_refs": [
268
+ "art-po",
269
+ "art-receipts",
270
+ "EX-POSSIBLE-DUP",
271
+ "duplicate_check:normalized_invoice_no",
272
+ "CAND-NORM-01"
273
+ ],
274
+ "decisive_refs": ["art-po", "art-receipts", "EX-POSSIBLE-DUP", "duplicate_check:normalized_invoice_no", "CAND-NORM-01"],
275
+ "unsafe_approve": false
276
+ }
277
+ },
278
+ "header_expectation": {
279
+ "score_map": {
280
+ "release_approved_lines": 1.0,
281
+ "hold_full_invoice": 0.35,
282
+ "escalate_case": 0.15,
283
+ "reject_full_invoice": 0.0
284
+ },
285
+ "accepted_reason_sets": [
286
+ ["safe_to_pay", "possible_duplicate_review"],
287
+ ["safe_to_pay"]
288
+ ],
289
+ "accepted_routes": [],
290
+ "gating_refs": [
291
+ "art-po",
292
+ "art-receipts",
293
+ "EX-POSSIBLE-DUP",
294
+ "duplicate_check:normalized_invoice_no",
295
+ "CAND-NORM-01"
296
+ ],
297
+ "decisive_refs": ["art-po", "art-receipts", "EX-POSSIBLE-DUP", "duplicate_check:normalized_invoice_no", "CAND-NORM-01"],
298
+ "unsafe_recommendations": [],
299
+ "overconservative_recommendations": ["hold_full_invoice", "escalate_case"]
300
+ },
301
+ "note_expectations": [
302
+ {
303
+ "issue_id": "duplicate_cleared",
304
+ "accepted_reason_sets": [
305
+ ["possible_duplicate_review", "safe_to_pay"],
306
+ ["possible_duplicate_review"]
307
+ ],
308
+ "decisive_refs": ["duplicate_check:normalized_invoice_no", "CAND-NORM-01"]
309
+ }
310
+ ],
311
+ "efficient_step_target": 9
312
+ }
313
+ }
data/scenarios/medium_plus.json ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "scenario_id": "medium_plus",
3
+ "task_id": "medium_plus",
4
+ "case_id": "CASE-MEDIUMPLUS-001",
5
+ "title": "PO invoice with one receipt-blocked line after duplicate clearance",
6
+ "description": "Review a PO-backed goods invoice with a possible duplicate flag and a short-received line. The correct action depends on clearing the duplicate, judging the unsupported amount on the short line, and choosing partial release instead of a full invoice hold.",
7
+ "step_limit": 17,
8
+ "queue_card": {
9
+ "case_id": "CASE-MEDIUMPLUS-001",
10
+ "vendor_name": "Apex Facility Systems",
11
+ "vendor_id": "V-411",
12
+ "invoice_number": "AFS-7719/B",
13
+ "invoice_date": "2026-03-24",
14
+ "invoice_total": 3050.0,
15
+ "currency": "USD",
16
+ "po_number": "PO-55312",
17
+ "risk_flags": [
18
+ "po_invoice",
19
+ "possible_duplicate",
20
+ "receipt_variance",
21
+ "partial_receipt"
22
+ ],
23
+ "summary": "PO-backed goods invoice with duplicate review open and one receipt support issue."
24
+ },
25
+ "artifacts": [
26
+ {
27
+ "artifact_id": "art-invoice",
28
+ "artifact_type": "invoice_packet",
29
+ "title": "Invoice packet AFS-7719/B",
30
+ "summary": "Two-line goods invoice with one high-rate short line.",
31
+ "fields": [
32
+ {"label": "Vendor", "value": "Apex Facility Systems"},
33
+ {"label": "Invoice number", "value": "AFS-7719/B"},
34
+ {"label": "Invoice date", "value": "2026-03-24"},
35
+ {"label": "PO number", "value": "PO-55312"},
36
+ {"label": "Payment terms", "value": "Net 30"},
37
+ {"label": "Gross total", "value": "3050.00 USD"}
38
+ ],
39
+ "line_items": [
40
+ {
41
+ "line_id": "L1",
42
+ "description": "Relay control modules",
43
+ "quantity": 10.0,
44
+ "unit_price": 185.0,
45
+ "amount": 1850.0,
46
+ "status": "invoiced",
47
+ "notes": "PO line 10"
48
+ },
49
+ {
50
+ "line_id": "L2",
51
+ "description": "Backup power supply units",
52
+ "quantity": 5.0,
53
+ "unit_price": 240.0,
54
+ "amount": 1200.0,
55
+ "status": "invoiced",
56
+ "notes": "PO line 20"
57
+ }
58
+ ],
59
+ "events": [
60
+ {
61
+ "event_id": "evt-received",
62
+ "event_type": "invoice_received",
63
+ "event_date": "2026-03-25",
64
+ "description": "Invoice packet received through EDI channel",
65
+ "quantity": null,
66
+ "amount": 3050.0,
67
+ "status": "queued"
68
+ }
69
+ ],
70
+ "related_refs": [
71
+ "art-po",
72
+ "art-receipts",
73
+ "art-policy",
74
+ "EX-POSSIBLE-DUP",
75
+ "EX-RECEIPT-L2"
76
+ ]
77
+ },
78
+ {
79
+ "artifact_id": "art-po",
80
+ "artifact_type": "purchase_order",
81
+ "title": "PO-55312",
82
+ "summary": "Purchase order for facility control hardware.",
83
+ "fields": [
84
+ {"label": "Buyer", "value": "Facilities Procurement"},
85
+ {"label": "PO number", "value": "PO-55312"},
86
+ {"label": "Supplier", "value": "Apex Facility Systems"},
87
+ {"label": "Payment terms", "value": "Net 30"}
88
+ ],
89
+ "line_items": [
90
+ {
91
+ "line_id": "L1",
92
+ "description": "Relay control modules",
93
+ "quantity": 10.0,
94
+ "unit_price": 185.0,
95
+ "amount": 1850.0,
96
+ "status": "ordered",
97
+ "notes": "PO line 10"
98
+ },
99
+ {
100
+ "line_id": "L2",
101
+ "description": "Backup power supply units",
102
+ "quantity": 5.0,
103
+ "unit_price": 240.0,
104
+ "amount": 1200.0,
105
+ "status": "ordered",
106
+ "notes": "PO line 20"
107
+ }
108
+ ],
109
+ "events": [],
110
+ "related_refs": ["art-invoice", "art-receipts"]
111
+ },
112
+ {
113
+ "artifact_id": "art-receipts",
114
+ "artifact_type": "receipt_log",
115
+ "title": "Receipt log for PO-55312",
116
+ "summary": "One line is fully received and one line remains one unit short.",
117
+ "fields": [
118
+ {"label": "Receiving site", "value": "South regional warehouse"},
119
+ {"label": "Last receipt update", "value": "2026-03-23"},
120
+ {"label": "Open receipt issue", "value": "PO line 20 remains one unit short"}
121
+ ],
122
+ "line_items": [
123
+ {
124
+ "line_id": "L1",
125
+ "description": "Relay control modules",
126
+ "quantity": 10.0,
127
+ "unit_price": null,
128
+ "amount": 1850.0,
129
+ "status": "fully_received",
130
+ "notes": "Received in full on 2026-03-22"
131
+ },
132
+ {
133
+ "line_id": "L2",
134
+ "description": "Backup power supply units",
135
+ "quantity": 4.0,
136
+ "unit_price": null,
137
+ "amount": null,
138
+ "status": "short_received",
139
+ "notes": "4 of 5 units posted on 2026-03-23; remaining unit not yet received"
140
+ }
141
+ ],
142
+ "events": [
143
+ {
144
+ "event_id": "evt-rcv-l1",
145
+ "event_type": "goods_receipt",
146
+ "event_date": "2026-03-22",
147
+ "description": "Received 10 relay control modules",
148
+ "quantity": 10.0,
149
+ "amount": 1850.0,
150
+ "status": "posted"
151
+ },
152
+ {
153
+ "event_id": "evt-rcv-l2",
154
+ "event_type": "goods_receipt",
155
+ "event_date": "2026-03-23",
156
+ "description": "Received 4 backup power supply units",
157
+ "quantity": 4.0,
158
+ "amount": null,
159
+ "status": "posted"
160
+ }
161
+ ],
162
+ "related_refs": ["art-po", "art-invoice", "EX-RECEIPT-L2"]
163
+ },
164
+ {
165
+ "artifact_id": "art-policy",
166
+ "artifact_type": "policy_card",
167
+ "title": "AP policy card",
168
+ "summary": "Duplicate, de minimis receipt shortage, and partial release rules.",
169
+ "fields": [
170
+ {"label": "Duplicate review rule", "value": "When possible_duplicate is flagged, review a normalized invoice number match before relying on heuristic amount/date similarity."},
171
+ {"label": "Reversed prior record", "value": "A reversed or voided prior record with the same vendor and invoice number is not a payment block."},
172
+ {"label": "De minimis receipt shortage", "value": "A line may release only when unsupported amount is 150.00 USD or less. If unsupported amount exceeds that threshold, hold the line to Receiving."},
173
+ {"label": "Partial release rule", "value": "If only specific lines remain unsupported and no case-level blocker exists, hold the affected lines and release the approved lines instead of holding the full invoice."}
174
+ ],
175
+ "line_items": [],
176
+ "events": [],
177
+ "related_refs": ["EX-POSSIBLE-DUP", "EX-RECEIPT-L2"]
178
+ }
179
+ ],
180
+ "exceptions": [
181
+ {
182
+ "exception_id": "EX-POSSIBLE-DUP",
183
+ "exception_type": "possible_duplicate",
184
+ "severity": "high",
185
+ "headline": "Duplicate control is open for this invoice",
186
+ "impacted_line_ids": ["L1", "L2"],
187
+ "short_description": "A prior AP record may overlap with this invoice.",
188
+ "fields": [
189
+ {"label": "Invoice number", "value": "AFS-7719/B"},
190
+ {"label": "Vendor", "value": "Apex Facility Systems"},
191
+ {"label": "Control status", "value": "Duplicate review required before release"}
192
+ ],
193
+ "reviewer_guidance": "Run the relevant duplicate search and review surfaced candidates before deciding."
194
+ },
195
+ {
196
+ "exception_id": "EX-RECEIPT-L2",
197
+ "exception_type": "receipt_quantity_variance",
198
+ "severity": "high",
199
+ "headline": "Receipt support is short on L2",
200
+ "impacted_line_ids": ["L2"],
201
+ "short_description": "Received quantity on L2 is below the invoiced quantity.",
202
+ "fields": [
203
+ {"label": "Invoice quantity", "value": "5"},
204
+ {"label": "Received quantity", "value": "4"},
205
+ {"label": "Short quantity", "value": "1"}
206
+ ],
207
+ "reviewer_guidance": "Review the invoiced unit rate, receipt log, and shortage rule before deciding whether L2 can release."
208
+ }
209
+ ],
210
+ "duplicate_candidates": [
211
+ {
212
+ "candidate_id": "CAND-NORM-01",
213
+ "vendor_name": "Apex Facility Systems",
214
+ "invoice_number": "AFS7719B",
215
+ "invoice_date": "2026-03-13",
216
+ "gross_amount": 3050.0,
217
+ "status": "reversed on 2026-03-14 after import duplicate; closed",
218
+ "match_basis": "Normalized invoice number + vendor + gross amount",
219
+ "overlap_summary": "Same normalized invoice number. Prior record was reversed before payment.",
220
+ "supported_match_strategies": [
221
+ "normalized_invoice_no"
222
+ ]
223
+ },
224
+ {
225
+ "candidate_id": "CAND-AMT-02",
226
+ "vendor_name": "Apex Facility Systems",
227
+ "invoice_number": "AFS-7688",
228
+ "invoice_date": "2026-03-24",
229
+ "gross_amount": 3050.0,
230
+ "status": "open",
231
+ "match_basis": "Vendor + gross amount + nearby invoice date",
232
+ "overlap_summary": "Same amount and nearby date, but invoice number differs.",
233
+ "supported_match_strategies": [
234
+ "vendor_amount_date"
235
+ ]
236
+ }
237
+ ],
238
+ "hidden_truth": {
239
+ "line_expectations": {
240
+ "L1": {
241
+ "amount": 1850.0,
242
+ "score_map": {
243
+ "approve": 1.0,
244
+ "hold": 0.25,
245
+ "escalate": 0.1,
246
+ "reject": 0.0
247
+ },
248
+ "accepted_reason_sets": [
249
+ ["matched_to_po_and_receipt", "safe_to_pay"],
250
+ ["safe_to_pay"],
251
+ ["possible_duplicate_review", "safe_to_pay"]
252
+ ],
253
+ "accepted_routes": [],
254
+ "gating_refs": [
255
+ "art-po",
256
+ "art-receipts",
257
+ "EX-POSSIBLE-DUP",
258
+ "duplicate_check:normalized_invoice_no",
259
+ "CAND-NORM-01"
260
+ ],
261
+ "safe_gating_refs": [
262
+ "art-po",
263
+ "art-receipts",
264
+ "EX-POSSIBLE-DUP",
265
+ "duplicate_check:normalized_invoice_no",
266
+ "CAND-NORM-01"
267
+ ],
268
+ "decisive_refs": [
269
+ "art-po",
270
+ "art-receipts",
271
+ "EX-POSSIBLE-DUP",
272
+ "duplicate_check:normalized_invoice_no",
273
+ "CAND-NORM-01"
274
+ ],
275
+ "unsafe_approve": false
276
+ },
277
+ "L2": {
278
+ "amount": 1200.0,
279
+ "score_map": {
280
+ "hold": 1.0,
281
+ "escalate": 0.45,
282
+ "reject": 0.1,
283
+ "approve": 0.0
284
+ },
285
+ "accepted_reason_sets": [
286
+ ["receipt_not_confirmed"],
287
+ ["receipt_not_confirmed", "partial_receipt_pending"],
288
+ ["partial_receipt_pending"]
289
+ ],
290
+ "accepted_routes": ["receiving"],
291
+ "gating_refs": [
292
+ "art-invoice",
293
+ "art-receipts",
294
+ "EX-RECEIPT-L2",
295
+ "art-policy"
296
+ ],
297
+ "safe_gating_refs": [
298
+ "art-receipts",
299
+ "EX-RECEIPT-L2"
300
+ ],
301
+ "decisive_refs": [
302
+ "art-invoice",
303
+ "art-receipts",
304
+ "EX-RECEIPT-L2",
305
+ "art-policy"
306
+ ],
307
+ "unsafe_approve": true
308
+ }
309
+ },
310
+ "header_expectation": {
311
+ "score_map": {
312
+ "release_approved_lines": 1.0,
313
+ "hold_full_invoice": 0.55,
314
+ "escalate_case": 0.35,
315
+ "reject_full_invoice": 0.0
316
+ },
317
+ "accepted_reason_sets": [
318
+ ["possible_duplicate_review", "receipt_not_confirmed", "safe_to_pay"],
319
+ ["possible_duplicate_review", "partial_receipt_pending", "safe_to_pay"],
320
+ ["receipt_not_confirmed", "safe_to_pay"]
321
+ ],
322
+ "accepted_routes": [],
323
+ "gating_refs": [
324
+ "art-policy",
325
+ "art-receipts",
326
+ "EX-RECEIPT-L2",
327
+ "duplicate_check:normalized_invoice_no",
328
+ "CAND-NORM-01"
329
+ ],
330
+ "safe_gating_refs": [
331
+ "art-receipts",
332
+ "EX-RECEIPT-L2",
333
+ "duplicate_check:normalized_invoice_no",
334
+ "CAND-NORM-01"
335
+ ],
336
+ "decisive_refs": [
337
+ "art-policy",
338
+ "art-receipts",
339
+ "EX-RECEIPT-L2",
340
+ "duplicate_check:normalized_invoice_no",
341
+ "CAND-NORM-01"
342
+ ],
343
+ "unsafe_recommendations": [],
344
+ "overconservative_recommendations": ["hold_full_invoice", "escalate_case"]
345
+ },
346
+ "note_expectations": [
347
+ {
348
+ "issue_id": "duplicate_cleared",
349
+ "accepted_reason_sets": [
350
+ ["possible_duplicate_review", "safe_to_pay"],
351
+ ["possible_duplicate_review"]
352
+ ],
353
+ "decisive_refs": [
354
+ "duplicate_check:normalized_invoice_no",
355
+ "CAND-NORM-01"
356
+ ]
357
+ },
358
+ {
359
+ "issue_id": "receipt_short_hold",
360
+ "accepted_reason_sets": [
361
+ ["receipt_not_confirmed"],
362
+ ["partial_receipt_pending"]
363
+ ],
364
+ "decisive_refs": [
365
+ "art-invoice",
366
+ "art-receipts",
367
+ "art-policy",
368
+ "EX-RECEIPT-L2"
369
+ ]
370
+ }
371
+ ],
372
+ "efficient_step_target": 13
373
+ }
374
+ }
docs/rules.md ADDED
@@ -0,0 +1,560 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Round 1 — Problem Statement
2
+
3
+ ### The Task
4
+ Build a complete, real-world **OpenEnv** environment that an AI agent can learn from through the standard `step()` / `reset()` / `state()` API.
5
+
6
+ ### Key Requirements at a Glance
7
+ * **Real-world Focus:** Must simulate a real-world task (not games or toys).
8
+ * **Full Spec:** Implement full OpenEnv spec: typed models, `step()`/`reset()`/`state()`, and `openenv.yaml`.
9
+ * **Tasks:** Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0).
10
+ * **Rewards:** Meaningful reward function with partial progress signals.
11
+ * **Baselines:** Baseline inference script with reproducible scores.
12
+ * **Deployment:** Deploy to Hugging Face Spaces + working Dockerfile.
13
+ * **Docs:** README with environment description, action/observation spaces, and setup instructions.
14
+
15
+ ---
16
+
17
+ ### Functional Requirements
18
+
19
+ #### 1. Real-world task simulation
20
+ The environment must simulate a task humans actually do.
21
+ * **Examples:** Email triage, code review, data cleaning, scheduling, customer support, content moderation.
22
+
23
+ #### 2. OpenEnv spec compliance
24
+ Implement the full OpenEnv interface:
25
+ * **Typed Models:** Observation, Action, and Reward Pydantic models.
26
+ * **Methods:** `step(action)` → returns observation, reward, done, info; `reset()` → returns initial observation; `state()` → returns current state.
27
+ * **Metadata:** `openenv.yaml` with metadata, tested via `openenv validate`.
28
+
29
+ #### 3. Minimum 3 tasks with agent graders
30
+ Each task defines a concrete objective with a programmatic grader (0.0–1.0).
31
+ * **Progression:** Easy → Medium → Hard.
32
+ * **Criteria:** Graders must have clear, deterministic success/failure criteria.
33
+
34
+ #### 4. Meaningful reward function
35
+ * Provides signal over the full trajectory (not just binary end-of-episode).
36
+ * Rewards partial progress toward task completion.
37
+ * Penalizes clearly undesirable behavior (e.g., infinite loops, destructive actions).
38
+
39
+ #### 5. Baseline inference script
40
+ * Uses the OpenAI API client to run a model against the environment.
41
+ * Produces a reproducible baseline score on all public tasks.
42
+
43
+ ---
44
+
45
+ ### Non-Functional Requirements
46
+ * **Hugging Face Spaces:** Environment must run as a containerized HF Space tagged with `openenv`.
47
+ * **Containerization:** Must include a working `Dockerfile` that starts cleanly.
48
+ * **Documentation:** README must include environment description, action/observation space definitions, task descriptions, and setup instructions.
49
+
50
+ ---
51
+
52
+ ### Scoring Rubric
53
+
54
+ | Parameter | Weight | Description |
55
+ | :--------------------------------- | :----- | :------------------------------------------------------------------------ |
56
+ | **Real-world utility** | 30% | Does the environment model a genuine task useful for training/evaluation? |
57
+ | **Task & grader quality** | 25% | Well-defined objectives? Fair measurement? Difficulty progression? |
58
+ | **Environment design** | 20% | Clean state management, sensible spaces, good reward shaping. |
59
+ | **Code quality & spec compliance** | 15% | Follows OpenEnv spec, clean project structure, working Dockerfile. |
60
+ | **Creativity & novelty** | 10% | Novel problem domain or interesting mechanics. |
61
+
62
+ ### Scoring Breakdown
63
+ **Real-world utility (30%)**
64
+ * **0–5:** Toy/artificial problem with no practical application
65
+ * **6–15:** Valid domain but shallow modeling of the real task
66
+ * **16–25:** Good domain modeling, would be useful for agent evaluation
67
+ * **26–30:** Excellent — fills a real gap, immediate value for the RL/agent community
68
+
69
+ **Task & grader quality (25%)**
70
+ * 3+ tasks with difficulty range?
71
+ * Graders produce scores between 0.0–1.0?
72
+ * Graders deterministic and reproducible?
73
+ * Hard task genuinely challenges frontier models?
74
+
75
+ **Environment design (20%)**
76
+ * `reset()` produces clean state?
77
+ * Action/observation types well-designed and documented?
78
+ * Reward function provides useful varying signal (not just sparse)?
79
+ * Episode boundaries sensible?
80
+
81
+ **Code quality & spec compliance (15%)**
82
+ * `openenv validate` passes?
83
+ * `docker build && docker run` works?
84
+ * HF Space deploys and responds?
85
+ * Baseline script runs and reproduces scores?
86
+
87
+ **Creativity & novelty (10%)**
88
+ * Domain we haven’t seen in OpenEnv before?
89
+ * Reward design has interesting properties?
90
+ * Clever mechanics that make the environment engaging?
91
+
92
+ ---
93
+
94
+ ### How judging works
95
+
96
+ Phase 1: Automated Validation
97
+
98
+ Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
99
+
100
+ Phase 2: Agentic Evaluation
101
+
102
+ Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
103
+
104
+ Phase 3: Human Review
105
+
106
+ Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
107
+
108
+ Disqualification Criteria
109
+
110
+ Environment does not deploy or respond
111
+
112
+ Plagiarized or trivially modified existing environments
113
+
114
+ Graders that always return the same score
115
+
116
+ ---
117
+
118
+ ### Pre-Submission Checklist
119
+ **CRITICAL: All checks must pass during automated validation or you will be disqualified.**
120
+
121
+ * **[ ] HF Space Deploys:** An automated ping to your Space URL must return an **HTTP 200** and successfully respond to a `/reset` call.
122
+ * **[ ] OpenEnv Spec Compliance:** Your environment must pass validation for `openenv.yaml`, typed Pydantic models, and the required `step()`, `reset()`, and `state()` endpoints.
123
+ * **[ ] Dockerfile Builds:** The automated system will run a `docker build` on your submitted repository; it must complete successfully.
124
+ * **[ ] Baseline Reproduces:** The system will execute your `inference.py`. It must run without errors and produce scores for all tasks.
125
+ * **[ ] 3+ Tasks with Graders:** The system will enumerate your tasks and run every grader to verify that scores fall strictly within the **0.0–1.0** range.
126
+
127
+ ### Mandatory Configuration
128
+ Before submitting, ensure the following variables are defined in your environment configuration:
129
+
130
+ API_BASE_URL The API endpoint for the LLM.
131
+
132
+ MODEL_NAME The model identifier to use for inference.
133
+
134
+ HF_TOKEN Your Hugging Face / API key.
135
+
136
+ The inference script must be named `inference.py` and placed in the root directory of the project
137
+
138
+ Participants must use OpenAI Client for all LLM calls using above variables
139
+
140
+ Participants must emit structured stdout logs strictly following the [START], [STEP], and [END] format defined in the sample inference.py provided below. Any deviation in field names, ordering, or formatting will result in incorrect evaluation scoring. Refer to the Sample Inference Script for the complete format specification and examples
141
+
142
+ ### Infrastructure & Runtime Restrictions
143
+ To ensure fair and stable evaluation, your submission must adhere to these limits:
144
+
145
+ * **Inference Time:** The total runtime of the `inference.py` script must be **less than 20 minutes**.
146
+ * **Hardware Constraints:** Your environment and inference script must be able to run on a machine with:
147
+ * **vCPU:** 2
148
+ * **Memory:** 8GB RAM
149
+
150
+ ### Validator
151
+ It is highly recommended that you **run the pre-submission validation script** locally (provided in the "Pre-Validation Script" section) before final submission to catch any Docker or spec errors early.
152
+
153
+ ---
154
+
155
+ ### FAQs
156
+
157
+ **How are submissions evaluated?**
158
+ Submissions are evaluated based on runtime correctness (runs without errors), interface compliance (follows OpenEnv standard), task design (clear and realistic), and grading logic (meaningful reward system).
159
+
160
+ **What framework must be used?**
161
+ Participants must use the **OpenEnv** framework. For LLM calls within the inference script, the **OpenAI Client** is mandatory.
162
+
163
+ **What do I need to submit?**
164
+ You must submit the URL to your containerized Hugging Face Space. Ensure your repository includes the `openenv.yaml` file, a working `Dockerfile`, and an `inference.py` script in the root directory.
165
+
166
+ **Where can I get help?**
167
+ You can join the [Discord Community](https://discord.gg/Dedhy5pkWD) for mentor access and announcements, or email the support team at `help_openenvhackathon@scaler.com`.
168
+
169
+ ---
170
+
171
+ ### Inference Script Example (`inference.py`)
172
+
173
+ ```python
174
+ """
175
+ Inference Script Example
176
+ ===================================
177
+ MANDATORY
178
+ - Before submitting, ensure the following variables are defined in your environment configuration:
179
+ API_BASE_URL The API endpoint for the LLM.
180
+ MODEL_NAME The model identifier to use for inference.
181
+ HF_TOKEN Your Hugging Face / API key.
182
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
183
+ method
184
+
185
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
186
+ (and should reflect your active inference setup):
187
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
188
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
189
+
190
+ - The inference script must be named `inference.py` and placed in the root directory of the project
191
+ - Participants must use OpenAI Client for all LLM calls using above variables
192
+
193
+ STDOUT FORMAT
194
+ - The script must emit exactly three line types to stdout, in this order:
195
+
196
+ [START] task=<task_name> env=<benchmark> model=<model_name>
197
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
198
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
199
+
200
+ Rules:
201
+ - One [START] line at episode begin.
202
+ - One [STEP] line per step, immediately after env.step() returns.
203
+ - One [END] line after env.close(), always emitted (even on exception).
204
+ - reward and rewards are formatted to 2 decimal places.
205
+ - done and success are lowercase booleans: true or false.
206
+ - error is the raw last_action_error string, or null if none.
207
+ - All fields on a single line with no newlines within a line.
208
+ - Each tasks should return score in [0, 1]
209
+
210
+ Example:
211
+ [START] task=click-test env=miniwob model=Qwen3-VL-30B
212
+ [STEP] step=1 action=click('123') reward=0.00 done=false error=null
213
+ [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
214
+ [STEP] step=3 action=click('789') reward=1.00 done=true error=null
215
+ [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
216
+ """
217
+
218
+ import asyncio
219
+ import os
220
+ import textwrap
221
+ from typing import List, Optional
222
+
223
+ from openai import OpenAI
224
+
225
+ from my_env_v4 import MyEnvV4Action, MyEnvV4Env
226
+ IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
227
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
228
+
229
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
230
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
231
+ TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
232
+ BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
233
+ MAX_STEPS = 8
234
+ TEMPERATURE = 0.7
235
+ MAX_TOKENS = 150
236
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
237
+
238
+ # Max possible reward: each token contributes 0.1, across all steps
239
+ _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
240
+ MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
241
+
242
+ SYSTEM_PROMPT = textwrap.dedent(
243
+ """
244
+ You are interacting with a simple echo environment.
245
+ Each turn you must send a message. The environment will echo it back.
246
+ Reward is proportional to message length: reward = len(message) * 0.1
247
+ Your goal is to maximize total reward by sending meaningful, substantive messages.
248
+ Reply with exactly one message string — no quotes, no prefixes, just the message text.
249
+ """
250
+ ).strip()
251
+
252
+
253
+ def log_start(task: str, env: str, model: str) -> None:
254
+ print(f"[START] task={task} env={env} model={model}", flush=True)
255
+
256
+
257
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
258
+ error_val = error if error else "null"
259
+ done_val = str(done).lower()
260
+ print(
261
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
262
+ flush=True,
263
+ )
264
+
265
+
266
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
267
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
268
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
269
+
270
+
271
+ def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
272
+ history_block = "\n".join(history[-4:]) if history else "None"
273
+ return textwrap.dedent(
274
+ f"""
275
+ Step: {step}
276
+ Last echoed message: {last_echoed!r}
277
+ Last reward: {last_reward:.2f}
278
+ Previous steps:
279
+ {history_block}
280
+ Send your next message.
281
+ """
282
+ ).strip()
283
+
284
+
285
+ def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
286
+ user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
287
+ try:
288
+ completion = client.chat.completions.create(
289
+ model=MODEL_NAME,
290
+ messages=[
291
+ {"role": "system", "content": SYSTEM_PROMPT},
292
+ {"role": "user", "content": user_prompt},
293
+ ],
294
+ temperature=TEMPERATURE,
295
+ max_tokens=MAX_TOKENS,
296
+ stream=False,
297
+ )
298
+ text = (completion.choices[0].message.content or "").strip()
299
+ return text if text else "hello"
300
+ except Exception as exc:
301
+ print(f"[DEBUG] Model request failed: {exc}", flush=True)
302
+ return "hello"
303
+
304
+
305
+ async def main() -> None:
306
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
307
+
308
+ env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
309
+
310
+ history: List[str] = []
311
+ rewards: List[float] = []
312
+ steps_taken = 0
313
+ score = 0.0
314
+ success = False
315
+
316
+ log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
317
+
318
+ try:
319
+ result = await env.reset() # OpenENV.reset()
320
+ last_echoed = result.observation.echoed_message
321
+ last_reward = 0.0
322
+
323
+ for step in range(1, MAX_STEPS + 1):
324
+ if result.done:
325
+ break
326
+
327
+ message = get_model_message(client, step, last_echoed, last_reward, history)
328
+
329
+ result = await env.step(MyEnvV4Action(message=message))
330
+ obs = result.observation
331
+
332
+ reward = result.reward or 0.0
333
+ done = result.done
334
+ error = None
335
+
336
+ rewards.append(reward)
337
+ steps_taken = step
338
+ last_echoed = obs.echoed_message
339
+ last_reward = reward
340
+
341
+ log_step(step=step, action=message, reward=reward, done=done, error=error)
342
+
343
+ history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
344
+
345
+ if done:
346
+ break
347
+
348
+ score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
349
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
350
+ success = score >= SUCCESS_SCORE_THRESHOLD
351
+
352
+ finally:
353
+ try:
354
+ await env.close()
355
+ except Exception as e:
356
+ print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
357
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
358
+
359
+
360
+ if __name__ == "__main__":
361
+ asyncio.run(main())
362
+ ```
363
+
364
+ ---
365
+
366
+ ### Pre-Validation Script
367
+
368
+ ```bash
369
+ #!/usr/bin/env bash
370
+ #
371
+ # validate-submission.sh — OpenEnv Submission Validator
372
+ #
373
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
374
+ #
375
+ # Prerequisites:
376
+ # - Docker: https://docs.docker.com/get-docker/
377
+ # - openenv-core: pip install openenv-core
378
+ # - curl (usually pre-installed)
379
+ #
380
+ # Run:
381
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
382
+ #
383
+ # Or download and run locally:
384
+ # chmod +x validate-submission.sh
385
+ # ./validate-submission.sh <ping_url> [repo_dir]
386
+ #
387
+ # Arguments:
388
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
389
+ # repo_dir Path to your repo (default: current directory)
390
+ #
391
+ # Examples:
392
+ # ./validate-submission.sh https://my-team.hf.space
393
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
394
+ #
395
+
396
+ set -uo pipefail
397
+
398
+ DOCKER_BUILD_TIMEOUT=600
399
+ if [ -t 1 ]; then
400
+ RED='\033[0;31m'
401
+ GREEN='\033[0;32m'
402
+ YELLOW='\033[1;33m'
403
+ BOLD='\033[1m'
404
+ NC='\033[0m'
405
+ else
406
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
407
+ fi
408
+
409
+ run_with_timeout() {
410
+ local secs="$1"; shift
411
+ if command -v timeout &>/dev/null; then
412
+ timeout "$secs" "$@"
413
+ elif command -v gtimeout &>/dev/null; then
414
+ gtimeout "$secs" "$@"
415
+ else
416
+ "$@" &
417
+ local pid=$!
418
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
419
+ local watcher=$!
420
+ wait "$pid" 2>/dev/null
421
+ local rc=$?
422
+ kill "$watcher" 2>/dev/null
423
+ wait "$watcher" 2>/dev/null
424
+ return $rc
425
+ fi
426
+ }
427
+
428
+ portable_mktemp() {
429
+ local prefix="${1:-validate}"
430
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
431
+ }
432
+
433
+ CLEANUP_FILES=()
434
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
435
+ trap cleanup EXIT
436
+
437
+ PING_URL="${1:-}"
438
+ REPO_DIR="${2:-.}"
439
+
440
+ if [ -z "$PING_URL" ]; then
441
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
442
+ printf "\n"
443
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
444
+ printf " repo_dir Path to your repo (default: current directory)\n"
445
+ exit 1
446
+ fi
447
+
448
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
449
+ printf "Error: directory '%s' not found\n" "${2:-.}"
450
+ exit 1
451
+ fi
452
+ PING_URL="${PING_URL%/}"
453
+ export PING_URL
454
+ PASS=0
455
+
456
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
457
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
458
+ fail() { log "${RED}FAILED${NC} -- $1"; }
459
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
460
+ stop_at() {
461
+ printf "\n"
462
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
463
+ exit 1
464
+ }
465
+
466
+ printf "\n"
467
+ printf "${BOLD}========================================${NC}\n"
468
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
469
+ printf "${BOLD}========================================${NC}\n"
470
+ log "Repo: $REPO_DIR"
471
+ log "Ping URL: $PING_URL"
472
+ printf "\n"
473
+
474
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
475
+
476
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
477
+ CLEANUP_FILES+=("$CURL_OUTPUT")
478
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
479
+ -H "Content-Type: application/json" -d '{}' \
480
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
481
+
482
+ if [ "$HTTP_CODE" = "200" ]; then
483
+ pass "HF Space is live and responds to /reset"
484
+ elif [ "$HTTP_CODE" = "000" ]; then
485
+ fail "HF Space not reachable (connection failed or timed out)"
486
+ hint "Check your network connection and that the Space is running."
487
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
488
+ stop_at "Step 1"
489
+ else
490
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
491
+ hint "Make sure your Space is running and the URL is correct."
492
+ hint "Try opening $PING_URL in your browser first."
493
+ stop_at "Step 1"
494
+ fi
495
+
496
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
497
+
498
+ if ! command -v docker &>/dev/null; then
499
+ fail "docker command not found"
500
+ hint "Install Docker: https://docs.docker.com/get-docker/"
501
+ stop_at "Step 2"
502
+ fi
503
+
504
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
505
+ DOCKER_CONTEXT="$REPO_DIR"
506
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
507
+ DOCKER_CONTEXT="$REPO_DIR/server"
508
+ else
509
+ fail "No Dockerfile found in repo root or server/ directory"
510
+ stop_at "Step 2"
511
+ fi
512
+
513
+ log " Found Dockerfile in $DOCKER_CONTEXT"
514
+
515
+ BUILD_OK=false
516
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
517
+
518
+ if [ "$BUILD_OK" = true ]; then
519
+ pass "Docker build succeeded"
520
+ else
521
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
522
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
523
+ stop_at "Step 2"
524
+ fi
525
+
526
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
527
+
528
+ if ! command -v openenv &>/dev/null; then
529
+ fail "openenv command not found"
530
+ hint "Install it: pip install openenv-core"
531
+ stop_at "Step 3"
532
+ fi
533
+
534
+ VALIDATE_OK=false
535
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
536
+
537
+ if [ "$VALIDATE_OK" = true ]; then
538
+ pass "openenv validate passed"
539
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
540
+ else
541
+ fail "openenv validate failed"
542
+ printf "%s\n" "$VALIDATE_OUTPUT"
543
+ stop_at "Step 3"
544
+ fi
545
+
546
+ printf "\n"
547
+ printf "${BOLD}========================================${NC}\n"
548
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
549
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
550
+ printf "${BOLD}========================================${NC}\n"
551
+ printf "\n"
552
+
553
+ exit 0
554
+ ```
555
+
556
+ ---
557
+
558
+ REQUIREMENTS:
559
+ - Must use models available on HuggingFace only
560
+ - Use openenv cli to stay compliant
inference.py ADDED
@@ -0,0 +1,1067 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Reproducible baseline for InvoiceOps."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ import os
7
+ import re
8
+ import sys
9
+ from dataclasses import dataclass, field
10
+ from datetime import datetime, timezone
11
+ from pathlib import Path
12
+ from typing import Any, Callable, TypeVar
13
+
14
+ from openai import OpenAI
15
+
16
+ from invoiceops_env import InvoiceOpsAction, InvoiceOpsEnv
17
+ from invoiceops_env.models import (
18
+ ActionType,
19
+ Disposition,
20
+ DuplicateCandidate,
21
+ DuplicateMatchStrategy,
22
+ ExceptionDetail,
23
+ InvoiceOpsObservation,
24
+ NoteType,
25
+ PaymentRecommendation,
26
+ QueueCard,
27
+ ReasonCode,
28
+ RouteTarget,
29
+ TaskId,
30
+ )
31
+
32
+ ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
33
+ DEFAULT_HF_ROUTER_BASE_URL = "https://router.huggingface.co/v1"
34
+ API_BASE_URL = os.getenv("API_BASE_URL", DEFAULT_HF_ROUTER_BASE_URL)
35
+ MODEL_NAME = os.getenv("MODEL_NAME", "zai-org/GLM-5.1")
36
+ TEMPERATURE = 0.0
37
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "3000"))
38
+ RETRY_MAX_TOKENS = max(MAX_TOKENS, int(os.getenv("RETRY_MAX_TOKENS", "5000")))
39
+ MAX_MODEL_ATTEMPTS = 2
40
+ BENCHMARK = "invoiceops_env"
41
+ OUTPUT_DIR = Path(__file__).resolve().parent / "outputs" / "evals"
42
+ EVAL_RUN_NAME = os.getenv("EVAL_RUN_NAME")
43
+ TASKS = [
44
+ TaskId.EASY,
45
+ TaskId.MEDIUM,
46
+ TaskId.MEDIUM_PLUS,
47
+ TaskId.HARD,
48
+ ]
49
+ HEADER_DISPOSITION_MAP: dict[Disposition, PaymentRecommendation] = {
50
+ Disposition.APPROVE: PaymentRecommendation.RELEASE_APPROVED_LINES,
51
+ Disposition.HOLD: PaymentRecommendation.HOLD_FULL_INVOICE,
52
+ Disposition.REJECT: PaymentRecommendation.REJECT_FULL_INVOICE,
53
+ Disposition.ESCALATE: PaymentRecommendation.ESCALATE_CASE,
54
+ }
55
+ ParsedModelOutput = TypeVar("ParsedModelOutput")
56
+
57
+
58
+ def _env_flag(name: str, default: bool) -> bool:
59
+ raw_value = os.getenv(name)
60
+ if raw_value is None:
61
+ return default
62
+ return raw_value.strip().lower() not in {"0", "false", "no", "off", ""}
63
+
64
+
65
+ def strict_task_score(raw_score: float, *, used_fallback: bool) -> float:
66
+ if used_fallback and _env_flag("STRICT_BASELINE_SCORING", True):
67
+ return 0.0
68
+ return raw_score
69
+
70
+
71
+ @dataclass
72
+ class EpisodeTrace:
73
+ rewards: list[float] = field(default_factory=list)
74
+ steps_taken: int = 0
75
+
76
+
77
+ @dataclass
78
+ class ObservationMemory:
79
+ opened_artifacts: dict[str, Any] = field(default_factory=dict)
80
+ inspected_exceptions: dict[str, ExceptionDetail] = field(default_factory=dict)
81
+ duplicate_candidates: list[DuplicateCandidate] = field(default_factory=list)
82
+
83
+
84
+ def resolve_api_key() -> tuple[str | None, str | None]:
85
+ token = os.getenv("HF_TOKEN")
86
+ return (token, "HF_TOKEN") if token else (None, None)
87
+
88
+
89
+ def _slugify(value: str) -> str:
90
+ slug = re.sub(r"[^A-Za-z0-9._-]+", "-", value.strip())
91
+ slug = slug.strip("-._")
92
+ return slug or "run"
93
+
94
+
95
+ def build_output_path(model_name: str) -> tuple[str, Path]:
96
+ timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
97
+ run_id = _slugify(EVAL_RUN_NAME) if EVAL_RUN_NAME else timestamp
98
+ model_slug = _slugify(model_name)
99
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
100
+
101
+ candidate = OUTPUT_DIR / f"{run_id}__{model_slug}.json"
102
+ suffix = 2
103
+ while candidate.exists():
104
+ candidate = OUTPUT_DIR / f"{run_id}__{model_slug}__{suffix}.json"
105
+ suffix += 1
106
+ return run_id, candidate
107
+
108
+
109
+ def _sanitize_log_value(value: str | None) -> str:
110
+ if not value:
111
+ return "null"
112
+ return value.replace("\n", " ").strip() or "null"
113
+
114
+
115
+ def format_action_for_log(action: InvoiceOpsAction) -> str:
116
+ return json.dumps(
117
+ action.model_dump(mode="json", exclude_none=True),
118
+ separators=(",", ":"),
119
+ sort_keys=True,
120
+ )
121
+
122
+
123
+ def _extract_step_error(
124
+ observation: InvoiceOpsObservation | None,
125
+ *,
126
+ previous_invalid_actions: int,
127
+ ) -> str | None:
128
+ if observation is None:
129
+ return None
130
+ if observation.progress.invalid_actions > previous_invalid_actions:
131
+ return observation.message or None
132
+ return None
133
+
134
+
135
+ def log_start(task: str, env: str, model: str) -> None:
136
+ print(f"[START] task={task} env={env} model={model}", flush=True)
137
+
138
+
139
+ def log_step(
140
+ step: int, action: str, reward: float, done: bool, error: str | None
141
+ ) -> None:
142
+ print(
143
+ f"[STEP] step={step} action={action} reward={reward:.2f} "
144
+ f"done={str(done).lower()} error={_sanitize_log_value(error)}",
145
+ flush=True,
146
+ )
147
+
148
+
149
+ def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
150
+ rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
151
+ print(
152
+ f"[END] success={str(success).lower()} steps={steps} "
153
+ f"score={score:.3f} rewards={rewards_str}",
154
+ flush=True,
155
+ )
156
+
157
+
158
+ def _safe_json_load(text: str) -> dict[str, Any] | None:
159
+ text = text.strip()
160
+ if not text:
161
+ return None
162
+
163
+ text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE)
164
+ text = re.sub(
165
+ r"<reasoning>.*?</reasoning>",
166
+ "",
167
+ text,
168
+ flags=re.DOTALL | re.IGNORECASE,
169
+ )
170
+
171
+ if text.startswith("```"):
172
+ text = re.sub(r"^```(?:json)?\s*", "", text)
173
+ text = re.sub(r"\s*```$", "", text)
174
+
175
+ try:
176
+ payload = json.loads(text)
177
+ except json.JSONDecodeError:
178
+ match = re.search(r"\{.*\}", text, re.DOTALL)
179
+ if not match:
180
+ return None
181
+ try:
182
+ payload = json.loads(match.group(0))
183
+ except json.JSONDecodeError:
184
+ return None
185
+
186
+ return payload if isinstance(payload, dict) else None
187
+
188
+
189
+ def _normalize_completion_content(raw_content: Any) -> str:
190
+ if raw_content is None:
191
+ return ""
192
+ if isinstance(raw_content, str):
193
+ return raw_content
194
+ if isinstance(raw_content, list):
195
+ parts: list[str] = []
196
+ for item in raw_content:
197
+ if isinstance(item, dict):
198
+ text = item.get("text")
199
+ if isinstance(text, str):
200
+ parts.append(text)
201
+ continue
202
+ text = getattr(item, "text", None)
203
+ if isinstance(text, str):
204
+ parts.append(text)
205
+ return "\n".join(part for part in parts if part)
206
+ return str(raw_content)
207
+
208
+
209
+ def _attempt_trace(
210
+ *,
211
+ completion: Any | None = None,
212
+ content: str = "",
213
+ payload: dict[str, Any] | None = None,
214
+ parsed_ok: bool = False,
215
+ failure_reason: str | None = None,
216
+ error: Exception | None = None,
217
+ ) -> dict[str, Any]:
218
+ trace: dict[str, Any] = {
219
+ "content": content,
220
+ "content_empty": not bool(content.strip()),
221
+ "json_detected": payload is not None,
222
+ "validation_passed": parsed_ok,
223
+ "failure_reason": failure_reason,
224
+ }
225
+
226
+ if error is not None:
227
+ trace["error_type"] = error.__class__.__name__
228
+ trace["error_message"] = str(error)
229
+
230
+ if completion is None:
231
+ return trace
232
+
233
+ trace["response_id"] = getattr(completion, "id", None)
234
+ choices = getattr(completion, "choices", None) or []
235
+ if choices:
236
+ choice = choices[0]
237
+ trace["finish_reason"] = getattr(choice, "finish_reason", None)
238
+ message = getattr(choice, "message", None)
239
+ if message is not None:
240
+ if hasattr(message, "model_dump"):
241
+ trace["raw_message"] = message.model_dump(
242
+ mode="json", exclude_none=True
243
+ )
244
+ else:
245
+ trace["raw_message"] = str(message)
246
+
247
+ usage = getattr(completion, "usage", None)
248
+ if usage is not None and hasattr(usage, "model_dump"):
249
+ trace["usage"] = usage.model_dump(mode="json", exclude_none=True)
250
+
251
+ return trace
252
+
253
+
254
+ def _query_model_json(
255
+ openai_client: OpenAI,
256
+ *,
257
+ system_prompt: str,
258
+ user_prompt: str,
259
+ validator: Callable[[dict[str, Any] | None], ParsedModelOutput | None],
260
+ retry_feedback: str,
261
+ ) -> tuple[ParsedModelOutput | None, list[dict[str, Any]]]:
262
+ messages = [
263
+ {"role": "system", "content": system_prompt},
264
+ {"role": "user", "content": user_prompt},
265
+ ]
266
+ attempts: list[dict[str, Any]] = []
267
+
268
+ for attempt in range(MAX_MODEL_ATTEMPTS):
269
+ expand_token_budget = bool(
270
+ attempts and attempts[-1].get("finish_reason") == "length"
271
+ )
272
+ try:
273
+ completion = openai_client.chat.completions.create(
274
+ model=MODEL_NAME,
275
+ messages=messages,
276
+ temperature=TEMPERATURE,
277
+ response_format={"type": "json_object"},
278
+ max_tokens=(RETRY_MAX_TOKENS if expand_token_budget else MAX_TOKENS),
279
+ )
280
+ except Exception as exc:
281
+ attempts.append(
282
+ _attempt_trace(
283
+ failure_reason="request_error",
284
+ error=exc,
285
+ )
286
+ )
287
+ if attempt == MAX_MODEL_ATTEMPTS - 1:
288
+ break
289
+ messages.append(
290
+ {
291
+ "role": "user",
292
+ "content": (
293
+ "The previous request failed before a usable response was returned. "
294
+ f"{retry_feedback} Reply with JSON only and no prose."
295
+ ),
296
+ }
297
+ )
298
+ continue
299
+
300
+ choices = getattr(completion, "choices", None) or []
301
+ if not choices:
302
+ attempts.append(
303
+ _attempt_trace(
304
+ completion=completion,
305
+ failure_reason="no_choices",
306
+ )
307
+ )
308
+ if attempt == MAX_MODEL_ATTEMPTS - 1:
309
+ break
310
+ messages.append(
311
+ {
312
+ "role": "user",
313
+ "content": (
314
+ "The previous reply did not contain any choices. "
315
+ f"{retry_feedback} Reply with JSON only and no prose."
316
+ ),
317
+ }
318
+ )
319
+ continue
320
+
321
+ message = choices[0].message
322
+ content = _normalize_completion_content(getattr(message, "content", None))
323
+ payload = _safe_json_load(content)
324
+ parsed = validator(payload)
325
+ if parsed is not None:
326
+ attempts.append(
327
+ _attempt_trace(
328
+ completion=completion,
329
+ content=content,
330
+ payload=payload,
331
+ parsed_ok=True,
332
+ )
333
+ )
334
+ return parsed, attempts
335
+
336
+ if not content.strip():
337
+ failure_reason = "empty_content"
338
+ elif payload is None:
339
+ failure_reason = "json_not_found"
340
+ else:
341
+ failure_reason = "schema_validation_failed"
342
+
343
+ attempts.append(
344
+ _attempt_trace(
345
+ completion=completion,
346
+ content=content,
347
+ payload=payload,
348
+ parsed_ok=False,
349
+ failure_reason=failure_reason,
350
+ )
351
+ )
352
+
353
+ if attempt == MAX_MODEL_ATTEMPTS - 1:
354
+ break
355
+
356
+ messages.extend(
357
+ [
358
+ {"role": "assistant", "content": content or "<empty_response>"},
359
+ {
360
+ "role": "user",
361
+ "content": (
362
+ "Your previous reply could not be used. "
363
+ f"{retry_feedback} Reply with JSON only and no prose."
364
+ ),
365
+ },
366
+ ]
367
+ )
368
+
369
+ return None, attempts
370
+
371
+
372
+ def _coerce_reason_codes(values: Any) -> list[ReasonCode]:
373
+ if isinstance(values, str):
374
+ raw_values = [values]
375
+ elif isinstance(values, list):
376
+ raw_values = values
377
+ else:
378
+ return []
379
+
380
+ codes: list[ReasonCode] = []
381
+ for value in raw_values:
382
+ if not isinstance(value, str):
383
+ continue
384
+ try:
385
+ code = ReasonCode(value)
386
+ except ValueError:
387
+ continue
388
+ if code not in codes:
389
+ codes.append(code)
390
+ return codes
391
+
392
+
393
+ def _coerce_string_list(values: Any) -> list[str]:
394
+ if isinstance(values, str):
395
+ raw_values = [values]
396
+ elif isinstance(values, list):
397
+ raw_values = values
398
+ else:
399
+ return []
400
+
401
+ refs: list[str] = []
402
+ for value in raw_values:
403
+ if not isinstance(value, str):
404
+ continue
405
+ ref = value.strip()
406
+ if not ref or ref in refs:
407
+ continue
408
+ refs.append(ref)
409
+ return refs
410
+
411
+
412
+ def _coerce_action_type(value: Any) -> ActionType | None:
413
+ if not isinstance(value, str):
414
+ return None
415
+ try:
416
+ return ActionType(value)
417
+ except ValueError:
418
+ return None
419
+
420
+
421
+ def _coerce_match_strategy(value: Any) -> DuplicateMatchStrategy | None:
422
+ if not isinstance(value, str):
423
+ return None
424
+ normalized = value.strip().lower()
425
+ aliases = {
426
+ "exact_invoice_no": DuplicateMatchStrategy.EXACT_INVOICE_NUMBER,
427
+ "exact_invoice_number": DuplicateMatchStrategy.EXACT_INVOICE_NUMBER,
428
+ "invoice_number_exact": DuplicateMatchStrategy.EXACT_INVOICE_NUMBER,
429
+ "normalized_invoice_no": DuplicateMatchStrategy.NORMALIZED_INVOICE_NUMBER,
430
+ "normalized_invoice_number": DuplicateMatchStrategy.NORMALIZED_INVOICE_NUMBER,
431
+ "normalized_invoice": DuplicateMatchStrategy.NORMALIZED_INVOICE_NUMBER,
432
+ "vendor_amount_date": DuplicateMatchStrategy.VENDOR_AMOUNT_DATE,
433
+ "vendor_amount": DuplicateMatchStrategy.VENDOR_AMOUNT_DATE,
434
+ "vendor_invoice_amount": DuplicateMatchStrategy.VENDOR_AMOUNT_DATE,
435
+ "exact_vendor_invoice_amount": DuplicateMatchStrategy.VENDOR_AMOUNT_DATE,
436
+ "vendor_amount_and_date": DuplicateMatchStrategy.VENDOR_AMOUNT_DATE,
437
+ }
438
+ strategy = aliases.get(normalized)
439
+ if strategy is not None:
440
+ return strategy
441
+ try:
442
+ return DuplicateMatchStrategy(value)
443
+ except ValueError:
444
+ return None
445
+
446
+
447
+ def _coerce_note_type(value: Any) -> NoteType | None:
448
+ if not isinstance(value, str):
449
+ return None
450
+ try:
451
+ return NoteType(value)
452
+ except ValueError:
453
+ return None
454
+
455
+
456
+ def _coerce_route(value: Any) -> RouteTarget | None:
457
+ if not isinstance(value, str):
458
+ return None
459
+ try:
460
+ return RouteTarget(value)
461
+ except ValueError:
462
+ return None
463
+
464
+
465
+ def _coerce_disposition(value: Any) -> Disposition | None:
466
+ if not isinstance(value, str):
467
+ return None
468
+ try:
469
+ return Disposition(value)
470
+ except ValueError:
471
+ return None
472
+
473
+
474
+ def _coerce_payment_recommendation(
475
+ raw_header: dict[str, Any] | str | None,
476
+ ) -> PaymentRecommendation | None:
477
+ if isinstance(raw_header, str):
478
+ try:
479
+ return PaymentRecommendation(raw_header)
480
+ except ValueError:
481
+ return None
482
+
483
+ if not isinstance(raw_header, dict):
484
+ return None
485
+
486
+ for key in ("payment_recommendation", "header_recommendation", "recommendation"):
487
+ raw_value = raw_header.get(key)
488
+ if not isinstance(raw_value, str):
489
+ continue
490
+ try:
491
+ return PaymentRecommendation(raw_value)
492
+ except ValueError:
493
+ continue
494
+
495
+ disposition = _coerce_disposition(
496
+ raw_header.get("disposition") or raw_header.get("decision")
497
+ )
498
+ if disposition is None:
499
+ return None
500
+ return HEADER_DISPOSITION_MAP.get(disposition)
501
+
502
+
503
+ def _extract_action_payload(payload: dict[str, Any] | None) -> dict[str, Any] | None:
504
+ if payload is None:
505
+ return None
506
+
507
+ if isinstance(payload.get("action"), dict):
508
+ raw_action = dict(payload["action"])
509
+ if "action_type" not in raw_action and isinstance(
510
+ payload.get("action_type"), str
511
+ ):
512
+ raw_action["action_type"] = payload["action_type"]
513
+ return raw_action
514
+
515
+ if isinstance(payload.get("args"), dict) and isinstance(payload.get("action"), str):
516
+ raw_action = dict(payload["args"])
517
+ raw_action.setdefault("action_type", payload["action"])
518
+ return raw_action
519
+
520
+ if isinstance(payload.get("arguments"), dict) and isinstance(
521
+ payload.get("action"), str
522
+ ):
523
+ raw_action = dict(payload["arguments"])
524
+ raw_action.setdefault("action_type", payload["action"])
525
+ return raw_action
526
+
527
+ return dict(payload)
528
+
529
+
530
+ def _parse_action_payload(payload: dict[str, Any] | None) -> InvoiceOpsAction | None:
531
+ raw_action = _extract_action_payload(payload)
532
+ if raw_action is None:
533
+ return None
534
+
535
+ action_type = _coerce_action_type(
536
+ raw_action.get("action_type")
537
+ or raw_action.get("action")
538
+ or raw_action.get("type")
539
+ or raw_action.get("kind")
540
+ or raw_action.get("name")
541
+ )
542
+ if action_type is None:
543
+ return None
544
+
545
+ action_kwargs: dict[str, Any] = {
546
+ "action_type": action_type,
547
+ }
548
+
549
+ if action_type is ActionType.OPEN_ARTIFACT:
550
+ action_kwargs["artifact_id"] = (
551
+ raw_action.get("artifact_id")
552
+ or raw_action.get("artifact")
553
+ or raw_action.get("id")
554
+ )
555
+ elif action_type is ActionType.INSPECT_EXCEPTION:
556
+ action_kwargs["exception_id"] = (
557
+ raw_action.get("exception_id")
558
+ or raw_action.get("exception")
559
+ or raw_action.get("id")
560
+ )
561
+ elif action_type is ActionType.RUN_DUPLICATE_CHECK:
562
+ match_strategy = raw_action.get("match_strategy") or raw_action.get("strategy")
563
+ action_kwargs["match_strategy"] = _coerce_match_strategy(match_strategy)
564
+ if action_kwargs["match_strategy"] is None:
565
+ return None
566
+ elif action_type is ActionType.ADD_NOTE:
567
+ action_kwargs["note_type"] = _coerce_note_type(
568
+ raw_action.get("note_type") or raw_action.get("note_kind")
569
+ )
570
+ action_kwargs["reason_codes"] = _coerce_reason_codes(
571
+ raw_action.get("reason_codes") or raw_action.get("reason_code")
572
+ )
573
+ action_kwargs["evidence_refs"] = _coerce_string_list(
574
+ raw_action.get("evidence_refs")
575
+ or raw_action.get("evidence_ref")
576
+ or raw_action.get("refs")
577
+ )
578
+ action_kwargs["text"] = raw_action.get("text")
579
+ elif action_type is ActionType.SET_LINE_RESOLUTION:
580
+ action_kwargs["line_id"] = raw_action.get("line_id") or raw_action.get("line")
581
+ action_kwargs["disposition"] = _coerce_disposition(
582
+ raw_action.get("disposition") or raw_action.get("decision")
583
+ )
584
+ action_kwargs["reason_codes"] = _coerce_reason_codes(
585
+ raw_action.get("reason_codes") or raw_action.get("reason_code")
586
+ )
587
+ action_kwargs["evidence_refs"] = _coerce_string_list(
588
+ raw_action.get("evidence_refs")
589
+ or raw_action.get("evidence_ref")
590
+ or raw_action.get("refs")
591
+ )
592
+ action_kwargs["route_to"] = _coerce_route(
593
+ raw_action.get("route_to")
594
+ or raw_action.get("route")
595
+ or raw_action.get("escalation_target")
596
+ )
597
+ elif action_type is ActionType.SET_HEADER_RESOLUTION:
598
+ action_kwargs["payment_recommendation"] = _coerce_payment_recommendation(
599
+ raw_action
600
+ )
601
+ action_kwargs["reason_codes"] = _coerce_reason_codes(
602
+ raw_action.get("reason_codes") or raw_action.get("reason_code")
603
+ )
604
+ action_kwargs["evidence_refs"] = _coerce_string_list(
605
+ raw_action.get("evidence_refs")
606
+ or raw_action.get("evidence_ref")
607
+ or raw_action.get("refs")
608
+ )
609
+ action_kwargs["route_to"] = _coerce_route(
610
+ raw_action.get("route_to")
611
+ or raw_action.get("route")
612
+ or raw_action.get("escalation_target")
613
+ )
614
+ elif action_type is ActionType.SUBMIT_CASE:
615
+ action_kwargs["note_ids"] = _coerce_string_list(raw_action.get("note_ids"))
616
+ action_kwargs["line_resolution_ids"] = _coerce_string_list(
617
+ raw_action.get("line_resolution_ids")
618
+ )
619
+ header_resolution_id = raw_action.get("header_resolution_id")
620
+ if isinstance(header_resolution_id, str):
621
+ action_kwargs["header_resolution_id"] = header_resolution_id.strip()
622
+
623
+ try:
624
+ return InvoiceOpsAction(**action_kwargs)
625
+ except Exception:
626
+ return None
627
+
628
+
629
+ def build_case_snapshot(
630
+ queue_card: QueueCard,
631
+ opened_artifacts: dict[str, Any],
632
+ inspected_exceptions: dict[str, ExceptionDetail],
633
+ duplicate_candidates: list[DuplicateCandidate],
634
+ ) -> dict[str, Any]:
635
+ def compact_text(value: str, *, limit: int = 180) -> str:
636
+ normalized = re.sub(r"\s+", " ", value.strip())
637
+ if len(normalized) <= limit:
638
+ return normalized
639
+ return f"{normalized[: limit - 3].rstrip()}..."
640
+
641
+ def compact_fields(fields: list[Any], *, limit: int = 10) -> dict[str, str]:
642
+ compact: dict[str, str] = {}
643
+ for field in fields[:limit]:
644
+ label = field.label.strip()
645
+ value = field.value.strip()
646
+ if not label or not value:
647
+ continue
648
+ compact[label] = compact_text(value, limit=120)
649
+ return compact
650
+
651
+ def compact_line_items(
652
+ line_items: list[Any], *, limit: int = 6
653
+ ) -> list[dict[str, Any]]:
654
+ compact_items: list[dict[str, Any]] = []
655
+ for item in line_items[:limit]:
656
+ compact_item: dict[str, Any] = {
657
+ "line_id": item.line_id,
658
+ "description": compact_text(item.description, limit=100),
659
+ "amount": item.amount,
660
+ }
661
+ if item.quantity is not None:
662
+ compact_item["quantity"] = item.quantity
663
+ if item.unit_price is not None:
664
+ compact_item["unit_price"] = item.unit_price
665
+ if item.status:
666
+ compact_item["status"] = compact_text(item.status, limit=60)
667
+ if item.notes:
668
+ compact_item["notes"] = compact_text(item.notes, limit=100)
669
+ compact_items.append(compact_item)
670
+ return compact_items
671
+
672
+ def compact_events(events: list[Any], *, limit: int = 8) -> list[dict[str, Any]]:
673
+ compact_events_list: list[dict[str, Any]] = []
674
+ for event in events[:limit]:
675
+ compact_event: dict[str, Any] = {
676
+ "type": event.event_type,
677
+ "date": event.event_date,
678
+ "description": compact_text(event.description, limit=120),
679
+ }
680
+ if event.quantity is not None:
681
+ compact_event["quantity"] = event.quantity
682
+ if event.amount is not None:
683
+ compact_event["amount"] = event.amount
684
+ if event.status:
685
+ compact_event["status"] = compact_text(event.status, limit=60)
686
+ compact_events_list.append(compact_event)
687
+ return compact_events_list
688
+
689
+ def compact_artifact(artifact: Any) -> dict[str, Any]:
690
+ compact_artifact_view: dict[str, Any] = {
691
+ "title": artifact.title,
692
+ }
693
+ if artifact.summary:
694
+ compact_artifact_view["summary"] = compact_text(artifact.summary)
695
+ fields = compact_fields(artifact.fields)
696
+ if fields:
697
+ compact_artifact_view["fields"] = fields
698
+ line_items = compact_line_items(artifact.line_items)
699
+ if line_items:
700
+ compact_artifact_view["line_items"] = line_items
701
+ events = compact_events(artifact.events)
702
+ if events:
703
+ compact_artifact_view["events"] = events
704
+ return compact_artifact_view
705
+
706
+ def compact_exception(exception: ExceptionDetail) -> dict[str, Any]:
707
+ compact_exception_view: dict[str, Any] = {
708
+ "type": exception.exception_type.value,
709
+ "severity": exception.severity.value,
710
+ "headline": compact_text(exception.headline, limit=120),
711
+ }
712
+ if exception.impacted_line_ids:
713
+ compact_exception_view["impacted_line_ids"] = exception.impacted_line_ids
714
+ if exception.short_description:
715
+ compact_exception_view["summary"] = compact_text(
716
+ exception.short_description,
717
+ limit=140,
718
+ )
719
+ fields = compact_fields(exception.fields, limit=8)
720
+ if fields:
721
+ compact_exception_view["facts"] = fields
722
+ if exception.reviewer_guidance:
723
+ compact_exception_view["guidance"] = compact_text(
724
+ exception.reviewer_guidance,
725
+ limit=160,
726
+ )
727
+ return compact_exception_view
728
+
729
+ def compact_duplicate(candidate: DuplicateCandidate) -> dict[str, Any]:
730
+ return {
731
+ "candidate_id": candidate.candidate_id,
732
+ "invoice_number": candidate.invoice_number,
733
+ "invoice_date": candidate.invoice_date,
734
+ "gross_amount": candidate.gross_amount,
735
+ "status": candidate.status,
736
+ "match_basis": compact_text(candidate.match_basis, limit=80),
737
+ "overlap_summary": compact_text(candidate.overlap_summary, limit=140),
738
+ }
739
+
740
+ return {
741
+ "queue_card": {
742
+ "vendor_name": queue_card.vendor_name,
743
+ "vendor_id": queue_card.vendor_id,
744
+ "invoice_number": queue_card.invoice_number,
745
+ "invoice_date": queue_card.invoice_date,
746
+ "invoice_total": queue_card.invoice_total,
747
+ "currency": queue_card.currency,
748
+ "po_number": queue_card.po_number,
749
+ "risk_flags": [flag.value for flag in queue_card.risk_flags],
750
+ "summary": compact_text(queue_card.summary, limit=160),
751
+ },
752
+ "artifacts": {
753
+ artifact.artifact_type.value: compact_artifact(artifact)
754
+ for artifact in opened_artifacts.values()
755
+ },
756
+ "exceptions": [
757
+ compact_exception(exception) for exception in inspected_exceptions.values()
758
+ ],
759
+ "duplicate_candidates": [
760
+ compact_duplicate(candidate) for candidate in duplicate_candidates
761
+ ],
762
+ }
763
+
764
+
765
+ def update_memory(
766
+ memory: ObservationMemory,
767
+ observation: InvoiceOpsObservation,
768
+ ) -> None:
769
+ if observation.opened_artifact is not None:
770
+ memory.opened_artifacts[observation.opened_artifact.artifact_id] = (
771
+ observation.opened_artifact
772
+ )
773
+ if observation.inspected_exception is not None:
774
+ memory.inspected_exceptions[observation.inspected_exception.exception_id] = (
775
+ observation.inspected_exception
776
+ )
777
+ if observation.duplicate_candidates:
778
+ memory.duplicate_candidates = observation.duplicate_candidates
779
+
780
+
781
+ def build_observation_snapshot(
782
+ observation: InvoiceOpsObservation,
783
+ memory: ObservationMemory,
784
+ ) -> dict[str, Any]:
785
+ queue_card = observation.queue_card
786
+ assert queue_card is not None
787
+
788
+ base_snapshot = build_case_snapshot(
789
+ queue_card,
790
+ memory.opened_artifacts,
791
+ memory.inspected_exceptions,
792
+ memory.duplicate_candidates,
793
+ )
794
+ base_snapshot["message"] = observation.message
795
+ base_snapshot["progress"] = observation.progress.model_dump(mode="json")
796
+ base_snapshot["known_refs"] = observation.known_refs
797
+ base_snapshot["available_artifacts"] = [
798
+ artifact.model_dump(mode="json") for artifact in observation.available_artifacts
799
+ ]
800
+ base_snapshot["visible_exceptions"] = [
801
+ exception.model_dump(mode="json")
802
+ for exception in observation.visible_exceptions
803
+ ]
804
+ base_snapshot["current_focus"] = {
805
+ "opened_artifact_id": (
806
+ observation.opened_artifact.artifact_id
807
+ if observation.opened_artifact is not None
808
+ else None
809
+ ),
810
+ "inspected_exception_id": (
811
+ observation.inspected_exception.exception_id
812
+ if observation.inspected_exception is not None
813
+ else None
814
+ ),
815
+ }
816
+ base_snapshot["draft_state"] = {
817
+ "line_resolutions": [
818
+ line_resolution.model_dump(mode="json")
819
+ for line_resolution in observation.draft_line_resolutions
820
+ ],
821
+ "header_resolution": (
822
+ observation.draft_header_resolution.model_dump(mode="json")
823
+ if observation.draft_header_resolution is not None
824
+ else None
825
+ ),
826
+ "notes": [note.model_dump(mode="json") for note in observation.draft_notes],
827
+ }
828
+ return base_snapshot
829
+
830
+
831
+ def build_action_prompt(
832
+ observation: InvoiceOpsObservation,
833
+ memory: ObservationMemory,
834
+ ) -> str:
835
+ snapshot = build_observation_snapshot(observation, memory)
836
+ return (
837
+ "You are controlling an AP invoice exception environment one action at a time.\n"
838
+ "Return exactly one JSON object for the single best next action. No prose. No markdown. No multi-action plans.\n"
839
+ "Do not assume you have seen artifacts or exception details that are not in the observation snapshot.\n"
840
+ "Use open_artifact, inspect_exception, and run_duplicate_check to gather evidence before deciding.\n"
841
+ "Only use evidence_refs from known_refs. Invalid refs will be penalized.\n"
842
+ "Only add notes or resolutions when you have enough visible evidence to support them.\n"
843
+ "route_to means the next owner or follow-up queue for the action. Use it whenever another queue must act, including hold actions that still need follow-up.\n"
844
+ "Line resolutions describe content/payment readiness for each line. Header resolution describes whether any payment can be released now.\n"
845
+ "A real case-level blocker can justify hold_full_invoice or escalate_case even when some lines are approved.\n"
846
+ "Submit only when the current draft state is coherent or when no better action remains.\n\n"
847
+ f"Allowed action_type values: {[action.value for action in ActionType]}\n"
848
+ f"Allowed match_strategy values: {[strategy.value for strategy in DuplicateMatchStrategy]}\n"
849
+ f"Allowed disposition values: {[disposition.value for disposition in Disposition]}\n"
850
+ f"Allowed payment_recommendation values: {[recommendation.value for recommendation in PaymentRecommendation]}\n"
851
+ f"Allowed route_to values: {[route.value for route in RouteTarget]}\n"
852
+ f"Allowed note_type values: {[note_type.value for note_type in NoteType]}\n"
853
+ f"Allowed reason_codes values: {[reason.value for reason in ReasonCode]}\n"
854
+ "Action JSON templates (replace angle-bracket placeholders with real values from the observation; omit optional fields when unused):\n"
855
+ '{"action_type":"open_artifact","artifact_id":"<artifact_id>"}\n'
856
+ '{"action_type":"inspect_exception","exception_id":"<exception_id>"}\n'
857
+ '{"action_type":"run_duplicate_check","match_strategy":"normalized_invoice_no"}\n'
858
+ '{"action_type":"set_line_resolution","line_id":"<line_id>","disposition":"<disposition>","reason_codes":["<reason_code>"],"evidence_refs":["<known_ref>"],"route_to":"<optional_route_target>"}\n'
859
+ '{"action_type":"set_header_resolution","payment_recommendation":"<payment_recommendation>","reason_codes":["<reason_code>"],"evidence_refs":["<known_ref>"],"route_to":"<optional_route_target>"}\n'
860
+ '{"action_type":"add_note","note_type":"<note_type>","reason_codes":["<reason_code>"],"evidence_refs":["<known_ref>"],"text":"<brief_handoff_note>"}\n'
861
+ '{"action_type":"submit_case"}\n\n'
862
+ f"Observation snapshot:\n{json.dumps(snapshot, indent=2)}"
863
+ )
864
+
865
+
866
+ def request_action_from_model(
867
+ openai_client: OpenAI,
868
+ *,
869
+ observation: InvoiceOpsObservation,
870
+ memory: ObservationMemory,
871
+ ) -> tuple[InvoiceOpsAction | None, list[dict[str, Any]]]:
872
+ return _query_model_json(
873
+ openai_client,
874
+ system_prompt=(
875
+ "You are a deterministic AP invoice reviewer acting in an environment. "
876
+ "Return exactly one valid JSON action and nothing else."
877
+ ),
878
+ user_prompt=build_action_prompt(observation, memory),
879
+ validator=_parse_action_payload,
880
+ retry_feedback=(
881
+ "Return exactly one action object with action_type and only the fields required for that action. "
882
+ 'Examples: {"action_type":"open_artifact","artifact_id":"art-invoice"} '
883
+ 'or {"action_type":"submit_case"}. '
884
+ "Do not output a plan or multiple actions."
885
+ ),
886
+ )
887
+
888
+
889
+ def run_task(
890
+ env: Any,
891
+ openai_client: OpenAI,
892
+ task_id: TaskId,
893
+ trace: EpisodeTrace,
894
+ ) -> dict[str, Any]:
895
+ try:
896
+ reset_result = env.reset(task_id=task_id.value)
897
+ observation = reset_result.observation
898
+ initial_queue_card = observation.queue_card
899
+ memory = ObservationMemory()
900
+ update_memory(memory, observation)
901
+
902
+ model_attempts: list[dict[str, Any]] = []
903
+ action_history: list[dict[str, Any]] = []
904
+ used_fallback = False
905
+ decision_parsed = True
906
+ failure_reason: str | None = None
907
+
908
+ while not observation.done:
909
+ action, attempts = request_action_from_model(
910
+ openai_client,
911
+ observation=observation,
912
+ memory=memory,
913
+ )
914
+ model_attempts.append(
915
+ {
916
+ "turn_index": len(model_attempts) + 1,
917
+ "attempts": attempts,
918
+ }
919
+ )
920
+
921
+ if action is None:
922
+ used_fallback = True
923
+ decision_parsed = False
924
+ failure_reason = (
925
+ attempts[-1]["failure_reason"] if attempts else "no_attempt"
926
+ )
927
+ action = InvoiceOpsAction(action_type=ActionType.SUBMIT_CASE)
928
+ model_attempts[-1]["fallback_action"] = action.model_dump(
929
+ mode="json",
930
+ exclude_none=True,
931
+ )
932
+
933
+ previous_invalid_actions = observation.progress.invalid_actions
934
+ result = env.step(action)
935
+ reward = float(result.reward or 0.0)
936
+ trace.steps_taken += 1
937
+ trace.rewards.append(reward)
938
+ log_step(
939
+ trace.steps_taken,
940
+ format_action_for_log(action),
941
+ reward,
942
+ bool(result.done),
943
+ _extract_step_error(
944
+ result.observation,
945
+ previous_invalid_actions=previous_invalid_actions,
946
+ ),
947
+ )
948
+ action_history.append(
949
+ {
950
+ "step": trace.steps_taken,
951
+ "action": action.model_dump(mode="json", exclude_none=True),
952
+ "reward": reward,
953
+ "done": bool(result.done),
954
+ "message": result.observation.message,
955
+ }
956
+ )
957
+ observation = result.observation
958
+ update_memory(memory, observation)
959
+
960
+ raw_score = float(observation.episode_score or 0.0)
961
+ score = strict_task_score(raw_score, used_fallback=used_fallback)
962
+ return {
963
+ "task_id": task_id.value,
964
+ "queue_card": (
965
+ initial_queue_card.model_dump(mode="json")
966
+ if initial_queue_card is not None
967
+ else None
968
+ ),
969
+ "decision_parsed": decision_parsed,
970
+ "used_fallback": used_fallback,
971
+ "failure_reason": failure_reason,
972
+ "parsed_line_count": len(observation.draft_line_resolutions),
973
+ "parsed_header_resolution": observation.draft_header_resolution is not None,
974
+ "model_attempts": model_attempts,
975
+ "action_history": action_history,
976
+ "raw_score": raw_score,
977
+ "score": score,
978
+ "steps_used": trace.steps_taken,
979
+ "reward_trace": trace.rewards,
980
+ "submission_report": (
981
+ observation.submission_report.model_dump(mode="json")
982
+ if observation.submission_report is not None
983
+ else None
984
+ ),
985
+ "error": None,
986
+ }
987
+ except Exception as exc:
988
+ return {
989
+ "task_id": task_id.value,
990
+ "queue_card": None,
991
+ "decision_parsed": False,
992
+ "used_fallback": False,
993
+ "failure_reason": "task_execution_error",
994
+ "parsed_line_count": 0,
995
+ "parsed_header_resolution": False,
996
+ "model_attempts": [],
997
+ "action_history": [],
998
+ "raw_score": 0.0,
999
+ "score": 0.0,
1000
+ "steps_used": trace.steps_taken,
1001
+ "reward_trace": trace.rewards,
1002
+ "submission_report": None,
1003
+ "error": str(exc),
1004
+ }
1005
+
1006
+
1007
+ def main() -> None:
1008
+ api_key, api_key_source = resolve_api_key()
1009
+ api_base_url = API_BASE_URL
1010
+
1011
+ if not api_key:
1012
+ raise RuntimeError("Set HF_TOKEN before running inference.py.")
1013
+
1014
+ openai_client = OpenAI(api_key=api_key, base_url=api_base_url)
1015
+
1016
+ run_id, output_path = build_output_path(MODEL_NAME)
1017
+ results: list[dict[str, Any]] = []
1018
+
1019
+ for task_id in TASKS:
1020
+ trace = EpisodeTrace()
1021
+ log_start(task=task_id.value, env=BENCHMARK, model=MODEL_NAME)
1022
+ task_result: dict[str, Any] | None = None
1023
+ try:
1024
+ with InvoiceOpsEnv(base_url=ENV_URL).sync() as env:
1025
+ task_result = run_task(env, openai_client, task_id, trace)
1026
+ finally:
1027
+ score = float(task_result["score"]) if task_result is not None else 0.0
1028
+ success = task_result is not None and task_result.get("error") is None
1029
+ log_end(
1030
+ success=success,
1031
+ steps=trace.steps_taken,
1032
+ score=score,
1033
+ rewards=trace.rewards,
1034
+ )
1035
+
1036
+ assert task_result is not None
1037
+ results.append(task_result)
1038
+ sys.stderr.write(
1039
+ f"{task_id.value}: score={task_result['score']:.4f} "
1040
+ f"raw_score={task_result.get('raw_score', task_result['score']):.4f} "
1041
+ f"fallback={str(task_result['used_fallback']).lower()}\n"
1042
+ )
1043
+
1044
+ mean_score = sum(result["score"] for result in results) / len(results)
1045
+ raw_mean_score = sum(
1046
+ result.get("raw_score", result["score"]) for result in results
1047
+ ) / len(results)
1048
+ payload = {
1049
+ "run_id": run_id,
1050
+ "model_name": MODEL_NAME,
1051
+ "env_url": ENV_URL,
1052
+ "api_base_url": api_base_url,
1053
+ "api_key_source": api_key_source,
1054
+ "raw_mean_score": round(raw_mean_score, 4),
1055
+ "mean_score": round(mean_score, 4),
1056
+ "strict_baseline_scoring": _env_flag("STRICT_BASELINE_SCORING", True),
1057
+ "results": results,
1058
+ }
1059
+ output_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
1060
+ sys.stderr.write(
1061
+ f"mean_score={mean_score:.4f} raw_mean_score={raw_mean_score:.4f}\n"
1062
+ )
1063
+ sys.stderr.write(f"wrote={output_path}\n")
1064
+
1065
+
1066
+ if __name__ == "__main__":
1067
+ main()
models.py ADDED
@@ -0,0 +1,583 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Typed models for the InvoiceOps environment."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from enum import Enum
6
+
7
+ from openenv.core.env_server.types import Action, Observation, State
8
+ from pydantic import BaseModel, ConfigDict, Field, model_validator
9
+
10
+
11
+ class Model(BaseModel):
12
+ model_config = ConfigDict(
13
+ extra="forbid",
14
+ validate_assignment=True,
15
+ arbitrary_types_allowed=True,
16
+ )
17
+
18
+
19
+ class TaskId(str, Enum):
20
+ EASY = "easy"
21
+ MEDIUM = "medium"
22
+ MEDIUM_PLUS = "medium_plus"
23
+ HARD = "hard"
24
+
25
+
26
+ class ActionType(str, Enum):
27
+ OPEN_ARTIFACT = "open_artifact"
28
+ INSPECT_EXCEPTION = "inspect_exception"
29
+ RUN_DUPLICATE_CHECK = "run_duplicate_check"
30
+ ADD_NOTE = "add_note"
31
+ SET_LINE_RESOLUTION = "set_line_resolution"
32
+ SET_HEADER_RESOLUTION = "set_header_resolution"
33
+ SUBMIT_CASE = "submit_case"
34
+
35
+
36
+ class ArtifactType(str, Enum):
37
+ INVOICE_PACKET = "invoice_packet"
38
+ PURCHASE_ORDER = "purchase_order"
39
+ RECEIPT_LOG = "receipt_log"
40
+ VENDOR_MASTER = "vendor_master"
41
+ POLICY_CARD = "policy_card"
42
+ APPROVAL_ARTIFACT = "approval_artifact"
43
+ INVOICE_HISTORY = "invoice_history"
44
+
45
+
46
+ class ExceptionType(str, Enum):
47
+ RECEIPT_QUANTITY_VARIANCE = "receipt_quantity_variance"
48
+ NON_PO_MISSING_APPROVAL = "non_po_missing_approval"
49
+ POSSIBLE_DUPLICATE = "possible_duplicate"
50
+ PRICE_VARIANCE = "price_variance"
51
+ CUMULATIVE_BILLING_VARIANCE = "cumulative_billing_variance"
52
+ TAX_VARIANCE = "tax_variance"
53
+ PAYMENT_TERMS_MISMATCH = "payment_terms_mismatch"
54
+
55
+
56
+ class Severity(str, Enum):
57
+ LOW = "low"
58
+ MEDIUM = "medium"
59
+ HIGH = "high"
60
+ CRITICAL = "critical"
61
+
62
+
63
+ class DuplicateMatchStrategy(str, Enum):
64
+ EXACT_INVOICE_NUMBER = "exact_invoice_no"
65
+ NORMALIZED_INVOICE_NUMBER = "normalized_invoice_no"
66
+ VENDOR_AMOUNT_DATE = "vendor_amount_date"
67
+
68
+
69
+ class NoteType(str, Enum):
70
+ ISSUE_SUMMARY = "issue_summary"
71
+ ESCALATION_REQUEST = "escalation_request"
72
+ REVIEW_SUMMARY = "review_summary"
73
+
74
+
75
+ class Disposition(str, Enum):
76
+ APPROVE = "approve"
77
+ HOLD = "hold"
78
+ REJECT = "reject"
79
+ ESCALATE = "escalate"
80
+
81
+
82
+ class PaymentRecommendation(str, Enum):
83
+ RELEASE_APPROVED_LINES = "release_approved_lines"
84
+ HOLD_FULL_INVOICE = "hold_full_invoice"
85
+ REJECT_FULL_INVOICE = "reject_full_invoice"
86
+ ESCALATE_CASE = "escalate_case"
87
+
88
+
89
+ class DecisionBand(str, Enum):
90
+ BEST = "best"
91
+ SAFE_SUBOPTIMAL = "safe_suboptimal"
92
+ WRONG = "wrong"
93
+ UNSAFE = "unsafe"
94
+
95
+
96
+ class RouteTarget(str, Enum):
97
+ RECEIVING = "receiving"
98
+ REQUESTER = "requester"
99
+ PROCUREMENT = "procurement"
100
+ TAX = "tax"
101
+ AP_MANAGER = "ap_manager"
102
+
103
+
104
+ class RiskFlag(str, Enum):
105
+ PO_INVOICE = "po_invoice"
106
+ RECEIPT_VARIANCE = "receipt_variance"
107
+ PARTIAL_RECEIPT = "partial_receipt"
108
+ PRICE_VARIANCE = "price_variance"
109
+ NON_PO_INVOICE = "non_po_invoice"
110
+ MISSING_APPROVAL = "missing_approval"
111
+ POSSIBLE_DUPLICATE = "possible_duplicate"
112
+ CUMULATIVE_BILLING_RISK = "cumulative_billing_risk"
113
+ TAX_VARIANCE = "tax_variance"
114
+ TERMS_MISMATCH = "terms_mismatch"
115
+
116
+
117
+ class ReasonCode(str, Enum):
118
+ MATCHED_TO_PO_AND_RECEIPT = "matched_to_po_and_receipt"
119
+ RECEIPT_NOT_CONFIRMED = "receipt_not_confirmed"
120
+ PARTIAL_RECEIPT_PENDING = "partial_receipt_pending"
121
+ PRICE_EXCEEDS_PO_RATE = "price_exceeds_po_rate"
122
+ NON_PO_APPROVAL_MISSING = "non_po_approval_missing"
123
+ POSSIBLE_DUPLICATE_REVIEW = "possible_duplicate_review"
124
+ CUMULATIVE_BILLING_EXCEEDS_PO = "cumulative_billing_exceeds_po"
125
+ TAX_AMOUNT_MISMATCH = "tax_amount_mismatch"
126
+ PAYMENT_TERMS_MISMATCH = "payment_terms_mismatch"
127
+ SAFE_TO_PAY = "safe_to_pay"
128
+ ESCALATE_FOR_MANUAL_REVIEW = "escalate_for_manual_review"
129
+
130
+
131
+ class ArtifactField(Model):
132
+ label: str = Field(..., description="Field label shown to the reviewer")
133
+ value: str = Field(..., description="Rendered field value")
134
+
135
+
136
+ class ArtifactLineItem(Model):
137
+ line_id: str = Field(..., description="Stable line identifier")
138
+ description: str = Field(..., description="Line description")
139
+ quantity: float | None = Field(default=None, description="Line quantity")
140
+ unit_price: float | None = Field(default=None, description="Unit price")
141
+ amount: float | None = Field(
142
+ default=None,
143
+ description="Extended amount for the line when the artifact exposes it",
144
+ )
145
+ status: str = Field(default="", description="Operational status")
146
+ notes: str = Field(default="", description="Short line note")
147
+
148
+
149
+ class ArtifactEvent(Model):
150
+ event_id: str = Field(..., description="Stable event identifier")
151
+ event_type: str = Field(..., description="Event type label")
152
+ event_date: str = Field(..., description="Event date in ISO format")
153
+ description: str = Field(..., description="Human readable event description")
154
+ quantity: float | None = Field(default=None, description="Event quantity")
155
+ amount: float | None = Field(default=None, description="Event amount")
156
+ status: str = Field(default="", description="Event status")
157
+
158
+
159
+ class ArtifactReference(Model):
160
+ artifact_id: str = Field(..., description="Artifact identifier")
161
+ artifact_type: ArtifactType = Field(..., description="Artifact type")
162
+ title: str = Field(..., description="Artifact title shown in the UI")
163
+
164
+
165
+ class ArtifactView(ArtifactReference):
166
+ summary: str = Field(default="", description="Short artifact summary")
167
+ fields: list[ArtifactField] = Field(
168
+ default_factory=list,
169
+ description="Structured key-value pairs exposed by the artifact",
170
+ )
171
+ line_items: list[ArtifactLineItem] = Field(
172
+ default_factory=list,
173
+ description="Line items exposed by the artifact",
174
+ )
175
+ events: list[ArtifactEvent] = Field(
176
+ default_factory=list,
177
+ description="Timeline or ledger events exposed by the artifact",
178
+ )
179
+ related_refs: list[str] = Field(
180
+ default_factory=list,
181
+ description="Related artifact or issue identifiers",
182
+ )
183
+
184
+
185
+ class QueueCard(Model):
186
+ case_id: str = Field(..., description="Stable case identifier")
187
+ vendor_name: str = Field(..., description="Vendor display name")
188
+ vendor_id: str = Field(..., description="Vendor identifier")
189
+ invoice_number: str = Field(..., description="Invoice number")
190
+ invoice_date: str = Field(..., description="Invoice date in ISO format")
191
+ invoice_total: float = Field(..., description="Gross invoice total")
192
+ currency: str = Field(..., description="Invoice currency")
193
+ po_number: str | None = Field(default=None, description="PO number when present")
194
+ risk_flags: list[RiskFlag] = Field(
195
+ default_factory=list,
196
+ description="Compact risk hints visible from the queue",
197
+ )
198
+ summary: str = Field(default="", description="Short queue summary")
199
+
200
+
201
+ class ExceptionSummary(Model):
202
+ exception_id: str = Field(..., description="Stable exception identifier")
203
+ exception_type: ExceptionType = Field(..., description="Exception category")
204
+ severity: Severity = Field(..., description="Exception severity")
205
+ headline: str = Field(..., description="Queue-visible exception stub headline")
206
+ impacted_line_ids: list[str] = Field(
207
+ default_factory=list,
208
+ description="Invoice lines directly impacted by the exception",
209
+ )
210
+ short_description: str = Field(
211
+ default="",
212
+ description="Queue-safe hint shown before inspection",
213
+ )
214
+
215
+
216
+ class ExceptionDetail(ExceptionSummary):
217
+ fields: list[ArtifactField] = Field(
218
+ default_factory=list,
219
+ description="Structured exception facts shown after inspection",
220
+ )
221
+ reviewer_guidance: str = Field(
222
+ default="",
223
+ description="Short workflow guidance exposed after inspection",
224
+ )
225
+
226
+
227
+ class DuplicateCandidate(Model):
228
+ candidate_id: str = Field(..., description="Ledger invoice identifier")
229
+ vendor_name: str = Field(..., description="Vendor display name")
230
+ invoice_number: str = Field(..., description="Prior or pending invoice number")
231
+ invoice_date: str = Field(..., description="Candidate invoice date")
232
+ gross_amount: float = Field(..., description="Candidate gross amount")
233
+ status: str = Field(..., description="Current ledger or workflow status")
234
+ match_basis: str = Field(..., description="Why the invoice was matched")
235
+ overlap_summary: str = Field(..., description="Human readable overlap summary")
236
+ supported_match_strategies: list[DuplicateMatchStrategy] = Field(
237
+ default_factory=list,
238
+ description="Match strategies that surface this candidate",
239
+ )
240
+
241
+
242
+ class CaseNote(Model):
243
+ note_id: str = Field(..., description="Stable note identifier")
244
+ note_type: NoteType = Field(..., description="Workflow note category")
245
+ reason_codes: list[ReasonCode] = Field(
246
+ default_factory=list,
247
+ description="Structured reason codes captured in the note",
248
+ )
249
+ evidence_refs: list[str] = Field(
250
+ default_factory=list,
251
+ description="Artifact or exception references cited in the note",
252
+ )
253
+ text: str = Field(
254
+ ...,
255
+ description="Free-form note text retained for auditability, not prose-quality scoring",
256
+ )
257
+ saved_at_step: int = Field(..., ge=0, description="Step where the note was saved")
258
+
259
+
260
+ class LineResolution(Model):
261
+ resolution_id: str = Field(..., description="Stable resolution identifier")
262
+ line_id: str = Field(..., description="Invoice line identifier")
263
+ disposition: Disposition = Field(..., description="Line disposition")
264
+ reason_codes: list[ReasonCode] = Field(
265
+ default_factory=list,
266
+ description="Structured reason codes supporting the line disposition",
267
+ )
268
+ evidence_refs: list[str] = Field(
269
+ default_factory=list,
270
+ description="Artifact or exception references cited by the reviewer",
271
+ )
272
+ route_to: RouteTarget | None = Field(
273
+ default=None,
274
+ description="Next owner or follow-up queue for the line when another team must act",
275
+ )
276
+ saved_at_step: int = Field(
277
+ ...,
278
+ ge=0,
279
+ description="Step where the line disposition was saved",
280
+ )
281
+
282
+
283
+ class HeaderResolution(Model):
284
+ resolution_id: str = Field(..., description="Stable header resolution identifier")
285
+ payment_recommendation: PaymentRecommendation = Field(
286
+ ...,
287
+ description=(
288
+ "Header-level payment recommendation governing whether any payment can "
289
+ "be released now, including case-level blockers that may override "
290
+ "otherwise approved lines"
291
+ ),
292
+ )
293
+ reason_codes: list[ReasonCode] = Field(
294
+ default_factory=list,
295
+ description="Structured reason codes for the header recommendation",
296
+ )
297
+ evidence_refs: list[str] = Field(
298
+ default_factory=list,
299
+ description="Artifact or exception references cited by the reviewer",
300
+ )
301
+ route_to: RouteTarget | None = Field(
302
+ default=None,
303
+ description="Next owner or follow-up queue for the case when another team must act",
304
+ )
305
+ saved_at_step: int = Field(
306
+ ...,
307
+ ge=0,
308
+ description="Step where the header recommendation was saved",
309
+ )
310
+
311
+
312
+ class LineScoreReport(Model):
313
+ line_id: str
314
+ line_score: float
315
+ disposition_score: float
316
+ reason_score: float
317
+ route_score: float
318
+ evidence_score: float
319
+ accepted_dispositions: list[Disposition] = Field(default_factory=list)
320
+
321
+
322
+ class HeaderScoreReport(Model):
323
+ header_score: float
324
+ recommendation_score: float
325
+ reason_score: float
326
+ route_score: float
327
+ evidence_score: float
328
+ accepted_recommendations: list[PaymentRecommendation] = Field(default_factory=list)
329
+
330
+
331
+ class IssueNoteReport(Model):
332
+ issue_id: str
333
+ note_score: float
334
+ reason_score: float
335
+ evidence_score: float
336
+
337
+
338
+ class SubmissionReport(Model):
339
+ decision_band: DecisionBand
340
+ total_score: float = Field(..., ge=0.0, le=1.0)
341
+ core_decision_score: float = Field(..., ge=0.0, le=1.0)
342
+ reason_quality_score: float = Field(..., ge=0.0, le=1.0)
343
+ auxiliary_score: float = Field(..., ge=0.0, le=1.0)
344
+ resolution_score: float = Field(..., ge=0.0, le=1.0)
345
+ evidence_score: float = Field(..., ge=0.0, le=1.0)
346
+ documentation_score: float = Field(..., ge=0.0, le=1.0)
347
+ efficiency_score: float = Field(..., ge=0.0, le=1.0)
348
+ safety_cap_applied: float | None = Field(
349
+ default=None,
350
+ description="Cap value applied because the action set was unsafe",
351
+ )
352
+ unsafe_findings: list[str] = Field(
353
+ default_factory=list,
354
+ description="Unsafe findings surfaced by the grader",
355
+ )
356
+ line_reports: list[LineScoreReport] = Field(default_factory=list)
357
+ header_report: HeaderScoreReport | None = None
358
+ note_reports: list[IssueNoteReport] = Field(default_factory=list)
359
+
360
+
361
+ class Progress(Model):
362
+ steps_used: int = Field(..., ge=0, description="Steps used in the episode")
363
+ steps_remaining: int = Field(..., ge=0, description="Steps remaining")
364
+ opened_artifacts: int = Field(..., ge=0, description="Unique artifacts opened")
365
+ inspected_exceptions: int = Field(
366
+ ...,
367
+ ge=0,
368
+ description="Unique exceptions inspected",
369
+ )
370
+ notes_count: int = Field(..., ge=0, description="Saved notes")
371
+ line_resolutions: int = Field(..., ge=0, description="Saved line resolutions")
372
+ duplicate_checks_run: int = Field(
373
+ ...,
374
+ ge=0,
375
+ description="Duplicate check actions executed",
376
+ )
377
+ invalid_actions: int = Field(..., ge=0, description="Invalid actions taken")
378
+ redundant_actions: int = Field(..., ge=0, description="Redundant actions taken")
379
+ submitted: bool = Field(default=False, description="Whether the case is submitted")
380
+
381
+
382
+ class InvoiceOpsObservation(Observation):
383
+ message: str = Field(default="", description="Short environment message")
384
+ task_id: TaskId | None = Field(default=None, description="Task bucket for the case")
385
+ scenario_id: str | None = Field(default=None, description="Scenario identifier")
386
+ title: str = Field(default="", description="Case title")
387
+ description: str = Field(default="", description="Case description")
388
+ queue_card: QueueCard | None = Field(
389
+ default=None,
390
+ description="Queue-level summary of the current invoice case",
391
+ )
392
+ available_artifacts: list[ArtifactReference] = Field(
393
+ default_factory=list,
394
+ description="Artifacts currently available to the reviewer",
395
+ )
396
+ opened_artifact: ArtifactView | None = Field(
397
+ default=None,
398
+ description="Most recently opened artifact",
399
+ )
400
+ visible_exceptions: list[ExceptionSummary] = Field(
401
+ default_factory=list,
402
+ description="Queue-visible exception stubs visible before detailed inspection",
403
+ )
404
+ inspected_exception: ExceptionDetail | None = Field(
405
+ default=None,
406
+ description="Most recently inspected full exception detail",
407
+ )
408
+ duplicate_candidates: list[DuplicateCandidate] = Field(
409
+ default_factory=list,
410
+ description="Candidates surfaced by duplicate search",
411
+ )
412
+ draft_notes: list[CaseNote] = Field(
413
+ default_factory=list,
414
+ description="Saved case notes",
415
+ )
416
+ draft_line_resolutions: list[LineResolution] = Field(
417
+ default_factory=list,
418
+ description="Draft line resolutions saved so far",
419
+ )
420
+ draft_header_resolution: HeaderResolution | None = Field(
421
+ default=None,
422
+ description="Draft header recommendation if saved",
423
+ )
424
+ submission_report: SubmissionReport | None = Field(
425
+ default=None,
426
+ description="Deterministic grading report after submission",
427
+ )
428
+ progress: Progress = Field(
429
+ default_factory=lambda: Progress(
430
+ steps_used=0,
431
+ steps_remaining=0,
432
+ opened_artifacts=0,
433
+ inspected_exceptions=0,
434
+ notes_count=0,
435
+ line_resolutions=0,
436
+ duplicate_checks_run=0,
437
+ invalid_actions=0,
438
+ redundant_actions=0,
439
+ submitted=False,
440
+ ),
441
+ description="Episode progress counters",
442
+ )
443
+ known_refs: list[str] = Field(
444
+ default_factory=list,
445
+ description="Evidence refs that can be cited safely in notes or resolutions",
446
+ )
447
+ episode_score: float | None = Field(
448
+ default=None,
449
+ description="Final episode score when the case is done",
450
+ )
451
+
452
+
453
+ class InvoiceOpsState(State):
454
+ task_id: TaskId | None = Field(default=None, description="Task bucket")
455
+ scenario_id: str | None = Field(default=None, description="Scenario identifier")
456
+ case_id: str | None = Field(default=None, description="Case identifier")
457
+ current_artifact_id: str | None = Field(
458
+ default=None,
459
+ description="Most recently opened artifact",
460
+ )
461
+ submitted: bool = Field(default=False, description="Whether the case is submitted")
462
+ step_limit: int = Field(default=0, ge=0, description="Episode step budget")
463
+ duplicate_checks_run: int = Field(
464
+ default=0,
465
+ ge=0,
466
+ description="Number of duplicate checks executed",
467
+ )
468
+ invalid_actions: int = Field(
469
+ default=0,
470
+ ge=0,
471
+ description="Number of invalid actions taken",
472
+ )
473
+ redundant_actions: int = Field(
474
+ default=0,
475
+ ge=0,
476
+ description="Number of redundant actions taken",
477
+ )
478
+
479
+
480
+ class InvoiceOpsAction(Action):
481
+ action_type: ActionType = Field(..., description="Action to execute")
482
+ artifact_id: str | None = Field(default=None, description="Artifact to open")
483
+ exception_id: str | None = Field(default=None, description="Exception to inspect")
484
+ match_strategy: DuplicateMatchStrategy | None = Field(
485
+ default=None,
486
+ description="Duplicate search strategy to run",
487
+ )
488
+ note_type: NoteType | None = Field(default=None, description="Case note type")
489
+ reason_codes: list[ReasonCode] = Field(
490
+ default_factory=list,
491
+ description="Structured reason codes carried by the action",
492
+ )
493
+ evidence_refs: list[str] = Field(
494
+ default_factory=list,
495
+ description="Artifact or exception refs supporting the action",
496
+ )
497
+ text: str | None = Field(default=None, description="Free-form note text")
498
+ line_id: str | None = Field(default=None, description="Invoice line identifier")
499
+ disposition: Disposition | None = Field(default=None, description="Line outcome")
500
+ payment_recommendation: PaymentRecommendation | None = Field(
501
+ default=None,
502
+ description="Header-level payment recommendation",
503
+ )
504
+ route_to: RouteTarget | None = Field(
505
+ default=None,
506
+ description="Next owner or follow-up queue for the action, when applicable",
507
+ )
508
+ note_ids: list[str] = Field(
509
+ default_factory=list,
510
+ description="Optional note identifiers to submit",
511
+ )
512
+ line_resolution_ids: list[str] = Field(
513
+ default_factory=list,
514
+ description="Optional line resolution identifiers to submit",
515
+ )
516
+ header_resolution_id: str | None = Field(
517
+ default=None,
518
+ description="Optional header resolution identifier to submit",
519
+ )
520
+
521
+ @model_validator(mode="after")
522
+ def validate_action_fields(self) -> "InvoiceOpsAction":
523
+ action_type = self.action_type
524
+
525
+ if action_type is ActionType.OPEN_ARTIFACT:
526
+ if not self.artifact_id:
527
+ raise ValueError("artifact_id is required for open_artifact")
528
+ return self
529
+
530
+ if action_type is ActionType.INSPECT_EXCEPTION:
531
+ if not self.exception_id:
532
+ raise ValueError("exception_id is required for inspect_exception")
533
+ return self
534
+
535
+ if action_type is ActionType.RUN_DUPLICATE_CHECK:
536
+ if self.match_strategy is None:
537
+ raise ValueError("match_strategy is required for run_duplicate_check")
538
+ return self
539
+
540
+ if action_type is ActionType.ADD_NOTE:
541
+ if self.note_type is None:
542
+ raise ValueError("note_type is required for add_note")
543
+ if not self.reason_codes:
544
+ raise ValueError("reason_codes are required for add_note")
545
+ if not self.evidence_refs:
546
+ raise ValueError("evidence_refs are required for add_note")
547
+ if not self.text or not self.text.strip():
548
+ raise ValueError("text is required for add_note")
549
+ return self
550
+
551
+ if action_type is ActionType.SET_LINE_RESOLUTION:
552
+ if not self.line_id:
553
+ raise ValueError("line_id is required for set_line_resolution")
554
+ if self.disposition is None:
555
+ raise ValueError("disposition is required for set_line_resolution")
556
+ if not self.reason_codes:
557
+ raise ValueError("reason_codes are required for set_line_resolution")
558
+ if not self.evidence_refs:
559
+ raise ValueError("evidence_refs are required for set_line_resolution")
560
+ if self.disposition is Disposition.ESCALATE and self.route_to is None:
561
+ raise ValueError("route_to is required when escalating a line")
562
+ return self
563
+
564
+ if action_type is ActionType.SET_HEADER_RESOLUTION:
565
+ if self.payment_recommendation is None:
566
+ raise ValueError(
567
+ "payment_recommendation is required for set_header_resolution"
568
+ )
569
+ if not self.reason_codes:
570
+ raise ValueError("reason_codes are required for set_header_resolution")
571
+ if not self.evidence_refs:
572
+ raise ValueError("evidence_refs are required for set_header_resolution")
573
+ if (
574
+ self.payment_recommendation is PaymentRecommendation.ESCALATE_CASE
575
+ and self.route_to is None
576
+ ):
577
+ raise ValueError("route_to is required when escalating the case")
578
+ return self
579
+
580
+ if action_type is ActionType.SUBMIT_CASE:
581
+ return self
582
+
583
+ raise ValueError(f"Unsupported action_type: {action_type}")
openenv.yaml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: invoiceops_env
3
+ type: space
4
+ runtime: fastapi
5
+ app: server.app:app
6
+ port: 8000
7
+
pyproject.toml ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=45", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "openenv-invoiceops_env"
7
+ version = "0.1.0"
8
+ description = "AP invoice exception handling environment for OpenEnv"
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ dependencies = [
12
+ "openenv-core[core]>=0.2.2,<0.3",
13
+ "fastapi>=0.115.0",
14
+ "pydantic>=2.0.0",
15
+ "uvicorn[standard]>=0.24.0",
16
+ "openai>=2.7.2",
17
+ ]
18
+
19
+ [project.optional-dependencies]
20
+ dev = [
21
+ "pytest>=8.0.0",
22
+ "pytest-cov>=4.0.0",
23
+ ]
24
+
25
+ [project.scripts]
26
+ server = "invoiceops_env.server.app:main"
27
+
28
+ [tool.setuptools]
29
+ include-package-data = true
30
+ packages = ["invoiceops_env", "invoiceops_env.server"]
31
+ package-dir = { "invoiceops_env" = ".", "invoiceops_env.server" = "server" }
32
+
33
+ [tool.setuptools.package-data]
34
+ invoiceops_env = ["data/**/*.json", "*.yaml", "*.md"]
server/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """InvoiceOps environment server components."""
8
+
9
+ from invoiceops_env.server.invoiceops_env_environment import InvoiceOpsEnvironment
10
+
11
+ __all__ = ["InvoiceOpsEnvironment"]
server/app.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FastAPI entrypoint for InvoiceOps."""
2
+
3
+ try:
4
+ from openenv.core.env_server.http_server import create_app
5
+ except Exception as e: # pragma: no cover
6
+ raise ImportError(
7
+ "openenv is required for the web interface. Install dependencies with '\n uv sync\n'"
8
+ ) from e
9
+
10
+ from invoiceops_env.models import InvoiceOpsAction, InvoiceOpsObservation
11
+ from invoiceops_env.server.invoiceops_env_environment import InvoiceOpsEnvironment
12
+
13
+
14
+ app = create_app(
15
+ InvoiceOpsEnvironment,
16
+ InvoiceOpsAction,
17
+ InvoiceOpsObservation,
18
+ env_name="invoiceops_env",
19
+ max_concurrent_envs=4,
20
+ )
21
+
22
+
23
+ def _resolve_cli_args(
24
+ default_host: str = "0.0.0.0",
25
+ default_port: int = 8000,
26
+ ) -> tuple[str, int]:
27
+ import argparse
28
+
29
+ parser = argparse.ArgumentParser(add_help=False)
30
+ parser.add_argument("--host", default=default_host)
31
+ parser.add_argument("--port", type=int, default=default_port)
32
+ args, _ = parser.parse_known_args()
33
+ return args.host, args.port
34
+
35
+
36
+ def main(host: str | None = None, port: int | None = None) -> None:
37
+ """Run the server directly via ``uv run --project . server``."""
38
+ import uvicorn
39
+
40
+ if host is None and port is None:
41
+ host, port = _resolve_cli_args()
42
+ else:
43
+ host = host or "0.0.0.0"
44
+ port = port or 8000
45
+
46
+ uvicorn.run(app, host=host, port=port)
47
+
48
+
49
+ if __name__ == "__main__":
50
+ main()
server/fixtures.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fixture constants for InvoiceOps."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from pathlib import Path
6
+
7
+ from invoiceops_env.models import TaskId
8
+
9
+
10
+ PACKAGE_ROOT = Path(__file__).resolve().parents[1]
11
+ DATA_DIR = PACKAGE_ROOT / "data"
12
+ SCENARIOS_DIR = DATA_DIR / "scenarios"
13
+
14
+ ENV_DESCRIPTION = (
15
+ "Document-centric AP invoice exception handling environment with deterministic "
16
+ "grading for non-PO routing, duplicate-evidence review, partial-release "
17
+ "judgment, chronology-aware exception handling, and safe payment-release "
18
+ "decisions."
19
+ )
20
+
21
+ SCENARIOS_BY_TASK: dict[TaskId, tuple[str, ...]] = {
22
+ TaskId.EASY: ("easy",),
23
+ TaskId.MEDIUM: ("medium",),
24
+ TaskId.MEDIUM_PLUS: ("medium_plus",),
25
+ TaskId.HARD: ("hard",),
26
+ }
27
+
28
+ DEFAULT_SCENARIOS: dict[TaskId, str] = {
29
+ task: scenario_ids[0] for task, scenario_ids in SCENARIOS_BY_TASK.items()
30
+ }
31
+
32
+ DUPLICATE_CHECK_REF_PREFIX = "duplicate_check:"
server/grader.py ADDED
@@ -0,0 +1,712 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic grading for InvoiceOps cases."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+
7
+ from invoiceops_env.models import (
8
+ CaseNote,
9
+ DecisionBand,
10
+ Disposition,
11
+ HeaderResolution,
12
+ HeaderScoreReport,
13
+ IssueNoteReport,
14
+ LineResolution,
15
+ LineScoreReport,
16
+ PaymentRecommendation,
17
+ ReasonCode,
18
+ SubmissionReport,
19
+ )
20
+ from invoiceops_env.server.scenario_loader import (
21
+ HeaderExpectation,
22
+ NoteExpectation,
23
+ ResolutionExpectation,
24
+ ScenarioFixture,
25
+ )
26
+
27
+
28
+ LINE_DISPOSITION_WEIGHT = 0.55
29
+ LINE_REASON_WEIGHT = 0.15
30
+ LINE_ROUTE_WEIGHT = 0.30
31
+
32
+ HEADER_RECOMMENDATION_WEIGHT = 0.55
33
+ HEADER_REASON_WEIGHT = 0.15
34
+ HEADER_ROUTE_WEIGHT = 0.30
35
+
36
+ NOTE_REASON_WEIGHT = 0.65
37
+ NOTE_EVIDENCE_WEIGHT = 0.35
38
+
39
+ AUX_REASON_QUALITY_WEIGHT = 0.25
40
+ AUX_EVIDENCE_WEIGHT = 0.45
41
+ AUX_DOCUMENTATION_WEIGHT = 0.20
42
+ AUX_EFFICIENCY_WEIGHT = 0.10
43
+
44
+ BAND_CORE_WEIGHT = 0.60
45
+ BAND_AUXILIARY_WEIGHT = 0.40
46
+
47
+ BAND_RANGES: dict[DecisionBand, tuple[float, float]] = {
48
+ DecisionBand.BEST: (0.80, 1.00),
49
+ DecisionBand.SAFE_SUBOPTIMAL: (0.50, 0.79),
50
+ DecisionBand.WRONG: (0.05, 0.45),
51
+ DecisionBand.UNSAFE: (0.00, 0.15),
52
+ }
53
+
54
+ HEADER_TO_LINE_DISPOSITION: dict[PaymentRecommendation, Disposition] = {
55
+ PaymentRecommendation.RELEASE_APPROVED_LINES: Disposition.APPROVE,
56
+ PaymentRecommendation.HOLD_FULL_INVOICE: Disposition.HOLD,
57
+ PaymentRecommendation.REJECT_FULL_INVOICE: Disposition.REJECT,
58
+ PaymentRecommendation.ESCALATE_CASE: Disposition.ESCALATE,
59
+ }
60
+
61
+ CONSERVATIVE_LINE_DISPOSITIONS = {
62
+ Disposition.HOLD,
63
+ Disposition.ESCALATE,
64
+ Disposition.REJECT,
65
+ }
66
+
67
+ CONSERVATIVE_HEADER_RECOMMENDATIONS = {
68
+ PaymentRecommendation.HOLD_FULL_INVOICE,
69
+ PaymentRecommendation.ESCALATE_CASE,
70
+ PaymentRecommendation.REJECT_FULL_INVOICE,
71
+ }
72
+
73
+
74
+ @dataclass(frozen=True)
75
+ class ReviewTrace:
76
+ ref_steps: dict[str, int]
77
+ steps_used: int
78
+ invalid_actions: int = 0
79
+ redundant_actions: int = 0
80
+
81
+
82
+ def _f1(predicted: set[str], expected: set[str]) -> float:
83
+ if not predicted and not expected:
84
+ return 1.0
85
+ if not predicted or not expected:
86
+ return 0.0
87
+
88
+ true_positives = len(predicted & expected)
89
+ precision = true_positives / len(predicted)
90
+ recall = true_positives / len(expected)
91
+ if precision + recall == 0:
92
+ return 0.0
93
+ return (2 * precision * recall) / (precision + recall)
94
+
95
+
96
+ def _reason_score(
97
+ reason_codes: list[ReasonCode] | None,
98
+ accepted_reason_sets: list[list[str]],
99
+ ) -> float:
100
+ if not accepted_reason_sets:
101
+ return 1.0
102
+ if not reason_codes:
103
+ return 0.0
104
+
105
+ predicted = {reason.value for reason in reason_codes}
106
+ return max(_f1(predicted, set(expected)) for expected in accepted_reason_sets)
107
+
108
+
109
+ def _route_score(route_to_value: str | None, accepted_routes: list[str]) -> float:
110
+ if not accepted_routes:
111
+ return 1.0
112
+ if route_to_value is None:
113
+ return 0.0
114
+ return 1.0 if route_to_value in accepted_routes else 0.0
115
+
116
+
117
+ def _normalized_weighted_score(components: list[tuple[float, float]]) -> float:
118
+ active_weight = sum(weight for _, weight in components if weight > 0.0)
119
+ if active_weight <= 0.0:
120
+ return 0.0
121
+ return sum(score * weight for score, weight in components if weight > 0.0) / active_weight
122
+
123
+
124
+ def _timely_refs(
125
+ cited_refs: list[str] | None,
126
+ ref_steps: dict[str, int],
127
+ saved_at_step: int | None,
128
+ ) -> set[str]:
129
+ if saved_at_step is None or not cited_refs:
130
+ return set()
131
+
132
+ timely_refs: set[str] = set()
133
+ for ref in cited_refs:
134
+ ref_step = ref_steps.get(ref)
135
+ if ref_step is not None and ref_step < saved_at_step:
136
+ timely_refs.add(ref)
137
+ return timely_refs
138
+
139
+
140
+ def _observed_refs_before_step(
141
+ ref_steps: dict[str, int],
142
+ saved_at_step: int | None,
143
+ ) -> set[str]:
144
+ if saved_at_step is None:
145
+ return set()
146
+ return {
147
+ ref
148
+ for ref, ref_step in ref_steps.items()
149
+ if ref_step < saved_at_step
150
+ }
151
+
152
+
153
+ def _evidence_score(
154
+ cited_refs: list[str] | None,
155
+ decisive_refs: list[str],
156
+ ref_steps: dict[str, int],
157
+ saved_at_step: int | None,
158
+ ) -> float:
159
+ if not decisive_refs:
160
+ return 1.0
161
+
162
+ timely_cited_refs = _timely_refs(cited_refs, ref_steps, saved_at_step)
163
+ if not timely_cited_refs:
164
+ return 0.0
165
+ return _f1(timely_cited_refs, set(decisive_refs))
166
+
167
+
168
+ def _gating_refs_satisfied(
169
+ gating_refs: list[str],
170
+ ref_steps: dict[str, int],
171
+ saved_at_step: int | None,
172
+ ) -> bool:
173
+ if not gating_refs:
174
+ return True
175
+ # Band gating depends on what the agent had already uncovered in time,
176
+ # not on whether every observed ref was restated in the saved action.
177
+ observed_refs = _observed_refs_before_step(ref_steps, saved_at_step)
178
+ return set(gating_refs).issubset(observed_refs)
179
+
180
+
181
+ def _accepted_dispositions(score_map: dict[str, float]) -> list[Disposition]:
182
+ return [
183
+ disposition
184
+ for disposition in Disposition
185
+ if score_map.get(disposition.value, 0.0) > 0.0
186
+ ]
187
+
188
+
189
+ def _accepted_recommendations(
190
+ score_map: dict[str, float],
191
+ ) -> list[PaymentRecommendation]:
192
+ return [
193
+ recommendation
194
+ for recommendation in PaymentRecommendation
195
+ if score_map.get(recommendation.value, 0.0) > 0.0
196
+ ]
197
+
198
+
199
+ def _max_positive_score(score_map: dict[str, float]) -> float:
200
+ positive_scores = [score for score in score_map.values() if score > 0.0]
201
+ return max(positive_scores) if positive_scores else 0.0
202
+
203
+
204
+ def _max_suboptimal_positive_score(score_map: dict[str, float]) -> float:
205
+ best_score = _max_positive_score(score_map)
206
+ suboptimal_positive_scores = [
207
+ score for score in score_map.values() if 0.0 < score < best_score
208
+ ]
209
+ return max(suboptimal_positive_scores) if suboptimal_positive_scores else 0.0
210
+
211
+
212
+ def _core_line_score(disposition_score: float, route_score: float, has_route: bool) -> float:
213
+ return _normalized_weighted_score(
214
+ [
215
+ (disposition_score, 0.70),
216
+ (route_score, 0.30 if has_route else 0.0),
217
+ ]
218
+ )
219
+
220
+
221
+ def _core_header_score(
222
+ recommendation_score: float,
223
+ route_score: float,
224
+ has_route: bool,
225
+ ) -> float:
226
+ return _normalized_weighted_score(
227
+ [
228
+ (recommendation_score, 0.70),
229
+ (route_score, 0.30 if has_route else 0.0),
230
+ ]
231
+ )
232
+
233
+
234
+ def _grade_note_expectation(
235
+ expectation: NoteExpectation,
236
+ notes: list[CaseNote],
237
+ trace: ReviewTrace,
238
+ ) -> IssueNoteReport:
239
+ best_note_score = 0.0
240
+ best_reason = 0.0
241
+ best_evidence = 0.0
242
+ for note in notes:
243
+ reason_score = _reason_score(note.reason_codes, expectation.accepted_reason_sets)
244
+ evidence_score = _evidence_score(
245
+ note.evidence_refs,
246
+ expectation.decisive_refs,
247
+ trace.ref_steps,
248
+ note.saved_at_step,
249
+ )
250
+ note_score = (
251
+ (NOTE_REASON_WEIGHT * reason_score)
252
+ + (NOTE_EVIDENCE_WEIGHT * evidence_score)
253
+ )
254
+ if note_score > best_note_score:
255
+ best_note_score = note_score
256
+ best_reason = reason_score
257
+ best_evidence = evidence_score
258
+
259
+ return IssueNoteReport(
260
+ issue_id=expectation.issue_id,
261
+ note_score=round(best_note_score, 4),
262
+ reason_score=round(best_reason, 4),
263
+ evidence_score=round(best_evidence, 4),
264
+ )
265
+
266
+
267
+ def _grade_header(
268
+ expectation: HeaderExpectation,
269
+ header_resolution: HeaderResolution | None,
270
+ ref_steps: dict[str, int],
271
+ ) -> HeaderScoreReport:
272
+ recommendation_value = (
273
+ header_resolution.payment_recommendation.value
274
+ if header_resolution is not None
275
+ else None
276
+ )
277
+ recommendation_score = (
278
+ expectation.score_map.get(recommendation_value or "", 0.0)
279
+ if header_resolution is not None
280
+ else 0.0
281
+ )
282
+ reason_score = _reason_score(
283
+ header_resolution.reason_codes if header_resolution is not None else None,
284
+ expectation.accepted_reason_sets,
285
+ )
286
+ route_score = _route_score(
287
+ (
288
+ header_resolution.route_to.value
289
+ if header_resolution is not None and header_resolution.route_to is not None
290
+ else None
291
+ ),
292
+ expectation.accepted_routes,
293
+ )
294
+ evidence_score = _evidence_score(
295
+ header_resolution.evidence_refs if header_resolution is not None else None,
296
+ expectation.decisive_refs,
297
+ ref_steps,
298
+ header_resolution.saved_at_step if header_resolution is not None else None,
299
+ )
300
+ header_score = _normalized_weighted_score(
301
+ [
302
+ (recommendation_score, HEADER_RECOMMENDATION_WEIGHT),
303
+ (
304
+ reason_score,
305
+ HEADER_REASON_WEIGHT if expectation.accepted_reason_sets else 0.0,
306
+ ),
307
+ (route_score, HEADER_ROUTE_WEIGHT if expectation.accepted_routes else 0.0),
308
+ ]
309
+ )
310
+ return HeaderScoreReport(
311
+ header_score=round(header_score, 4),
312
+ recommendation_score=round(recommendation_score, 4),
313
+ reason_score=round(reason_score, 4),
314
+ route_score=round(route_score, 4),
315
+ evidence_score=round(evidence_score, 4),
316
+ accepted_recommendations=_accepted_recommendations(expectation.score_map),
317
+ )
318
+
319
+
320
+ def _mirrored_single_line_resolution(
321
+ scenario: ScenarioFixture,
322
+ line_resolutions: dict[str, LineResolution],
323
+ header_resolution: HeaderResolution | None,
324
+ ) -> dict[str, LineResolution]:
325
+ # Single-line warm-up cases should not crater solely because the agent saved
326
+ # the correct header decision but omitted the redundant line decision.
327
+ if header_resolution is None or len(scenario.hidden_truth.line_expectations) != 1:
328
+ return line_resolutions
329
+
330
+ line_id, expectation = next(iter(scenario.hidden_truth.line_expectations.items()))
331
+ if line_id in line_resolutions:
332
+ return line_resolutions
333
+
334
+ route_value = (
335
+ header_resolution.route_to.value
336
+ if header_resolution.route_to is not None
337
+ else None
338
+ )
339
+ if expectation.accepted_routes and route_value not in expectation.accepted_routes:
340
+ return line_resolutions
341
+
342
+ mirrored_resolution = LineResolution(
343
+ resolution_id=f"{header_resolution.resolution_id}-mirror-line",
344
+ line_id=line_id,
345
+ disposition=HEADER_TO_LINE_DISPOSITION[header_resolution.payment_recommendation],
346
+ reason_codes=list(header_resolution.reason_codes),
347
+ evidence_refs=list(header_resolution.evidence_refs),
348
+ route_to=header_resolution.route_to,
349
+ saved_at_step=header_resolution.saved_at_step,
350
+ )
351
+ return {**line_resolutions, line_id: mirrored_resolution}
352
+
353
+
354
+ def _line_is_best(
355
+ expectation: ResolutionExpectation,
356
+ disposition_score: float,
357
+ route_score: float,
358
+ gating_ok: bool,
359
+ ) -> bool:
360
+ if not gating_ok:
361
+ return False
362
+ if disposition_score <= 0.0:
363
+ return False
364
+ if expectation.accepted_routes and route_score <= 0.0:
365
+ return False
366
+ return disposition_score >= _max_positive_score(expectation.score_map)
367
+
368
+
369
+ def _line_is_safe(
370
+ expectation: ResolutionExpectation,
371
+ resolution: LineResolution | None,
372
+ disposition_score: float,
373
+ route_score: float,
374
+ best_gating_ok: bool,
375
+ safe_gating_ok: bool,
376
+ ) -> bool:
377
+ if resolution is None:
378
+ return False
379
+ if disposition_score <= 0.0:
380
+ return False
381
+ if expectation.accepted_routes and route_score <= 0.0:
382
+ return False
383
+ if best_gating_ok:
384
+ return True
385
+ return (
386
+ safe_gating_ok
387
+ and resolution.disposition in CONSERVATIVE_LINE_DISPOSITIONS
388
+ )
389
+
390
+
391
+ def _header_is_best(
392
+ expectation: HeaderExpectation,
393
+ recommendation_score: float,
394
+ route_score: float,
395
+ gating_ok: bool,
396
+ ) -> bool:
397
+ if not gating_ok:
398
+ return False
399
+ if recommendation_score <= 0.0:
400
+ return False
401
+ if expectation.accepted_routes and route_score <= 0.0:
402
+ return False
403
+ return recommendation_score >= _max_positive_score(expectation.score_map)
404
+
405
+
406
+ def _header_is_safe(
407
+ expectation: HeaderExpectation,
408
+ header_resolution: HeaderResolution | None,
409
+ recommendation_score: float,
410
+ route_score: float,
411
+ best_gating_ok: bool,
412
+ safe_gating_ok: bool,
413
+ ) -> bool:
414
+ if header_resolution is None:
415
+ return False
416
+ if recommendation_score <= 0.0:
417
+ return False
418
+ if expectation.accepted_routes and route_score <= 0.0:
419
+ return False
420
+ if best_gating_ok:
421
+ return True
422
+ return (
423
+ safe_gating_ok
424
+ and header_resolution.payment_recommendation
425
+ in CONSERVATIVE_HEADER_RECOMMENDATIONS
426
+ )
427
+
428
+
429
+ def grade_case(
430
+ scenario: ScenarioFixture,
431
+ line_resolutions: dict[str, LineResolution],
432
+ header_resolution: HeaderResolution | None,
433
+ notes: dict[str, CaseNote],
434
+ trace: ReviewTrace,
435
+ ) -> SubmissionReport:
436
+ line_resolutions = _mirrored_single_line_resolution(
437
+ scenario,
438
+ line_resolutions,
439
+ header_resolution,
440
+ )
441
+
442
+ line_reports: list[LineScoreReport] = []
443
+ weighted_line_resolution = 0.0
444
+ weighted_line_core = 0.0
445
+ weighted_line_reason = 0.0
446
+ weighted_line_evidence = 0.0
447
+ total_amount = sum(
448
+ expectation.amount
449
+ for expectation in scenario.hidden_truth.line_expectations.values()
450
+ )
451
+ total_amount = total_amount or 1.0
452
+
453
+ unsafe_findings: list[str] = []
454
+ all_lines_best = True
455
+ all_lines_safe = True
456
+
457
+ for line_id, expectation in scenario.hidden_truth.line_expectations.items():
458
+ resolution = line_resolutions.get(line_id)
459
+ disposition_value = resolution.disposition.value if resolution is not None else ""
460
+ disposition_score = expectation.score_map.get(disposition_value, 0.0)
461
+ reason_score = _reason_score(
462
+ resolution.reason_codes if resolution is not None else None,
463
+ expectation.accepted_reason_sets,
464
+ )
465
+ route_score = _route_score(
466
+ (
467
+ resolution.route_to.value
468
+ if resolution is not None and resolution.route_to is not None
469
+ else None
470
+ ),
471
+ expectation.accepted_routes,
472
+ )
473
+ evidence_score = _evidence_score(
474
+ resolution.evidence_refs if resolution is not None else None,
475
+ expectation.decisive_refs,
476
+ trace.ref_steps,
477
+ resolution.saved_at_step if resolution is not None else None,
478
+ )
479
+ best_gating_ok = _gating_refs_satisfied(
480
+ expectation.gating_refs,
481
+ trace.ref_steps,
482
+ resolution.saved_at_step if resolution is not None else None,
483
+ )
484
+ safe_gating_ok = _gating_refs_satisfied(
485
+ expectation.safe_gating_refs or expectation.gating_refs,
486
+ trace.ref_steps,
487
+ resolution.saved_at_step if resolution is not None else None,
488
+ )
489
+
490
+ line_score = _normalized_weighted_score(
491
+ [
492
+ (disposition_score, LINE_DISPOSITION_WEIGHT),
493
+ (
494
+ reason_score,
495
+ LINE_REASON_WEIGHT if expectation.accepted_reason_sets else 0.0,
496
+ ),
497
+ (route_score, LINE_ROUTE_WEIGHT if expectation.accepted_routes else 0.0),
498
+ ]
499
+ )
500
+ core_score = _core_line_score(
501
+ disposition_score,
502
+ route_score,
503
+ bool(expectation.accepted_routes),
504
+ )
505
+ effective_core_score = 0.0
506
+ # Best credit needs the full best gating refs. Conservative actions can
507
+ # still earn capped core credit when the scenario defines safe gating.
508
+ if best_gating_ok:
509
+ effective_core_score = core_score
510
+ elif (
511
+ resolution is not None
512
+ and safe_gating_ok
513
+ and disposition_score > 0.0
514
+ and resolution.disposition in CONSERVATIVE_LINE_DISPOSITIONS
515
+ ):
516
+ capped_disposition_score = min(
517
+ disposition_score,
518
+ _max_suboptimal_positive_score(expectation.score_map),
519
+ )
520
+ if capped_disposition_score > 0.0:
521
+ effective_core_score = _core_line_score(
522
+ capped_disposition_score,
523
+ route_score,
524
+ bool(expectation.accepted_routes),
525
+ )
526
+ weight = expectation.amount / total_amount
527
+ weighted_line_resolution += line_score * weight
528
+ weighted_line_core += effective_core_score * weight
529
+ weighted_line_reason += reason_score * weight
530
+ weighted_line_evidence += evidence_score * weight
531
+ line_reports.append(
532
+ LineScoreReport(
533
+ line_id=line_id,
534
+ line_score=round(line_score, 4),
535
+ disposition_score=round(disposition_score, 4),
536
+ reason_score=round(reason_score, 4),
537
+ route_score=round(route_score, 4),
538
+ evidence_score=round(evidence_score, 4),
539
+ accepted_dispositions=_accepted_dispositions(expectation.score_map),
540
+ )
541
+ )
542
+
543
+ all_lines_best = all_lines_best and _line_is_best(
544
+ expectation,
545
+ disposition_score,
546
+ route_score,
547
+ best_gating_ok,
548
+ )
549
+ all_lines_safe = all_lines_safe and _line_is_safe(
550
+ expectation,
551
+ resolution,
552
+ disposition_score,
553
+ route_score,
554
+ best_gating_ok,
555
+ safe_gating_ok,
556
+ )
557
+
558
+ if (
559
+ expectation.unsafe_approve
560
+ and resolution is not None
561
+ and resolution.disposition is Disposition.APPROVE
562
+ ):
563
+ unsafe_findings.append(f"unsafe approval on line {line_id}")
564
+
565
+ header_report = _grade_header(
566
+ scenario.hidden_truth.header_expectation,
567
+ header_resolution,
568
+ trace.ref_steps,
569
+ )
570
+ header_best_gating_ok = _gating_refs_satisfied(
571
+ scenario.hidden_truth.header_expectation.gating_refs,
572
+ trace.ref_steps,
573
+ header_resolution.saved_at_step if header_resolution is not None else None,
574
+ )
575
+ header_safe_gating_ok = _gating_refs_satisfied(
576
+ scenario.hidden_truth.header_expectation.safe_gating_refs
577
+ or scenario.hidden_truth.header_expectation.gating_refs,
578
+ trace.ref_steps,
579
+ header_resolution.saved_at_step if header_resolution is not None else None,
580
+ )
581
+
582
+ header_core_score = _core_header_score(
583
+ header_report.recommendation_score,
584
+ header_report.route_score,
585
+ bool(scenario.hidden_truth.header_expectation.accepted_routes),
586
+ )
587
+ if header_best_gating_ok:
588
+ pass
589
+ elif (
590
+ header_resolution is not None
591
+ and header_safe_gating_ok
592
+ and header_report.recommendation_score > 0.0
593
+ and header_resolution.payment_recommendation
594
+ in CONSERVATIVE_HEADER_RECOMMENDATIONS
595
+ ):
596
+ capped_recommendation_score = min(
597
+ header_report.recommendation_score,
598
+ _max_suboptimal_positive_score(
599
+ scenario.hidden_truth.header_expectation.score_map
600
+ ),
601
+ )
602
+ if capped_recommendation_score > 0.0:
603
+ header_core_score = _core_header_score(
604
+ capped_recommendation_score,
605
+ header_report.route_score,
606
+ bool(scenario.hidden_truth.header_expectation.accepted_routes),
607
+ )
608
+ else:
609
+ header_core_score = 0.0
610
+ else:
611
+ header_core_score = 0.0
612
+ reason_quality_score = (0.80 * weighted_line_reason) + (
613
+ 0.20 * header_report.reason_score
614
+ )
615
+ resolution_score = (0.80 * weighted_line_resolution) + (
616
+ 0.20 * header_report.header_score
617
+ )
618
+ core_decision_score = (0.80 * weighted_line_core) + (0.20 * header_core_score)
619
+ evidence_score = (0.80 * weighted_line_evidence) + (
620
+ 0.20 * header_report.evidence_score
621
+ )
622
+
623
+ note_reports = [
624
+ _grade_note_expectation(expectation, list(notes.values()), trace)
625
+ for expectation in scenario.hidden_truth.note_expectations
626
+ ]
627
+ if note_reports:
628
+ documentation_score = sum(
629
+ report.note_score for report in note_reports
630
+ ) / len(note_reports)
631
+ else:
632
+ documentation_score = 1.0
633
+
634
+ extra_steps = max(
635
+ 0,
636
+ trace.steps_used - scenario.hidden_truth.efficient_step_target,
637
+ )
638
+ efficiency_score = max(
639
+ 0.0,
640
+ 1.0
641
+ - (0.08 * extra_steps)
642
+ - (0.25 * trace.invalid_actions)
643
+ - (0.08 * trace.redundant_actions),
644
+ )
645
+
646
+ if header_resolution is not None:
647
+ header_value = header_resolution.payment_recommendation.value
648
+ if header_value in scenario.hidden_truth.header_expectation.unsafe_recommendations:
649
+ unsafe_findings.append(f"unsafe header recommendation {header_value}")
650
+
651
+ # Stage 1 assigns the decision band from the gated essential decisions.
652
+ # Stage 2 later scores within that band using evidence, notes, and efficiency.
653
+ if unsafe_findings:
654
+ decision_band = DecisionBand.UNSAFE
655
+ elif all_lines_best and _header_is_best(
656
+ scenario.hidden_truth.header_expectation,
657
+ header_report.recommendation_score,
658
+ header_report.route_score,
659
+ header_best_gating_ok,
660
+ ):
661
+ decision_band = DecisionBand.BEST
662
+ elif all_lines_safe and _header_is_safe(
663
+ scenario.hidden_truth.header_expectation,
664
+ header_resolution,
665
+ header_report.recommendation_score,
666
+ header_report.route_score,
667
+ header_best_gating_ok,
668
+ header_safe_gating_ok,
669
+ ):
670
+ decision_band = DecisionBand.SAFE_SUBOPTIMAL
671
+ else:
672
+ decision_band = DecisionBand.WRONG
673
+
674
+ auxiliary_score = _normalized_weighted_score(
675
+ [
676
+ (reason_quality_score, AUX_REASON_QUALITY_WEIGHT),
677
+ (evidence_score, AUX_EVIDENCE_WEIGHT),
678
+ (documentation_score, AUX_DOCUMENTATION_WEIGHT),
679
+ (efficiency_score, AUX_EFFICIENCY_WEIGHT),
680
+ ]
681
+ )
682
+
683
+ band_progress = _normalized_weighted_score(
684
+ [
685
+ (core_decision_score, BAND_CORE_WEIGHT),
686
+ (auxiliary_score, BAND_AUXILIARY_WEIGHT),
687
+ ]
688
+ )
689
+ band_floor, band_ceiling = BAND_RANGES[decision_band]
690
+ total_score = band_floor + ((band_ceiling - band_floor) * band_progress)
691
+ total_score = max(0.0, min(1.0, total_score))
692
+
693
+ return SubmissionReport(
694
+ decision_band=decision_band,
695
+ total_score=round(total_score, 4),
696
+ core_decision_score=round(core_decision_score, 4),
697
+ reason_quality_score=round(reason_quality_score, 4),
698
+ auxiliary_score=round(auxiliary_score, 4),
699
+ resolution_score=round(resolution_score, 4),
700
+ evidence_score=round(evidence_score, 4),
701
+ documentation_score=round(documentation_score, 4),
702
+ efficiency_score=round(efficiency_score, 4),
703
+ safety_cap_applied=(
704
+ round(BAND_RANGES[DecisionBand.UNSAFE][1], 4)
705
+ if decision_band is DecisionBand.UNSAFE
706
+ else None
707
+ ),
708
+ unsafe_findings=unsafe_findings,
709
+ line_reports=line_reports,
710
+ header_report=header_report,
711
+ note_reports=note_reports,
712
+ )
server/invoiceops_env_environment.py ADDED
@@ -0,0 +1,491 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """InvoiceOps environment implementation."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from uuid import uuid4
6
+
7
+ from openenv.core.env_server.interfaces import Environment
8
+ from openenv.core.env_server.types import EnvironmentMetadata
9
+
10
+ from invoiceops_env.models import (
11
+ ActionType,
12
+ CaseNote,
13
+ Disposition,
14
+ DuplicateCandidate,
15
+ HeaderResolution,
16
+ InvoiceOpsAction,
17
+ InvoiceOpsObservation,
18
+ InvoiceOpsState,
19
+ LineResolution,
20
+ PaymentRecommendation,
21
+ Progress,
22
+ )
23
+ from invoiceops_env.server.fixtures import ENV_DESCRIPTION
24
+ from invoiceops_env.server.fixtures import DUPLICATE_CHECK_REF_PREFIX
25
+ from invoiceops_env.server.grader import ReviewTrace, grade_case
26
+ from invoiceops_env.server.reward_engine import DEFAULT_REWARD_CONFIG
27
+ from invoiceops_env.server.scenario_loader import (
28
+ ScenarioFixture,
29
+ artifact_lookup,
30
+ artifact_references,
31
+ exception_lookup,
32
+ exception_summaries,
33
+ line_ids_for_scenario,
34
+ load_scenario,
35
+ )
36
+
37
+
38
+ class InvoiceOpsEnvironment(
39
+ Environment[InvoiceOpsAction, InvoiceOpsObservation, InvoiceOpsState]
40
+ ):
41
+ """Accounts-payable invoice exception handling environment."""
42
+
43
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
44
+
45
+ def __init__(self) -> None:
46
+ super().__init__()
47
+ self._reward_config = DEFAULT_REWARD_CONFIG
48
+ self._scenario: ScenarioFixture | None = None
49
+ self._artifact_map = {}
50
+ self._exception_map = {}
51
+ self._state = InvoiceOpsState(episode_id=str(uuid4()), step_count=0)
52
+ self._opened_artifact_ids: set[str] = set()
53
+ self._inspected_exception_ids: set[str] = set()
54
+ self._duplicate_checks_run: set[str] = set()
55
+ self._duplicate_candidates: list[DuplicateCandidate] = []
56
+ self._notes: dict[str, CaseNote] = {}
57
+ self._line_resolutions: dict[str, LineResolution] = {}
58
+ self._header_resolution: HeaderResolution | None = None
59
+ self._ref_steps: dict[str, int] = {}
60
+ self._invalid_actions = 0
61
+ self._redundant_actions = 0
62
+ self._submitted = False
63
+ self._submission_report = None
64
+ self._current_artifact_id: str | None = None
65
+ self._current_exception_id: str | None = None
66
+
67
+ def get_metadata(self) -> EnvironmentMetadata:
68
+ return EnvironmentMetadata(
69
+ name="InvoiceOpsEnvironment",
70
+ description=ENV_DESCRIPTION,
71
+ version="0.1.0",
72
+ )
73
+
74
+ def reset(
75
+ self,
76
+ seed: int | None = None,
77
+ episode_id: str | None = None,
78
+ task_id: str | None = None,
79
+ scenario_id: str | None = None,
80
+ **kwargs: object,
81
+ ) -> InvoiceOpsObservation:
82
+ del seed, kwargs
83
+
84
+ self._scenario = load_scenario(task_id=task_id, scenario_id=scenario_id)
85
+ self._artifact_map = artifact_lookup(self._scenario)
86
+ self._exception_map = exception_lookup(self._scenario)
87
+ self._state = InvoiceOpsState(
88
+ episode_id=episode_id or str(uuid4()),
89
+ step_count=0,
90
+ task_id=self._scenario.task_id,
91
+ scenario_id=self._scenario.scenario_id,
92
+ case_id=self._scenario.case_id,
93
+ current_artifact_id=None,
94
+ submitted=False,
95
+ step_limit=self._scenario.step_limit,
96
+ duplicate_checks_run=0,
97
+ invalid_actions=0,
98
+ redundant_actions=0,
99
+ )
100
+ self._opened_artifact_ids = set()
101
+ self._inspected_exception_ids = set()
102
+ self._duplicate_checks_run = set()
103
+ self._duplicate_candidates = []
104
+ self._notes = {}
105
+ self._line_resolutions = {}
106
+ self._header_resolution = None
107
+ self._ref_steps = {}
108
+ self._invalid_actions = 0
109
+ self._redundant_actions = 0
110
+ self._submitted = False
111
+ self._submission_report = None
112
+ self._current_artifact_id = None
113
+ self._current_exception_id = None
114
+
115
+ return self._build_observation(
116
+ message=f"{self._scenario.title} ready.",
117
+ reward=0.0,
118
+ done=False,
119
+ )
120
+
121
+ def step(
122
+ self,
123
+ action: InvoiceOpsAction,
124
+ timeout_s: float | None = None,
125
+ **kwargs: object,
126
+ ) -> InvoiceOpsObservation:
127
+ del timeout_s, kwargs
128
+ if self._scenario is None:
129
+ raise RuntimeError("reset() must be called before step()")
130
+
131
+ if self._submitted:
132
+ return self._invalid_observation(
133
+ "Case already submitted.",
134
+ self._reward_config.invalid_action_penalty,
135
+ done=True,
136
+ )
137
+
138
+ self._state.step_count += 1
139
+ reward = self._reward_config.step_cost
140
+ done = False
141
+ message = "Action processed."
142
+
143
+ match action.action_type:
144
+ case ActionType.OPEN_ARTIFACT:
145
+ artifact_id = action.artifact_id or ""
146
+ artifact = self._artifact_map.get(artifact_id)
147
+ if artifact is None:
148
+ return self._invalid_observation(
149
+ f"Unknown artifact_id: {action.artifact_id}",
150
+ reward + self._reward_config.invalid_action_penalty,
151
+ )
152
+ self._current_artifact_id = artifact_id
153
+ self._state.current_artifact_id = artifact_id
154
+ if artifact_id in self._opened_artifact_ids:
155
+ self._redundant_actions += 1
156
+ self._state.redundant_actions = self._redundant_actions
157
+ reward += self._reward_config.redundant_open_penalty
158
+ message = f"Artifact {artifact_id} was already opened."
159
+ else:
160
+ self._opened_artifact_ids.add(artifact_id)
161
+ self._ref_steps[artifact_id] = self._state.step_count
162
+ reward += self._reward_config.first_open_artifact
163
+ message = f"Opened artifact {artifact_id}."
164
+
165
+ case ActionType.INSPECT_EXCEPTION:
166
+ exception_id = action.exception_id or ""
167
+ exception = self._exception_map.get(exception_id)
168
+ if exception is None:
169
+ return self._invalid_observation(
170
+ f"Unknown exception_id: {action.exception_id}",
171
+ reward + self._reward_config.invalid_action_penalty,
172
+ )
173
+ self._current_exception_id = exception_id
174
+ if exception_id not in self._inspected_exception_ids:
175
+ self._inspected_exception_ids.add(exception_id)
176
+ self._ref_steps[exception_id] = self._state.step_count
177
+ reward += self._reward_config.inspect_exception
178
+ message = f"Inspected exception {exception_id}."
179
+
180
+ case ActionType.RUN_DUPLICATE_CHECK:
181
+ strategy = action.match_strategy
182
+ assert strategy is not None
183
+ if strategy.value in self._duplicate_checks_run:
184
+ self._redundant_actions += 1
185
+ self._state.redundant_actions = self._redundant_actions
186
+ reward += self._reward_config.redundant_duplicate_penalty
187
+ message = f"Duplicate check {strategy.value} was already run."
188
+ else:
189
+ self._duplicate_checks_run.add(strategy.value)
190
+ self._state.duplicate_checks_run = len(self._duplicate_checks_run)
191
+ self._ref_steps[
192
+ f"{DUPLICATE_CHECK_REF_PREFIX}{strategy.value}"
193
+ ] = self._state.step_count
194
+ reward += self._reward_config.run_duplicate_check
195
+ self._duplicate_candidates = [
196
+ candidate
197
+ for candidate in self._scenario.duplicate_candidates
198
+ if strategy in candidate.supported_match_strategies
199
+ ]
200
+ for candidate in self._duplicate_candidates:
201
+ self._ref_steps.setdefault(
202
+ candidate.candidate_id,
203
+ self._state.step_count,
204
+ )
205
+ message = (
206
+ f"Duplicate search completed with {len(self._duplicate_candidates)} "
207
+ f"candidate(s)."
208
+ )
209
+
210
+ case ActionType.ADD_NOTE:
211
+ invalid_ref = self._first_invalid_ref(action.evidence_refs)
212
+ if invalid_ref is not None:
213
+ return self._invalid_observation(
214
+ f"Unknown evidence ref: {invalid_ref}",
215
+ reward + self._reward_config.invalid_action_penalty,
216
+ )
217
+ note_id = f"N-{len(self._notes) + 1:02d}"
218
+ self._notes[note_id] = CaseNote(
219
+ note_id=note_id,
220
+ note_type=action.note_type,
221
+ reason_codes=action.reason_codes,
222
+ evidence_refs=action.evidence_refs,
223
+ text=(action.text or "").strip(),
224
+ saved_at_step=self._state.step_count,
225
+ )
226
+ reward += self._reward_config.valid_note
227
+ message = f"Saved note {note_id}."
228
+
229
+ case ActionType.SET_LINE_RESOLUTION:
230
+ line_id = action.line_id or ""
231
+ if line_id not in line_ids_for_scenario(self._scenario):
232
+ return self._invalid_observation(
233
+ f"Unknown line_id: {action.line_id}",
234
+ reward + self._reward_config.invalid_action_penalty,
235
+ )
236
+ invalid_ref = self._first_invalid_ref(action.evidence_refs)
237
+ if invalid_ref is not None:
238
+ return self._invalid_observation(
239
+ f"Unknown evidence ref: {invalid_ref}",
240
+ reward + self._reward_config.invalid_action_penalty,
241
+ )
242
+ resolution_id = f"LR-{line_id}"
243
+ is_revision = line_id in self._line_resolutions
244
+ self._line_resolutions[line_id] = LineResolution(
245
+ resolution_id=resolution_id,
246
+ line_id=line_id,
247
+ disposition=action.disposition,
248
+ reason_codes=action.reason_codes,
249
+ evidence_refs=action.evidence_refs,
250
+ route_to=action.route_to,
251
+ saved_at_step=self._state.step_count,
252
+ )
253
+ reward += (
254
+ self._reward_config.revision_penalty
255
+ if is_revision
256
+ else self._reward_config.valid_line_resolution
257
+ )
258
+ if is_revision:
259
+ self._redundant_actions += 1
260
+ self._state.redundant_actions = self._redundant_actions
261
+ message = f"Saved line resolution for {line_id}."
262
+
263
+ case ActionType.SET_HEADER_RESOLUTION:
264
+ invalid_ref = self._first_invalid_ref(action.evidence_refs)
265
+ if invalid_ref is not None:
266
+ return self._invalid_observation(
267
+ f"Unknown evidence ref: {invalid_ref}",
268
+ reward + self._reward_config.invalid_action_penalty,
269
+ )
270
+ is_revision = self._header_resolution is not None
271
+ self._header_resolution = HeaderResolution(
272
+ resolution_id="HR-001",
273
+ payment_recommendation=action.payment_recommendation,
274
+ reason_codes=action.reason_codes,
275
+ evidence_refs=action.evidence_refs,
276
+ route_to=action.route_to,
277
+ saved_at_step=self._state.step_count,
278
+ )
279
+ reward += (
280
+ self._reward_config.revision_penalty
281
+ if is_revision
282
+ else self._reward_config.valid_header_resolution
283
+ )
284
+ if is_revision:
285
+ self._redundant_actions += 1
286
+ self._state.redundant_actions = self._redundant_actions
287
+ message = "Saved header recommendation."
288
+
289
+ case ActionType.SUBMIT_CASE:
290
+ invalid_submission = self._validate_submission_refs(
291
+ action.note_ids,
292
+ action.line_resolution_ids,
293
+ action.header_resolution_id,
294
+ )
295
+ if invalid_submission is not None:
296
+ return self._invalid_observation(
297
+ invalid_submission,
298
+ reward + self._reward_config.invalid_action_penalty,
299
+ )
300
+ consistency_error = self._validate_submission_consistency()
301
+ if consistency_error is not None:
302
+ return self._invalid_observation(
303
+ consistency_error,
304
+ reward + self._reward_config.invalid_action_penalty,
305
+ )
306
+ self._submission_report = grade_case(
307
+ self._scenario,
308
+ self._line_resolutions,
309
+ self._header_resolution,
310
+ self._notes,
311
+ ReviewTrace(
312
+ ref_steps=self._ref_steps,
313
+ steps_used=self._state.step_count,
314
+ invalid_actions=self._invalid_actions,
315
+ redundant_actions=self._redundant_actions,
316
+ ),
317
+ )
318
+ self._submitted = True
319
+ self._state.submitted = True
320
+ reward = self._submission_report.total_score
321
+ done = True
322
+ message = (
323
+ f"Case submitted with score {self._submission_report.total_score:.4f}."
324
+ )
325
+
326
+ if not done and self._state.step_count >= self._state.step_limit:
327
+ self._submission_report = grade_case(
328
+ self._scenario,
329
+ self._line_resolutions,
330
+ self._header_resolution,
331
+ self._notes,
332
+ ReviewTrace(
333
+ ref_steps=self._ref_steps,
334
+ steps_used=self._state.step_count,
335
+ invalid_actions=self._invalid_actions,
336
+ redundant_actions=self._redundant_actions,
337
+ ),
338
+ )
339
+ self._submitted = True
340
+ self._state.submitted = True
341
+ reward = self._submission_report.total_score
342
+ done = True
343
+ message = (
344
+ "Step budget exhausted. "
345
+ f"Auto-submitted with score {self._submission_report.total_score:.4f}."
346
+ )
347
+
348
+ return self._build_observation(message=message, reward=reward, done=done)
349
+
350
+ def _validate_submission_refs(
351
+ self,
352
+ note_ids: list[str],
353
+ line_resolution_ids: list[str],
354
+ header_resolution_id: str | None,
355
+ ) -> str | None:
356
+ if note_ids:
357
+ missing = [note_id for note_id in note_ids if note_id not in self._notes]
358
+ if missing:
359
+ return f"Unknown note_ids in submit_case: {missing}"
360
+ if line_resolution_ids:
361
+ known = {resolution.resolution_id for resolution in self._line_resolutions.values()}
362
+ missing = [resolution_id for resolution_id in line_resolution_ids if resolution_id not in known]
363
+ if missing:
364
+ return f"Unknown line_resolution_ids in submit_case: {missing}"
365
+ if header_resolution_id is not None:
366
+ if self._header_resolution is None or self._header_resolution.resolution_id != header_resolution_id:
367
+ return f"Unknown header_resolution_id in submit_case: {header_resolution_id}"
368
+ return None
369
+
370
+ def _validate_submission_consistency(self) -> str | None:
371
+ approved_line_ids = sorted(
372
+ line_id
373
+ for line_id, resolution in self._line_resolutions.items()
374
+ if resolution.disposition is Disposition.APPROVE
375
+ )
376
+ escalated_without_route = sorted(
377
+ line_id
378
+ for line_id, resolution in self._line_resolutions.items()
379
+ if resolution.disposition is Disposition.ESCALATE and resolution.route_to is None
380
+ )
381
+ if escalated_without_route:
382
+ return f"Escalated lines require route_to: {escalated_without_route}"
383
+
384
+ header_resolution = self._header_resolution
385
+ if header_resolution is None:
386
+ return None
387
+
388
+ recommendation = header_resolution.payment_recommendation
389
+ if (
390
+ recommendation is PaymentRecommendation.ESCALATE_CASE
391
+ and header_resolution.route_to is None
392
+ ):
393
+ return "escalate_case requires route_to."
394
+ if (
395
+ recommendation is PaymentRecommendation.RELEASE_APPROVED_LINES
396
+ and not approved_line_ids
397
+ ):
398
+ return "release_approved_lines requires at least one approved line."
399
+ if (
400
+ recommendation is PaymentRecommendation.REJECT_FULL_INVOICE
401
+ and approved_line_ids
402
+ ):
403
+ return (
404
+ f"{recommendation.value} is inconsistent with approved lines: "
405
+ f"{approved_line_ids}"
406
+ )
407
+ return None
408
+
409
+ def _first_invalid_ref(self, evidence_refs: list[str]) -> str | None:
410
+ known_refs = set(self._ref_steps)
411
+ for ref in evidence_refs:
412
+ if ref not in known_refs:
413
+ return ref
414
+ return None
415
+
416
+ def _invalid_observation(
417
+ self,
418
+ message: str,
419
+ reward: float,
420
+ done: bool = False,
421
+ ) -> InvoiceOpsObservation:
422
+ self._invalid_actions += 1
423
+ self._state.invalid_actions = self._invalid_actions
424
+ return self._build_observation(message=message, reward=reward, done=done)
425
+
426
+ def _build_observation(
427
+ self,
428
+ *,
429
+ message: str,
430
+ reward: float,
431
+ done: bool,
432
+ ) -> InvoiceOpsObservation:
433
+ scenario = self._scenario
434
+ if scenario is None:
435
+ raise RuntimeError("Scenario is not loaded")
436
+
437
+ steps_remaining = max(0, self._state.step_limit - self._state.step_count)
438
+ progress = Progress(
439
+ steps_used=self._state.step_count,
440
+ steps_remaining=steps_remaining,
441
+ opened_artifacts=len(self._opened_artifact_ids),
442
+ inspected_exceptions=len(self._inspected_exception_ids),
443
+ notes_count=len(self._notes),
444
+ line_resolutions=len(self._line_resolutions),
445
+ duplicate_checks_run=len(self._duplicate_checks_run),
446
+ invalid_actions=self._invalid_actions,
447
+ redundant_actions=self._redundant_actions,
448
+ submitted=self._submitted,
449
+ )
450
+ opened_artifact = (
451
+ self._artifact_map[self._current_artifact_id]
452
+ if self._current_artifact_id is not None
453
+ else None
454
+ )
455
+ inspected_exception = (
456
+ self._exception_map[self._current_exception_id]
457
+ if self._current_exception_id is not None
458
+ else None
459
+ )
460
+ return InvoiceOpsObservation(
461
+ message=message,
462
+ task_id=scenario.task_id,
463
+ scenario_id=scenario.scenario_id,
464
+ title=scenario.title,
465
+ description=scenario.description,
466
+ queue_card=scenario.queue_card,
467
+ available_artifacts=artifact_references(scenario),
468
+ opened_artifact=opened_artifact,
469
+ visible_exceptions=exception_summaries(scenario),
470
+ inspected_exception=inspected_exception,
471
+ duplicate_candidates=self._duplicate_candidates,
472
+ draft_notes=list(self._notes.values()),
473
+ draft_line_resolutions=list(self._line_resolutions.values()),
474
+ draft_header_resolution=self._header_resolution,
475
+ submission_report=self._submission_report,
476
+ progress=progress,
477
+ known_refs=sorted(self._ref_steps),
478
+ episode_score=(
479
+ self._submission_report.total_score if self._submission_report else None
480
+ ),
481
+ done=done,
482
+ reward=reward,
483
+ metadata={
484
+ "case_id": scenario.case_id,
485
+ "task_id": scenario.task_id.value,
486
+ },
487
+ )
488
+
489
+ @property
490
+ def state(self) -> InvoiceOpsState:
491
+ return self._state
server/reward_engine.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Dense reward shaping for InvoiceOps."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass
6
+
7
+
8
+ @dataclass(frozen=True)
9
+ class RewardConfig:
10
+ step_cost: float = -0.01
11
+ first_open_artifact: float = 0.02
12
+ inspect_exception: float = 0.03
13
+ run_duplicate_check: float = 0.03
14
+ valid_note: float = 0.03
15
+ valid_line_resolution: float = 0.04
16
+ valid_header_resolution: float = 0.05
17
+ invalid_action_penalty: float = -0.05
18
+ redundant_open_penalty: float = -0.03
19
+ revision_penalty: float = -0.02
20
+ redundant_duplicate_penalty: float = -0.03
21
+
22
+
23
+ DEFAULT_REWARD_CONFIG = RewardConfig()
server/scenario_loader.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scenario loading helpers for InvoiceOps."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from pathlib import Path
7
+
8
+ from pydantic import Field
9
+
10
+ from invoiceops_env.models import (
11
+ ArtifactReference,
12
+ ArtifactView,
13
+ DuplicateCandidate,
14
+ ExceptionDetail,
15
+ ExceptionSummary,
16
+ ExceptionType,
17
+ QueueCard,
18
+ RouteTarget,
19
+ TaskId,
20
+ )
21
+ from invoiceops_env.models import Model as BaseModel
22
+ from invoiceops_env.server.fixtures import (
23
+ DEFAULT_SCENARIOS,
24
+ SCENARIOS_DIR,
25
+ )
26
+
27
+
28
+ class ResolutionExpectation(BaseModel):
29
+ amount: float = Field(..., ge=0.0)
30
+ score_map: dict[str, float] = Field(default_factory=dict)
31
+ accepted_reason_sets: list[list[str]] = Field(default_factory=list)
32
+ accepted_routes: list[str] = Field(default_factory=list)
33
+ gating_refs: list[str] = Field(default_factory=list)
34
+ safe_gating_refs: list[str] = Field(default_factory=list)
35
+ decisive_refs: list[str] = Field(default_factory=list)
36
+ unsafe_approve: bool = Field(default=False)
37
+
38
+
39
+ class HeaderExpectation(BaseModel):
40
+ score_map: dict[str, float] = Field(default_factory=dict)
41
+ accepted_reason_sets: list[list[str]] = Field(default_factory=list)
42
+ accepted_routes: list[str] = Field(default_factory=list)
43
+ gating_refs: list[str] = Field(default_factory=list)
44
+ safe_gating_refs: list[str] = Field(default_factory=list)
45
+ decisive_refs: list[str] = Field(default_factory=list)
46
+ unsafe_recommendations: list[str] = Field(default_factory=list)
47
+ overconservative_recommendations: list[str] = Field(default_factory=list)
48
+
49
+
50
+ class NoteExpectation(BaseModel):
51
+ issue_id: str
52
+ accepted_reason_sets: list[list[str]] = Field(default_factory=list)
53
+ decisive_refs: list[str] = Field(default_factory=list)
54
+
55
+
56
+ class HiddenTruth(BaseModel):
57
+ line_expectations: dict[str, ResolutionExpectation] = Field(default_factory=dict)
58
+ header_expectation: HeaderExpectation
59
+ note_expectations: list[NoteExpectation] = Field(default_factory=list)
60
+ efficient_step_target: int = Field(default=0, ge=0)
61
+
62
+
63
+ class ScenarioFixture(BaseModel):
64
+ scenario_id: str
65
+ task_id: TaskId
66
+ case_id: str
67
+ title: str
68
+ description: str
69
+ step_limit: int = Field(..., ge=1)
70
+ queue_card: QueueCard
71
+ artifacts: list[ArtifactView] = Field(default_factory=list)
72
+ exceptions: list[ExceptionDetail] = Field(default_factory=list)
73
+ duplicate_candidates: list[DuplicateCandidate] = Field(default_factory=list)
74
+ hidden_truth: HiddenTruth
75
+
76
+
77
+ def _scenario_path_for_id(scenario_id: str) -> Path:
78
+ return SCENARIOS_DIR / f"{scenario_id}.json"
79
+
80
+
81
+ def load_scenario(
82
+ task_id: TaskId | str | None = None,
83
+ scenario_id: str | None = None,
84
+ ) -> ScenarioFixture:
85
+ if scenario_id is None:
86
+ task = TaskId(task_id or TaskId.EASY)
87
+ scenario_id = DEFAULT_SCENARIOS[task]
88
+
89
+ scenario_path = _scenario_path_for_id(scenario_id)
90
+ if not scenario_path.exists():
91
+ raise ValueError(f"Unknown scenario_id: {scenario_id}")
92
+
93
+ with scenario_path.open("r", encoding="utf-8") as handle:
94
+ payload = json.load(handle)
95
+
96
+ scenario = ScenarioFixture.model_validate(payload)
97
+ if task_id is not None and scenario.task_id is not TaskId(task_id):
98
+ raise ValueError(
99
+ f"Scenario '{scenario_id}' belongs to task '{scenario.task_id.value}', "
100
+ f"not '{TaskId(task_id).value}'"
101
+ )
102
+ return scenario
103
+
104
+
105
+ QUEUE_SAFE_EXCEPTION_HEADLINES: dict[ExceptionType, str] = {
106
+ ExceptionType.RECEIPT_QUANTITY_VARIANCE: "Receipt variance requires review",
107
+ ExceptionType.NON_PO_MISSING_APPROVAL: "Non-PO approval exception requires review",
108
+ ExceptionType.POSSIBLE_DUPLICATE: "Potential duplicate invoice requires review",
109
+ ExceptionType.PRICE_VARIANCE: "Price variance requires review",
110
+ ExceptionType.CUMULATIVE_BILLING_VARIANCE: "Cumulative billing exception requires review",
111
+ ExceptionType.TAX_VARIANCE: "Tax exception requires review",
112
+ ExceptionType.PAYMENT_TERMS_MISMATCH: "Payment terms exception requires review",
113
+ }
114
+
115
+ QUEUE_SAFE_EXCEPTION_HINTS: dict[ExceptionType, str] = {
116
+ ExceptionType.RECEIPT_QUANTITY_VARIANCE: (
117
+ "Inspect this exception for receipt support and quantity details."
118
+ ),
119
+ ExceptionType.NON_PO_MISSING_APPROVAL: (
120
+ "Inspect this exception for workflow status and approval details."
121
+ ),
122
+ ExceptionType.POSSIBLE_DUPLICATE: (
123
+ "Inspect this exception for duplicate-match details before deciding."
124
+ ),
125
+ ExceptionType.PRICE_VARIANCE: (
126
+ "Inspect this exception for invoice-vs-PO price details."
127
+ ),
128
+ ExceptionType.CUMULATIVE_BILLING_VARIANCE: (
129
+ "Inspect this exception for history-aware billing facts."
130
+ ),
131
+ ExceptionType.TAX_VARIANCE: "Inspect this exception for tax calculation details.",
132
+ ExceptionType.PAYMENT_TERMS_MISMATCH: (
133
+ "Inspect this exception for payment-terms comparison details."
134
+ ),
135
+ }
136
+
137
+
138
+ def artifact_lookup(scenario: ScenarioFixture) -> dict[str, ArtifactView]:
139
+ return {artifact.artifact_id: artifact for artifact in scenario.artifacts}
140
+
141
+
142
+ def artifact_references(scenario: ScenarioFixture) -> list[ArtifactReference]:
143
+ return [
144
+ ArtifactReference(
145
+ artifact_id=artifact.artifact_id,
146
+ artifact_type=artifact.artifact_type,
147
+ title=artifact.title,
148
+ )
149
+ for artifact in scenario.artifacts
150
+ ]
151
+
152
+
153
+ def exception_lookup(scenario: ScenarioFixture) -> dict[str, ExceptionDetail]:
154
+ return {exception.exception_id: exception for exception in scenario.exceptions}
155
+
156
+
157
+ def exception_summaries(scenario: ScenarioFixture) -> list[ExceptionSummary]:
158
+ return [
159
+ ExceptionSummary(
160
+ exception_id=exception.exception_id,
161
+ exception_type=exception.exception_type,
162
+ severity=exception.severity,
163
+ headline=QUEUE_SAFE_EXCEPTION_HEADLINES.get(
164
+ exception.exception_type,
165
+ "Exception requires review",
166
+ ),
167
+ impacted_line_ids=exception.impacted_line_ids,
168
+ short_description=QUEUE_SAFE_EXCEPTION_HINTS.get(
169
+ exception.exception_type,
170
+ "Inspect this exception for detailed facts before deciding.",
171
+ ),
172
+ )
173
+ for exception in scenario.exceptions
174
+ ]
175
+
176
+
177
+ def line_ids_for_scenario(scenario: ScenarioFixture) -> set[str]:
178
+ line_ids: set[str] = set(scenario.hidden_truth.line_expectations.keys())
179
+ for artifact in scenario.artifacts:
180
+ for line_item in artifact.line_items:
181
+ line_ids.add(line_item.line_id)
182
+ return line_ids
183
+
184
+
185
+ def route_target_values() -> set[str]:
186
+ return {route.value for route in RouteTarget}
summarize_eval.py ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ from __future__ import annotations
3
+
4
+ import argparse
5
+ import json
6
+ from pathlib import Path
7
+ from statistics import mean
8
+ from typing import Any
9
+
10
+ TASK_ORDER = ("easy", "medium", "medium_plus", "hard")
11
+
12
+
13
+ def find_latest_eval() -> Path:
14
+ candidates = sorted(Path("outputs/evals").glob("*.json"))
15
+ if not candidates:
16
+ raise FileNotFoundError("No eval JSON files found under outputs/evals/.")
17
+ return candidates[-1]
18
+
19
+
20
+ def parse_args() -> argparse.Namespace:
21
+ parser = argparse.ArgumentParser(
22
+ description="Print a compact summary for an InvoiceOps eval JSON artifact."
23
+ )
24
+ parser.add_argument(
25
+ "paths",
26
+ nargs="*",
27
+ help="Optional eval JSON paths. Defaults to the latest file under outputs/evals/.",
28
+ )
29
+ return parser.parse_args()
30
+
31
+
32
+ def _safe_mean(values: list[float]) -> float | None:
33
+ return round(mean(values), 4) if values else None
34
+
35
+
36
+ def _request_error_count(result: dict[str, Any]) -> int:
37
+ attempts = result.get("model_attempts") or []
38
+ return sum(
39
+ 1
40
+ for attempt in attempts
41
+ if isinstance(attempt, dict) and attempt.get("request_error")
42
+ )
43
+
44
+
45
+ def summarize_eval(path: Path) -> dict[str, Any]:
46
+ payload = json.loads(path.read_text(encoding="utf-8"))
47
+ results = payload.get("results") or []
48
+
49
+ task_scores: dict[str, float] = {}
50
+ resolution_scores: list[float] = []
51
+ evidence_scores: list[float] = []
52
+ documentation_scores: list[float] = []
53
+ efficiency_scores: list[float] = []
54
+ steps: list[float] = []
55
+ reward_lengths: list[float] = []
56
+ fallback_count = 0
57
+ parse_failure_count = 0
58
+ request_error_count = 0
59
+
60
+ for result in results:
61
+ task_id = result.get("task_id")
62
+ score = result.get("score")
63
+ if isinstance(task_id, str) and isinstance(score, (int, float)):
64
+ task_scores[task_id] = round(float(score), 4)
65
+
66
+ if result.get("used_fallback") is True:
67
+ fallback_count += 1
68
+ if result.get("decision_parsed") is False:
69
+ parse_failure_count += 1
70
+ request_error_count += _request_error_count(result)
71
+
72
+ if isinstance(result.get("steps_used"), (int, float)):
73
+ steps.append(float(result["steps_used"]))
74
+ reward_trace = result.get("reward_trace")
75
+ if isinstance(reward_trace, list):
76
+ reward_lengths.append(float(len(reward_trace)))
77
+
78
+ report = result.get("submission_report")
79
+ if not isinstance(report, dict):
80
+ continue
81
+ for source, bucket in (
82
+ ("resolution_score", resolution_scores),
83
+ ("evidence_score", evidence_scores),
84
+ ("documentation_score", documentation_scores),
85
+ ("efficiency_score", efficiency_scores),
86
+ ):
87
+ value = report.get(source)
88
+ if isinstance(value, (int, float)):
89
+ bucket.append(float(value))
90
+
91
+ return {
92
+ "path": str(path),
93
+ "run_id": payload.get("run_id"),
94
+ "model_name": payload.get("model_name"),
95
+ "mean_score": payload.get("mean_score"),
96
+ "raw_mean_score": payload.get("raw_mean_score"),
97
+ "strict_baseline_scoring": payload.get("strict_baseline_scoring"),
98
+ "task_scores": task_scores,
99
+ "fallback_count": fallback_count,
100
+ "parse_failure_count": parse_failure_count,
101
+ "request_error_count": request_error_count,
102
+ "avg_resolution_score": _safe_mean(resolution_scores),
103
+ "avg_evidence_score": _safe_mean(evidence_scores),
104
+ "avg_documentation_score": _safe_mean(documentation_scores),
105
+ "avg_efficiency_score": _safe_mean(efficiency_scores),
106
+ "avg_steps_used": _safe_mean(steps),
107
+ "avg_reward_trace_len": _safe_mean(reward_lengths),
108
+ }
109
+
110
+
111
+ def print_summary(summary: dict[str, Any]) -> None:
112
+ print(f"path: {summary['path']}")
113
+ print(f"run_id: {summary['run_id']}")
114
+ print(f"model: {summary['model_name']}")
115
+ print(
116
+ "mean_score: "
117
+ f"{summary['mean_score']:.4f} "
118
+ f"(raw_mean_score={summary['raw_mean_score']:.4f}, "
119
+ f"strict_baseline_scoring={summary['strict_baseline_scoring']})"
120
+ )
121
+
122
+ print("tasks:")
123
+ for task_id in TASK_ORDER:
124
+ score = summary["task_scores"].get(task_id)
125
+ rendered = "-" if score is None else f"{score:.4f}"
126
+ print(f" {task_id}: {rendered}")
127
+
128
+ print("components:")
129
+ for label in (
130
+ "avg_resolution_score",
131
+ "avg_evidence_score",
132
+ "avg_documentation_score",
133
+ "avg_efficiency_score",
134
+ "avg_steps_used",
135
+ "avg_reward_trace_len",
136
+ ):
137
+ value = summary[label]
138
+ rendered = "-" if value is None else f"{value:.4f}"
139
+ print(f" {label}: {rendered}")
140
+
141
+ print("health:")
142
+ print(f" fallbacks: {summary['fallback_count']}")
143
+ print(f" parse_failures: {summary['parse_failure_count']}")
144
+ print(f" request_errors: {summary['request_error_count']}")
145
+
146
+
147
+ def main() -> None:
148
+ args = parse_args()
149
+ paths = [Path(value) for value in args.paths] if args.paths else [find_latest_eval()]
150
+ for index, path in enumerate(paths):
151
+ if index:
152
+ print()
153
+ print_summary(summarize_eval(path))
154
+
155
+
156
+ if __name__ == "__main__":
157
+ main()
tests/conftest.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ from pathlib import Path
3
+
4
+
5
+ ROOT = Path(__file__).resolve().parents[1]
6
+ PARENT = ROOT.parent
7
+
8
+ if str(PARENT) not in sys.path:
9
+ sys.path.insert(0, str(PARENT))
10
+
11
+ for module_name in list(sys.modules):
12
+ if module_name == "invoiceops_env" or module_name.startswith("invoiceops_env."):
13
+ sys.modules.pop(module_name, None)
tests/test_baseline_smoke.py ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from invoiceops_env.inference import (
2
+ API_BASE_URL,
3
+ ObservationMemory,
4
+ TASKS,
5
+ _parse_action_payload,
6
+ _safe_json_load,
7
+ build_action_prompt,
8
+ build_observation_snapshot,
9
+ resolve_api_key,
10
+ strict_task_score,
11
+ update_memory,
12
+ )
13
+ from invoiceops_env.models import (
14
+ ActionType,
15
+ Disposition,
16
+ DuplicateMatchStrategy,
17
+ InvoiceOpsAction,
18
+ PaymentRecommendation,
19
+ ReasonCode,
20
+ )
21
+ from invoiceops_env.server.invoiceops_env_environment import InvoiceOpsEnvironment
22
+
23
+
24
+ def test_parse_action_payload_salvages_common_shapes() -> None:
25
+ payload = {
26
+ "action": "set_line_resolution",
27
+ "args": {
28
+ "line": "L1",
29
+ "decision": "hold",
30
+ "reason_code": "receipt_not_confirmed",
31
+ "refs": ["art-history", "EX-RECEIPT-L2"],
32
+ "route": "receiving",
33
+ },
34
+ }
35
+
36
+ action = _parse_action_payload(payload)
37
+
38
+ assert action is not None
39
+ assert action.action_type is ActionType.SET_LINE_RESOLUTION
40
+ assert action.line_id == "L1"
41
+ assert action.disposition is Disposition.HOLD
42
+ assert action.reason_codes == [ReasonCode.RECEIPT_NOT_CONFIRMED]
43
+ assert action.evidence_refs == ["art-history", "EX-RECEIPT-L2"]
44
+
45
+
46
+ def test_parse_action_payload_rejects_missing_required_fields() -> None:
47
+ payload = {
48
+ "action_type": "set_header_resolution",
49
+ "payment_recommendation": "hold_full_invoice",
50
+ "reason_codes": ["receipt_not_confirmed"],
51
+ }
52
+
53
+ assert _parse_action_payload(payload) is None
54
+
55
+
56
+ def test_parse_action_payload_accepts_submit_case() -> None:
57
+ action = _parse_action_payload({"action_type": "submit_case"})
58
+
59
+ assert action is not None
60
+ assert action.action_type is ActionType.SUBMIT_CASE
61
+
62
+
63
+ def test_safe_json_load_strips_think_blocks() -> None:
64
+ payload = _safe_json_load(
65
+ '<think>reasoning here</think>{"action_type":"submit_case"}'
66
+ )
67
+
68
+ assert payload == {"action_type": "submit_case"}
69
+
70
+
71
+ def test_initial_snapshot_does_not_preload_case_details() -> None:
72
+ env = InvoiceOpsEnvironment()
73
+ observation = env.reset(task_id="hard")
74
+ memory = ObservationMemory()
75
+
76
+ snapshot = build_observation_snapshot(observation, memory)
77
+
78
+ assert snapshot["artifacts"] == {}
79
+ assert snapshot["exceptions"] == []
80
+ assert snapshot["duplicate_candidates"] == []
81
+ assert snapshot["known_refs"] == []
82
+
83
+
84
+ def test_snapshot_only_contains_explicitly_observed_details() -> None:
85
+ env = InvoiceOpsEnvironment()
86
+ observation = env.reset(task_id="medium")
87
+ memory = ObservationMemory()
88
+ update_memory(memory, observation)
89
+
90
+ observation = env.step(
91
+ InvoiceOpsAction(
92
+ action_type=ActionType.OPEN_ARTIFACT,
93
+ artifact_id="art-invoice",
94
+ )
95
+ )
96
+ update_memory(memory, observation)
97
+
98
+ first_exception = observation.visible_exceptions[0]
99
+ observation = env.step(
100
+ InvoiceOpsAction(
101
+ action_type=ActionType.INSPECT_EXCEPTION,
102
+ exception_id=first_exception.exception_id,
103
+ )
104
+ )
105
+ update_memory(memory, observation)
106
+
107
+ snapshot = build_observation_snapshot(observation, memory)
108
+
109
+ assert set(snapshot["artifacts"]) == {"invoice_packet"}
110
+ assert [exception["type"] for exception in snapshot["exceptions"]] == [
111
+ "possible_duplicate"
112
+ ]
113
+ assert observation.known_refs == ["EX-POSSIBLE-DUP", "art-invoice"]
114
+
115
+
116
+ def test_visible_exception_stubs_hide_detailed_facts_until_inspection() -> None:
117
+ env = InvoiceOpsEnvironment()
118
+ observation = env.reset(task_id="medium")
119
+
120
+ stub = observation.visible_exceptions[0]
121
+ assert stub.headline == "Potential duplicate invoice requires review"
122
+ assert stub.short_description == (
123
+ "Inspect this exception for duplicate-match details before deciding."
124
+ )
125
+
126
+ inspected = env.step(
127
+ InvoiceOpsAction(
128
+ action_type=ActionType.INSPECT_EXCEPTION,
129
+ exception_id=stub.exception_id,
130
+ )
131
+ ).inspected_exception
132
+
133
+ assert inspected is not None
134
+ assert inspected.headline == "Duplicate control is open for this invoice"
135
+ assert any(
136
+ field.label == "Invoice number" and field.value == "TL-9205/A"
137
+ for field in inspected.fields
138
+ )
139
+
140
+
141
+ def test_hard_exceptions_do_not_expose_derived_answer_fields() -> None:
142
+ env = InvoiceOpsEnvironment()
143
+ env.reset(task_id="hard")
144
+
145
+ l1 = env.step(
146
+ InvoiceOpsAction(
147
+ action_type=ActionType.INSPECT_EXCEPTION,
148
+ exception_id="EX-RECEIPT-L1",
149
+ )
150
+ ).inspected_exception
151
+ l2 = env.step(
152
+ InvoiceOpsAction(
153
+ action_type=ActionType.INSPECT_EXCEPTION,
154
+ exception_id="EX-RECEIPT-L2",
155
+ )
156
+ ).inspected_exception
157
+
158
+ assert l1 is not None
159
+ assert l2 is not None
160
+ assert {field.label for field in l1.fields} == {
161
+ "Invoice quantity",
162
+ "Received quantity",
163
+ "Short quantity",
164
+ }
165
+ assert {field.label for field in l2.fields} == {
166
+ "Invoice quantity",
167
+ "Initial posted receipt",
168
+ "Latest control update",
169
+ }
170
+
171
+
172
+ def test_hard_receipt_log_points_to_history_instead_of_reversal_answer() -> None:
173
+ env = InvoiceOpsEnvironment()
174
+ observation = env.reset(task_id="hard")
175
+ observation = env.step(
176
+ InvoiceOpsAction(
177
+ action_type=ActionType.OPEN_ARTIFACT,
178
+ artifact_id="art-receipts",
179
+ )
180
+ )
181
+
182
+ opened = observation.opened_artifact
183
+ assert opened is not None
184
+ l2 = next(item for item in opened.line_items if item.line_id == "L2")
185
+
186
+ assert l2.status == "received_under_review"
187
+ assert "history" in l2.notes.lower()
188
+ assert "reversed" not in l2.notes.lower()
189
+
190
+
191
+ def test_medium_plus_exception_does_not_expose_unsupported_amount() -> None:
192
+ env = InvoiceOpsEnvironment()
193
+ env.reset(task_id="medium_plus")
194
+
195
+ inspected = env.step(
196
+ InvoiceOpsAction(
197
+ action_type=ActionType.INSPECT_EXCEPTION,
198
+ exception_id="EX-RECEIPT-L2",
199
+ )
200
+ ).inspected_exception
201
+
202
+ assert inspected is not None
203
+ assert {field.label for field in inspected.fields} == {
204
+ "Invoice quantity",
205
+ "Received quantity",
206
+ "Short quantity",
207
+ }
208
+
209
+
210
+ def test_action_prompt_describes_single_step_agent_loop() -> None:
211
+ env = InvoiceOpsEnvironment()
212
+ observation = env.reset(task_id="medium")
213
+ prompt = build_action_prompt(observation, ObservationMemory())
214
+
215
+ assert "Return exactly one JSON object for the single best next action." in prompt
216
+ assert "Use open_artifact, inspect_exception, and run_duplicate_check" in prompt
217
+ assert "next owner or follow-up queue" in prompt
218
+ assert "A real case-level blocker can justify hold_full_invoice" in prompt
219
+ assert "Allowed match_strategy values" in prompt
220
+ assert "Action JSON templates" in prompt
221
+ assert "<artifact_id>" in prompt
222
+ assert '"match_strategy":"normalized_invoice_no"' in prompt
223
+ assert '"action_type":"submit_case"' in prompt
224
+ assert "If any line is approved, use release_approved_lines" not in prompt
225
+
226
+
227
+ def test_hf_router_configuration_requires_hf_token(monkeypatch) -> None:
228
+ monkeypatch.setenv("HF_TOKEN", "hf-secret")
229
+ monkeypatch.delenv("OPENAI_API_KEY", raising=False)
230
+ monkeypatch.delenv("API_KEY", raising=False)
231
+ monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
232
+ monkeypatch.delenv("API_BASE_URL", raising=False)
233
+
234
+ api_key, source = resolve_api_key()
235
+
236
+ assert api_key == "hf-secret"
237
+ assert source == "HF_TOKEN"
238
+ assert API_BASE_URL == "https://router.huggingface.co/v1"
239
+
240
+
241
+ def test_invoice_baseline_defaults_to_strict_scoring(monkeypatch) -> None:
242
+ monkeypatch.delenv("STRICT_BASELINE_SCORING", raising=False)
243
+
244
+ assert strict_task_score(0.2136, used_fallback=True) == 0.0
245
+ assert strict_task_score(0.2136, used_fallback=False) == 0.2136
246
+
247
+
248
+ def test_public_task_loop_uses_four_task_progression() -> None:
249
+ assert [task.value for task in TASKS] == [
250
+ "easy",
251
+ "medium",
252
+ "medium_plus",
253
+ "hard",
254
+ ]
255
+
256
+
257
+ def test_duplicate_check_exposes_strategy_and_candidate_refs() -> None:
258
+ env = InvoiceOpsEnvironment()
259
+ observation = env.reset(task_id="medium")
260
+ memory = ObservationMemory()
261
+ update_memory(memory, observation)
262
+
263
+ observation = env.step(
264
+ InvoiceOpsAction(
265
+ action_type=ActionType.RUN_DUPLICATE_CHECK,
266
+ match_strategy=DuplicateMatchStrategy.NORMALIZED_INVOICE_NUMBER,
267
+ )
268
+ )
269
+ update_memory(memory, observation)
270
+ snapshot = build_observation_snapshot(observation, memory)
271
+
272
+ assert "duplicate_check:normalized_invoice_no" in observation.known_refs
273
+ assert "CAND-NORM-01" in observation.known_refs
274
+ assert snapshot["duplicate_candidates"] == [
275
+ {
276
+ "candidate_id": "CAND-NORM-01",
277
+ "invoice_number": "TL9205A",
278
+ "invoice_date": "2026-03-10",
279
+ "gross_amount": 3800.0,
280
+ "status": "reversed on 2026-03-11 after import duplicate; closed",
281
+ "match_basis": "Normalized invoice number + vendor + gross amount",
282
+ "overlap_summary": "Same normalized invoice number. Prior record was reversed before payment.",
283
+ }
284
+ ]
285
+
286
+
287
+ def test_duplicate_check_action_parses() -> None:
288
+ action = _parse_action_payload(
289
+ {
290
+ "action_type": "run_duplicate_check",
291
+ "match_strategy": DuplicateMatchStrategy.NORMALIZED_INVOICE_NUMBER.value,
292
+ }
293
+ )
294
+
295
+ assert action is not None
296
+ assert action.action_type is ActionType.RUN_DUPLICATE_CHECK
297
+ assert (
298
+ action.match_strategy
299
+ is DuplicateMatchStrategy.NORMALIZED_INVOICE_NUMBER
300
+ )
301
+
302
+
303
+ def test_duplicate_check_action_parses_common_aliases() -> None:
304
+ action = _parse_action_payload(
305
+ {
306
+ "action_type": "run_duplicate_check",
307
+ "match_strategy": "vendor_invoice_amount",
308
+ }
309
+ )
310
+
311
+ assert action is not None
312
+ assert action.action_type is ActionType.RUN_DUPLICATE_CHECK
313
+ assert action.match_strategy is DuplicateMatchStrategy.VENDOR_AMOUNT_DATE
314
+
315
+
316
+ def test_header_resolution_action_parses_common_aliases() -> None:
317
+ action = _parse_action_payload(
318
+ {
319
+ "action_type": "set_header_resolution",
320
+ "recommendation": PaymentRecommendation.HOLD_FULL_INVOICE.value,
321
+ "reason_code": [
322
+ ReasonCode.NON_PO_APPROVAL_MISSING.value,
323
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW.value,
324
+ ],
325
+ "refs": ["art-approval", "duplicate_check:normalized_invoice_no"],
326
+ "route": "requester",
327
+ }
328
+ )
329
+
330
+ assert action is not None
331
+ assert action.action_type is ActionType.SET_HEADER_RESOLUTION
332
+ assert (
333
+ action.payment_recommendation
334
+ is PaymentRecommendation.HOLD_FULL_INVOICE
335
+ )
336
+ assert action.reason_codes == [
337
+ ReasonCode.NON_PO_APPROVAL_MISSING,
338
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
339
+ ]
340
+ assert action.evidence_refs == [
341
+ "art-approval",
342
+ "duplicate_check:normalized_invoice_no",
343
+ ]
tests/test_env_flow.py ADDED
@@ -0,0 +1,640 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """End-to-end environment flow tests for the 4-task InvoiceOps ladder."""
2
+
3
+ from invoiceops_env.models import (
4
+ ActionType,
5
+ Disposition,
6
+ InvoiceOpsAction,
7
+ NoteType,
8
+ PaymentRecommendation,
9
+ ReasonCode,
10
+ )
11
+ from invoiceops_env.server.invoiceops_env_environment import InvoiceOpsEnvironment
12
+ from invoiceops_env.server.scenario_loader import load_scenario
13
+
14
+
15
+ def _run_easy_perfect_case() -> float:
16
+ env = InvoiceOpsEnvironment()
17
+ env.reset(task_id="easy")
18
+
19
+ env.step(
20
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-invoice")
21
+ )
22
+ env.step(
23
+ InvoiceOpsAction(
24
+ action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-approval"
25
+ )
26
+ )
27
+ env.step(
28
+ InvoiceOpsAction(
29
+ action_type=ActionType.INSPECT_EXCEPTION,
30
+ exception_id="EX-NONPO-APPROVAL",
31
+ )
32
+ )
33
+ env.step(
34
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-policy")
35
+ )
36
+ env.step(
37
+ InvoiceOpsAction(
38
+ action_type=ActionType.ADD_NOTE,
39
+ note_type=NoteType.ISSUE_SUMMARY,
40
+ reason_codes=[ReasonCode.NON_PO_APPROVAL_MISSING],
41
+ evidence_refs=["art-approval", "art-policy"],
42
+ text="Approval workflow is not initiated and the requester must start approval before payment can release.",
43
+ )
44
+ )
45
+ env.step(
46
+ InvoiceOpsAction(
47
+ action_type=ActionType.SET_LINE_RESOLUTION,
48
+ line_id="L1",
49
+ disposition=Disposition.HOLD,
50
+ reason_codes=[ReasonCode.NON_PO_APPROVAL_MISSING],
51
+ evidence_refs=[
52
+ "art-invoice",
53
+ "art-approval",
54
+ "art-policy",
55
+ "EX-NONPO-APPROVAL",
56
+ ],
57
+ route_to="requester",
58
+ )
59
+ )
60
+ env.step(
61
+ InvoiceOpsAction(
62
+ action_type=ActionType.SET_HEADER_RESOLUTION,
63
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
64
+ reason_codes=[ReasonCode.NON_PO_APPROVAL_MISSING],
65
+ evidence_refs=[
66
+ "art-invoice",
67
+ "art-approval",
68
+ "art-policy",
69
+ "EX-NONPO-APPROVAL",
70
+ ],
71
+ route_to="requester",
72
+ )
73
+ )
74
+ result = env.step(InvoiceOpsAction(action_type=ActionType.SUBMIT_CASE))
75
+ assert result.done is True
76
+ return float(result.episode_score or 0.0)
77
+
78
+
79
+ def _run_medium_perfect_case() -> float:
80
+ env = InvoiceOpsEnvironment()
81
+ env.reset(task_id="medium")
82
+
83
+ env.step(InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-po"))
84
+ env.step(
85
+ InvoiceOpsAction(
86
+ action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-receipts"
87
+ )
88
+ )
89
+ env.step(
90
+ InvoiceOpsAction(
91
+ action_type=ActionType.INSPECT_EXCEPTION,
92
+ exception_id="EX-POSSIBLE-DUP",
93
+ )
94
+ )
95
+ env.step(
96
+ InvoiceOpsAction(
97
+ action_type=ActionType.RUN_DUPLICATE_CHECK,
98
+ match_strategy="normalized_invoice_no",
99
+ )
100
+ )
101
+ env.step(
102
+ InvoiceOpsAction(
103
+ action_type=ActionType.ADD_NOTE,
104
+ note_type=NoteType.REVIEW_SUMMARY,
105
+ reason_codes=[
106
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
107
+ ReasonCode.SAFE_TO_PAY,
108
+ ],
109
+ evidence_refs=["duplicate_check:normalized_invoice_no", "CAND-NORM-01"],
110
+ text="The normalized duplicate hit is a reversed prior record, so the invoice can release.",
111
+ )
112
+ )
113
+ env.step(
114
+ InvoiceOpsAction(
115
+ action_type=ActionType.SET_LINE_RESOLUTION,
116
+ line_id="L1",
117
+ disposition=Disposition.APPROVE,
118
+ reason_codes=[
119
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
120
+ ReasonCode.SAFE_TO_PAY,
121
+ ],
122
+ evidence_refs=[
123
+ "art-po",
124
+ "art-receipts",
125
+ "EX-POSSIBLE-DUP",
126
+ "duplicate_check:normalized_invoice_no",
127
+ "CAND-NORM-01",
128
+ ],
129
+ )
130
+ )
131
+ env.step(
132
+ InvoiceOpsAction(
133
+ action_type=ActionType.SET_LINE_RESOLUTION,
134
+ line_id="L2",
135
+ disposition=Disposition.APPROVE,
136
+ reason_codes=[
137
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
138
+ ReasonCode.SAFE_TO_PAY,
139
+ ],
140
+ evidence_refs=[
141
+ "art-po",
142
+ "art-receipts",
143
+ "EX-POSSIBLE-DUP",
144
+ "duplicate_check:normalized_invoice_no",
145
+ "CAND-NORM-01",
146
+ ],
147
+ )
148
+ )
149
+ env.step(
150
+ InvoiceOpsAction(
151
+ action_type=ActionType.SET_HEADER_RESOLUTION,
152
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
153
+ reason_codes=[
154
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
155
+ ReasonCode.SAFE_TO_PAY,
156
+ ],
157
+ evidence_refs=[
158
+ "art-po",
159
+ "art-receipts",
160
+ "EX-POSSIBLE-DUP",
161
+ "duplicate_check:normalized_invoice_no",
162
+ "CAND-NORM-01",
163
+ ],
164
+ )
165
+ )
166
+ result = env.step(InvoiceOpsAction(action_type=ActionType.SUBMIT_CASE))
167
+ assert result.done is True
168
+ return float(result.episode_score or 0.0)
169
+
170
+
171
+ def _run_medium_plus_perfect_case() -> float:
172
+ env = InvoiceOpsEnvironment()
173
+ env.reset(task_id="medium_plus")
174
+
175
+ env.step(
176
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-invoice")
177
+ )
178
+ env.step(InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-po"))
179
+ env.step(
180
+ InvoiceOpsAction(
181
+ action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-receipts"
182
+ )
183
+ )
184
+ env.step(
185
+ InvoiceOpsAction(
186
+ action_type=ActionType.INSPECT_EXCEPTION,
187
+ exception_id="EX-POSSIBLE-DUP",
188
+ )
189
+ )
190
+ env.step(
191
+ InvoiceOpsAction(
192
+ action_type=ActionType.RUN_DUPLICATE_CHECK,
193
+ match_strategy="normalized_invoice_no",
194
+ )
195
+ )
196
+ env.step(
197
+ InvoiceOpsAction(
198
+ action_type=ActionType.INSPECT_EXCEPTION,
199
+ exception_id="EX-RECEIPT-L2",
200
+ )
201
+ )
202
+ env.step(
203
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-policy")
204
+ )
205
+ env.step(
206
+ InvoiceOpsAction(
207
+ action_type=ActionType.ADD_NOTE,
208
+ note_type=NoteType.REVIEW_SUMMARY,
209
+ reason_codes=[
210
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
211
+ ReasonCode.SAFE_TO_PAY,
212
+ ],
213
+ evidence_refs=["duplicate_check:normalized_invoice_no", "CAND-NORM-01"],
214
+ text="The normalized duplicate hit is a reversed prior record, so duplicate review is cleared.",
215
+ )
216
+ )
217
+ env.step(
218
+ InvoiceOpsAction(
219
+ action_type=ActionType.ADD_NOTE,
220
+ note_type=NoteType.ISSUE_SUMMARY,
221
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
222
+ evidence_refs=[
223
+ "art-invoice",
224
+ "art-receipts",
225
+ "art-policy",
226
+ "EX-RECEIPT-L2",
227
+ ],
228
+ text="L2 remains blocked because the unsupported amount exceeds the de minimis receipt threshold.",
229
+ )
230
+ )
231
+ env.step(
232
+ InvoiceOpsAction(
233
+ action_type=ActionType.SET_LINE_RESOLUTION,
234
+ line_id="L1",
235
+ disposition=Disposition.APPROVE,
236
+ reason_codes=[
237
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
238
+ ReasonCode.SAFE_TO_PAY,
239
+ ],
240
+ evidence_refs=[
241
+ "art-po",
242
+ "art-receipts",
243
+ "EX-POSSIBLE-DUP",
244
+ "duplicate_check:normalized_invoice_no",
245
+ "CAND-NORM-01",
246
+ ],
247
+ )
248
+ )
249
+ env.step(
250
+ InvoiceOpsAction(
251
+ action_type=ActionType.SET_LINE_RESOLUTION,
252
+ line_id="L2",
253
+ disposition=Disposition.HOLD,
254
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
255
+ evidence_refs=[
256
+ "art-invoice",
257
+ "art-receipts",
258
+ "art-policy",
259
+ "EX-RECEIPT-L2",
260
+ ],
261
+ route_to="receiving",
262
+ )
263
+ )
264
+ env.step(
265
+ InvoiceOpsAction(
266
+ action_type=ActionType.SET_HEADER_RESOLUTION,
267
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
268
+ reason_codes=[
269
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
270
+ ReasonCode.RECEIPT_NOT_CONFIRMED,
271
+ ReasonCode.SAFE_TO_PAY,
272
+ ],
273
+ evidence_refs=[
274
+ "art-policy",
275
+ "art-receipts",
276
+ "EX-RECEIPT-L2",
277
+ "duplicate_check:normalized_invoice_no",
278
+ "CAND-NORM-01",
279
+ ],
280
+ )
281
+ )
282
+ result = env.step(InvoiceOpsAction(action_type=ActionType.SUBMIT_CASE))
283
+ assert result.done is True
284
+ return float(result.episode_score or 0.0)
285
+
286
+
287
+ def _run_hard_perfect_case() -> float:
288
+ env = InvoiceOpsEnvironment()
289
+ env.reset(task_id="hard")
290
+
291
+ env.step(
292
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-invoice")
293
+ )
294
+ env.step(
295
+ InvoiceOpsAction(
296
+ action_type=ActionType.INSPECT_EXCEPTION,
297
+ exception_id="EX-POSSIBLE-DUP",
298
+ )
299
+ )
300
+ env.step(
301
+ InvoiceOpsAction(
302
+ action_type=ActionType.RUN_DUPLICATE_CHECK,
303
+ match_strategy="normalized_invoice_no",
304
+ )
305
+ )
306
+ env.step(
307
+ InvoiceOpsAction(
308
+ action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-receipts"
309
+ )
310
+ )
311
+ env.step(
312
+ InvoiceOpsAction(
313
+ action_type=ActionType.INSPECT_EXCEPTION,
314
+ exception_id="EX-RECEIPT-L1",
315
+ )
316
+ )
317
+ env.step(
318
+ InvoiceOpsAction(
319
+ action_type=ActionType.INSPECT_EXCEPTION,
320
+ exception_id="EX-RECEIPT-L2",
321
+ )
322
+ )
323
+ env.step(
324
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-history")
325
+ )
326
+ env.step(
327
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-vendor")
328
+ )
329
+ env.step(
330
+ InvoiceOpsAction(
331
+ action_type=ActionType.INSPECT_EXCEPTION,
332
+ exception_id="EX-TAX-001",
333
+ )
334
+ )
335
+ env.step(
336
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-policy")
337
+ )
338
+ env.step(
339
+ InvoiceOpsAction(
340
+ action_type=ActionType.ADD_NOTE,
341
+ note_type=NoteType.REVIEW_SUMMARY,
342
+ reason_codes=[
343
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
344
+ ReasonCode.SAFE_TO_PAY,
345
+ ],
346
+ evidence_refs=["duplicate_check:normalized_invoice_no", "CAND-NORM-01"],
347
+ text="The normalized duplicate hit is a reversed prior record, so the duplicate control is cleared.",
348
+ )
349
+ )
350
+ env.step(
351
+ InvoiceOpsAction(
352
+ action_type=ActionType.ADD_NOTE,
353
+ note_type=NoteType.ISSUE_SUMMARY,
354
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
355
+ evidence_refs=["art-history", "EX-RECEIPT-L2"],
356
+ text="L2 remains blocked because the latest receiving history shows an open damage hold.",
357
+ )
358
+ )
359
+ env.step(
360
+ InvoiceOpsAction(
361
+ action_type=ActionType.ADD_NOTE,
362
+ note_type=NoteType.ESCALATION_REQUEST,
363
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
364
+ evidence_refs=["art-vendor", "art-policy", "EX-TAX-001"],
365
+ text="The project is tax exempt, so payment must remain blocked pending Tax Ops review.",
366
+ )
367
+ )
368
+ env.step(
369
+ InvoiceOpsAction(
370
+ action_type=ActionType.SET_LINE_RESOLUTION,
371
+ line_id="L1",
372
+ disposition=Disposition.APPROVE,
373
+ reason_codes=[
374
+ ReasonCode.PARTIAL_RECEIPT_PENDING,
375
+ ReasonCode.SAFE_TO_PAY,
376
+ ],
377
+ evidence_refs=[
378
+ "art-invoice",
379
+ "art-receipts",
380
+ "EX-RECEIPT-L1",
381
+ "art-policy",
382
+ "duplicate_check:normalized_invoice_no",
383
+ "CAND-NORM-01",
384
+ ],
385
+ )
386
+ )
387
+ env.step(
388
+ InvoiceOpsAction(
389
+ action_type=ActionType.SET_LINE_RESOLUTION,
390
+ line_id="L2",
391
+ disposition=Disposition.HOLD,
392
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
393
+ evidence_refs=[
394
+ "art-receipts",
395
+ "art-history",
396
+ "EX-RECEIPT-L2",
397
+ "art-policy",
398
+ ],
399
+ route_to="receiving",
400
+ )
401
+ )
402
+ env.step(
403
+ InvoiceOpsAction(
404
+ action_type=ActionType.SET_LINE_RESOLUTION,
405
+ line_id="L3",
406
+ disposition=Disposition.APPROVE,
407
+ reason_codes=[
408
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
409
+ ReasonCode.SAFE_TO_PAY,
410
+ ],
411
+ evidence_refs=[
412
+ "art-invoice",
413
+ "art-receipts",
414
+ "duplicate_check:normalized_invoice_no",
415
+ "CAND-NORM-01",
416
+ ],
417
+ )
418
+ )
419
+ env.step(
420
+ InvoiceOpsAction(
421
+ action_type=ActionType.SET_HEADER_RESOLUTION,
422
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
423
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
424
+ evidence_refs=[
425
+ "art-invoice",
426
+ "art-vendor",
427
+ "art-policy",
428
+ "EX-TAX-001",
429
+ ],
430
+ route_to="tax",
431
+ )
432
+ )
433
+ result = env.step(InvoiceOpsAction(action_type=ActionType.SUBMIT_CASE))
434
+ assert result.done is True
435
+ return float(result.episode_score or 0.0)
436
+
437
+
438
+ def test_perfect_cases_score_near_one() -> None:
439
+ assert _run_easy_perfect_case() >= 0.99
440
+ assert _run_medium_perfect_case() >= 0.99
441
+ assert _run_medium_plus_perfect_case() >= 0.99
442
+ assert _run_hard_perfect_case() >= 0.99
443
+
444
+
445
+ def test_tax_hold_can_coexist_with_approved_lines() -> None:
446
+ env = InvoiceOpsEnvironment()
447
+ env.reset(task_id="hard")
448
+
449
+ env.step(
450
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-invoice")
451
+ )
452
+ env.step(
453
+ InvoiceOpsAction(
454
+ action_type=ActionType.RUN_DUPLICATE_CHECK,
455
+ match_strategy="normalized_invoice_no",
456
+ )
457
+ )
458
+ env.step(
459
+ InvoiceOpsAction(
460
+ action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-receipts"
461
+ )
462
+ )
463
+ env.step(
464
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-vendor")
465
+ )
466
+ env.step(
467
+ InvoiceOpsAction(
468
+ action_type=ActionType.INSPECT_EXCEPTION,
469
+ exception_id="EX-RECEIPT-L2",
470
+ )
471
+ )
472
+ env.step(
473
+ InvoiceOpsAction(
474
+ action_type=ActionType.INSPECT_EXCEPTION,
475
+ exception_id="EX-TAX-001",
476
+ )
477
+ )
478
+ env.step(
479
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-history")
480
+ )
481
+ env.step(
482
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-policy")
483
+ )
484
+ env.step(
485
+ InvoiceOpsAction(
486
+ action_type=ActionType.SET_LINE_RESOLUTION,
487
+ line_id="L1",
488
+ disposition=Disposition.APPROVE,
489
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
490
+ evidence_refs=[
491
+ "art-invoice",
492
+ "art-receipts",
493
+ "duplicate_check:normalized_invoice_no",
494
+ "CAND-NORM-01",
495
+ "EX-RECEIPT-L1",
496
+ ],
497
+ )
498
+ )
499
+ env.step(
500
+ InvoiceOpsAction(
501
+ action_type=ActionType.SET_LINE_RESOLUTION,
502
+ line_id="L2",
503
+ disposition=Disposition.HOLD,
504
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
505
+ evidence_refs=["art-history", "EX-RECEIPT-L2"],
506
+ route_to="receiving",
507
+ )
508
+ )
509
+ env.step(
510
+ InvoiceOpsAction(
511
+ action_type=ActionType.SET_LINE_RESOLUTION,
512
+ line_id="L3",
513
+ disposition=Disposition.APPROVE,
514
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
515
+ evidence_refs=[
516
+ "art-invoice",
517
+ "art-receipts",
518
+ "duplicate_check:normalized_invoice_no",
519
+ "CAND-NORM-01",
520
+ ],
521
+ )
522
+ )
523
+
524
+ result = env.step(
525
+ InvoiceOpsAction(
526
+ action_type=ActionType.SET_HEADER_RESOLUTION,
527
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
528
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
529
+ evidence_refs=[
530
+ "art-invoice",
531
+ "art-vendor",
532
+ "art-policy",
533
+ "EX-TAX-001",
534
+ ],
535
+ route_to="tax",
536
+ )
537
+ )
538
+
539
+ assert result.done is False
540
+ assert result.message == "Saved header recommendation."
541
+
542
+
543
+ def test_release_approved_lines_can_coexist_with_held_lines() -> None:
544
+ env = InvoiceOpsEnvironment()
545
+ env.reset(task_id="medium_plus")
546
+
547
+ env.step(
548
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-po")
549
+ )
550
+ env.step(
551
+ InvoiceOpsAction(
552
+ action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-receipts"
553
+ )
554
+ )
555
+ env.step(
556
+ InvoiceOpsAction(
557
+ action_type=ActionType.RUN_DUPLICATE_CHECK,
558
+ match_strategy="normalized_invoice_no",
559
+ )
560
+ )
561
+ env.step(
562
+ InvoiceOpsAction(
563
+ action_type=ActionType.INSPECT_EXCEPTION,
564
+ exception_id="EX-RECEIPT-L2",
565
+ )
566
+ )
567
+ env.step(
568
+ InvoiceOpsAction(
569
+ action_type=ActionType.SET_LINE_RESOLUTION,
570
+ line_id="L1",
571
+ disposition=Disposition.APPROVE,
572
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
573
+ evidence_refs=[
574
+ "art-po",
575
+ "art-receipts",
576
+ "duplicate_check:normalized_invoice_no",
577
+ "CAND-NORM-01",
578
+ ],
579
+ )
580
+ )
581
+ env.step(
582
+ InvoiceOpsAction(
583
+ action_type=ActionType.SET_LINE_RESOLUTION,
584
+ line_id="L2",
585
+ disposition=Disposition.HOLD,
586
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
587
+ evidence_refs=["art-receipts", "EX-RECEIPT-L2"],
588
+ route_to="receiving",
589
+ )
590
+ )
591
+
592
+ result = env.step(
593
+ InvoiceOpsAction(
594
+ action_type=ActionType.SET_HEADER_RESOLUTION,
595
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
596
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
597
+ evidence_refs=[
598
+ "art-receipts",
599
+ "duplicate_check:normalized_invoice_no",
600
+ "CAND-NORM-01",
601
+ ],
602
+ )
603
+ )
604
+
605
+ assert result.done is False
606
+ assert result.message == "Saved header recommendation."
607
+
608
+
609
+ def test_release_approved_lines_without_approved_lines_is_invalid() -> None:
610
+ env = InvoiceOpsEnvironment()
611
+ env.reset(task_id="easy")
612
+
613
+ env.step(
614
+ InvoiceOpsAction(action_type=ActionType.OPEN_ARTIFACT, artifact_id="art-policy")
615
+ )
616
+ result = env.step(
617
+ InvoiceOpsAction(
618
+ action_type=ActionType.SET_HEADER_RESOLUTION,
619
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
620
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
621
+ evidence_refs=["art-policy"],
622
+ )
623
+ )
624
+ assert result.done is False
625
+
626
+ submit_result = env.step(InvoiceOpsAction(action_type=ActionType.SUBMIT_CASE))
627
+ assert submit_result.done is False
628
+ assert "requires at least one approved line" in submit_result.message
629
+
630
+
631
+ def test_hard_budget_has_recovery_slack() -> None:
632
+ scenario = load_scenario(task_id="hard")
633
+
634
+ assert scenario.step_limit - scenario.hidden_truth.efficient_step_target >= 5
635
+
636
+
637
+ def test_medium_plus_budget_has_recovery_slack() -> None:
638
+ scenario = load_scenario(task_id="medium_plus")
639
+
640
+ assert scenario.step_limit - scenario.hidden_truth.efficient_step_target >= 4
tests/test_grader.py ADDED
@@ -0,0 +1,1029 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Grader discrimination tests for the 4-task InvoiceOps benchmark."""
2
+
3
+ from invoiceops_env.models import (
4
+ DecisionBand,
5
+ Disposition,
6
+ HeaderResolution,
7
+ LineResolution,
8
+ PaymentRecommendation,
9
+ ReasonCode,
10
+ RouteTarget,
11
+ )
12
+ from invoiceops_env.server.grader import ReviewTrace, grade_case
13
+ from invoiceops_env.server.scenario_loader import load_scenario
14
+
15
+
16
+ def test_easy_single_line_header_fallback_rewards_correct_route() -> None:
17
+ scenario = load_scenario(task_id="easy")
18
+ trace = ReviewTrace(
19
+ ref_steps={
20
+ "art-invoice": 1,
21
+ "art-approval": 2,
22
+ "EX-NONPO-APPROVAL": 3,
23
+ "art-policy": 4,
24
+ },
25
+ steps_used=6,
26
+ )
27
+
28
+ correct_header_only = grade_case(
29
+ scenario,
30
+ line_resolutions={},
31
+ header_resolution=HeaderResolution(
32
+ resolution_id="HR-001",
33
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
34
+ reason_codes=[ReasonCode.NON_PO_APPROVAL_MISSING],
35
+ evidence_refs=[
36
+ "art-invoice",
37
+ "art-approval",
38
+ "art-policy",
39
+ "EX-NONPO-APPROVAL",
40
+ ],
41
+ route_to=RouteTarget.REQUESTER,
42
+ saved_at_step=5,
43
+ ),
44
+ notes={},
45
+ trace=trace,
46
+ )
47
+
48
+ wrong_route_header_only = grade_case(
49
+ scenario,
50
+ line_resolutions={},
51
+ header_resolution=HeaderResolution(
52
+ resolution_id="HR-001",
53
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
54
+ reason_codes=[ReasonCode.NON_PO_APPROVAL_MISSING],
55
+ evidence_refs=[
56
+ "art-invoice",
57
+ "art-approval",
58
+ "art-policy",
59
+ "EX-NONPO-APPROVAL",
60
+ ],
61
+ route_to=RouteTarget.AP_MANAGER,
62
+ saved_at_step=5,
63
+ ),
64
+ notes={},
65
+ trace=trace,
66
+ )
67
+
68
+ assert correct_header_only.decision_band is DecisionBand.BEST
69
+ assert wrong_route_header_only.decision_band is DecisionBand.WRONG
70
+ assert correct_header_only.total_score > 0.95
71
+ assert wrong_route_header_only.total_score < 0.30
72
+
73
+
74
+ def test_medium_duplicate_evidence_creates_best_safe_and_wrong_bands() -> None:
75
+ scenario = load_scenario(task_id="medium")
76
+
77
+ best_trace = ReviewTrace(
78
+ ref_steps={
79
+ "art-po": 1,
80
+ "art-receipts": 2,
81
+ "EX-POSSIBLE-DUP": 3,
82
+ "duplicate_check:normalized_invoice_no": 4,
83
+ "CAND-NORM-01": 4,
84
+ },
85
+ steps_used=8,
86
+ )
87
+ best_report = grade_case(
88
+ scenario,
89
+ line_resolutions={
90
+ "L1": LineResolution(
91
+ resolution_id="LR-L1",
92
+ line_id="L1",
93
+ disposition=Disposition.APPROVE,
94
+ reason_codes=[
95
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
96
+ ReasonCode.SAFE_TO_PAY,
97
+ ],
98
+ evidence_refs=[
99
+ "art-po",
100
+ "art-receipts",
101
+ "EX-POSSIBLE-DUP",
102
+ "duplicate_check:normalized_invoice_no",
103
+ "CAND-NORM-01",
104
+ ],
105
+ route_to=None,
106
+ saved_at_step=5,
107
+ ),
108
+ "L2": LineResolution(
109
+ resolution_id="LR-L2",
110
+ line_id="L2",
111
+ disposition=Disposition.APPROVE,
112
+ reason_codes=[
113
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
114
+ ReasonCode.SAFE_TO_PAY,
115
+ ],
116
+ evidence_refs=[
117
+ "art-po",
118
+ "art-receipts",
119
+ "EX-POSSIBLE-DUP",
120
+ "duplicate_check:normalized_invoice_no",
121
+ "CAND-NORM-01",
122
+ ],
123
+ route_to=None,
124
+ saved_at_step=6,
125
+ ),
126
+ },
127
+ header_resolution=HeaderResolution(
128
+ resolution_id="HR-001",
129
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
130
+ reason_codes=[
131
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
132
+ ReasonCode.SAFE_TO_PAY,
133
+ ],
134
+ evidence_refs=[
135
+ "art-po",
136
+ "art-receipts",
137
+ "EX-POSSIBLE-DUP",
138
+ "duplicate_check:normalized_invoice_no",
139
+ "CAND-NORM-01",
140
+ ],
141
+ route_to=None,
142
+ saved_at_step=7,
143
+ ),
144
+ notes={},
145
+ trace=best_trace,
146
+ )
147
+
148
+ safe_hold_report = grade_case(
149
+ scenario,
150
+ line_resolutions={
151
+ "L1": LineResolution(
152
+ resolution_id="LR-L1",
153
+ line_id="L1",
154
+ disposition=Disposition.HOLD,
155
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
156
+ evidence_refs=[
157
+ "EX-POSSIBLE-DUP",
158
+ "duplicate_check:normalized_invoice_no",
159
+ "CAND-NORM-01",
160
+ ],
161
+ route_to=None,
162
+ saved_at_step=5,
163
+ ),
164
+ "L2": LineResolution(
165
+ resolution_id="LR-L2",
166
+ line_id="L2",
167
+ disposition=Disposition.HOLD,
168
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
169
+ evidence_refs=[
170
+ "EX-POSSIBLE-DUP",
171
+ "duplicate_check:normalized_invoice_no",
172
+ "CAND-NORM-01",
173
+ ],
174
+ route_to=None,
175
+ saved_at_step=6,
176
+ ),
177
+ },
178
+ header_resolution=HeaderResolution(
179
+ resolution_id="HR-001",
180
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
181
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
182
+ evidence_refs=[
183
+ "EX-POSSIBLE-DUP",
184
+ "duplicate_check:normalized_invoice_no",
185
+ "CAND-NORM-01",
186
+ ],
187
+ route_to=None,
188
+ saved_at_step=7,
189
+ ),
190
+ notes={},
191
+ trace=best_trace,
192
+ )
193
+
194
+ wrong_heuristic_hold = grade_case(
195
+ scenario,
196
+ line_resolutions={
197
+ "L1": LineResolution(
198
+ resolution_id="LR-L1",
199
+ line_id="L1",
200
+ disposition=Disposition.HOLD,
201
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
202
+ evidence_refs=[
203
+ "EX-POSSIBLE-DUP",
204
+ "duplicate_check:vendor_amount_date",
205
+ "CAND-AMT-02",
206
+ ],
207
+ route_to=None,
208
+ saved_at_step=5,
209
+ ),
210
+ "L2": LineResolution(
211
+ resolution_id="LR-L2",
212
+ line_id="L2",
213
+ disposition=Disposition.HOLD,
214
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
215
+ evidence_refs=[
216
+ "EX-POSSIBLE-DUP",
217
+ "duplicate_check:vendor_amount_date",
218
+ "CAND-AMT-02",
219
+ ],
220
+ route_to=None,
221
+ saved_at_step=6,
222
+ ),
223
+ },
224
+ header_resolution=HeaderResolution(
225
+ resolution_id="HR-001",
226
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
227
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
228
+ evidence_refs=[
229
+ "EX-POSSIBLE-DUP",
230
+ "duplicate_check:vendor_amount_date",
231
+ "CAND-AMT-02",
232
+ ],
233
+ route_to=None,
234
+ saved_at_step=7,
235
+ ),
236
+ notes={},
237
+ trace=ReviewTrace(
238
+ ref_steps={
239
+ "art-po": 1,
240
+ "art-receipts": 2,
241
+ "EX-POSSIBLE-DUP": 3,
242
+ "duplicate_check:vendor_amount_date": 4,
243
+ "CAND-AMT-02": 4,
244
+ },
245
+ steps_used=8,
246
+ ),
247
+ )
248
+
249
+ assert best_report.decision_band is DecisionBand.BEST
250
+ assert safe_hold_report.decision_band is DecisionBand.SAFE_SUBOPTIMAL
251
+ assert wrong_heuristic_hold.decision_band is DecisionBand.WRONG
252
+ assert best_report.total_score > safe_hold_report.total_score > wrong_heuristic_hold.total_score
253
+ assert best_report.total_score > 0.95
254
+ assert 0.55 < safe_hold_report.total_score < 0.75
255
+ assert wrong_heuristic_hold.total_score < 0.30
256
+
257
+
258
+ def test_medium_approval_without_duplicate_clearance_is_wrong() -> None:
259
+ scenario = load_scenario(task_id="medium")
260
+ trace = ReviewTrace(
261
+ ref_steps={
262
+ "art-po": 1,
263
+ "art-receipts": 2,
264
+ "EX-POSSIBLE-DUP": 3,
265
+ },
266
+ steps_used=7,
267
+ )
268
+
269
+ report = grade_case(
270
+ scenario,
271
+ line_resolutions={
272
+ "L1": LineResolution(
273
+ resolution_id="LR-L1",
274
+ line_id="L1",
275
+ disposition=Disposition.APPROVE,
276
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
277
+ evidence_refs=["art-po", "art-receipts", "EX-POSSIBLE-DUP"],
278
+ route_to=None,
279
+ saved_at_step=4,
280
+ ),
281
+ "L2": LineResolution(
282
+ resolution_id="LR-L2",
283
+ line_id="L2",
284
+ disposition=Disposition.APPROVE,
285
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
286
+ evidence_refs=["art-po", "art-receipts", "EX-POSSIBLE-DUP"],
287
+ route_to=None,
288
+ saved_at_step=5,
289
+ ),
290
+ },
291
+ header_resolution=HeaderResolution(
292
+ resolution_id="HR-001",
293
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
294
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
295
+ evidence_refs=["art-po", "art-receipts", "EX-POSSIBLE-DUP"],
296
+ route_to=None,
297
+ saved_at_step=6,
298
+ ),
299
+ notes={},
300
+ trace=trace,
301
+ )
302
+
303
+ assert report.decision_band is DecisionBand.WRONG
304
+ assert report.total_score < 0.30
305
+
306
+
307
+ def test_medium_observed_evidence_counts_even_if_final_refs_are_sparse() -> None:
308
+ scenario = load_scenario(task_id="medium")
309
+ trace = ReviewTrace(
310
+ ref_steps={
311
+ "art-po": 1,
312
+ "art-receipts": 2,
313
+ "EX-POSSIBLE-DUP": 3,
314
+ "duplicate_check:normalized_invoice_no": 4,
315
+ "CAND-NORM-01": 4,
316
+ },
317
+ steps_used=8,
318
+ )
319
+
320
+ report = grade_case(
321
+ scenario,
322
+ line_resolutions={
323
+ "L1": LineResolution(
324
+ resolution_id="LR-L1",
325
+ line_id="L1",
326
+ disposition=Disposition.APPROVE,
327
+ reason_codes=[
328
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
329
+ ReasonCode.SAFE_TO_PAY,
330
+ ],
331
+ evidence_refs=["art-po", "CAND-NORM-01"],
332
+ route_to=None,
333
+ saved_at_step=5,
334
+ ),
335
+ "L2": LineResolution(
336
+ resolution_id="LR-L2",
337
+ line_id="L2",
338
+ disposition=Disposition.APPROVE,
339
+ reason_codes=[
340
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
341
+ ReasonCode.SAFE_TO_PAY,
342
+ ],
343
+ evidence_refs=["art-po"],
344
+ route_to=None,
345
+ saved_at_step=6,
346
+ ),
347
+ },
348
+ header_resolution=HeaderResolution(
349
+ resolution_id="HR-001",
350
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
351
+ reason_codes=[
352
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
353
+ ReasonCode.SAFE_TO_PAY,
354
+ ],
355
+ evidence_refs=["art-po", "CAND-NORM-01"],
356
+ route_to=None,
357
+ saved_at_step=7,
358
+ ),
359
+ notes={},
360
+ trace=trace,
361
+ )
362
+
363
+ assert report.decision_band is DecisionBand.BEST
364
+ assert 0.80 < report.total_score < 0.98
365
+ assert report.evidence_score < 0.70
366
+
367
+
368
+ def test_medium_duplicate_hit_without_po_and_receipts_stays_wrong() -> None:
369
+ scenario = load_scenario(task_id="medium")
370
+ trace = ReviewTrace(
371
+ ref_steps={
372
+ "EX-POSSIBLE-DUP": 1,
373
+ "duplicate_check:normalized_invoice_no": 2,
374
+ "CAND-NORM-01": 2,
375
+ },
376
+ steps_used=6,
377
+ )
378
+
379
+ report = grade_case(
380
+ scenario,
381
+ line_resolutions={
382
+ "L1": LineResolution(
383
+ resolution_id="LR-L1",
384
+ line_id="L1",
385
+ disposition=Disposition.APPROVE,
386
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
387
+ evidence_refs=["CAND-NORM-01"],
388
+ route_to=None,
389
+ saved_at_step=3,
390
+ ),
391
+ "L2": LineResolution(
392
+ resolution_id="LR-L2",
393
+ line_id="L2",
394
+ disposition=Disposition.APPROVE,
395
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
396
+ evidence_refs=["CAND-NORM-01"],
397
+ route_to=None,
398
+ saved_at_step=4,
399
+ ),
400
+ },
401
+ header_resolution=HeaderResolution(
402
+ resolution_id="HR-001",
403
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
404
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
405
+ evidence_refs=["CAND-NORM-01"],
406
+ route_to=None,
407
+ saved_at_step=5,
408
+ ),
409
+ notes={},
410
+ trace=trace,
411
+ )
412
+
413
+ assert report.decision_band is DecisionBand.WRONG
414
+ assert report.total_score < 0.30
415
+
416
+
417
+ def test_medium_plus_partial_release_creates_best_safe_wrong_and_unsafe_bands() -> None:
418
+ scenario = load_scenario(task_id="medium_plus")
419
+ full_trace = ReviewTrace(
420
+ ref_steps={
421
+ "art-invoice": 1,
422
+ "art-po": 2,
423
+ "art-receipts": 3,
424
+ "EX-POSSIBLE-DUP": 4,
425
+ "duplicate_check:normalized_invoice_no": 5,
426
+ "CAND-NORM-01": 5,
427
+ "EX-RECEIPT-L2": 6,
428
+ "art-policy": 7,
429
+ },
430
+ steps_used=12,
431
+ )
432
+ conservative_trace = ReviewTrace(
433
+ ref_steps={
434
+ "art-invoice": 1,
435
+ "art-po": 2,
436
+ "art-receipts": 3,
437
+ "EX-POSSIBLE-DUP": 4,
438
+ "duplicate_check:normalized_invoice_no": 5,
439
+ "CAND-NORM-01": 5,
440
+ "EX-RECEIPT-L2": 6,
441
+ },
442
+ steps_used=11,
443
+ )
444
+
445
+ best_report = grade_case(
446
+ scenario,
447
+ line_resolutions={
448
+ "L1": LineResolution(
449
+ resolution_id="LR-L1",
450
+ line_id="L1",
451
+ disposition=Disposition.APPROVE,
452
+ reason_codes=[
453
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
454
+ ReasonCode.SAFE_TO_PAY,
455
+ ],
456
+ evidence_refs=[
457
+ "art-po",
458
+ "art-receipts",
459
+ "EX-POSSIBLE-DUP",
460
+ "duplicate_check:normalized_invoice_no",
461
+ "CAND-NORM-01",
462
+ ],
463
+ route_to=None,
464
+ saved_at_step=8,
465
+ ),
466
+ "L2": LineResolution(
467
+ resolution_id="LR-L2",
468
+ line_id="L2",
469
+ disposition=Disposition.HOLD,
470
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
471
+ evidence_refs=[
472
+ "art-invoice",
473
+ "art-receipts",
474
+ "EX-RECEIPT-L2",
475
+ "art-policy",
476
+ ],
477
+ route_to=RouteTarget.RECEIVING,
478
+ saved_at_step=9,
479
+ ),
480
+ },
481
+ header_resolution=HeaderResolution(
482
+ resolution_id="HR-001",
483
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
484
+ reason_codes=[
485
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
486
+ ReasonCode.RECEIPT_NOT_CONFIRMED,
487
+ ReasonCode.SAFE_TO_PAY,
488
+ ],
489
+ evidence_refs=[
490
+ "art-policy",
491
+ "art-receipts",
492
+ "EX-RECEIPT-L2",
493
+ "duplicate_check:normalized_invoice_no",
494
+ "CAND-NORM-01",
495
+ ],
496
+ route_to=None,
497
+ saved_at_step=10,
498
+ ),
499
+ notes={},
500
+ trace=full_trace,
501
+ )
502
+
503
+ safe_report = grade_case(
504
+ scenario,
505
+ line_resolutions={
506
+ "L1": LineResolution(
507
+ resolution_id="LR-L1",
508
+ line_id="L1",
509
+ disposition=Disposition.HOLD,
510
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
511
+ evidence_refs=[
512
+ "art-po",
513
+ "art-receipts",
514
+ "EX-POSSIBLE-DUP",
515
+ "duplicate_check:normalized_invoice_no",
516
+ "CAND-NORM-01",
517
+ ],
518
+ route_to=None,
519
+ saved_at_step=7,
520
+ ),
521
+ "L2": LineResolution(
522
+ resolution_id="LR-L2",
523
+ line_id="L2",
524
+ disposition=Disposition.HOLD,
525
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
526
+ evidence_refs=["art-receipts", "EX-RECEIPT-L2"],
527
+ route_to=RouteTarget.RECEIVING,
528
+ saved_at_step=8,
529
+ ),
530
+ },
531
+ header_resolution=HeaderResolution(
532
+ resolution_id="HR-001",
533
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
534
+ reason_codes=[
535
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
536
+ ReasonCode.RECEIPT_NOT_CONFIRMED,
537
+ ],
538
+ evidence_refs=[
539
+ "art-receipts",
540
+ "EX-RECEIPT-L2",
541
+ "duplicate_check:normalized_invoice_no",
542
+ "CAND-NORM-01",
543
+ ],
544
+ route_to=None,
545
+ saved_at_step=9,
546
+ ),
547
+ notes={},
548
+ trace=conservative_trace,
549
+ )
550
+
551
+ wrong_report = grade_case(
552
+ scenario,
553
+ line_resolutions={
554
+ "L1": LineResolution(
555
+ resolution_id="LR-L1",
556
+ line_id="L1",
557
+ disposition=Disposition.APPROVE,
558
+ reason_codes=[
559
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
560
+ ReasonCode.SAFE_TO_PAY,
561
+ ],
562
+ evidence_refs=[
563
+ "art-po",
564
+ "art-receipts",
565
+ "EX-POSSIBLE-DUP",
566
+ "duplicate_check:normalized_invoice_no",
567
+ "CAND-NORM-01",
568
+ ],
569
+ route_to=None,
570
+ saved_at_step=7,
571
+ ),
572
+ "L2": LineResolution(
573
+ resolution_id="LR-L2",
574
+ line_id="L2",
575
+ disposition=Disposition.HOLD,
576
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
577
+ evidence_refs=["art-invoice", "art-receipts", "EX-RECEIPT-L2"],
578
+ route_to=RouteTarget.RECEIVING,
579
+ saved_at_step=8,
580
+ ),
581
+ },
582
+ header_resolution=HeaderResolution(
583
+ resolution_id="HR-001",
584
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
585
+ reason_codes=[
586
+ ReasonCode.POSSIBLE_DUPLICATE_REVIEW,
587
+ ReasonCode.RECEIPT_NOT_CONFIRMED,
588
+ ReasonCode.SAFE_TO_PAY,
589
+ ],
590
+ evidence_refs=[
591
+ "art-receipts",
592
+ "EX-RECEIPT-L2",
593
+ "duplicate_check:normalized_invoice_no",
594
+ "CAND-NORM-01",
595
+ ],
596
+ route_to=None,
597
+ saved_at_step=9,
598
+ ),
599
+ notes={},
600
+ trace=conservative_trace,
601
+ )
602
+
603
+ unsafe_report = grade_case(
604
+ scenario,
605
+ line_resolutions={
606
+ "L1": LineResolution(
607
+ resolution_id="LR-L1",
608
+ line_id="L1",
609
+ disposition=Disposition.APPROVE,
610
+ reason_codes=[
611
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
612
+ ReasonCode.SAFE_TO_PAY,
613
+ ],
614
+ evidence_refs=[
615
+ "art-po",
616
+ "art-receipts",
617
+ "EX-POSSIBLE-DUP",
618
+ "duplicate_check:normalized_invoice_no",
619
+ "CAND-NORM-01",
620
+ ],
621
+ route_to=None,
622
+ saved_at_step=8,
623
+ ),
624
+ "L2": LineResolution(
625
+ resolution_id="LR-L2",
626
+ line_id="L2",
627
+ disposition=Disposition.APPROVE,
628
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
629
+ evidence_refs=[
630
+ "art-invoice",
631
+ "art-receipts",
632
+ "EX-RECEIPT-L2",
633
+ "art-policy",
634
+ ],
635
+ route_to=None,
636
+ saved_at_step=9,
637
+ ),
638
+ },
639
+ header_resolution=HeaderResolution(
640
+ resolution_id="HR-001",
641
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
642
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
643
+ evidence_refs=[
644
+ "art-policy",
645
+ "art-receipts",
646
+ "EX-RECEIPT-L2",
647
+ "duplicate_check:normalized_invoice_no",
648
+ "CAND-NORM-01",
649
+ ],
650
+ route_to=None,
651
+ saved_at_step=10,
652
+ ),
653
+ notes={},
654
+ trace=full_trace,
655
+ )
656
+
657
+ assert best_report.decision_band is DecisionBand.BEST
658
+ assert safe_report.decision_band is DecisionBand.SAFE_SUBOPTIMAL
659
+ assert wrong_report.decision_band is DecisionBand.WRONG
660
+ assert unsafe_report.decision_band is DecisionBand.UNSAFE
661
+ assert (
662
+ best_report.total_score
663
+ > safe_report.total_score
664
+ > wrong_report.total_score
665
+ > unsafe_report.total_score
666
+ )
667
+ assert best_report.total_score > 0.95
668
+ assert 0.55 < safe_report.total_score < 0.80
669
+ assert wrong_report.total_score < 0.45
670
+ assert unsafe_report.total_score < 0.15
671
+
672
+
673
+ def test_hard_composition_rewards_mixed_judgment_and_penalizes_templates() -> None:
674
+ scenario = load_scenario(task_id="hard")
675
+ full_trace = ReviewTrace(
676
+ ref_steps={
677
+ "art-invoice": 1,
678
+ "EX-POSSIBLE-DUP": 2,
679
+ "duplicate_check:normalized_invoice_no": 3,
680
+ "CAND-NORM-01": 3,
681
+ "art-receipts": 4,
682
+ "EX-RECEIPT-L1": 5,
683
+ "EX-RECEIPT-L2": 6,
684
+ "art-history": 7,
685
+ "art-vendor": 8,
686
+ "EX-TAX-001": 9,
687
+ "art-policy": 10,
688
+ },
689
+ steps_used=17,
690
+ )
691
+
692
+ best_report = grade_case(
693
+ scenario,
694
+ line_resolutions={
695
+ "L1": LineResolution(
696
+ resolution_id="LR-L1",
697
+ line_id="L1",
698
+ disposition=Disposition.APPROVE,
699
+ reason_codes=[
700
+ ReasonCode.PARTIAL_RECEIPT_PENDING,
701
+ ReasonCode.SAFE_TO_PAY,
702
+ ],
703
+ evidence_refs=[
704
+ "art-invoice",
705
+ "art-receipts",
706
+ "EX-RECEIPT-L1",
707
+ "art-policy",
708
+ "duplicate_check:normalized_invoice_no",
709
+ "CAND-NORM-01",
710
+ ],
711
+ route_to=None,
712
+ saved_at_step=12,
713
+ ),
714
+ "L2": LineResolution(
715
+ resolution_id="LR-L2",
716
+ line_id="L2",
717
+ disposition=Disposition.HOLD,
718
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
719
+ evidence_refs=[
720
+ "art-receipts",
721
+ "art-history",
722
+ "EX-RECEIPT-L2",
723
+ "art-policy",
724
+ ],
725
+ route_to=RouteTarget.RECEIVING,
726
+ saved_at_step=13,
727
+ ),
728
+ "L3": LineResolution(
729
+ resolution_id="LR-L3",
730
+ line_id="L3",
731
+ disposition=Disposition.APPROVE,
732
+ reason_codes=[
733
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
734
+ ReasonCode.SAFE_TO_PAY,
735
+ ],
736
+ evidence_refs=[
737
+ "art-invoice",
738
+ "art-receipts",
739
+ "duplicate_check:normalized_invoice_no",
740
+ "CAND-NORM-01",
741
+ ],
742
+ route_to=None,
743
+ saved_at_step=14,
744
+ ),
745
+ },
746
+ header_resolution=HeaderResolution(
747
+ resolution_id="HR-001",
748
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
749
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
750
+ evidence_refs=[
751
+ "art-invoice",
752
+ "art-vendor",
753
+ "art-policy",
754
+ "EX-TAX-001",
755
+ ],
756
+ route_to=RouteTarget.TAX,
757
+ saved_at_step=15,
758
+ ),
759
+ notes={},
760
+ trace=full_trace,
761
+ )
762
+
763
+ blanket_hold = grade_case(
764
+ scenario,
765
+ line_resolutions={
766
+ "L1": LineResolution(
767
+ resolution_id="LR-L1",
768
+ line_id="L1",
769
+ disposition=Disposition.HOLD,
770
+ reason_codes=[ReasonCode.PARTIAL_RECEIPT_PENDING],
771
+ evidence_refs=[
772
+ "duplicate_check:normalized_invoice_no",
773
+ "CAND-NORM-01",
774
+ "EX-RECEIPT-L1",
775
+ "art-policy",
776
+ ],
777
+ route_to=None,
778
+ saved_at_step=12,
779
+ ),
780
+ "L2": LineResolution(
781
+ resolution_id="LR-L2",
782
+ line_id="L2",
783
+ disposition=Disposition.HOLD,
784
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
785
+ evidence_refs=["art-history", "EX-RECEIPT-L2"],
786
+ route_to=RouteTarget.RECEIVING,
787
+ saved_at_step=13,
788
+ ),
789
+ "L3": LineResolution(
790
+ resolution_id="LR-L3",
791
+ line_id="L3",
792
+ disposition=Disposition.HOLD,
793
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
794
+ evidence_refs=[
795
+ "duplicate_check:normalized_invoice_no",
796
+ "CAND-NORM-01",
797
+ ],
798
+ route_to=None,
799
+ saved_at_step=14,
800
+ ),
801
+ },
802
+ header_resolution=HeaderResolution(
803
+ resolution_id="HR-001",
804
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
805
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
806
+ evidence_refs=["art-vendor", "art-policy", "EX-TAX-001"],
807
+ route_to=RouteTarget.TAX,
808
+ saved_at_step=15,
809
+ ),
810
+ notes={},
811
+ trace=full_trace,
812
+ )
813
+
814
+ unsafe_release = grade_case(
815
+ scenario,
816
+ line_resolutions={
817
+ "L1": LineResolution(
818
+ resolution_id="LR-L1",
819
+ line_id="L1",
820
+ disposition=Disposition.APPROVE,
821
+ reason_codes=[
822
+ ReasonCode.PARTIAL_RECEIPT_PENDING,
823
+ ReasonCode.SAFE_TO_PAY,
824
+ ],
825
+ evidence_refs=[
826
+ "art-invoice",
827
+ "art-receipts",
828
+ "EX-RECEIPT-L1",
829
+ "art-policy",
830
+ "duplicate_check:normalized_invoice_no",
831
+ "CAND-NORM-01",
832
+ ],
833
+ route_to=None,
834
+ saved_at_step=12,
835
+ ),
836
+ "L2": LineResolution(
837
+ resolution_id="LR-L2",
838
+ line_id="L2",
839
+ disposition=Disposition.HOLD,
840
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
841
+ evidence_refs=[
842
+ "art-receipts",
843
+ "art-history",
844
+ "EX-RECEIPT-L2",
845
+ "art-policy",
846
+ ],
847
+ route_to=RouteTarget.RECEIVING,
848
+ saved_at_step=13,
849
+ ),
850
+ "L3": LineResolution(
851
+ resolution_id="LR-L3",
852
+ line_id="L3",
853
+ disposition=Disposition.APPROVE,
854
+ reason_codes=[
855
+ ReasonCode.MATCHED_TO_PO_AND_RECEIPT,
856
+ ReasonCode.SAFE_TO_PAY,
857
+ ],
858
+ evidence_refs=[
859
+ "art-invoice",
860
+ "art-receipts",
861
+ "duplicate_check:normalized_invoice_no",
862
+ "CAND-NORM-01",
863
+ ],
864
+ route_to=None,
865
+ saved_at_step=14,
866
+ ),
867
+ },
868
+ header_resolution=HeaderResolution(
869
+ resolution_id="HR-001",
870
+ payment_recommendation=PaymentRecommendation.RELEASE_APPROVED_LINES,
871
+ reason_codes=[ReasonCode.SAFE_TO_PAY],
872
+ evidence_refs=[
873
+ "art-po",
874
+ "art-receipts",
875
+ "duplicate_check:normalized_invoice_no",
876
+ "CAND-NORM-01",
877
+ ],
878
+ route_to=None,
879
+ saved_at_step=15,
880
+ ),
881
+ notes={},
882
+ trace=full_trace,
883
+ )
884
+
885
+ assert best_report.decision_band is DecisionBand.BEST
886
+ assert blanket_hold.decision_band is DecisionBand.SAFE_SUBOPTIMAL
887
+ assert unsafe_release.decision_band is DecisionBand.UNSAFE
888
+ assert best_report.total_score > blanket_hold.total_score > unsafe_release.total_score
889
+ assert best_report.total_score > 0.95
890
+ assert 0.55 < blanket_hold.total_score < 0.75
891
+ assert unsafe_release.total_score <= 0.15
892
+
893
+
894
+ def test_hard_conservative_partial_evidence_scores_safe_suboptimal() -> None:
895
+ scenario = load_scenario(task_id="hard")
896
+ partial_trace = ReviewTrace(
897
+ ref_steps={
898
+ "EX-POSSIBLE-DUP": 1,
899
+ "duplicate_check:normalized_invoice_no": 2,
900
+ "CAND-NORM-01": 2,
901
+ "EX-RECEIPT-L1": 3,
902
+ "EX-RECEIPT-L2": 4,
903
+ "art-receipts": 5,
904
+ "art-vendor": 6,
905
+ "art-policy": 7,
906
+ "EX-TAX-001": 8,
907
+ },
908
+ steps_used=12,
909
+ )
910
+
911
+ report = grade_case(
912
+ scenario,
913
+ line_resolutions={
914
+ "L1": LineResolution(
915
+ resolution_id="LR-L1",
916
+ line_id="L1",
917
+ disposition=Disposition.HOLD,
918
+ reason_codes=[ReasonCode.PARTIAL_RECEIPT_PENDING],
919
+ evidence_refs=[
920
+ "art-receipts",
921
+ "duplicate_check:normalized_invoice_no",
922
+ "CAND-NORM-01",
923
+ "EX-RECEIPT-L1",
924
+ ],
925
+ route_to=None,
926
+ saved_at_step=8,
927
+ ),
928
+ "L2": LineResolution(
929
+ resolution_id="LR-L2",
930
+ line_id="L2",
931
+ disposition=Disposition.HOLD,
932
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
933
+ evidence_refs=["art-receipts", "EX-RECEIPT-L2"],
934
+ route_to=RouteTarget.RECEIVING,
935
+ saved_at_step=9,
936
+ ),
937
+ "L3": LineResolution(
938
+ resolution_id="LR-L3",
939
+ line_id="L3",
940
+ disposition=Disposition.HOLD,
941
+ reason_codes=[ReasonCode.POSSIBLE_DUPLICATE_REVIEW],
942
+ evidence_refs=[
943
+ "art-receipts",
944
+ "duplicate_check:normalized_invoice_no",
945
+ "CAND-NORM-01",
946
+ ],
947
+ route_to=None,
948
+ saved_at_step=10,
949
+ ),
950
+ },
951
+ header_resolution=HeaderResolution(
952
+ resolution_id="HR-001",
953
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
954
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
955
+ evidence_refs=["EX-TAX-001", "art-vendor", "art-policy"],
956
+ route_to=RouteTarget.TAX,
957
+ saved_at_step=11,
958
+ ),
959
+ notes={},
960
+ trace=partial_trace,
961
+ )
962
+
963
+ assert report.decision_band is DecisionBand.SAFE_SUBOPTIMAL
964
+ assert 0.55 < report.total_score < 0.80
965
+
966
+
967
+ def test_hard_shortcut_approvals_without_invoice_and_history_stay_wrong() -> None:
968
+ scenario = load_scenario(task_id="hard")
969
+ shortcut_trace = ReviewTrace(
970
+ ref_steps={
971
+ "EX-POSSIBLE-DUP": 1,
972
+ "duplicate_check:normalized_invoice_no": 2,
973
+ "CAND-NORM-01": 2,
974
+ "EX-RECEIPT-L1": 3,
975
+ "EX-RECEIPT-L2": 4,
976
+ "art-receipts": 5,
977
+ "art-vendor": 6,
978
+ "art-policy": 7,
979
+ "EX-TAX-001": 8,
980
+ },
981
+ steps_used=12,
982
+ )
983
+
984
+ report = grade_case(
985
+ scenario,
986
+ line_resolutions={
987
+ "L1": LineResolution(
988
+ resolution_id="LR-L1",
989
+ line_id="L1",
990
+ disposition=Disposition.APPROVE,
991
+ reason_codes=[ReasonCode.PARTIAL_RECEIPT_PENDING],
992
+ evidence_refs=["art-receipts", "EX-RECEIPT-L1"],
993
+ route_to=None,
994
+ saved_at_step=8,
995
+ ),
996
+ "L2": LineResolution(
997
+ resolution_id="LR-L2",
998
+ line_id="L2",
999
+ disposition=Disposition.HOLD,
1000
+ reason_codes=[ReasonCode.RECEIPT_NOT_CONFIRMED],
1001
+ evidence_refs=["art-receipts", "EX-RECEIPT-L2"],
1002
+ route_to=RouteTarget.RECEIVING,
1003
+ saved_at_step=9,
1004
+ ),
1005
+ "L3": LineResolution(
1006
+ resolution_id="LR-L3",
1007
+ line_id="L3",
1008
+ disposition=Disposition.APPROVE,
1009
+ reason_codes=[ReasonCode.MATCHED_TO_PO_AND_RECEIPT],
1010
+ evidence_refs=["art-receipts"],
1011
+ route_to=None,
1012
+ saved_at_step=10,
1013
+ ),
1014
+ },
1015
+ header_resolution=HeaderResolution(
1016
+ resolution_id="HR-001",
1017
+ payment_recommendation=PaymentRecommendation.HOLD_FULL_INVOICE,
1018
+ reason_codes=[ReasonCode.TAX_AMOUNT_MISMATCH],
1019
+ evidence_refs=["EX-TAX-001", "art-vendor", "art-policy"],
1020
+ route_to=RouteTarget.TAX,
1021
+ saved_at_step=11,
1022
+ ),
1023
+ notes={},
1024
+ trace=shortcut_trace,
1025
+ )
1026
+
1027
+ assert report.decision_band is DecisionBand.WRONG
1028
+ assert report.core_decision_score < 0.35
1029
+ assert report.total_score < 0.35
tests/test_validation_smoke.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import subprocess
2
+ from pathlib import Path
3
+
4
+
5
+ ROOT = Path(__file__).resolve().parents[1]
6
+ OPENENV_PROJECT = ROOT.parent / "OpenEnv"
7
+ if not OPENENV_PROJECT.exists():
8
+ OPENENV_PROJECT = ROOT.parent / "markov" / "OpenEnv"
9
+
10
+
11
+ def test_openenv_validate_passes() -> None:
12
+ result = subprocess.run(
13
+ [
14
+ "uv",
15
+ "run",
16
+ "--project",
17
+ str(OPENENV_PROJECT),
18
+ "openenv",
19
+ "validate",
20
+ str(ROOT),
21
+ ],
22
+ check=False,
23
+ capture_output=True,
24
+ text=True,
25
+ )
26
+
27
+ assert result.returncode == 0, result.stdout + result.stderr
28
+ assert "[OK]" in result.stdout
uv.lock ADDED
The diff for this file is too large to render. See raw diff