Spaces:
Sleeping
Sleeping
File size: 54,679 Bytes
562f58d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 | # Invoice Exception Handler β OpenEnv
> An AI agent learning environment that simulates accounts payable exception handling.
> The agent acts as an AP analyst: receives flagged invoices, investigates root causes,
> makes decisions, and closes cases. Built for the OpenEnv hackathon.
[](https://github.com/openenv/openenv)
[](https://python.org)
[](https://huggingface.co/spaces)
---
## For Agents Building This Project
This README is the single source of truth for building the entire project from scratch.
Read every section before writing any code. Do not skip sections. Do not guess.
### Ground Rules
1. **Write code like a human wrote it.** Use real variable names, not `x` or `tmp`. Add comments where
the logic is non-obvious. Leave one blank line between logical blocks inside functions. Use 4-space
indentation everywhere. Python files get a module docstring at the top explaining what the file does.
2. **Create a new Git repo and push after every major milestone.** A milestone is: models done,
tasks done, environment done, API done, inference done, app done. Not after every file.
3. **Record every change in `documents/CHANGELOG.md`.** Use the format in the changelog section below.
Before pushing, append to the changelog what changed and why.
4. **If something in this README conflicts with the competition spec, the competition spec wins.**
The competition spec is in the document the user shared. Key points: `inference.py` must use the
OpenAI client. `[START]` `[STEP]` `[END]` format must be exact. `/reset` must return 200.
5. **Test before pushing.** Run `python -c "from env import InvoiceExceptionEnv"` to check imports.
Run `python inference.py` with a dummy API key to check the log format. Run `docker build .` to
check the Dockerfile before claiming it works.
---
## What This Environment Does
Every company that buys goods or services deals with invoice exceptions β mismatches between
what was ordered, what arrived, and what was invoiced. Currently these are handled by accounts
payable analysts who manually compare documents and decide: approve, reject, hold, or escalate.
This environment puts an AI agent in that analyst's chair. The agent receives:
- A **Purchase Order** (what was agreed to)
- An **Invoice** (what the supplier is claiming)
- A **Goods Receipt Note** (what actually arrived)
- A **Supplier Master** (the verified supplier record)
- An **Exception Flag** (why the system flagged this invoice)
The agent investigates, runs checks, queries people, makes a decision, and closes the case.
Every action has realistic consequences including financial, compliance, and fraud implications.
---
## Repository Structure
Build the project with exactly this structure. Do not add extra directories. Do not rename files.
```
invoice-exception-handler/
β
βββ README.md β this file
βββ openenv.yaml β OpenEnv spec, must pass openenv validate
βββ Dockerfile β single-stage Python 3.11-slim
βββ requirements.txt β pinned versions
βββ inference.py β competition script, MUST be named this
βββ app.py β Gradio + FastAPI, entry point for HF Spaces
β
βββ env/
β βββ __init__.py β exports InvoiceExceptionEnv, Action, ALL_TASKS
β βββ models.py β all Pydantic models (Action, EnvironmentState, etc.)
β βββ environment.py β InvoiceExceptionEnv class
β βββ tasks.py β 3 task classes, EpisodeData, graders
β
βββ documents/
βββ PRD-001-product-requirements.md
βββ CHANGELOG.md
βββ ARCHITECTURE.md
βββ BASELINE-SCORES.md
```
---
## Step-by-Step Build Order
Follow this order exactly. Do not jump ahead.
```
Step 1 β Create the repo
Step 2 β Write requirements.txt
Step 3 β Write env/models.py
Step 4 β Write env/tasks.py
Step 5 β Write env/environment.py
Step 6 β Write env/__init__.py
Step 7 β Smoke test the environment (run a quick script)
Step 8 β Write openenv.yaml
Step 9 β Write inference.py
Step 10 β Write app.py
Step 11 β Write Dockerfile
Step 12 β Full end-to-end test
Step 13 β Write documents/
Step 14 β Push and verify
```
---
## Step 1 β Create the Repo
```bash
# Create the project directory
mkdir invoice-exception-handler
cd invoice-exception-handler
# Initialise git
git init
git checkout -b main
# Create the directory structure
mkdir -p env documents
# Create empty placeholder files so git tracks the structure
touch env/__init__.py
touch documents/.gitkeep
# First commit β skeleton only
git add .
git commit -m "init: project skeleton"
# Create the repo on GitHub/HF and push
# Replace with your actual remote
git remote add origin https://github.com/YOUR_USERNAME/invoice-exception-handler.git
git push -u origin main
```
---
## Step 2 β requirements.txt
Pin every version. Do not use `>=` ranges β the validator builds in a clean environment and
range mismatches cause mysterious failures.
```
pydantic==2.7.1
fastapi==0.111.0
uvicorn==0.29.0
gradio==4.36.1
openai==1.35.3
pyyaml==6.0.1
httpx==0.27.0
python-multipart==0.0.9
```
---
## Step 3 β env/models.py
This file defines every typed object in the system. Write it before any other Python code.
Nothing is untyped. Every field has a type annotation.
### What goes in models.py
**Enumerations:**
- `ActionType` β the 9 action types an agent can take (string enum)
- `DecisionType` β approve / reject / hold / partial_approve (string enum)
- `CaseStatus` β open / in_review / decided / routed / closed (string enum)
**Document models** (read-only context given to the agent):
- `LineItem` β one line on an invoice or PO (description, quantity, unit_price, total, tax_rate)
- `PurchaseOrder` β what was agreed to be purchased
- `Invoice` β what the supplier is claiming
- `GoodsReceiptNote` β what actually arrived at the warehouse
- `SupplierMaster` β the verified, registered supplier record
- `ExceptionFlag` β why the system flagged this invoice (flag_code, description, auto_hold)
**Action model:**
- `Action` β has a `type: ActionType` and `params: Dict[str, Any]`
- Add classmethod constructors for each action type so callers can do `Action.run_check("tolerance_rule")`
**Result models:**
- `InspectionResult` β what came back from inspect_field (document, field, value, note, timestamp)
- `CheckResult` β what came back from run_check or cross_check (check_name, passed, detail, timestamp)
- `QueryResult` β what came back from a query (target, question, response, channel, timestamp)
**State models:**
- `EnvironmentState` β the full observable state returned by reset() and step()
- `StepResult` β what step() returns: (observation, reward, done, info)
### EnvironmentState fields
The EnvironmentState must include:
- `task_id: str`
- `step_number: int`
- `case_status: CaseStatus`
- All 5 documents (purchase_order, invoice, grn, supplier_master, exception_flag)
- Agent history: `inspections`, `checks_run`, `queries`, `rules_applied`
- Decision state: `decision`, `decision_reason`, `routed_to`, `case_closed`, `close_summary`
- Action hints: `available_actions`, `available_checks`, `available_rules`, `knowledge_base`
- `cumulative_reward: float`
### Writing style for models.py
```python
"""
Typed models for the Invoice Exception Handler OpenEnv environment.
Every object the agent sees or produces is defined here as a Pydantic model.
This is the single source of truth for the data contract between the
environment simulation and the agent.
"""
from __future__ import annotations
import time
from enum import Enum
from typing import Any, Dict, List, Optional
from pydantic import BaseModel, Field
class ActionType(str, Enum):
INSPECT_FIELD = "inspect_field"
CROSS_CHECK = "cross_check"
# ... etc
```
Do not put business logic in models.py. Just data shapes.
---
## Step 4 β env/tasks.py
This is the biggest file. It defines what happens when the agent takes each action β
the simulated responses, the rewards, and the grading logic.
### EpisodeData class
A plain Python class (not Pydantic) that tracks everything the agent has done in one episode.
```python
class EpisodeData:
"""Tracks the full history of one episode for grading and state building."""
def __init__(self):
self.inspections: List[InspectionResult] = []
self.checks: List[CheckResult] = []
self.queries: List[QueryResult] = []
self.rules_applied: List[str] = []
self.decision: Optional[str] = None
self.decision_reason: Optional[str] = None
self.routed_to: List[str] = []
self.closed: bool = False
self.close_summary: Optional[str] = None
self.step_count: int = 0
self.cumulative_reward: float = 0.0
def has_inspected(self, doc: str, field: str) -> bool:
"""Check if we already looked at this field in this document."""
return any(i.document == doc and i.field == field for i in self.inspections)
def has_checked(self, name: str) -> bool:
"""Check if this validation check has already been run."""
return any(c.check_name == name for c in self.checks)
def has_queried(self, target: str) -> bool:
"""Check if we already queried this person or department."""
return any(q.target == target for q in self.queries)
```
### BaseTask class
Abstract base that all three tasks inherit from. Every method raises `NotImplementedError`.
```python
class BaseTask:
task_id: str = "base"
max_steps: int = 20
difficulty: str = "easy"
# Document factories β return fresh objects each time (no shared state)
def get_purchase_order(self) -> PurchaseOrder: raise NotImplementedError
def get_invoice(self) -> Invoice: raise NotImplementedError
def get_grn(self) -> GoodsReceiptNote: raise NotImplementedError
def get_supplier_master(self) -> SupplierMaster: raise NotImplementedError
def get_exception_flag(self) -> ExceptionFlag: raise NotImplementedError
# Simulators β each returns (result_object, reward_delta)
def simulate_inspect(self, document: str, field: str) -> Tuple[InspectionResult, float]: ...
def simulate_cross_check(self, field: str, doc_a: str, doc_b: str) -> Tuple[CheckResult, float]: ...
def simulate_run_check(self, check_name: str) -> Tuple[CheckResult, float]: ...
def simulate_query_supplier(self, question: str, channel: str) -> Tuple[QueryResult, float]: ...
def simulate_query_internal(self, department: str, question: str) -> Tuple[QueryResult, float]: ...
def simulate_apply_rule(self, rule_id: str) -> Tuple[str, float]: ...
def simulate_make_decision(self, decision: str, reason: str, ep: EpisodeData) -> float: ...
def simulate_route_to(self, team: str, notes: str, ep: EpisodeData) -> float: ...
def simulate_close(self, summary: str, ep: EpisodeData) -> float: ...
def grade(self, ep: EpisodeData) -> Dict[str, float]: ...
# These are properties, not methods
@property
def available_checks(self) -> List[str]: return []
@property
def available_rules(self) -> List[str]: return []
@property
def knowledge_base(self) -> List[str]: return []
```
### The Three Tasks
#### Task 1: PriceVarianceTask (task1_price_variance)
**The scenario:** An office stationery supplier sends an invoice that's 3.08% above the PO.
Company policy allows Β±2% automatic approval. Above that needs manual exception approval.
The supplier did communicate the price increase but procurement never updated the PO.
**task_id:** `"task1_price_variance"`
**max_steps:** `18`
**difficulty:** `"easy"`
**The documents:**
PO (PO-2024-1041): 3 stationery line items totalling βΉ50,000
- A4 Paper 100 reams @ βΉ220 = βΉ22,000
- Ballpoint Pens 20 boxes @ βΉ450 = βΉ9,000
- Staplers 10 units @ βΉ1,900 = βΉ19,000
Invoice (INV-ON-8821): Same items, same quantities, but 2 items have higher unit prices
- A4 Paper @ βΉ231 (+βΉ11, +5.0%)
- Ballpoint Pens @ βΉ472 (+βΉ22, +4.9%)
- Staplers unchanged @ βΉ1,900
- Subtotal: βΉ51,540 (+βΉ1,540, +3.08%)
- 18% GST applied correctly: βΉ9,277.20
- Total: βΉ60,817.20
GRN (GRN-2024-0892): All items fully received, no pending, no rejected.
Supplier Master (SUP-0441 β OfficeNeed Supplies): Bank account and GSTIN both match invoice exactly. No fraud signals.
Exception Flag: `PRICE_MISMATCH` β "Invoice total βΉ51,540 exceeds PO βΉ50,000 by βΉ1,540 (3.08%). Above auto-approval threshold."
**Knowledge base entries:**
- POL-001: Price variance β€Β±2% may be auto-approved. Above 2% requires exception approval.
- POL-002: Exception approval requires confirmation from originating department.
- POL-003: Any approved invoice with a price change must be followed by a PO amendment request.
- POL-004: Bank account on invoice must match supplier master.
**Simulator logic:**
`simulate_inspect`: Return meaningful values for invoice line_items (+0.10), invoice total_amount (+0.08), po line_items (+0.06), grn items_received (+0.05). Return +0.01 for unknown fields.
`simulate_cross_check`: The key cross-checks are:
- `(unit_price, invoice, po)` β finds Paper and Pen mismatch, reward +0.12
- `(total_amount, invoice, po)` β confirms 3.08% variance, reward +0.10
- `(bank_account, invoice, supplier_master)` β match (no fraud), reward +0.03
- `(gstin, invoice, supplier_master)` β match, reward +0.02
- `(quantity, invoice, grn)` β match (full delivery), reward +0.04
`simulate_run_check`:
- `"tolerance_rule"` β 3.08% > 2%, FAILS, reward +0.14 (most important check)
- `"grn_match"` β PASSES (all received), reward +0.06
- `"duplicate_detection"` β PASSES (not a dup), reward +0.02
- `"bank_account_verification"` β PASSES, reward +0.02
- `"gst_verification"` β PASSES, reward +0.02
- `"po_match"` β FAILS on price, reward +0.08
`simulate_query_supplier`: Returns email from supplier explaining raw material price increase communicated to Arjun Mehta at procurement on Feb 20. Reward +0.10.
`simulate_query_internal`:
- `"procurement"` β Arjun Mehta confirms verbal approval, says he'll raise PO amendment. Reward +0.12.
- Others β generic responses, reward +0.03.
`simulate_apply_rule`:
- `"tolerance_2pct_auto_approve"` β BLOCKED (3.08% > 2%), reward β0.05
- `"tolerance_exception_approval"` β APPLIED, reward +0.10
- `"rejection_with_reason"` β APPLIED but wrong, reward β0.08
- `"partial_approval"` β not applicable here, reward β0.05
`simulate_make_decision`:
- `"approve"` with tolerance check + procurement query: reward +0.25
- `"approve"` with tolerance check only: reward +0.18
- `"approve"` with nothing checked: reward +0.05 (bad approval, should have verified)
- `"reject"`: reward β0.10 (wrong decision, delay supplier)
- `"hold"`: reward +0.08
`simulate_route_to`:
- `"procurement"` β reward +0.12 (correct β PO amendment needed)
- `"finance"` β reward +0.03
- `"legal"` β reward β0.05 (overkill for a price variance)
`simulate_close`: reward +0.12 if approved + tolerance checked + procurement routed, else +0.06, else 0.
**Grader (`grade` method):**
```python
def grade(self, ep: EpisodeData) -> Dict[str, float]:
checks_run = {c.check_name for c in ep.checks}
queries_to = {q.target for q in ep.queries}
# Did the agent correctly diagnose?
d = 0.0
if any("unit_price" in c.check_name or "total" in c.check_name
for c in ep.checks):
d += 0.12
if "tolerance_rule" in checks_run:
d += 0.14
if "grn_match" in checks_run:
d += 0.06
# Did the agent investigate properly?
i = 0.0
if "supplier" in queries_to:
i += 0.10
if "procurement" in queries_to:
i += 0.12
if "tolerance_exception_approval" in ep.rules_applied:
i += 0.08
# Correct decision?
dec = 0.0
if ep.decision == "approve": dec += 0.18
elif ep.decision == "hold": dec += 0.06
elif ep.decision == "reject": dec -= 0.10
# Correct routing?
route = 0.12 if "procurement" in ep.routed_to else 0.0
# Closed cleanly?
closure = 0.08 if ep.closed else 0.0
# Efficiency bonus β penalise extra steps
eff = max(0.0, 0.06 - 0.004 * max(0, ep.step_count - 9))
total = d + i + dec + route + closure + eff
return {
"score": round(max(0.0, min(1.0, total)), 4),
"diagnosis_score": round(d, 4),
"investigation_score": round(i, 4),
"decision_score": round(dec, 4),
"routing_score": round(route, 4),
"closure_score": round(closure, 4),
"efficiency_score": round(eff, 4),
}
```
---
#### Task 2: DuplicateTaxErrorTask (task2_duplicate_tax)
**The scenario:** Logistics supplier submits INV-2024-891 for transport services. System flags
it as a possible duplicate. Turns out it IS a duplicate of INV-2024-819 β the numbers differ
by digit transposition (891 vs 819). That original invoice was already paid. BUT: the original
invoice applied 15% GST when the correct rate is 18%. The company overpaid βΉ3,240 in tax.
The new invoice has the correct rate. So it's both a duplicate AND a legitimate correction.
**task_id:** `"task2_duplicate_tax"`
**max_steps:** `20`
**difficulty:** `"medium"`
**The documents:**
PO (PO-2024-0778): Logistics services
- Mumbai-Pune Transport 20 trips @ βΉ4,500 = βΉ90,000
- Warehousing charges Feb 2024 @ βΉ18,000 = βΉ18,000
- Total: βΉ1,08,000, Net-15 terms
Invoice (INV-2024-891): Same services, same amounts β correct on the face of it
- Subtotal: βΉ1,08,000
- GST 18%: βΉ19,440 β this is CORRECT
- Total: βΉ1,27,440
GRN (GRN-2024-0740): Services confirmed complete (transport + warehousing).
Supplier Master (SUP-0229 β FastMove Logistics): Bank and GSTIN match invoice. No fraud signals.
Exception Flag: `POSSIBLE_DUPLICATE` β "Invoice INV-2024-891 closely matches previously processed invoice."
**Hidden state (not in documents, revealed by checks):**
- INV-2024-819 was paid 12 days ago for βΉ1,24,200
- INV-2024-819 applied 15% GST = βΉ16,200 (wrong rate)
- Correct 18% GST = βΉ19,440
- Company overpaid: βΉ3,240
**Key checks and what they reveal:**
`run_check("duplicate_detection")` β FAILS β finds INV-2024-819 paid 12 days ago, reward +0.18
`run_check("tax_calculation_verify")` β FAILS β discovers the 15% error on original, reveals βΉ3,240 delta, reward +0.16
`cross_check(invoice_number, invoice, payment_history)` β finds digit transposition, reward +0.15
`cross_check(tax_amount, invoice, payment_history)` β confirms βΉ3,240 delta, reward +0.14
`query_internal("finance")` β confirms overpayment on original, reward +0.12
`query_supplier` β supplier confirms they know and wants partial approval for the delta, reward +0.10
`apply_rule("partial_approval")` β correct pathway, reward +0.12
`apply_rule("credit_note_request")` β supplier must issue credit note for the balance, reward +0.10
**Decision logic:**
`simulate_make_decision`:
- `"partial_approve"` with dup + tax found: reward +0.28 β optimal
- `"partial_approve"` with dup only: reward +0.14 β incomplete
- `"reject"` with dup found: reward +0.08 β catches dup, misses correction
- `"approve"` (pays full duplicate): reward β0.15 β bad
**Grader weights:**
- diagnosis_score: up to 0.30 (dup found +0.16, tax error found +0.14)
- investigation_score: up to 0.32 (finance queried, supplier queried, rules applied)
- decision_score: up to 0.20 (partial_approve = 0.20, reject = 0.05, approve = β0.15)
- routing_score: up to 0.08
- closure_score: up to 0.06
---
#### Task 3: CompoundFraudTask (task3_compound_fraud)
**The scenario:** IT supplier submits βΉ8,47,500 invoice for 15 laptops. System flags a bank
account change. But there are FOUR simultaneous fraud signals that the agent must find all of.
**task_id:** `"task3_compound_fraud"`
**max_steps:** `25`
**difficulty:** `"hard"`
**The four signals:**
1. **Bank account fraud (Signal 1):** Invoice has a different bank account than the supplier
master. The change request came from `techcore-solutions.com`. The registered domain is
`techcore-solutions.in`. Classic Business Email Compromise (BEC) attack.
2. **GSTIN fraud (Signal 2):** The GST number on the invoice (`07AABCT9999X1Z8`) belongs to
"TechCore Trading Pvt Ltd" β a completely different entity in Delhi. Supplier master shows
`07AABCT1234Y1Z5` for "TechCore Solutions."
3. **Quantity mismatch (Signal 3):** Invoice claims 15 laptops. GRN shows only 13 received.
2 units are still marked as pending.
4. **Price inflation (Signal 4):** βΉ56,500/unit on invoice vs βΉ52,000/unit on PO. That's
8.65% above the agreed price. No price revision was ever approved.
**Bonus signals (smaller, still notable):**
- Invoice is dated a Sunday (2024-03-10) β unusual for B2B
- PO was raised Friday March 8 β 2-day turnaround is suspiciously fast for IT equipment
**The critical trap β channel selection:**
`simulate_query_supplier(question, channel="email")` β
Returns fraudster's response urging payment to the new account. Reward: **β0.15**.
`simulate_query_supplier(question, channel="phone")` β
The real TechCore Solutions confirms they sent no bank change request. Confirms fraud. Reward: **+0.15**.
This tests whether the agent follows POL-009 ("bank account change must be verified via
registered phone number β NEVER via email") which is in the knowledge base.
**Available checks and rewards:**
```python
"bank_account_verification" β FAILS, finds lookalike domain, reward +0.18
"gst_verification" β FAILS, GST belongs to different entity, reward +0.18
"grn_match" β FAILS, 13 vs 15 received, reward +0.14
"email_domain_verification" β FAILS, lookalike domain confirmed, reward +0.16
"invoice_date_validation" β FAILS, Sunday flag, reward +0.08
"quantity_check" β FAILS, quantity inflated, reward +0.12
"price_check" β FAILS, 8.65% above PO, reward +0.10
"duplicate_detection" β PASSES (not a dup), reward +0.02
"po_match" β FAILS (GST + qty + price all wrong), reward +0.08
```
**Decision logic:**
`simulate_make_decision`:
- `"reject"` β reward = 0.10 + 0.05 Γ (number of signals found) β max ~0.30
- `"approve"` β reward β0.40 (catastrophic β approved fraud)
- `"partial_approve"` β reward β0.20 (you can't partially approve fraud)
- `"hold"` β reward = 0.08 + 0.03 Γ signals found β acceptable but not optimal
**Route logic:**
```python
"legal" β reward +0.14 # must escalate to legal
"security" β reward +0.12 # BEC attack needs security investigation
"finance" β reward +0.08 # finance needs to block payment
"procurement" β reward +0.06
```
**Grader β the signal detection scoring:**
```python
def grade(self, ep: EpisodeData) -> Dict[str, float]:
failed = {c.check_name for c in ep.checks if not c.passed}
bank_found = "bank_account_verification" in {c.check_name for c in ep.checks}
gst_found = "gst_verification" in {c.check_name for c in ep.checks}
qty_found = "grn_match" in {c.check_name for c in ep.checks}
domain_found = "email_domain_verification" in {c.check_name for c in ep.checks}
price_found = "price_check" in {c.check_name for c in ep.checks}
# Diagnosis β finding all signals is the whole point
d = (0.12 if bank_found else 0) + (0.12 if gst_found else 0) \
+ (0.10 if qty_found else 0) + (0.10 if domain_found else 0) \
+ (0.06 if price_found else 0)
# Investigation β reward for using phone not email
i = 0.0
for q in ep.queries:
if q.target == "supplier" and q.channel not in ("email", "mail"):
i += 0.10 # correct channel
elif q.target == "supplier" and q.channel in ("email", "mail"):
i -= 0.15 # contacting fraudster
if "legal" in {q.target for q in ep.queries}: i += 0.06
if "security" in {q.target for q in ep.queries}: i += 0.06
# Decision
signals = sum([bank_found, gst_found, qty_found, domain_found])
dec = 0.0
if ep.decision == "reject":
dec = 0.08 + 0.03 * signals
elif ep.decision == "approve":
dec = -0.35
elif ep.decision == "partial_approve":
dec = -0.15
elif ep.decision == "hold":
dec = 0.06
# Routing
routes = set(ep.routed_to)
route = (0.10 if "legal" in routes else 0) \
+ (0.06 if "security" in routes else 0) \
+ (0.04 if "finance" in routes else 0)
closure = 0.06 if (ep.closed and ep.decision == "reject") else 0.0
eff = max(0.0, 0.04 - 0.002 * max(0, ep.step_count - 12))
total = d + i + dec + route + closure + eff
return {
"score": round(max(0.0, min(1.0, total)), 4),
"signals_found": sum([bank_found, gst_found, qty_found, domain_found, price_found]),
"diagnosis_score": round(d, 4),
"investigation_score": round(i, 4),
"decision_score": round(dec, 4),
"routing_score": round(route, 4),
"closure_score": round(closure, 4),
"efficiency_score": round(eff, 4),
}
```
### Task Registry
At the bottom of tasks.py:
```python
TASK_REGISTRY: Dict[str, type] = {
"task1_price_variance": PriceVarianceTask,
"task2_duplicate_tax": DuplicateTaxErrorTask,
"task3_compound_fraud": CompoundFraudTask,
}
ALL_TASKS = list(TASK_REGISTRY.keys())
def make_task(task_id: str) -> BaseTask:
cls = TASK_REGISTRY.get(task_id)
if cls is None:
raise ValueError(f"Unknown task '{task_id}'. Available: {ALL_TASKS}")
return cls()
```
---
## Step 5 β env/environment.py
This is the `InvoiceExceptionEnv` class. It is the only thing external code needs to import.
```python
class InvoiceExceptionEnv:
"""
OpenEnv-compatible Invoice Exception Handler environment.
Usage:
env = InvoiceExceptionEnv(seed=42)
obs = env.reset("task1_price_variance")
result = env.step(Action.run_check("tolerance_rule"))
scores = env.grade()
"""
```
### Constructor
Takes an optional `seed: Optional[int] = None` for reproducibility.
Initialises `self._rng = random.Random(seed)`.
Initialises `self._task`, `self._ep`, `self._state`, `self._done` all to None/False.
### reset(task_id)
```python
def reset(self, task_id: Optional[str] = None) -> EnvironmentState:
"""
Start a new episode. If task_id is None, picks one at random.
Returns the initial EnvironmentState showing all documents and available actions.
"""
```
1. Pick task (random if None)
2. Create `EpisodeData()`
3. Set `self._done = False`
4. Call `self._build_state()` and store result
5. Return the state
### step(action)
```python
def step(self, action: Union[Action, Dict[str, Any]]) -> StepResult:
"""
Execute one action. Returns observation, reward, done flag, and info dict.
Raises RuntimeError if called before reset() or after the episode is done.
"""
```
1. Validate we're in an active episode
2. Convert dict to Action if needed
3. Call `self._dispatch(action)` β gets (reward, info)
4. Increment step count
5. Check SLA (step count vs max_steps)
6. Check done condition (closed or SLA breach)
7. Rebuild state
8. Return StepResult
### state()
Non-destructive. Just returns `self._state`. Raises RuntimeError if not initialised.
### grade()
Calls `self._task.grade(self._ep)` and returns the dict.
### _dispatch(action)
The routing function. A single if/elif chain for each ActionType.
For each action:
1. Call the appropriate task simulator
2. Update EpisodeData
3. Return (reward, info dict)
Handle repeated actions (inspect same field twice, check same thing twice) with a small β0.02 to β0.05 penalty and return early.
### _build_state()
Constructs an `EnvironmentState` from the current `_task` and `_ep`. Called after every step.
Also determines the current `CaseStatus` based on episode data.
### action_space_sample()
Returns a random valid action (for random baseline agents). Uses `self._rng` for reproducibility.
---
## Step 6 β env/__init__.py
```python
from .environment import InvoiceExceptionEnv
from .models import Action, ActionType, EnvironmentState, StepResult
from .tasks import ALL_TASKS, make_task
__all__ = [
"InvoiceExceptionEnv",
"Action",
"ActionType",
"EnvironmentState",
"StepResult",
"ALL_TASKS",
"make_task",
]
```
---
## Step 7 β Smoke Test Before Continuing
Before writing openenv.yaml or inference.py, verify the environment works.
```python
# test_smoke.py β run this, do not commit it
from env import InvoiceExceptionEnv, Action, ALL_TASKS
print("Tasks:", ALL_TASKS)
env = InvoiceExceptionEnv(seed=42)
for task_id in ALL_TASKS:
obs = env.reset(task_id)
print(f"\n--- {task_id} ---")
print("Ticket:", obs.exception_flag.flag_description[:80])
# Take a few actions
r1 = env.step(Action.run_check(obs.available_checks[0]))
print(f"Step 1 reward: {r1.reward}")
r2 = env.step(Action.make_decision("approve", "test"))
print(f"Step 2 reward: {r2.reward}")
r3 = env.step(Action.close_case("closed"))
print(f"Step 3 reward: {r3.reward}, done: {r3.done}")
scores = env.grade()
print(f"Grade: {scores['score']}")
print("\nSmoke test passed.")
```
All three tasks must complete without errors. Scores must be in [0.0, 1.0].
---
## Step 8 β openenv.yaml
This file must pass `openenv validate`. Write it carefully.
```yaml
# openenv.yaml
name: Invoice Exception Handler
version: "1.0.0"
description: |
An agent learning environment simulating accounts payable exception handling.
The agent acts as an AP analyst: investigates flagged invoices, applies business
rules, detects fraud signals, makes decisions, and closes cases with an audit trail.
authors:
- name: Your Name
email: your@email.com
license: MIT
tasks:
- id: task1_price_variance
name: Price Variance Exception
difficulty: easy
description: |
Office stationery invoice arrives 3.08% above PO. Company tolerance policy
allows Β±2% auto-approval. Agent must detect the variance, verify through
the tolerance rule, confirm verbal approval with procurement, and approve
with a PO amendment request.
max_steps: 18
optimal_score: 1.0
min_passing_score: 0.60
- id: task2_duplicate_tax
name: Duplicate Invoice with Tax Error
difficulty: medium
description: |
Logistics supplier submits INV-2024-891, a duplicate of paid INV-2024-819
(digit transposition: 891 vs 819). Original invoice had wrong GST rate (15%
vs correct 18%) β company overpaid βΉ3,240. New invoice has correct rate.
Agent must detect the duplicate, identify the tax error in the original,
and partially approve only the βΉ3,240 tax correction.
max_steps: 20
optimal_score: 1.0
min_passing_score: 0.50
- id: task3_compound_fraud
name: Compound Fraud Signals
difficulty: hard
description: |
IT equipment supplier invoice with four simultaneous fraud signals: bank
account changed via BEC attack (lookalike email domain), GSTIN belongs to
a different entity, 2 of 15 laptops not yet received, and unit price 8.65%
above PO. Agent must find all signals, use the correct communication channel
(phone, not email β which would contact the fraudster), and escalate to legal
and security.
max_steps: 25
optimal_score: 1.0
min_passing_score: 0.40
observation_space:
type: object
description: EnvironmentState Pydantic model
fields:
task_id: {type: string}
step_number: {type: integer}
case_status: {type: string, enum: [open, in_review, decided, routed, closed]}
purchase_order: {type: object, description: "PO with line items and terms"}
invoice: {type: object, description: "Supplier invoice with line items and tax"}
grn: {type: object, description: "Goods receipt β what actually arrived"}
supplier_master: {type: object, description: "Verified supplier record"}
exception_flag: {type: object, description: "Why the system flagged this invoice"}
inspections: {type: array, description: "Fields the agent has inspected"}
checks_run: {type: array, description: "Validation checks completed"}
queries: {type: array, description: "Internal and supplier queries"}
rules_applied: {type: array, description: "Business rules applied"}
decision: {type: string, nullable: true}
routed_to: {type: array}
available_actions: {type: array}
available_checks: {type: array}
available_rules: {type: array}
knowledge_base: {type: array}
cumulative_reward: {type: number}
action_space:
type: object
description: Action with type and params
actions:
inspect_field:
params: {document: string, field: string}
cross_check:
params: {field: string, doc_a: string, doc_b: string}
run_check:
params: {check_name: string}
query_supplier:
params: {question: string, channel: string}
query_internal:
params: {department: string, question: string}
apply_rule:
params: {rule_id: string}
make_decision:
params: {decision: string, reason: string}
route_to:
params: {team: string, notes: string}
close_case:
params: {summary: string}
reward:
range: [-1.0, 1.0]
description: |
Shaped reward at every step. Relevant inspections: +0.01 to +0.14.
Diagnostics revealing issues: +0.08 to +0.18. Correct fixes: +0.08 to +0.30.
Wrong decision on fraud: -0.15 to -0.40. Repeat actions: -0.02 to -0.05.
SLA breach: -0.10.
grading:
method: task_grader
scores:
- score # 0.0β1.0 overall
- diagnosis_score
- investigation_score
- decision_score
- routing_score
- closure_score
- efficiency_score
api:
reset:
signature: "reset(task_id: str | None = None) -> EnvironmentState"
step:
signature: "step(action: Action | dict) -> StepResult"
state:
signature: "state() -> EnvironmentState"
grade:
signature: "grade() -> Dict[str, float]"
http_endpoints:
- path: /reset
method: POST
description: Reset environment, returns EnvironmentState JSON
- path: /step
method: POST
description: Execute action, returns StepResult JSON
- path: /state
method: GET
description: Current state, returns EnvironmentState JSON
- path: /grade
method: POST
description: Grade current episode
- path: /health
method: GET
description: Health check
dependencies:
python: ">=3.11"
packages:
- pydantic==2.7.1
- fastapi==0.111.0
- uvicorn==0.29.0
- gradio==4.36.1
- openai==1.35.3
- pyyaml==6.0.1
docker:
port: 7860
health_check: /health
```
---
## Step 9 β inference.py
This is the most critical file for the hackathon validator. Get the format exactly right.
### Required env vars
```python
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "")
```
### Required stdout format
Every line to stdout must be exactly:
```
[START] task=<task_id> env=invoice-exception-handler model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...>
```
Rules (do not deviate):
- One `[START]` line at episode begin
- One `[STEP]` line per step, immediately after `env.step()` returns
- One `[END]` line after the episode, always emitted even on exception
- `reward` and all values in `rewards` formatted to exactly 2 decimal places
- `score` formatted to exactly 3 decimal places
- `done` and `success` are lowercase: `true` or `false`
- `error` is the error message string, or exactly `null` if none
- No newlines within a single line
- `flush=True` on every print so the validator sees output in real time
### System prompt for the LLM
Write a clear system prompt that tells the model:
- It is an AP analyst handling a flagged invoice
- It has a structured action space (list all 9 action types)
- It must respond in JSON: `{"type": "...", "params": {...}}`
- It should investigate before deciding
- Never approve without checking, never contact supplier by email if fraud is suspected
- Available documents: PO, Invoice, GRN, Supplier Master, Exception Flag
### User prompt per step
Include in the user prompt:
- Current step number and max steps
- The exception flag (what was flagged and why)
- Available checks (list them)
- Available rules (list them)
- Knowledge base entries (the policy list)
- What has been done so far (checks run, queries made, inspections done)
- Current cumulative reward
- Ask for next action as JSON
### Parsing LLM output
```python
def parse_action(raw_text: str) -> dict:
"""
Parse the model's response into an action dict.
Handles markdown code fences, extra whitespace, and minor formatting errors.
Falls back to run_check(po_match) if parsing fails.
"""
text = raw_text.strip()
# Remove ```json or ``` fences if present
if text.startswith("```"):
lines = text.split("\n")
text = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])
try:
return json.loads(text.strip())
except json.JSONDecodeError:
# Try to find JSON within the text
import re
match = re.search(r'\{.*\}', text, re.DOTALL)
if match:
try:
return json.loads(match.group())
except json.JSONDecodeError:
pass
# Safe fallback
return {"type": "run_check", "params": {"check_name": "po_match"}}
```
### Overall structure
```python
def run_task(client, env, task_id, max_steps=20):
"""Run one task episode and return (steps_taken, score, rewards)."""
rewards = []
print(f"[START] task={task_id} env=invoice-exception-handler model={MODEL_NAME}", flush=True)
obs = env.reset(task_id)
history = []
for step in range(1, max_steps + 1):
# Build prompt from observation
user_prompt = build_prompt(obs, step, max_steps, history)
# Call LLM
raw = call_llm(client, user_prompt)
action_dict = parse_action(raw)
# Execute
try:
result = env.step(action_dict)
reward = result.reward
done = result.done
error = None
except Exception as e:
reward = 0.0
done = False
error = str(e)
result = None
rewards.append(reward)
action_str = json.dumps(action_dict)
print(
f"[STEP] step={step} action={action_str} "
f"reward={reward:.2f} done={str(done).lower()} "
f"error={error or 'null'}",
flush=True
)
history.append(f"Step {step}: {action_str} β reward {reward:+.2f}")
if result:
obs = result.observation
if done:
break
score = env.grade()["score"]
success = score >= 0.5
steps_taken = min(step, max_steps)
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
print(
f"[END] success={str(success).lower()} steps={steps_taken} "
f"score={score:.3f} rewards={rewards_str}",
flush=True
)
return steps_taken, score, rewards
def main():
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
env = InvoiceExceptionEnv(seed=42)
for task_id in ALL_TASKS:
run_task(client, env, task_id)
if __name__ == "__main__":
main()
```
---
## Step 10 β app.py
The app.py serves two purposes:
1. Provides the FastAPI HTTP endpoints that the validator pings (`POST /reset` must return 200)
2. Provides a Gradio UI for interactive exploration on HF Spaces
### Architecture
Run both FastAPI and Gradio in the same process on port 7860.
Use `gr.mount_gradio_app` to mount Gradio on FastAPI, or run Gradio alongside FastAPI.
The cleanest approach:
```python
import gradio as gr
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import uvicorn
app = FastAPI(title="Invoice Exception Handler OpenEnv")
env = InvoiceExceptionEnv(seed=42) # shared environment instance
@app.post("/reset")
async def http_reset(body: dict = {}):
task_id = body.get("task_id", None)
obs = env.reset(task_id)
return JSONResponse(obs.model_dump(mode="json"))
@app.post("/step")
async def http_step(body: dict):
result = env.step(body)
return JSONResponse(result.model_dump(mode="json"))
@app.get("/state")
async def http_state():
return JSONResponse(env.state().model_dump(mode="json"))
@app.post("/grade")
async def http_grade():
return JSONResponse(env.grade())
@app.get("/tasks")
async def http_tasks():
return JSONResponse(ALL_TASKS)
@app.get("/health")
async def health():
return JSONResponse({"status": "ok", "version": "1.0.0"})
# Mount Gradio on /ui
gradio_app = build_gradio_ui()
app = gr.mount_gradio_app(app, gradio_app, path="/")
```
### Gradio UI β what to build
Keep the UI simple and functional. Three tabs:
**Tab 1: Manual Play**
- Dropdown to select task (labels: "Task 1 β Price Variance (Easy)", etc.)
- Reset button
- Shows the exception flag, the key document fields, and available actions
- Dropdown or textbox to compose and submit an action
- Shows reward, cumulative reward, and status after each step
- Shows grade breakdown when episode ends
**Tab 2: Agent Demo**
- Select task
- Shows a hardcoded optimal action sequence running step by step
- Good for demonstrating the environment to judges who won't run code
**Tab 3: API Reference**
- Code examples for each action type
- Reward table
- Grader score breakdown explanation
---
## Step 11 β Dockerfile
```dockerfile
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update \
&& apt-get install -y --no-install-recommends curl \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user (required by HF Spaces)
RUN useradd -m -u 1000 appuser
WORKDIR /app
# Copy and install dependencies first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 7860
# Health check β pings the /health endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
CMD curl -f http://localhost:7860/health || exit 1
ENV PYTHONUNBUFFERED=1
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV GRADIO_SERVER_PORT=7860
CMD ["python", "app.py"]
```
---
## Step 12 β End-to-End Test Checklist
Before pushing, check every item in this list.
```bash
# 1. Imports work
python -c "from env import InvoiceExceptionEnv, Action, ALL_TASKS; print('OK')"
# 2. All three tasks complete without errors
python -c "
from env import InvoiceExceptionEnv, Action, ALL_TASKS
env = InvoiceExceptionEnv(seed=42)
for t in ALL_TASKS:
obs = env.reset(t)
result = env.step(Action.run_check(obs.available_checks[0]))
result = env.step(Action.make_decision('reject', 'test'))
result = env.step(Action.close_case('test'))
score = env.grade()['score']
assert 0.0 <= score <= 1.0, f'Score out of range: {score}'
print(f'{t}: {score}')
print('All tasks OK')
"
# 3. Graders are deterministic
python -c "
from env import InvoiceExceptionEnv, Action
env1 = InvoiceExceptionEnv(seed=42)
env2 = InvoiceExceptionEnv(seed=42)
obs1 = env1.reset('task1_price_variance')
obs2 = env2.reset('task1_price_variance')
env1.step(Action.run_check('tolerance_rule'))
env2.step(Action.run_check('tolerance_rule'))
env1.step(Action.make_decision('approve', 'test'))
env2.step(Action.make_decision('approve', 'test'))
env1.step(Action.close_case('done'))
env2.step(Action.close_case('done'))
s1 = env1.grade()['score']
s2 = env2.grade()['score']
assert s1 == s2, f'Non-deterministic: {s1} vs {s2}'
print(f'Deterministic: {s1}')
"
# 4. inference.py log format (with fake API key)
API_BASE_URL=https://api.example.com HF_TOKEN=fake MODEL_NAME=test python -c "
# This will fail on the API call but should print [START] before failing
import subprocess, sys
"
# Manually verify the [START] line would print correctly
# 5. Docker builds
docker build -t invoice-env-test .
# 6. Docker runs and /health returns 200
docker run -d -p 7860:7860 --name test-env invoice-env-test
sleep 15
curl -f http://localhost:7860/health
curl -s -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{}'
docker stop test-env && docker rm test-env
# 7. openenv validate (if openenv-core is installed)
pip install openenv-core
openenv validate
```
---
## Step 13 β documents/ Folder
Create these four files. Keep them updated as the project evolves.
### documents/CHANGELOG.md
```markdown
# Changelog
All changes to the Invoice Exception Handler environment are recorded here.
Format: Date | Version | What changed | Why
---
## [1.0.0] β 2025-01-20
### Added
- Initial implementation of InvoiceExceptionEnv with full OpenEnv API
- Three tasks: task1_price_variance, task2_duplicate_tax, task3_compound_fraud
- Pydantic v2 typed models for all environment objects
- FastAPI HTTP endpoints for HF Spaces validation
- Gradio UI for interactive exploration
- inference.py using OpenAI client with [START][STEP][END] log format
- openenv.yaml spec file
- Dockerfile for HF Spaces deployment
### Design decisions
- Used pure Python simulation (no external databases) for portability and determinism
- Compound fraud task has four signals to prevent simple greedy agents from scoring well
- Channel selection in Task 3 (phone vs email) tests policy knowledge, not just anomaly detection
- Grader uses sub-scores to allow partial credit for partial solutions
```
### documents/ARCHITECTURE.md
Document the system architecture. Include:
- A text diagram of how the components connect
- Why FastAPI and Gradio in the same process (HF Spaces constraint)
- Why Pydantic v2 (spec requirement, validation)
- How EpisodeData separates mutable state from immutable document context
- Why tasks are separate classes (easy to extend)
### documents/BASELINE-SCORES.md
Record the reproducible baseline scores. Run them yourself and copy the output here.
```markdown
# Baseline Scores
Recorded on: 2025-01-20
Seed: 42
Machine: 2 vCPU, 8GB RAM
## Random Agent (action_space_sample())
| Task | Score | Steps |
|------|-------|-------|
| task1_price_variance | ~0.18 | 18 (SLA breach) |
| task2_duplicate_tax | ~0.12 | 20 (SLA breach) |
| task3_compound_fraud | ~0.08 | 25 (SLA breach) |
| **Average** | **~0.13** | |
## Optimal Agent (hardcoded correct actions)
| Task | Score | Steps |
|------|-------|-------|
| task1_price_variance | ~0.98 | 9 |
| task2_duplicate_tax | ~0.95 | 10 |
| task3_compound_fraud | ~0.92 | 14 |
| **Average** | **~0.95** | |
```
---
## Step 14 β Push and Verify
```bash
# Final commit
git add .
git commit -m "feat: complete invoice exception handler v1.0.0
- 3 tasks with deterministic graders (easy/medium/hard)
- Full OpenEnv API: reset/step/state/grade
- FastAPI HTTP endpoints for validator (/reset, /step, /state, /health)
- Gradio UI for HF Spaces
- inference.py with OpenAI client and [START][STEP][END] format
- openenv.yaml spec
- Dockerfile for HF Spaces deployment
- documents/ folder with PRD, changelog, architecture, baseline scores"
git push origin main
# Deploy to HF Spaces (if not using git-based deployment)
# The Dockerfile and app.py handle this automatically when pushed to HF
```
---
## Action Space Reference
| Action Type | Required Params | Description |
|---|---|---|
| `inspect_field` | `document, field` | Look at a specific field in a document |
| `cross_check` | `field, doc_a, doc_b` | Compare a field between two documents |
| `run_check` | `check_name` | Run a named validation check |
| `query_supplier` | `question, channel` | Ask the supplier something (channel: phone or email) |
| `query_internal` | `department, question` | Ask an internal team |
| `apply_rule` | `rule_id` | Apply a business policy rule |
| `make_decision` | `decision, reason` | approve / reject / hold / partial_approve |
| `route_to` | `team, notes` | Escalate to a team |
| `close_case` | `summary` | Close with an audit trail summary |
---
## Observation Space Reference
| Field | Type | Description |
|---|---|---|
| `task_id` | str | Which task is running |
| `step_number` | int | Current step |
| `case_status` | str | open / in_review / decided / routed / closed |
| `purchase_order` | PurchaseOrder | What was agreed to be purchased |
| `invoice` | Invoice | What the supplier is claiming |
| `grn` | GoodsReceiptNote | What actually arrived |
| `supplier_master` | SupplierMaster | Verified supplier record |
| `exception_flag` | ExceptionFlag | Why this invoice was flagged |
| `inspections` | List | Fields already inspected |
| `checks_run` | List | Validation checks already run |
| `queries` | List | Queries made and responses |
| `rules_applied` | List | Business rules applied |
| `decision` | str? | Current decision if made |
| `routed_to` | List | Teams this case has been escalated to |
| `available_actions` | List | All 9 action types |
| `available_checks` | List | Check names valid for this task |
| `available_rules` | List | Rule IDs valid for this task |
| `knowledge_base` | List | Policy entries relevant to this task |
| `cumulative_reward` | float | Sum of all rewards so far |
---
## Reward Reference
| Event | Reward |
|---|---|
| Inspecting a key field that reveals an anomaly | +0.08 to +0.14 |
| Inspecting a routine field | +0.01 to +0.06 |
| Cross-check that finds a mismatch | +0.12 to +0.15 |
| Running a check that finds an issue | +0.08 to +0.18 |
| Querying the right person | +0.04 to +0.12 |
| Contacting supplier via wrong channel (Task 3) | β0.15 |
| Applying the correct business rule | +0.08 to +0.12 |
| Applying the wrong rule | β0.05 to β0.10 |
| Correct decision (approve/reject/partial) | +0.18 to +0.28 |
| Approving a fraudulent invoice | β0.35 to β0.40 |
| Wrong rejection (task1) | β0.10 |
| Routing to the right team | +0.06 to +0.14 |
| Clean case closure | +0.06 to +0.12 |
| Repeat action | β0.02 to β0.05 |
| SLA breach (exceed max_steps) | β0.10 |
---
## Expected Baseline Scores
These are the scores you should see when running `inference.py` with a good LLM.
| Task | Difficulty | Random Agent | Rule Agent | LLM Agent (Qwen-72B) |
|---|---|---|---|---|
| task1_price_variance | Easy | ~0.18 | ~0.85 | ~0.80 |
| task2_duplicate_tax | Medium | ~0.12 | ~0.72 | ~0.68 |
| task3_compound_fraud | Hard | ~0.08 | ~0.55 | ~0.45 |
The hard task should be genuinely hard for LLMs β a score of 0.45 is expected, not a failure.
---
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| `API_BASE_URL` | Yes | `https://router.huggingface.co/v1` | LLM endpoint |
| `MODEL_NAME` | Yes | `Qwen/Qwen2.5-72B-Instruct` | Model to use |
| `HF_TOKEN` | Yes | β | API key for the LLM endpoint |
| `ANTHROPIC_API_KEY` | No | β | Only if using Anthropic models directly |
---
## Setup Instructions
### Local Development
```bash
# Clone the repo
git clone https://github.com/YOUR_USERNAME/invoice-exception-handler.git
cd invoice-exception-handler
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the app locally
python app.py
# Visit http://localhost:7860
```
### Run Inference
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token-here"
python inference.py
```
### Docker
```bash
docker build -t invoice-exception-handler .
docker run -p 7860:7860 \
-e API_BASE_URL="https://router.huggingface.co/v1" \
-e MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" \
-e HF_TOKEN="your-token-here" \
invoice-exception-handler
```
### HF Spaces Deployment
1. Create a new Space with the Gradio SDK
2. Push this repository to it
3. Add secrets in Space settings: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
4. The Space will build and deploy automatically from the Dockerfile
### Validate Submission
```bash
# Install validator
pip install openenv-core
# Validate the spec
openenv validate
# Run the full submission validator script
chmod +x scripts/validate-submission.sh
./scripts/validate-submission.sh https://your-space.hf.space .
```
---
## Common Mistakes to Avoid
1. **Don't use `inference.py` as the wrong name.** The validator looks for exactly `inference.py` in the root.
2. **Don't use the Anthropic SDK in inference.py.** The spec requires the OpenAI client. Use `from openai import OpenAI`.
3. **Don't forget `flush=True` on print statements.** The validator reads stdout line by line. Without flush, logs may not appear.
4. **Don't let the Gradio UI crash the FastAPI server.** If the UI has an error, it should fail gracefully, not bring down `/reset`.
5. **Don't hardcode the model name.** Always read from `os.getenv("MODEL_NAME")`.
6. **Don't put business logic in models.py.** That file is just data shapes.
7. **Don't mutate documents during a step.** The documents (PO, Invoice, GRN) are fixed for the duration of an episode. Only EpisodeData changes.
8. **Don't forget to test determinism.** Same seed + same actions must = same score. Run the determinism test.
9. **Don't skip the docker build test.** The validator builds your Docker image. If it doesn't build, you're disqualified.
10. **Don't forget the changelog.** Update `documents/CHANGELOG.md` before every push.
---
## License
MIT License. See LICENSE file.
|