File size: 54,679 Bytes
562f58d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
# Invoice Exception Handler β€” OpenEnv

> An AI agent learning environment that simulates accounts payable exception handling.
> The agent acts as an AP analyst: receives flagged invoices, investigates root causes,
> makes decisions, and closes cases. Built for the OpenEnv hackathon.

[![OpenEnv](https://img.shields.io/badge/OpenEnv-1.0.0-blue)](https://github.com/openenv/openenv)
[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
[![HF Space](https://img.shields.io/badge/HF%20Space-Live-yellow)](https://huggingface.co/spaces)

---

## For Agents Building This Project

This README is the single source of truth for building the entire project from scratch.
Read every section before writing any code. Do not skip sections. Do not guess.

### Ground Rules

1. **Write code like a human wrote it.** Use real variable names, not `x` or `tmp`. Add comments where
   the logic is non-obvious. Leave one blank line between logical blocks inside functions. Use 4-space
   indentation everywhere. Python files get a module docstring at the top explaining what the file does.

2. **Create a new Git repo and push after every major milestone.** A milestone is: models done,
   tasks done, environment done, API done, inference done, app done. Not after every file.

3. **Record every change in `documents/CHANGELOG.md`.** Use the format in the changelog section below.
   Before pushing, append to the changelog what changed and why.

4. **If something in this README conflicts with the competition spec, the competition spec wins.**
   The competition spec is in the document the user shared. Key points: `inference.py` must use the
   OpenAI client. `[START]` `[STEP]` `[END]` format must be exact. `/reset` must return 200.

5. **Test before pushing.** Run `python -c "from env import InvoiceExceptionEnv"` to check imports.
   Run `python inference.py` with a dummy API key to check the log format. Run `docker build .` to
   check the Dockerfile before claiming it works.

---

## What This Environment Does

Every company that buys goods or services deals with invoice exceptions β€” mismatches between
what was ordered, what arrived, and what was invoiced. Currently these are handled by accounts
payable analysts who manually compare documents and decide: approve, reject, hold, or escalate.

This environment puts an AI agent in that analyst's chair. The agent receives:
- A **Purchase Order** (what was agreed to)
- An **Invoice** (what the supplier is claiming)
- A **Goods Receipt Note** (what actually arrived)
- A **Supplier Master** (the verified supplier record)
- An **Exception Flag** (why the system flagged this invoice)

The agent investigates, runs checks, queries people, makes a decision, and closes the case.
Every action has realistic consequences including financial, compliance, and fraud implications.

---

## Repository Structure

Build the project with exactly this structure. Do not add extra directories. Do not rename files.

```
invoice-exception-handler/
β”‚
β”œβ”€β”€ README.md                          ← this file
β”œβ”€β”€ openenv.yaml                       ← OpenEnv spec, must pass openenv validate
β”œβ”€β”€ Dockerfile                         ← single-stage Python 3.11-slim
β”œβ”€β”€ requirements.txt                   ← pinned versions
β”œβ”€β”€ inference.py                       ← competition script, MUST be named this
β”œβ”€β”€ app.py                             ← Gradio + FastAPI, entry point for HF Spaces
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py                    ← exports InvoiceExceptionEnv, Action, ALL_TASKS
β”‚   β”œβ”€β”€ models.py                      ← all Pydantic models (Action, EnvironmentState, etc.)
β”‚   β”œβ”€β”€ environment.py                 ← InvoiceExceptionEnv class
β”‚   └── tasks.py                       ← 3 task classes, EpisodeData, graders
β”‚
└── documents/
    β”œβ”€β”€ PRD-001-product-requirements.md
    β”œβ”€β”€ CHANGELOG.md
    β”œβ”€β”€ ARCHITECTURE.md
    └── BASELINE-SCORES.md
```

---

## Step-by-Step Build Order

Follow this order exactly. Do not jump ahead.

```
Step 1 β†’ Create the repo
Step 2 β†’ Write requirements.txt
Step 3 β†’ Write env/models.py
Step 4 β†’ Write env/tasks.py
Step 5 β†’ Write env/environment.py
Step 6 β†’ Write env/__init__.py
Step 7 β†’ Smoke test the environment (run a quick script)
Step 8 β†’ Write openenv.yaml
Step 9 β†’ Write inference.py
Step 10 β†’ Write app.py
Step 11 β†’ Write Dockerfile
Step 12 β†’ Full end-to-end test
Step 13 β†’ Write documents/
Step 14 β†’ Push and verify
```

---

## Step 1 β€” Create the Repo

```bash
# Create the project directory
mkdir invoice-exception-handler
cd invoice-exception-handler

# Initialise git
git init
git checkout -b main

# Create the directory structure
mkdir -p env documents

# Create empty placeholder files so git tracks the structure
touch env/__init__.py
touch documents/.gitkeep

# First commit β€” skeleton only
git add .
git commit -m "init: project skeleton"

# Create the repo on GitHub/HF and push
# Replace with your actual remote
git remote add origin https://github.com/YOUR_USERNAME/invoice-exception-handler.git
git push -u origin main
```

---

## Step 2 β€” requirements.txt

Pin every version. Do not use `>=` ranges β€” the validator builds in a clean environment and
range mismatches cause mysterious failures.

```
pydantic==2.7.1
fastapi==0.111.0
uvicorn==0.29.0
gradio==4.36.1
openai==1.35.3
pyyaml==6.0.1
httpx==0.27.0
python-multipart==0.0.9
```

---

## Step 3 β€” env/models.py

This file defines every typed object in the system. Write it before any other Python code.
Nothing is untyped. Every field has a type annotation.

### What goes in models.py

**Enumerations:**
- `ActionType` β€” the 9 action types an agent can take (string enum)
- `DecisionType` β€” approve / reject / hold / partial_approve (string enum)
- `CaseStatus` β€” open / in_review / decided / routed / closed (string enum)

**Document models** (read-only context given to the agent):
- `LineItem` β€” one line on an invoice or PO (description, quantity, unit_price, total, tax_rate)
- `PurchaseOrder` β€” what was agreed to be purchased
- `Invoice` β€” what the supplier is claiming
- `GoodsReceiptNote` β€” what actually arrived at the warehouse
- `SupplierMaster` β€” the verified, registered supplier record
- `ExceptionFlag` β€” why the system flagged this invoice (flag_code, description, auto_hold)

**Action model:**
- `Action` β€” has a `type: ActionType` and `params: Dict[str, Any]`
- Add classmethod constructors for each action type so callers can do `Action.run_check("tolerance_rule")`

**Result models:**
- `InspectionResult` β€” what came back from inspect_field (document, field, value, note, timestamp)
- `CheckResult` β€” what came back from run_check or cross_check (check_name, passed, detail, timestamp)
- `QueryResult` β€” what came back from a query (target, question, response, channel, timestamp)

**State models:**
- `EnvironmentState` β€” the full observable state returned by reset() and step()
- `StepResult` β€” what step() returns: (observation, reward, done, info)

### EnvironmentState fields

The EnvironmentState must include:
- `task_id: str`
- `step_number: int`
- `case_status: CaseStatus`
- All 5 documents (purchase_order, invoice, grn, supplier_master, exception_flag)
- Agent history: `inspections`, `checks_run`, `queries`, `rules_applied`
- Decision state: `decision`, `decision_reason`, `routed_to`, `case_closed`, `close_summary`
- Action hints: `available_actions`, `available_checks`, `available_rules`, `knowledge_base`
- `cumulative_reward: float`

### Writing style for models.py

```python
"""
Typed models for the Invoice Exception Handler OpenEnv environment.

Every object the agent sees or produces is defined here as a Pydantic model.
This is the single source of truth for the data contract between the
environment simulation and the agent.
"""
from __future__ import annotations

import time
from enum import Enum
from typing import Any, Dict, List, Optional

from pydantic import BaseModel, Field


class ActionType(str, Enum):
    INSPECT_FIELD  = "inspect_field"
    CROSS_CHECK    = "cross_check"
    # ... etc
```

Do not put business logic in models.py. Just data shapes.

---

## Step 4 β€” env/tasks.py

This is the biggest file. It defines what happens when the agent takes each action β€”
the simulated responses, the rewards, and the grading logic.

### EpisodeData class

A plain Python class (not Pydantic) that tracks everything the agent has done in one episode.

```python
class EpisodeData:
    """Tracks the full history of one episode for grading and state building."""
    
    def __init__(self):
        self.inspections: List[InspectionResult] = []
        self.checks: List[CheckResult] = []
        self.queries: List[QueryResult] = []
        self.rules_applied: List[str] = []
        self.decision: Optional[str] = None
        self.decision_reason: Optional[str] = None
        self.routed_to: List[str] = []
        self.closed: bool = False
        self.close_summary: Optional[str] = None
        self.step_count: int = 0
        self.cumulative_reward: float = 0.0

    def has_inspected(self, doc: str, field: str) -> bool:
        """Check if we already looked at this field in this document."""
        return any(i.document == doc and i.field == field for i in self.inspections)

    def has_checked(self, name: str) -> bool:
        """Check if this validation check has already been run."""
        return any(c.check_name == name for c in self.checks)

    def has_queried(self, target: str) -> bool:
        """Check if we already queried this person or department."""
        return any(q.target == target for q in self.queries)
```

### BaseTask class

Abstract base that all three tasks inherit from. Every method raises `NotImplementedError`.

```python
class BaseTask:
    task_id: str = "base"
    max_steps: int = 20
    difficulty: str = "easy"
    
    # Document factories β€” return fresh objects each time (no shared state)
    def get_purchase_order(self) -> PurchaseOrder: raise NotImplementedError
    def get_invoice(self) -> Invoice: raise NotImplementedError
    def get_grn(self) -> GoodsReceiptNote: raise NotImplementedError
    def get_supplier_master(self) -> SupplierMaster: raise NotImplementedError
    def get_exception_flag(self) -> ExceptionFlag: raise NotImplementedError
    
    # Simulators β€” each returns (result_object, reward_delta)
    def simulate_inspect(self, document: str, field: str) -> Tuple[InspectionResult, float]: ...
    def simulate_cross_check(self, field: str, doc_a: str, doc_b: str) -> Tuple[CheckResult, float]: ...
    def simulate_run_check(self, check_name: str) -> Tuple[CheckResult, float]: ...
    def simulate_query_supplier(self, question: str, channel: str) -> Tuple[QueryResult, float]: ...
    def simulate_query_internal(self, department: str, question: str) -> Tuple[QueryResult, float]: ...
    def simulate_apply_rule(self, rule_id: str) -> Tuple[str, float]: ...
    def simulate_make_decision(self, decision: str, reason: str, ep: EpisodeData) -> float: ...
    def simulate_route_to(self, team: str, notes: str, ep: EpisodeData) -> float: ...
    def simulate_close(self, summary: str, ep: EpisodeData) -> float: ...
    def grade(self, ep: EpisodeData) -> Dict[str, float]: ...
    
    # These are properties, not methods
    @property
    def available_checks(self) -> List[str]: return []
    
    @property
    def available_rules(self) -> List[str]: return []
    
    @property
    def knowledge_base(self) -> List[str]: return []
```

### The Three Tasks

#### Task 1: PriceVarianceTask (task1_price_variance)

**The scenario:** An office stationery supplier sends an invoice that's 3.08% above the PO.
Company policy allows Β±2% automatic approval. Above that needs manual exception approval.
The supplier did communicate the price increase but procurement never updated the PO.

**task_id:** `"task1_price_variance"`  
**max_steps:** `18`  
**difficulty:** `"easy"`

**The documents:**

PO (PO-2024-1041): 3 stationery line items totalling β‚Ή50,000
- A4 Paper 100 reams @ β‚Ή220 = β‚Ή22,000
- Ballpoint Pens 20 boxes @ β‚Ή450 = β‚Ή9,000
- Staplers 10 units @ β‚Ή1,900 = β‚Ή19,000

Invoice (INV-ON-8821): Same items, same quantities, but 2 items have higher unit prices
- A4 Paper @ β‚Ή231 (+β‚Ή11, +5.0%)
- Ballpoint Pens @ β‚Ή472 (+β‚Ή22, +4.9%)
- Staplers unchanged @ β‚Ή1,900
- Subtotal: β‚Ή51,540 (+β‚Ή1,540, +3.08%)
- 18% GST applied correctly: β‚Ή9,277.20
- Total: β‚Ή60,817.20

GRN (GRN-2024-0892): All items fully received, no pending, no rejected.

Supplier Master (SUP-0441 β€” OfficeNeed Supplies): Bank account and GSTIN both match invoice exactly. No fraud signals.

Exception Flag: `PRICE_MISMATCH` β€” "Invoice total β‚Ή51,540 exceeds PO β‚Ή50,000 by β‚Ή1,540 (3.08%). Above auto-approval threshold."

**Knowledge base entries:**
- POL-001: Price variance ≀±2% may be auto-approved. Above 2% requires exception approval.
- POL-002: Exception approval requires confirmation from originating department.
- POL-003: Any approved invoice with a price change must be followed by a PO amendment request.
- POL-004: Bank account on invoice must match supplier master.

**Simulator logic:**

`simulate_inspect`: Return meaningful values for invoice line_items (+0.10), invoice total_amount (+0.08), po line_items (+0.06), grn items_received (+0.05). Return +0.01 for unknown fields.

`simulate_cross_check`: The key cross-checks are:
- `(unit_price, invoice, po)` β†’ finds Paper and Pen mismatch, reward +0.12
- `(total_amount, invoice, po)` β†’ confirms 3.08% variance, reward +0.10
- `(bank_account, invoice, supplier_master)` β†’ match (no fraud), reward +0.03
- `(gstin, invoice, supplier_master)` β†’ match, reward +0.02
- `(quantity, invoice, grn)` β†’ match (full delivery), reward +0.04

`simulate_run_check`:
- `"tolerance_rule"` β†’ 3.08% > 2%, FAILS, reward +0.14 (most important check)
- `"grn_match"` β†’ PASSES (all received), reward +0.06
- `"duplicate_detection"` β†’ PASSES (not a dup), reward +0.02
- `"bank_account_verification"` β†’ PASSES, reward +0.02
- `"gst_verification"` β†’ PASSES, reward +0.02
- `"po_match"` β†’ FAILS on price, reward +0.08

`simulate_query_supplier`: Returns email from supplier explaining raw material price increase communicated to Arjun Mehta at procurement on Feb 20. Reward +0.10.

`simulate_query_internal`:
- `"procurement"` β†’ Arjun Mehta confirms verbal approval, says he'll raise PO amendment. Reward +0.12.
- Others β†’ generic responses, reward +0.03.

`simulate_apply_rule`:
- `"tolerance_2pct_auto_approve"` β†’ BLOCKED (3.08% > 2%), reward βˆ’0.05
- `"tolerance_exception_approval"` β†’ APPLIED, reward +0.10
- `"rejection_with_reason"` β†’ APPLIED but wrong, reward βˆ’0.08
- `"partial_approval"` β†’ not applicable here, reward βˆ’0.05

`simulate_make_decision`:
- `"approve"` with tolerance check + procurement query: reward +0.25
- `"approve"` with tolerance check only: reward +0.18
- `"approve"` with nothing checked: reward +0.05 (bad approval, should have verified)
- `"reject"`: reward βˆ’0.10 (wrong decision, delay supplier)
- `"hold"`: reward +0.08

`simulate_route_to`:
- `"procurement"` β†’ reward +0.12 (correct β€” PO amendment needed)
- `"finance"` β†’ reward +0.03
- `"legal"` β†’ reward βˆ’0.05 (overkill for a price variance)

`simulate_close`: reward +0.12 if approved + tolerance checked + procurement routed, else +0.06, else 0.

**Grader (`grade` method):**
```python
def grade(self, ep: EpisodeData) -> Dict[str, float]:
    checks_run = {c.check_name for c in ep.checks}
    queries_to = {q.target for q in ep.queries}
    
    # Did the agent correctly diagnose?
    d = 0.0
    if any("unit_price" in c.check_name or "total" in c.check_name 
           for c in ep.checks): 
        d += 0.12
    if "tolerance_rule" in checks_run: 
        d += 0.14
    if "grn_match" in checks_run: 
        d += 0.06
    
    # Did the agent investigate properly?
    i = 0.0
    if "supplier" in queries_to: 
        i += 0.10
    if "procurement" in queries_to: 
        i += 0.12
    if "tolerance_exception_approval" in ep.rules_applied: 
        i += 0.08
    
    # Correct decision?
    dec = 0.0
    if ep.decision == "approve":   dec += 0.18
    elif ep.decision == "hold":    dec += 0.06
    elif ep.decision == "reject":  dec -= 0.10
    
    # Correct routing?
    route = 0.12 if "procurement" in ep.routed_to else 0.0
    
    # Closed cleanly?
    closure = 0.08 if ep.closed else 0.0
    
    # Efficiency bonus β€” penalise extra steps
    eff = max(0.0, 0.06 - 0.004 * max(0, ep.step_count - 9))
    
    total = d + i + dec + route + closure + eff
    return {
        "score": round(max(0.0, min(1.0, total)), 4),
        "diagnosis_score": round(d, 4),
        "investigation_score": round(i, 4),
        "decision_score": round(dec, 4),
        "routing_score": round(route, 4),
        "closure_score": round(closure, 4),
        "efficiency_score": round(eff, 4),
    }
```

---

#### Task 2: DuplicateTaxErrorTask (task2_duplicate_tax)

**The scenario:** Logistics supplier submits INV-2024-891 for transport services. System flags
it as a possible duplicate. Turns out it IS a duplicate of INV-2024-819 β€” the numbers differ
by digit transposition (891 vs 819). That original invoice was already paid. BUT: the original
invoice applied 15% GST when the correct rate is 18%. The company overpaid β‚Ή3,240 in tax.
The new invoice has the correct rate. So it's both a duplicate AND a legitimate correction.

**task_id:** `"task2_duplicate_tax"`  
**max_steps:** `20`  
**difficulty:** `"medium"`

**The documents:**

PO (PO-2024-0778): Logistics services
- Mumbai-Pune Transport 20 trips @ β‚Ή4,500 = β‚Ή90,000
- Warehousing charges Feb 2024 @ β‚Ή18,000 = β‚Ή18,000
- Total: β‚Ή1,08,000, Net-15 terms

Invoice (INV-2024-891): Same services, same amounts β€” correct on the face of it
- Subtotal: β‚Ή1,08,000
- GST 18%: β‚Ή19,440 ← this is CORRECT
- Total: β‚Ή1,27,440

GRN (GRN-2024-0740): Services confirmed complete (transport + warehousing).

Supplier Master (SUP-0229 β€” FastMove Logistics): Bank and GSTIN match invoice. No fraud signals.

Exception Flag: `POSSIBLE_DUPLICATE` β€” "Invoice INV-2024-891 closely matches previously processed invoice."

**Hidden state (not in documents, revealed by checks):**
- INV-2024-819 was paid 12 days ago for β‚Ή1,24,200
- INV-2024-819 applied 15% GST = β‚Ή16,200 (wrong rate)
- Correct 18% GST = β‚Ή19,440
- Company overpaid: β‚Ή3,240

**Key checks and what they reveal:**

`run_check("duplicate_detection")` β†’ FAILS β†’ finds INV-2024-819 paid 12 days ago, reward +0.18

`run_check("tax_calculation_verify")` β†’ FAILS β†’ discovers the 15% error on original, reveals β‚Ή3,240 delta, reward +0.16

`cross_check(invoice_number, invoice, payment_history)` β†’ finds digit transposition, reward +0.15

`cross_check(tax_amount, invoice, payment_history)` β†’ confirms β‚Ή3,240 delta, reward +0.14

`query_internal("finance")` β†’ confirms overpayment on original, reward +0.12

`query_supplier` β†’ supplier confirms they know and wants partial approval for the delta, reward +0.10

`apply_rule("partial_approval")` β†’ correct pathway, reward +0.12

`apply_rule("credit_note_request")` β†’ supplier must issue credit note for the balance, reward +0.10

**Decision logic:**

`simulate_make_decision`:
- `"partial_approve"` with dup + tax found: reward +0.28 ← optimal
- `"partial_approve"` with dup only: reward +0.14 ← incomplete
- `"reject"` with dup found: reward +0.08 ← catches dup, misses correction
- `"approve"` (pays full duplicate): reward βˆ’0.15 ← bad

**Grader weights:**
- diagnosis_score: up to 0.30 (dup found +0.16, tax error found +0.14)
- investigation_score: up to 0.32 (finance queried, supplier queried, rules applied)
- decision_score: up to 0.20 (partial_approve = 0.20, reject = 0.05, approve = βˆ’0.15)
- routing_score: up to 0.08
- closure_score: up to 0.06

---

#### Task 3: CompoundFraudTask (task3_compound_fraud)

**The scenario:** IT supplier submits β‚Ή8,47,500 invoice for 15 laptops. System flags a bank
account change. But there are FOUR simultaneous fraud signals that the agent must find all of.

**task_id:** `"task3_compound_fraud"`  
**max_steps:** `25`  
**difficulty:** `"hard"`

**The four signals:**

1. **Bank account fraud (Signal 1):** Invoice has a different bank account than the supplier
   master. The change request came from `techcore-solutions.com`. The registered domain is
   `techcore-solutions.in`. Classic Business Email Compromise (BEC) attack.

2. **GSTIN fraud (Signal 2):** The GST number on the invoice (`07AABCT9999X1Z8`) belongs to
   "TechCore Trading Pvt Ltd" β€” a completely different entity in Delhi. Supplier master shows
   `07AABCT1234Y1Z5` for "TechCore Solutions."

3. **Quantity mismatch (Signal 3):** Invoice claims 15 laptops. GRN shows only 13 received.
   2 units are still marked as pending.

4. **Price inflation (Signal 4):** β‚Ή56,500/unit on invoice vs β‚Ή52,000/unit on PO. That's
   8.65% above the agreed price. No price revision was ever approved.

**Bonus signals (smaller, still notable):**
- Invoice is dated a Sunday (2024-03-10) β€” unusual for B2B
- PO was raised Friday March 8 β€” 2-day turnaround is suspiciously fast for IT equipment

**The critical trap β€” channel selection:**

`simulate_query_supplier(question, channel="email")` β†’
Returns fraudster's response urging payment to the new account. Reward: **βˆ’0.15**.

`simulate_query_supplier(question, channel="phone")` β†’
The real TechCore Solutions confirms they sent no bank change request. Confirms fraud. Reward: **+0.15**.

This tests whether the agent follows POL-009 ("bank account change must be verified via
registered phone number β€” NEVER via email") which is in the knowledge base.

**Available checks and rewards:**

```python
"bank_account_verification"  β†’ FAILS, finds lookalike domain, reward +0.18
"gst_verification"           β†’ FAILS, GST belongs to different entity, reward +0.18
"grn_match"                  β†’ FAILS, 13 vs 15 received, reward +0.14
"email_domain_verification"  β†’ FAILS, lookalike domain confirmed, reward +0.16
"invoice_date_validation"    β†’ FAILS, Sunday flag, reward +0.08
"quantity_check"             β†’ FAILS, quantity inflated, reward +0.12
"price_check"                β†’ FAILS, 8.65% above PO, reward +0.10
"duplicate_detection"        β†’ PASSES (not a dup), reward +0.02
"po_match"                   β†’ FAILS (GST + qty + price all wrong), reward +0.08
```

**Decision logic:**

`simulate_make_decision`:
- `"reject"` β†’ reward = 0.10 + 0.05 Γ— (number of signals found) β†’ max ~0.30
- `"approve"` β†’ reward βˆ’0.40 (catastrophic β€” approved fraud)
- `"partial_approve"` β†’ reward βˆ’0.20 (you can't partially approve fraud)
- `"hold"` β†’ reward = 0.08 + 0.03 Γ— signals found β†’ acceptable but not optimal

**Route logic:**

```python
"legal"     β†’ reward +0.14  # must escalate to legal
"security"  β†’ reward +0.12  # BEC attack needs security investigation
"finance"   β†’ reward +0.08  # finance needs to block payment
"procurement" β†’ reward +0.06
```

**Grader β€” the signal detection scoring:**

```python
def grade(self, ep: EpisodeData) -> Dict[str, float]:
    failed = {c.check_name for c in ep.checks if not c.passed}
    
    bank_found   = "bank_account_verification" in {c.check_name for c in ep.checks}
    gst_found    = "gst_verification" in {c.check_name for c in ep.checks}
    qty_found    = "grn_match" in {c.check_name for c in ep.checks}
    domain_found = "email_domain_verification" in {c.check_name for c in ep.checks}
    price_found  = "price_check" in {c.check_name for c in ep.checks}
    
    # Diagnosis β€” finding all signals is the whole point
    d = (0.12 if bank_found else 0) + (0.12 if gst_found else 0) \
      + (0.10 if qty_found else 0)  + (0.10 if domain_found else 0) \
      + (0.06 if price_found else 0)
    
    # Investigation β€” reward for using phone not email
    i = 0.0
    for q in ep.queries:
        if q.target == "supplier" and q.channel not in ("email", "mail"):
            i += 0.10   # correct channel
        elif q.target == "supplier" and q.channel in ("email", "mail"):
            i -= 0.15   # contacting fraudster
    if "legal" in {q.target for q in ep.queries}:    i += 0.06
    if "security" in {q.target for q in ep.queries}: i += 0.06
    
    # Decision
    signals = sum([bank_found, gst_found, qty_found, domain_found])
    dec = 0.0
    if ep.decision == "reject":
        dec = 0.08 + 0.03 * signals
    elif ep.decision == "approve":
        dec = -0.35
    elif ep.decision == "partial_approve":
        dec = -0.15
    elif ep.decision == "hold":
        dec = 0.06
    
    # Routing
    routes = set(ep.routed_to)
    route = (0.10 if "legal" in routes else 0) \
          + (0.06 if "security" in routes else 0) \
          + (0.04 if "finance" in routes else 0)
    
    closure = 0.06 if (ep.closed and ep.decision == "reject") else 0.0
    eff = max(0.0, 0.04 - 0.002 * max(0, ep.step_count - 12))
    
    total = d + i + dec + route + closure + eff
    return {
        "score": round(max(0.0, min(1.0, total)), 4),
        "signals_found": sum([bank_found, gst_found, qty_found, domain_found, price_found]),
        "diagnosis_score": round(d, 4),
        "investigation_score": round(i, 4),
        "decision_score": round(dec, 4),
        "routing_score": round(route, 4),
        "closure_score": round(closure, 4),
        "efficiency_score": round(eff, 4),
    }
```

### Task Registry

At the bottom of tasks.py:

```python
TASK_REGISTRY: Dict[str, type] = {
    "task1_price_variance": PriceVarianceTask,
    "task2_duplicate_tax":  DuplicateTaxErrorTask,
    "task3_compound_fraud": CompoundFraudTask,
}

ALL_TASKS = list(TASK_REGISTRY.keys())

def make_task(task_id: str) -> BaseTask:
    cls = TASK_REGISTRY.get(task_id)
    if cls is None:
        raise ValueError(f"Unknown task '{task_id}'. Available: {ALL_TASKS}")
    return cls()
```

---

## Step 5 β€” env/environment.py

This is the `InvoiceExceptionEnv` class. It is the only thing external code needs to import.

```python
class InvoiceExceptionEnv:
    """
    OpenEnv-compatible Invoice Exception Handler environment.
    
    Usage:
        env = InvoiceExceptionEnv(seed=42)
        obs = env.reset("task1_price_variance")
        result = env.step(Action.run_check("tolerance_rule"))
        scores = env.grade()
    """
```

### Constructor

Takes an optional `seed: Optional[int] = None` for reproducibility.
Initialises `self._rng = random.Random(seed)`.
Initialises `self._task`, `self._ep`, `self._state`, `self._done` all to None/False.

### reset(task_id)

```python
def reset(self, task_id: Optional[str] = None) -> EnvironmentState:
    """
    Start a new episode. If task_id is None, picks one at random.
    Returns the initial EnvironmentState showing all documents and available actions.
    """
```

1. Pick task (random if None)
2. Create `EpisodeData()`
3. Set `self._done = False`
4. Call `self._build_state()` and store result
5. Return the state

### step(action)

```python
def step(self, action: Union[Action, Dict[str, Any]]) -> StepResult:
    """
    Execute one action. Returns observation, reward, done flag, and info dict.
    Raises RuntimeError if called before reset() or after the episode is done.
    """
```

1. Validate we're in an active episode
2. Convert dict to Action if needed
3. Call `self._dispatch(action)` β†’ gets (reward, info)
4. Increment step count
5. Check SLA (step count vs max_steps)
6. Check done condition (closed or SLA breach)
7. Rebuild state
8. Return StepResult

### state()

Non-destructive. Just returns `self._state`. Raises RuntimeError if not initialised.

### grade()

Calls `self._task.grade(self._ep)` and returns the dict.

### _dispatch(action)

The routing function. A single if/elif chain for each ActionType.

For each action:
1. Call the appropriate task simulator
2. Update EpisodeData
3. Return (reward, info dict)

Handle repeated actions (inspect same field twice, check same thing twice) with a small βˆ’0.02 to βˆ’0.05 penalty and return early.

### _build_state()

Constructs an `EnvironmentState` from the current `_task` and `_ep`. Called after every step.
Also determines the current `CaseStatus` based on episode data.

### action_space_sample()

Returns a random valid action (for random baseline agents). Uses `self._rng` for reproducibility.

---

## Step 6 β€” env/__init__.py

```python
from .environment import InvoiceExceptionEnv
from .models import Action, ActionType, EnvironmentState, StepResult
from .tasks import ALL_TASKS, make_task

__all__ = [
    "InvoiceExceptionEnv",
    "Action",
    "ActionType",
    "EnvironmentState",
    "StepResult",
    "ALL_TASKS",
    "make_task",
]
```

---

## Step 7 β€” Smoke Test Before Continuing

Before writing openenv.yaml or inference.py, verify the environment works.

```python
# test_smoke.py β€” run this, do not commit it
from env import InvoiceExceptionEnv, Action, ALL_TASKS

print("Tasks:", ALL_TASKS)

env = InvoiceExceptionEnv(seed=42)

for task_id in ALL_TASKS:
    obs = env.reset(task_id)
    print(f"\n--- {task_id} ---")
    print("Ticket:", obs.exception_flag.flag_description[:80])
    
    # Take a few actions
    r1 = env.step(Action.run_check(obs.available_checks[0]))
    print(f"Step 1 reward: {r1.reward}")
    
    r2 = env.step(Action.make_decision("approve", "test"))
    print(f"Step 2 reward: {r2.reward}")
    
    r3 = env.step(Action.close_case("closed"))
    print(f"Step 3 reward: {r3.reward}, done: {r3.done}")
    
    scores = env.grade()
    print(f"Grade: {scores['score']}")

print("\nSmoke test passed.")
```

All three tasks must complete without errors. Scores must be in [0.0, 1.0].

---

## Step 8 β€” openenv.yaml

This file must pass `openenv validate`. Write it carefully.

```yaml
# openenv.yaml
name: Invoice Exception Handler
version: "1.0.0"
description: |
  An agent learning environment simulating accounts payable exception handling.
  The agent acts as an AP analyst: investigates flagged invoices, applies business
  rules, detects fraud signals, makes decisions, and closes cases with an audit trail.

authors:
  - name: Your Name
    email: your@email.com

license: MIT

tasks:
  - id: task1_price_variance
    name: Price Variance Exception
    difficulty: easy
    description: |
      Office stationery invoice arrives 3.08% above PO. Company tolerance policy
      allows Β±2% auto-approval. Agent must detect the variance, verify through
      the tolerance rule, confirm verbal approval with procurement, and approve
      with a PO amendment request.
    max_steps: 18
    optimal_score: 1.0
    min_passing_score: 0.60

  - id: task2_duplicate_tax
    name: Duplicate Invoice with Tax Error
    difficulty: medium
    description: |
      Logistics supplier submits INV-2024-891, a duplicate of paid INV-2024-819
      (digit transposition: 891 vs 819). Original invoice had wrong GST rate (15%
      vs correct 18%) β€” company overpaid β‚Ή3,240. New invoice has correct rate.
      Agent must detect the duplicate, identify the tax error in the original,
      and partially approve only the β‚Ή3,240 tax correction.
    max_steps: 20
    optimal_score: 1.0
    min_passing_score: 0.50

  - id: task3_compound_fraud
    name: Compound Fraud Signals
    difficulty: hard
    description: |
      IT equipment supplier invoice with four simultaneous fraud signals: bank
      account changed via BEC attack (lookalike email domain), GSTIN belongs to
      a different entity, 2 of 15 laptops not yet received, and unit price 8.65%
      above PO. Agent must find all signals, use the correct communication channel
      (phone, not email β€” which would contact the fraudster), and escalate to legal
      and security.
    max_steps: 25
    optimal_score: 1.0
    min_passing_score: 0.40

observation_space:
  type: object
  description: EnvironmentState Pydantic model
  fields:
    task_id:             {type: string}
    step_number:         {type: integer}
    case_status:         {type: string, enum: [open, in_review, decided, routed, closed]}
    purchase_order:      {type: object, description: "PO with line items and terms"}
    invoice:             {type: object, description: "Supplier invoice with line items and tax"}
    grn:                 {type: object, description: "Goods receipt β€” what actually arrived"}
    supplier_master:     {type: object, description: "Verified supplier record"}
    exception_flag:      {type: object, description: "Why the system flagged this invoice"}
    inspections:         {type: array, description: "Fields the agent has inspected"}
    checks_run:          {type: array, description: "Validation checks completed"}
    queries:             {type: array, description: "Internal and supplier queries"}
    rules_applied:       {type: array, description: "Business rules applied"}
    decision:            {type: string, nullable: true}
    routed_to:           {type: array}
    available_actions:   {type: array}
    available_checks:    {type: array}
    available_rules:     {type: array}
    knowledge_base:      {type: array}
    cumulative_reward:   {type: number}

action_space:
  type: object
  description: Action with type and params
  actions:
    inspect_field:
      params: {document: string, field: string}
    cross_check:
      params: {field: string, doc_a: string, doc_b: string}
    run_check:
      params: {check_name: string}
    query_supplier:
      params: {question: string, channel: string}
    query_internal:
      params: {department: string, question: string}
    apply_rule:
      params: {rule_id: string}
    make_decision:
      params: {decision: string, reason: string}
    route_to:
      params: {team: string, notes: string}
    close_case:
      params: {summary: string}

reward:
  range: [-1.0, 1.0]
  description: |
    Shaped reward at every step. Relevant inspections: +0.01 to +0.14.
    Diagnostics revealing issues: +0.08 to +0.18. Correct fixes: +0.08 to +0.30.
    Wrong decision on fraud: -0.15 to -0.40. Repeat actions: -0.02 to -0.05.
    SLA breach: -0.10.

grading:
  method: task_grader
  scores:
    - score           # 0.0–1.0 overall
    - diagnosis_score
    - investigation_score
    - decision_score
    - routing_score
    - closure_score
    - efficiency_score

api:
  reset:
    signature: "reset(task_id: str | None = None) -> EnvironmentState"
  step:
    signature: "step(action: Action | dict) -> StepResult"
  state:
    signature: "state() -> EnvironmentState"
  grade:
    signature: "grade() -> Dict[str, float]"

http_endpoints:
  - path: /reset
    method: POST
    description: Reset environment, returns EnvironmentState JSON
  - path: /step
    method: POST
    description: Execute action, returns StepResult JSON
  - path: /state
    method: GET
    description: Current state, returns EnvironmentState JSON
  - path: /grade
    method: POST
    description: Grade current episode
  - path: /health
    method: GET
    description: Health check

dependencies:
  python: ">=3.11"
  packages:
    - pydantic==2.7.1
    - fastapi==0.111.0
    - uvicorn==0.29.0
    - gradio==4.36.1
    - openai==1.35.3
    - pyyaml==6.0.1

docker:
  port: 7860
  health_check: /health
```

---

## Step 9 β€” inference.py

This is the most critical file for the hackathon validator. Get the format exactly right.

### Required env vars

```python
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
MODEL_NAME   = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
API_KEY      = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "")
```

### Required stdout format

Every line to stdout must be exactly:
```
[START] task=<task_id> env=invoice-exception-handler model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...>
```

Rules (do not deviate):
- One `[START]` line at episode begin
- One `[STEP]` line per step, immediately after `env.step()` returns
- One `[END]` line after the episode, always emitted even on exception
- `reward` and all values in `rewards` formatted to exactly 2 decimal places
- `score` formatted to exactly 3 decimal places
- `done` and `success` are lowercase: `true` or `false`
- `error` is the error message string, or exactly `null` if none
- No newlines within a single line
- `flush=True` on every print so the validator sees output in real time

### System prompt for the LLM

Write a clear system prompt that tells the model:
- It is an AP analyst handling a flagged invoice
- It has a structured action space (list all 9 action types)
- It must respond in JSON: `{"type": "...", "params": {...}}`
- It should investigate before deciding
- Never approve without checking, never contact supplier by email if fraud is suspected
- Available documents: PO, Invoice, GRN, Supplier Master, Exception Flag

### User prompt per step

Include in the user prompt:
- Current step number and max steps
- The exception flag (what was flagged and why)
- Available checks (list them)
- Available rules (list them)
- Knowledge base entries (the policy list)
- What has been done so far (checks run, queries made, inspections done)
- Current cumulative reward
- Ask for next action as JSON

### Parsing LLM output

```python
def parse_action(raw_text: str) -> dict:
    """
    Parse the model's response into an action dict.
    Handles markdown code fences, extra whitespace, and minor formatting errors.
    Falls back to run_check(po_match) if parsing fails.
    """
    text = raw_text.strip()
    # Remove ```json or ``` fences if present
    if text.startswith("```"):
        lines = text.split("\n")
        text = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError:
        # Try to find JSON within the text
        import re
        match = re.search(r'\{.*\}', text, re.DOTALL)
        if match:
            try:
                return json.loads(match.group())
            except json.JSONDecodeError:
                pass
    # Safe fallback
    return {"type": "run_check", "params": {"check_name": "po_match"}}
```

### Overall structure

```python
def run_task(client, env, task_id, max_steps=20):
    """Run one task episode and return (steps_taken, score, rewards)."""
    rewards = []
    
    print(f"[START] task={task_id} env=invoice-exception-handler model={MODEL_NAME}", flush=True)
    
    obs = env.reset(task_id)
    history = []
    
    for step in range(1, max_steps + 1):
        # Build prompt from observation
        user_prompt = build_prompt(obs, step, max_steps, history)
        
        # Call LLM
        raw = call_llm(client, user_prompt)
        action_dict = parse_action(raw)
        
        # Execute
        try:
            result = env.step(action_dict)
            reward = result.reward
            done = result.done
            error = None
        except Exception as e:
            reward = 0.0
            done = False
            error = str(e)
            result = None
        
        rewards.append(reward)
        action_str = json.dumps(action_dict)
        
        print(
            f"[STEP] step={step} action={action_str} "
            f"reward={reward:.2f} done={str(done).lower()} "
            f"error={error or 'null'}",
            flush=True
        )
        
        history.append(f"Step {step}: {action_str} β†’ reward {reward:+.2f}")
        
        if result:
            obs = result.observation
        
        if done:
            break
    
    score = env.grade()["score"]
    success = score >= 0.5
    steps_taken = min(step, max_steps)
    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
    
    print(
        f"[END] success={str(success).lower()} steps={steps_taken} "
        f"score={score:.3f} rewards={rewards_str}",
        flush=True
    )
    
    return steps_taken, score, rewards


def main():
    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
    env = InvoiceExceptionEnv(seed=42)
    
    for task_id in ALL_TASKS:
        run_task(client, env, task_id)


if __name__ == "__main__":
    main()
```

---

## Step 10 β€” app.py

The app.py serves two purposes:
1. Provides the FastAPI HTTP endpoints that the validator pings (`POST /reset` must return 200)
2. Provides a Gradio UI for interactive exploration on HF Spaces

### Architecture

Run both FastAPI and Gradio in the same process on port 7860.
Use `gr.mount_gradio_app` to mount Gradio on FastAPI, or run Gradio alongside FastAPI.

The cleanest approach:

```python
import gradio as gr
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import uvicorn

app = FastAPI(title="Invoice Exception Handler OpenEnv")
env = InvoiceExceptionEnv(seed=42)  # shared environment instance

@app.post("/reset")
async def http_reset(body: dict = {}):
    task_id = body.get("task_id", None)
    obs = env.reset(task_id)
    return JSONResponse(obs.model_dump(mode="json"))

@app.post("/step")
async def http_step(body: dict):
    result = env.step(body)
    return JSONResponse(result.model_dump(mode="json"))

@app.get("/state")
async def http_state():
    return JSONResponse(env.state().model_dump(mode="json"))

@app.post("/grade")
async def http_grade():
    return JSONResponse(env.grade())

@app.get("/tasks")
async def http_tasks():
    return JSONResponse(ALL_TASKS)

@app.get("/health")
async def health():
    return JSONResponse({"status": "ok", "version": "1.0.0"})

# Mount Gradio on /ui
gradio_app = build_gradio_ui()
app = gr.mount_gradio_app(app, gradio_app, path="/")
```

### Gradio UI β€” what to build

Keep the UI simple and functional. Three tabs:

**Tab 1: Manual Play**
- Dropdown to select task (labels: "Task 1 β€” Price Variance (Easy)", etc.)
- Reset button
- Shows the exception flag, the key document fields, and available actions
- Dropdown or textbox to compose and submit an action
- Shows reward, cumulative reward, and status after each step
- Shows grade breakdown when episode ends

**Tab 2: Agent Demo**
- Select task
- Shows a hardcoded optimal action sequence running step by step
- Good for demonstrating the environment to judges who won't run code

**Tab 3: API Reference**
- Code examples for each action type
- Reward table
- Grader score breakdown explanation

---

## Step 11 β€” Dockerfile

```dockerfile
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update \
    && apt-get install -y --no-install-recommends curl \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user (required by HF Spaces)
RUN useradd -m -u 1000 appuser

WORKDIR /app

# Copy and install dependencies first (layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY --chown=appuser:appuser . .

USER appuser

EXPOSE 7860

# Health check β€” pings the /health endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
    CMD curl -f http://localhost:7860/health || exit 1

ENV PYTHONUNBUFFERED=1
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV GRADIO_SERVER_PORT=7860

CMD ["python", "app.py"]
```

---

## Step 12 β€” End-to-End Test Checklist

Before pushing, check every item in this list.

```bash
# 1. Imports work
python -c "from env import InvoiceExceptionEnv, Action, ALL_TASKS; print('OK')"

# 2. All three tasks complete without errors
python -c "
from env import InvoiceExceptionEnv, Action, ALL_TASKS
env = InvoiceExceptionEnv(seed=42)
for t in ALL_TASKS:
    obs = env.reset(t)
    result = env.step(Action.run_check(obs.available_checks[0]))
    result = env.step(Action.make_decision('reject', 'test'))
    result = env.step(Action.close_case('test'))
    score = env.grade()['score']
    assert 0.0 <= score <= 1.0, f'Score out of range: {score}'
    print(f'{t}: {score}')
print('All tasks OK')
"

# 3. Graders are deterministic
python -c "
from env import InvoiceExceptionEnv, Action
env1 = InvoiceExceptionEnv(seed=42)
env2 = InvoiceExceptionEnv(seed=42)
obs1 = env1.reset('task1_price_variance')
obs2 = env2.reset('task1_price_variance')
env1.step(Action.run_check('tolerance_rule'))
env2.step(Action.run_check('tolerance_rule'))
env1.step(Action.make_decision('approve', 'test'))
env2.step(Action.make_decision('approve', 'test'))
env1.step(Action.close_case('done'))
env2.step(Action.close_case('done'))
s1 = env1.grade()['score']
s2 = env2.grade()['score']
assert s1 == s2, f'Non-deterministic: {s1} vs {s2}'
print(f'Deterministic: {s1}')
"

# 4. inference.py log format (with fake API key)
API_BASE_URL=https://api.example.com HF_TOKEN=fake MODEL_NAME=test python -c "
# This will fail on the API call but should print [START] before failing
import subprocess, sys
" 
# Manually verify the [START] line would print correctly

# 5. Docker builds
docker build -t invoice-env-test .

# 6. Docker runs and /health returns 200
docker run -d -p 7860:7860 --name test-env invoice-env-test
sleep 15
curl -f http://localhost:7860/health
curl -s -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{}'
docker stop test-env && docker rm test-env

# 7. openenv validate (if openenv-core is installed)
pip install openenv-core
openenv validate
```

---

## Step 13 β€” documents/ Folder

Create these four files. Keep them updated as the project evolves.

### documents/CHANGELOG.md

```markdown
# Changelog

All changes to the Invoice Exception Handler environment are recorded here.
Format: Date | Version | What changed | Why

---

## [1.0.0] β€” 2025-01-20

### Added
- Initial implementation of InvoiceExceptionEnv with full OpenEnv API
- Three tasks: task1_price_variance, task2_duplicate_tax, task3_compound_fraud
- Pydantic v2 typed models for all environment objects
- FastAPI HTTP endpoints for HF Spaces validation
- Gradio UI for interactive exploration
- inference.py using OpenAI client with [START][STEP][END] log format
- openenv.yaml spec file
- Dockerfile for HF Spaces deployment

### Design decisions
- Used pure Python simulation (no external databases) for portability and determinism
- Compound fraud task has four signals to prevent simple greedy agents from scoring well
- Channel selection in Task 3 (phone vs email) tests policy knowledge, not just anomaly detection
- Grader uses sub-scores to allow partial credit for partial solutions
```

### documents/ARCHITECTURE.md

Document the system architecture. Include:
- A text diagram of how the components connect
- Why FastAPI and Gradio in the same process (HF Spaces constraint)
- Why Pydantic v2 (spec requirement, validation)
- How EpisodeData separates mutable state from immutable document context
- Why tasks are separate classes (easy to extend)

### documents/BASELINE-SCORES.md

Record the reproducible baseline scores. Run them yourself and copy the output here.

```markdown
# Baseline Scores

Recorded on: 2025-01-20
Seed: 42
Machine: 2 vCPU, 8GB RAM

## Random Agent (action_space_sample())

| Task | Score | Steps |
|------|-------|-------|
| task1_price_variance | ~0.18 | 18 (SLA breach) |
| task2_duplicate_tax  | ~0.12 | 20 (SLA breach) |
| task3_compound_fraud | ~0.08 | 25 (SLA breach) |
| **Average** | **~0.13** | |

## Optimal Agent (hardcoded correct actions)

| Task | Score | Steps |
|------|-------|-------|
| task1_price_variance | ~0.98 | 9  |
| task2_duplicate_tax  | ~0.95 | 10 |
| task3_compound_fraud | ~0.92 | 14 |
| **Average** | **~0.95** | |
```

---

## Step 14 β€” Push and Verify

```bash
# Final commit
git add .
git commit -m "feat: complete invoice exception handler v1.0.0

- 3 tasks with deterministic graders (easy/medium/hard)
- Full OpenEnv API: reset/step/state/grade
- FastAPI HTTP endpoints for validator (/reset, /step, /state, /health)
- Gradio UI for HF Spaces
- inference.py with OpenAI client and [START][STEP][END] format
- openenv.yaml spec
- Dockerfile for HF Spaces deployment
- documents/ folder with PRD, changelog, architecture, baseline scores"

git push origin main

# Deploy to HF Spaces (if not using git-based deployment)
# The Dockerfile and app.py handle this automatically when pushed to HF
```

---

## Action Space Reference

| Action Type | Required Params | Description |
|---|---|---|
| `inspect_field` | `document, field` | Look at a specific field in a document |
| `cross_check` | `field, doc_a, doc_b` | Compare a field between two documents |
| `run_check` | `check_name` | Run a named validation check |
| `query_supplier` | `question, channel` | Ask the supplier something (channel: phone or email) |
| `query_internal` | `department, question` | Ask an internal team |
| `apply_rule` | `rule_id` | Apply a business policy rule |
| `make_decision` | `decision, reason` | approve / reject / hold / partial_approve |
| `route_to` | `team, notes` | Escalate to a team |
| `close_case` | `summary` | Close with an audit trail summary |

---

## Observation Space Reference

| Field | Type | Description |
|---|---|---|
| `task_id` | str | Which task is running |
| `step_number` | int | Current step |
| `case_status` | str | open / in_review / decided / routed / closed |
| `purchase_order` | PurchaseOrder | What was agreed to be purchased |
| `invoice` | Invoice | What the supplier is claiming |
| `grn` | GoodsReceiptNote | What actually arrived |
| `supplier_master` | SupplierMaster | Verified supplier record |
| `exception_flag` | ExceptionFlag | Why this invoice was flagged |
| `inspections` | List | Fields already inspected |
| `checks_run` | List | Validation checks already run |
| `queries` | List | Queries made and responses |
| `rules_applied` | List | Business rules applied |
| `decision` | str? | Current decision if made |
| `routed_to` | List | Teams this case has been escalated to |
| `available_actions` | List | All 9 action types |
| `available_checks` | List | Check names valid for this task |
| `available_rules` | List | Rule IDs valid for this task |
| `knowledge_base` | List | Policy entries relevant to this task |
| `cumulative_reward` | float | Sum of all rewards so far |

---

## Reward Reference

| Event | Reward |
|---|---|
| Inspecting a key field that reveals an anomaly | +0.08 to +0.14 |
| Inspecting a routine field | +0.01 to +0.06 |
| Cross-check that finds a mismatch | +0.12 to +0.15 |
| Running a check that finds an issue | +0.08 to +0.18 |
| Querying the right person | +0.04 to +0.12 |
| Contacting supplier via wrong channel (Task 3) | βˆ’0.15 |
| Applying the correct business rule | +0.08 to +0.12 |
| Applying the wrong rule | βˆ’0.05 to βˆ’0.10 |
| Correct decision (approve/reject/partial) | +0.18 to +0.28 |
| Approving a fraudulent invoice | βˆ’0.35 to βˆ’0.40 |
| Wrong rejection (task1) | βˆ’0.10 |
| Routing to the right team | +0.06 to +0.14 |
| Clean case closure | +0.06 to +0.12 |
| Repeat action | βˆ’0.02 to βˆ’0.05 |
| SLA breach (exceed max_steps) | βˆ’0.10 |

---

## Expected Baseline Scores

These are the scores you should see when running `inference.py` with a good LLM.

| Task | Difficulty | Random Agent | Rule Agent | LLM Agent (Qwen-72B) |
|---|---|---|---|---|
| task1_price_variance | Easy | ~0.18 | ~0.85 | ~0.80 |
| task2_duplicate_tax | Medium | ~0.12 | ~0.72 | ~0.68 |
| task3_compound_fraud | Hard | ~0.08 | ~0.55 | ~0.45 |

The hard task should be genuinely hard for LLMs β€” a score of 0.45 is expected, not a failure.

---

## Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| `API_BASE_URL` | Yes | `https://router.huggingface.co/v1` | LLM endpoint |
| `MODEL_NAME` | Yes | `Qwen/Qwen2.5-72B-Instruct` | Model to use |
| `HF_TOKEN` | Yes | β€” | API key for the LLM endpoint |
| `ANTHROPIC_API_KEY` | No | β€” | Only if using Anthropic models directly |

---

## Setup Instructions

### Local Development

```bash
# Clone the repo
git clone https://github.com/YOUR_USERNAME/invoice-exception-handler.git
cd invoice-exception-handler

# Create virtual environment
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the app locally
python app.py
# Visit http://localhost:7860
```

### Run Inference

```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token-here"

python inference.py
```

### Docker

```bash
docker build -t invoice-exception-handler .
docker run -p 7860:7860 \
  -e API_BASE_URL="https://router.huggingface.co/v1" \
  -e MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" \
  -e HF_TOKEN="your-token-here" \
  invoice-exception-handler
```

### HF Spaces Deployment

1. Create a new Space with the Gradio SDK
2. Push this repository to it
3. Add secrets in Space settings: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
4. The Space will build and deploy automatically from the Dockerfile

### Validate Submission

```bash
# Install validator
pip install openenv-core

# Validate the spec
openenv validate

# Run the full submission validator script
chmod +x scripts/validate-submission.sh
./scripts/validate-submission.sh https://your-space.hf.space .
```

---

## Common Mistakes to Avoid

1. **Don't use `inference.py` as the wrong name.** The validator looks for exactly `inference.py` in the root.

2. **Don't use the Anthropic SDK in inference.py.** The spec requires the OpenAI client. Use `from openai import OpenAI`.

3. **Don't forget `flush=True` on print statements.** The validator reads stdout line by line. Without flush, logs may not appear.

4. **Don't let the Gradio UI crash the FastAPI server.** If the UI has an error, it should fail gracefully, not bring down `/reset`.

5. **Don't hardcode the model name.** Always read from `os.getenv("MODEL_NAME")`.

6. **Don't put business logic in models.py.** That file is just data shapes.

7. **Don't mutate documents during a step.** The documents (PO, Invoice, GRN) are fixed for the duration of an episode. Only EpisodeData changes.

8. **Don't forget to test determinism.** Same seed + same actions must = same score. Run the determinism test.

9. **Don't skip the docker build test.** The validator builds your Docker image. If it doesn't build, you're disqualified.

10. **Don't forget the changelog.** Update `documents/CHANGELOG.md` before every push.

---

## License

MIT License. See LICENSE file.