File size: 26,997 Bytes
562f58d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
# Product Requirements Document
## Invoice Exception Handler β€” OpenEnv Agent Learning Environment

**Document ID:** PRD-001  
**Version:** 1.0.0  
**Status:** Final  
**Author:** [Your Name]  
**Last Updated:** 2025-01-20  
**Classification:** Internal / Hackathon Submission

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Problem Statement](#2-problem-statement)
3. [Product Vision](#3-product-vision)
4. [Stakeholders](#4-stakeholders)
5. [Functional Requirements](#5-functional-requirements)
6. [Non-Functional Requirements](#6-non-functional-requirements)
7. [System Architecture](#7-system-architecture)
8. [Task Specifications](#8-task-specifications)
9. [Reward Design](#9-reward-design)
10. [Evaluation Criteria](#10-evaluation-criteria)
11. [API Contract](#11-api-contract)
12. [File Structure](#12-file-structure)
13. [Out of Scope](#13-out-of-scope)
14. [Change Log](#14-change-log)

---

## 1. Executive Summary

The Invoice Exception Handler is a real-world agent learning environment built for the OpenEnv standard. It simulates the accounts payable (AP) exception handling workflow that every business on earth runs daily β€” the process of investigating flagged invoices before payment is approved.

The environment places an AI agent in the role of an AP analyst. The agent receives a document packet (Purchase Order, Invoice, Goods Receipt Note, Supplier Master), reads an exception flag, and must investigate the root cause, make a decision, route the case to the right team, and close it cleanly. Every action has realistic financial and compliance consequences.

The environment ships with three tasks of increasing difficulty β€” price variance (easy), duplicate with hidden tax error (medium), and compound fraud with four simultaneous signals (hard).

---

## 2. Problem Statement

### 2.1 The Real-World Pain

Every company that buys goods or services from suppliers receives invoices. Typically 5–15% of all invoices have exceptions β€” discrepancies between what was ordered (PO), what was received (GRN), and what was invoiced. These exceptions are currently handled by accounts payable clerks who manually:

1. Pull the original Purchase Order
2. Compare it line by line against the invoice
3. Check the Goods Receipt Note
4. Run validation checks
5. Query internal teams or the supplier
6. Make a decision (approve / reject / hold / partial approve)
7. Route the case and document everything

At a mid-size company this is 2–4 hours of analyst time per day. At enterprise scale it is entire departments. The cost to the AP automation market exceeds $3 billion annually.

### 2.2 The AI Gap

No existing OpenEnv benchmark tests an agent's ability to:
- Reason across multiple documents simultaneously
- Apply business rules with thresholds and exceptions
- Detect fraud signals that require cross-referencing
- Make nuanced decisions (partial approve, hold, escalate)
- Know *not* to contact a supplier via a potentially compromised channel

This gap means agents trained on existing benchmarks cannot be evaluated or trained on one of the most common finance workflows in enterprise software.

### 2.3 What This Environment Fixes

The Invoice Exception Handler provides:
- A clean, typed, deterministic simulation of AP exception handling
- Three tasks that test a progression of reasoning: threshold logic β†’ duplicate detection β†’ multi-signal fraud
- Shaped rewards that signal progress at every step, not just at episode end
- A fully deployable environment that conforms to the OpenEnv spec

---

## 3. Product Vision

> An agent that scores well in this environment is demonstrably better at AP exception handling than the average accounts payable clerk β€” and is ready to be deployed in real enterprise finance workflows.

The environment is designed so that:
- The reward signal is meaningful enough to actually train agents on, not just evaluate them
- The hard task (compound fraud) remains genuinely difficult for frontier models
- Every score between 0.0 and 1.0 reflects a real quality difference in agent behavior

---

## 4. Stakeholders

| Stakeholder | Role | Interest |
|---|---|---|
| Hackathon Judges (Meta, HF engineers) | Evaluators | Real-world utility, code quality, creativity |
| OpenEnv Automated Validator | Gatekeeper | Spec compliance, deployment health |
| AI Researchers | Primary users post-submission | Training and evaluating AP agents |
| Enterprise Software Companies | Secondary users | Evaluating models for AP automation products |

---

## 5. Functional Requirements

### 5.1 Core Environment API

| Requirement | Priority | Detail |
|---|---|---|
| FR-001 | MUST | `env.reset(task_id)` returns a clean `EnvironmentState` |
| FR-002 | MUST | `env.step(action)` returns `StepResult(observation, reward, done, info)` |
| FR-003 | MUST | `env.state()` returns current state without advancing episode |
| FR-004 | MUST | `env.grade()` returns a score dict with overall score 0.0–1.0 |
| FR-005 | MUST | All models are typed Pydantic v2 with no untyped fields |
| FR-006 | MUST | `openenv.yaml` passes `openenv validate` |

### 5.2 HTTP Endpoints (for HF Spaces validator)

| Requirement | Priority | Detail |
|---|---|---|
| FR-007 | MUST | `POST /reset` returns HTTP 200 with JSON observation |
| FR-008 | MUST | `POST /step` returns HTTP 200 with JSON StepResult |
| FR-009 | MUST | `GET /state` returns HTTP 200 with JSON EnvironmentState |
| FR-010 | MUST | `GET /health` returns HTTP 200 `{"status": "ok"}` |
| FR-011 | SHOULD | `GET /` returns HTML documentation page |

### 5.3 Task Requirements

| Requirement | Priority | Detail |
|---|---|---|
| FR-012 | MUST | Minimum 3 tasks with distinct scenarios |
| FR-013 | MUST | Tasks range easy β†’ medium β†’ hard |
| FR-014 | MUST | Each task has a deterministic grader returning 0.0–1.0 |
| FR-015 | MUST | Graders have sub-scores (diagnosis, investigation, decision, routing, closure, efficiency) |
| FR-016 | MUST | Hard task must not be solvable by simple heuristics |

### 5.4 Reward Function

| Requirement | Priority | Detail |
|---|---|---|
| FR-017 | MUST | Reward is shaped across the full trajectory |
| FR-018 | MUST | Dangerous actions (approving fraud) produce large negative rewards |
| FR-019 | MUST | Repeating already-completed actions penalised lightly |
| FR-020 | MUST | Exceeding step budget penalised (SLA concept) |
| FR-021 | SHOULD | Efficiency bonus for completing faster than optimal |

### 5.5 Inference Script

| Requirement | Priority | Detail |
|---|---|---|
| FR-022 | MUST | Script named exactly `inference.py` in root directory |
| FR-023 | MUST | Uses OpenAI client (not Anthropic SDK) |
| FR-024 | MUST | Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from environment |
| FR-025 | MUST | Emits `[START]`, `[STEP]`, `[END]` lines to stdout exactly as spec |
| FR-026 | MUST | Completes all 3 tasks in under 20 minutes on 2 vCPU / 8 GB RAM |
| FR-027 | MUST | Produces reproducible scores with the same seed |

### 5.6 Deployment

| Requirement | Priority | Detail |
|---|---|---|
| FR-028 | MUST | Dockerfile builds cleanly without internet access at run time |
| FR-029 | MUST | Container starts and serves on port 7860 |
| FR-030 | MUST | HF Spaces `POST /reset` returns 200 |
| FR-031 | MUST | README documents setup, action space, observation space, tasks, baseline scores |

---

## 6. Non-Functional Requirements

| ID | Category | Requirement |
|---|---|---|
| NFR-001 | Performance | `reset()` completes in < 100ms |
| NFR-002 | Performance | `step()` completes in < 50ms |
| NFR-003 | Performance | Full 3-task inference run completes in < 20 minutes |
| NFR-004 | Resource | Runs on 2 vCPU, 8 GB RAM β€” no GPU required |
| NFR-005 | Correctness | Grader output is deterministic β€” same actions = same score |
| NFR-006 | Correctness | Reward values are deterministic β€” no randomness in simulation |
| NFR-007 | Code quality | No bare `except:` blocks β€” all exceptions typed |
| NFR-008 | Code quality | All functions have docstrings |
| NFR-009 | Code quality | Type hints on all function signatures |
| NFR-010 | Portability | Zero OS-specific code β€” runs on Linux (Docker) |
| NFR-011 | Security | No hardcoded credentials anywhere in code |

---

## 7. System Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     HF Space / Docker Container             β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Gradio UI   β”‚    β”‚         FastAPI Server            β”‚   β”‚
β”‚  β”‚  (port 7860) β”‚    β”‚  POST /reset  GET /state          β”‚   β”‚
β”‚  β”‚              β”‚    β”‚  POST /step   GET /health         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚                           β”‚                        β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                    β”‚                                         β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚         β”‚   InvoiceExceptionEnv   β”‚                          β”‚
β”‚         β”‚  reset() step() state() β”‚                          β”‚
β”‚         β”‚  grade()                β”‚                          β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                    β”‚                                         β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚         β”‚      Task Registry      β”‚                          β”‚
β”‚         β”‚  task1_price_variance   β”‚                          β”‚
β”‚         β”‚  task2_duplicate_tax    β”‚                          β”‚
β”‚         β”‚  task3_compound_fraud   β”‚                          β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    inference.py (agent)                      β”‚
β”‚                                                             β”‚
β”‚  OpenAI Client β†’ env.reset() β†’ loop {                       β”‚
β”‚    action = LLM(observation_json)                           β”‚
β”‚    result = env.step(action)                                β”‚
β”‚    log [STEP]                                               β”‚
β”‚  } β†’ log [END]                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### 7.1 Data Flow

```
Episode start
    β”‚
    β–Ό
reset(task_id) ──► builds DocumentPacket + EpisodeData ──► EnvironmentState
    β”‚
    β–Ό
step(action) ──► dispatch to task simulator ──► (reward, info)
    β”‚                                               β”‚
    β–Ό                                               β–Ό
EpisodeData updated ◄──────────────────── append to history
    β”‚
    β–Ό
new EnvironmentState built ──► StepResult(obs, reward, done, info)
    β”‚
    β–Ό
grade() ──► EpisodeData ──► grader logic ──► Dict[str, float]
```

---

## 8. Task Specifications

### 8.1 Task 1 β€” Price Variance Exception (Easy)

**Scenario:** Office stationery invoice arrives 3.08% above the PO amount. Company tolerance policy is Β±2% for auto-approval. The supplier has a verbal approval email from the procurement team explaining a raw material price increase that was never formalised in the PO.

**What makes it easy:** Single root cause, all signals are benign (no fraud), the fix is straightforward (confirm with procurement, approve with PO amendment).

**Optimal path (9 steps):**
```
run_check(po_match)
run_check(tolerance_rule)              ← finds 3.08% > 2%
cross_check(unit_price, invoice, po)   ← finds two mismatched lines
run_check(grn_match)                   ← confirms delivery complete
query_supplier(reason for increase)    ← gets email confirmation
query_internal(procurement, confirm?)  ← procurement confirms verbal approval
apply_rule(tolerance_exception_approval)
make_decision(approve, reason)
route_to(procurement, raise PO amendment)
close_case(summary)
```

**Pitfalls:**
- Rejecting without querying supplier β†’ wrong decision, score capped at ~0.35
- Approving without checking tolerance rule β†’ policy violation, βˆ’0.15
- Disabling fraud checks that aren't needed β†’ wasted steps

**Grader weights:**
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.32 | tolerance_rule check, price mismatch found |
| Investigation | 0.30 | supplier queried, procurement confirmed |
| Decision | 0.18 | correct approve decision |
| Routing | 0.12 | PO amendment sent to procurement |
| Closure | 0.08 | case closed with summary |

---

### 8.2 Task 2 β€” Duplicate Invoice with Hidden Tax Error (Medium)

**Scenario:** Logistics supplier submits INV-2024-891. System flags it as a possible duplicate of INV-2024-819 (already paid). The invoice numbers differ by a digit transposition (8-9-1 vs 8-1-9). However: the original invoice applied 15% GST (wrong rate); the correct rate is 18%. The company overpaid β‚Ή3,240 in tax on the original invoice. The new invoice has the correct rate. So it is simultaneously a duplicate AND a legitimate correction.

**What makes it medium:** The agent must not just detect the duplicate and reject β€” it must also detect the tax error in the *original* paid invoice and partially approve the correction delta (β‚Ή3,240). A simple "reject all duplicates" rule misses this and loses significant score.

**Optimal path (10 steps):**
```
run_check(duplicate_detection)         ← finds INV-2024-819
inspect_field(invoice, invoice_number) ← spots digit transposition
run_check(tax_calculation_verify)      ← finds 15% vs 18% on original
cross_check(tax_amount, invoice, payment_history) ← confirms β‚Ή3,240 delta
query_internal(finance, confirm overpayment?)
query_supplier(clarify relationship between invoices)
apply_rule(partial_approval)
apply_rule(credit_note_request)
make_decision(partial_approve, reason)
route_to(finance, process β‚Ή3,240 correction)
close_case(summary)
```

**Pitfalls:**
- Full rejection (catches duplicate, misses correction): score ~0.35
- Full approval (pays full duplicate): score βˆ’0.15
- Partial approve without credit note: score ~0.60

**Grader weights:**
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.30 | duplicate found, tax error found |
| Investigation | 0.32 | finance queried, supplier queried, rules applied |
| Decision | 0.20 | partial_approve correct |
| Routing | 0.08 | finance routed for correction |
| Closure | 0.06 | case closed |

---

### 8.3 Task 3 β€” Compound Fraud Signals (Hard)

**Scenario:** IT equipment supplier submits an β‚Ή8,47,500 invoice for 15 laptops. The system flags a bank account change. But there are four simultaneous fraud signals:

1. **Bank account fraud:** Invoice has a different bank account than supplier master. The change request came from a lookalike email domain (techcore-solutions.com vs registered techcore-solutions.in) β€” a classic Business Email Compromise (BEC) attack.
2. **GSTIN fraud:** The GST number on the invoice belongs to a *different company* β€” "TechCore Trading Pvt Ltd" in Delhi, not "TechCore Solutions."
3. **Quantity mismatch:** Invoice claims 15 laptops. GRN confirms only 13 received. 2 are still in transit.
4. **Price inflation:** β‚Ή56,500/unit invoiced vs β‚Ή52,000/unit on PO β€” 8.65% above PO with no approved revision.

**Critical trap:** If the agent queries the supplier via email, they contact the fraudster who will try to convince them to approve. The agent must use phone/registered channel.

**What makes it hard:** The agent must hold four independent fraud signals simultaneously, not be satisfied by finding one and stopping, detect the correct communication channel policy, and route to both legal and security (not just finance).

**Optimal path (12-14 steps):**
```
inspect_field(invoice, bank_account)            ← sees mismatch
cross_check(bank_account, invoice, supplier_master)
run_check(bank_account_verification)            ← finds lookalike domain
run_check(email_domain_verification)
inspect_field(invoice, supplier_gstin)
run_check(gst_verification)                     ← finds GST belongs to different entity
cross_check(gstin, invoice, supplier_master)
inspect_field(grn, items_received)
run_check(grn_match)                            ← 13 vs 15
run_check(price_check)                          ← 8.65% above PO
query_supplier(confirm details, channel=phone)  ← supplier confirms fraud
query_internal(security, investigate BEC)
apply_rule(fraud_hold)
make_decision(reject, all fraud signals documented)
route_to(legal, initiate supplier audit)
route_to(security, BEC investigation)
close_case(fraud report summary)
```

**Critical pitfall β€” contacting via email:** βˆ’0.15 reward, and agent receives fraudster's response trying to get payment approved. Scoring penalises this heavily.

**Grader weights:**
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.50 | bank fraud, GST fraud, quantity mismatch, domain lookalike, price inflation |
| Investigation | 0.20 | phone contact (not email), security queried, legal queried |
| Decision | 0.20 | reject with all signals documented |
| Routing | 0.20 | legal + security routed |
| Closure | 0.06 | case closed with fraud report |

**Scoring thresholds:**
- Find 1 signal: ~0.20
- Find 2 signals: ~0.40
- Find 3 signals: ~0.60
- Find all 4 + correct routing: ~0.90+

---

## 9. Reward Design

### 9.1 Philosophy

The reward function is designed around three principles:

**Principle 1: Every informative action gets signal.** Agents should learn that investigating is always better than guessing. Each relevant inspection, check, or query returns a positive reward proportional to how diagnostic that action is.

**Principle 2: Dangerous actions get crushed.** Approving a fraudulent invoice, disabling security controls, or contacting a supplier via a compromised channel are not mistakes β€” they are catastrophic errors. These must receive large negative rewards so agents learn to avoid them unconditionally.

**Principle 3: The grader is the ground truth, the shaped reward is the training signal.** The episode reward is shaped to help agents learn. The grader score at the end is what actually measures quality.

### 9.2 Reward Table

| Action | Reward Range | Notes |
|---|---|---|
| `inspect_field` (relevant) | +0.01 to +0.14 | Higher for fields that reveal anomalies |
| `inspect_field` (irrelevant) | +0.01 | Still small positive β€” exploration is fine |
| `cross_check` (finds mismatch) | +0.12 to +0.15 | Diagnosis reward |
| `cross_check` (no mismatch) | +0.02 | Confirms a clean field |
| `run_check` (finds issue) | +0.08 to +0.18 | Higher for more diagnostic checks |
| `run_check` (clean) | +0.01 to +0.06 | Clean checks still confirm facts |
| `query_supplier` (phone) | +0.10 to +0.15 | Correct channel |
| `query_supplier` (email, fraud task) | βˆ’0.15 | Contacts fraudster |
| `query_internal` (key dept) | +0.04 to +0.12 | Higher for departments that add critical info |
| `apply_rule` (correct rule) | +0.08 to +0.12 | Applying the right policy pathway |
| `apply_rule` (wrong rule) | βˆ’0.05 to βˆ’0.10 | Misapplying policy |
| `make_decision` (correct) | +0.18 to +0.28 | Correct decision based on evidence |
| `make_decision` (wrong) | βˆ’0.10 to βˆ’0.40 | Severity scales with how wrong |
| `route_to` (correct team) | +0.06 to +0.14 | Right escalation path |
| `close_case` (complete) | +0.06 to +0.12 | Depends on decision quality |
| Repeat action | βˆ’0.02 to βˆ’0.05 | Light penalty, not catastrophic |
| SLA breach (exceed max steps) | βˆ’0.10 | One-time penalty at end |

### 9.3 Episode Score vs Cumulative Reward

These are different numbers:
- **Cumulative reward** is the sum of step rewards. It is used as a training signal.
- **Episode score** (from `grade()`) is the holistic quality assessment. It is what the hackathon evaluates.

Agents should be optimised on the grade score, not the cumulative reward alone.

---

## 10. Evaluation Criteria

### 10.1 Hackathon Scoring

| Criterion | Weight | What judges look for |
|---|---|---|
| Real-world utility | 30% | Would an enterprise actually use this? Does it model the task faithfully? |
| Task & grader quality | 25% | Clear objectives, accurate grading, genuine difficulty progression, frontier models challenged |
| Environment design | 20% | Clean state management, good action/observation spaces, shaped reward, sensible episode boundaries |
| Code quality & spec compliance | 15% | OpenEnv spec passes, Dockerfile works, baseline reproduces, typed models |
| Creativity & novelty | 10% | Novel domain, interesting mechanics, original reward design |

### 10.2 Automated Gates (must all pass)

1. HF Space deploys β€” `POST /reset` returns 200
2. `openenv validate` passes
3. `docker build` succeeds
4. `python inference.py` runs without error, produces scores
5. All 3 tasks enumerated, grader scores verified in [0.0, 1.0]

### 10.3 Phase 2 β€” Agentic Evaluation

The hackathon will run a standard open LLM agent (e.g. Nemotron 3 Super) against the environment. The environment must:
- Not be trivially solvable by a greedy agent
- Produce score variance across tasks (not all the same)
- Penalise clearly suboptimal behaviour

### 10.4 Disqualifiers

- Environment does not deploy or respond to `/reset`
- Graders that always return the same score regardless of actions
- `inference.py` not in root, or not using OpenAI client
- No baseline scores produced
- Plagiarised environment

---

## 11. API Contract

### 11.1 Environment Python API

```python
env = InvoiceExceptionEnv(seed=42)

# Reset β€” returns EnvironmentState
obs: EnvironmentState = env.reset("task1_price_variance")

# Step β€” returns StepResult
result: StepResult = env.step(Action.run_check("tolerance_rule"))
# result.observation  β†’  EnvironmentState
# result.reward       β†’  float
# result.done         β†’  bool
# result.info         β†’  dict

# State β€” non-destructive peek
obs: EnvironmentState = env.state()

# Grade β€” run grader on episode
scores: dict = env.grade()
# scores["score"]           β†’ 0.0–1.0 overall
# scores["diagnosis_score"] β†’ float
# scores["decision_score"]  β†’ float
# ...
```

### 11.2 HTTP API

```
POST /reset
Body: {"task_id": "task1_price_variance"}   (optional β€” random if omitted)
Response: 200 EnvironmentState JSON

POST /step
Body: {"type": "run_check", "params": {"check_name": "tolerance_rule"}}
Response: 200 StepResult JSON

GET /state
Response: 200 EnvironmentState JSON

POST /grade
Response: 200 {"score": 0.85, "diagnosis_score": ...}

GET /tasks
Response: 200 ["task1_price_variance", "task2_duplicate_tax", "task3_compound_fraud"]

GET /health
Response: 200 {"status": "ok", "version": "1.0.0"}
```

### 11.3 Action Schema

```json
{
  "type": "run_check",
  "params": {"check_name": "tolerance_rule"}
}

{
  "type": "inspect_field",
  "params": {"document": "invoice", "field": "bank_account"}
}

{
  "type": "cross_check",
  "params": {"field": "unit_price", "doc_a": "invoice", "doc_b": "po"}
}

{
  "type": "query_supplier",
  "params": {"question": "Why does your bank account differ?", "channel": "phone"}
}

{
  "type": "query_internal",
  "params": {"department": "procurement", "question": "Did you approve this price?"}
}

{
  "type": "apply_rule",
  "params": {"rule_id": "tolerance_exception_approval"}
}

{
  "type": "make_decision",
  "params": {"decision": "approve", "reason": "Verbal approval confirmed by procurement."}
}

{
  "type": "route_to",
  "params": {"team": "procurement", "notes": "Please raise PO amendment for the price variance."}
}

{
  "type": "close_case",
  "params": {"summary": "Invoice approved. PO amendment requested. Case closed."}
}
```

---

## 12. File Structure

```
invoice-exception-handler/
β”‚
β”œβ”€β”€ README.md                    # Full setup + usage guide
β”œβ”€β”€ openenv.yaml                 # OpenEnv spec (must pass openenv validate)
β”œβ”€β”€ Dockerfile                   # Single-stage Python 3.11-slim build
β”œβ”€β”€ requirements.txt             # Pinned dependencies
β”œβ”€β”€ inference.py                 # Competition inference script (MUST be here)
β”œβ”€β”€ app.py                       # Gradio + FastAPI entrypoint for HF Spaces
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py                # All Pydantic typed models
β”‚   β”œβ”€β”€ environment.py           # InvoiceExceptionEnv class
β”‚   └── tasks.py                 # 3 task classes + graders + EpisodeData
β”‚
└── documents/
    β”œβ”€β”€ PRD-001-product-requirements.md    # This document
    β”œβ”€β”€ CHANGELOG.md                       # Every code change recorded
    β”œβ”€β”€ ARCHITECTURE.md                    # System diagram + decisions
    └── BASELINE-SCORES.md                 # Reproducible benchmark results
```

---

## 13. Out of Scope

The following are explicitly not part of v1.0:

- Real database connectivity (the environment is fully simulated)
- Multi-agent scenarios (one agent per episode)
- Partial observability (agent sees all documents from the start)
- User interface for human play (nice-to-have but not required for submission)
- Real supplier APIs (simulation only)
- Currency other than INR (can be extended in v1.1)
- Tasks beyond 3 (can be extended)

---

## 14. Change Log

| Version | Date | Author | Change |
|---|---|---|---|
| 0.1.0 | 2025-01-18 | [Author] | Initial draft β€” problem definition and task sketches |
| 0.2.0 | 2025-01-19 | [Author] | Added reward design section, API contract, file structure |
| 1.0.0 | 2025-01-20 | [Author] | Final version β€” all sections complete, ready for implementation |