Spaces:
Sleeping
Product Requirements Document
Invoice Exception Handler β OpenEnv Agent Learning Environment
Document ID: PRD-001
Version: 1.0.0
Status: Final
Author: [Your Name]
Last Updated: 2025-01-20
Classification: Internal / Hackathon Submission
Table of Contents
- Executive Summary
- Problem Statement
- Product Vision
- Stakeholders
- Functional Requirements
- Non-Functional Requirements
- System Architecture
- Task Specifications
- Reward Design
- Evaluation Criteria
- API Contract
- File Structure
- Out of Scope
- Change Log
1. Executive Summary
The Invoice Exception Handler is a real-world agent learning environment built for the OpenEnv standard. It simulates the accounts payable (AP) exception handling workflow that every business on earth runs daily β the process of investigating flagged invoices before payment is approved.
The environment places an AI agent in the role of an AP analyst. The agent receives a document packet (Purchase Order, Invoice, Goods Receipt Note, Supplier Master), reads an exception flag, and must investigate the root cause, make a decision, route the case to the right team, and close it cleanly. Every action has realistic financial and compliance consequences.
The environment ships with three tasks of increasing difficulty β price variance (easy), duplicate with hidden tax error (medium), and compound fraud with four simultaneous signals (hard).
2. Problem Statement
2.1 The Real-World Pain
Every company that buys goods or services from suppliers receives invoices. Typically 5β15% of all invoices have exceptions β discrepancies between what was ordered (PO), what was received (GRN), and what was invoiced. These exceptions are currently handled by accounts payable clerks who manually:
- Pull the original Purchase Order
- Compare it line by line against the invoice
- Check the Goods Receipt Note
- Run validation checks
- Query internal teams or the supplier
- Make a decision (approve / reject / hold / partial approve)
- Route the case and document everything
At a mid-size company this is 2β4 hours of analyst time per day. At enterprise scale it is entire departments. The cost to the AP automation market exceeds $3 billion annually.
2.2 The AI Gap
No existing OpenEnv benchmark tests an agent's ability to:
- Reason across multiple documents simultaneously
- Apply business rules with thresholds and exceptions
- Detect fraud signals that require cross-referencing
- Make nuanced decisions (partial approve, hold, escalate)
- Know not to contact a supplier via a potentially compromised channel
This gap means agents trained on existing benchmarks cannot be evaluated or trained on one of the most common finance workflows in enterprise software.
2.3 What This Environment Fixes
The Invoice Exception Handler provides:
- A clean, typed, deterministic simulation of AP exception handling
- Three tasks that test a progression of reasoning: threshold logic β duplicate detection β multi-signal fraud
- Shaped rewards that signal progress at every step, not just at episode end
- A fully deployable environment that conforms to the OpenEnv spec
3. Product Vision
An agent that scores well in this environment is demonstrably better at AP exception handling than the average accounts payable clerk β and is ready to be deployed in real enterprise finance workflows.
The environment is designed so that:
- The reward signal is meaningful enough to actually train agents on, not just evaluate them
- The hard task (compound fraud) remains genuinely difficult for frontier models
- Every score between 0.0 and 1.0 reflects a real quality difference in agent behavior
4. Stakeholders
| Stakeholder | Role | Interest |
|---|---|---|
| Hackathon Judges (Meta, HF engineers) | Evaluators | Real-world utility, code quality, creativity |
| OpenEnv Automated Validator | Gatekeeper | Spec compliance, deployment health |
| AI Researchers | Primary users post-submission | Training and evaluating AP agents |
| Enterprise Software Companies | Secondary users | Evaluating models for AP automation products |
5. Functional Requirements
5.1 Core Environment API
| Requirement | Priority | Detail |
|---|---|---|
| FR-001 | MUST | env.reset(task_id) returns a clean EnvironmentState |
| FR-002 | MUST | env.step(action) returns StepResult(observation, reward, done, info) |
| FR-003 | MUST | env.state() returns current state without advancing episode |
| FR-004 | MUST | env.grade() returns a score dict with overall score 0.0β1.0 |
| FR-005 | MUST | All models are typed Pydantic v2 with no untyped fields |
| FR-006 | MUST | openenv.yaml passes openenv validate |
5.2 HTTP Endpoints (for HF Spaces validator)
| Requirement | Priority | Detail |
|---|---|---|
| FR-007 | MUST | POST /reset returns HTTP 200 with JSON observation |
| FR-008 | MUST | POST /step returns HTTP 200 with JSON StepResult |
| FR-009 | MUST | GET /state returns HTTP 200 with JSON EnvironmentState |
| FR-010 | MUST | GET /health returns HTTP 200 {"status": "ok"} |
| FR-011 | SHOULD | GET / returns HTML documentation page |
5.3 Task Requirements
| Requirement | Priority | Detail |
|---|---|---|
| FR-012 | MUST | Minimum 3 tasks with distinct scenarios |
| FR-013 | MUST | Tasks range easy β medium β hard |
| FR-014 | MUST | Each task has a deterministic grader returning 0.0β1.0 |
| FR-015 | MUST | Graders have sub-scores (diagnosis, investigation, decision, routing, closure, efficiency) |
| FR-016 | MUST | Hard task must not be solvable by simple heuristics |
5.4 Reward Function
| Requirement | Priority | Detail |
|---|---|---|
| FR-017 | MUST | Reward is shaped across the full trajectory |
| FR-018 | MUST | Dangerous actions (approving fraud) produce large negative rewards |
| FR-019 | MUST | Repeating already-completed actions penalised lightly |
| FR-020 | MUST | Exceeding step budget penalised (SLA concept) |
| FR-021 | SHOULD | Efficiency bonus for completing faster than optimal |
5.5 Inference Script
| Requirement | Priority | Detail |
|---|---|---|
| FR-022 | MUST | Script named exactly inference.py in root directory |
| FR-023 | MUST | Uses OpenAI client (not Anthropic SDK) |
| FR-024 | MUST | Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment |
| FR-025 | MUST | Emits [START], [STEP], [END] lines to stdout exactly as spec |
| FR-026 | MUST | Completes all 3 tasks in under 20 minutes on 2 vCPU / 8 GB RAM |
| FR-027 | MUST | Produces reproducible scores with the same seed |
5.6 Deployment
| Requirement | Priority | Detail |
|---|---|---|
| FR-028 | MUST | Dockerfile builds cleanly without internet access at run time |
| FR-029 | MUST | Container starts and serves on port 7860 |
| FR-030 | MUST | HF Spaces POST /reset returns 200 |
| FR-031 | MUST | README documents setup, action space, observation space, tasks, baseline scores |
6. Non-Functional Requirements
| ID | Category | Requirement |
|---|---|---|
| NFR-001 | Performance | reset() completes in < 100ms |
| NFR-002 | Performance | step() completes in < 50ms |
| NFR-003 | Performance | Full 3-task inference run completes in < 20 minutes |
| NFR-004 | Resource | Runs on 2 vCPU, 8 GB RAM β no GPU required |
| NFR-005 | Correctness | Grader output is deterministic β same actions = same score |
| NFR-006 | Correctness | Reward values are deterministic β no randomness in simulation |
| NFR-007 | Code quality | No bare except: blocks β all exceptions typed |
| NFR-008 | Code quality | All functions have docstrings |
| NFR-009 | Code quality | Type hints on all function signatures |
| NFR-010 | Portability | Zero OS-specific code β runs on Linux (Docker) |
| NFR-011 | Security | No hardcoded credentials anywhere in code |
7. System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HF Space / Docker Container β
β β
β ββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β Gradio UI β β FastAPI Server β β
β β (port 7860) β β POST /reset GET /state β β
β β β β POST /step GET /health β β
β ββββββββ¬ββββββββ ββββββββββββββββ¬ββββββββββββββββββββ β
β β β β
β ββββββββββββ¬βββββββββββββββββ β
β β β
β ββββββββββββΌβββββββββββββββ β
β β InvoiceExceptionEnv β β
β β reset() step() state() β β
β β grade() β β
β ββββββββββββ¬βββββββββββββββ β
β β β
β ββββββββββββΌβββββββββββββββ β
β β Task Registry β β
β β task1_price_variance β β
β β task2_duplicate_tax β β
β β task3_compound_fraud β β
β βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β inference.py (agent) β
β β
β OpenAI Client β env.reset() β loop { β
β action = LLM(observation_json) β
β result = env.step(action) β
β log [STEP] β
β } β log [END] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.1 Data Flow
Episode start
β
βΌ
reset(task_id) βββΊ builds DocumentPacket + EpisodeData βββΊ EnvironmentState
β
βΌ
step(action) βββΊ dispatch to task simulator βββΊ (reward, info)
β β
βΌ βΌ
EpisodeData updated βββββββββββββββββββββ append to history
β
βΌ
new EnvironmentState built βββΊ StepResult(obs, reward, done, info)
β
βΌ
grade() βββΊ EpisodeData βββΊ grader logic βββΊ Dict[str, float]
8. Task Specifications
8.1 Task 1 β Price Variance Exception (Easy)
Scenario: Office stationery invoice arrives 3.08% above the PO amount. Company tolerance policy is Β±2% for auto-approval. The supplier has a verbal approval email from the procurement team explaining a raw material price increase that was never formalised in the PO.
What makes it easy: Single root cause, all signals are benign (no fraud), the fix is straightforward (confirm with procurement, approve with PO amendment).
Optimal path (9 steps):
run_check(po_match)
run_check(tolerance_rule) β finds 3.08% > 2%
cross_check(unit_price, invoice, po) β finds two mismatched lines
run_check(grn_match) β confirms delivery complete
query_supplier(reason for increase) β gets email confirmation
query_internal(procurement, confirm?) β procurement confirms verbal approval
apply_rule(tolerance_exception_approval)
make_decision(approve, reason)
route_to(procurement, raise PO amendment)
close_case(summary)
Pitfalls:
- Rejecting without querying supplier β wrong decision, score capped at ~0.35
- Approving without checking tolerance rule β policy violation, β0.15
- Disabling fraud checks that aren't needed β wasted steps
Grader weights:
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.32 | tolerance_rule check, price mismatch found |
| Investigation | 0.30 | supplier queried, procurement confirmed |
| Decision | 0.18 | correct approve decision |
| Routing | 0.12 | PO amendment sent to procurement |
| Closure | 0.08 | case closed with summary |
8.2 Task 2 β Duplicate Invoice with Hidden Tax Error (Medium)
Scenario: Logistics supplier submits INV-2024-891. System flags it as a possible duplicate of INV-2024-819 (already paid). The invoice numbers differ by a digit transposition (8-9-1 vs 8-1-9). However: the original invoice applied 15% GST (wrong rate); the correct rate is 18%. The company overpaid βΉ3,240 in tax on the original invoice. The new invoice has the correct rate. So it is simultaneously a duplicate AND a legitimate correction.
What makes it medium: The agent must not just detect the duplicate and reject β it must also detect the tax error in the original paid invoice and partially approve the correction delta (βΉ3,240). A simple "reject all duplicates" rule misses this and loses significant score.
Optimal path (10 steps):
run_check(duplicate_detection) β finds INV-2024-819
inspect_field(invoice, invoice_number) β spots digit transposition
run_check(tax_calculation_verify) β finds 15% vs 18% on original
cross_check(tax_amount, invoice, payment_history) β confirms βΉ3,240 delta
query_internal(finance, confirm overpayment?)
query_supplier(clarify relationship between invoices)
apply_rule(partial_approval)
apply_rule(credit_note_request)
make_decision(partial_approve, reason)
route_to(finance, process βΉ3,240 correction)
close_case(summary)
Pitfalls:
- Full rejection (catches duplicate, misses correction): score ~0.35
- Full approval (pays full duplicate): score β0.15
- Partial approve without credit note: score ~0.60
Grader weights:
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.30 | duplicate found, tax error found |
| Investigation | 0.32 | finance queried, supplier queried, rules applied |
| Decision | 0.20 | partial_approve correct |
| Routing | 0.08 | finance routed for correction |
| Closure | 0.06 | case closed |
8.3 Task 3 β Compound Fraud Signals (Hard)
Scenario: IT equipment supplier submits an βΉ8,47,500 invoice for 15 laptops. The system flags a bank account change. But there are four simultaneous fraud signals:
- Bank account fraud: Invoice has a different bank account than supplier master. The change request came from a lookalike email domain (techcore-solutions.com vs registered techcore-solutions.in) β a classic Business Email Compromise (BEC) attack.
- GSTIN fraud: The GST number on the invoice belongs to a different company β "TechCore Trading Pvt Ltd" in Delhi, not "TechCore Solutions."
- Quantity mismatch: Invoice claims 15 laptops. GRN confirms only 13 received. 2 are still in transit.
- Price inflation: βΉ56,500/unit invoiced vs βΉ52,000/unit on PO β 8.65% above PO with no approved revision.
Critical trap: If the agent queries the supplier via email, they contact the fraudster who will try to convince them to approve. The agent must use phone/registered channel.
What makes it hard: The agent must hold four independent fraud signals simultaneously, not be satisfied by finding one and stopping, detect the correct communication channel policy, and route to both legal and security (not just finance).
Optimal path (12-14 steps):
inspect_field(invoice, bank_account) β sees mismatch
cross_check(bank_account, invoice, supplier_master)
run_check(bank_account_verification) β finds lookalike domain
run_check(email_domain_verification)
inspect_field(invoice, supplier_gstin)
run_check(gst_verification) β finds GST belongs to different entity
cross_check(gstin, invoice, supplier_master)
inspect_field(grn, items_received)
run_check(grn_match) β 13 vs 15
run_check(price_check) β 8.65% above PO
query_supplier(confirm details, channel=phone) β supplier confirms fraud
query_internal(security, investigate BEC)
apply_rule(fraud_hold)
make_decision(reject, all fraud signals documented)
route_to(legal, initiate supplier audit)
route_to(security, BEC investigation)
close_case(fraud report summary)
Critical pitfall β contacting via email: β0.15 reward, and agent receives fraudster's response trying to get payment approved. Scoring penalises this heavily.
Grader weights:
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.50 | bank fraud, GST fraud, quantity mismatch, domain lookalike, price inflation |
| Investigation | 0.20 | phone contact (not email), security queried, legal queried |
| Decision | 0.20 | reject with all signals documented |
| Routing | 0.20 | legal + security routed |
| Closure | 0.06 | case closed with fraud report |
Scoring thresholds:
- Find 1 signal: ~0.20
- Find 2 signals: ~0.40
- Find 3 signals: ~0.60
- Find all 4 + correct routing: ~0.90+
9. Reward Design
9.1 Philosophy
The reward function is designed around three principles:
Principle 1: Every informative action gets signal. Agents should learn that investigating is always better than guessing. Each relevant inspection, check, or query returns a positive reward proportional to how diagnostic that action is.
Principle 2: Dangerous actions get crushed. Approving a fraudulent invoice, disabling security controls, or contacting a supplier via a compromised channel are not mistakes β they are catastrophic errors. These must receive large negative rewards so agents learn to avoid them unconditionally.
Principle 3: The grader is the ground truth, the shaped reward is the training signal. The episode reward is shaped to help agents learn. The grader score at the end is what actually measures quality.
9.2 Reward Table
| Action | Reward Range | Notes |
|---|---|---|
inspect_field (relevant) |
+0.01 to +0.14 | Higher for fields that reveal anomalies |
inspect_field (irrelevant) |
+0.01 | Still small positive β exploration is fine |
cross_check (finds mismatch) |
+0.12 to +0.15 | Diagnosis reward |
cross_check (no mismatch) |
+0.02 | Confirms a clean field |
run_check (finds issue) |
+0.08 to +0.18 | Higher for more diagnostic checks |
run_check (clean) |
+0.01 to +0.06 | Clean checks still confirm facts |
query_supplier (phone) |
+0.10 to +0.15 | Correct channel |
query_supplier (email, fraud task) |
β0.15 | Contacts fraudster |
query_internal (key dept) |
+0.04 to +0.12 | Higher for departments that add critical info |
apply_rule (correct rule) |
+0.08 to +0.12 | Applying the right policy pathway |
apply_rule (wrong rule) |
β0.05 to β0.10 | Misapplying policy |
make_decision (correct) |
+0.18 to +0.28 | Correct decision based on evidence |
make_decision (wrong) |
β0.10 to β0.40 | Severity scales with how wrong |
route_to (correct team) |
+0.06 to +0.14 | Right escalation path |
close_case (complete) |
+0.06 to +0.12 | Depends on decision quality |
| Repeat action | β0.02 to β0.05 | Light penalty, not catastrophic |
| SLA breach (exceed max steps) | β0.10 | One-time penalty at end |
9.3 Episode Score vs Cumulative Reward
These are different numbers:
- Cumulative reward is the sum of step rewards. It is used as a training signal.
- Episode score (from
grade()) is the holistic quality assessment. It is what the hackathon evaluates.
Agents should be optimised on the grade score, not the cumulative reward alone.
10. Evaluation Criteria
10.1 Hackathon Scoring
| Criterion | Weight | What judges look for |
|---|---|---|
| Real-world utility | 30% | Would an enterprise actually use this? Does it model the task faithfully? |
| Task & grader quality | 25% | Clear objectives, accurate grading, genuine difficulty progression, frontier models challenged |
| Environment design | 20% | Clean state management, good action/observation spaces, shaped reward, sensible episode boundaries |
| Code quality & spec compliance | 15% | OpenEnv spec passes, Dockerfile works, baseline reproduces, typed models |
| Creativity & novelty | 10% | Novel domain, interesting mechanics, original reward design |
10.2 Automated Gates (must all pass)
- HF Space deploys β
POST /resetreturns 200 openenv validatepassesdocker buildsucceedspython inference.pyruns without error, produces scores- All 3 tasks enumerated, grader scores verified in [0.0, 1.0]
10.3 Phase 2 β Agentic Evaluation
The hackathon will run a standard open LLM agent (e.g. Nemotron 3 Super) against the environment. The environment must:
- Not be trivially solvable by a greedy agent
- Produce score variance across tasks (not all the same)
- Penalise clearly suboptimal behaviour
10.4 Disqualifiers
- Environment does not deploy or respond to
/reset - Graders that always return the same score regardless of actions
inference.pynot in root, or not using OpenAI client- No baseline scores produced
- Plagiarised environment
11. API Contract
11.1 Environment Python API
env = InvoiceExceptionEnv(seed=42)
# Reset β returns EnvironmentState
obs: EnvironmentState = env.reset("task1_price_variance")
# Step β returns StepResult
result: StepResult = env.step(Action.run_check("tolerance_rule"))
# result.observation β EnvironmentState
# result.reward β float
# result.done β bool
# result.info β dict
# State β non-destructive peek
obs: EnvironmentState = env.state()
# Grade β run grader on episode
scores: dict = env.grade()
# scores["score"] β 0.0β1.0 overall
# scores["diagnosis_score"] β float
# scores["decision_score"] β float
# ...
11.2 HTTP API
POST /reset
Body: {"task_id": "task1_price_variance"} (optional β random if omitted)
Response: 200 EnvironmentState JSON
POST /step
Body: {"type": "run_check", "params": {"check_name": "tolerance_rule"}}
Response: 200 StepResult JSON
GET /state
Response: 200 EnvironmentState JSON
POST /grade
Response: 200 {"score": 0.85, "diagnosis_score": ...}
GET /tasks
Response: 200 ["task1_price_variance", "task2_duplicate_tax", "task3_compound_fraud"]
GET /health
Response: 200 {"status": "ok", "version": "1.0.0"}
11.3 Action Schema
{
"type": "run_check",
"params": {"check_name": "tolerance_rule"}
}
{
"type": "inspect_field",
"params": {"document": "invoice", "field": "bank_account"}
}
{
"type": "cross_check",
"params": {"field": "unit_price", "doc_a": "invoice", "doc_b": "po"}
}
{
"type": "query_supplier",
"params": {"question": "Why does your bank account differ?", "channel": "phone"}
}
{
"type": "query_internal",
"params": {"department": "procurement", "question": "Did you approve this price?"}
}
{
"type": "apply_rule",
"params": {"rule_id": "tolerance_exception_approval"}
}
{
"type": "make_decision",
"params": {"decision": "approve", "reason": "Verbal approval confirmed by procurement."}
}
{
"type": "route_to",
"params": {"team": "procurement", "notes": "Please raise PO amendment for the price variance."}
}
{
"type": "close_case",
"params": {"summary": "Invoice approved. PO amendment requested. Case closed."}
}
12. File Structure
invoice-exception-handler/
β
βββ README.md # Full setup + usage guide
βββ openenv.yaml # OpenEnv spec (must pass openenv validate)
βββ Dockerfile # Single-stage Python 3.11-slim build
βββ requirements.txt # Pinned dependencies
βββ inference.py # Competition inference script (MUST be here)
βββ app.py # Gradio + FastAPI entrypoint for HF Spaces
β
βββ env/
β βββ __init__.py
β βββ models.py # All Pydantic typed models
β βββ environment.py # InvoiceExceptionEnv class
β βββ tasks.py # 3 task classes + graders + EpisodeData
β
βββ documents/
βββ PRD-001-product-requirements.md # This document
βββ CHANGELOG.md # Every code change recorded
βββ ARCHITECTURE.md # System diagram + decisions
βββ BASELINE-SCORES.md # Reproducible benchmark results
13. Out of Scope
The following are explicitly not part of v1.0:
- Real database connectivity (the environment is fully simulated)
- Multi-agent scenarios (one agent per episode)
- Partial observability (agent sees all documents from the start)
- User interface for human play (nice-to-have but not required for submission)
- Real supplier APIs (simulation only)
- Currency other than INR (can be extended in v1.1)
- Tasks beyond 3 (can be extended)
14. Change Log
| Version | Date | Author | Change |
|---|---|---|---|
| 0.1.0 | 2025-01-18 | [Author] | Initial draft β problem definition and task sketches |
| 0.2.0 | 2025-01-19 | [Author] | Added reward design section, API contract, file structure |
| 1.0.0 | 2025-01-20 | [Author] | Final version β all sections complete, ready for implementation |