Spaces:

YUS200619
/

invoice-exception-handler

Sleeping

App Files Files Community

invoice-exception-handler / documents /PRD.md

YUS200619

feat: complete invoice exception handler v1.0.0

562f58d 30 days ago

preview code

raw

history blame contribute delete

27 kB

Product Requirements Document

Invoice Exception Handler — OpenEnv Agent Learning Environment

Document ID: PRD-001
Version: 1.0.0
Status: Final
Author: [Your Name]
Last Updated: 2025-01-20
Classification: Internal / Hackathon Submission

Executive Summary
Problem Statement
Product Vision
Stakeholders
Functional Requirements
Non-Functional Requirements
System Architecture
Task Specifications
Reward Design
Evaluation Criteria
API Contract
File Structure
Out of Scope
Change Log

1. Executive Summary

The Invoice Exception Handler is a real-world agent learning environment built for the OpenEnv standard. It simulates the accounts payable (AP) exception handling workflow that every business on earth runs daily — the process of investigating flagged invoices before payment is approved.

The environment places an AI agent in the role of an AP analyst. The agent receives a document packet (Purchase Order, Invoice, Goods Receipt Note, Supplier Master), reads an exception flag, and must investigate the root cause, make a decision, route the case to the right team, and close it cleanly. Every action has realistic financial and compliance consequences.

The environment ships with three tasks of increasing difficulty — price variance (easy), duplicate with hidden tax error (medium), and compound fraud with four simultaneous signals (hard).

2. Problem Statement

2.1 The Real-World Pain

Every company that buys goods or services from suppliers receives invoices. Typically 5–15% of all invoices have exceptions — discrepancies between what was ordered (PO), what was received (GRN), and what was invoiced. These exceptions are currently handled by accounts payable clerks who manually:

Pull the original Purchase Order
Compare it line by line against the invoice
Check the Goods Receipt Note
Run validation checks
Query internal teams or the supplier
Make a decision (approve / reject / hold / partial approve)
Route the case and document everything

At a mid-size company this is 2–4 hours of analyst time per day. At enterprise scale it is entire departments. The cost to the AP automation market exceeds $3 billion annually.

2.2 The AI Gap

No existing OpenEnv benchmark tests an agent's ability to:

Reason across multiple documents simultaneously
Apply business rules with thresholds and exceptions
Detect fraud signals that require cross-referencing
Make nuanced decisions (partial approve, hold, escalate)
Know not to contact a supplier via a potentially compromised channel

This gap means agents trained on existing benchmarks cannot be evaluated or trained on one of the most common finance workflows in enterprise software.

2.3 What This Environment Fixes

The Invoice Exception Handler provides:

A clean, typed, deterministic simulation of AP exception handling
Three tasks that test a progression of reasoning: threshold logic → duplicate detection → multi-signal fraud
Shaped rewards that signal progress at every step, not just at episode end
A fully deployable environment that conforms to the OpenEnv spec

3. Product Vision

An agent that scores well in this environment is demonstrably better at AP exception handling than the average accounts payable clerk — and is ready to be deployed in real enterprise finance workflows.

The environment is designed so that:

The reward signal is meaningful enough to actually train agents on, not just evaluate them
The hard task (compound fraud) remains genuinely difficult for frontier models
Every score between 0.0 and 1.0 reflects a real quality difference in agent behavior

4. Stakeholders

Stakeholder	Role	Interest
Hackathon Judges (Meta, HF engineers)	Evaluators	Real-world utility, code quality, creativity
OpenEnv Automated Validator	Gatekeeper	Spec compliance, deployment health
AI Researchers	Primary users post-submission	Training and evaluating AP agents
Enterprise Software Companies	Secondary users	Evaluating models for AP automation products

5. Functional Requirements

5.1 Core Environment API

Requirement	Priority	Detail
FR-001	MUST	`env.reset(task_id)` returns a clean `EnvironmentState`
FR-002	MUST	`env.step(action)` returns `StepResult(observation, reward, done, info)`
FR-003	MUST	`env.state()` returns current state without advancing episode
FR-004	MUST	`env.grade()` returns a score dict with overall score 0.0–1.0
FR-005	MUST	All models are typed Pydantic v2 with no untyped fields
FR-006	MUST	`openenv.yaml` passes `openenv validate`

5.2 HTTP Endpoints (for HF Spaces validator)

Requirement	Priority	Detail
FR-007	MUST	`POST /reset` returns HTTP 200 with JSON observation
FR-008	MUST	`POST /step` returns HTTP 200 with JSON StepResult
FR-009	MUST	`GET /state` returns HTTP 200 with JSON EnvironmentState
FR-010	MUST	`GET /health` returns HTTP 200 `{"status": "ok"}`
FR-011	SHOULD	`GET /` returns HTML documentation page

5.3 Task Requirements

Requirement	Priority	Detail
FR-012	MUST	Minimum 3 tasks with distinct scenarios
FR-013	MUST	Tasks range easy → medium → hard
FR-014	MUST	Each task has a deterministic grader returning 0.0–1.0
FR-015	MUST	Graders have sub-scores (diagnosis, investigation, decision, routing, closure, efficiency)
FR-016	MUST	Hard task must not be solvable by simple heuristics

5.4 Reward Function

Requirement	Priority	Detail
FR-017	MUST	Reward is shaped across the full trajectory
FR-018	MUST	Dangerous actions (approving fraud) produce large negative rewards
FR-019	MUST	Repeating already-completed actions penalised lightly
FR-020	MUST	Exceeding step budget penalised (SLA concept)
FR-021	SHOULD	Efficiency bonus for completing faster than optimal

5.5 Inference Script

Requirement	Priority	Detail
FR-022	MUST	Script named exactly `inference.py` in root directory
FR-023	MUST	Uses OpenAI client (not Anthropic SDK)
FR-024	MUST	Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from environment
FR-025	MUST	Emits `[START]`, `[STEP]`, `[END]` lines to stdout exactly as spec
FR-026	MUST	Completes all 3 tasks in under 20 minutes on 2 vCPU / 8 GB RAM
FR-027	MUST	Produces reproducible scores with the same seed

5.6 Deployment

Requirement	Priority	Detail
FR-028	MUST	Dockerfile builds cleanly without internet access at run time
FR-029	MUST	Container starts and serves on port 7860
FR-030	MUST	HF Spaces `POST /reset` returns 200
FR-031	MUST	README documents setup, action space, observation space, tasks, baseline scores

6. Non-Functional Requirements

ID	Category	Requirement
NFR-001	Performance	`reset()` completes in < 100ms
NFR-002	Performance	`step()` completes in < 50ms
NFR-003	Performance	Full 3-task inference run completes in < 20 minutes
NFR-004	Resource	Runs on 2 vCPU, 8 GB RAM — no GPU required
NFR-005	Correctness	Grader output is deterministic — same actions = same score
NFR-006	Correctness	Reward values are deterministic — no randomness in simulation
NFR-007	Code quality	No bare `except:` blocks — all exceptions typed
NFR-008	Code quality	All functions have docstrings
NFR-009	Code quality	Type hints on all function signatures
NFR-010	Portability	Zero OS-specific code — runs on Linux (Docker)
NFR-011	Security	No hardcoded credentials anywhere in code

7. System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     HF Space / Docker Container             │
│                                                             │
│  ┌──────────────┐    ┌──────────────────────────────────┐   │
│  │  Gradio UI   │    │         FastAPI Server            │   │
│  │  (port 7860) │    │  POST /reset  GET /state          │   │
│  │              │    │  POST /step   GET /health         │   │
│  └──────┬───────┘    └──────────────┬───────────────────┘   │
│         │                           │                        │
│         └──────────┬────────────────┘                        │
│                    │                                         │
│         ┌──────────▼──────────────┐                          │
│         │   InvoiceExceptionEnv   │                          │
│         │  reset() step() state() │                          │
│         │  grade()                │                          │
│         └──────────┬──────────────┘                          │
│                    │                                         │
│         ┌──────────▼──────────────┐                          │
│         │      Task Registry      │                          │
│         │  task1_price_variance   │                          │
│         │  task2_duplicate_tax    │                          │
│         │  task3_compound_fraud   │                          │
│         └─────────────────────────┘                          │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    inference.py (agent)                      │
│                                                             │
│  OpenAI Client → env.reset() → loop {                       │
│    action = LLM(observation_json)                           │
│    result = env.step(action)                                │
│    log [STEP]                                               │
│  } → log [END]                                              │
└─────────────────────────────────────────────────────────────┘

7.1 Data Flow

Episode start
    │
    ▼
reset(task_id) ──► builds DocumentPacket + EpisodeData ──► EnvironmentState
    │
    ▼
step(action) ──► dispatch to task simulator ──► (reward, info)
    │                                               │
    ▼                                               ▼
EpisodeData updated ◄──────────────────── append to history
    │
    ▼
new EnvironmentState built ──► StepResult(obs, reward, done, info)
    │
    ▼
grade() ──► EpisodeData ──► grader logic ──► Dict[str, float]

8. Task Specifications

8.1 Task 1 — Price Variance Exception (Easy)

Scenario: Office stationery invoice arrives 3.08% above the PO amount. Company tolerance policy is ±2% for auto-approval. The supplier has a verbal approval email from the procurement team explaining a raw material price increase that was never formalised in the PO.

What makes it easy: Single root cause, all signals are benign (no fraud), the fix is straightforward (confirm with procurement, approve with PO amendment).

Optimal path (9 steps):

run_check(po_match)
run_check(tolerance_rule)              ← finds 3.08% > 2%
cross_check(unit_price, invoice, po)   ← finds two mismatched lines
run_check(grn_match)                   ← confirms delivery complete
query_supplier(reason for increase)    ← gets email confirmation
query_internal(procurement, confirm?)  ← procurement confirms verbal approval
apply_rule(tolerance_exception_approval)
make_decision(approve, reason)
route_to(procurement, raise PO amendment)
close_case(summary)

Pitfalls:

Rejecting without querying supplier → wrong decision, score capped at ~0.35
Approving without checking tolerance rule → policy violation, −0.15
Disabling fraud checks that aren't needed → wasted steps

Grader weights:

Sub-score	Max	Key signals
Diagnosis	0.32	tolerance_rule check, price mismatch found
Investigation	0.30	supplier queried, procurement confirmed
Decision	0.18	correct approve decision
Routing	0.12	PO amendment sent to procurement
Closure	0.08	case closed with summary

8.2 Task 2 — Duplicate Invoice with Hidden Tax Error (Medium)

Scenario: Logistics supplier submits INV-2024-891. System flags it as a possible duplicate of INV-2024-819 (already paid). The invoice numbers differ by a digit transposition (8-9-1 vs 8-1-9). However: the original invoice applied 15% GST (wrong rate); the correct rate is 18%. The company overpaid ₹3,240 in tax on the original invoice. The new invoice has the correct rate. So it is simultaneously a duplicate AND a legitimate correction.

What makes it medium: The agent must not just detect the duplicate and reject — it must also detect the tax error in the original paid invoice and partially approve the correction delta (₹3,240). A simple "reject all duplicates" rule misses this and loses significant score.

Optimal path (10 steps):

run_check(duplicate_detection)         ← finds INV-2024-819
inspect_field(invoice, invoice_number) ← spots digit transposition
run_check(tax_calculation_verify)      ← finds 15% vs 18% on original
cross_check(tax_amount, invoice, payment_history) ← confirms ₹3,240 delta
query_internal(finance, confirm overpayment?)
query_supplier(clarify relationship between invoices)
apply_rule(partial_approval)
apply_rule(credit_note_request)
make_decision(partial_approve, reason)
route_to(finance, process ₹3,240 correction)
close_case(summary)

Pitfalls:

Full rejection (catches duplicate, misses correction): score ~0.35
Full approval (pays full duplicate): score −0.15
Partial approve without credit note: score ~0.60

Grader weights:

Sub-score	Max	Key signals
Diagnosis	0.30	duplicate found, tax error found
Investigation	0.32	finance queried, supplier queried, rules applied
Decision	0.20	partial_approve correct
Routing	0.08	finance routed for correction
Closure	0.06	case closed

8.3 Task 3 — Compound Fraud Signals (Hard)

Scenario: IT equipment supplier submits an ₹8,47,500 invoice for 15 laptops. The system flags a bank account change. But there are four simultaneous fraud signals:

Bank account fraud: Invoice has a different bank account than supplier master. The change request came from a lookalike email domain (techcore-solutions.com vs registered techcore-solutions.in) — a classic Business Email Compromise (BEC) attack.
GSTIN fraud: The GST number on the invoice belongs to a different company — "TechCore Trading Pvt Ltd" in Delhi, not "TechCore Solutions."
Quantity mismatch: Invoice claims 15 laptops. GRN confirms only 13 received. 2 are still in transit.
Price inflation: ₹56,500/unit invoiced vs ₹52,000/unit on PO — 8.65% above PO with no approved revision.

Critical trap: If the agent queries the supplier via email, they contact the fraudster who will try to convince them to approve. The agent must use phone/registered channel.

What makes it hard: The agent must hold four independent fraud signals simultaneously, not be satisfied by finding one and stopping, detect the correct communication channel policy, and route to both legal and security (not just finance).

Optimal path (12-14 steps):

inspect_field(invoice, bank_account)            ← sees mismatch
cross_check(bank_account, invoice, supplier_master)
run_check(bank_account_verification)            ← finds lookalike domain
run_check(email_domain_verification)
inspect_field(invoice, supplier_gstin)
run_check(gst_verification)                     ← finds GST belongs to different entity
cross_check(gstin, invoice, supplier_master)
inspect_field(grn, items_received)
run_check(grn_match)                            ← 13 vs 15
run_check(price_check)                          ← 8.65% above PO
query_supplier(confirm details, channel=phone)  ← supplier confirms fraud
query_internal(security, investigate BEC)
apply_rule(fraud_hold)
make_decision(reject, all fraud signals documented)
route_to(legal, initiate supplier audit)
route_to(security, BEC investigation)
close_case(fraud report summary)

Critical pitfall — contacting via email: −0.15 reward, and agent receives fraudster's response trying to get payment approved. Scoring penalises this heavily.

Grader weights:

Sub-score	Max	Key signals
Diagnosis	0.50	bank fraud, GST fraud, quantity mismatch, domain lookalike, price inflation
Investigation	0.20	phone contact (not email), security queried, legal queried
Decision	0.20	reject with all signals documented
Routing	0.20	legal + security routed
Closure	0.06	case closed with fraud report

Scoring thresholds:

Find 1 signal: ~0.20
Find 2 signals: ~0.40
Find 3 signals: ~0.60
Find all 4 + correct routing: ~0.90+

9. Reward Design

9.1 Philosophy

The reward function is designed around three principles:

Principle 1: Every informative action gets signal. Agents should learn that investigating is always better than guessing. Each relevant inspection, check, or query returns a positive reward proportional to how diagnostic that action is.

Principle 2: Dangerous actions get crushed. Approving a fraudulent invoice, disabling security controls, or contacting a supplier via a compromised channel are not mistakes — they are catastrophic errors. These must receive large negative rewards so agents learn to avoid them unconditionally.

Principle 3: The grader is the ground truth, the shaped reward is the training signal. The episode reward is shaped to help agents learn. The grader score at the end is what actually measures quality.

9.2 Reward Table

Action	Reward Range	Notes
`inspect_field` (relevant)	+0.01 to +0.14	Higher for fields that reveal anomalies
`inspect_field` (irrelevant)	+0.01	Still small positive — exploration is fine
`cross_check` (finds mismatch)	+0.12 to +0.15	Diagnosis reward
`cross_check` (no mismatch)	+0.02	Confirms a clean field
`run_check` (finds issue)	+0.08 to +0.18	Higher for more diagnostic checks
`run_check` (clean)	+0.01 to +0.06	Clean checks still confirm facts
`query_supplier` (phone)	+0.10 to +0.15	Correct channel
`query_supplier` (email, fraud task)	−0.15	Contacts fraudster
`query_internal` (key dept)	+0.04 to +0.12	Higher for departments that add critical info
`apply_rule` (correct rule)	+0.08 to +0.12	Applying the right policy pathway
`apply_rule` (wrong rule)	−0.05 to −0.10	Misapplying policy
`make_decision` (correct)	+0.18 to +0.28	Correct decision based on evidence
`make_decision` (wrong)	−0.10 to −0.40	Severity scales with how wrong
`route_to` (correct team)	+0.06 to +0.14	Right escalation path
`close_case` (complete)	+0.06 to +0.12	Depends on decision quality
Repeat action	−0.02 to −0.05	Light penalty, not catastrophic
SLA breach (exceed max steps)	−0.10	One-time penalty at end

9.3 Episode Score vs Cumulative Reward

These are different numbers:

Cumulative reward is the sum of step rewards. It is used as a training signal.
Episode score (from grade()) is the holistic quality assessment. It is what the hackathon evaluates.

Agents should be optimised on the grade score, not the cumulative reward alone.

10. Evaluation Criteria

10.1 Hackathon Scoring

Criterion	Weight	What judges look for
Real-world utility	30%	Would an enterprise actually use this? Does it model the task faithfully?
Task & grader quality	25%	Clear objectives, accurate grading, genuine difficulty progression, frontier models challenged
Environment design	20%	Clean state management, good action/observation spaces, shaped reward, sensible episode boundaries
Code quality & spec compliance	15%	OpenEnv spec passes, Dockerfile works, baseline reproduces, typed models
Creativity & novelty	10%	Novel domain, interesting mechanics, original reward design

10.2 Automated Gates (must all pass)

HF Space deploys — POST /reset returns 200
openenv validate passes
docker build succeeds
python inference.py runs without error, produces scores
All 3 tasks enumerated, grader scores verified in [0.0, 1.0]

10.3 Phase 2 — Agentic Evaluation

The hackathon will run a standard open LLM agent (e.g. Nemotron 3 Super) against the environment. The environment must:

Not be trivially solvable by a greedy agent
Produce score variance across tasks (not all the same)
Penalise clearly suboptimal behaviour

10.4 Disqualifiers

Environment does not deploy or respond to /reset
Graders that always return the same score regardless of actions
inference.py not in root, or not using OpenAI client
No baseline scores produced
Plagiarised environment

11. API Contract

11.1 Environment Python API

env = InvoiceExceptionEnv(seed=42)

# Reset — returns EnvironmentState
obs: EnvironmentState = env.reset("task1_price_variance")

# Step — returns StepResult
result: StepResult = env.step(Action.run_check("tolerance_rule"))
# result.observation  →  EnvironmentState
# result.reward       →  float
# result.done         →  bool
# result.info         →  dict

# State — non-destructive peek
obs: EnvironmentState = env.state()

# Grade — run grader on episode
scores: dict = env.grade()
# scores["score"]           → 0.0–1.0 overall
# scores["diagnosis_score"] → float
# scores["decision_score"]  → float
# ...

11.2 HTTP API

POST /reset
Body: {"task_id": "task1_price_variance"}   (optional — random if omitted)
Response: 200 EnvironmentState JSON

POST /step
Body: {"type": "run_check", "params": {"check_name": "tolerance_rule"}}
Response: 200 StepResult JSON

GET /state
Response: 200 EnvironmentState JSON

POST /grade
Response: 200 {"score": 0.85, "diagnosis_score": ...}

GET /tasks
Response: 200 ["task1_price_variance", "task2_duplicate_tax", "task3_compound_fraud"]

GET /health
Response: 200 {"status": "ok", "version": "1.0.0"}

11.3 Action Schema

{
  "type": "run_check",
  "params": {"check_name": "tolerance_rule"}
}

{
  "type": "inspect_field",
  "params": {"document": "invoice", "field": "bank_account"}
}

{
  "type": "cross_check",
  "params": {"field": "unit_price", "doc_a": "invoice", "doc_b": "po"}
}

{
  "type": "query_supplier",
  "params": {"question": "Why does your bank account differ?", "channel": "phone"}
}

{
  "type": "query_internal",
  "params": {"department": "procurement", "question": "Did you approve this price?"}
}

{
  "type": "apply_rule",
  "params": {"rule_id": "tolerance_exception_approval"}
}

{
  "type": "make_decision",
  "params": {"decision": "approve", "reason": "Verbal approval confirmed by procurement."}
}

{
  "type": "route_to",
  "params": {"team": "procurement", "notes": "Please raise PO amendment for the price variance."}
}

{
  "type": "close_case",
  "params": {"summary": "Invoice approved. PO amendment requested. Case closed."}
}

12. File Structure

invoice-exception-handler/
│
├── README.md                    # Full setup + usage guide
├── openenv.yaml                 # OpenEnv spec (must pass openenv validate)
├── Dockerfile                   # Single-stage Python 3.11-slim build
├── requirements.txt             # Pinned dependencies
├── inference.py                 # Competition inference script (MUST be here)
├── app.py                       # Gradio + FastAPI entrypoint for HF Spaces
│
├── env/
│   ├── __init__.py
│   ├── models.py                # All Pydantic typed models
│   ├── environment.py           # InvoiceExceptionEnv class
│   └── tasks.py                 # 3 task classes + graders + EpisodeData
│
└── documents/
    ├── PRD-001-product-requirements.md    # This document
    ├── CHANGELOG.md                       # Every code change recorded
    ├── ARCHITECTURE.md                    # System diagram + decisions
    └── BASELINE-SCORES.md                 # Reproducible benchmark results

13. Out of Scope

The following are explicitly not part of v1.0:

Real database connectivity (the environment is fully simulated)
Multi-agent scenarios (one agent per episode)
Partial observability (agent sees all documents from the start)
User interface for human play (nice-to-have but not required for submission)
Real supplier APIs (simulation only)
Currency other than INR (can be extended in v1.1)
Tasks beyond 3 (can be extended)

14. Change Log

Version	Date	Author	Change
0.1.0	2025-01-18	[Author]	Initial draft — problem definition and task sketches
0.2.0	2025-01-19	[Author]	Added reward design section, API contract, file structure
1.0.0	2025-01-20	[Author]	Final version — all sections complete, ready for implementation

Product Requirements Document

Invoice Exception Handler — OpenEnv Agent Learning Environment

Table of Contents

1. Executive Summary

2. Problem Statement

2.1 The Real-World Pain

2.2 The AI Gap

2.3 What This Environment Fixes

3. Product Vision

4. Stakeholders

5. Functional Requirements

5.1 Core Environment API

5.2 HTTP Endpoints (for HF Spaces validator)

5.3 Task Requirements

5.4 Reward Function

5.5 Inference Script

5.6 Deployment

6. Non-Functional Requirements

7. System Architecture

7.1 Data Flow

8. Task Specifications

8.1 Task 1 — Price Variance Exception (Easy)

8.2 Task 2 — Duplicate Invoice with Hidden Tax Error (Medium)

8.3 Task 3 — Compound Fraud Signals (Hard)

9. Reward Design

9.1 Philosophy

9.2 Reward Table

9.3 Episode Score vs Cumulative Reward

10. Evaluation Criteria

10.1 Hackathon Scoring

10.2 Automated Gates (must all pass)

10.3 Phase 2 — Agentic Evaluation

10.4 Disqualifiers

11. API Contract

11.1 Environment Python API

11.2 HTTP API

11.3 Action Schema

12. File Structure

13. Out of Scope

14. Change Log