Spaces:

YUS200619
/

invoice-exception-handler

Sleeping

File size: 26,997 Bytes

562f58d

# Product Requirements Document
## Invoice Exception Handler — OpenEnv Agent Learning Environment

**Document ID:** PRD-001  
**Version:** 1.0.0  
**Status:** Final  
**Author:** [Your Name]  
**Last Updated:** 2025-01-20  
**Classification:** Internal / Hackathon Submission

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Problem Statement](#2-problem-statement)
3. [Product Vision](#3-product-vision)
4. [Stakeholders](#4-stakeholders)
5. [Functional Requirements](#5-functional-requirements)
6. [Non-Functional Requirements](#6-non-functional-requirements)
7. [System Architecture](#7-system-architecture)
8. [Task Specifications](#8-task-specifications)
9. [Reward Design](#9-reward-design)
10. [Evaluation Criteria](#10-evaluation-criteria)
11. [API Contract](#11-api-contract)
12. [File Structure](#12-file-structure)
13. [Out of Scope](#13-out-of-scope)
14. [Change Log](#14-change-log)

---

## 1. Executive Summary

The Invoice Exception Handler is a real-world agent learning environment built for the OpenEnv standard. It simulates the accounts payable (AP) exception handling workflow that every business on earth runs daily — the process of investigating flagged invoices before payment is approved.

The environment places an AI agent in the role of an AP analyst. The agent receives a document packet (Purchase Order, Invoice, Goods Receipt Note, Supplier Master), reads an exception flag, and must investigate the root cause, make a decision, route the case to the right team, and close it cleanly. Every action has realistic financial and compliance consequences.

The environment ships with three tasks of increasing difficulty — price variance (easy), duplicate with hidden tax error (medium), and compound fraud with four simultaneous signals (hard).

---

## 2. Problem Statement

### 2.1 The Real-World Pain

Every company that buys goods or services from suppliers receives invoices. Typically 5–15% of all invoices have exceptions — discrepancies between what was ordered (PO), what was received (GRN), and what was invoiced. These exceptions are currently handled by accounts payable clerks who manually:

1. Pull the original Purchase Order
2. Compare it line by line against the invoice
3. Check the Goods Receipt Note
4. Run validation checks
5. Query internal teams or the supplier
6. Make a decision (approve / reject / hold / partial approve)
7. Route the case and document everything

At a mid-size company this is 2–4 hours of analyst time per day. At enterprise scale it is entire departments. The cost to the AP automation market exceeds $3 billion annually.

### 2.2 The AI Gap

No existing OpenEnv benchmark tests an agent's ability to:
- Reason across multiple documents simultaneously
- Apply business rules with thresholds and exceptions
- Detect fraud signals that require cross-referencing
- Make nuanced decisions (partial approve, hold, escalate)
- Know *not* to contact a supplier via a potentially compromised channel

This gap means agents trained on existing benchmarks cannot be evaluated or trained on one of the most common finance workflows in enterprise software.

### 2.3 What This Environment Fixes

The Invoice Exception Handler provides:
- A clean, typed, deterministic simulation of AP exception handling
- Three tasks that test a progression of reasoning: threshold logic → duplicate detection → multi-signal fraud
- Shaped rewards that signal progress at every step, not just at episode end
- A fully deployable environment that conforms to the OpenEnv spec

---

## 3. Product Vision

> An agent that scores well in this environment is demonstrably better at AP exception handling than the average accounts payable clerk — and is ready to be deployed in real enterprise finance workflows.

The environment is designed so that:
- The reward signal is meaningful enough to actually train agents on, not just evaluate them
- The hard task (compound fraud) remains genuinely difficult for frontier models
- Every score between 0.0 and 1.0 reflects a real quality difference in agent behavior

---

## 4. Stakeholders

| Stakeholder | Role | Interest |
|---|---|---|
| Hackathon Judges (Meta, HF engineers) | Evaluators | Real-world utility, code quality, creativity |
| OpenEnv Automated Validator | Gatekeeper | Spec compliance, deployment health |
| AI Researchers | Primary users post-submission | Training and evaluating AP agents |
| Enterprise Software Companies | Secondary users | Evaluating models for AP automation products |

---

## 5. Functional Requirements

### 5.1 Core Environment API

| Requirement | Priority | Detail |
|---|---|---|
| FR-001 | MUST | `env.reset(task_id)` returns a clean `EnvironmentState` |
| FR-002 | MUST | `env.step(action)` returns `StepResult(observation, reward, done, info)` |
| FR-003 | MUST | `env.state()` returns current state without advancing episode |
| FR-004 | MUST | `env.grade()` returns a score dict with overall score 0.0–1.0 |
| FR-005 | MUST | All models are typed Pydantic v2 with no untyped fields |
| FR-006 | MUST | `openenv.yaml` passes `openenv validate` |

### 5.2 HTTP Endpoints (for HF Spaces validator)

| Requirement | Priority | Detail |
|---|---|---|
| FR-007 | MUST | `POST /reset` returns HTTP 200 with JSON observation |
| FR-008 | MUST | `POST /step` returns HTTP 200 with JSON StepResult |
| FR-009 | MUST | `GET /state` returns HTTP 200 with JSON EnvironmentState |
| FR-010 | MUST | `GET /health` returns HTTP 200 `{"status": "ok"}` |
| FR-011 | SHOULD | `GET /` returns HTML documentation page |

### 5.3 Task Requirements

| Requirement | Priority | Detail |
|---|---|---|
| FR-012 | MUST | Minimum 3 tasks with distinct scenarios |
| FR-013 | MUST | Tasks range easy → medium → hard |
| FR-014 | MUST | Each task has a deterministic grader returning 0.0–1.0 |
| FR-015 | MUST | Graders have sub-scores (diagnosis, investigation, decision, routing, closure, efficiency) |
| FR-016 | MUST | Hard task must not be solvable by simple heuristics |

### 5.4 Reward Function

| Requirement | Priority | Detail |
|---|---|---|
| FR-017 | MUST | Reward is shaped across the full trajectory |
| FR-018 | MUST | Dangerous actions (approving fraud) produce large negative rewards |
| FR-019 | MUST | Repeating already-completed actions penalised lightly |
| FR-020 | MUST | Exceeding step budget penalised (SLA concept) |
| FR-021 | SHOULD | Efficiency bonus for completing faster than optimal |

### 5.5 Inference Script

| Requirement | Priority | Detail |
|---|---|---|
| FR-022 | MUST | Script named exactly `inference.py` in root directory |
| FR-023 | MUST | Uses OpenAI client (not Anthropic SDK) |
| FR-024 | MUST | Reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` from environment |
| FR-025 | MUST | Emits `[START]`, `[STEP]`, `[END]` lines to stdout exactly as spec |
| FR-026 | MUST | Completes all 3 tasks in under 20 minutes on 2 vCPU / 8 GB RAM |
| FR-027 | MUST | Produces reproducible scores with the same seed |

### 5.6 Deployment

| Requirement | Priority | Detail |
|---|---|---|
| FR-028 | MUST | Dockerfile builds cleanly without internet access at run time |
| FR-029 | MUST | Container starts and serves on port 7860 |
| FR-030 | MUST | HF Spaces `POST /reset` returns 200 |
| FR-031 | MUST | README documents setup, action space, observation space, tasks, baseline scores |

---

## 6. Non-Functional Requirements

| ID | Category | Requirement |
|---|---|---|
| NFR-001 | Performance | `reset()` completes in < 100ms |
| NFR-002 | Performance | `step()` completes in < 50ms |
| NFR-003 | Performance | Full 3-task inference run completes in < 20 minutes |
| NFR-004 | Resource | Runs on 2 vCPU, 8 GB RAM — no GPU required |
| NFR-005 | Correctness | Grader output is deterministic — same actions = same score |
| NFR-006 | Correctness | Reward values are deterministic — no randomness in simulation |
| NFR-007 | Code quality | No bare `except:` blocks — all exceptions typed |
| NFR-008 | Code quality | All functions have docstrings |
| NFR-009 | Code quality | Type hints on all function signatures |
| NFR-010 | Portability | Zero OS-specific code — runs on Linux (Docker) |
| NFR-011 | Security | No hardcoded credentials anywhere in code |

---

## 7. System Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     HF Space / Docker Container             │
│                                                             │
│  ┌──────────────┐    ┌──────────────────────────────────┐   │
│  │  Gradio UI   │    │         FastAPI Server            │   │
│  │  (port 7860) │    │  POST /reset  GET /state          │   │
│  │              │    │  POST /step   GET /health         │   │
│  └──────┬───────┘    └──────────────┬───────────────────┘   │
│         │                           │                        │
│         └──────────┬────────────────┘                        │
│                    │                                         │
│         ┌──────────▼──────────────┐                          │
│         │   InvoiceExceptionEnv   │                          │
│         │  reset() step() state() │                          │
│         │  grade()                │                          │
│         └──────────┬──────────────┘                          │
│                    │                                         │
│         ┌──────────▼──────────────┐                          │
│         │      Task Registry      │                          │
│         │  task1_price_variance   │                          │
│         │  task2_duplicate_tax    │                          │
│         │  task3_compound_fraud   │                          │
│         └─────────────────────────┘                          │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    inference.py (agent)                      │
│                                                             │
│  OpenAI Client → env.reset() → loop {                       │
│    action = LLM(observation_json)                           │
│    result = env.step(action)                                │
│    log [STEP]                                               │
│  } → log [END]                                              │
└─────────────────────────────────────────────────────────────┘
```

### 7.1 Data Flow

```
Episode start
    │
    ▼
reset(task_id) ──► builds DocumentPacket + EpisodeData ──► EnvironmentState
    │
    ▼
step(action) ──► dispatch to task simulator ──► (reward, info)
    │                                               │
    ▼                                               ▼
EpisodeData updated ◄──────────────────── append to history
    │
    ▼
new EnvironmentState built ──► StepResult(obs, reward, done, info)
    │
    ▼
grade() ──► EpisodeData ──► grader logic ──► Dict[str, float]
```

---

## 8. Task Specifications

### 8.1 Task 1 — Price Variance Exception (Easy)

**Scenario:** Office stationery invoice arrives 3.08% above the PO amount. Company tolerance policy is ±2% for auto-approval. The supplier has a verbal approval email from the procurement team explaining a raw material price increase that was never formalised in the PO.

**What makes it easy:** Single root cause, all signals are benign (no fraud), the fix is straightforward (confirm with procurement, approve with PO amendment).

**Optimal path (9 steps):**
```
run_check(po_match)
run_check(tolerance_rule)              ← finds 3.08% > 2%
cross_check(unit_price, invoice, po)   ← finds two mismatched lines
run_check(grn_match)                   ← confirms delivery complete
query_supplier(reason for increase)    ← gets email confirmation
query_internal(procurement, confirm?)  ← procurement confirms verbal approval
apply_rule(tolerance_exception_approval)
make_decision(approve, reason)
route_to(procurement, raise PO amendment)
close_case(summary)
```

**Pitfalls:**
- Rejecting without querying supplier → wrong decision, score capped at ~0.35
- Approving without checking tolerance rule → policy violation, −0.15
- Disabling fraud checks that aren't needed → wasted steps

**Grader weights:**
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.32 | tolerance_rule check, price mismatch found |
| Investigation | 0.30 | supplier queried, procurement confirmed |
| Decision | 0.18 | correct approve decision |
| Routing | 0.12 | PO amendment sent to procurement |
| Closure | 0.08 | case closed with summary |

---

### 8.2 Task 2 — Duplicate Invoice with Hidden Tax Error (Medium)

**Scenario:** Logistics supplier submits INV-2024-891. System flags it as a possible duplicate of INV-2024-819 (already paid). The invoice numbers differ by a digit transposition (8-9-1 vs 8-1-9). However: the original invoice applied 15% GST (wrong rate); the correct rate is 18%. The company overpaid ₹3,240 in tax on the original invoice. The new invoice has the correct rate. So it is simultaneously a duplicate AND a legitimate correction.

**What makes it medium:** The agent must not just detect the duplicate and reject — it must also detect the tax error in the *original* paid invoice and partially approve the correction delta (₹3,240). A simple "reject all duplicates" rule misses this and loses significant score.

**Optimal path (10 steps):**
```
run_check(duplicate_detection)         ← finds INV-2024-819
inspect_field(invoice, invoice_number) ← spots digit transposition
run_check(tax_calculation_verify)      ← finds 15% vs 18% on original
cross_check(tax_amount, invoice, payment_history) ← confirms ₹3,240 delta
query_internal(finance, confirm overpayment?)
query_supplier(clarify relationship between invoices)
apply_rule(partial_approval)
apply_rule(credit_note_request)
make_decision(partial_approve, reason)
route_to(finance, process ₹3,240 correction)
close_case(summary)
```

**Pitfalls:**
- Full rejection (catches duplicate, misses correction): score ~0.35
- Full approval (pays full duplicate): score −0.15
- Partial approve without credit note: score ~0.60

**Grader weights:**
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.30 | duplicate found, tax error found |
| Investigation | 0.32 | finance queried, supplier queried, rules applied |
| Decision | 0.20 | partial_approve correct |
| Routing | 0.08 | finance routed for correction |
| Closure | 0.06 | case closed |

---

### 8.3 Task 3 — Compound Fraud Signals (Hard)

**Scenario:** IT equipment supplier submits an ₹8,47,500 invoice for 15 laptops. The system flags a bank account change. But there are four simultaneous fraud signals:

1. **Bank account fraud:** Invoice has a different bank account than supplier master. The change request came from a lookalike email domain (techcore-solutions.com vs registered techcore-solutions.in) — a classic Business Email Compromise (BEC) attack.
2. **GSTIN fraud:** The GST number on the invoice belongs to a *different company* — "TechCore Trading Pvt Ltd" in Delhi, not "TechCore Solutions."
3. **Quantity mismatch:** Invoice claims 15 laptops. GRN confirms only 13 received. 2 are still in transit.
4. **Price inflation:** ₹56,500/unit invoiced vs ₹52,000/unit on PO — 8.65% above PO with no approved revision.

**Critical trap:** If the agent queries the supplier via email, they contact the fraudster who will try to convince them to approve. The agent must use phone/registered channel.

**What makes it hard:** The agent must hold four independent fraud signals simultaneously, not be satisfied by finding one and stopping, detect the correct communication channel policy, and route to both legal and security (not just finance).

**Optimal path (12-14 steps):**
```
inspect_field(invoice, bank_account)            ← sees mismatch
cross_check(bank_account, invoice, supplier_master)
run_check(bank_account_verification)            ← finds lookalike domain
run_check(email_domain_verification)
inspect_field(invoice, supplier_gstin)
run_check(gst_verification)                     ← finds GST belongs to different entity
cross_check(gstin, invoice, supplier_master)
inspect_field(grn, items_received)
run_check(grn_match)                            ← 13 vs 15
run_check(price_check)                          ← 8.65% above PO
query_supplier(confirm details, channel=phone)  ← supplier confirms fraud
query_internal(security, investigate BEC)
apply_rule(fraud_hold)
make_decision(reject, all fraud signals documented)
route_to(legal, initiate supplier audit)
route_to(security, BEC investigation)
close_case(fraud report summary)
```

**Critical pitfall — contacting via email:** −0.15 reward, and agent receives fraudster's response trying to get payment approved. Scoring penalises this heavily.

**Grader weights:**
| Sub-score | Max | Key signals |
|---|---|---|
| Diagnosis | 0.50 | bank fraud, GST fraud, quantity mismatch, domain lookalike, price inflation |
| Investigation | 0.20 | phone contact (not email), security queried, legal queried |
| Decision | 0.20 | reject with all signals documented |
| Routing | 0.20 | legal + security routed |
| Closure | 0.06 | case closed with fraud report |

**Scoring thresholds:**
- Find 1 signal: ~0.20
- Find 2 signals: ~0.40
- Find 3 signals: ~0.60
- Find all 4 + correct routing: ~0.90+

---

## 9. Reward Design

### 9.1 Philosophy

The reward function is designed around three principles:

**Principle 1: Every informative action gets signal.** Agents should learn that investigating is always better than guessing. Each relevant inspection, check, or query returns a positive reward proportional to how diagnostic that action is.

**Principle 2: Dangerous actions get crushed.** Approving a fraudulent invoice, disabling security controls, or contacting a supplier via a compromised channel are not mistakes — they are catastrophic errors. These must receive large negative rewards so agents learn to avoid them unconditionally.

**Principle 3: The grader is the ground truth, the shaped reward is the training signal.** The episode reward is shaped to help agents learn. The grader score at the end is what actually measures quality.

### 9.2 Reward Table

| Action | Reward Range | Notes |
|---|---|---|
| `inspect_field` (relevant) | +0.01 to +0.14 | Higher for fields that reveal anomalies |
| `inspect_field` (irrelevant) | +0.01 | Still small positive — exploration is fine |
| `cross_check` (finds mismatch) | +0.12 to +0.15 | Diagnosis reward |
| `cross_check` (no mismatch) | +0.02 | Confirms a clean field |
| `run_check` (finds issue) | +0.08 to +0.18 | Higher for more diagnostic checks |
| `run_check` (clean) | +0.01 to +0.06 | Clean checks still confirm facts |
| `query_supplier` (phone) | +0.10 to +0.15 | Correct channel |
| `query_supplier` (email, fraud task) | −0.15 | Contacts fraudster |
| `query_internal` (key dept) | +0.04 to +0.12 | Higher for departments that add critical info |
| `apply_rule` (correct rule) | +0.08 to +0.12 | Applying the right policy pathway |
| `apply_rule` (wrong rule) | −0.05 to −0.10 | Misapplying policy |
| `make_decision` (correct) | +0.18 to +0.28 | Correct decision based on evidence |
| `make_decision` (wrong) | −0.10 to −0.40 | Severity scales with how wrong |
| `route_to` (correct team) | +0.06 to +0.14 | Right escalation path |
| `close_case` (complete) | +0.06 to +0.12 | Depends on decision quality |
| Repeat action | −0.02 to −0.05 | Light penalty, not catastrophic |
| SLA breach (exceed max steps) | −0.10 | One-time penalty at end |

### 9.3 Episode Score vs Cumulative Reward

These are different numbers:
- **Cumulative reward** is the sum of step rewards. It is used as a training signal.
- **Episode score** (from `grade()`) is the holistic quality assessment. It is what the hackathon evaluates.

Agents should be optimised on the grade score, not the cumulative reward alone.

---

## 10. Evaluation Criteria

### 10.1 Hackathon Scoring

| Criterion | Weight | What judges look for |
|---|---|---|
| Real-world utility | 30% | Would an enterprise actually use this? Does it model the task faithfully? |
| Task & grader quality | 25% | Clear objectives, accurate grading, genuine difficulty progression, frontier models challenged |
| Environment design | 20% | Clean state management, good action/observation spaces, shaped reward, sensible episode boundaries |
| Code quality & spec compliance | 15% | OpenEnv spec passes, Dockerfile works, baseline reproduces, typed models |
| Creativity & novelty | 10% | Novel domain, interesting mechanics, original reward design |

### 10.2 Automated Gates (must all pass)

1. HF Space deploys — `POST /reset` returns 200
2. `openenv validate` passes
3. `docker build` succeeds
4. `python inference.py` runs without error, produces scores
5. All 3 tasks enumerated, grader scores verified in [0.0, 1.0]

### 10.3 Phase 2 — Agentic Evaluation

The hackathon will run a standard open LLM agent (e.g. Nemotron 3 Super) against the environment. The environment must:
- Not be trivially solvable by a greedy agent
- Produce score variance across tasks (not all the same)
- Penalise clearly suboptimal behaviour

### 10.4 Disqualifiers

- Environment does not deploy or respond to `/reset`
- Graders that always return the same score regardless of actions
- `inference.py` not in root, or not using OpenAI client
- No baseline scores produced
- Plagiarised environment

---

## 11. API Contract

### 11.1 Environment Python API

```python
env = InvoiceExceptionEnv(seed=42)

# Reset — returns EnvironmentState
obs: EnvironmentState = env.reset("task1_price_variance")

# Step — returns StepResult
result: StepResult = env.step(Action.run_check("tolerance_rule"))
# result.observation  →  EnvironmentState
# result.reward       →  float
# result.done         →  bool
# result.info         →  dict

# State — non-destructive peek
obs: EnvironmentState = env.state()

# Grade — run grader on episode
scores: dict = env.grade()
# scores["score"]           → 0.0–1.0 overall
# scores["diagnosis_score"] → float
# scores["decision_score"]  → float
# ...
```

### 11.2 HTTP API

```
POST /reset
Body: {"task_id": "task1_price_variance"}   (optional — random if omitted)
Response: 200 EnvironmentState JSON

POST /step
Body: {"type": "run_check", "params": {"check_name": "tolerance_rule"}}
Response: 200 StepResult JSON

GET /state
Response: 200 EnvironmentState JSON

POST /grade
Response: 200 {"score": 0.85, "diagnosis_score": ...}

GET /tasks
Response: 200 ["task1_price_variance", "task2_duplicate_tax", "task3_compound_fraud"]

GET /health
Response: 200 {"status": "ok", "version": "1.0.0"}
```

### 11.3 Action Schema

```json
{
  "type": "run_check",
  "params": {"check_name": "tolerance_rule"}
}

{
  "type": "inspect_field",
  "params": {"document": "invoice", "field": "bank_account"}
}

{
  "type": "cross_check",
  "params": {"field": "unit_price", "doc_a": "invoice", "doc_b": "po"}
}

{
  "type": "query_supplier",
  "params": {"question": "Why does your bank account differ?", "channel": "phone"}
}

{
  "type": "query_internal",
  "params": {"department": "procurement", "question": "Did you approve this price?"}
}

{
  "type": "apply_rule",
  "params": {"rule_id": "tolerance_exception_approval"}
}

{
  "type": "make_decision",
  "params": {"decision": "approve", "reason": "Verbal approval confirmed by procurement."}
}

{
  "type": "route_to",
  "params": {"team": "procurement", "notes": "Please raise PO amendment for the price variance."}
}

{
  "type": "close_case",
  "params": {"summary": "Invoice approved. PO amendment requested. Case closed."}
}
```

---

## 12. File Structure

```
invoice-exception-handler/
│
├── README.md                    # Full setup + usage guide
├── openenv.yaml                 # OpenEnv spec (must pass openenv validate)
├── Dockerfile                   # Single-stage Python 3.11-slim build
├── requirements.txt             # Pinned dependencies
├── inference.py                 # Competition inference script (MUST be here)
├── app.py                       # Gradio + FastAPI entrypoint for HF Spaces
│
├── env/
│   ├── __init__.py
│   ├── models.py                # All Pydantic typed models
│   ├── environment.py           # InvoiceExceptionEnv class
│   └── tasks.py                 # 3 task classes + graders + EpisodeData
│
└── documents/
    ├── PRD-001-product-requirements.md    # This document
    ├── CHANGELOG.md                       # Every code change recorded
    ├── ARCHITECTURE.md                    # System diagram + decisions
    └── BASELINE-SCORES.md                 # Reproducible benchmark results
```

---

## 13. Out of Scope

The following are explicitly not part of v1.0:

- Real database connectivity (the environment is fully simulated)
- Multi-agent scenarios (one agent per episode)
- Partial observability (agent sees all documents from the start)
- User interface for human play (nice-to-have but not required for submission)
- Real supplier APIs (simulation only)
- Currency other than INR (can be extended in v1.1)
- Tasks beyond 3 (can be extended)

---

## 14. Change Log

| Version | Date | Author | Change |
|---|---|---|---|
| 0.1.0 | 2025-01-18 | [Author] | Initial draft — problem definition and task sketches |
| 0.2.0 | 2025-01-19 | [Author] | Added reward design section, API contract, file structure |
| 1.0.0 | 2025-01-20 | [Author] | Final version — all sections complete, ready for implementation |