YUS200619's picture
feat: complete invoice exception handler v1.0.0
562f58d

Product Requirements Document

Invoice Exception Handler β€” OpenEnv Agent Learning Environment

Document ID: PRD-001
Version: 1.0.0
Status: Final
Author: [Your Name]
Last Updated: 2025-01-20
Classification: Internal / Hackathon Submission


Table of Contents

  1. Executive Summary
  2. Problem Statement
  3. Product Vision
  4. Stakeholders
  5. Functional Requirements
  6. Non-Functional Requirements
  7. System Architecture
  8. Task Specifications
  9. Reward Design
  10. Evaluation Criteria
  11. API Contract
  12. File Structure
  13. Out of Scope
  14. Change Log

1. Executive Summary

The Invoice Exception Handler is a real-world agent learning environment built for the OpenEnv standard. It simulates the accounts payable (AP) exception handling workflow that every business on earth runs daily β€” the process of investigating flagged invoices before payment is approved.

The environment places an AI agent in the role of an AP analyst. The agent receives a document packet (Purchase Order, Invoice, Goods Receipt Note, Supplier Master), reads an exception flag, and must investigate the root cause, make a decision, route the case to the right team, and close it cleanly. Every action has realistic financial and compliance consequences.

The environment ships with three tasks of increasing difficulty β€” price variance (easy), duplicate with hidden tax error (medium), and compound fraud with four simultaneous signals (hard).


2. Problem Statement

2.1 The Real-World Pain

Every company that buys goods or services from suppliers receives invoices. Typically 5–15% of all invoices have exceptions β€” discrepancies between what was ordered (PO), what was received (GRN), and what was invoiced. These exceptions are currently handled by accounts payable clerks who manually:

  1. Pull the original Purchase Order
  2. Compare it line by line against the invoice
  3. Check the Goods Receipt Note
  4. Run validation checks
  5. Query internal teams or the supplier
  6. Make a decision (approve / reject / hold / partial approve)
  7. Route the case and document everything

At a mid-size company this is 2–4 hours of analyst time per day. At enterprise scale it is entire departments. The cost to the AP automation market exceeds $3 billion annually.

2.2 The AI Gap

No existing OpenEnv benchmark tests an agent's ability to:

  • Reason across multiple documents simultaneously
  • Apply business rules with thresholds and exceptions
  • Detect fraud signals that require cross-referencing
  • Make nuanced decisions (partial approve, hold, escalate)
  • Know not to contact a supplier via a potentially compromised channel

This gap means agents trained on existing benchmarks cannot be evaluated or trained on one of the most common finance workflows in enterprise software.

2.3 What This Environment Fixes

The Invoice Exception Handler provides:

  • A clean, typed, deterministic simulation of AP exception handling
  • Three tasks that test a progression of reasoning: threshold logic β†’ duplicate detection β†’ multi-signal fraud
  • Shaped rewards that signal progress at every step, not just at episode end
  • A fully deployable environment that conforms to the OpenEnv spec

3. Product Vision

An agent that scores well in this environment is demonstrably better at AP exception handling than the average accounts payable clerk β€” and is ready to be deployed in real enterprise finance workflows.

The environment is designed so that:

  • The reward signal is meaningful enough to actually train agents on, not just evaluate them
  • The hard task (compound fraud) remains genuinely difficult for frontier models
  • Every score between 0.0 and 1.0 reflects a real quality difference in agent behavior

4. Stakeholders

Stakeholder Role Interest
Hackathon Judges (Meta, HF engineers) Evaluators Real-world utility, code quality, creativity
OpenEnv Automated Validator Gatekeeper Spec compliance, deployment health
AI Researchers Primary users post-submission Training and evaluating AP agents
Enterprise Software Companies Secondary users Evaluating models for AP automation products

5. Functional Requirements

5.1 Core Environment API

Requirement Priority Detail
FR-001 MUST env.reset(task_id) returns a clean EnvironmentState
FR-002 MUST env.step(action) returns StepResult(observation, reward, done, info)
FR-003 MUST env.state() returns current state without advancing episode
FR-004 MUST env.grade() returns a score dict with overall score 0.0–1.0
FR-005 MUST All models are typed Pydantic v2 with no untyped fields
FR-006 MUST openenv.yaml passes openenv validate

5.2 HTTP Endpoints (for HF Spaces validator)

Requirement Priority Detail
FR-007 MUST POST /reset returns HTTP 200 with JSON observation
FR-008 MUST POST /step returns HTTP 200 with JSON StepResult
FR-009 MUST GET /state returns HTTP 200 with JSON EnvironmentState
FR-010 MUST GET /health returns HTTP 200 {"status": "ok"}
FR-011 SHOULD GET / returns HTML documentation page

5.3 Task Requirements

Requirement Priority Detail
FR-012 MUST Minimum 3 tasks with distinct scenarios
FR-013 MUST Tasks range easy β†’ medium β†’ hard
FR-014 MUST Each task has a deterministic grader returning 0.0–1.0
FR-015 MUST Graders have sub-scores (diagnosis, investigation, decision, routing, closure, efficiency)
FR-016 MUST Hard task must not be solvable by simple heuristics

5.4 Reward Function

Requirement Priority Detail
FR-017 MUST Reward is shaped across the full trajectory
FR-018 MUST Dangerous actions (approving fraud) produce large negative rewards
FR-019 MUST Repeating already-completed actions penalised lightly
FR-020 MUST Exceeding step budget penalised (SLA concept)
FR-021 SHOULD Efficiency bonus for completing faster than optimal

5.5 Inference Script

Requirement Priority Detail
FR-022 MUST Script named exactly inference.py in root directory
FR-023 MUST Uses OpenAI client (not Anthropic SDK)
FR-024 MUST Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment
FR-025 MUST Emits [START], [STEP], [END] lines to stdout exactly as spec
FR-026 MUST Completes all 3 tasks in under 20 minutes on 2 vCPU / 8 GB RAM
FR-027 MUST Produces reproducible scores with the same seed

5.6 Deployment

Requirement Priority Detail
FR-028 MUST Dockerfile builds cleanly without internet access at run time
FR-029 MUST Container starts and serves on port 7860
FR-030 MUST HF Spaces POST /reset returns 200
FR-031 MUST README documents setup, action space, observation space, tasks, baseline scores

6. Non-Functional Requirements

ID Category Requirement
NFR-001 Performance reset() completes in < 100ms
NFR-002 Performance step() completes in < 50ms
NFR-003 Performance Full 3-task inference run completes in < 20 minutes
NFR-004 Resource Runs on 2 vCPU, 8 GB RAM β€” no GPU required
NFR-005 Correctness Grader output is deterministic β€” same actions = same score
NFR-006 Correctness Reward values are deterministic β€” no randomness in simulation
NFR-007 Code quality No bare except: blocks β€” all exceptions typed
NFR-008 Code quality All functions have docstrings
NFR-009 Code quality Type hints on all function signatures
NFR-010 Portability Zero OS-specific code β€” runs on Linux (Docker)
NFR-011 Security No hardcoded credentials anywhere in code

7. System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     HF Space / Docker Container             β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Gradio UI   β”‚    β”‚         FastAPI Server            β”‚   β”‚
β”‚  β”‚  (port 7860) β”‚    β”‚  POST /reset  GET /state          β”‚   β”‚
β”‚  β”‚              β”‚    β”‚  POST /step   GET /health         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚                           β”‚                        β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                    β”‚                                         β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚         β”‚   InvoiceExceptionEnv   β”‚                          β”‚
β”‚         β”‚  reset() step() state() β”‚                          β”‚
β”‚         β”‚  grade()                β”‚                          β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                    β”‚                                         β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚         β”‚      Task Registry      β”‚                          β”‚
β”‚         β”‚  task1_price_variance   β”‚                          β”‚
β”‚         β”‚  task2_duplicate_tax    β”‚                          β”‚
β”‚         β”‚  task3_compound_fraud   β”‚                          β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    inference.py (agent)                      β”‚
β”‚                                                             β”‚
β”‚  OpenAI Client β†’ env.reset() β†’ loop {                       β”‚
β”‚    action = LLM(observation_json)                           β”‚
β”‚    result = env.step(action)                                β”‚
β”‚    log [STEP]                                               β”‚
β”‚  } β†’ log [END]                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7.1 Data Flow

Episode start
    β”‚
    β–Ό
reset(task_id) ──► builds DocumentPacket + EpisodeData ──► EnvironmentState
    β”‚
    β–Ό
step(action) ──► dispatch to task simulator ──► (reward, info)
    β”‚                                               β”‚
    β–Ό                                               β–Ό
EpisodeData updated ◄──────────────────── append to history
    β”‚
    β–Ό
new EnvironmentState built ──► StepResult(obs, reward, done, info)
    β”‚
    β–Ό
grade() ──► EpisodeData ──► grader logic ──► Dict[str, float]

8. Task Specifications

8.1 Task 1 β€” Price Variance Exception (Easy)

Scenario: Office stationery invoice arrives 3.08% above the PO amount. Company tolerance policy is Β±2% for auto-approval. The supplier has a verbal approval email from the procurement team explaining a raw material price increase that was never formalised in the PO.

What makes it easy: Single root cause, all signals are benign (no fraud), the fix is straightforward (confirm with procurement, approve with PO amendment).

Optimal path (9 steps):

run_check(po_match)
run_check(tolerance_rule)              ← finds 3.08% > 2%
cross_check(unit_price, invoice, po)   ← finds two mismatched lines
run_check(grn_match)                   ← confirms delivery complete
query_supplier(reason for increase)    ← gets email confirmation
query_internal(procurement, confirm?)  ← procurement confirms verbal approval
apply_rule(tolerance_exception_approval)
make_decision(approve, reason)
route_to(procurement, raise PO amendment)
close_case(summary)

Pitfalls:

  • Rejecting without querying supplier β†’ wrong decision, score capped at ~0.35
  • Approving without checking tolerance rule β†’ policy violation, βˆ’0.15
  • Disabling fraud checks that aren't needed β†’ wasted steps

Grader weights:

Sub-score Max Key signals
Diagnosis 0.32 tolerance_rule check, price mismatch found
Investigation 0.30 supplier queried, procurement confirmed
Decision 0.18 correct approve decision
Routing 0.12 PO amendment sent to procurement
Closure 0.08 case closed with summary

8.2 Task 2 β€” Duplicate Invoice with Hidden Tax Error (Medium)

Scenario: Logistics supplier submits INV-2024-891. System flags it as a possible duplicate of INV-2024-819 (already paid). The invoice numbers differ by a digit transposition (8-9-1 vs 8-1-9). However: the original invoice applied 15% GST (wrong rate); the correct rate is 18%. The company overpaid β‚Ή3,240 in tax on the original invoice. The new invoice has the correct rate. So it is simultaneously a duplicate AND a legitimate correction.

What makes it medium: The agent must not just detect the duplicate and reject β€” it must also detect the tax error in the original paid invoice and partially approve the correction delta (β‚Ή3,240). A simple "reject all duplicates" rule misses this and loses significant score.

Optimal path (10 steps):

run_check(duplicate_detection)         ← finds INV-2024-819
inspect_field(invoice, invoice_number) ← spots digit transposition
run_check(tax_calculation_verify)      ← finds 15% vs 18% on original
cross_check(tax_amount, invoice, payment_history) ← confirms β‚Ή3,240 delta
query_internal(finance, confirm overpayment?)
query_supplier(clarify relationship between invoices)
apply_rule(partial_approval)
apply_rule(credit_note_request)
make_decision(partial_approve, reason)
route_to(finance, process β‚Ή3,240 correction)
close_case(summary)

Pitfalls:

  • Full rejection (catches duplicate, misses correction): score ~0.35
  • Full approval (pays full duplicate): score βˆ’0.15
  • Partial approve without credit note: score ~0.60

Grader weights:

Sub-score Max Key signals
Diagnosis 0.30 duplicate found, tax error found
Investigation 0.32 finance queried, supplier queried, rules applied
Decision 0.20 partial_approve correct
Routing 0.08 finance routed for correction
Closure 0.06 case closed

8.3 Task 3 β€” Compound Fraud Signals (Hard)

Scenario: IT equipment supplier submits an β‚Ή8,47,500 invoice for 15 laptops. The system flags a bank account change. But there are four simultaneous fraud signals:

  1. Bank account fraud: Invoice has a different bank account than supplier master. The change request came from a lookalike email domain (techcore-solutions.com vs registered techcore-solutions.in) β€” a classic Business Email Compromise (BEC) attack.
  2. GSTIN fraud: The GST number on the invoice belongs to a different company β€” "TechCore Trading Pvt Ltd" in Delhi, not "TechCore Solutions."
  3. Quantity mismatch: Invoice claims 15 laptops. GRN confirms only 13 received. 2 are still in transit.
  4. Price inflation: β‚Ή56,500/unit invoiced vs β‚Ή52,000/unit on PO β€” 8.65% above PO with no approved revision.

Critical trap: If the agent queries the supplier via email, they contact the fraudster who will try to convince them to approve. The agent must use phone/registered channel.

What makes it hard: The agent must hold four independent fraud signals simultaneously, not be satisfied by finding one and stopping, detect the correct communication channel policy, and route to both legal and security (not just finance).

Optimal path (12-14 steps):

inspect_field(invoice, bank_account)            ← sees mismatch
cross_check(bank_account, invoice, supplier_master)
run_check(bank_account_verification)            ← finds lookalike domain
run_check(email_domain_verification)
inspect_field(invoice, supplier_gstin)
run_check(gst_verification)                     ← finds GST belongs to different entity
cross_check(gstin, invoice, supplier_master)
inspect_field(grn, items_received)
run_check(grn_match)                            ← 13 vs 15
run_check(price_check)                          ← 8.65% above PO
query_supplier(confirm details, channel=phone)  ← supplier confirms fraud
query_internal(security, investigate BEC)
apply_rule(fraud_hold)
make_decision(reject, all fraud signals documented)
route_to(legal, initiate supplier audit)
route_to(security, BEC investigation)
close_case(fraud report summary)

Critical pitfall β€” contacting via email: βˆ’0.15 reward, and agent receives fraudster's response trying to get payment approved. Scoring penalises this heavily.

Grader weights:

Sub-score Max Key signals
Diagnosis 0.50 bank fraud, GST fraud, quantity mismatch, domain lookalike, price inflation
Investigation 0.20 phone contact (not email), security queried, legal queried
Decision 0.20 reject with all signals documented
Routing 0.20 legal + security routed
Closure 0.06 case closed with fraud report

Scoring thresholds:

  • Find 1 signal: ~0.20
  • Find 2 signals: ~0.40
  • Find 3 signals: ~0.60
  • Find all 4 + correct routing: ~0.90+

9. Reward Design

9.1 Philosophy

The reward function is designed around three principles:

Principle 1: Every informative action gets signal. Agents should learn that investigating is always better than guessing. Each relevant inspection, check, or query returns a positive reward proportional to how diagnostic that action is.

Principle 2: Dangerous actions get crushed. Approving a fraudulent invoice, disabling security controls, or contacting a supplier via a compromised channel are not mistakes β€” they are catastrophic errors. These must receive large negative rewards so agents learn to avoid them unconditionally.

Principle 3: The grader is the ground truth, the shaped reward is the training signal. The episode reward is shaped to help agents learn. The grader score at the end is what actually measures quality.

9.2 Reward Table

Action Reward Range Notes
inspect_field (relevant) +0.01 to +0.14 Higher for fields that reveal anomalies
inspect_field (irrelevant) +0.01 Still small positive β€” exploration is fine
cross_check (finds mismatch) +0.12 to +0.15 Diagnosis reward
cross_check (no mismatch) +0.02 Confirms a clean field
run_check (finds issue) +0.08 to +0.18 Higher for more diagnostic checks
run_check (clean) +0.01 to +0.06 Clean checks still confirm facts
query_supplier (phone) +0.10 to +0.15 Correct channel
query_supplier (email, fraud task) βˆ’0.15 Contacts fraudster
query_internal (key dept) +0.04 to +0.12 Higher for departments that add critical info
apply_rule (correct rule) +0.08 to +0.12 Applying the right policy pathway
apply_rule (wrong rule) βˆ’0.05 to βˆ’0.10 Misapplying policy
make_decision (correct) +0.18 to +0.28 Correct decision based on evidence
make_decision (wrong) βˆ’0.10 to βˆ’0.40 Severity scales with how wrong
route_to (correct team) +0.06 to +0.14 Right escalation path
close_case (complete) +0.06 to +0.12 Depends on decision quality
Repeat action βˆ’0.02 to βˆ’0.05 Light penalty, not catastrophic
SLA breach (exceed max steps) βˆ’0.10 One-time penalty at end

9.3 Episode Score vs Cumulative Reward

These are different numbers:

  • Cumulative reward is the sum of step rewards. It is used as a training signal.
  • Episode score (from grade()) is the holistic quality assessment. It is what the hackathon evaluates.

Agents should be optimised on the grade score, not the cumulative reward alone.


10. Evaluation Criteria

10.1 Hackathon Scoring

Criterion Weight What judges look for
Real-world utility 30% Would an enterprise actually use this? Does it model the task faithfully?
Task & grader quality 25% Clear objectives, accurate grading, genuine difficulty progression, frontier models challenged
Environment design 20% Clean state management, good action/observation spaces, shaped reward, sensible episode boundaries
Code quality & spec compliance 15% OpenEnv spec passes, Dockerfile works, baseline reproduces, typed models
Creativity & novelty 10% Novel domain, interesting mechanics, original reward design

10.2 Automated Gates (must all pass)

  1. HF Space deploys β€” POST /reset returns 200
  2. openenv validate passes
  3. docker build succeeds
  4. python inference.py runs without error, produces scores
  5. All 3 tasks enumerated, grader scores verified in [0.0, 1.0]

10.3 Phase 2 β€” Agentic Evaluation

The hackathon will run a standard open LLM agent (e.g. Nemotron 3 Super) against the environment. The environment must:

  • Not be trivially solvable by a greedy agent
  • Produce score variance across tasks (not all the same)
  • Penalise clearly suboptimal behaviour

10.4 Disqualifiers

  • Environment does not deploy or respond to /reset
  • Graders that always return the same score regardless of actions
  • inference.py not in root, or not using OpenAI client
  • No baseline scores produced
  • Plagiarised environment

11. API Contract

11.1 Environment Python API

env = InvoiceExceptionEnv(seed=42)

# Reset β€” returns EnvironmentState
obs: EnvironmentState = env.reset("task1_price_variance")

# Step β€” returns StepResult
result: StepResult = env.step(Action.run_check("tolerance_rule"))
# result.observation  β†’  EnvironmentState
# result.reward       β†’  float
# result.done         β†’  bool
# result.info         β†’  dict

# State β€” non-destructive peek
obs: EnvironmentState = env.state()

# Grade β€” run grader on episode
scores: dict = env.grade()
# scores["score"]           β†’ 0.0–1.0 overall
# scores["diagnosis_score"] β†’ float
# scores["decision_score"]  β†’ float
# ...

11.2 HTTP API

POST /reset
Body: {"task_id": "task1_price_variance"}   (optional β€” random if omitted)
Response: 200 EnvironmentState JSON

POST /step
Body: {"type": "run_check", "params": {"check_name": "tolerance_rule"}}
Response: 200 StepResult JSON

GET /state
Response: 200 EnvironmentState JSON

POST /grade
Response: 200 {"score": 0.85, "diagnosis_score": ...}

GET /tasks
Response: 200 ["task1_price_variance", "task2_duplicate_tax", "task3_compound_fraud"]

GET /health
Response: 200 {"status": "ok", "version": "1.0.0"}

11.3 Action Schema

{
  "type": "run_check",
  "params": {"check_name": "tolerance_rule"}
}

{
  "type": "inspect_field",
  "params": {"document": "invoice", "field": "bank_account"}
}

{
  "type": "cross_check",
  "params": {"field": "unit_price", "doc_a": "invoice", "doc_b": "po"}
}

{
  "type": "query_supplier",
  "params": {"question": "Why does your bank account differ?", "channel": "phone"}
}

{
  "type": "query_internal",
  "params": {"department": "procurement", "question": "Did you approve this price?"}
}

{
  "type": "apply_rule",
  "params": {"rule_id": "tolerance_exception_approval"}
}

{
  "type": "make_decision",
  "params": {"decision": "approve", "reason": "Verbal approval confirmed by procurement."}
}

{
  "type": "route_to",
  "params": {"team": "procurement", "notes": "Please raise PO amendment for the price variance."}
}

{
  "type": "close_case",
  "params": {"summary": "Invoice approved. PO amendment requested. Case closed."}
}

12. File Structure

invoice-exception-handler/
β”‚
β”œβ”€β”€ README.md                    # Full setup + usage guide
β”œβ”€β”€ openenv.yaml                 # OpenEnv spec (must pass openenv validate)
β”œβ”€β”€ Dockerfile                   # Single-stage Python 3.11-slim build
β”œβ”€β”€ requirements.txt             # Pinned dependencies
β”œβ”€β”€ inference.py                 # Competition inference script (MUST be here)
β”œβ”€β”€ app.py                       # Gradio + FastAPI entrypoint for HF Spaces
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py                # All Pydantic typed models
β”‚   β”œβ”€β”€ environment.py           # InvoiceExceptionEnv class
β”‚   └── tasks.py                 # 3 task classes + graders + EpisodeData
β”‚
└── documents/
    β”œβ”€β”€ PRD-001-product-requirements.md    # This document
    β”œβ”€β”€ CHANGELOG.md                       # Every code change recorded
    β”œβ”€β”€ ARCHITECTURE.md                    # System diagram + decisions
    └── BASELINE-SCORES.md                 # Reproducible benchmark results

13. Out of Scope

The following are explicitly not part of v1.0:

  • Real database connectivity (the environment is fully simulated)
  • Multi-agent scenarios (one agent per episode)
  • Partial observability (agent sees all documents from the start)
  • User interface for human play (nice-to-have but not required for submission)
  • Real supplier APIs (simulation only)
  • Currency other than INR (can be extended in v1.1)
  • Tasks beyond 3 (can be extended)

14. Change Log

Version Date Author Change
0.1.0 2025-01-18 [Author] Initial draft β€” problem definition and task sketches
0.2.0 2025-01-19 [Author] Added reward design section, API contract, file structure
1.0.0 2025-01-20 [Author] Final version β€” all sections complete, ready for implementation