Spaces:
Sleeping
Sleeping
| # Invoice Exception Handler β OpenEnv | |
| > An AI agent learning environment that simulates accounts payable exception handling. | |
| > The agent acts as an AP analyst: receives flagged invoices, investigates root causes, | |
| > makes decisions, and closes cases. Built for the OpenEnv hackathon. | |
| [](https://github.com/openenv/openenv) | |
| [](https://python.org) | |
| [](https://huggingface.co/spaces) | |
| --- | |
| ## For Agents Building This Project | |
| This README is the single source of truth for building the entire project from scratch. | |
| Read every section before writing any code. Do not skip sections. Do not guess. | |
| ### Ground Rules | |
| 1. **Write code like a human wrote it.** Use real variable names, not `x` or `tmp`. Add comments where | |
| the logic is non-obvious. Leave one blank line between logical blocks inside functions. Use 4-space | |
| indentation everywhere. Python files get a module docstring at the top explaining what the file does. | |
| 2. **Create a new Git repo and push after every major milestone.** A milestone is: models done, | |
| tasks done, environment done, API done, inference done, app done. Not after every file. | |
| 3. **Record every change in `documents/CHANGELOG.md`.** Use the format in the changelog section below. | |
| Before pushing, append to the changelog what changed and why. | |
| 4. **If something in this README conflicts with the competition spec, the competition spec wins.** | |
| The competition spec is in the document the user shared. Key points: `inference.py` must use the | |
| OpenAI client. `[START]` `[STEP]` `[END]` format must be exact. `/reset` must return 200. | |
| 5. **Test before pushing.** Run `python -c "from env import InvoiceExceptionEnv"` to check imports. | |
| Run `python inference.py` with a dummy API key to check the log format. Run `docker build .` to | |
| check the Dockerfile before claiming it works. | |
| --- | |
| ## What This Environment Does | |
| Every company that buys goods or services deals with invoice exceptions β mismatches between | |
| what was ordered, what arrived, and what was invoiced. Currently these are handled by accounts | |
| payable analysts who manually compare documents and decide: approve, reject, hold, or escalate. | |
| This environment puts an AI agent in that analyst's chair. The agent receives: | |
| - A **Purchase Order** (what was agreed to) | |
| - An **Invoice** (what the supplier is claiming) | |
| - A **Goods Receipt Note** (what actually arrived) | |
| - A **Supplier Master** (the verified supplier record) | |
| - An **Exception Flag** (why the system flagged this invoice) | |
| The agent investigates, runs checks, queries people, makes a decision, and closes the case. | |
| Every action has realistic consequences including financial, compliance, and fraud implications. | |
| --- | |
| ## Repository Structure | |
| Build the project with exactly this structure. Do not add extra directories. Do not rename files. | |
| ``` | |
| invoice-exception-handler/ | |
| β | |
| βββ README.md β this file | |
| βββ openenv.yaml β OpenEnv spec, must pass openenv validate | |
| βββ Dockerfile β single-stage Python 3.11-slim | |
| βββ requirements.txt β pinned versions | |
| βββ inference.py β competition script, MUST be named this | |
| βββ app.py β Gradio + FastAPI, entry point for HF Spaces | |
| β | |
| βββ env/ | |
| β βββ __init__.py β exports InvoiceExceptionEnv, Action, ALL_TASKS | |
| β βββ models.py β all Pydantic models (Action, EnvironmentState, etc.) | |
| β βββ environment.py β InvoiceExceptionEnv class | |
| β βββ tasks.py β 3 task classes, EpisodeData, graders | |
| β | |
| βββ documents/ | |
| βββ PRD-001-product-requirements.md | |
| βββ CHANGELOG.md | |
| βββ ARCHITECTURE.md | |
| βββ BASELINE-SCORES.md | |
| ``` | |
| --- | |
| ## Step-by-Step Build Order | |
| Follow this order exactly. Do not jump ahead. | |
| ``` | |
| Step 1 β Create the repo | |
| Step 2 β Write requirements.txt | |
| Step 3 β Write env/models.py | |
| Step 4 β Write env/tasks.py | |
| Step 5 β Write env/environment.py | |
| Step 6 β Write env/__init__.py | |
| Step 7 β Smoke test the environment (run a quick script) | |
| Step 8 β Write openenv.yaml | |
| Step 9 β Write inference.py | |
| Step 10 β Write app.py | |
| Step 11 β Write Dockerfile | |
| Step 12 β Full end-to-end test | |
| Step 13 β Write documents/ | |
| Step 14 β Push and verify | |
| ``` | |
| --- | |
| ## Step 1 β Create the Repo | |
| ```bash | |
| # Create the project directory | |
| mkdir invoice-exception-handler | |
| cd invoice-exception-handler | |
| # Initialise git | |
| git init | |
| git checkout -b main | |
| # Create the directory structure | |
| mkdir -p env documents | |
| # Create empty placeholder files so git tracks the structure | |
| touch env/__init__.py | |
| touch documents/.gitkeep | |
| # First commit β skeleton only | |
| git add . | |
| git commit -m "init: project skeleton" | |
| # Create the repo on GitHub/HF and push | |
| # Replace with your actual remote | |
| git remote add origin https://github.com/YOUR_USERNAME/invoice-exception-handler.git | |
| git push -u origin main | |
| ``` | |
| --- | |
| ## Step 2 β requirements.txt | |
| Pin every version. Do not use `>=` ranges β the validator builds in a clean environment and | |
| range mismatches cause mysterious failures. | |
| ``` | |
| pydantic==2.7.1 | |
| fastapi==0.111.0 | |
| uvicorn==0.29.0 | |
| gradio==4.36.1 | |
| openai==1.35.3 | |
| pyyaml==6.0.1 | |
| httpx==0.27.0 | |
| python-multipart==0.0.9 | |
| ``` | |
| --- | |
| ## Step 3 β env/models.py | |
| This file defines every typed object in the system. Write it before any other Python code. | |
| Nothing is untyped. Every field has a type annotation. | |
| ### What goes in models.py | |
| **Enumerations:** | |
| - `ActionType` β the 9 action types an agent can take (string enum) | |
| - `DecisionType` β approve / reject / hold / partial_approve (string enum) | |
| - `CaseStatus` β open / in_review / decided / routed / closed (string enum) | |
| **Document models** (read-only context given to the agent): | |
| - `LineItem` β one line on an invoice or PO (description, quantity, unit_price, total, tax_rate) | |
| - `PurchaseOrder` β what was agreed to be purchased | |
| - `Invoice` β what the supplier is claiming | |
| - `GoodsReceiptNote` β what actually arrived at the warehouse | |
| - `SupplierMaster` β the verified, registered supplier record | |
| - `ExceptionFlag` β why the system flagged this invoice (flag_code, description, auto_hold) | |
| **Action model:** | |
| - `Action` β has a `type: ActionType` and `params: Dict[str, Any]` | |
| - Add classmethod constructors for each action type so callers can do `Action.run_check("tolerance_rule")` | |
| **Result models:** | |
| - `InspectionResult` β what came back from inspect_field (document, field, value, note, timestamp) | |
| - `CheckResult` β what came back from run_check or cross_check (check_name, passed, detail, timestamp) | |
| - `QueryResult` β what came back from a query (target, question, response, channel, timestamp) | |
| **State models:** | |
| - `EnvironmentState` β the full observable state returned by reset() and step() | |
| - `StepResult` β what step() returns: (observation, reward, done, info) | |
| ### EnvironmentState fields | |
| The EnvironmentState must include: | |
| - `task_id: str` | |
| - `step_number: int` | |
| - `case_status: CaseStatus` | |
| - All 5 documents (purchase_order, invoice, grn, supplier_master, exception_flag) | |
| - Agent history: `inspections`, `checks_run`, `queries`, `rules_applied` | |
| - Decision state: `decision`, `decision_reason`, `routed_to`, `case_closed`, `close_summary` | |
| - Action hints: `available_actions`, `available_checks`, `available_rules`, `knowledge_base` | |
| - `cumulative_reward: float` | |
| ### Writing style for models.py | |
| ```python | |
| """ | |
| Typed models for the Invoice Exception Handler OpenEnv environment. | |
| Every object the agent sees or produces is defined here as a Pydantic model. | |
| This is the single source of truth for the data contract between the | |
| environment simulation and the agent. | |
| """ | |
| from __future__ import annotations | |
| import time | |
| from enum import Enum | |
| from typing import Any, Dict, List, Optional | |
| from pydantic import BaseModel, Field | |
| class ActionType(str, Enum): | |
| INSPECT_FIELD = "inspect_field" | |
| CROSS_CHECK = "cross_check" | |
| # ... etc | |
| ``` | |
| Do not put business logic in models.py. Just data shapes. | |
| --- | |
| ## Step 4 β env/tasks.py | |
| This is the biggest file. It defines what happens when the agent takes each action β | |
| the simulated responses, the rewards, and the grading logic. | |
| ### EpisodeData class | |
| A plain Python class (not Pydantic) that tracks everything the agent has done in one episode. | |
| ```python | |
| class EpisodeData: | |
| """Tracks the full history of one episode for grading and state building.""" | |
| def __init__(self): | |
| self.inspections: List[InspectionResult] = [] | |
| self.checks: List[CheckResult] = [] | |
| self.queries: List[QueryResult] = [] | |
| self.rules_applied: List[str] = [] | |
| self.decision: Optional[str] = None | |
| self.decision_reason: Optional[str] = None | |
| self.routed_to: List[str] = [] | |
| self.closed: bool = False | |
| self.close_summary: Optional[str] = None | |
| self.step_count: int = 0 | |
| self.cumulative_reward: float = 0.0 | |
| def has_inspected(self, doc: str, field: str) -> bool: | |
| """Check if we already looked at this field in this document.""" | |
| return any(i.document == doc and i.field == field for i in self.inspections) | |
| def has_checked(self, name: str) -> bool: | |
| """Check if this validation check has already been run.""" | |
| return any(c.check_name == name for c in self.checks) | |
| def has_queried(self, target: str) -> bool: | |
| """Check if we already queried this person or department.""" | |
| return any(q.target == target for q in self.queries) | |
| ``` | |
| ### BaseTask class | |
| Abstract base that all three tasks inherit from. Every method raises `NotImplementedError`. | |
| ```python | |
| class BaseTask: | |
| task_id: str = "base" | |
| max_steps: int = 20 | |
| difficulty: str = "easy" | |
| # Document factories β return fresh objects each time (no shared state) | |
| def get_purchase_order(self) -> PurchaseOrder: raise NotImplementedError | |
| def get_invoice(self) -> Invoice: raise NotImplementedError | |
| def get_grn(self) -> GoodsReceiptNote: raise NotImplementedError | |
| def get_supplier_master(self) -> SupplierMaster: raise NotImplementedError | |
| def get_exception_flag(self) -> ExceptionFlag: raise NotImplementedError | |
| # Simulators β each returns (result_object, reward_delta) | |
| def simulate_inspect(self, document: str, field: str) -> Tuple[InspectionResult, float]: ... | |
| def simulate_cross_check(self, field: str, doc_a: str, doc_b: str) -> Tuple[CheckResult, float]: ... | |
| def simulate_run_check(self, check_name: str) -> Tuple[CheckResult, float]: ... | |
| def simulate_query_supplier(self, question: str, channel: str) -> Tuple[QueryResult, float]: ... | |
| def simulate_query_internal(self, department: str, question: str) -> Tuple[QueryResult, float]: ... | |
| def simulate_apply_rule(self, rule_id: str) -> Tuple[str, float]: ... | |
| def simulate_make_decision(self, decision: str, reason: str, ep: EpisodeData) -> float: ... | |
| def simulate_route_to(self, team: str, notes: str, ep: EpisodeData) -> float: ... | |
| def simulate_close(self, summary: str, ep: EpisodeData) -> float: ... | |
| def grade(self, ep: EpisodeData) -> Dict[str, float]: ... | |
| # These are properties, not methods | |
| @property | |
| def available_checks(self) -> List[str]: return [] | |
| @property | |
| def available_rules(self) -> List[str]: return [] | |
| @property | |
| def knowledge_base(self) -> List[str]: return [] | |
| ``` | |
| ### The Three Tasks | |
| #### Task 1: PriceVarianceTask (task1_price_variance) | |
| **The scenario:** An office stationery supplier sends an invoice that's 3.08% above the PO. | |
| Company policy allows Β±2% automatic approval. Above that needs manual exception approval. | |
| The supplier did communicate the price increase but procurement never updated the PO. | |
| **task_id:** `"task1_price_variance"` | |
| **max_steps:** `18` | |
| **difficulty:** `"easy"` | |
| **The documents:** | |
| PO (PO-2024-1041): 3 stationery line items totalling βΉ50,000 | |
| - A4 Paper 100 reams @ βΉ220 = βΉ22,000 | |
| - Ballpoint Pens 20 boxes @ βΉ450 = βΉ9,000 | |
| - Staplers 10 units @ βΉ1,900 = βΉ19,000 | |
| Invoice (INV-ON-8821): Same items, same quantities, but 2 items have higher unit prices | |
| - A4 Paper @ βΉ231 (+βΉ11, +5.0%) | |
| - Ballpoint Pens @ βΉ472 (+βΉ22, +4.9%) | |
| - Staplers unchanged @ βΉ1,900 | |
| - Subtotal: βΉ51,540 (+βΉ1,540, +3.08%) | |
| - 18% GST applied correctly: βΉ9,277.20 | |
| - Total: βΉ60,817.20 | |
| GRN (GRN-2024-0892): All items fully received, no pending, no rejected. | |
| Supplier Master (SUP-0441 β OfficeNeed Supplies): Bank account and GSTIN both match invoice exactly. No fraud signals. | |
| Exception Flag: `PRICE_MISMATCH` β "Invoice total βΉ51,540 exceeds PO βΉ50,000 by βΉ1,540 (3.08%). Above auto-approval threshold." | |
| **Knowledge base entries:** | |
| - POL-001: Price variance β€Β±2% may be auto-approved. Above 2% requires exception approval. | |
| - POL-002: Exception approval requires confirmation from originating department. | |
| - POL-003: Any approved invoice with a price change must be followed by a PO amendment request. | |
| - POL-004: Bank account on invoice must match supplier master. | |
| **Simulator logic:** | |
| `simulate_inspect`: Return meaningful values for invoice line_items (+0.10), invoice total_amount (+0.08), po line_items (+0.06), grn items_received (+0.05). Return +0.01 for unknown fields. | |
| `simulate_cross_check`: The key cross-checks are: | |
| - `(unit_price, invoice, po)` β finds Paper and Pen mismatch, reward +0.12 | |
| - `(total_amount, invoice, po)` β confirms 3.08% variance, reward +0.10 | |
| - `(bank_account, invoice, supplier_master)` β match (no fraud), reward +0.03 | |
| - `(gstin, invoice, supplier_master)` β match, reward +0.02 | |
| - `(quantity, invoice, grn)` β match (full delivery), reward +0.04 | |
| `simulate_run_check`: | |
| - `"tolerance_rule"` β 3.08% > 2%, FAILS, reward +0.14 (most important check) | |
| - `"grn_match"` β PASSES (all received), reward +0.06 | |
| - `"duplicate_detection"` β PASSES (not a dup), reward +0.02 | |
| - `"bank_account_verification"` β PASSES, reward +0.02 | |
| - `"gst_verification"` β PASSES, reward +0.02 | |
| - `"po_match"` β FAILS on price, reward +0.08 | |
| `simulate_query_supplier`: Returns email from supplier explaining raw material price increase communicated to Arjun Mehta at procurement on Feb 20. Reward +0.10. | |
| `simulate_query_internal`: | |
| - `"procurement"` β Arjun Mehta confirms verbal approval, says he'll raise PO amendment. Reward +0.12. | |
| - Others β generic responses, reward +0.03. | |
| `simulate_apply_rule`: | |
| - `"tolerance_2pct_auto_approve"` β BLOCKED (3.08% > 2%), reward β0.05 | |
| - `"tolerance_exception_approval"` β APPLIED, reward +0.10 | |
| - `"rejection_with_reason"` β APPLIED but wrong, reward β0.08 | |
| - `"partial_approval"` β not applicable here, reward β0.05 | |
| `simulate_make_decision`: | |
| - `"approve"` with tolerance check + procurement query: reward +0.25 | |
| - `"approve"` with tolerance check only: reward +0.18 | |
| - `"approve"` with nothing checked: reward +0.05 (bad approval, should have verified) | |
| - `"reject"`: reward β0.10 (wrong decision, delay supplier) | |
| - `"hold"`: reward +0.08 | |
| `simulate_route_to`: | |
| - `"procurement"` β reward +0.12 (correct β PO amendment needed) | |
| - `"finance"` β reward +0.03 | |
| - `"legal"` β reward β0.05 (overkill for a price variance) | |
| `simulate_close`: reward +0.12 if approved + tolerance checked + procurement routed, else +0.06, else 0. | |
| **Grader (`grade` method):** | |
| ```python | |
| def grade(self, ep: EpisodeData) -> Dict[str, float]: | |
| checks_run = {c.check_name for c in ep.checks} | |
| queries_to = {q.target for q in ep.queries} | |
| # Did the agent correctly diagnose? | |
| d = 0.0 | |
| if any("unit_price" in c.check_name or "total" in c.check_name | |
| for c in ep.checks): | |
| d += 0.12 | |
| if "tolerance_rule" in checks_run: | |
| d += 0.14 | |
| if "grn_match" in checks_run: | |
| d += 0.06 | |
| # Did the agent investigate properly? | |
| i = 0.0 | |
| if "supplier" in queries_to: | |
| i += 0.10 | |
| if "procurement" in queries_to: | |
| i += 0.12 | |
| if "tolerance_exception_approval" in ep.rules_applied: | |
| i += 0.08 | |
| # Correct decision? | |
| dec = 0.0 | |
| if ep.decision == "approve": dec += 0.18 | |
| elif ep.decision == "hold": dec += 0.06 | |
| elif ep.decision == "reject": dec -= 0.10 | |
| # Correct routing? | |
| route = 0.12 if "procurement" in ep.routed_to else 0.0 | |
| # Closed cleanly? | |
| closure = 0.08 if ep.closed else 0.0 | |
| # Efficiency bonus β penalise extra steps | |
| eff = max(0.0, 0.06 - 0.004 * max(0, ep.step_count - 9)) | |
| total = d + i + dec + route + closure + eff | |
| return { | |
| "score": round(max(0.0, min(1.0, total)), 4), | |
| "diagnosis_score": round(d, 4), | |
| "investigation_score": round(i, 4), | |
| "decision_score": round(dec, 4), | |
| "routing_score": round(route, 4), | |
| "closure_score": round(closure, 4), | |
| "efficiency_score": round(eff, 4), | |
| } | |
| ``` | |
| --- | |
| #### Task 2: DuplicateTaxErrorTask (task2_duplicate_tax) | |
| **The scenario:** Logistics supplier submits INV-2024-891 for transport services. System flags | |
| it as a possible duplicate. Turns out it IS a duplicate of INV-2024-819 β the numbers differ | |
| by digit transposition (891 vs 819). That original invoice was already paid. BUT: the original | |
| invoice applied 15% GST when the correct rate is 18%. The company overpaid βΉ3,240 in tax. | |
| The new invoice has the correct rate. So it's both a duplicate AND a legitimate correction. | |
| **task_id:** `"task2_duplicate_tax"` | |
| **max_steps:** `20` | |
| **difficulty:** `"medium"` | |
| **The documents:** | |
| PO (PO-2024-0778): Logistics services | |
| - Mumbai-Pune Transport 20 trips @ βΉ4,500 = βΉ90,000 | |
| - Warehousing charges Feb 2024 @ βΉ18,000 = βΉ18,000 | |
| - Total: βΉ1,08,000, Net-15 terms | |
| Invoice (INV-2024-891): Same services, same amounts β correct on the face of it | |
| - Subtotal: βΉ1,08,000 | |
| - GST 18%: βΉ19,440 β this is CORRECT | |
| - Total: βΉ1,27,440 | |
| GRN (GRN-2024-0740): Services confirmed complete (transport + warehousing). | |
| Supplier Master (SUP-0229 β FastMove Logistics): Bank and GSTIN match invoice. No fraud signals. | |
| Exception Flag: `POSSIBLE_DUPLICATE` β "Invoice INV-2024-891 closely matches previously processed invoice." | |
| **Hidden state (not in documents, revealed by checks):** | |
| - INV-2024-819 was paid 12 days ago for βΉ1,24,200 | |
| - INV-2024-819 applied 15% GST = βΉ16,200 (wrong rate) | |
| - Correct 18% GST = βΉ19,440 | |
| - Company overpaid: βΉ3,240 | |
| **Key checks and what they reveal:** | |
| `run_check("duplicate_detection")` β FAILS β finds INV-2024-819 paid 12 days ago, reward +0.18 | |
| `run_check("tax_calculation_verify")` β FAILS β discovers the 15% error on original, reveals βΉ3,240 delta, reward +0.16 | |
| `cross_check(invoice_number, invoice, payment_history)` β finds digit transposition, reward +0.15 | |
| `cross_check(tax_amount, invoice, payment_history)` β confirms βΉ3,240 delta, reward +0.14 | |
| `query_internal("finance")` β confirms overpayment on original, reward +0.12 | |
| `query_supplier` β supplier confirms they know and wants partial approval for the delta, reward +0.10 | |
| `apply_rule("partial_approval")` β correct pathway, reward +0.12 | |
| `apply_rule("credit_note_request")` β supplier must issue credit note for the balance, reward +0.10 | |
| **Decision logic:** | |
| `simulate_make_decision`: | |
| - `"partial_approve"` with dup + tax found: reward +0.28 β optimal | |
| - `"partial_approve"` with dup only: reward +0.14 β incomplete | |
| - `"reject"` with dup found: reward +0.08 β catches dup, misses correction | |
| - `"approve"` (pays full duplicate): reward β0.15 β bad | |
| **Grader weights:** | |
| - diagnosis_score: up to 0.30 (dup found +0.16, tax error found +0.14) | |
| - investigation_score: up to 0.32 (finance queried, supplier queried, rules applied) | |
| - decision_score: up to 0.20 (partial_approve = 0.20, reject = 0.05, approve = β0.15) | |
| - routing_score: up to 0.08 | |
| - closure_score: up to 0.06 | |
| --- | |
| #### Task 3: CompoundFraudTask (task3_compound_fraud) | |
| **The scenario:** IT supplier submits βΉ8,47,500 invoice for 15 laptops. System flags a bank | |
| account change. But there are FOUR simultaneous fraud signals that the agent must find all of. | |
| **task_id:** `"task3_compound_fraud"` | |
| **max_steps:** `25` | |
| **difficulty:** `"hard"` | |
| **The four signals:** | |
| 1. **Bank account fraud (Signal 1):** Invoice has a different bank account than the supplier | |
| master. The change request came from `techcore-solutions.com`. The registered domain is | |
| `techcore-solutions.in`. Classic Business Email Compromise (BEC) attack. | |
| 2. **GSTIN fraud (Signal 2):** The GST number on the invoice (`07AABCT9999X1Z8`) belongs to | |
| "TechCore Trading Pvt Ltd" β a completely different entity in Delhi. Supplier master shows | |
| `07AABCT1234Y1Z5` for "TechCore Solutions." | |
| 3. **Quantity mismatch (Signal 3):** Invoice claims 15 laptops. GRN shows only 13 received. | |
| 2 units are still marked as pending. | |
| 4. **Price inflation (Signal 4):** βΉ56,500/unit on invoice vs βΉ52,000/unit on PO. That's | |
| 8.65% above the agreed price. No price revision was ever approved. | |
| **Bonus signals (smaller, still notable):** | |
| - Invoice is dated a Sunday (2024-03-10) β unusual for B2B | |
| - PO was raised Friday March 8 β 2-day turnaround is suspiciously fast for IT equipment | |
| **The critical trap β channel selection:** | |
| `simulate_query_supplier(question, channel="email")` β | |
| Returns fraudster's response urging payment to the new account. Reward: **β0.15**. | |
| `simulate_query_supplier(question, channel="phone")` β | |
| The real TechCore Solutions confirms they sent no bank change request. Confirms fraud. Reward: **+0.15**. | |
| This tests whether the agent follows POL-009 ("bank account change must be verified via | |
| registered phone number β NEVER via email") which is in the knowledge base. | |
| **Available checks and rewards:** | |
| ```python | |
| "bank_account_verification" β FAILS, finds lookalike domain, reward +0.18 | |
| "gst_verification" β FAILS, GST belongs to different entity, reward +0.18 | |
| "grn_match" β FAILS, 13 vs 15 received, reward +0.14 | |
| "email_domain_verification" β FAILS, lookalike domain confirmed, reward +0.16 | |
| "invoice_date_validation" β FAILS, Sunday flag, reward +0.08 | |
| "quantity_check" β FAILS, quantity inflated, reward +0.12 | |
| "price_check" β FAILS, 8.65% above PO, reward +0.10 | |
| "duplicate_detection" β PASSES (not a dup), reward +0.02 | |
| "po_match" β FAILS (GST + qty + price all wrong), reward +0.08 | |
| ``` | |
| **Decision logic:** | |
| `simulate_make_decision`: | |
| - `"reject"` β reward = 0.10 + 0.05 Γ (number of signals found) β max ~0.30 | |
| - `"approve"` β reward β0.40 (catastrophic β approved fraud) | |
| - `"partial_approve"` β reward β0.20 (you can't partially approve fraud) | |
| - `"hold"` β reward = 0.08 + 0.03 Γ signals found β acceptable but not optimal | |
| **Route logic:** | |
| ```python | |
| "legal" β reward +0.14 # must escalate to legal | |
| "security" β reward +0.12 # BEC attack needs security investigation | |
| "finance" β reward +0.08 # finance needs to block payment | |
| "procurement" β reward +0.06 | |
| ``` | |
| **Grader β the signal detection scoring:** | |
| ```python | |
| def grade(self, ep: EpisodeData) -> Dict[str, float]: | |
| failed = {c.check_name for c in ep.checks if not c.passed} | |
| bank_found = "bank_account_verification" in {c.check_name for c in ep.checks} | |
| gst_found = "gst_verification" in {c.check_name for c in ep.checks} | |
| qty_found = "grn_match" in {c.check_name for c in ep.checks} | |
| domain_found = "email_domain_verification" in {c.check_name for c in ep.checks} | |
| price_found = "price_check" in {c.check_name for c in ep.checks} | |
| # Diagnosis β finding all signals is the whole point | |
| d = (0.12 if bank_found else 0) + (0.12 if gst_found else 0) \ | |
| + (0.10 if qty_found else 0) + (0.10 if domain_found else 0) \ | |
| + (0.06 if price_found else 0) | |
| # Investigation β reward for using phone not email | |
| i = 0.0 | |
| for q in ep.queries: | |
| if q.target == "supplier" and q.channel not in ("email", "mail"): | |
| i += 0.10 # correct channel | |
| elif q.target == "supplier" and q.channel in ("email", "mail"): | |
| i -= 0.15 # contacting fraudster | |
| if "legal" in {q.target for q in ep.queries}: i += 0.06 | |
| if "security" in {q.target for q in ep.queries}: i += 0.06 | |
| # Decision | |
| signals = sum([bank_found, gst_found, qty_found, domain_found]) | |
| dec = 0.0 | |
| if ep.decision == "reject": | |
| dec = 0.08 + 0.03 * signals | |
| elif ep.decision == "approve": | |
| dec = -0.35 | |
| elif ep.decision == "partial_approve": | |
| dec = -0.15 | |
| elif ep.decision == "hold": | |
| dec = 0.06 | |
| # Routing | |
| routes = set(ep.routed_to) | |
| route = (0.10 if "legal" in routes else 0) \ | |
| + (0.06 if "security" in routes else 0) \ | |
| + (0.04 if "finance" in routes else 0) | |
| closure = 0.06 if (ep.closed and ep.decision == "reject") else 0.0 | |
| eff = max(0.0, 0.04 - 0.002 * max(0, ep.step_count - 12)) | |
| total = d + i + dec + route + closure + eff | |
| return { | |
| "score": round(max(0.0, min(1.0, total)), 4), | |
| "signals_found": sum([bank_found, gst_found, qty_found, domain_found, price_found]), | |
| "diagnosis_score": round(d, 4), | |
| "investigation_score": round(i, 4), | |
| "decision_score": round(dec, 4), | |
| "routing_score": round(route, 4), | |
| "closure_score": round(closure, 4), | |
| "efficiency_score": round(eff, 4), | |
| } | |
| ``` | |
| ### Task Registry | |
| At the bottom of tasks.py: | |
| ```python | |
| TASK_REGISTRY: Dict[str, type] = { | |
| "task1_price_variance": PriceVarianceTask, | |
| "task2_duplicate_tax": DuplicateTaxErrorTask, | |
| "task3_compound_fraud": CompoundFraudTask, | |
| } | |
| ALL_TASKS = list(TASK_REGISTRY.keys()) | |
| def make_task(task_id: str) -> BaseTask: | |
| cls = TASK_REGISTRY.get(task_id) | |
| if cls is None: | |
| raise ValueError(f"Unknown task '{task_id}'. Available: {ALL_TASKS}") | |
| return cls() | |
| ``` | |
| --- | |
| ## Step 5 β env/environment.py | |
| This is the `InvoiceExceptionEnv` class. It is the only thing external code needs to import. | |
| ```python | |
| class InvoiceExceptionEnv: | |
| """ | |
| OpenEnv-compatible Invoice Exception Handler environment. | |
| Usage: | |
| env = InvoiceExceptionEnv(seed=42) | |
| obs = env.reset("task1_price_variance") | |
| result = env.step(Action.run_check("tolerance_rule")) | |
| scores = env.grade() | |
| """ | |
| ``` | |
| ### Constructor | |
| Takes an optional `seed: Optional[int] = None` for reproducibility. | |
| Initialises `self._rng = random.Random(seed)`. | |
| Initialises `self._task`, `self._ep`, `self._state`, `self._done` all to None/False. | |
| ### reset(task_id) | |
| ```python | |
| def reset(self, task_id: Optional[str] = None) -> EnvironmentState: | |
| """ | |
| Start a new episode. If task_id is None, picks one at random. | |
| Returns the initial EnvironmentState showing all documents and available actions. | |
| """ | |
| ``` | |
| 1. Pick task (random if None) | |
| 2. Create `EpisodeData()` | |
| 3. Set `self._done = False` | |
| 4. Call `self._build_state()` and store result | |
| 5. Return the state | |
| ### step(action) | |
| ```python | |
| def step(self, action: Union[Action, Dict[str, Any]]) -> StepResult: | |
| """ | |
| Execute one action. Returns observation, reward, done flag, and info dict. | |
| Raises RuntimeError if called before reset() or after the episode is done. | |
| """ | |
| ``` | |
| 1. Validate we're in an active episode | |
| 2. Convert dict to Action if needed | |
| 3. Call `self._dispatch(action)` β gets (reward, info) | |
| 4. Increment step count | |
| 5. Check SLA (step count vs max_steps) | |
| 6. Check done condition (closed or SLA breach) | |
| 7. Rebuild state | |
| 8. Return StepResult | |
| ### state() | |
| Non-destructive. Just returns `self._state`. Raises RuntimeError if not initialised. | |
| ### grade() | |
| Calls `self._task.grade(self._ep)` and returns the dict. | |
| ### _dispatch(action) | |
| The routing function. A single if/elif chain for each ActionType. | |
| For each action: | |
| 1. Call the appropriate task simulator | |
| 2. Update EpisodeData | |
| 3. Return (reward, info dict) | |
| Handle repeated actions (inspect same field twice, check same thing twice) with a small β0.02 to β0.05 penalty and return early. | |
| ### _build_state() | |
| Constructs an `EnvironmentState` from the current `_task` and `_ep`. Called after every step. | |
| Also determines the current `CaseStatus` based on episode data. | |
| ### action_space_sample() | |
| Returns a random valid action (for random baseline agents). Uses `self._rng` for reproducibility. | |
| --- | |
| ## Step 6 β env/__init__.py | |
| ```python | |
| from .environment import InvoiceExceptionEnv | |
| from .models import Action, ActionType, EnvironmentState, StepResult | |
| from .tasks import ALL_TASKS, make_task | |
| __all__ = [ | |
| "InvoiceExceptionEnv", | |
| "Action", | |
| "ActionType", | |
| "EnvironmentState", | |
| "StepResult", | |
| "ALL_TASKS", | |
| "make_task", | |
| ] | |
| ``` | |
| --- | |
| ## Step 7 β Smoke Test Before Continuing | |
| Before writing openenv.yaml or inference.py, verify the environment works. | |
| ```python | |
| # test_smoke.py β run this, do not commit it | |
| from env import InvoiceExceptionEnv, Action, ALL_TASKS | |
| print("Tasks:", ALL_TASKS) | |
| env = InvoiceExceptionEnv(seed=42) | |
| for task_id in ALL_TASKS: | |
| obs = env.reset(task_id) | |
| print(f"\n--- {task_id} ---") | |
| print("Ticket:", obs.exception_flag.flag_description[:80]) | |
| # Take a few actions | |
| r1 = env.step(Action.run_check(obs.available_checks[0])) | |
| print(f"Step 1 reward: {r1.reward}") | |
| r2 = env.step(Action.make_decision("approve", "test")) | |
| print(f"Step 2 reward: {r2.reward}") | |
| r3 = env.step(Action.close_case("closed")) | |
| print(f"Step 3 reward: {r3.reward}, done: {r3.done}") | |
| scores = env.grade() | |
| print(f"Grade: {scores['score']}") | |
| print("\nSmoke test passed.") | |
| ``` | |
| All three tasks must complete without errors. Scores must be in [0.0, 1.0]. | |
| --- | |
| ## Step 8 β openenv.yaml | |
| This file must pass `openenv validate`. Write it carefully. | |
| ```yaml | |
| # openenv.yaml | |
| name: Invoice Exception Handler | |
| version: "1.0.0" | |
| description: | | |
| An agent learning environment simulating accounts payable exception handling. | |
| The agent acts as an AP analyst: investigates flagged invoices, applies business | |
| rules, detects fraud signals, makes decisions, and closes cases with an audit trail. | |
| authors: | |
| - name: Your Name | |
| email: your@email.com | |
| license: MIT | |
| tasks: | |
| - id: task1_price_variance | |
| name: Price Variance Exception | |
| difficulty: easy | |
| description: | | |
| Office stationery invoice arrives 3.08% above PO. Company tolerance policy | |
| allows Β±2% auto-approval. Agent must detect the variance, verify through | |
| the tolerance rule, confirm verbal approval with procurement, and approve | |
| with a PO amendment request. | |
| max_steps: 18 | |
| optimal_score: 1.0 | |
| min_passing_score: 0.60 | |
| - id: task2_duplicate_tax | |
| name: Duplicate Invoice with Tax Error | |
| difficulty: medium | |
| description: | | |
| Logistics supplier submits INV-2024-891, a duplicate of paid INV-2024-819 | |
| (digit transposition: 891 vs 819). Original invoice had wrong GST rate (15% | |
| vs correct 18%) β company overpaid βΉ3,240. New invoice has correct rate. | |
| Agent must detect the duplicate, identify the tax error in the original, | |
| and partially approve only the βΉ3,240 tax correction. | |
| max_steps: 20 | |
| optimal_score: 1.0 | |
| min_passing_score: 0.50 | |
| - id: task3_compound_fraud | |
| name: Compound Fraud Signals | |
| difficulty: hard | |
| description: | | |
| IT equipment supplier invoice with four simultaneous fraud signals: bank | |
| account changed via BEC attack (lookalike email domain), GSTIN belongs to | |
| a different entity, 2 of 15 laptops not yet received, and unit price 8.65% | |
| above PO. Agent must find all signals, use the correct communication channel | |
| (phone, not email β which would contact the fraudster), and escalate to legal | |
| and security. | |
| max_steps: 25 | |
| optimal_score: 1.0 | |
| min_passing_score: 0.40 | |
| observation_space: | |
| type: object | |
| description: EnvironmentState Pydantic model | |
| fields: | |
| task_id: {type: string} | |
| step_number: {type: integer} | |
| case_status: {type: string, enum: [open, in_review, decided, routed, closed]} | |
| purchase_order: {type: object, description: "PO with line items and terms"} | |
| invoice: {type: object, description: "Supplier invoice with line items and tax"} | |
| grn: {type: object, description: "Goods receipt β what actually arrived"} | |
| supplier_master: {type: object, description: "Verified supplier record"} | |
| exception_flag: {type: object, description: "Why the system flagged this invoice"} | |
| inspections: {type: array, description: "Fields the agent has inspected"} | |
| checks_run: {type: array, description: "Validation checks completed"} | |
| queries: {type: array, description: "Internal and supplier queries"} | |
| rules_applied: {type: array, description: "Business rules applied"} | |
| decision: {type: string, nullable: true} | |
| routed_to: {type: array} | |
| available_actions: {type: array} | |
| available_checks: {type: array} | |
| available_rules: {type: array} | |
| knowledge_base: {type: array} | |
| cumulative_reward: {type: number} | |
| action_space: | |
| type: object | |
| description: Action with type and params | |
| actions: | |
| inspect_field: | |
| params: {document: string, field: string} | |
| cross_check: | |
| params: {field: string, doc_a: string, doc_b: string} | |
| run_check: | |
| params: {check_name: string} | |
| query_supplier: | |
| params: {question: string, channel: string} | |
| query_internal: | |
| params: {department: string, question: string} | |
| apply_rule: | |
| params: {rule_id: string} | |
| make_decision: | |
| params: {decision: string, reason: string} | |
| route_to: | |
| params: {team: string, notes: string} | |
| close_case: | |
| params: {summary: string} | |
| reward: | |
| range: [-1.0, 1.0] | |
| description: | | |
| Shaped reward at every step. Relevant inspections: +0.01 to +0.14. | |
| Diagnostics revealing issues: +0.08 to +0.18. Correct fixes: +0.08 to +0.30. | |
| Wrong decision on fraud: -0.15 to -0.40. Repeat actions: -0.02 to -0.05. | |
| SLA breach: -0.10. | |
| grading: | |
| method: task_grader | |
| scores: | |
| - score # 0.0β1.0 overall | |
| - diagnosis_score | |
| - investigation_score | |
| - decision_score | |
| - routing_score | |
| - closure_score | |
| - efficiency_score | |
| api: | |
| reset: | |
| signature: "reset(task_id: str | None = None) -> EnvironmentState" | |
| step: | |
| signature: "step(action: Action | dict) -> StepResult" | |
| state: | |
| signature: "state() -> EnvironmentState" | |
| grade: | |
| signature: "grade() -> Dict[str, float]" | |
| http_endpoints: | |
| - path: /reset | |
| method: POST | |
| description: Reset environment, returns EnvironmentState JSON | |
| - path: /step | |
| method: POST | |
| description: Execute action, returns StepResult JSON | |
| - path: /state | |
| method: GET | |
| description: Current state, returns EnvironmentState JSON | |
| - path: /grade | |
| method: POST | |
| description: Grade current episode | |
| - path: /health | |
| method: GET | |
| description: Health check | |
| dependencies: | |
| python: ">=3.11" | |
| packages: | |
| - pydantic==2.7.1 | |
| - fastapi==0.111.0 | |
| - uvicorn==0.29.0 | |
| - gradio==4.36.1 | |
| - openai==1.35.3 | |
| - pyyaml==6.0.1 | |
| docker: | |
| port: 7860 | |
| health_check: /health | |
| ``` | |
| --- | |
| ## Step 9 β inference.py | |
| This is the most critical file for the hackathon validator. Get the format exactly right. | |
| ### Required env vars | |
| ```python | |
| API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1") | |
| MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct") | |
| API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "") | |
| ``` | |
| ### Required stdout format | |
| Every line to stdout must be exactly: | |
| ``` | |
| [START] task=<task_id> env=invoice-exception-handler model=<model_name> | |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> | |
| [END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...> | |
| ``` | |
| Rules (do not deviate): | |
| - One `[START]` line at episode begin | |
| - One `[STEP]` line per step, immediately after `env.step()` returns | |
| - One `[END]` line after the episode, always emitted even on exception | |
| - `reward` and all values in `rewards` formatted to exactly 2 decimal places | |
| - `score` formatted to exactly 3 decimal places | |
| - `done` and `success` are lowercase: `true` or `false` | |
| - `error` is the error message string, or exactly `null` if none | |
| - No newlines within a single line | |
| - `flush=True` on every print so the validator sees output in real time | |
| ### System prompt for the LLM | |
| Write a clear system prompt that tells the model: | |
| - It is an AP analyst handling a flagged invoice | |
| - It has a structured action space (list all 9 action types) | |
| - It must respond in JSON: `{"type": "...", "params": {...}}` | |
| - It should investigate before deciding | |
| - Never approve without checking, never contact supplier by email if fraud is suspected | |
| - Available documents: PO, Invoice, GRN, Supplier Master, Exception Flag | |
| ### User prompt per step | |
| Include in the user prompt: | |
| - Current step number and max steps | |
| - The exception flag (what was flagged and why) | |
| - Available checks (list them) | |
| - Available rules (list them) | |
| - Knowledge base entries (the policy list) | |
| - What has been done so far (checks run, queries made, inspections done) | |
| - Current cumulative reward | |
| - Ask for next action as JSON | |
| ### Parsing LLM output | |
| ```python | |
| def parse_action(raw_text: str) -> dict: | |
| """ | |
| Parse the model's response into an action dict. | |
| Handles markdown code fences, extra whitespace, and minor formatting errors. | |
| Falls back to run_check(po_match) if parsing fails. | |
| """ | |
| text = raw_text.strip() | |
| # Remove ```json or ``` fences if present | |
| if text.startswith("```"): | |
| lines = text.split("\n") | |
| text = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:]) | |
| try: | |
| return json.loads(text.strip()) | |
| except json.JSONDecodeError: | |
| # Try to find JSON within the text | |
| import re | |
| match = re.search(r'\{.*\}', text, re.DOTALL) | |
| if match: | |
| try: | |
| return json.loads(match.group()) | |
| except json.JSONDecodeError: | |
| pass | |
| # Safe fallback | |
| return {"type": "run_check", "params": {"check_name": "po_match"}} | |
| ``` | |
| ### Overall structure | |
| ```python | |
| def run_task(client, env, task_id, max_steps=20): | |
| """Run one task episode and return (steps_taken, score, rewards).""" | |
| rewards = [] | |
| print(f"[START] task={task_id} env=invoice-exception-handler model={MODEL_NAME}", flush=True) | |
| obs = env.reset(task_id) | |
| history = [] | |
| for step in range(1, max_steps + 1): | |
| # Build prompt from observation | |
| user_prompt = build_prompt(obs, step, max_steps, history) | |
| # Call LLM | |
| raw = call_llm(client, user_prompt) | |
| action_dict = parse_action(raw) | |
| # Execute | |
| try: | |
| result = env.step(action_dict) | |
| reward = result.reward | |
| done = result.done | |
| error = None | |
| except Exception as e: | |
| reward = 0.0 | |
| done = False | |
| error = str(e) | |
| result = None | |
| rewards.append(reward) | |
| action_str = json.dumps(action_dict) | |
| print( | |
| f"[STEP] step={step} action={action_str} " | |
| f"reward={reward:.2f} done={str(done).lower()} " | |
| f"error={error or 'null'}", | |
| flush=True | |
| ) | |
| history.append(f"Step {step}: {action_str} β reward {reward:+.2f}") | |
| if result: | |
| obs = result.observation | |
| if done: | |
| break | |
| score = env.grade()["score"] | |
| success = score >= 0.5 | |
| steps_taken = min(step, max_steps) | |
| rewards_str = ",".join(f"{r:.2f}" for r in rewards) | |
| print( | |
| f"[END] success={str(success).lower()} steps={steps_taken} " | |
| f"score={score:.3f} rewards={rewards_str}", | |
| flush=True | |
| ) | |
| return steps_taken, score, rewards | |
| def main(): | |
| client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) | |
| env = InvoiceExceptionEnv(seed=42) | |
| for task_id in ALL_TASKS: | |
| run_task(client, env, task_id) | |
| if __name__ == "__main__": | |
| main() | |
| ``` | |
| --- | |
| ## Step 10 β app.py | |
| The app.py serves two purposes: | |
| 1. Provides the FastAPI HTTP endpoints that the validator pings (`POST /reset` must return 200) | |
| 2. Provides a Gradio UI for interactive exploration on HF Spaces | |
| ### Architecture | |
| Run both FastAPI and Gradio in the same process on port 7860. | |
| Use `gr.mount_gradio_app` to mount Gradio on FastAPI, or run Gradio alongside FastAPI. | |
| The cleanest approach: | |
| ```python | |
| import gradio as gr | |
| from fastapi import FastAPI | |
| from fastapi.responses import JSONResponse | |
| import uvicorn | |
| app = FastAPI(title="Invoice Exception Handler OpenEnv") | |
| env = InvoiceExceptionEnv(seed=42) # shared environment instance | |
| @app.post("/reset") | |
| async def http_reset(body: dict = {}): | |
| task_id = body.get("task_id", None) | |
| obs = env.reset(task_id) | |
| return JSONResponse(obs.model_dump(mode="json")) | |
| @app.post("/step") | |
| async def http_step(body: dict): | |
| result = env.step(body) | |
| return JSONResponse(result.model_dump(mode="json")) | |
| @app.get("/state") | |
| async def http_state(): | |
| return JSONResponse(env.state().model_dump(mode="json")) | |
| @app.post("/grade") | |
| async def http_grade(): | |
| return JSONResponse(env.grade()) | |
| @app.get("/tasks") | |
| async def http_tasks(): | |
| return JSONResponse(ALL_TASKS) | |
| @app.get("/health") | |
| async def health(): | |
| return JSONResponse({"status": "ok", "version": "1.0.0"}) | |
| # Mount Gradio on /ui | |
| gradio_app = build_gradio_ui() | |
| app = gr.mount_gradio_app(app, gradio_app, path="/") | |
| ``` | |
| ### Gradio UI β what to build | |
| Keep the UI simple and functional. Three tabs: | |
| **Tab 1: Manual Play** | |
| - Dropdown to select task (labels: "Task 1 β Price Variance (Easy)", etc.) | |
| - Reset button | |
| - Shows the exception flag, the key document fields, and available actions | |
| - Dropdown or textbox to compose and submit an action | |
| - Shows reward, cumulative reward, and status after each step | |
| - Shows grade breakdown when episode ends | |
| **Tab 2: Agent Demo** | |
| - Select task | |
| - Shows a hardcoded optimal action sequence running step by step | |
| - Good for demonstrating the environment to judges who won't run code | |
| **Tab 3: API Reference** | |
| - Code examples for each action type | |
| - Reward table | |
| - Grader score breakdown explanation | |
| --- | |
| ## Step 11 β Dockerfile | |
| ```dockerfile | |
| FROM python:3.11-slim | |
| # Install system dependencies | |
| RUN apt-get update \ | |
| && apt-get install -y --no-install-recommends curl \ | |
| && rm -rf /var/lib/apt/lists/* | |
| # Create non-root user (required by HF Spaces) | |
| RUN useradd -m -u 1000 appuser | |
| WORKDIR /app | |
| # Copy and install dependencies first (layer caching) | |
| COPY requirements.txt . | |
| RUN pip install --no-cache-dir -r requirements.txt | |
| # Copy application code | |
| COPY --chown=appuser:appuser . . | |
| USER appuser | |
| EXPOSE 7860 | |
| # Health check β pings the /health endpoint | |
| HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \ | |
| CMD curl -f http://localhost:7860/health || exit 1 | |
| ENV PYTHONUNBUFFERED=1 | |
| ENV GRADIO_SERVER_NAME=0.0.0.0 | |
| ENV GRADIO_SERVER_PORT=7860 | |
| CMD ["python", "app.py"] | |
| ``` | |
| --- | |
| ## Step 12 β End-to-End Test Checklist | |
| Before pushing, check every item in this list. | |
| ```bash | |
| # 1. Imports work | |
| python -c "from env import InvoiceExceptionEnv, Action, ALL_TASKS; print('OK')" | |
| # 2. All three tasks complete without errors | |
| python -c " | |
| from env import InvoiceExceptionEnv, Action, ALL_TASKS | |
| env = InvoiceExceptionEnv(seed=42) | |
| for t in ALL_TASKS: | |
| obs = env.reset(t) | |
| result = env.step(Action.run_check(obs.available_checks[0])) | |
| result = env.step(Action.make_decision('reject', 'test')) | |
| result = env.step(Action.close_case('test')) | |
| score = env.grade()['score'] | |
| assert 0.0 <= score <= 1.0, f'Score out of range: {score}' | |
| print(f'{t}: {score}') | |
| print('All tasks OK') | |
| " | |
| # 3. Graders are deterministic | |
| python -c " | |
| from env import InvoiceExceptionEnv, Action | |
| env1 = InvoiceExceptionEnv(seed=42) | |
| env2 = InvoiceExceptionEnv(seed=42) | |
| obs1 = env1.reset('task1_price_variance') | |
| obs2 = env2.reset('task1_price_variance') | |
| env1.step(Action.run_check('tolerance_rule')) | |
| env2.step(Action.run_check('tolerance_rule')) | |
| env1.step(Action.make_decision('approve', 'test')) | |
| env2.step(Action.make_decision('approve', 'test')) | |
| env1.step(Action.close_case('done')) | |
| env2.step(Action.close_case('done')) | |
| s1 = env1.grade()['score'] | |
| s2 = env2.grade()['score'] | |
| assert s1 == s2, f'Non-deterministic: {s1} vs {s2}' | |
| print(f'Deterministic: {s1}') | |
| " | |
| # 4. inference.py log format (with fake API key) | |
| API_BASE_URL=https://api.example.com HF_TOKEN=fake MODEL_NAME=test python -c " | |
| # This will fail on the API call but should print [START] before failing | |
| import subprocess, sys | |
| " | |
| # Manually verify the [START] line would print correctly | |
| # 5. Docker builds | |
| docker build -t invoice-env-test . | |
| # 6. Docker runs and /health returns 200 | |
| docker run -d -p 7860:7860 --name test-env invoice-env-test | |
| sleep 15 | |
| curl -f http://localhost:7860/health | |
| curl -s -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{}' | |
| docker stop test-env && docker rm test-env | |
| # 7. openenv validate (if openenv-core is installed) | |
| pip install openenv-core | |
| openenv validate | |
| ``` | |
| --- | |
| ## Step 13 β documents/ Folder | |
| Create these four files. Keep them updated as the project evolves. | |
| ### documents/CHANGELOG.md | |
| ```markdown | |
| # Changelog | |
| All changes to the Invoice Exception Handler environment are recorded here. | |
| Format: Date | Version | What changed | Why | |
| --- | |
| ## [1.0.0] β 2025-01-20 | |
| ### Added | |
| - Initial implementation of InvoiceExceptionEnv with full OpenEnv API | |
| - Three tasks: task1_price_variance, task2_duplicate_tax, task3_compound_fraud | |
| - Pydantic v2 typed models for all environment objects | |
| - FastAPI HTTP endpoints for HF Spaces validation | |
| - Gradio UI for interactive exploration | |
| - inference.py using OpenAI client with [START][STEP][END] log format | |
| - openenv.yaml spec file | |
| - Dockerfile for HF Spaces deployment | |
| ### Design decisions | |
| - Used pure Python simulation (no external databases) for portability and determinism | |
| - Compound fraud task has four signals to prevent simple greedy agents from scoring well | |
| - Channel selection in Task 3 (phone vs email) tests policy knowledge, not just anomaly detection | |
| - Grader uses sub-scores to allow partial credit for partial solutions | |
| ``` | |
| ### documents/ARCHITECTURE.md | |
| Document the system architecture. Include: | |
| - A text diagram of how the components connect | |
| - Why FastAPI and Gradio in the same process (HF Spaces constraint) | |
| - Why Pydantic v2 (spec requirement, validation) | |
| - How EpisodeData separates mutable state from immutable document context | |
| - Why tasks are separate classes (easy to extend) | |
| ### documents/BASELINE-SCORES.md | |
| Record the reproducible baseline scores. Run them yourself and copy the output here. | |
| ```markdown | |
| # Baseline Scores | |
| Recorded on: 2025-01-20 | |
| Seed: 42 | |
| Machine: 2 vCPU, 8GB RAM | |
| ## Random Agent (action_space_sample()) | |
| | Task | Score | Steps | | |
| |------|-------|-------| | |
| | task1_price_variance | ~0.18 | 18 (SLA breach) | | |
| | task2_duplicate_tax | ~0.12 | 20 (SLA breach) | | |
| | task3_compound_fraud | ~0.08 | 25 (SLA breach) | | |
| | **Average** | **~0.13** | | | |
| ## Optimal Agent (hardcoded correct actions) | |
| | Task | Score | Steps | | |
| |------|-------|-------| | |
| | task1_price_variance | ~0.98 | 9 | | |
| | task2_duplicate_tax | ~0.95 | 10 | | |
| | task3_compound_fraud | ~0.92 | 14 | | |
| | **Average** | **~0.95** | | | |
| ``` | |
| --- | |
| ## Step 14 β Push and Verify | |
| ```bash | |
| # Final commit | |
| git add . | |
| git commit -m "feat: complete invoice exception handler v1.0.0 | |
| - 3 tasks with deterministic graders (easy/medium/hard) | |
| - Full OpenEnv API: reset/step/state/grade | |
| - FastAPI HTTP endpoints for validator (/reset, /step, /state, /health) | |
| - Gradio UI for HF Spaces | |
| - inference.py with OpenAI client and [START][STEP][END] format | |
| - openenv.yaml spec | |
| - Dockerfile for HF Spaces deployment | |
| - documents/ folder with PRD, changelog, architecture, baseline scores" | |
| git push origin main | |
| # Deploy to HF Spaces (if not using git-based deployment) | |
| # The Dockerfile and app.py handle this automatically when pushed to HF | |
| ``` | |
| --- | |
| ## Action Space Reference | |
| | Action Type | Required Params | Description | | |
| |---|---|---| | |
| | `inspect_field` | `document, field` | Look at a specific field in a document | | |
| | `cross_check` | `field, doc_a, doc_b` | Compare a field between two documents | | |
| | `run_check` | `check_name` | Run a named validation check | | |
| | `query_supplier` | `question, channel` | Ask the supplier something (channel: phone or email) | | |
| | `query_internal` | `department, question` | Ask an internal team | | |
| | `apply_rule` | `rule_id` | Apply a business policy rule | | |
| | `make_decision` | `decision, reason` | approve / reject / hold / partial_approve | | |
| | `route_to` | `team, notes` | Escalate to a team | | |
| | `close_case` | `summary` | Close with an audit trail summary | | |
| --- | |
| ## Observation Space Reference | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `task_id` | str | Which task is running | | |
| | `step_number` | int | Current step | | |
| | `case_status` | str | open / in_review / decided / routed / closed | | |
| | `purchase_order` | PurchaseOrder | What was agreed to be purchased | | |
| | `invoice` | Invoice | What the supplier is claiming | | |
| | `grn` | GoodsReceiptNote | What actually arrived | | |
| | `supplier_master` | SupplierMaster | Verified supplier record | | |
| | `exception_flag` | ExceptionFlag | Why this invoice was flagged | | |
| | `inspections` | List | Fields already inspected | | |
| | `checks_run` | List | Validation checks already run | | |
| | `queries` | List | Queries made and responses | | |
| | `rules_applied` | List | Business rules applied | | |
| | `decision` | str? | Current decision if made | | |
| | `routed_to` | List | Teams this case has been escalated to | | |
| | `available_actions` | List | All 9 action types | | |
| | `available_checks` | List | Check names valid for this task | | |
| | `available_rules` | List | Rule IDs valid for this task | | |
| | `knowledge_base` | List | Policy entries relevant to this task | | |
| | `cumulative_reward` | float | Sum of all rewards so far | | |
| --- | |
| ## Reward Reference | |
| | Event | Reward | | |
| |---|---| | |
| | Inspecting a key field that reveals an anomaly | +0.08 to +0.14 | | |
| | Inspecting a routine field | +0.01 to +0.06 | | |
| | Cross-check that finds a mismatch | +0.12 to +0.15 | | |
| | Running a check that finds an issue | +0.08 to +0.18 | | |
| | Querying the right person | +0.04 to +0.12 | | |
| | Contacting supplier via wrong channel (Task 3) | β0.15 | | |
| | Applying the correct business rule | +0.08 to +0.12 | | |
| | Applying the wrong rule | β0.05 to β0.10 | | |
| | Correct decision (approve/reject/partial) | +0.18 to +0.28 | | |
| | Approving a fraudulent invoice | β0.35 to β0.40 | | |
| | Wrong rejection (task1) | β0.10 | | |
| | Routing to the right team | +0.06 to +0.14 | | |
| | Clean case closure | +0.06 to +0.12 | | |
| | Repeat action | β0.02 to β0.05 | | |
| | SLA breach (exceed max_steps) | β0.10 | | |
| --- | |
| ## Expected Baseline Scores | |
| These are the scores you should see when running `inference.py` with a good LLM. | |
| | Task | Difficulty | Random Agent | Rule Agent | LLM Agent (Qwen-72B) | | |
| |---|---|---|---|---| | |
| | task1_price_variance | Easy | ~0.18 | ~0.85 | ~0.80 | | |
| | task2_duplicate_tax | Medium | ~0.12 | ~0.72 | ~0.68 | | |
| | task3_compound_fraud | Hard | ~0.08 | ~0.55 | ~0.45 | | |
| The hard task should be genuinely hard for LLMs β a score of 0.45 is expected, not a failure. | |
| --- | |
| ## Environment Variables | |
| | Variable | Required | Default | Description | | |
| |---|---|---|---| | |
| | `API_BASE_URL` | Yes | `https://router.huggingface.co/v1` | LLM endpoint | | |
| | `MODEL_NAME` | Yes | `Qwen/Qwen2.5-72B-Instruct` | Model to use | | |
| | `HF_TOKEN` | Yes | β | API key for the LLM endpoint | | |
| | `ANTHROPIC_API_KEY` | No | β | Only if using Anthropic models directly | | |
| --- | |
| ## Setup Instructions | |
| ### Local Development | |
| ```bash | |
| # Clone the repo | |
| git clone https://github.com/YOUR_USERNAME/invoice-exception-handler.git | |
| cd invoice-exception-handler | |
| # Create virtual environment | |
| python -m venv venv | |
| source venv/bin/activate # Windows: venv\Scripts\activate | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the app locally | |
| python app.py | |
| # Visit http://localhost:7860 | |
| ``` | |
| ### Run Inference | |
| ```bash | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| export HF_TOKEN="your-token-here" | |
| python inference.py | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t invoice-exception-handler . | |
| docker run -p 7860:7860 \ | |
| -e API_BASE_URL="https://router.huggingface.co/v1" \ | |
| -e MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" \ | |
| -e HF_TOKEN="your-token-here" \ | |
| invoice-exception-handler | |
| ``` | |
| ### HF Spaces Deployment | |
| 1. Create a new Space with the Gradio SDK | |
| 2. Push this repository to it | |
| 3. Add secrets in Space settings: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` | |
| 4. The Space will build and deploy automatically from the Dockerfile | |
| ### Validate Submission | |
| ```bash | |
| # Install validator | |
| pip install openenv-core | |
| # Validate the spec | |
| openenv validate | |
| # Run the full submission validator script | |
| chmod +x scripts/validate-submission.sh | |
| ./scripts/validate-submission.sh https://your-space.hf.space . | |
| ``` | |
| --- | |
| ## Common Mistakes to Avoid | |
| 1. **Don't use `inference.py` as the wrong name.** The validator looks for exactly `inference.py` in the root. | |
| 2. **Don't use the Anthropic SDK in inference.py.** The spec requires the OpenAI client. Use `from openai import OpenAI`. | |
| 3. **Don't forget `flush=True` on print statements.** The validator reads stdout line by line. Without flush, logs may not appear. | |
| 4. **Don't let the Gradio UI crash the FastAPI server.** If the UI has an error, it should fail gracefully, not bring down `/reset`. | |
| 5. **Don't hardcode the model name.** Always read from `os.getenv("MODEL_NAME")`. | |
| 6. **Don't put business logic in models.py.** That file is just data shapes. | |
| 7. **Don't mutate documents during a step.** The documents (PO, Invoice, GRN) are fixed for the duration of an episode. Only EpisodeData changes. | |
| 8. **Don't forget to test determinism.** Same seed + same actions must = same score. Run the determinism test. | |
| 9. **Don't skip the docker build test.** The validator builds your Docker image. If it doesn't build, you're disqualified. | |
| 10. **Don't forget the changelog.** Update `documents/CHANGELOG.md` before every push. | |
| --- | |
| ## License | |
| MIT License. See LICENSE file. | |