Spaces:

YUS200619
/

invoice-exception-handler

Sleeping

App Files Files Community

invoice-exception-handler / documents /README.md

YUS200619

feat: complete invoice exception handler v1.0.0

562f58d about 1 month ago

preview code

raw

history blame contribute delete

54.7 kB

	# Invoice Exception Handler — OpenEnv

	> An AI agent learning environment that simulates accounts payable exception handling.
	> The agent acts as an AP analyst: receives flagged invoices, investigates root causes,
	> makes decisions, and closes cases. Built for the OpenEnv hackathon.

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-1.0.0-blue)](https://github.com/openenv/openenv)
	[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
	[![HF Space](https://img.shields.io/badge/HF%20Space-Live-yellow)](https://huggingface.co/spaces)

	---

	## For Agents Building This Project

	This README is the single source of truth for building the entire project from scratch.
	Read every section before writing any code. Do not skip sections. Do not guess.

	### Ground Rules

	1. Write code like a human wrote it. Use real variable names, not `x` or `tmp`. Add comments where
	the logic is non-obvious. Leave one blank line between logical blocks inside functions. Use 4-space
	indentation everywhere. Python files get a module docstring at the top explaining what the file does.

	2. Create a new Git repo and push after every major milestone. A milestone is: models done,
	tasks done, environment done, API done, inference done, app done. Not after every file.

	3. Record every change in `documents/CHANGELOG.md`. Use the format in the changelog section below.
	Before pushing, append to the changelog what changed and why.

	4. If something in this README conflicts with the competition spec, the competition spec wins.
	The competition spec is in the document the user shared. Key points: `inference.py` must use the
	OpenAI client. `[START]` `[STEP]` `[END]` format must be exact. `/reset` must return 200.

	5. Test before pushing. Run `python -c "from env import InvoiceExceptionEnv"` to check imports.
	Run `python inference.py` with a dummy API key to check the log format. Run `docker build .` to
	check the Dockerfile before claiming it works.

	---

	## What This Environment Does

	Every company that buys goods or services deals with invoice exceptions — mismatches between
	what was ordered, what arrived, and what was invoiced. Currently these are handled by accounts
	payable analysts who manually compare documents and decide: approve, reject, hold, or escalate.

	This environment puts an AI agent in that analyst's chair. The agent receives:
	- A Purchase Order (what was agreed to)
	- An Invoice (what the supplier is claiming)
	- A Goods Receipt Note (what actually arrived)
	- A Supplier Master (the verified supplier record)
	- An Exception Flag (why the system flagged this invoice)

	The agent investigates, runs checks, queries people, makes a decision, and closes the case.
	Every action has realistic consequences including financial, compliance, and fraud implications.

	---

	## Repository Structure

	Build the project with exactly this structure. Do not add extra directories. Do not rename files.

	```
	invoice-exception-handler/
	│
	├── README.md ← this file
	├── openenv.yaml ← OpenEnv spec, must pass openenv validate
	├── Dockerfile ← single-stage Python 3.11-slim
	├── requirements.txt ← pinned versions
	├── inference.py ← competition script, MUST be named this
	├── app.py ← Gradio + FastAPI, entry point for HF Spaces
	│
	├── env/
	│ ├── __init__.py ← exports InvoiceExceptionEnv, Action, ALL_TASKS
	│ ├── models.py ← all Pydantic models (Action, EnvironmentState, etc.)
	│ ├── environment.py ← InvoiceExceptionEnv class
	│ └── tasks.py ← 3 task classes, EpisodeData, graders
	│
	└── documents/
	├── PRD-001-product-requirements.md
	├── CHANGELOG.md
	├── ARCHITECTURE.md
	└── BASELINE-SCORES.md
	```

	---

	## Step-by-Step Build Order

	Follow this order exactly. Do not jump ahead.

	```
	Step 1 → Create the repo
	Step 2 → Write requirements.txt
	Step 3 → Write env/models.py
	Step 4 → Write env/tasks.py
	Step 5 → Write env/environment.py
	Step 6 → Write env/__init__.py
	Step 7 → Smoke test the environment (run a quick script)
	Step 8 → Write openenv.yaml
	Step 9 → Write inference.py
	Step 10 → Write app.py
	Step 11 → Write Dockerfile
	Step 12 → Full end-to-end test
	Step 13 → Write documents/
	Step 14 → Push and verify
	```

	---

	## Step 1 — Create the Repo

	```bash
	# Create the project directory
	mkdir invoice-exception-handler
	cd invoice-exception-handler

	# Initialise git
	git init
	git checkout -b main

	# Create the directory structure
	mkdir -p env documents

	# Create empty placeholder files so git tracks the structure
	touch env/__init__.py
	touch documents/.gitkeep

	# First commit — skeleton only
	git add .
	git commit -m "init: project skeleton"

	# Create the repo on GitHub/HF and push
	# Replace with your actual remote
	git remote add origin https://github.com/YOUR_USERNAME/invoice-exception-handler.git
	git push -u origin main
	```

	---

	## Step 2 — requirements.txt

	Pin every version. Do not use `>=` ranges — the validator builds in a clean environment and
	range mismatches cause mysterious failures.

	```
	pydantic==2.7.1
	fastapi==0.111.0
	uvicorn==0.29.0
	gradio==4.36.1
	openai==1.35.3
	pyyaml==6.0.1
	httpx==0.27.0
	python-multipart==0.0.9
	```

	---

	## Step 3 — env/models.py

	This file defines every typed object in the system. Write it before any other Python code.
	Nothing is untyped. Every field has a type annotation.

	### What goes in models.py

	Enumerations:
	- `ActionType` — the 9 action types an agent can take (string enum)
	- `DecisionType` — approve / reject / hold / partial_approve (string enum)
	- `CaseStatus` — open / in_review / decided / routed / closed (string enum)

	Document models (read-only context given to the agent):
	- `LineItem` — one line on an invoice or PO (description, quantity, unit_price, total, tax_rate)
	- `PurchaseOrder` — what was agreed to be purchased
	- `Invoice` — what the supplier is claiming
	- `GoodsReceiptNote` — what actually arrived at the warehouse
	- `SupplierMaster` — the verified, registered supplier record
	- `ExceptionFlag` — why the system flagged this invoice (flag_code, description, auto_hold)

	Action model:
	- `Action` — has a `type: ActionType` and `params: Dict[str, Any]`
	- Add classmethod constructors for each action type so callers can do `Action.run_check("tolerance_rule")`

	Result models:
	- `InspectionResult` — what came back from inspect_field (document, field, value, note, timestamp)
	- `CheckResult` — what came back from run_check or cross_check (check_name, passed, detail, timestamp)
	- `QueryResult` — what came back from a query (target, question, response, channel, timestamp)

	State models:
	- `EnvironmentState` — the full observable state returned by reset() and step()
	- `StepResult` — what step() returns: (observation, reward, done, info)

	### EnvironmentState fields

	The EnvironmentState must include:
	- `task_id: str`
	- `step_number: int`
	- `case_status: CaseStatus`
	- All 5 documents (purchase_order, invoice, grn, supplier_master, exception_flag)
	- Agent history: `inspections`, `checks_run`, `queries`, `rules_applied`
	- Decision state: `decision`, `decision_reason`, `routed_to`, `case_closed`, `close_summary`
	- Action hints: `available_actions`, `available_checks`, `available_rules`, `knowledge_base`
	- `cumulative_reward: float`

	### Writing style for models.py

	```python
	"""
	Typed models for the Invoice Exception Handler OpenEnv environment.

	Every object the agent sees or produces is defined here as a Pydantic model.
	This is the single source of truth for the data contract between the
	environment simulation and the agent.
	"""
	from __future__ import annotations

	import time
	from enum import Enum
	from typing import Any, Dict, List, Optional

	from pydantic import BaseModel, Field


	class ActionType(str, Enum):
	INSPECT_FIELD = "inspect_field"
	CROSS_CHECK = "cross_check"
	# ... etc
	```

	Do not put business logic in models.py. Just data shapes.

	---

	## Step 4 — env/tasks.py

	This is the biggest file. It defines what happens when the agent takes each action —
	the simulated responses, the rewards, and the grading logic.

	### EpisodeData class

	A plain Python class (not Pydantic) that tracks everything the agent has done in one episode.

	```python
	class EpisodeData:
	"""Tracks the full history of one episode for grading and state building."""

	def __init__(self):
	self.inspections: List[InspectionResult] = []
	self.checks: List[CheckResult] = []
	self.queries: List[QueryResult] = []
	self.rules_applied: List[str] = []
	self.decision: Optional[str] = None
	self.decision_reason: Optional[str] = None
	self.routed_to: List[str] = []
	self.closed: bool = False
	self.close_summary: Optional[str] = None
	self.step_count: int = 0
	self.cumulative_reward: float = 0.0

	def has_inspected(self, doc: str, field: str) -> bool:
	"""Check if we already looked at this field in this document."""
	return any(i.document == doc and i.field == field for i in self.inspections)

	def has_checked(self, name: str) -> bool:
	"""Check if this validation check has already been run."""
	return any(c.check_name == name for c in self.checks)

	def has_queried(self, target: str) -> bool:
	"""Check if we already queried this person or department."""
	return any(q.target == target for q in self.queries)
	```

	### BaseTask class

	Abstract base that all three tasks inherit from. Every method raises `NotImplementedError`.

	```python
	class BaseTask:
	task_id: str = "base"
	max_steps: int = 20
	difficulty: str = "easy"

	# Document factories — return fresh objects each time (no shared state)
	def get_purchase_order(self) -> PurchaseOrder: raise NotImplementedError
	def get_invoice(self) -> Invoice: raise NotImplementedError
	def get_grn(self) -> GoodsReceiptNote: raise NotImplementedError
	def get_supplier_master(self) -> SupplierMaster: raise NotImplementedError
	def get_exception_flag(self) -> ExceptionFlag: raise NotImplementedError

	# Simulators — each returns (result_object, reward_delta)
	def simulate_inspect(self, document: str, field: str) -> Tuple[InspectionResult, float]: ...
	def simulate_cross_check(self, field: str, doc_a: str, doc_b: str) -> Tuple[CheckResult, float]: ...
	def simulate_run_check(self, check_name: str) -> Tuple[CheckResult, float]: ...
	def simulate_query_supplier(self, question: str, channel: str) -> Tuple[QueryResult, float]: ...
	def simulate_query_internal(self, department: str, question: str) -> Tuple[QueryResult, float]: ...
	def simulate_apply_rule(self, rule_id: str) -> Tuple[str, float]: ...
	def simulate_make_decision(self, decision: str, reason: str, ep: EpisodeData) -> float: ...
	def simulate_route_to(self, team: str, notes: str, ep: EpisodeData) -> float: ...
	def simulate_close(self, summary: str, ep: EpisodeData) -> float: ...
	def grade(self, ep: EpisodeData) -> Dict[str, float]: ...

	# These are properties, not methods
	@property
	def available_checks(self) -> List[str]: return []

	@property
	def available_rules(self) -> List[str]: return []

	@property
	def knowledge_base(self) -> List[str]: return []
	```

	### The Three Tasks

	#### Task 1: PriceVarianceTask (task1_price_variance)

	The scenario: An office stationery supplier sends an invoice that's 3.08% above the PO.
	Company policy allows ±2% automatic approval. Above that needs manual exception approval.
	The supplier did communicate the price increase but procurement never updated the PO.

	task_id: `"task1_price_variance"`
	max_steps: `18`
	difficulty: `"easy"`

	The documents:

	PO (PO-2024-1041): 3 stationery line items totalling ₹50,000
	- A4 Paper 100 reams @ ₹220 = ₹22,000
	- Ballpoint Pens 20 boxes @ ₹450 = ₹9,000
	- Staplers 10 units @ ₹1,900 = ₹19,000

	Invoice (INV-ON-8821): Same items, same quantities, but 2 items have higher unit prices
	- A4 Paper @ ₹231 (+₹11, +5.0%)
	- Ballpoint Pens @ ₹472 (+₹22, +4.9%)
	- Staplers unchanged @ ₹1,900
	- Subtotal: ₹51,540 (+₹1,540, +3.08%)
	- 18% GST applied correctly: ₹9,277.20
	- Total: ₹60,817.20

	GRN (GRN-2024-0892): All items fully received, no pending, no rejected.

	Supplier Master (SUP-0441 — OfficeNeed Supplies): Bank account and GSTIN both match invoice exactly. No fraud signals.

	Exception Flag: `PRICE_MISMATCH` — "Invoice total ₹51,540 exceeds PO ₹50,000 by ₹1,540 (3.08%). Above auto-approval threshold."

	Knowledge base entries:
	- POL-001: Price variance ≤±2% may be auto-approved. Above 2% requires exception approval.
	- POL-002: Exception approval requires confirmation from originating department.
	- POL-003: Any approved invoice with a price change must be followed by a PO amendment request.
	- POL-004: Bank account on invoice must match supplier master.

	Simulator logic:

	`simulate_inspect`: Return meaningful values for invoice line_items (+0.10), invoice total_amount (+0.08), po line_items (+0.06), grn items_received (+0.05). Return +0.01 for unknown fields.

	`simulate_cross_check`: The key cross-checks are:
	- `(unit_price, invoice, po)` → finds Paper and Pen mismatch, reward +0.12
	- `(total_amount, invoice, po)` → confirms 3.08% variance, reward +0.10
	- `(bank_account, invoice, supplier_master)` → match (no fraud), reward +0.03
	- `(gstin, invoice, supplier_master)` → match, reward +0.02
	- `(quantity, invoice, grn)` → match (full delivery), reward +0.04

	`simulate_run_check`:
	- `"tolerance_rule"` → 3.08% > 2%, FAILS, reward +0.14 (most important check)
	- `"grn_match"` → PASSES (all received), reward +0.06
	- `"duplicate_detection"` → PASSES (not a dup), reward +0.02
	- `"bank_account_verification"` → PASSES, reward +0.02
	- `"gst_verification"` → PASSES, reward +0.02
	- `"po_match"` → FAILS on price, reward +0.08

	`simulate_query_supplier`: Returns email from supplier explaining raw material price increase communicated to Arjun Mehta at procurement on Feb 20. Reward +0.10.

	`simulate_query_internal`:
	- `"procurement"` → Arjun Mehta confirms verbal approval, says he'll raise PO amendment. Reward +0.12.
	- Others → generic responses, reward +0.03.

	`simulate_apply_rule`:
	- `"tolerance_2pct_auto_approve"` → BLOCKED (3.08% > 2%), reward −0.05
	- `"tolerance_exception_approval"` → APPLIED, reward +0.10
	- `"rejection_with_reason"` → APPLIED but wrong, reward −0.08
	- `"partial_approval"` → not applicable here, reward −0.05

	`simulate_make_decision`:
	- `"approve"` with tolerance check + procurement query: reward +0.25
	- `"approve"` with tolerance check only: reward +0.18
	- `"approve"` with nothing checked: reward +0.05 (bad approval, should have verified)
	- `"reject"`: reward −0.10 (wrong decision, delay supplier)
	- `"hold"`: reward +0.08

	`simulate_route_to`:
	- `"procurement"` → reward +0.12 (correct — PO amendment needed)
	- `"finance"` → reward +0.03
	- `"legal"` → reward −0.05 (overkill for a price variance)

	`simulate_close`: reward +0.12 if approved + tolerance checked + procurement routed, else +0.06, else 0.

	Grader (`grade` method):
	```python
	def grade(self, ep: EpisodeData) -> Dict[str, float]:
	checks_run = {c.check_name for c in ep.checks}
	queries_to = {q.target for q in ep.queries}

	# Did the agent correctly diagnose?
	d = 0.0
	if any("unit_price" in c.check_name or "total" in c.check_name
	for c in ep.checks):
	d += 0.12
	if "tolerance_rule" in checks_run:
	d += 0.14
	if "grn_match" in checks_run:
	d += 0.06

	# Did the agent investigate properly?
	i = 0.0
	if "supplier" in queries_to:
	i += 0.10
	if "procurement" in queries_to:
	i += 0.12
	if "tolerance_exception_approval" in ep.rules_applied:
	i += 0.08

	# Correct decision?
	dec = 0.0
	if ep.decision == "approve": dec += 0.18
	elif ep.decision == "hold": dec += 0.06
	elif ep.decision == "reject": dec -= 0.10

	# Correct routing?
	route = 0.12 if "procurement" in ep.routed_to else 0.0

	# Closed cleanly?
	closure = 0.08 if ep.closed else 0.0

	# Efficiency bonus — penalise extra steps
	eff = max(0.0, 0.06 - 0.004 * max(0, ep.step_count - 9))

	total = d + i + dec + route + closure + eff
	return {
	"score": round(max(0.0, min(1.0, total)), 4),
	"diagnosis_score": round(d, 4),
	"investigation_score": round(i, 4),
	"decision_score": round(dec, 4),
	"routing_score": round(route, 4),
	"closure_score": round(closure, 4),
	"efficiency_score": round(eff, 4),
	}
	```

	---

	#### Task 2: DuplicateTaxErrorTask (task2_duplicate_tax)

	The scenario: Logistics supplier submits INV-2024-891 for transport services. System flags
	it as a possible duplicate. Turns out it IS a duplicate of INV-2024-819 — the numbers differ
	by digit transposition (891 vs 819). That original invoice was already paid. BUT: the original
	invoice applied 15% GST when the correct rate is 18%. The company overpaid ₹3,240 in tax.
	The new invoice has the correct rate. So it's both a duplicate AND a legitimate correction.

	task_id: `"task2_duplicate_tax"`
	max_steps: `20`
	difficulty: `"medium"`

	The documents:

	PO (PO-2024-0778): Logistics services
	- Mumbai-Pune Transport 20 trips @ ₹4,500 = ₹90,000
	- Warehousing charges Feb 2024 @ ₹18,000 = ₹18,000
	- Total: ₹1,08,000, Net-15 terms

	Invoice (INV-2024-891): Same services, same amounts — correct on the face of it
	- Subtotal: ₹1,08,000
	- GST 18%: ₹19,440 ← this is CORRECT
	- Total: ₹1,27,440

	GRN (GRN-2024-0740): Services confirmed complete (transport + warehousing).

	Supplier Master (SUP-0229 — FastMove Logistics): Bank and GSTIN match invoice. No fraud signals.

	Exception Flag: `POSSIBLE_DUPLICATE` — "Invoice INV-2024-891 closely matches previously processed invoice."

	Hidden state (not in documents, revealed by checks):
	- INV-2024-819 was paid 12 days ago for ₹1,24,200
	- INV-2024-819 applied 15% GST = ₹16,200 (wrong rate)
	- Correct 18% GST = ₹19,440
	- Company overpaid: ₹3,240

	Key checks and what they reveal:

	`run_check("duplicate_detection")` → FAILS → finds INV-2024-819 paid 12 days ago, reward +0.18

	`run_check("tax_calculation_verify")` → FAILS → discovers the 15% error on original, reveals ₹3,240 delta, reward +0.16

	`cross_check(invoice_number, invoice, payment_history)` → finds digit transposition, reward +0.15

	`cross_check(tax_amount, invoice, payment_history)` → confirms ₹3,240 delta, reward +0.14

	`query_internal("finance")` → confirms overpayment on original, reward +0.12

	`query_supplier` → supplier confirms they know and wants partial approval for the delta, reward +0.10

	`apply_rule("partial_approval")` → correct pathway, reward +0.12

	`apply_rule("credit_note_request")` → supplier must issue credit note for the balance, reward +0.10

	Decision logic:

	`simulate_make_decision`:
	- `"partial_approve"` with dup + tax found: reward +0.28 ← optimal
	- `"partial_approve"` with dup only: reward +0.14 ← incomplete
	- `"reject"` with dup found: reward +0.08 ← catches dup, misses correction
	- `"approve"` (pays full duplicate): reward −0.15 ← bad

	Grader weights:
	- diagnosis_score: up to 0.30 (dup found +0.16, tax error found +0.14)
	- investigation_score: up to 0.32 (finance queried, supplier queried, rules applied)
	- decision_score: up to 0.20 (partial_approve = 0.20, reject = 0.05, approve = −0.15)
	- routing_score: up to 0.08
	- closure_score: up to 0.06

	---

	#### Task 3: CompoundFraudTask (task3_compound_fraud)

	The scenario: IT supplier submits ₹8,47,500 invoice for 15 laptops. System flags a bank
	account change. But there are FOUR simultaneous fraud signals that the agent must find all of.

	task_id: `"task3_compound_fraud"`
	max_steps: `25`
	difficulty: `"hard"`

	The four signals:

	1. Bank account fraud (Signal 1): Invoice has a different bank account than the supplier
	master. The change request came from `techcore-solutions.com`. The registered domain is
	`techcore-solutions.in`. Classic Business Email Compromise (BEC) attack.

	2. GSTIN fraud (Signal 2): The GST number on the invoice (`07AABCT9999X1Z8`) belongs to
	"TechCore Trading Pvt Ltd" — a completely different entity in Delhi. Supplier master shows
	`07AABCT1234Y1Z5` for "TechCore Solutions."

	3. Quantity mismatch (Signal 3): Invoice claims 15 laptops. GRN shows only 13 received.
	2 units are still marked as pending.

	4. Price inflation (Signal 4): ₹56,500/unit on invoice vs ₹52,000/unit on PO. That's
	8.65% above the agreed price. No price revision was ever approved.

	Bonus signals (smaller, still notable):
	- Invoice is dated a Sunday (2024-03-10) — unusual for B2B
	- PO was raised Friday March 8 — 2-day turnaround is suspiciously fast for IT equipment

	The critical trap — channel selection:

	`simulate_query_supplier(question, channel="email")` →
	Returns fraudster's response urging payment to the new account. Reward: −0.15.

	`simulate_query_supplier(question, channel="phone")` →
	The real TechCore Solutions confirms they sent no bank change request. Confirms fraud. Reward: +0.15.

	This tests whether the agent follows POL-009 ("bank account change must be verified via
	registered phone number — NEVER via email") which is in the knowledge base.

	Available checks and rewards:

	```python
	"bank_account_verification" → FAILS, finds lookalike domain, reward +0.18
	"gst_verification" → FAILS, GST belongs to different entity, reward +0.18
	"grn_match" → FAILS, 13 vs 15 received, reward +0.14
	"email_domain_verification" → FAILS, lookalike domain confirmed, reward +0.16
	"invoice_date_validation" → FAILS, Sunday flag, reward +0.08
	"quantity_check" → FAILS, quantity inflated, reward +0.12
	"price_check" → FAILS, 8.65% above PO, reward +0.10
	"duplicate_detection" → PASSES (not a dup), reward +0.02
	"po_match" → FAILS (GST + qty + price all wrong), reward +0.08
	```

	Decision logic:

	`simulate_make_decision`:
	- `"reject"` → reward = 0.10 + 0.05 × (number of signals found) → max ~0.30
	- `"approve"` → reward −0.40 (catastrophic — approved fraud)
	- `"partial_approve"` → reward −0.20 (you can't partially approve fraud)
	- `"hold"` → reward = 0.08 + 0.03 × signals found → acceptable but not optimal

	Route logic:

	```python
	"legal" → reward +0.14 # must escalate to legal
	"security" → reward +0.12 # BEC attack needs security investigation
	"finance" → reward +0.08 # finance needs to block payment
	"procurement" → reward +0.06
	```

	Grader — the signal detection scoring:

	```python
	def grade(self, ep: EpisodeData) -> Dict[str, float]:
	failed = {c.check_name for c in ep.checks if not c.passed}

	bank_found = "bank_account_verification" in {c.check_name for c in ep.checks}
	gst_found = "gst_verification" in {c.check_name for c in ep.checks}
	qty_found = "grn_match" in {c.check_name for c in ep.checks}
	domain_found = "email_domain_verification" in {c.check_name for c in ep.checks}
	price_found = "price_check" in {c.check_name for c in ep.checks}

	# Diagnosis — finding all signals is the whole point
	d = (0.12 if bank_found else 0) + (0.12 if gst_found else 0) \
	+ (0.10 if qty_found else 0) + (0.10 if domain_found else 0) \
	+ (0.06 if price_found else 0)

	# Investigation — reward for using phone not email
	i = 0.0
	for q in ep.queries:
	if q.target == "supplier" and q.channel not in ("email", "mail"):
	i += 0.10 # correct channel
	elif q.target == "supplier" and q.channel in ("email", "mail"):
	i -= 0.15 # contacting fraudster
	if "legal" in {q.target for q in ep.queries}: i += 0.06
	if "security" in {q.target for q in ep.queries}: i += 0.06

	# Decision
	signals = sum([bank_found, gst_found, qty_found, domain_found])
	dec = 0.0
	if ep.decision == "reject":
	dec = 0.08 + 0.03 * signals
	elif ep.decision == "approve":
	dec = -0.35
	elif ep.decision == "partial_approve":
	dec = -0.15
	elif ep.decision == "hold":
	dec = 0.06

	# Routing
	routes = set(ep.routed_to)
	route = (0.10 if "legal" in routes else 0) \
	+ (0.06 if "security" in routes else 0) \
	+ (0.04 if "finance" in routes else 0)

	closure = 0.06 if (ep.closed and ep.decision == "reject") else 0.0
	eff = max(0.0, 0.04 - 0.002 * max(0, ep.step_count - 12))

	total = d + i + dec + route + closure + eff
	return {
	"score": round(max(0.0, min(1.0, total)), 4),
	"signals_found": sum([bank_found, gst_found, qty_found, domain_found, price_found]),
	"diagnosis_score": round(d, 4),
	"investigation_score": round(i, 4),
	"decision_score": round(dec, 4),
	"routing_score": round(route, 4),
	"closure_score": round(closure, 4),
	"efficiency_score": round(eff, 4),
	}
	```

	### Task Registry

	At the bottom of tasks.py:

	```python
	TASK_REGISTRY: Dict[str, type] = {
	"task1_price_variance": PriceVarianceTask,
	"task2_duplicate_tax": DuplicateTaxErrorTask,
	"task3_compound_fraud": CompoundFraudTask,
	}

	ALL_TASKS = list(TASK_REGISTRY.keys())

	def make_task(task_id: str) -> BaseTask:
	cls = TASK_REGISTRY.get(task_id)
	if cls is None:
	raise ValueError(f"Unknown task '{task_id}'. Available: {ALL_TASKS}")
	return cls()
	```

	---

	## Step 5 — env/environment.py

	This is the `InvoiceExceptionEnv` class. It is the only thing external code needs to import.

	```python
	class InvoiceExceptionEnv:
	"""
	OpenEnv-compatible Invoice Exception Handler environment.

	Usage:
	env = InvoiceExceptionEnv(seed=42)
	obs = env.reset("task1_price_variance")
	result = env.step(Action.run_check("tolerance_rule"))
	scores = env.grade()
	"""
	```

	### Constructor

	Takes an optional `seed: Optional[int] = None` for reproducibility.
	Initialises `self._rng = random.Random(seed)`.
	Initialises `self._task`, `self._ep`, `self._state`, `self._done` all to None/False.

	### reset(task_id)

	```python
	def reset(self, task_id: Optional[str] = None) -> EnvironmentState:
	"""
	Start a new episode. If task_id is None, picks one at random.
	Returns the initial EnvironmentState showing all documents and available actions.
	"""
	```

	1. Pick task (random if None)
	2. Create `EpisodeData()`
	3. Set `self._done = False`
	4. Call `self._build_state()` and store result
	5. Return the state

	### step(action)

	```python
	def step(self, action: Union[Action, Dict[str, Any]]) -> StepResult:
	"""
	Execute one action. Returns observation, reward, done flag, and info dict.
	Raises RuntimeError if called before reset() or after the episode is done.
	"""
	```

	1. Validate we're in an active episode
	2. Convert dict to Action if needed
	3. Call `self._dispatch(action)` → gets (reward, info)
	4. Increment step count
	5. Check SLA (step count vs max_steps)
	6. Check done condition (closed or SLA breach)
	7. Rebuild state
	8. Return StepResult

	### state()

	Non-destructive. Just returns `self._state`. Raises RuntimeError if not initialised.

	### grade()

	Calls `self._task.grade(self._ep)` and returns the dict.

	### _dispatch(action)

	The routing function. A single if/elif chain for each ActionType.

	For each action:
	1. Call the appropriate task simulator
	2. Update EpisodeData
	3. Return (reward, info dict)

	Handle repeated actions (inspect same field twice, check same thing twice) with a small −0.02 to −0.05 penalty and return early.

	### _build_state()

	Constructs an `EnvironmentState` from the current `_task` and `_ep`. Called after every step.
	Also determines the current `CaseStatus` based on episode data.

	### action_space_sample()

	Returns a random valid action (for random baseline agents). Uses `self._rng` for reproducibility.

	---

	## Step 6 — env/__init__.py

	```python
	from .environment import InvoiceExceptionEnv
	from .models import Action, ActionType, EnvironmentState, StepResult
	from .tasks import ALL_TASKS, make_task

	__all__ = [
	"InvoiceExceptionEnv",
	"Action",
	"ActionType",
	"EnvironmentState",
	"StepResult",
	"ALL_TASKS",
	"make_task",
	]
	```

	---

	## Step 7 — Smoke Test Before Continuing

	Before writing openenv.yaml or inference.py, verify the environment works.

	```python
	# test_smoke.py — run this, do not commit it
	from env import InvoiceExceptionEnv, Action, ALL_TASKS

	print("Tasks:", ALL_TASKS)

	env = InvoiceExceptionEnv(seed=42)

	for task_id in ALL_TASKS:
	obs = env.reset(task_id)
	print(f"\n--- {task_id} ---")
	print("Ticket:", obs.exception_flag.flag_description[:80])

	# Take a few actions
	r1 = env.step(Action.run_check(obs.available_checks[0]))
	print(f"Step 1 reward: {r1.reward}")

	r2 = env.step(Action.make_decision("approve", "test"))
	print(f"Step 2 reward: {r2.reward}")

	r3 = env.step(Action.close_case("closed"))
	print(f"Step 3 reward: {r3.reward}, done: {r3.done}")

	scores = env.grade()
	print(f"Grade: {scores['score']}")

	print("\nSmoke test passed.")
	```

	All three tasks must complete without errors. Scores must be in [0.0, 1.0].

	---

	## Step 8 — openenv.yaml

	This file must pass `openenv validate`. Write it carefully.

	```yaml
	# openenv.yaml
	name: Invoice Exception Handler
	version: "1.0.0"
	description: \|
	An agent learning environment simulating accounts payable exception handling.
	The agent acts as an AP analyst: investigates flagged invoices, applies business
	rules, detects fraud signals, makes decisions, and closes cases with an audit trail.

	authors:
	- name: Your Name
	email: your@email.com

	license: MIT

	tasks:
	- id: task1_price_variance
	name: Price Variance Exception
	difficulty: easy
	description: \|
	Office stationery invoice arrives 3.08% above PO. Company tolerance policy
	allows ±2% auto-approval. Agent must detect the variance, verify through
	the tolerance rule, confirm verbal approval with procurement, and approve
	with a PO amendment request.
	max_steps: 18
	optimal_score: 1.0
	min_passing_score: 0.60

	- id: task2_duplicate_tax
	name: Duplicate Invoice with Tax Error
	difficulty: medium
	description: \|
	Logistics supplier submits INV-2024-891, a duplicate of paid INV-2024-819
	(digit transposition: 891 vs 819). Original invoice had wrong GST rate (15%
	vs correct 18%) — company overpaid ₹3,240. New invoice has correct rate.
	Agent must detect the duplicate, identify the tax error in the original,
	and partially approve only the ₹3,240 tax correction.
	max_steps: 20
	optimal_score: 1.0
	min_passing_score: 0.50

	- id: task3_compound_fraud
	name: Compound Fraud Signals
	difficulty: hard
	description: \|
	IT equipment supplier invoice with four simultaneous fraud signals: bank
	account changed via BEC attack (lookalike email domain), GSTIN belongs to
	a different entity, 2 of 15 laptops not yet received, and unit price 8.65%
	above PO. Agent must find all signals, use the correct communication channel
	(phone, not email — which would contact the fraudster), and escalate to legal
	and security.
	max_steps: 25
	optimal_score: 1.0
	min_passing_score: 0.40

	observation_space:
	type: object
	description: EnvironmentState Pydantic model
	fields:
	task_id: {type: string}
	step_number: {type: integer}
	case_status: {type: string, enum: [open, in_review, decided, routed, closed]}
	purchase_order: {type: object, description: "PO with line items and terms"}
	invoice: {type: object, description: "Supplier invoice with line items and tax"}
	grn: {type: object, description: "Goods receipt — what actually arrived"}
	supplier_master: {type: object, description: "Verified supplier record"}
	exception_flag: {type: object, description: "Why the system flagged this invoice"}
	inspections: {type: array, description: "Fields the agent has inspected"}
	checks_run: {type: array, description: "Validation checks completed"}
	queries: {type: array, description: "Internal and supplier queries"}
	rules_applied: {type: array, description: "Business rules applied"}
	decision: {type: string, nullable: true}
	routed_to: {type: array}
	available_actions: {type: array}
	available_checks: {type: array}
	available_rules: {type: array}
	knowledge_base: {type: array}
	cumulative_reward: {type: number}

	action_space:
	type: object
	description: Action with type and params
	actions:
	inspect_field:
	params: {document: string, field: string}
	cross_check:
	params: {field: string, doc_a: string, doc_b: string}
	run_check:
	params: {check_name: string}
	query_supplier:
	params: {question: string, channel: string}
	query_internal:
	params: {department: string, question: string}
	apply_rule:
	params: {rule_id: string}
	make_decision:
	params: {decision: string, reason: string}
	route_to:
	params: {team: string, notes: string}
	close_case:
	params: {summary: string}

	reward:
	range: [-1.0, 1.0]
	description: \|
	Shaped reward at every step. Relevant inspections: +0.01 to +0.14.
	Diagnostics revealing issues: +0.08 to +0.18. Correct fixes: +0.08 to +0.30.
	Wrong decision on fraud: -0.15 to -0.40. Repeat actions: -0.02 to -0.05.
	SLA breach: -0.10.

	grading:
	method: task_grader
	scores:
	- score # 0.0–1.0 overall
	- diagnosis_score
	- investigation_score
	- decision_score
	- routing_score
	- closure_score
	- efficiency_score

	api:
	reset:
	signature: "reset(task_id: str \| None = None) -> EnvironmentState"
	step:
	signature: "step(action: Action \| dict) -> StepResult"
	state:
	signature: "state() -> EnvironmentState"
	grade:
	signature: "grade() -> Dict[str, float]"

	http_endpoints:
	- path: /reset
	method: POST
	description: Reset environment, returns EnvironmentState JSON
	- path: /step
	method: POST
	description: Execute action, returns StepResult JSON
	- path: /state
	method: GET
	description: Current state, returns EnvironmentState JSON
	- path: /grade
	method: POST
	description: Grade current episode
	- path: /health
	method: GET
	description: Health check

	dependencies:
	python: ">=3.11"
	packages:
	- pydantic==2.7.1
	- fastapi==0.111.0
	- uvicorn==0.29.0
	- gradio==4.36.1
	- openai==1.35.3
	- pyyaml==6.0.1

	docker:
	port: 7860
	health_check: /health
	```

	---

	## Step 9 — inference.py

	This is the most critical file for the hackathon validator. Get the format exactly right.

	### Required env vars

	```python
	API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
	MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
	API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY", "")
	```

	### Required stdout format

	Every line to stdout must be exactly:
	```
	[START] task=<task_id> env=invoice-exception-handler model=<model_name>
	[STEP] step=<n> action=<action_str> reward=<0.00> done=<true\|false> error=<msg\|null>
	[END] success=<true\|false> steps=<n> score=<0.000> rewards=<r1,r2,...>
	```

	Rules (do not deviate):
	- One `[START]` line at episode begin
	- One `[STEP]` line per step, immediately after `env.step()` returns
	- One `[END]` line after the episode, always emitted even on exception
	- `reward` and all values in `rewards` formatted to exactly 2 decimal places
	- `score` formatted to exactly 3 decimal places
	- `done` and `success` are lowercase: `true` or `false`
	- `error` is the error message string, or exactly `null` if none
	- No newlines within a single line
	- `flush=True` on every print so the validator sees output in real time

	### System prompt for the LLM

	Write a clear system prompt that tells the model:
	- It is an AP analyst handling a flagged invoice
	- It has a structured action space (list all 9 action types)
	- It must respond in JSON: `{"type": "...", "params": {...}}`
	- It should investigate before deciding
	- Never approve without checking, never contact supplier by email if fraud is suspected
	- Available documents: PO, Invoice, GRN, Supplier Master, Exception Flag

	### User prompt per step

	Include in the user prompt:
	- Current step number and max steps
	- The exception flag (what was flagged and why)
	- Available checks (list them)
	- Available rules (list them)
	- Knowledge base entries (the policy list)
	- What has been done so far (checks run, queries made, inspections done)
	- Current cumulative reward
	- Ask for next action as JSON

	### Parsing LLM output

	```python
	def parse_action(raw_text: str) -> dict:
	"""
	Parse the model's response into an action dict.
	Handles markdown code fences, extra whitespace, and minor formatting errors.
	Falls back to run_check(po_match) if parsing fails.
	"""
	text = raw_text.strip()
	# Remove ```json or ``` fences if present
	if text.startswith("```"):
	lines = text.split("\n")
	text = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])
	try:
	return json.loads(text.strip())
	except json.JSONDecodeError:
	# Try to find JSON within the text
	import re
	match = re.search(r'\{.*\}', text, re.DOTALL)
	if match:
	try:
	return json.loads(match.group())
	except json.JSONDecodeError:
	pass
	# Safe fallback
	return {"type": "run_check", "params": {"check_name": "po_match"}}
	```

	### Overall structure

	```python
	def run_task(client, env, task_id, max_steps=20):
	"""Run one task episode and return (steps_taken, score, rewards)."""
	rewards = []

	print(f"[START] task={task_id} env=invoice-exception-handler model={MODEL_NAME}", flush=True)

	obs = env.reset(task_id)
	history = []

	for step in range(1, max_steps + 1):
	# Build prompt from observation
	user_prompt = build_prompt(obs, step, max_steps, history)

	# Call LLM
	raw = call_llm(client, user_prompt)
	action_dict = parse_action(raw)

	# Execute
	try:
	result = env.step(action_dict)
	reward = result.reward
	done = result.done
	error = None
	except Exception as e:
	reward = 0.0
	done = False
	error = str(e)
	result = None

	rewards.append(reward)
	action_str = json.dumps(action_dict)

	print(
	f"[STEP] step={step} action={action_str} "
	f"reward={reward:.2f} done={str(done).lower()} "
	f"error={error or 'null'}",
	flush=True
	)

	history.append(f"Step {step}: {action_str} → reward {reward:+.2f}")

	if result:
	obs = result.observation

	if done:
	break

	score = env.grade()["score"]
	success = score >= 0.5
	steps_taken = min(step, max_steps)
	rewards_str = ",".join(f"{r:.2f}" for r in rewards)

	print(
	f"[END] success={str(success).lower()} steps={steps_taken} "
	f"score={score:.3f} rewards={rewards_str}",
	flush=True
	)

	return steps_taken, score, rewards


	def main():
	client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
	env = InvoiceExceptionEnv(seed=42)

	for task_id in ALL_TASKS:
	run_task(client, env, task_id)


	if __name__ == "__main__":
	main()
	```

	---

	## Step 10 — app.py

	The app.py serves two purposes:
	1. Provides the FastAPI HTTP endpoints that the validator pings (`POST /reset` must return 200)
	2. Provides a Gradio UI for interactive exploration on HF Spaces

	### Architecture

	Run both FastAPI and Gradio in the same process on port 7860.
	Use `gr.mount_gradio_app` to mount Gradio on FastAPI, or run Gradio alongside FastAPI.

	The cleanest approach:

	```python
	import gradio as gr
	from fastapi import FastAPI
	from fastapi.responses import JSONResponse
	import uvicorn

	app = FastAPI(title="Invoice Exception Handler OpenEnv")
	env = InvoiceExceptionEnv(seed=42) # shared environment instance

	@app.post("/reset")
	async def http_reset(body: dict = {}):
	task_id = body.get("task_id", None)
	obs = env.reset(task_id)
	return JSONResponse(obs.model_dump(mode="json"))

	@app.post("/step")
	async def http_step(body: dict):
	result = env.step(body)
	return JSONResponse(result.model_dump(mode="json"))

	@app.get("/state")
	async def http_state():
	return JSONResponse(env.state().model_dump(mode="json"))

	@app.post("/grade")
	async def http_grade():
	return JSONResponse(env.grade())

	@app.get("/tasks")
	async def http_tasks():
	return JSONResponse(ALL_TASKS)

	@app.get("/health")
	async def health():
	return JSONResponse({"status": "ok", "version": "1.0.0"})

	# Mount Gradio on /ui
	gradio_app = build_gradio_ui()
	app = gr.mount_gradio_app(app, gradio_app, path="/")
	```

	### Gradio UI — what to build

	Keep the UI simple and functional. Three tabs:

	Tab 1: Manual Play
	- Dropdown to select task (labels: "Task 1 — Price Variance (Easy)", etc.)
	- Reset button
	- Shows the exception flag, the key document fields, and available actions
	- Dropdown or textbox to compose and submit an action
	- Shows reward, cumulative reward, and status after each step
	- Shows grade breakdown when episode ends

	Tab 2: Agent Demo
	- Select task
	- Shows a hardcoded optimal action sequence running step by step
	- Good for demonstrating the environment to judges who won't run code

	Tab 3: API Reference
	- Code examples for each action type
	- Reward table
	- Grader score breakdown explanation

	---

	## Step 11 — Dockerfile

	```dockerfile
	FROM python:3.11-slim

	# Install system dependencies
	RUN apt-get update \
	&& apt-get install -y --no-install-recommends curl \
	&& rm -rf /var/lib/apt/lists/*

	# Create non-root user (required by HF Spaces)
	RUN useradd -m -u 1000 appuser

	WORKDIR /app

	# Copy and install dependencies first (layer caching)
	COPY requirements.txt .
	RUN pip install --no-cache-dir -r requirements.txt

	# Copy application code
	COPY --chown=appuser:appuser . .

	USER appuser

	EXPOSE 7860

	# Health check — pings the /health endpoint
	HEALTHCHECK --interval=30s --timeout=10s --start-period=20s --retries=3 \
	CMD curl -f http://localhost:7860/health \|\| exit 1

	ENV PYTHONUNBUFFERED=1
	ENV GRADIO_SERVER_NAME=0.0.0.0
	ENV GRADIO_SERVER_PORT=7860

	CMD ["python", "app.py"]
	```

	---

	## Step 12 — End-to-End Test Checklist

	Before pushing, check every item in this list.

	```bash
	# 1. Imports work
	python -c "from env import InvoiceExceptionEnv, Action, ALL_TASKS; print('OK')"

	# 2. All three tasks complete without errors
	python -c "
	from env import InvoiceExceptionEnv, Action, ALL_TASKS
	env = InvoiceExceptionEnv(seed=42)
	for t in ALL_TASKS:
	obs = env.reset(t)
	result = env.step(Action.run_check(obs.available_checks[0]))
	result = env.step(Action.make_decision('reject', 'test'))
	result = env.step(Action.close_case('test'))
	score = env.grade()['score']
	assert 0.0 <= score <= 1.0, f'Score out of range: {score}'
	print(f'{t}: {score}')
	print('All tasks OK')
	"

	# 3. Graders are deterministic
	python -c "
	from env import InvoiceExceptionEnv, Action
	env1 = InvoiceExceptionEnv(seed=42)
	env2 = InvoiceExceptionEnv(seed=42)
	obs1 = env1.reset('task1_price_variance')
	obs2 = env2.reset('task1_price_variance')
	env1.step(Action.run_check('tolerance_rule'))
	env2.step(Action.run_check('tolerance_rule'))
	env1.step(Action.make_decision('approve', 'test'))
	env2.step(Action.make_decision('approve', 'test'))
	env1.step(Action.close_case('done'))
	env2.step(Action.close_case('done'))
	s1 = env1.grade()['score']
	s2 = env2.grade()['score']
	assert s1 == s2, f'Non-deterministic: {s1} vs {s2}'
	print(f'Deterministic: {s1}')
	"

	# 4. inference.py log format (with fake API key)
	API_BASE_URL=https://api.example.com HF_TOKEN=fake MODEL_NAME=test python -c "
	# This will fail on the API call but should print [START] before failing
	import subprocess, sys
	"
	# Manually verify the [START] line would print correctly

	# 5. Docker builds
	docker build -t invoice-env-test .

	# 6. Docker runs and /health returns 200
	docker run -d -p 7860:7860 --name test-env invoice-env-test
	sleep 15
	curl -f http://localhost:7860/health
	curl -s -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{}'
	docker stop test-env && docker rm test-env

	# 7. openenv validate (if openenv-core is installed)
	pip install openenv-core
	openenv validate
	```

	---

	## Step 13 — documents/ Folder

	Create these four files. Keep them updated as the project evolves.

	### documents/CHANGELOG.md

	```markdown
	# Changelog

	All changes to the Invoice Exception Handler environment are recorded here.
	Format: Date \| Version \| What changed \| Why

	---

	## [1.0.0] — 2025-01-20

	### Added
	- Initial implementation of InvoiceExceptionEnv with full OpenEnv API
	- Three tasks: task1_price_variance, task2_duplicate_tax, task3_compound_fraud
	- Pydantic v2 typed models for all environment objects
	- FastAPI HTTP endpoints for HF Spaces validation
	- Gradio UI for interactive exploration
	- inference.py using OpenAI client with [START][STEP][END] log format
	- openenv.yaml spec file
	- Dockerfile for HF Spaces deployment

	### Design decisions
	- Used pure Python simulation (no external databases) for portability and determinism
	- Compound fraud task has four signals to prevent simple greedy agents from scoring well
	- Channel selection in Task 3 (phone vs email) tests policy knowledge, not just anomaly detection
	- Grader uses sub-scores to allow partial credit for partial solutions
	```

	### documents/ARCHITECTURE.md

	Document the system architecture. Include:
	- A text diagram of how the components connect
	- Why FastAPI and Gradio in the same process (HF Spaces constraint)
	- Why Pydantic v2 (spec requirement, validation)
	- How EpisodeData separates mutable state from immutable document context
	- Why tasks are separate classes (easy to extend)

	### documents/BASELINE-SCORES.md

	Record the reproducible baseline scores. Run them yourself and copy the output here.

	```markdown
	# Baseline Scores

	Recorded on: 2025-01-20
	Seed: 42
	Machine: 2 vCPU, 8GB RAM

	## Random Agent (action_space_sample())

	\| Task \| Score \| Steps \|
	\|------\|-------\|-------\|
	\| task1_price_variance \| ~0.18 \| 18 (SLA breach) \|
	\| task2_duplicate_tax \| ~0.12 \| 20 (SLA breach) \|
	\| task3_compound_fraud \| ~0.08 \| 25 (SLA breach) \|
	\| Average \| ~0.13 \| \|

	## Optimal Agent (hardcoded correct actions)

	\| Task \| Score \| Steps \|
	\|------\|-------\|-------\|
	\| task1_price_variance \| ~0.98 \| 9 \|
	\| task2_duplicate_tax \| ~0.95 \| 10 \|
	\| task3_compound_fraud \| ~0.92 \| 14 \|
	\| Average \| ~0.95 \| \|
	```

	---

	## Step 14 — Push and Verify

	```bash
	# Final commit
	git add .
	git commit -m "feat: complete invoice exception handler v1.0.0

	- 3 tasks with deterministic graders (easy/medium/hard)
	- Full OpenEnv API: reset/step/state/grade
	- FastAPI HTTP endpoints for validator (/reset, /step, /state, /health)
	- Gradio UI for HF Spaces
	- inference.py with OpenAI client and [START][STEP][END] format
	- openenv.yaml spec
	- Dockerfile for HF Spaces deployment
	- documents/ folder with PRD, changelog, architecture, baseline scores"

	git push origin main

	# Deploy to HF Spaces (if not using git-based deployment)
	# The Dockerfile and app.py handle this automatically when pushed to HF
	```

	---

	## Action Space Reference

	\| Action Type \| Required Params \| Description \|
	\|---\|---\|---\|
	\| `inspect_field` \| `document, field` \| Look at a specific field in a document \|
	\| `cross_check` \| `field, doc_a, doc_b` \| Compare a field between two documents \|
	\| `run_check` \| `check_name` \| Run a named validation check \|
	\| `query_supplier` \| `question, channel` \| Ask the supplier something (channel: phone or email) \|
	\| `query_internal` \| `department, question` \| Ask an internal team \|
	\| `apply_rule` \| `rule_id` \| Apply a business policy rule \|
	\| `make_decision` \| `decision, reason` \| approve / reject / hold / partial_approve \|
	\| `route_to` \| `team, notes` \| Escalate to a team \|
	\| `close_case` \| `summary` \| Close with an audit trail summary \|

	---

	## Observation Space Reference

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `task_id` \| str \| Which task is running \|
	\| `step_number` \| int \| Current step \|
	\| `case_status` \| str \| open / in_review / decided / routed / closed \|
	\| `purchase_order` \| PurchaseOrder \| What was agreed to be purchased \|
	\| `invoice` \| Invoice \| What the supplier is claiming \|
	\| `grn` \| GoodsReceiptNote \| What actually arrived \|
	\| `supplier_master` \| SupplierMaster \| Verified supplier record \|
	\| `exception_flag` \| ExceptionFlag \| Why this invoice was flagged \|
	\| `inspections` \| List \| Fields already inspected \|
	\| `checks_run` \| List \| Validation checks already run \|
	\| `queries` \| List \| Queries made and responses \|
	\| `rules_applied` \| List \| Business rules applied \|
	\| `decision` \| str? \| Current decision if made \|
	\| `routed_to` \| List \| Teams this case has been escalated to \|
	\| `available_actions` \| List \| All 9 action types \|
	\| `available_checks` \| List \| Check names valid for this task \|
	\| `available_rules` \| List \| Rule IDs valid for this task \|
	\| `knowledge_base` \| List \| Policy entries relevant to this task \|
	\| `cumulative_reward` \| float \| Sum of all rewards so far \|

	---

	## Reward Reference

	\| Event \| Reward \|
	\|---\|---\|
	\| Inspecting a key field that reveals an anomaly \| +0.08 to +0.14 \|
	\| Inspecting a routine field \| +0.01 to +0.06 \|
	\| Cross-check that finds a mismatch \| +0.12 to +0.15 \|
	\| Running a check that finds an issue \| +0.08 to +0.18 \|
	\| Querying the right person \| +0.04 to +0.12 \|
	\| Contacting supplier via wrong channel (Task 3) \| −0.15 \|
	\| Applying the correct business rule \| +0.08 to +0.12 \|
	\| Applying the wrong rule \| −0.05 to −0.10 \|
	\| Correct decision (approve/reject/partial) \| +0.18 to +0.28 \|
	\| Approving a fraudulent invoice \| −0.35 to −0.40 \|
	\| Wrong rejection (task1) \| −0.10 \|
	\| Routing to the right team \| +0.06 to +0.14 \|
	\| Clean case closure \| +0.06 to +0.12 \|
	\| Repeat action \| −0.02 to −0.05 \|
	\| SLA breach (exceed max_steps) \| −0.10 \|

	---

	## Expected Baseline Scores

	These are the scores you should see when running `inference.py` with a good LLM.

	\| Task \| Difficulty \| Random Agent \| Rule Agent \| LLM Agent (Qwen-72B) \|
	\|---\|---\|---\|---\|---\|
	\| task1_price_variance \| Easy \| ~0.18 \| ~0.85 \| ~0.80 \|
	\| task2_duplicate_tax \| Medium \| ~0.12 \| ~0.72 \| ~0.68 \|
	\| task3_compound_fraud \| Hard \| ~0.08 \| ~0.55 \| ~0.45 \|

	The hard task should be genuinely hard for LLMs — a score of 0.45 is expected, not a failure.

	---

	## Environment Variables

	\| Variable \| Required \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `API_BASE_URL` \| Yes \| `https://router.huggingface.co/v1` \| LLM endpoint \|
	\| `MODEL_NAME` \| Yes \| `Qwen/Qwen2.5-72B-Instruct` \| Model to use \|
	\| `HF_TOKEN` \| Yes \| — \| API key for the LLM endpoint \|
	\| `ANTHROPIC_API_KEY` \| No \| — \| Only if using Anthropic models directly \|

	---

	## Setup Instructions

	### Local Development

	```bash
	# Clone the repo
	git clone https://github.com/YOUR_USERNAME/invoice-exception-handler.git
	cd invoice-exception-handler

	# Create virtual environment
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate

	# Install dependencies
	pip install -r requirements.txt

	# Run the app locally
	python app.py
	# Visit http://localhost:7860
	```

	### Run Inference

	```bash
	export API_BASE_URL="https://router.huggingface.co/v1"
	export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	export HF_TOKEN="your-token-here"

	python inference.py
	```

	### Docker

	```bash
	docker build -t invoice-exception-handler .
	docker run -p 7860:7860 \
	-e API_BASE_URL="https://router.huggingface.co/v1" \
	-e MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" \
	-e HF_TOKEN="your-token-here" \
	invoice-exception-handler
	```

	### HF Spaces Deployment

	1. Create a new Space with the Gradio SDK
	2. Push this repository to it
	3. Add secrets in Space settings: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`
	4. The Space will build and deploy automatically from the Dockerfile

	### Validate Submission

	```bash
	# Install validator
	pip install openenv-core

	# Validate the spec
	openenv validate

	# Run the full submission validator script
	chmod +x scripts/validate-submission.sh
	./scripts/validate-submission.sh https://your-space.hf.space .
	```

	---

	## Common Mistakes to Avoid

	1. Don't use `inference.py` as the wrong name. The validator looks for exactly `inference.py` in the root.

	2. Don't use the Anthropic SDK in inference.py. The spec requires the OpenAI client. Use `from openai import OpenAI`.

	3. Don't forget `flush=True` on print statements. The validator reads stdout line by line. Without flush, logs may not appear.

	4. Don't let the Gradio UI crash the FastAPI server. If the UI has an error, it should fail gracefully, not bring down `/reset`.

	5. Don't hardcode the model name. Always read from `os.getenv("MODEL_NAME")`.

	6. Don't put business logic in models.py. That file is just data shapes.

	7. Don't mutate documents during a step. The documents (PO, Invoice, GRN) are fixed for the duration of an episode. Only EpisodeData changes.

	8. Don't forget to test determinism. Same seed + same actions must = same score. Run the determinism test.

	9. Don't skip the docker build test. The validator builds your Docker image. If it doesn't build, you're disqualified.

	10. Don't forget the changelog. Update `documents/CHANGELOG.md` before every push.

	---

	## License

	MIT License. See LICENSE file.