Spaces:
Sleeping
title: InvoiceOps Environment Server
emoji: 📄
colorFrom: yellow
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /
tags:
- openenv
- finance
- accounts-payable
- invoices
InvoiceOps Environment
Submitted by team: Markov
InvoiceOps is a deterministic OpenEnv environment for accounts payable (AP) invoice exception handling. Each episode is one invoice case. The agent inspects surfaced exceptions, opens typed supporting artifacts, optionally runs duplicate checks, writes structured notes, saves line and header resolutions, and submits the case for deterministic grading.
In real AP operations, this is the core decision problem: determine whether an invoice can be paid now, partially released, or routed for further review based on invoices, POs, receipts, approval status, and policy evidence.
The workflow is loosely modeled on real enterprise AP controls used in systems such as Microsoft Dynamics 365 Accounts payable, including invoice review and approval, invoice matching, workflow routing, and partial payment handling.
This environment is intentionally small and CPU-friendly, but it still measures real AP judgment:
- evidence gathering before payment decisions
- line-level vs header-level separation
- duplicate-review strategy selection
- receipt support judgment
- partial release vs full hold
- chronology-aware exception handling
- routing to the correct follow-up owner when payment is not safe
Public Benchmark
The public benchmark has four tasks. easy is a warm-up. medium and medium_plus test distinct mid-tier capabilities. hard is the composition case.
| Task | Core burden | Best outcome |
|---|---|---|
easy |
Start a missing approval workflow for a non-PO invoice | Hold and route to requester |
medium |
Clear a duplicate exception using the correct evidence path | Approve both lines and release payment |
medium_plus |
Combine duplicate clearance with mixed line outcomes | Approve L1, hold L2, release approved lines |
hard |
Combine duplicate review, invoice arithmetic, receipt chronology, and a tax block | Approve L1 and L3, hold L2, hold header to tax |
Task Details
easy
Non-PO invoice with no initiated approval workflow. The invoice amount is within requester authority, so the correct action is to hold and route to requester.
medium
PO-backed invoice with a possible duplicate flag. The decisive evidence appears only after the normalized invoice number duplicate search. Approving safely requires the right duplicate path plus PO and receipt review.
medium_plus
PO-backed invoice with a possible duplicate flag and one short-received line above the de minimis threshold. The agent must clear the duplicate, separate line outcomes correctly, and use release_approved_lines instead of a blanket hold.
hard
Project invoice with interacting burdens: duplicate review, de minimis invoice arithmetic on L1, chronology-sensitive receipt support on L2, and a tax header block that routes to tax.
Action Space
InvoiceOpsAction is a typed action model with these actions:
open_artifactinspect_exceptionrun_duplicate_checkadd_noteset_line_resolutionset_header_resolutionsubmit_case
Observation Space
InvoiceOpsObservation includes:
- queue-level case summary
- available artifacts
- most recently opened artifact
- exception stubs and inspected exception details
- duplicate candidates surfaced by the chosen strategy
- saved notes
- draft line and header resolutions
- progress counters
- final deterministic submission report after submit
Scoring
The reward function provides dense trajectory signal for useful work such as first-time artifact opens, exception inspection, duplicate checks, notes, and valid saved resolutions. It penalizes invalid or redundant actions and inefficient trajectories.
Final grading is deterministic and two-stage:
- Assign a
decision_band:best,safe_suboptimal,wrong, orunsafe. - Score within that band using core decision quality, timely evidence, structured documentation coverage, and efficiency.
Important grading rule: best outcomes require the agent to uncover the required evidence before saving the decision. Conservative holds can still earn safe_suboptimal when the observed evidence justifies caution.
Design Choices
This benchmark was iterated on, not created in one pass. We tried weaker task and grader shapes first, then removed designs that were easy to game or that clustered strong models for the wrong reasons.
Key anti-gaming choices:
- no pre-opened artifacts, auto-inspected exceptions, or auto-run duplicate checks
- no hidden scenario-specific solver logic in the environment or grader
- no prose grading; scores depend on typed actions, saved resolutions, observed evidence, and timing
- fallback runs are zeroed in baseline mean scoring
- conservative blanket holds are capped in
safe_suboptimal; they do not earnbest
Main lessons from iteration:
- making partial credit harsher did not improve the benchmark; harder tasks had to require better evidence use and better judgment
- gating on restated citation strings was too brittle; grading now depends on evidence actually uncovered before the decision was saved
Local Setup
cd invoiceops_env
uv sync --extra dev
uv run pytest -q
uv run server --port 8000
Run validation from the environment root:
openenv validate .
openenv validate --url http://localhost:8000
If openenv is not installed in the current environment:
uvx --from openenv-core openenv validate .
uvx --from openenv-core openenv validate --url http://localhost:8000
Baseline
The root inference.py script is the reproducible baseline.
- OpenAI Python client
- default
ENV_URL:https://ehsaaniqbal-invoiceops-env.hf.space - default
API_BASE_URL:https://router.huggingface.co/v1 - default
MODEL_NAME:zai-org/GLM-5.1 - fallback tasks are zeroed in
mean_scoreby default while raw environment scores are still preserved - run artifacts are written under
outputs/evals/
Verified baseline on the current public benchmark:
- model:
zai-org/GLM-5.1 - mean score:
0.6149 - task scores:
easy 0.9862,medium 0.9628,medium_plus 0.3130,hard 0.1975
Run it with:
cd invoiceops_env
HF_TOKEN=... \
uv run python inference.py
This matches the competition-style run: inference.py talks to the deployed Space by default and uses the Hugging Face router unless you override ENV_URL, API_BASE_URL, or MODEL_NAME.
Optional environment variables:
HF_TOKENAPI_BASE_URLMODEL_NAMEENV_URLEVAL_RUN_NAMEMAX_TOKENSRETRY_MAX_TOKENSSTRICT_BASELINE_SCORING
Docker
cd invoiceops_env
docker build -t invoiceops-env:latest .
docker run -p 8000:8000 invoiceops-env:latest