Spaces:
Sleeping
title: HARvestGym
emoji: πΈοΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
tags:
- openenv
- reinforcement-learning
- api-agent
- web-tasks
base_path: /web
HARvestGym
Can a small model learn to reverse-engineer any web application's API β and complete real tasks through those APIs, without ever opening a browser?
Web applications are full of APIs. Every click in a browser triggers an HTTP call with a precise schema, a specific authentication header, an exact sequence of prerequisites. HARvestGym trains a small model to do all of that directly β given a task and a URL, it discovers the relevant endpoints, figures out what each one needs, chains the calls in the right order, and completes the task without any browser.
The model starts with nothing: no schema, no documentation, no endpoint list. It uses tools to explore β issuing requests, inspecting responses, building up its own understanding of how the application works. This is what a developer does when they reverse-engineer an API. The model learns to do the same.
Why It Matters
The environment HARvestGym trains β discover an API, understand its dependencies, complete a task through it β is one of the most broadly useful things an AI agent can do.
| Application | What it unlocks |
|---|---|
| π API reverse engineering | Point an agent at any web app and get a working API map β no docs, no SDK, no source code required |
| π Automatic API documentation | Capture real call sequences and parameter provenance to produce accurate, living documentation from observed traffic |
| π€ Browser-free automation | Automate form submissions, cart flows, content management, and data extraction at the HTTP layer β immune to UI redesigns |
| π§ Any website as MCP tools | Turn any web application into a set of callable agent tools on the fly, with no official integration needed |
| π‘οΈ Security & API auditing | Autonomously probe endpoints, trace auth flows, and map attack surfaces for penetration testing and compliance review |
How It Works
Task + App URL
β
βΌ
Policy Model (RL Agent)
small model β no prior knowledge of the app
Step 1 βββΊ browser_agent(task, url) β filtered API endpoint list
Step 2+ βββΊ search_endpoints(query) β full schema for a specific endpoint
βββΊ curl_exec(command) β execute HTTP call, get response
βββΊ search_episode_data(query) β search prior response bodies
βββΊ done(result) β declare task complete
β
βΌ
Live WebArena Apps (EC2) βββ real HTTP responses (always live, never mocked)
β
βΌ
Deterministic Judge (compares against ground truth API catalog)
β
βΌ
Reward Signal βββΊ GRPO βββΊ updated policy
The agent calls browser_agent once at the start β this runs a real browser to complete the same task while recording all network traffic, then returns the filtered list of API endpoints observed. The agent now has a map of what endpoints exist. What it does not know:
- which of those endpoints are actually needed for this specific task
- in what order they must be called (you cannot add to a cart before the cart exists)
- where each required parameter value comes from
- how to re-authenticate if a session expires mid-episode
The model must learn to discover all of this on its own.
Architecture
TRAINING LOOP
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Task + App URL β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Policy Model (RL Agent) β β
β β small model β no prior knowledge of the app β β
β β β β
β β Observation: task + history + session_state + last_result β β
β β β β
β β Step 1 βββΊ browser_agent(task, url) β β
β β Step 2+ βββΊ search_endpoints(query) β β
β β βββΊ curl_exec(command) β β
β β βββΊ search_episode_data(query) β β
β β βββΊ done(result) β β
β ββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββ΄βββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β Browser Agent β β Environment β β
β β (step 1 only) β β β β
β β β β β’ Executes curl_exec via subprocessβ β
β β Training: β β β’ Auto-injects session cookies β β
β β Load pre-recorded β β β’ Smart-truncates response bodies β β
β β cached HAR from β β β’ Indexes full responses into β β
β β disk or launch β β per-episode BM25 + GEMMA store β β
β β on real browser β β β’ Manages session_state: cookies, β β
β β β β CSRF tokens, auth headers β β
β β Inference: β ββββββββββββββββ¬βββββββββββββββββββββββ β
β β Launch real browserβ β β
β β via Playwright + β β HTTP calls (always live) β
β β bu-30b-a3b-preview β βΌ β
β β β βββββββββββββββββββββββββββββββββββββββ β
β β Both paths produce: β β WebArena EC2 (live apps) β β
β β β’ Filtered HAR β β β β
β β β’ OpenAPI-like specβ β :7770 Shopping (Magento 2) β β
β β β’ GEMMA embeddings β β :7780 Shopping Admin β β
β β for search_ β β :9999 Forum (Postmill) β β
β β endpoints() β β :8888 Wikipedia (Kiwix) β β
β βββββββββββββββββββββββ β :3000 Map (OpenStreetMap) β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β
β β β
β β episode trajectory β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββ β
β β Deterministic Judge β β
β β β β
β β Per-template programmatic grader: β β
β β β’ Inspects episode trajectory β β
β β β’ Optionally probes live app state β β
β β β’ Verifies parameter sourcing β β
β β (TASK_SPEC / PREV_CALL / β β
β β AUTH_FLOW / STATIC / DERIVED) β β
β β β’ Scores [0.0 β 1.0] β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββ β
β β Reward Signal β β
β β β β
β β Per-step: β β
β β +0.2 valid API call (2xx) β β
β β +0.1 new path explored β β
β β +0.25 correct param sourcing β β
β β β0.15 repeated identical call β β
β β β0.3 browser_agent called again β β
β β β β
β β Episode end: β β
β β +2.0β+5.0 task complete (easyβhardβ β
β β β1.5 task failed β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββ β
β β GRPO (via HF TRL) β β
β β β β
β β 8 parallel rollouts per prompt β β
β β Computes advantages without β β
β β a value function β β
β β Updates policy weights β β
β βββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββΊ updated Policy Model β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Target Applications
All running on a single AWS EC2 instance β real production software, no simulation.
| App | Port | Software |
|---|---|---|
| Shopping | 7770 | Magento 2 β open-source e-commerce platform |
| Shopping Admin | 7780 | Magento 2 Admin β backend panel for the same store |
| Forum | 9999 | Postmill β open-source Reddit-like forum |
| Wikipedia | 8888 | Kiwix β read-only offline mirror of Wikipedia |
| Map | 3000 | OpenStreetMap β collaborative mapping platform |
Source: WebArena environment_docker
Tasks
HARvestGym trains on 7 task templates across three complexity tiers. Each template is a parameterized scenario: one reward function, one ground truth catalog entry, one grader β but potentially hundreds of distinct episode variations produced by substituting different values for the template slots ({product_name}, {category_name}, etc.).
Complexity Tiers
| Tier | Characteristic | API calls required |
|---|---|---|
| Easy | Single call, no auth | 1 |
| Medium | Auth + 1β2 dependent calls | 2β3 |
| Hard | Multi-step chain with ID threading, full auth | 4β8+ |
The model only graduates to harder templates once it reliably solves easier ones.
Task Templates
| # | Tier | App | Template | Key Challenge |
|---|---|---|---|---|
| 1 | Easy | Shopping | List products in category {category_name} |
Single GET with query params |
| 2 | Easy | Wikipedia | Retrieve article summary for {title} |
Single GET, path parameter resolution |
| 3 | Medium | Shopping | Add {product_name} to a guest cart |
2 calls: create cart β add item; ID threading |
| 4 | Medium | Forum | Retrieve all posts in {forum_category} (authed) |
Login β extract session β GET |
| 5 | Hard | Forum | Create a post titled {title} in {category} |
Login β extract CSRF form_key β POST with full schema |
| 6 | Hard | Shopping | Guest checkout for {product_name} |
5+ chained calls; cart β item β shipping β payment |
| 7 | Hard | Shopping Admin | Create a new product with SKU {sku}, price {price} |
Admin bearer token β full Magento product schema |
Template parameters are populated from a static parameter pool built by querying the live applications before training (see parameter_pools.json, refreshed via scripts/build_parameter_pools.py). Each episode samples randomly from its pool β the model never sees the pool directly, it must discover the correct values through its own API calls.
Each task has a deterministic programmatic grader (score in [0.0, 1.0]):
- Easy graders: check HTTP response body for expected values
- Medium graders: probe application state after episode (e.g., fetch the cart, verify item is present)
- Hard graders: verify multi-step state change in the application (e.g., post exists, checkout created)
Spaces
Observation Space
What the model sees at each step:
class Observation(BaseModel):
task: str # Natural language task
app_base_url: str # Root URL of the target application
last_tool_result: Any # Result of last tool call
history: list[dict] # Full episode trajectory: [{action, tool_result}, ...]
session_state: dict # Auto-managed: cookies, tokens, CSRF values
step_count: int
max_steps: int # 20
session_state is maintained by the environment β the model decides when to authenticate and which session values to use; the environment handles extraction from Set-Cookie headers and response bodies.
Response truncation rules applied in order:
- Non-JSON body (HTML, CSS): truncated to 3,000 characters
- JSON primitive (string, number): never truncated β these are tokens, IDs
- Error response (4xx/5xx): never truncated β the model needs every word to self-correct
- Small JSON (no large arrays): returned as-is
- Large JSON array (β₯ 3 items): first 2 items shown +
_list_truncatedannotation + hint to callsearch_episode_data()
Every curl_exec call indexes the full response into a per-episode hybrid index (BM25 + GEMMA embeddings) before truncation β so all items are always retrievable even when only 2 were shown.
Action Space
The model outputs a single tool call per step.
| Tool | Input | Output |
|---|---|---|
browser_agent(task, url) |
Task string + app base URL | Summary list of API endpoint names + methods (e.g. GET /products) |
search_endpoints(query) |
Natural language query | Top-3 endpoint schemas (method, path, auth, params with sources, response fields) |
curl_exec(command) |
Full curl command string | {status_code, headers, body} β body smart-truncated; full body indexed |
search_episode_data(query) |
Keyword or natural language query | Top-5 JSON objects from this episode's request/response history |
done(result?) |
Optional result string | Ends episode, triggers judge evaluation |
browser_agent is called exactly once per episode at step 1. Calling it again applies a β0.3 penalty. During training, it loads a cached HAR file; at inference, it launches a live browser session.
Full technical specifications for all tools: TOOLS.md
Reward Space
Per-step:
| Signal | Value | Trigger |
|---|---|---|
| Valid API call (2xx) | +0.2 | curl_exec returns 2xx status |
| New path called this episode | +0.1 | Normalized path not called before β discourages looping |
| Correct parameter sourcing | +0.25 | Judge: value came from the correct source type |
| Session value correctly used | +0.1 | Auth token/cookie present and correct in curl call |
| Repeated identical call | β0.15 | Exact duplicate curl command issued twice |
| browser_agent called again | β0.3 | browser_agent called after step 1 |
| Malformed curl command | β0.1 | curl cannot be parsed or executed |
| 4xx response (recoverable) | β0.05 | Call failed but episode continues |
Episode end:
| Outcome | Reward |
|---|---|
| Task completed correctly | +2.0 to +5.0 (scales with difficulty tier) |
| Partial completion (right endpoints, wrong param threading) | +0.5 to +1.5 |
| Authentication correctly obtained (even if task fails) | +0.3 |
| Timeout / task failed entirely | β1.5 |
Target signal separation: successful episodes +3 to +7, failed episodes β2 to β1. Required for GRPO.
Reward design note: Pure step-level rewards can teach a model to "look busy" β accumulating exploration rewards while never completing the task. The terminal outcome reward is designed to dominate the sum of all per-step rewards. The curriculum is the primary defense: Easy tasks have a trivially short optimal path (2 steps), so there's no room to accumulate fake exploration reward before the model learns that the terminal reward is what matters.
Key Design Decisions
Browser Agent as a Discovery Tool
The RL agent has access to a browser agent tool powered by bu-30b-a3b-preview β a 30B MoE vision-language model (3B active parameters) served via the browser-use library on Playwright. When called, it completes the task in a real browser while intercepting all network traffic, then returns the filtered API call list.
Training vs. inference: The browser agent output is pre-computed and cached per task during training β the RL model receives it instantly, no live browser session runs. At inference, the browser agent runs live to handle novel tasks.
Full details: BROWSER_AGENT.md
Ground Truth from the Codebase, Not the Browser
The browser agent shows what API calls happen. It does not explain why β where each parameter comes from or what field constraints exist. That comes from a one-time static analysis of each WebArena application's Docker image source, producing a ground truth API catalog:
endpoint: POST /rest/V1/guest-carts/{cartId}/items
path_params:
cartId: obtained from: POST /rest/V1/guest-carts β response body
body:
cartItem.sku: the product's SKU, from: GET /rest/V1/products β items[].sku
cartItem.qty: quantity, from: task specification
cartItem.quote_id: same as cartId
The judge uses this to verify not just what the model called, but where each parameter value came from. Source types: TASK_SPEC, PREV_CALL, AUTH_FLOW, STATIC, DERIVED. This is how partial credit works β the model gets reward for correctly threading a cart_id even if the final call had a wrong field elsewhere.
Full extraction process: GROUND_TRUTH_EXTRACTION.md
HTML and Form-Based Applications
Not every endpoint returns JSON. The Forum (Postmill) relies on HTML form submissions with CSRF tokens; Wikipedia (Kiwix) serves static HTML pages. The agent handles both:
- CSRF tokens: The model GETs the form page, reads the
valueattribute ofinput[name="_csrf_token"]from the returned HTML, and places it in the subsequent POST. If the token is beyond the 3,000-character truncation point, it callssearch_episode_data("_csrf_token")β the full HTML is indexed before truncation. - HTML-only responses: Wikipedia responses are returned in the standard
{status_code, headers, body}envelope. Search results appear in<a href>tags; article content in<p>tags.
Example Episode
TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.../
STEP 1: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
β {"endpoints": ["POST /rest/V1/guest-carts", "GET /rest/V1/products",
"POST /rest/V1/guest-carts/{id}/items", ...]}
STEP 2: search_endpoints("create guest cart")
β ["endpoint: POST /rest/V1/guest-carts | auth: none | returns: string (cartId)"]
STEP 3: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
β {status_code: 200, body: "cart-abc123"}
STEP 4: search_endpoints("find product by name get sku")
β ["endpoint: GET /rest/V1/products | query: searchCriteria filters | returns: .items[].sku"]
STEP 5: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&...[value]=Radiant+Tee'")
β {status_code: 200, body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}]}}
STEP 6: search_endpoints("add item to guest cart cartId")
β ["endpoint: POST /rest/V1/guest-carts/{cartId}/items | path: cartId from POST /rest/V1/guest-carts | body: cartItem.sku, cartItem.qty, cartItem.quote_id"]
STEP 7: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
β {status_code: 200, body: {"item_id": 5, "sku": "MH01", "qty": 1}}
STEP 8: done("Radiant Tee added to cart")
Values from prior responses (cart-abc123, MH01) are threaded directly from the growing episode history. If step 5 had returned 200 products truncated to 2, the model would call search_episode_data("Radiant Tee sku") to retrieve MH01 from the episode index.
Setup
Prerequisites
- Docker installed and running
- Python 3.11+ with
uv - A Hugging Face token with read access
Local Development
# Clone and enter the project
git clone <your-hf-space-url>
cd HARvestGym
# Install dependencies
uv sync
# Validate the OpenEnv spec
openenv validate
# Build and run the Docker image
docker build -t harvgym .
docker run -p 8000:8000 harvgym
# Run the inference script
HF_TOKEN=hf_xxx uv run inference.py
Environment Variables
| Variable | Default | Required | Purpose |
|---|---|---|---|
HF_TOKEN |
β | Yes | HuggingFace auth token |
API_BASE_URL |
https://router.huggingface.co/v1 |
No | LLM API endpoint |
MODEL_NAME |
google/gemma-4-31B-it |
No | Model for inference |
HARVGYM_TASK |
har_classify_easy |
No | Override which task to run |
API Endpoints
# Reset episode
curl -X POST http://localhost:8000/reset
# Execute a step
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"tool": "browser_agent", "args": {"task": "...", "url": "..."}}'
# Get current state
curl http://localhost:8000/state
Baseline Performance
Scores generated by running uv run inference.py with google/gemma-4-31B-it via the HuggingFace Router.
| Task | Difficulty | Score | Steps | Notes |
|---|---|---|---|---|
easy_list_pants |
Easy | 0.74 | 7 | List products in 'Pants' category |
medium_cart_camera_backpack |
Medium | 0.46 | 20 | Add Camera Backpack to guest cart |
medium_cart_flannel_jacket |
Medium | 0.52 | 20 | Add Flannel Jacket to guest cart |
hard_checkout_ripstop_pants |
Hard | 0.14 | 20 | Full guest checkout (hit step limit) |
| Overall | β | 0.47 | β |
To regenerate:
HF_TOKEN=hf_xxx uv run inference.py