--- title: HARvestGym emoji: πŸ•ΈοΈ colorFrom: blue colorTo: purple sdk: docker app_port: 8000 pinned: false tags: - openenv - reinforcement-learning - api-agent - web-tasks base_path: /web --- # HARvestGym ### Can a small model learn to reverse-engineer any web application's API β€” and complete real tasks through those APIs, without ever opening a browser? Web applications are full of APIs. Every click in a browser triggers an HTTP call with a precise schema, a specific authentication header, an exact sequence of prerequisites. **HARvestGym trains a small model to do all of that directly** β€” given a task and a URL, it discovers the relevant endpoints, figures out what each one needs, chains the calls in the right order, and completes the task without any browser. The model starts with nothing: no schema, no documentation, no endpoint list. It uses tools to explore β€” issuing requests, inspecting responses, building up its own understanding of how the application works. This is what a developer does when they reverse-engineer an API. The model learns to do the same. --- ## Why It Matters The environment HARvestGym trains β€” discover an API, understand its dependencies, complete a task through it β€” is one of the most broadly useful things an AI agent can do. | Application | What it unlocks | | --- | --- | | πŸ” **API reverse engineering** | Point an agent at any web app and get a working API map β€” no docs, no SDK, no source code required | | πŸ“„ **Automatic API documentation** | Capture real call sequences and parameter provenance to produce accurate, living documentation from observed traffic | | πŸ€– **Browser-free automation** | Automate form submissions, cart flows, content management, and data extraction at the HTTP layer β€” immune to UI redesigns | | πŸ”§ **Any website as MCP tools** | Turn any web application into a set of callable agent tools on the fly, with no official integration needed | | πŸ›‘οΈ **Security & API auditing** | Autonomously probe endpoints, trace auth flows, and map attack surfaces for penetration testing and compliance review | --- ## How It Works ``` Task + App URL β”‚ β–Ό Policy Model (RL Agent) small model β€” no prior knowledge of the app Step 1 ──► browser_agent(task, url) β†’ filtered API endpoint list Step 2+ ──► search_endpoints(query) β†’ full schema for a specific endpoint ──► curl_exec(command) β†’ execute HTTP call, get response ──► search_episode_data(query) β†’ search prior response bodies ──► done(result) β†’ declare task complete β”‚ β–Ό Live WebArena Apps (EC2) ←── real HTTP responses (always live, never mocked) β”‚ β–Ό Deterministic Judge (compares against ground truth API catalog) β”‚ β–Ό Reward Signal ──► GRPO ──► updated policy ``` The agent calls `browser_agent` once at the start β€” this runs a real browser to complete the same task while recording all network traffic, then returns the filtered list of API endpoints observed. The agent now has a map of what endpoints exist. What it does *not* know: - which of those endpoints are actually needed for this specific task - in what order they must be called (you cannot add to a cart before the cart exists) - where each required parameter value comes from - how to re-authenticate if a session expires mid-episode The model must learn to discover all of this on its own. --- ## Architecture ``` TRAINING LOOP β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Task + App URL β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Policy Model (RL Agent) β”‚ β”‚ β”‚ β”‚ small model β€” no prior knowledge of the app β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Observation: task + history + session_state + last_result β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Step 1 ──► browser_agent(task, url) β”‚ β”‚ β”‚ β”‚ Step 2+ ──► search_endpoints(query) β”‚ β”‚ β”‚ β”‚ ──► curl_exec(command) β”‚ β”‚ β”‚ β”‚ ──► search_episode_data(query) β”‚ β”‚ β”‚ β”‚ ──► done(result) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Browser Agent β”‚ β”‚ Environment β”‚ β”‚ β”‚ β”‚ (step 1 only) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β€’ Executes curl_exec via subprocessβ”‚ β”‚ β”‚ β”‚ Training: β”‚ β”‚ β€’ Auto-injects session cookies β”‚ β”‚ β”‚ β”‚ Load pre-recorded β”‚ β”‚ β€’ Smart-truncates response bodies β”‚ β”‚ β”‚ β”‚ cached HAR from β”‚ β”‚ β€’ Indexes full responses into β”‚ β”‚ β”‚ β”‚ disk or launch β”‚ β”‚ per-episode BM25 + GEMMA store β”‚ β”‚ β”‚ β”‚ on real browser β”‚ β”‚ β€’ Manages session_state: cookies, β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ CSRF tokens, auth headers β”‚ β”‚ β”‚ β”‚ Inference: β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ Launch real browserβ”‚ β”‚ β”‚ β”‚ β”‚ via Playwright + β”‚ β”‚ HTTP calls (always live) β”‚ β”‚ β”‚ bu-30b-a3b-preview β”‚ β–Ό β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Both paths produce: β”‚ β”‚ WebArena EC2 (live apps) β”‚ β”‚ β”‚ β”‚ β€’ Filtered HAR β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β€’ OpenAPI-like specβ”‚ β”‚ :7770 Shopping (Magento 2) β”‚ β”‚ β”‚ β”‚ β€’ GEMMA embeddings β”‚ β”‚ :7780 Shopping Admin β”‚ β”‚ β”‚ β”‚ for search_ β”‚ β”‚ :9999 Forum (Postmill) β”‚ β”‚ β”‚ β”‚ endpoints() β”‚ β”‚ :8888 Wikipedia (Kiwix) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ :3000 Map (OpenStreetMap) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ episode trajectory β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Deterministic Judge β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Per-template programmatic grader: β”‚ β”‚ β”‚ β”‚ β€’ Inspects episode trajectory β”‚ β”‚ β”‚ β”‚ β€’ Optionally probes live app state β”‚ β”‚ β”‚ β”‚ β€’ Verifies parameter sourcing β”‚ β”‚ β”‚ β”‚ (TASK_SPEC / PREV_CALL / β”‚ β”‚ β”‚ β”‚ AUTH_FLOW / STATIC / DERIVED) β”‚ β”‚ β”‚ β”‚ β€’ Scores [0.0 β†’ 1.0] β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Reward Signal β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Per-step: β”‚ β”‚ β”‚ β”‚ +0.2 valid API call (2xx) β”‚ β”‚ β”‚ β”‚ +0.1 new path explored β”‚ β”‚ β”‚ β”‚ +0.25 correct param sourcing β”‚ β”‚ β”‚ β”‚ βˆ’0.15 repeated identical call β”‚ β”‚ β”‚ β”‚ βˆ’0.3 browser_agent called again β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Episode end: β”‚ β”‚ β”‚ β”‚ +2.0–+5.0 task complete (easyβ†’hardβ”‚ β”‚ β”‚ β”‚ βˆ’1.5 task failed β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ GRPO (via HF TRL) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ 8 parallel rollouts per prompt β”‚ β”‚ β”‚ β”‚ Computes advantages without β”‚ β”‚ β”‚ β”‚ a value function β”‚ β”‚ β”‚ β”‚ Updates policy weights β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ └──► updated Policy Model β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## Target Applications All running on a single AWS EC2 instance β€” real production software, no simulation. | App | Port | Software | | -------------- | ---- | ------------------------------------------------- | | Shopping | 7770 | Magento 2 β€” open-source e-commerce platform | | Shopping Admin | 7780 | Magento 2 Admin β€” backend panel for the same store| | Forum | 9999 | Postmill β€” open-source Reddit-like forum | | Wikipedia | 8888 | Kiwix β€” read-only offline mirror of Wikipedia | | Map | 3000 | OpenStreetMap β€” collaborative mapping platform | Source: [WebArena environment_docker](https://github.com/web-arena-x/webarena/tree/main/environment_docker) --- ## Tasks HARvestGym trains on **7 task templates** across three complexity tiers. Each template is a parameterized scenario: one reward function, one ground truth catalog entry, one grader β€” but potentially hundreds of distinct episode variations produced by substituting different values for the template slots (`{product_name}`, `{category_name}`, etc.). ### Complexity Tiers | Tier | Characteristic | API calls required | | ------ | --------------------------------------------- | ------------------ | | Easy | Single call, no auth | 1 | | Medium | Auth + 1–2 dependent calls | 2–3 | | Hard | Multi-step chain with ID threading, full auth | 4–8+ | The model only graduates to harder templates once it reliably solves easier ones. ### Task Templates | # | Tier | App | Template | Key Challenge | | --- | ------ | -------------- | ------------------------------------------------------ | ------------------------------------------------------- | | 1 | Easy | Shopping | List products in category `{category_name}` | Single GET with query params | | 2 | Easy | Wikipedia | Retrieve article summary for `{title}` | Single GET, path parameter resolution | | 3 | Medium | Shopping | Add `{product_name}` to a guest cart | 2 calls: create cart β†’ add item; ID threading | | 4 | Medium | Forum | Retrieve all posts in `{forum_category}` (authed) | Login β†’ extract session β†’ GET | | 5 | Hard | Forum | Create a post titled `{title}` in `{category}` | Login β†’ extract CSRF `form_key` β†’ POST with full schema | | 6 | Hard | Shopping | Guest checkout for `{product_name}` | 5+ chained calls; cart β†’ item β†’ shipping β†’ payment | | 7 | Hard | Shopping Admin | Create a new product with SKU `{sku}`, price `{price}` | Admin bearer token β†’ full Magento product schema | **Template parameters** are populated from a static parameter pool built by querying the live applications before training (see `parameter_pools.json`, refreshed via `scripts/build_parameter_pools.py`). Each episode samples randomly from its pool β€” the model never sees the pool directly, it must discover the correct values through its own API calls. Each task has a deterministic programmatic grader (score in `[0.0, 1.0]`): - **Easy graders**: check HTTP response body for expected values - **Medium graders**: probe application state after episode (e.g., fetch the cart, verify item is present) - **Hard graders**: verify multi-step state change in the application (e.g., post exists, checkout created) --- ## Spaces ### Observation Space What the model sees at each step: ```python class Observation(BaseModel): task: str # Natural language task app_base_url: str # Root URL of the target application last_tool_result: Any # Result of last tool call history: list[dict] # Full episode trajectory: [{action, tool_result}, ...] session_state: dict # Auto-managed: cookies, tokens, CSRF values step_count: int max_steps: int # 20 ``` `session_state` is maintained by the environment β€” the model decides *when* to authenticate and *which* session values to use; the environment handles *extraction* from `Set-Cookie` headers and response bodies. **Response truncation** rules applied in order: 1. Non-JSON body (HTML, CSS): truncated to 3,000 characters 2. JSON primitive (string, number): never truncated β€” these are tokens, IDs 3. Error response (4xx/5xx): never truncated β€” the model needs every word to self-correct 4. Small JSON (no large arrays): returned as-is 5. Large JSON array (β‰₯ 3 items): first 2 items shown + `_list_truncated` annotation + hint to call `search_episode_data()` Every `curl_exec` call indexes the *full* response into a per-episode hybrid index (BM25 + GEMMA embeddings) *before* truncation β€” so all items are always retrievable even when only 2 were shown. ### Action Space The model outputs a single tool call per step. | Tool | Input | Output | | ---------------------------- | --------------------------------- | ------------------------------------------------------------------------------- | | `browser_agent(task, url)` | Task string + app base URL | Summary list of API endpoint names + methods (e.g. `GET /products`) | | `search_endpoints(query)` | Natural language query | Top-3 endpoint schemas (method, path, auth, params with sources, response fields)| | `curl_exec(command)` | Full curl command string | `{status_code, headers, body}` β€” body smart-truncated; full body indexed | | `search_episode_data(query)` | Keyword or natural language query | Top-5 JSON objects from this episode's request/response history | | `done(result?)` | Optional result string | Ends episode, triggers judge evaluation | `browser_agent` is called **exactly once per episode at step 1**. Calling it again applies a βˆ’0.3 penalty. During training, it loads a cached HAR file; at inference, it launches a live browser session. Full technical specifications for all tools: [`TOOLS.md`](./TOOLS.md) ### Reward Space **Per-step:** | Signal | Value | Trigger | | ---------------------------- | ------ | -------------------------------------------------------------------- | | Valid API call (2xx) | +0.2 | `curl_exec` returns 2xx status | | New path called this episode | +0.1 | Normalized path not called before β€” discourages looping | | Correct parameter sourcing | +0.25 | Judge: value came from the correct source type | | Session value correctly used | +0.1 | Auth token/cookie present and correct in curl call | | Repeated identical call | βˆ’0.15 | Exact duplicate curl command issued twice | | browser_agent called again | βˆ’0.3 | `browser_agent` called after step 1 | | Malformed curl command | βˆ’0.1 | curl cannot be parsed or executed | | 4xx response (recoverable) | βˆ’0.05 | Call failed but episode continues | **Episode end:** | Outcome | Reward | | ----------------------------------------------------------- | ------------------------------------------ | | Task completed correctly | +2.0 to +5.0 (scales with difficulty tier) | | Partial completion (right endpoints, wrong param threading) | +0.5 to +1.5 | | Authentication correctly obtained (even if task fails) | +0.3 | | Timeout / task failed entirely | βˆ’1.5 | Target signal separation: successful episodes `+3` to `+7`, failed episodes `βˆ’2` to `βˆ’1`. Required for GRPO. > **Reward design note:** Pure step-level rewards can teach a model to "look busy" β€” accumulating exploration rewards while never completing the task. The terminal outcome reward is designed to dominate the sum of all per-step rewards. The curriculum is the primary defense: Easy tasks have a trivially short optimal path (2 steps), so there's no room to accumulate fake exploration reward before the model learns that the terminal reward is what matters. --- ## Key Design Decisions ### Browser Agent as a Discovery Tool The RL agent has access to a **browser agent tool** powered by [`bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview) β€” a 30B MoE vision-language model (3B active parameters) served via the [browser-use](https://github.com/browser-use/browser-use) library on Playwright. When called, it completes the task in a real browser while intercepting all network traffic, then returns the filtered API call list. **Training vs. inference:** The browser agent output is pre-computed and cached per task during training β€” the RL model receives it instantly, no live browser session runs. At inference, the browser agent runs live to handle novel tasks. Full details: [`BROWSER_AGENT.md`](BROWSER_AGENT.md) ### Ground Truth from the Codebase, Not the Browser The browser agent shows *what* API calls happen. It does not explain *why* β€” where each parameter comes from or what field constraints exist. That comes from a one-time static analysis of each WebArena application's Docker image source, producing a **ground truth API catalog**: ``` endpoint: POST /rest/V1/guest-carts/{cartId}/items path_params: cartId: obtained from: POST /rest/V1/guest-carts β†’ response body body: cartItem.sku: the product's SKU, from: GET /rest/V1/products β†’ items[].sku cartItem.qty: quantity, from: task specification cartItem.quote_id: same as cartId ``` The judge uses this to verify not just *what* the model called, but *where each parameter value came from*. Source types: `TASK_SPEC`, `PREV_CALL`, `AUTH_FLOW`, `STATIC`, `DERIVED`. This is how partial credit works β€” the model gets reward for correctly threading a `cart_id` even if the final call had a wrong field elsewhere. Full extraction process: [`GROUND_TRUTH_EXTRACTION.md`](GROUND_TRUTH_EXTRACTION.md) ### HTML and Form-Based Applications Not every endpoint returns JSON. The Forum (Postmill) relies on HTML form submissions with CSRF tokens; Wikipedia (Kiwix) serves static HTML pages. The agent handles both: - **CSRF tokens**: The model GETs the form page, reads the `value` attribute of `input[name="_csrf_token"]` from the returned HTML, and places it in the subsequent POST. If the token is beyond the 3,000-character truncation point, it calls `search_episode_data("_csrf_token")` β€” the full HTML is indexed before truncation. - **HTML-only responses**: Wikipedia responses are returned in the standard `{status_code, headers, body}` envelope. Search results appear in `` tags; article content in `

` tags. --- ## Example Episode ``` TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.../ STEP 1: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/") β†’ {"endpoints": ["POST /rest/V1/guest-carts", "GET /rest/V1/products", "POST /rest/V1/guest-carts/{id}/items", ...]} STEP 2: search_endpoints("create guest cart") β†’ ["endpoint: POST /rest/V1/guest-carts | auth: none | returns: string (cartId)"] STEP 3: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'") β†’ {status_code: 200, body: "cart-abc123"} STEP 4: search_endpoints("find product by name get sku") β†’ ["endpoint: GET /rest/V1/products | query: searchCriteria filters | returns: .items[].sku"] STEP 5: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&...[value]=Radiant+Tee'") β†’ {status_code: 200, body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}]}} STEP 6: search_endpoints("add item to guest cart cartId") β†’ ["endpoint: POST /rest/V1/guest-carts/{cartId}/items | path: cartId from POST /rest/V1/guest-carts | body: cartItem.sku, cartItem.qty, cartItem.quote_id"] STEP 7: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'") β†’ {status_code: 200, body: {"item_id": 5, "sku": "MH01", "qty": 1}} STEP 8: done("Radiant Tee added to cart") ``` Values from prior responses (`cart-abc123`, `MH01`) are threaded directly from the growing episode history. If step 5 had returned 200 products truncated to 2, the model would call `search_episode_data("Radiant Tee sku")` to retrieve `MH01` from the episode index. --- ## Setup ### Prerequisites - Docker installed and running - Python 3.11+ with [`uv`](https://github.com/astral-sh/uv) - A Hugging Face token with read access ### Local Development ```bash # Clone and enter the project git clone cd HARvestGym # Install dependencies uv sync # Validate the OpenEnv spec openenv validate # Build and run the Docker image docker build -t harvgym . docker run -p 8000:8000 harvgym # Run the inference script HF_TOKEN=hf_xxx uv run inference.py ``` ### Environment Variables | Variable | Default | Required | Purpose | | -------------- | ------------------------------------ | -------- | ----------------------------------------- | | `HF_TOKEN` | β€” | **Yes** | HuggingFace auth token | | `API_BASE_URL` | `https://router.huggingface.co/v1` | No | LLM API endpoint | | `MODEL_NAME` | `google/gemma-4-31B-it` | No | Model for inference | | `HARVGYM_TASK` | `har_classify_easy` | No | Override which task to run | ### API Endpoints ```bash # Reset episode curl -X POST http://localhost:8000/reset # Execute a step curl -X POST http://localhost:8000/step \ -H "Content-Type: application/json" \ -d '{"tool": "browser_agent", "args": {"task": "...", "url": "..."}}' # Get current state curl http://localhost:8000/state ``` --- ## Baseline Performance Scores generated by running `uv run inference.py` with `google/gemma-4-31B-it` via the HuggingFace Router. | Task | Difficulty | Score | Steps | Notes | | ---- | ---------- | ----- | ----- | ----- | | `easy_list_pants` | Easy | **0.74** | 7 | List products in 'Pants' category | | `medium_cart_camera_backpack` | Medium | **0.46** | 20 | Add Camera Backpack to guest cart | | `medium_cart_flannel_jacket` | Medium | **0.52** | 20 | Add Flannel Jacket to guest cart | | `hard_checkout_ripstop_pants` | Hard | **0.14** | 20 | Full guest checkout (hit step limit) | | **Overall** | β€” | **0.47** | β€” | | > **To regenerate:** `HF_TOKEN=hf_xxx uv run inference.py`