HARvestGym / README.md
kdcyberdude's picture
Upload folder using huggingface_hub
13f9ac5 verified
metadata
title: HARvestGym
emoji: πŸ•ΈοΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
tags:
  - openenv
  - reinforcement-learning
  - api-agent
  - web-tasks
base_path: /web

HARvestGym

Can a small model learn to reverse-engineer any web application's API β€” and complete real tasks through those APIs, without ever opening a browser?

Web applications are full of APIs. Every click in a browser triggers an HTTP call with a precise schema, a specific authentication header, an exact sequence of prerequisites. HARvestGym trains a small model to do all of that directly β€” given a task and a URL, it discovers the relevant endpoints, figures out what each one needs, chains the calls in the right order, and completes the task without any browser.

The model starts with nothing: no schema, no documentation, no endpoint list. It uses tools to explore β€” issuing requests, inspecting responses, building up its own understanding of how the application works. This is what a developer does when they reverse-engineer an API. The model learns to do the same.


Why It Matters

The environment HARvestGym trains β€” discover an API, understand its dependencies, complete a task through it β€” is one of the most broadly useful things an AI agent can do.

Application What it unlocks
πŸ” API reverse engineering Point an agent at any web app and get a working API map β€” no docs, no SDK, no source code required
πŸ“„ Automatic API documentation Capture real call sequences and parameter provenance to produce accurate, living documentation from observed traffic
πŸ€– Browser-free automation Automate form submissions, cart flows, content management, and data extraction at the HTTP layer β€” immune to UI redesigns
πŸ”§ Any website as MCP tools Turn any web application into a set of callable agent tools on the fly, with no official integration needed
πŸ›‘οΈ Security & API auditing Autonomously probe endpoints, trace auth flows, and map attack surfaces for penetration testing and compliance review

How It Works

Task + App URL
      β”‚
      β–Ό
Policy Model (RL Agent)
  small model β€” no prior knowledge of the app

  Step 1     ──► browser_agent(task, url)     β†’ filtered API endpoint list
  Step 2+    ──► search_endpoints(query)      β†’ full schema for a specific endpoint
             ──► curl_exec(command)           β†’ execute HTTP call, get response
             ──► search_episode_data(query)   β†’ search prior response bodies
             ──► done(result)                 β†’ declare task complete
      β”‚
      β–Ό
Live WebArena Apps (EC2)  ←── real HTTP responses (always live, never mocked)
      β”‚
      β–Ό
Deterministic Judge (compares against ground truth API catalog)
      β”‚
      β–Ό
Reward Signal  ──►  GRPO  ──►  updated policy

The agent calls browser_agent once at the start β€” this runs a real browser to complete the same task while recording all network traffic, then returns the filtered list of API endpoints observed. The agent now has a map of what endpoints exist. What it does not know:

  • which of those endpoints are actually needed for this specific task
  • in what order they must be called (you cannot add to a cart before the cart exists)
  • where each required parameter value comes from
  • how to re-authenticate if a session expires mid-episode

The model must learn to discover all of this on its own.


Architecture

                          TRAINING LOOP
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                         β”‚
β”‚  Task + App URL                                                         β”‚
β”‚       β”‚                                                                 β”‚
β”‚       β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚                  Policy Model (RL Agent)                       β”‚     β”‚
β”‚  β”‚         small model β€” no prior knowledge of the app            β”‚     β”‚
β”‚  β”‚                                                                β”‚     β”‚
β”‚  β”‚  Observation: task + history + session_state + last_result     β”‚     β”‚
β”‚  β”‚                                                                β”‚     β”‚
β”‚  β”‚  Step 1   ──► browser_agent(task, url)                         β”‚     β”‚
β”‚  β”‚  Step 2+  ──► search_endpoints(query)                          β”‚     β”‚
β”‚  β”‚           ──► curl_exec(command)                               β”‚     β”‚
β”‚  β”‚           ──► search_episode_data(query)                       β”‚     β”‚
β”‚  β”‚           ──► done(result)                                     β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚           β”‚                                                             β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚    β”‚                                     β”‚                              β”‚
β”‚    β–Ό                                     β–Ό                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Browser Agent     β”‚    β”‚         Environment                 β”‚     β”‚
β”‚  β”‚  (step 1 only)      β”‚    β”‚                                     β”‚     β”‚
β”‚  β”‚                     β”‚    β”‚  β€’ Executes curl_exec via subprocessβ”‚     β”‚
β”‚  β”‚ Training:           β”‚    β”‚  β€’ Auto-injects session cookies     β”‚     β”‚
β”‚  β”‚  Load pre-recorded  β”‚    β”‚  β€’ Smart-truncates response bodies  β”‚     β”‚
β”‚  β”‚  cached HAR from    β”‚    β”‚  β€’ Indexes full responses into      β”‚     β”‚
β”‚  β”‚   disk or launch    β”‚    β”‚    per-episode BM25 + GEMMA store   β”‚     β”‚
β”‚  β”‚   on real browser   β”‚    β”‚  β€’ Manages session_state: cookies,  β”‚     β”‚
β”‚  β”‚                     β”‚    β”‚    CSRF tokens, auth headers        β”‚     β”‚
β”‚  β”‚ Inference:          β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚  β”‚  Launch real browserβ”‚                   β”‚                            β”‚
β”‚  β”‚  via Playwright +   β”‚                   β”‚ HTTP calls (always live)   β”‚
β”‚  β”‚  bu-30b-a3b-preview β”‚                   β–Ό                            β”‚
β”‚  β”‚                     β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ Both paths produce: β”‚    β”‚     WebArena EC2 (live apps)        β”‚     β”‚
β”‚  β”‚  β€’ Filtered HAR     β”‚    β”‚                                     β”‚     β”‚
β”‚  β”‚  β€’ OpenAPI-like specβ”‚    β”‚  :7770  Shopping (Magento 2)        β”‚     β”‚
β”‚  β”‚  β€’ GEMMA embeddings β”‚    β”‚  :7780  Shopping Admin              β”‚     β”‚
β”‚  β”‚    for search_      β”‚    β”‚  :9999  Forum (Postmill)            β”‚     β”‚
β”‚  β”‚    endpoints()      β”‚    β”‚  :8888  Wikipedia (Kiwix)           β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  :3000  Map (OpenStreetMap)         β”‚     β”‚
β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                            β”‚                            β”‚
β”‚                                            β”‚ episode trajectory         β”‚
β”‚                                            β–Ό                            β”‚
β”‚                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚                             β”‚    Deterministic Judge              β”‚     β”‚
β”‚                             β”‚                                     β”‚     β”‚
β”‚                             β”‚  Per-template programmatic grader:  β”‚     β”‚
β”‚                             β”‚  β€’ Inspects episode trajectory      β”‚     β”‚
β”‚                             β”‚  β€’ Optionally probes live app state β”‚     β”‚
β”‚                             β”‚  β€’ Verifies parameter sourcing      β”‚     β”‚
β”‚                             β”‚    (TASK_SPEC / PREV_CALL /         β”‚     β”‚
β”‚                             β”‚     AUTH_FLOW / STATIC / DERIVED)   β”‚     β”‚
β”‚                             β”‚  β€’ Scores [0.0 β†’ 1.0]               β”‚     β”‚
β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                            β”‚                            β”‚
β”‚                                            β–Ό                            β”‚
β”‚                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚                             β”‚         Reward Signal               β”‚     β”‚
β”‚                             β”‚                                     β”‚     β”‚
β”‚                             β”‚  Per-step:                          β”‚     β”‚
β”‚                             β”‚   +0.2  valid API call (2xx)        β”‚     β”‚
β”‚                             β”‚   +0.1  new path explored           β”‚     β”‚
β”‚                             β”‚   +0.25 correct param sourcing      β”‚     β”‚
β”‚                             β”‚   βˆ’0.15 repeated identical call     β”‚     β”‚
β”‚                             β”‚   βˆ’0.3  browser_agent called again  β”‚     β”‚
β”‚                             β”‚                                     β”‚     β”‚
β”‚                             β”‚  Episode end:                       β”‚     β”‚
β”‚                             β”‚   +2.0–+5.0 task complete (easyβ†’hardβ”‚     β”‚
β”‚                             β”‚   βˆ’1.5      task failed             β”‚     β”‚
β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                            β”‚                            β”‚
β”‚                                            β–Ό                            β”‚
β”‚                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚                             β”‚    GRPO (via HF TRL)                β”‚     β”‚
β”‚                             β”‚                                     β”‚     β”‚
β”‚                             β”‚  8 parallel rollouts per prompt     β”‚     β”‚
β”‚                             β”‚  Computes advantages without        β”‚     β”‚
β”‚                             β”‚  a value function                   β”‚     β”‚
β”‚                             β”‚  Updates policy weights             β”‚     β”‚
β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                            β”‚                            β”‚
β”‚                                            └──► updated Policy Model    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Target Applications

All running on a single AWS EC2 instance β€” real production software, no simulation.

App Port Software
Shopping 7770 Magento 2 β€” open-source e-commerce platform
Shopping Admin 7780 Magento 2 Admin β€” backend panel for the same store
Forum 9999 Postmill β€” open-source Reddit-like forum
Wikipedia 8888 Kiwix β€” read-only offline mirror of Wikipedia
Map 3000 OpenStreetMap β€” collaborative mapping platform

Source: WebArena environment_docker


Tasks

HARvestGym trains on 7 task templates across three complexity tiers. Each template is a parameterized scenario: one reward function, one ground truth catalog entry, one grader β€” but potentially hundreds of distinct episode variations produced by substituting different values for the template slots ({product_name}, {category_name}, etc.).

Complexity Tiers

Tier Characteristic API calls required
Easy Single call, no auth 1
Medium Auth + 1–2 dependent calls 2–3
Hard Multi-step chain with ID threading, full auth 4–8+

The model only graduates to harder templates once it reliably solves easier ones.

Task Templates

# Tier App Template Key Challenge
1 Easy Shopping List products in category {category_name} Single GET with query params
2 Easy Wikipedia Retrieve article summary for {title} Single GET, path parameter resolution
3 Medium Shopping Add {product_name} to a guest cart 2 calls: create cart β†’ add item; ID threading
4 Medium Forum Retrieve all posts in {forum_category} (authed) Login β†’ extract session β†’ GET
5 Hard Forum Create a post titled {title} in {category} Login β†’ extract CSRF form_key β†’ POST with full schema
6 Hard Shopping Guest checkout for {product_name} 5+ chained calls; cart β†’ item β†’ shipping β†’ payment
7 Hard Shopping Admin Create a new product with SKU {sku}, price {price} Admin bearer token β†’ full Magento product schema

Template parameters are populated from a static parameter pool built by querying the live applications before training (see parameter_pools.json, refreshed via scripts/build_parameter_pools.py). Each episode samples randomly from its pool β€” the model never sees the pool directly, it must discover the correct values through its own API calls.

Each task has a deterministic programmatic grader (score in [0.0, 1.0]):

  • Easy graders: check HTTP response body for expected values
  • Medium graders: probe application state after episode (e.g., fetch the cart, verify item is present)
  • Hard graders: verify multi-step state change in the application (e.g., post exists, checkout created)

Spaces

Observation Space

What the model sees at each step:

class Observation(BaseModel):
    task: str                  # Natural language task
    app_base_url: str          # Root URL of the target application
    last_tool_result: Any      # Result of last tool call
    history: list[dict]        # Full episode trajectory: [{action, tool_result}, ...]
    session_state: dict        # Auto-managed: cookies, tokens, CSRF values
    step_count: int
    max_steps: int             # 20

session_state is maintained by the environment β€” the model decides when to authenticate and which session values to use; the environment handles extraction from Set-Cookie headers and response bodies.

Response truncation rules applied in order:

  1. Non-JSON body (HTML, CSS): truncated to 3,000 characters
  2. JSON primitive (string, number): never truncated β€” these are tokens, IDs
  3. Error response (4xx/5xx): never truncated β€” the model needs every word to self-correct
  4. Small JSON (no large arrays): returned as-is
  5. Large JSON array (β‰₯ 3 items): first 2 items shown + _list_truncated annotation + hint to call search_episode_data()

Every curl_exec call indexes the full response into a per-episode hybrid index (BM25 + GEMMA embeddings) before truncation β€” so all items are always retrievable even when only 2 were shown.

Action Space

The model outputs a single tool call per step.

Tool Input Output
browser_agent(task, url) Task string + app base URL Summary list of API endpoint names + methods (e.g. GET /products)
search_endpoints(query) Natural language query Top-3 endpoint schemas (method, path, auth, params with sources, response fields)
curl_exec(command) Full curl command string {status_code, headers, body} β€” body smart-truncated; full body indexed
search_episode_data(query) Keyword or natural language query Top-5 JSON objects from this episode's request/response history
done(result?) Optional result string Ends episode, triggers judge evaluation

browser_agent is called exactly once per episode at step 1. Calling it again applies a βˆ’0.3 penalty. During training, it loads a cached HAR file; at inference, it launches a live browser session.

Full technical specifications for all tools: TOOLS.md

Reward Space

Per-step:

Signal Value Trigger
Valid API call (2xx) +0.2 curl_exec returns 2xx status
New path called this episode +0.1 Normalized path not called before β€” discourages looping
Correct parameter sourcing +0.25 Judge: value came from the correct source type
Session value correctly used +0.1 Auth token/cookie present and correct in curl call
Repeated identical call βˆ’0.15 Exact duplicate curl command issued twice
browser_agent called again βˆ’0.3 browser_agent called after step 1
Malformed curl command βˆ’0.1 curl cannot be parsed or executed
4xx response (recoverable) βˆ’0.05 Call failed but episode continues

Episode end:

Outcome Reward
Task completed correctly +2.0 to +5.0 (scales with difficulty tier)
Partial completion (right endpoints, wrong param threading) +0.5 to +1.5
Authentication correctly obtained (even if task fails) +0.3
Timeout / task failed entirely βˆ’1.5

Target signal separation: successful episodes +3 to +7, failed episodes βˆ’2 to βˆ’1. Required for GRPO.

Reward design note: Pure step-level rewards can teach a model to "look busy" β€” accumulating exploration rewards while never completing the task. The terminal outcome reward is designed to dominate the sum of all per-step rewards. The curriculum is the primary defense: Easy tasks have a trivially short optimal path (2 steps), so there's no room to accumulate fake exploration reward before the model learns that the terminal reward is what matters.


Key Design Decisions

Browser Agent as a Discovery Tool

The RL agent has access to a browser agent tool powered by bu-30b-a3b-preview β€” a 30B MoE vision-language model (3B active parameters) served via the browser-use library on Playwright. When called, it completes the task in a real browser while intercepting all network traffic, then returns the filtered API call list.

Training vs. inference: The browser agent output is pre-computed and cached per task during training β€” the RL model receives it instantly, no live browser session runs. At inference, the browser agent runs live to handle novel tasks.

Full details: BROWSER_AGENT.md

Ground Truth from the Codebase, Not the Browser

The browser agent shows what API calls happen. It does not explain why β€” where each parameter comes from or what field constraints exist. That comes from a one-time static analysis of each WebArena application's Docker image source, producing a ground truth API catalog:

endpoint:    POST /rest/V1/guest-carts/{cartId}/items
path_params:
  cartId:    obtained from: POST /rest/V1/guest-carts β†’ response body
body:
  cartItem.sku:       the product's SKU, from: GET /rest/V1/products β†’ items[].sku
  cartItem.qty:       quantity, from: task specification
  cartItem.quote_id:  same as cartId

The judge uses this to verify not just what the model called, but where each parameter value came from. Source types: TASK_SPEC, PREV_CALL, AUTH_FLOW, STATIC, DERIVED. This is how partial credit works β€” the model gets reward for correctly threading a cart_id even if the final call had a wrong field elsewhere.

Full extraction process: GROUND_TRUTH_EXTRACTION.md

HTML and Form-Based Applications

Not every endpoint returns JSON. The Forum (Postmill) relies on HTML form submissions with CSRF tokens; Wikipedia (Kiwix) serves static HTML pages. The agent handles both:

  • CSRF tokens: The model GETs the form page, reads the value attribute of input[name="_csrf_token"] from the returned HTML, and places it in the subsequent POST. If the token is beyond the 3,000-character truncation point, it calls search_episode_data("_csrf_token") β€” the full HTML is indexed before truncation.
  • HTML-only responses: Wikipedia responses are returned in the standard {status_code, headers, body} envelope. Search results appear in <a href> tags; article content in <p> tags.

Example Episode

TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.../

STEP 1: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
β†’ {"endpoints": ["POST /rest/V1/guest-carts", "GET /rest/V1/products",
                  "POST /rest/V1/guest-carts/{id}/items", ...]}

STEP 2: search_endpoints("create guest cart")
β†’ ["endpoint: POST /rest/V1/guest-carts | auth: none | returns: string (cartId)"]

STEP 3: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
β†’ {status_code: 200, body: "cart-abc123"}

STEP 4: search_endpoints("find product by name get sku")
β†’ ["endpoint: GET /rest/V1/products | query: searchCriteria filters | returns: .items[].sku"]

STEP 5: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&...[value]=Radiant+Tee'")
β†’ {status_code: 200, body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}]}}

STEP 6: search_endpoints("add item to guest cart cartId")
β†’ ["endpoint: POST /rest/V1/guest-carts/{cartId}/items | path: cartId from POST /rest/V1/guest-carts | body: cartItem.sku, cartItem.qty, cartItem.quote_id"]

STEP 7: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
β†’ {status_code: 200, body: {"item_id": 5, "sku": "MH01", "qty": 1}}

STEP 8: done("Radiant Tee added to cart")

Values from prior responses (cart-abc123, MH01) are threaded directly from the growing episode history. If step 5 had returned 200 products truncated to 2, the model would call search_episode_data("Radiant Tee sku") to retrieve MH01 from the episode index.


Setup

Prerequisites

  • Docker installed and running
  • Python 3.11+ with uv
  • A Hugging Face token with read access

Local Development

# Clone and enter the project
git clone <your-hf-space-url>
cd HARvestGym

# Install dependencies
uv sync

# Validate the OpenEnv spec
openenv validate

# Build and run the Docker image
docker build -t harvgym .
docker run -p 8000:8000 harvgym

# Run the inference script
HF_TOKEN=hf_xxx uv run inference.py

Environment Variables

Variable Default Required Purpose
HF_TOKEN β€” Yes HuggingFace auth token
API_BASE_URL https://router.huggingface.co/v1 No LLM API endpoint
MODEL_NAME google/gemma-4-31B-it No Model for inference
HARVGYM_TASK har_classify_easy No Override which task to run

API Endpoints

# Reset episode
curl -X POST http://localhost:8000/reset

# Execute a step
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"tool": "browser_agent", "args": {"task": "...", "url": "..."}}'

# Get current state
curl http://localhost:8000/state

Baseline Performance

Scores generated by running uv run inference.py with google/gemma-4-31B-it via the HuggingFace Router.

Task Difficulty Score Steps Notes
easy_list_pants Easy 0.74 7 List products in 'Pants' category
medium_cart_camera_backpack Medium 0.46 20 Add Camera Backpack to guest cart
medium_cart_flannel_jacket Medium 0.52 20 Add Flannel Jacket to guest cart
hard_checkout_ripstop_pants Hard 0.14 20 Full guest checkout (hit step limit)
Overall β€” 0.47 β€”

To regenerate: HF_TOKEN=hf_xxx uv run inference.py