HARvestGym / TOOLS.md
kdcyberdude's picture
Upload folder using huggingface_hub
e6ce96e verified

HARvestGym Tool Specification

Technical specification for all tools available to the RL agent. Each tool is a Python function called by the environment on behalf of the model. The model outputs a single tool call per step; the environment executes it and returns the result.


Tool Set Summary

Tool Input What It Does Output
browser_agent(task, url) Task string + app base URL Checks if a pre-recorded HAR file exists for this app; if so, processes it(only for training; at inference time it will always use browser-agent) β€” otherwise launches a live browser agent to perform the task and record network traffic. Either way, extracts an OpenAPI-like spec from the captured traffic, builds GEMMA embeddings for the search endpoint index, and returns a summary endpoint list. Deduplicated list of API endpoint names with HTTP methods (e.g. GET /products, POST /guest-carts) β€” summary only, no headers/body/schemas. Use search_endpoints() with a natural-language query to get full details for any endpoint.
search_endpoints(query) Natural language query Semantic search over the endpoint embeddings built by browser_agent. Matches the query against the GEMMA-embedded OpenAPI-like spec and returns the top-3 endpoint schemas with full parameter details. Top-3 endpoint schemas (method, path, auth, params with sources, response fields)
curl_exec(command) Full curl command string Parses the curl command, executes it via subprocess against the live EC2 server, indexes the full response body into the episode's hybrid BM25 + GEMMA store (before truncation), then returns a truncated observation. {status_code, headers, body} β€” body smart-truncated; full body indexed into episode store for search_episode_data()
search_episode_data(query) Natural language or keyword query Hybrid BM25 + GEMMA semantic search across all request/response bodies accumulated during this episode from prior curl_exec calls. BM25 handles exact keyword matches (IDs, SKUs); GEMMA handles semantic paraphrases. Finds specific values from truncated or prior responses. Top-5 JSON objects from this episode's request/response history, each annotated with step number and source endpoint
done(result?) Optional result string Signals the model believes the task is complete. Triggers the judge to evaluate the episode against the ground truth catalog. Ends the episode

browser_agent is always called once at the start of an episode (step 1). It gives the model an API landscape map for the target application, so the model knows which endpoints exist before it begins probing. All subsequent discovery and execution uses the other tools. If the model calls browser_agent again after step 1, it receives a βˆ’0.3 penalty reward β€” the call still executes normally (loads HAR if it exists, or runs live browser), the penalty is just applied to the reward signal.


Tool 0: browser_agent(task, url)

Purpose

Give the model an initial map of the API surface for the target application at the start of every episode. The browser agent is a multi-stage pipeline that:

  1. Obtains HAR data (from pre-recorded file if available, or by launching a live browser)
  2. Processes it to extract an OpenAPI-like spec
  3. Builds GEMMA embeddings so search_endpoints() can work
  4. Returns a summary-only endpoint list to the RL agent β€” just names and methods, no schemas

The output is intentionally sparse. Because there could be many endpoints that will waste context window. The agent sees what endpoints exist but not how to call them. It must call search_endpoints() to get full parameter details for any endpoint.

Interface

def browser_agent(task: str, url: str) -> dict:
    """
    Multi-stage pipeline:
    1. Check for pre-recorded HAR file β†’ load if exists, else launch live browser
    2. Filter HAR β†’ extract OpenAPI-like spec (methods, paths, params, bodies)
    3. Build GEMMA embeddings over the spec β†’ stored for search_endpoints()
    4. Return summary endpoint list (names + methods only)

    Returns: {
        "app": str,                      # resolved app name (shopping, forum, osm, wikipedia)
        "endpoints": list[dict],         # summary: [{method, path}] β€” no schemas, no headers
        "total_endpoints": int,          # count of deduplicated endpoints
        "note": str                      # directs agent to use search_endpoints() for details
    }
    """

Stage 1 β€” HAR Data Source

The browser agent first checks if a pre-recorded HAR file exists. If it does, the browser is never launched β€” saving 30–120s per episode.

hars/
  shopping.har       # all shopping tasks, all API calls recorded for all task templates
  shopping_admin.har # all admin tasks
  forum.har          # all forum tasks
  osm.har            # all OSM tasks
  wikipedia.har      # all Wikipedia tasks
HAR_MAP = {
    ":7770": "hars/shopping.har",
    ":7780": "hars/shopping_admin.har",
    ":9999": "hars/forum.har",
    ":3000": "hars/osm.har",
    ":8888": "hars/wikipedia.har",
}

def get_har_data(task: str, url: str) -> dict:
    har_path = resolve_har_path(url)       # port-based lookup from HAR_MAP
    if har_path and os.path.exists(har_path):
        # HAR exists β€” load from disk, no browser needed
        with open(har_path) as f:
            return json.load(f)
    else:
        # No HAR β€” launch live browser, perform task, capture traffic
        raw_log = await run_browser_agent_live(task, url, "browser-use/bu-30b-a3b-preview")
        return convert_raw_log_to_har(raw_log)

If no HAR exists, the browser agent launches Chromium via Playwright, connects the bu-30b-a3b-preview LLM, performs the task while intercepting all network traffic, and produces a HAR-format output. See BROWSER_AGENT.md for the live browser implementation.

Stage 2 β€” Filter and Extract OpenAPI-like Spec

The HAR data (from either source) is processed to extract a structured spec:

def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
    entries = har_data["log"]["entries"]
    seen = set()
    spec_entries = []

    for entry in entries:
        req = entry["request"]
        resp = entry["response"]
        raw_url = req["url"]
        method = req["method"]

        # 1. Skip static assets (images, fonts, CSS, JS bundles, favicon)
        if _is_static_asset(raw_url):
            continue

        # 2. Skip page navigation (HTML document loads)
        content_type = _get_response_content_type(resp)
        if "text/html" in content_type and method == "GET":
            continue

        # 3. Normalise path: replace concrete IDs with {id} placeholders
        path = _normalise_path(urlparse(raw_url).path)

        # 4. Deduplicate by (method, normalised_path)
        key = f"{method} {path}"
        if key in seen:
            continue
        seen.add(key)

        # 5. Extract auth, body, query params for the spec document
        has_auth = any(
            h["name"].lower() in ("authorization", "x-api-key", "cookie")
            for h in req["headers"]
        )

        spec_entries.append({
            "method": method,
            "path": path,
            "query_params": urlparse(raw_url).query or None,
            "request_body": _extract_body(req),
            "status_code": resp["status"],
            "response_content_type": content_type,
            "response_body_sample": _truncate_body(resp),
            "auth_observed": has_auth,
        })

    return spec_entries

Stage 3 β€” Build GEMMA Embeddings

The spec entries are embedded using google/embeddinggemma-300m (GEMMA). These embeddings are stored in the environment and power search_endpoints().

def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
    model = SentenceTransformer("google/embeddinggemma-300m", token=os.environ.get("HF_TOKEN"))
    chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
    embeddings = model.encode_document(chunks, batch_size=32)
    return embeddings, chunks  # stored in env for search_endpoints()

Stage 4 β€” Return Summary

The RL agent receives only endpoint names and methods β€” no schemas, no headers, no body details:

{
  "app": "shopping",
  "endpoints": [
    {"method": "POST", "path": "/rest/V1/integration/customer/token"},
    {"method": "GET",  "path": "/rest/V1/products"},
    {"method": "GET",  "path": "/rest/V1/products/{id}"},
    {"method": "POST", "path": "/rest/V1/guest-carts"},
    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items"},
    {"method": "GET",  "path": "/rest/V1/guest-carts/{id}/totals"},
    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/order"},
    {"method": "GET",  "path": "/rest/V1/categories"}
  ],
  "total_endpoints": 8,
  "note": "These endpoints were observed for this application. Use search_endpoints() with a natural language query to get the full schema, parameters, and auth details for any endpoint."
}

Path Normalisation

The _normalise_path function replaces concrete dynamic segments with {id} placeholders so that duplicates collapse:

  • Numeric IDs: /products/42 β†’ /products/{id}
  • UUIDs: /carts/3fa85f64-5717-4562-b3fc β†’ /carts/{id}
  • Magento cart IDs (mixed alphanumeric, 32+ chars): detected by length and character set
  • OSM node/way/relation IDs: /api/0.6/node/12345678 β†’ /api/0.6/node/{id}
  • Forum post slugs: /f/general/1-hello-world β†’ /f/{slug}/{id}-{slug}

Normalisation is pattern-based (regex), not AI-generated. No external calls.

When to Call

browser_agent is called exactly once per episode, at step 1, before any other tool. It serves as the API landscape orientation AND builds the search index. If called again after step 1, the call executes normally but the model receives a βˆ’0.3 penalty reward. The model should not need to call it again mid-episode.

Relationship to Other Tools

browser_agent  β†’  "what endpoints exist?" (summary only)
    β”‚                    + builds GEMMA embeddings internally
    β”‚
    β–Ό
search_endpoints  β†’  "give me full schema for endpoint X"
    β”‚                    (searches the GEMMA embeddings built above)
    β–Ό
curl_exec         β†’  "call endpoint X, get live response"
    β”‚                    (indexes full response into BM25 episode store)
    β–Ό
search_episode_data  β†’  "find specific value from a prior response"
                         (BM25 search over indexed episode data)

browser_agent provides breadth (what exists) and builds the search index. search_endpoints provides depth (how to call it). curl_exec provides live data and feeds the episode index. search_episode_data retrieves specific values from that index.


Tool 1: search_endpoints(query)

Purpose

Find which API endpoint to call for a given subtask. The model calls this when it does not yet know the correct URL, method, or parameter schema for the next HTTP call it needs to make.

Interface

def search_endpoints(query: str) -> list[str]:
    """
    Semantic search over the endpoint embeddings built by browser_agent.
    Returns the top-3 matching endpoint schemas as formatted text strings.
    """

Underlying Index

  • Source: The GEMMA embeddings built by browser_agent during Stage 4. These embeddings are created from the OpenAPI-like spec extracted from HAR data β€” the actual network traffic observed when the browser agent performed tasks on the application.
  • Embedding model: google/embeddinggemma-300m via sentence-transformers
  • Built: By browser_agent at the start of each episode (Stage 4). The browser agent processes the HAR data, extracts the OpenAPI-like spec, converts each endpoint to a text chunk, and embeds them using GEMMA.
  • At runtime: Stored in environment memory after browser_agent completes. Available for the rest of the episode. Discarded at episode end (rebuilt from HAR at next episode start).
  • Query embedding: Uses the encode_query method with prompt task: search result | query: {query}.
  • Document embedding: Uses the encode_document method with prompt title: {endpoint} | text: {full_schema_text}.
  • Similarity: Use the similarity function specified by the google/embeddinggemma-300m model card. The model's sentence_transformers_config.json specifies the correct metric (typically cosine similarity for normalized embeddings). Pure numpy, no FAISS needed at this scale.

Document Structure (one per extracted endpoint)

Each endpoint from the browser agent's OpenAPI-like spec is converted to a searchable text chunk by spec_entry_to_text():

app: shopping | endpoint: POST /rest/V1/guest-carts/{id}/items | status: 200 | auth: none | body: {"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-abc123"}} | response_sample: {"item_id":5,"sku":"MH01","qty":1}

The text chunks include method, path, status code, auth observation, query params, request body sample, and response body sample β€” all extracted from the actual HAR traffic. This is richer than just endpoint names (which is what the RL agent sees from browser_agent's summary output) but less structured than a hand-written catalog.

Output Format

Returns a list of 3 strings, each being the full text of one matching endpoint schema. The model reads these directly and extracts the method, URL pattern, observed parameters, and response structure.

When to Call

  • At the start of a task subtask: "I need to authenticate β€” what endpoint handles login?"
  • When discovering a prerequisite: "I need a cart ID first β€” what creates a cart?"
  • When unsure of the exact URL pattern: "Is it /products/{id} or /products?id=?"

Caveats

  • Returns observed traffic patterns, not formal API documentation. The schemas reflect what was seen in the HAR, not what the API formally supports. Some optional parameters may be missing if the browser agent's session didn't exercise them.
  • Returns schemas, not live values. The model still needs curl_exec to get actual data (product SKUs, cart IDs, etc.).
  • If no relevant endpoint exists in the index, returns the closest matches by cosine similarity. The model should treat low-confidence results skeptically and try curl_exec to probe.
  • The index covers only the current application (determined by browser_agent's URL). Each episode's index is specific to one app.

Tool 2: curl_exec(command)

Purpose

Execute an HTTP request against the live EC2 application and return the response. This is the primary action tool β€” it is how the model actually interacts with the application.

Interface

def curl_exec(command: str) -> dict:
    """
    Parses a curl command string, executes it via subprocess against the live EC2 server,
    indexes the full response into the episode store, then returns a truncated observation.

    Returns: {
        "status_code": int,
        "headers": dict,          # response headers
        "body": str | dict        # truncated; see truncation rules below
    }
    """

Execution Pipeline

The environment performs these steps in order on every curl_exec call:

1. Parse the curl command string
      Extract: method, URL, headers, body
      Validate: URL host must match app_base_url (reject requests to external hosts)
      Inject: session cookies from session_state into headers automatically

2. Execute via subprocess
      subprocess.run(["curl", ...parsed args...], timeout=10)
      Capture: status_code, response headers, response body (full, untruncated)

3. Index into episode store (BEFORE truncation)
      Index the request body (if any)
      Index the response body
      See: Episode Store section below

4. Truncate the response body for context
      Apply truncation rules (see below)
      Add truncation note if any array was trimmed

5. Return to model
      {status_code, headers, truncated_body} or error

Truncation Rules

Applied in order. First matching rule wins.

Rule 1 β€” Non-JSON body:

HTML from form-serving pages (login, post creation, etc.) is kept longer than a byte cutoff would allow because CSRF tokens and <input> fields are embedded inside the markup. The model locates them by reading the raw HTML string β€” no HTML parser required since tokens appear as predictable plain-text patterns (<input type="hidden" name="_csrf_token" value="…">). Even with 3,000 characters, if the CSRF token appears after the cutoff (possible in large pages), the full body is indexed in the episode store and can be retrieved with search_episode_data("_csrf_token").

NONJSON_MAX_CHARS = 3000

if not is_valid_json(body):
    return body[:NONJSON_MAX_CHARS] + (" [truncated β€” non-JSON response]" if len(body) > NONJSON_MAX_CHARS else "")

Rule 2 β€” JSON primitive (string, number, boolean, null):

if isinstance(parsed, (str, int, float, bool)) or parsed is None:
    return body  # never truncate; these are tokens, IDs, simple confirmations

Rule 3 β€” Error response (4xx or 5xx):

if status_code >= 400:
    return body  # never truncate error messages; the model needs every word to self-correct

Rule 4 β€” JSON object or array with no large arrays:

# "large" means an array with >= 3 objects (dicts)
if no array field contains >= 3 dict items:
    return body  # small enough; return as-is

Rule 5 β€” JSON with large array field(s):

# For each top-level field whose value is a list of >= 3 dicts:
#   Keep first 2 items, drop the rest
#   Add a _list_truncated annotation at the top level

truncated = {k: v for k, v in parsed.items() if not is_large_list(v)}
for key, val in parsed.items():
    if is_large_list(val):
        truncated[key] = val[:2]
        truncated["_list_truncated"] = {
            "field": key,
            "shown": 2,
            "total": len(val),
            "note": f"Showing 2 of {len(val)} items. Use search_episode_data() to find a specific item from this response."
        }
return json.dumps(truncated)

The note is a static Python format string. It is not AI-generated. It does not suggest specific query parameters or URL patterns.

Session State Injection

Before executing the curl command, the environment reads session_state and injects any relevant cookies or tokens:

  • If session_state contains PHPSESSID, inject as Cookie: PHPSESSID=...
  • If session_state contains form_key (Magento CSRF), inject as a header: X-Form-Key: ...
  • If session_state contains PHPSESSID for Forum requests (port 9999), inject as Cookie: PHPSESSID=...
  • If session_state contains a bearer token, inject as Authorization: Bearer ... only if the model's curl command does not already include an Authorization header

CSRF note for Postmill (Forum): Postmill's _csrf_token is a request-body field, not a header. The environment does not auto-inject it β€” the model must extract it from the HTML form response and include it explicitly in the POST body. The session_state cookie (PHPSESSID) is auto-injected so the server associates the CSRF token with the active session. The expected workflow:

GET /login β†’ HTML body contains <input type="hidden" name="_csrf_token" value="XYZ">
Model reads "XYZ" from body string
POST /login -d '_csrf_token=XYZ&_username=user&_password=pass'
Environment auto-injects Cookie: PHPSESSID from session_state

The model is responsible for setting the correct Content-Type in its curl command. The model declares intent (which headers to include); the environment fills in the actual token values from session_state.

Caveats

  • curl_exec always hits the live EC2 server. No responses are mocked.
  • The timeout is 10 seconds. If the server does not respond, returns {status_code: 0, error: "timeout"}.
  • URL must be on the same host as app_base_url. Cross-host requests are rejected with {status_code: 0, error: "host_not_allowed"}.
  • The model must include the full URL including host and port. Relative paths are not supported.

Tool 3: search_episode_data(query)

Purpose

Search through all request and response bodies accumulated during the current episode. The model calls this when it needs a specific value (an ID, a name, a token) that was returned in a prior API response but the list was truncated.

This tool exists because not every data type has a search or filter API endpoint. For applications where the model cannot make a targeted filtered query (e.g., a listing endpoint that only supports pagination, not field-based filtering)

Interface

def search_episode_data(query: str) -> list[str]:
    """
    Keyword + BM25 search over all request and response bodies indexed during this episode.
    Returns the top-5 matching JSON objects as formatted text strings, each annotated
    with the step number and endpoint that produced them.
    """

Hybrid Search: BM25 + GEMMA Semantic Embeddings

The episode index uses a hybrid approach combining BM25 keyword matching with GEMMA semantic embeddings (google/embeddinggemma-300m). Both indexes are maintained in parallel β€” BM25 for fast exact keyword recall, GEMMA for semantic understanding when the agent's query uses different terminology than what appears in the response data.

Why hybrid, not BM25 alone:

BM25 excels at exact keyword matches ("MH01", "cart-abc123", "Radiant Tee") but fails on paraphrases. If the agent queries "price of the tee shirt I found earlier", BM25 won't match "Radiant Tee" because the terms don't overlap. GEMMA semantic embeddings bridge this gap β€” "tee shirt" and "Radiant Tee" are semantically close in embedding space.

Why hybrid, not GEMMA alone:

GEMMA embeddings are weaker at exact string matching. Searching for a specific cart ID like "cart-abc123" benefits from BM25's precise token matching. The hybrid approach gets the best of both.

Scoring: Results are ranked by a weighted combination of BM25 score (normalized) and GEMMA cosine similarity:

hybrid_score = alpha * bm25_score_normalized + (1 - alpha) * GEMMA_cosine_similarity
# alpha = 0.4 (tunable; favors semantic slightly over keyword)

Performance: GEMMA is 300M parameters. On GPU, embedding a batch of 200 response items takes ~1-2 seconds β€” acceptable overhead per curl_exec call. The GEMMA model is already loaded in memory for search_endpoints, so no additional model loading cost. BM25 remains instantaneous.

Fallback: If no GPU is available, the system falls back to BM25-only mode. The GEMMA model is shared with search_endpoints β€” if it's loaded, episode data search uses it too.

Episode Index β€” Document Construction

Every time curl_exec completes, the environment constructs embedding documents from the full (pre-truncation) request and response bodies and adds them to the in-memory BM25 index for the current episode.

Algorithm:

def build_index_documents(step: int, method: str, path: str,
                           request_body: Any, response_body: Any,
                           status_code: int) -> list[str]:
    docs = []

    # 1. Index the request body (if any)
    if request_body is not None:
        docs.append(
            f"step:{step} source:request endpoint:{method} {path} "
            f"body:{json.dumps(request_body, ensure_ascii=False)}"
        )

    # 2. Index the response body
    if response_body is None or not is_valid_json(response_body):
        docs.append(
            f"step:{step} source:response endpoint:{method} {path} "
            f"status:{status_code} body:{str(response_body)[:500]}"
        )
        return docs

    parsed = json.loads(response_body) if isinstance(response_body, str) else response_body

    # 3. JSON primitive β€” one document
    if isinstance(parsed, (str, int, float, bool)) or parsed is None:
        docs.append(
            f"step:{step} source:response endpoint:{method} {path} "
            f"status:{status_code} value:{parsed}"
        )
        return docs

    # 4. JSON object β€” find top-level array fields
    array_fields = {k: v for k, v in parsed.items()
                    if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)}
    scalar_fields = {k: v for k, v in parsed.items() if k not in array_fields}

    if not array_fields:
        # No arrays β€” one document for the whole object
        docs.append(
            f"step:{step} source:response endpoint:{method} {path} "
            f"status:{status_code} data:{json.dumps(parsed, ensure_ascii=False)}"
        )
        return docs

    # 5. Has array fields β€” one document per array item, with parent context attached
    parent_context = (
        f"step:{step} source:response endpoint:{method} {path} status:{status_code} "
        + " ".join(f"{k}:{v}" for k, v in scalar_fields.items())
    )
    for field_name, items in array_fields.items():
        for item in items:
            # Flatten nested arrays within each item to strings (do not recurse further)
            flat_item = {}
            for k, v in item.items():
                flat_item[k] = json.dumps(v) if isinstance(v, (list, dict)) else v
            docs.append(
                f"{parent_context} list_field:{field_name} "
                f"item:{json.dumps(flat_item, ensure_ascii=False)}"
            )

    return docs

Key design principle: The parent context (step number, endpoint, HTTP status, scalar fields like total_count) is prepended to every child item document. When the model searches for "Radiant Tee product SKU", the returned document contains both name:Radiant Tee sku:MH01 and the context endpoint:GET /rest/V1/products step:2 β€” the model knows where this value came from and which step it appeared in.

Episode Index β€” Lifecycle

episode start  β†’  BM25 index initialized (empty)
                  GEMMA embedding store initialized (empty)
                        β”‚
each curl_exec  β†’  build_index_documents() called
                   documents appended to BM25 corpus (BM25 index rebuilt, fast)
                   documents embedded via GEMMA and appended to embedding store
                        β”‚
search_episode_data()  β†’  BM25 scores computed (keyword match)
                          GEMMA cosine similarity computed (semantic match)
                          hybrid ranking: alpha * BM25 + (1-alpha) * GEMMA
                          top-5 documents returned
                        β”‚
episode end    β†’  both indexes discarded entirely
next episode   β†’  fresh indexes from scratch

Output Format

Returns a list of up to 5 strings, each being one indexed document. Example:

[
  "step:2 source:response endpoint:GET /rest/V1/products status:200 total_count:200 list_field:items item:{\"sku\": \"MH01\", \"name\": \"Radiant Tee\", \"price\": 22.0, \"type_id\": \"simple\"}",
  "step:2 source:response endpoint:GET /rest/V1/products status:200 total_count:200 list_field:items item:{\"sku\": \"MH03\", \"name\": \"Radiant Tee Long Sleeve\", \"price\": 28.0, \"type_id\": \"simple\"}"
]

The model reads sku: MH01 from the first result and uses it in the next curl call.

When to Call

  • A prior curl response was truncated (_list_truncated present in the response) and the model needs a specific item not shown in the 2-item sample.
  • The model needs a value from a prior step but cannot easily locate it by scanning history (many steps ago, or buried in a complex response).
  • There is no filter/search API for the data type (practical assumption: not all applications expose filtered listing endpoints for every resource).

Caveats

  • Only searches data from the current episode. Values from prior episodes are not accessible (each episode starts with an empty index).
  • Only finds data that was actually returned by a curl_exec call in this episode. If the relevant API has not been called yet, the data is not indexed.
  • The hybrid search handles both exact keywords ("MH01", "cart-abc123") and semantic paraphrases ("tee shirt price"). However, using the same terminology seen in the response still produces the best results.
  • For large lists (200+ items), all items are indexed. BM25 search is fast regardless of index size. GEMMA embedding of large responses adds 1-2 seconds of overhead per curl_exec call.

Tool 4: done(result?)

Interface

def done(result: str = None) -> None:
    """
    Signals that the model believes the task is complete.
    Triggers the judge to evaluate the episode against the ground truth catalog.
    Ends the episode.
    """

Behavior

  • Calling done() immediately ends the episode. No further tool calls are processed.
  • The optional result string is logged but does not affect the reward. The judge evaluates the live application state, not the model's self-report.
  • If the model calls done() and the task is not actually complete, the episode ends with βˆ’1.5 reward (timeout/failure outcome). The model should only call done() after the final curl_exec has returned a 2xx confirming the required state change.

How the Model Learns When to Call done()

There is no explicit "task complete" signal from the environment. The model learns when to call done() purely through the reward signal over many episodes:

  • Calling done() too early (before the task is actually complete) β†’ judge finds the expected state change is missing β†’ βˆ’1.5 reward. The model learns to avoid this.
  • Calling done() after a successful final API call (e.g., add-to-cart returns 2xx with item_id) β†’ judge confirms the state change β†’ +2.0 to +5.0 reward. The model learns that a 2xx response confirming the desired action is the right signal to call done().
  • Never calling done() (running out of steps) β†’ episode times out β†’ βˆ’1.5 reward. The model learns it must eventually commit.

The learned pattern is: after the final state-changing curl_exec returns a 2xx response whose body confirms the expected outcome (e.g., item_id present in add-to-cart, order_id present in checkout), call done(). This mirrors how a human developer knows an API call succeeded β€” you check the response.

Optional verification step: Before calling done(), the model can issue one more curl_exec to verify the state change (e.g., GET /rest/V1/guest-carts/{id} to confirm the item is in the cart). This costs one step but reduces the risk of premature done() calls. The model learns whether verification is worth the step cost through reward optimization.


Episode Index β€” Full Example

Task: "Add 'Radiant Tee' to a guest cart at http://ec2-.../"

SYSTEM: ...

TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/

[session_state: {}]

STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
STEP 1 RESULT: {
  "app": "shopping",
  "endpoints": [
    {"method": "POST", "path": "/rest/V1/guest-carts",             "status": 200, "auth_observed": false},
    {"method": "GET",  "path": "/rest/V1/products",                "status": 200, "auth_observed": false},
    {"method": "GET",  "path": "/rest/V1/products/{id}",           "status": 200, "auth_observed": false},
    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items",  "status": 200, "auth_observed": false},
    ...
  ],
  "note": "These endpoints were observed in pre-recorded browser sessions..."
}

STEP 2 ACTION: search_endpoints("find product by name, get sku")
STEP 2 RESULT: [
  "endpoint: GET /rest/V1/products | auth: none | query: searchCriteria filters (field, value, condition_type) | returns: .items[].sku, .items[].name, .items[].price",
  ...
]

STEP 3 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
STEP 3 RESULT: {"status_code": 200, "body": {"items": [{"sku": "MH01", "name": "Radiant Tee", "price": 22.0}], "total_count": 1}}

STEP 4 ACTION: search_endpoints("create guest cart, get cart id")
STEP 4 RESULT: [
  "endpoint: POST /rest/V1/guest-carts | auth: none | no body | returns: cartId string (plain quoted string)",
  ...
]

STEP 5 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
STEP 5 RESULT: {"status_code": 200, "body": "cart-abc123"}

STEP 6 ACTION: search_endpoints("add item to guest cart")
STEP 6 RESULT: [
  "endpoint: POST /rest/V1/guest-carts/{cartId}/items | auth: none | path: cartId (from POST /rest/V1/guest-carts) | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId) | returns: item_id",
  ...
]

STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
STEP 7 RESULT: {"status_code": 200, "body": {"item_id": 5, "sku": "MH01", "qty": 1}}

STEP 8 ACTION: done("Radiant Tee (MH01) added to guest cart cart-abc123, item_id 5")

Embedding build (by browser_agent, once per episode)

The GEMMA embeddings for search_endpoints are built by browser_agent during Stage 4, not pre-built offline. Each episode starts fresh:

from sentence_transformers import SentenceTransformer
import numpy as np
import os

# google/embeddinggemma-300m requires accepting Google's license on HuggingFace.
# Set HF_TOKEN env variable to a token that has accepted the license.
# Uses encode_query / encode_document / similarity API from sentence-transformers.
# NOTE: activations do not support float16 β€” use float32 or bfloat16.
HF_TOKEN = os.environ.get("HF_TOKEN")  # must have accepted the license

def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
    """Called by browser_agent Stage 4 after extracting OpenAPI-like spec from HAR."""
    model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN)
    chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
    # encode_document uses: "title: {endpoint} | text: {rest of chunk}"
    embeddings = model.encode_document(chunks, batch_size=32)
    # embeddings are returned normalized; dot product = cosine similarity
    return embeddings, chunks  # stored in env memory for search_endpoints()

Runtime search

def search_endpoints(query: str, embeddings, texts, model, top_k=3) -> list[str]:
    q_emb = model.encode_query(query)          # shape: (D,)
    # Use similarity metric specified by google/embeddinggemma-300m model card
    scores = model.similarity(q_emb, embeddings).squeeze(0)  # shape: (N,)
    top_idx = np.argsort(scores)[::-1][:top_k]
    return [texts[i] for i in top_idx]

Index size per episode

Typical endpoint counts per app (after HAR filtering and deduplication):

  • Shopping (Magento REST): ~8–15 endpoints per HAR
  • Shopping Admin (Magento Admin AJAX + REST): ~5–10 endpoints per HAR
  • Forum (Postmill forms + REST): ~3–8 endpoints per HAR
  • OSM (Rails API + web): ~5–10 endpoints per HAR
  • Wikipedia (Kiwix): ~2 endpoints per HAR

Typical: ~5–15 endpoints Γ— 768 dims Γ— 4 bytes = negligible memory. Embedding time on GPU: <1 second per episode.


Truncation Helper β€” Python Pseudocode

import json

TRUNCATE_LIST_AT = 2       # keep this many items from large arrays
LARGE_ARRAY_THRESHOLD = 3  # arrays with >= this many dicts are "large"
NONJSON_MAX_CHARS = 3000   # 3 000 chars β€” enough to capture hidden CSRF inputs in most HTML forms

def truncate_response_body(body: str, status_code: int) -> str:
    # Rule 3: never truncate errors
    if status_code >= 400:
        return body

    # Rule 1: non-JSON
    if not _is_json(body):
        if len(body) > NONJSON_MAX_CHARS:
            return body[:NONJSON_MAX_CHARS] + " [truncated β€” non-JSON response]"
        return body

    parsed = json.loads(body)

    # Rule 2: primitive
    if not isinstance(parsed, (dict, list)):
        return body

    # Rule 4/5: find large array fields
    if isinstance(parsed, list):
        if len(parsed) >= LARGE_ARRAY_THRESHOLD and isinstance(parsed[0], dict):
            result = parsed[:TRUNCATE_LIST_AT]
            note = {"_list_truncated": {
                "shown": TRUNCATE_LIST_AT,
                "total": len(parsed),
                "note": f"Showing {TRUNCATE_LIST_AT} of {len(parsed)} items. "
                        "Use search_episode_data() to find a specific item from this response."
            }}
            return json.dumps(result + [note])
        return body

    # parsed is a dict β€” check each value
    needs_truncation = {
        k for k, v in parsed.items()
        if isinstance(v, list) and len(v) >= LARGE_ARRAY_THRESHOLD
           and len(v) > 0 and isinstance(v[0], dict)
    }
    if not needs_truncation:
        return body

    result = {}
    total_truncated = {}
    for k, v in parsed.items():
        if k in needs_truncation:
            result[k] = v[:TRUNCATE_LIST_AT]
            total_truncated[k] = len(v)
        else:
            result[k] = v

    result["_list_truncated"] = {
        "fields": total_truncated,
        "shown_per_field": TRUNCATE_LIST_AT,
        "note": (
            f"List fields truncated: "
            + ", ".join(f"{k} showing {TRUNCATE_LIST_AT}/{n}" for k, n in total_truncated.items())
            + ". Use search_episode_data() to find a specific item from this response."
        )
    }
    return json.dumps(result)


def _is_json(s: str) -> bool:
    try:
        json.loads(s)
        return True
    except (ValueError, TypeError):
        return False

Tool Call Format in the Episode Prompt

The growing episode context uses this format for tool calls and results:

SYSTEM: You are an API agent. Your task is to complete the given task by calling the
available tools: browser_agent, search_endpoints, curl_exec, search_episode_data, done.
Complete the task using only HTTP calls to the application at the given URL.
When a response body is HTML, read hidden input fields directly from the markup to
extract CSRF tokens (pattern: <input type="hidden" name="_csrf_token" value="...">).
For form submissions, use Content-Type: application/x-www-form-urlencoded.

TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/

[session_state: {}]

STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
STEP 1 RESULT: {
  "app": "shopping",
  "endpoints": [
    "POST /rest/V1/guest-carts",
    "GET  /rest/V1/products",
    "GET  /rest/V1/products/{sku}",
    "POST /rest/V1/guest-carts/{id}/items",
    ...
  ],
  "note": "Use search_endpoints() to get full schema for any of these."
}

STEP 2 ACTION: search_endpoints("find product by name, get sku")
STEP 2 RESULT: [
  "endpoint: GET /rest/V1/products | auth: none | query: searchCriteria filters (field, value, condition_type) | returns: .items[].sku, .items[].name, .items[].price",
  ...
]

STEP 3 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
STEP 3 RESULT: {"status_code": 200, "body": {"items": [{"sku": "MH01", "name": "Radiant Tee", "price": 22.0}], "total_count": 1}}

STEP 4 ACTION: search_endpoints("create guest cart, get cart id")
STEP 4 RESULT: [
  "endpoint: POST /rest/V1/guest-carts | auth: none | no body | returns: cartId string (plain quoted string)",
  ...
]

STEP 5 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
STEP 5 RESULT: {"status_code": 200, "body": "cart-abc123"}

STEP 6 ACTION: search_endpoints("add item to guest cart")
STEP 6 RESULT: [
  "endpoint: POST /rest/V1/guest-carts/{cartId}/items | auth: none | path: cartId (from POST /rest/V1/guest-carts) | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId) | returns: item_id",
  ...
]

STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
STEP 7 RESULT: {"status_code": 200, "body": {"item_id": 5, "sku": "MH01", "qty": 1}}

STEP 8 ACTION: done("Radiant Tee (MH01) added to guest cart cart-abc123, item_id 5")

Value threading happens entirely through the multi-turn context. The model reads "MH01" from step 2's result and "cart-abc123" from step 4's result directly β€” no explicit store/retrieve tools needed.