Spaces:

kdcyberdude
/

HARvestGym

Sleeping

App Files Files Community

kdcyberdude commited on 14 days ago

Commit

e6ce96e

verified ·

1 Parent(s): f2c780a

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
BROWSER_AGENT.md +502 -0
BUILD_NOTES.md +205 -0
Dockerfile +69 -0
GROUND_TRUTH_EXTRACTION.md +470 -0
HAR_TASK_LIST.md +276 -0
JUDGE.md +660 -0
README.md +645 -3
TOOLS.md +847 -0
__init__.py +1 -0
catalogs/forum.json +1517 -0
catalogs/osm.json +0 -0
catalogs/shopping.json +0 -0
catalogs/shopping_admin.json +0 -0
catalogs/wikipedia.json +48 -0
client.py +59 -0
hars/forum.har +0 -0
hars/shopping.har +3 -0
hars/shopping_admin.har +3 -0
hars/wikipedia.har +0 -0
inference.py +375 -0
models.py +14 -0
openenv.yaml +6 -0
openenv_harvestgym.egg-info/PKG-INFO +18 -0
openenv_harvestgym.egg-info/SOURCES.txt +20 -0
openenv_harvestgym.egg-info/dependency_links.txt +1 -0
openenv_harvestgym.egg-info/entry_points.txt +2 -0
openenv_harvestgym.egg-info/requires.txt +14 -0
openenv_harvestgym.egg-info/top_level.txt +1 -0
parameter_pools.json +1090 -0
pyproject.toml +37 -0
scripts/build_parameter_pools.py +364 -0
server/__init__.py +0 -0
server/app.py +49 -0
server/episode.py +53 -0
server/judge.py +691 -0
server/models.py +517 -0
server/tools/__init__.py +0 -0
server/tools/browser_agent.py +418 -0
server/tools/curl_exec.py +434 -0
server/tools/search_endpoints.py +93 -0
server/tools/search_episode_data.py +87 -0
tests/mock_data/mock_catalog.json +88 -0
tests/mock_data/mock_har.json +170 -0
tests/test_e2e_episode.py +272 -0
tests/test_real_har.py +93 -0
tests/tool_browser_agent.py +327 -0
tests/tool_curl_exec.py +442 -0
tests/tool_search_endpoints.py +239 -0
tests/tool_search_episode_data.py +273 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+hars/shopping.har filter=lfs diff=lfs merge=lfs -text
+hars/shopping_admin.har filter=lfs diff=lfs merge=lfs -text

BROWSER_AGENT.md ADDED Viewed

	@@ -0,0 +1,502 @@

+# Browser Agent Component
+This document describes the browser agent tool used by the HARvestGym RL agent — how it works, how to build it, and how it integrates with the environment.
+---
+## What It Is
+The browser agent is a multi-stage tool the RL agent calls at the start of every episode. Given a natural language task and a URL, it:
+1. **Checks if a pre-recorded HAR file exists** for this application
+2. If HAR exists → loads it directly (no browser launched)
+3. If no HAR → **launches a real browser** (Chromium via Playwright), connects an LLM, performs the task, and records all network traffic as a HAR file
+4. **Processes the HAR** (from either source) to extract an OpenAPI-like spec
+5. **Builds GEMMA embeddings** over the extracted spec so `search_endpoints()` can do semantic search
+6. **Returns a summary** — the list of API endpoint names and HTTP methods only
+The browser agent is a script that orchestrates multiple processing stages. The RL agent sees only the final summary output — a list of endpoints like `GET /products`, `POST /guest-carts`. No headers, no body schemas, no parameter details. To get full details about any endpoint, the agent calls `search_endpoints()` with a natural language query — this searches the GEMMA embeddings built during the browser agent's processing stage.
+---
+## Library: browser-use
+**Repository:** [browser-use/browser-use](https://github.com/browser-use/browser-use)
+**Stars:** 86k+ (April 2026)
+**License:** MIT
+**Language:** Python 3.11+
+`browser-use` connects any LLM to a Playwright-controlled browser. The LLM receives the page state (DOM, screenshot, or both), decides on an action (click, type, navigate, extract), and `browser-use` executes it. It uses a sense-plan-act loop with built-in error handling.
+Install:
+```bash
+pip install browser-use
+playwright install chromium
+```
+---
+## How It Works: Full Pipeline
+### Stage 1 — Obtain HAR Data
+The browser agent first checks whether a pre-recorded HAR file exists for the target application. If it does, the browser is never launched — this saves 30–120 seconds per episode.
+```python
+import json, os
+from urllib.parse import urlparse
+HAR_MAP = {
+    ":7770": "hars/shopping.har",
+    ":7780": "hars/shopping_admin.har",
+    ":9999": "hars/forum.har",
+    ":3000": "hars/osm.har",
+    ":8888": "hars/wikipedia.har",
+}
+def resolve_har_path(url: str) -> str | None:
+    """Check if a pre-recorded HAR exists for this app."""
+    for port_key, path in HAR_MAP.items():
+        if port_key in url:
+            if os.path.exists(path):
+                return path
+    return None
+async def get_har_data(task: str, url: str, llm_model: str) -> dict:
+    """
+    Stage 1: Get HAR data — from file if available, from live browser otherwise.
+    Returns the parsed HAR JSON.
+    """
+    har_path = resolve_har_path(url)
+    if har_path:
+        # HAR exists — load directly, no browser needed
+        with open(har_path) as f:
+            return json.load(f)
+    else:
+        # No HAR — run live browser session and capture traffic
+        raw_log = await run_browser_agent_live(task, url, llm_model)
+        return convert_raw_log_to_har(raw_log)
+```
+### Stage 2 — Live Browser Session (only if no HAR exists)
+When no pre-recorded HAR is available, the browser agent launches a real Chromium browser, connects the LLM, and performs the task while intercepting all network traffic:
+```python
+from playwright.async_api import async_playwright
+from browser_use import Agent
+from langchain_openai import ChatOpenAI
+async def run_browser_agent_live(task: str, url: str, llm_model: str) -> list[dict]:
+    """
+    Runs browser-use agent on the given task, intercepts all network traffic,
+    returns raw request/response log.
+    """
+    requests_log = []
+    async with async_playwright() as p:
+        browser = await p.chromium.launch(headless=False)
+        context = await browser.new_context()
+        page = await context.new_page()
+        # Attach network interceptor
+        async def on_request(request):
+            requests_log.append({
+                "type": "request",
+                "url": request.url,
+                "method": request.method,
+                "headers": dict(request.headers),
+                "post_data": request.post_data,
+            })
+        async def on_response(response):
+            try:
+                body = await response.text()
+            except Exception:
+                body = None
+            requests_log.append({
+                "type": "response",
+                "url": response.url,
+                "status": response.status,
+                "headers": dict(response.headers),
+                "body": body,
+            })
+        page.on("request", on_request)
+        page.on("response", on_response)
+        # Navigate to app first
+        await page.goto(url)
+        # Run browser agent
+        llm = ChatOpenAI(model=llm_model, base_url="https://router.huggingface.co/v1")
+        agent = Agent(task=task, llm=llm, page=page)
+        await agent.run()
+        await browser.close()
+    return requests_log
+```
+### Stage 3 — Filter and Extract OpenAPI-like Spec
+The HAR data (from either source) contains everything: fonts, analytics, CDN requests, JS bundles, CSS. The browser agent filters this down and extracts a structured OpenAPI-like specification:
+```python
+import re
+from urllib.parse import urlparse
+SKIP_EXTENSIONS = {".css", ".png", ".jpg", ".svg", ".ico", ".woff", ".woff2", ".ttf", ".gif"}
+SKIP_DOMAINS = {"google-analytics.com", "doubleclick.net", "cloudflare.com", "cdn.", "fonts.googleapis.com"}
+SKIP_PATH_PREFIXES = ["/static/", "/media/", "/_next/", "/assets/", "/__webpack"]
+def is_application_api_call(url: str, app_base_url: str) -> bool:
+    parsed = urlparse(url)
+    app_host = urlparse(app_base_url).netloc
+    if parsed.netloc != app_host:
+        return False
+    path = parsed.path.lower()
+    for ext in SKIP_EXTENSIONS:
+        if path.endswith(ext):
+            return False
+    for prefix in SKIP_PATH_PREFIXES:
+        if path.startswith(prefix):
+            return False
+    return True
+def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
+    """
+    Stage 3: Process HAR entries into an OpenAPI-like spec.
+    Each entry becomes a structured endpoint document with method, path,
+    query params, request body schema, response body schema, status codes, auth info.
+    """
+    entries = har_data["log"]["entries"]
+    seen = set()
+    spec_entries = []
+    for entry in entries:
+        req = entry["request"]
+        resp = entry["response"]
+        raw_url = req["url"]
+        method = req["method"]
+        # Filter non-API traffic
+        if not is_application_api_call(raw_url, app_base_url):
+            continue
+        # Skip HTML page navigations
+        content_type = _get_response_content_type(resp)
+        if "text/html" in content_type and method == "GET":
+            continue
+        # Normalise path: replace IDs with {id}
+        parsed = urlparse(raw_url)
+        path = _normalise_path(parsed.path)
+        # Deduplicate by (method, normalised_path)
+        key = f"{method} {path}"
+        if key in seen:
+            continue
+        seen.add(key)
+        # Extract auth info
+        has_auth = any(
+            h["name"].lower() in ("authorization", "x-api-key", "cookie")
+            for h in req["headers"]
+        )
+        # Build endpoint spec document
+        spec_entry = {
+            "method": method,
+            "path": path,
+            "query_params": parsed.query or None,
+            "request_headers": {h["name"]: h["value"] for h in req["headers"]
+                                if h["name"].lower() in ("content-type", "authorization", "x-requested-with")},
+            "request_body": _extract_body(req),
+            "status_code": resp["status"],
+            "response_content_type": content_type,
+            "response_body_sample": _truncate_body(resp),
+            "auth_observed": has_auth,
+        }
+        spec_entries.append(spec_entry)
+    return spec_entries
+```
+### Stage 4 — Build GEMMA Embeddings for Search
+The extracted spec entries are converted to text documents and embedded using GEMMA embeddings. These embeddings power the `search_endpoints()` tool — when the RL agent queries "how to add item to cart", the semantic search finds the matching endpoint spec.
+```python
+from sentence_transformers import SentenceTransformer
+import numpy as np
+def build_endpoint_embeddings(spec_entries: list[dict], app_name: str) -> tuple[np.ndarray, list[str]]:
+    """
+    Stage 4: Convert spec entries to text chunks and build GEMMA embeddings.
+    These embeddings are stored in memory for the duration of the episode,
+    enabling search_endpoints() to do semantic search.
+    """
+    model = SentenceTransformer("google/embeddinggemma-300m")
+    chunks = []
+    for entry in spec_entries:
+        chunk = spec_entry_to_text(entry, app_name)
+        chunks.append(chunk)
+    # GEMMA encode_document: "title: {endpoint} | text: {rest of chunk}"
+    embeddings = model.encode_document(chunks, batch_size=32)
+    # Use similarity metric from google/embeddinggemma-300m model card
+    return embeddings, chunks
+def spec_entry_to_text(entry: dict, app_name: str) -> str:
+    """Convert a single spec entry to a searchable text document."""
+    parts = [
+        f"app: {app_name}",
+        f"endpoint: {entry['method']} {entry['path']}",
+        f"status: {entry['status_code']}",
+        f"auth: {'required' if entry['auth_observed'] else 'none'}",
+    ]
+    if entry.get("query_params"):
+        parts.append(f"query: {entry['query_params']}")
+    if entry.get("request_body"):
+        parts.append(f"body: {entry['request_body']}")
+    if entry.get("response_body_sample"):
+        parts.append(f"response_sample: {entry['response_body_sample']}")
+    return " | ".join(parts)
+```
+### Stage 5 — Return Summary to RL Agent
+The browser agent returns **only a summary** — endpoint names and HTTP methods. No headers, no body schemas, no parameter details. The agent must call `search_endpoints()` to get the full details.
+```python
+def build_browser_agent_output(spec_entries: list[dict], app_name: str) -> dict:
+    """
+    Stage 5: Build the summary output returned to the RL agent.
+    This is intentionally sparse — just endpoint names and methods.
+    """
+    summary_endpoints = []
+    for entry in spec_entries:
+        summary_endpoints.append({
+            "method": entry["method"],
+            "path": entry["path"],
+        })
+    return {
+        "app": app_name,
+        "endpoints": summary_endpoints,
+        "total_endpoints": len(summary_endpoints),
+        "note": (
+            "These endpoints were observed for this application. "
+            "Use search_endpoints() with a natural language query to get "
+            "the full schema, parameters, and auth details for any endpoint."
+        )
+    }
+```
+### Full Orchestration
+```python
+async def browser_agent(task: str, url: str) -> dict:
+    """
+    Complete browser agent pipeline:
+    1. Get HAR data (from file or live browser)
+    2. Filter and extract OpenAPI-like spec
+    3. Build GEMMA embeddings for search_endpoints()
+    4. Return summary endpoint list to RL agent
+    """
+    app_name = resolve_app_name(url)
+    llm_model = "browser-use/bu-30b-a3b-preview"
+    # Stage 1-2: Get HAR data
+    har_data = await get_har_data(task, url, llm_model)
+    # Stage 3: Extract OpenAPI-like spec
+    spec_entries = extract_openapi_spec(har_data, url)
+    # Stage 4: Build GEMMA embeddings (stored in environment for search_endpoints)
+    embeddings, chunks = build_endpoint_embeddings(spec_entries, app_name)
+    store_episode_embeddings(app_name, embeddings, chunks)  # makes search_endpoints() work
+    # Stage 5: Return summary to RL agent
+    return build_browser_agent_output(spec_entries, app_name)
+```
+---
+## Output Example
+What the RL agent sees (summary only — no schemas, no headers, no body details):
+```json
+{
+  "app": "shopping",
+  "endpoints": [
+    {"method": "POST", "path": "/rest/V1/integration/customer/token"},
+    {"method": "GET",  "path": "/rest/V1/products"},
+    {"method": "GET",  "path": "/rest/V1/products/{id}"},
+    {"method": "POST", "path": "/rest/V1/guest-carts"},
+    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items"},
+    {"method": "GET",  "path": "/rest/V1/guest-carts/{id}/totals"},
+    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/order"},
+    {"method": "GET",  "path": "/rest/V1/categories"}
+  ],
+  "total_endpoints": 8,
+  "note": "These endpoints were observed for this application. Use search_endpoints() with a natural language query to get the full schema, parameters, and auth details for any endpoint."
+}
+```
+To get full details, the agent calls:
+```
+search_endpoints("add item to guest cart")
+→ returns full schema: POST /rest/V1/guest-carts/{cartId}/items, body params, auth, response fields
+```
+---
+## How search_endpoints() Uses the Embeddings
+The GEMMA embeddings built in Stage 4 are what power `search_endpoints()`. When the RL agent calls `search_endpoints("create guest cart")`:
+1. The query is encoded using GEMMA `encode_query`
+2. Dot product similarity against all endpoint embeddings
+3. Top-3 matching endpoint spec documents are returned with full details
+```python
+def search_endpoints(query: str, embeddings, texts, model, top_k=3) -> list[str]:
+    q_emb = model.encode_query(query)          # shape: (D,)
+    # Use similarity metric specified by google/embeddinggemma-300m model card
+    scores = model.similarity(q_emb, embeddings).squeeze(0)  # shape: (N,)
+    top_idx = np.argsort(scores)[::-1][:top_k]
+    return [texts[i] for i in top_idx]
+```
+The endpoint documents returned by search contain the full extracted spec — method, path, query params, request body structure, response samples, auth requirements. This is the detailed view that complements the summary list from `browser_agent`.
+---
+## LLM Choice for Browser Agent
+We use **[`browser-use/bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview)** — a model purpose-built and fine-tuned specifically for browser-use tasks.
+| Property | Value |
+|----------|-------|
+| **Base model** | Qwen3-VL-30B-A3B-Instruct |
+| **Architecture** | Vision-Language MoE (Mixture of Experts) |
+| **Total parameters** | 30B |
+| **Active parameters** | 3B (MoE — only 3B fire per forward pass) |
+| **Context length** | 65,536 tokens |
+| **Specialization** | Superior DOM understanding + visual reasoning for web tasks |
+This model is designed to be served with vLLM and integrates directly with the `browser-use` library via its `ChatOpenAI`-compatible interface:
+```python
+from browser_use import Agent, ChatOpenAI
+llm = ChatOpenAI(
+    base_url="http://localhost:8000/v1",    # vLLM server
+    model="browser-use/bu-30b-a3b-preview",
+    temperature=0.6,
+    top_p=0.95,
+    dont_force_structured_output=True,      # speeds up inference
+)
+agent = Agent(task=task, llm=llm)
+agent.run_sync()
+```
+Serve with vLLM:
+```bash
+vllm serve browser-use/bu-30b-a3b-preview \
+  --max-model-len 65536 \
+  --host 0.0.0.0 \
+  --port 8000
+```
+Because 3B parameters are active per forward pass (MoE), this model is fast enough for deployment without requiring a full large-model GPU allocation.
+---
+## Training vs. Inference: What Changes
+```
+                  Training                          Inference
+                     │                                  │
+    browser_agent    │  HAR file exists → loads from    │  HAR file may not exist
+    Stage 1          │  disk, no browser launched       │  → launches live browser session
+                     │                                  │  → records traffic as HAR
+                     │                                  │
+    browser_agent    │  Processes HAR → extracts spec   │  Same processing pipeline
+    Stages 2-4       │  → builds GEMMA embeddings       │  on the live-captured traffic
+                     │  → returns summary               │
+                     │                                  │
+    curl_exec        │  hits REAL live server           │  hits REAL live server
+    calls            │  (WebArena EC2)                  │  (WebArena EC2)
+                     │                                  │
+    judge            │  probes REAL live server         │  probes REAL live server
+    verification     │  to verify task completion       │  to verify task completion
+```
+**What changes between training and inference:** only Stage 1 — where the HAR data comes from. During training, pre-recorded HAR files exist for all tasks, so the browser is never launched. At inference, the HAR may not exist for novel tasks, so the browser runs live.
+**What never changes:** Stages 3-5 (spec extraction, embedding, summary output) run identically regardless of the HAR source. And `curl_exec` always hits the real live server — no responses are ever mocked.
+---
+## Integration with the Environment
+```
+RL Environment (FastAPI server)
+    │
+    ├── receives Action: {tool: "browser_agent", input: {task, url}}
+    │
+    ├── Stage 1: HAR file exists?
+    │     ├── YES → load HAR from disk (~0ms)
+    │     └── NO  → spawn live browser session (30-120s)
+    │               ├── Playwright + bu-30b-a3b-preview
+    │               ├── intercept all HTTP traffic
+    │               └── produce HAR data
+    │
+    ├── Stage 3: Extract OpenAPI-like spec from HAR
+    │
+    ├── Stage 4: Build GEMMA embeddings → stored in env for search_endpoints()
+    │
+    └── Stage 5: Return summary endpoint list as Observation.last_tool_result
+          │
+          │  (agent now knows WHAT endpoints exist, but not HOW to call them)
+          │
+          ▼
+    search_endpoints("natural language query")
+          │  → semantic search over GEMMA embeddings
+          │  → returns full endpoint schema with params, auth, response fields
+          ▼
+    curl_exec("curl -X POST ...")
+          │  → executes against real live WebArena server (EC2)
+          │  → indexes full response into episode BM25 store
+          ▼
+    search_episode_data("keyword query")
+          │  → BM25 search over indexed responses from this episode
+          ▼
+    done() → judge evaluates against ground truth
+```
+---
+## Reference Tools
+- [browser-use GitHub](https://github.com/browser-use/browser-use) — the core library
+- [browser-use docs](https://docs.browser-use.com) — configuration, custom actions, LLM setup
+- [Playwright network events](https://playwright.dev/python/docs/network) — request/response interception API
+- [har-to-openapi](https://github.com/jonluca/har-to-openapi) — alternative: convert HAR files to OpenAPI spec format
+- [jsluice](https://github.com/BishopFox/jsluice) — extract API routes from JavaScript bundles (useful supplement to network interception) - Future Scope

BUILD_NOTES.md ADDED Viewed

	@@ -0,0 +1,205 @@

+# HARvestGym — Build Notes & Deferred Items
+This file captures caveats, deferred implementation decisions, and things to keep in mind during the building phase. It is not a specification — it is a living checklist.
+---
+## Critical Build-Time Checklist
+### 1. `google/embeddinggemma-300m` — License Acceptance Required
+**Status:** Deferred to build time
+**Action needed:** Accept the Google license for `google/embeddinggemma-300m` at https://huggingface.co/google/embeddinggemma-300m while logged in to HuggingFace. Then ensure `HF_TOKEN` is set in the environment before running any embedding code.
+```bash
+export HF_TOKEN=hf_...  # must have accepted the google/embeddinggemma-300m license
+```
+The model also requires `float32` or `bfloat16` — **not `float16`**. If you see activation errors, check the dtype:
+```python
+model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN)
+# Default dtype is float32; explicitly set bfloat16 if on GPU:
+# model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN, model_kwargs={"torch_dtype": "bfloat16"})
+```
+---
+### 2. Judge Verification — Trajectory-Based, No External Token Needed
+**Status:** Resolved in design
+**Detail:** The judge does **not** need a pre-set admin token or an outbound probe call to verify task completion. Verification is done by inspecting the episode trajectory already available in the environment.
+**Approach:**
+- The judge reads the `curl_exec` request/response history from the current episode (already stored in `episode.steps`).
+- If the final state-changing call returned a 2xx response with the expected payload (e.g., `item_id` in add-to-cart, `order_id` in checkout, `post` in the forum response body), that response **is** the ground truth — the web server confirmed it.
+- No re-probe is needed: the application already validated the request and returned success. The environment trusts that a 2xx from the live server is accurate.
+**When a live probe is still used:** Template 3 (add-to-cart) and Template 7 (product creation) optionally re-fetch the resource (cart contents, product by SKU) to double-check state. These probes use the admin credentials the RL agent itself obtained during the episode (extracted from `session_state`), not a pre-configured environment token. If the agent did not authenticate (e.g., it tried to create a product without admin auth), the probe will 401 — which correctly scores the episode as failed.
+**Implementation note for the judge:**
+```python
+# Prefer: check the response body from the agent's own curl calls
+for step in episode.steps:
+    if step.curl_parsed and step.curl_parsed.status_code == 200:
+        body = step.curl_parsed.response_body
+        # e.g. Template 3: look for item_id in add-to-cart response
+        if isinstance(body, dict) and "item_id" in body:
+            return 1.0
+# Fallback live probe (optional, uses agent's own session token from episode):
+admin_token = _extract_admin_token(episode)  # from agent's auth step
+if admin_token:
+    product = _judge_probe(f"GET /rest/V1/products/{sku}", base_url,
+                           headers={"Authorization": f"Bearer {admin_token}"})
+```
+---
+### 3. Forum (Postmill) — CSRF Token Position in HTML
+**Status:** Handled in design; verify at build time
+**Detail:** The HTML truncation limit was raised to 3,000 characters specifically to capture hidden `<input type="hidden" name="_csrf_token">` fields. However, on some Postmill routes, the CSRF token appears after the main nav and the full form body. At build time, test the actual login page HTML to confirm the token appears within the first 3,000 characters.
+```bash
+curl -s 'http://ec2-...:9999/login' | head -c 3000 | grep _csrf_token
+```
+If the token is not captured, either:
+- Increase `NONJSON_MAX_CHARS` further in `tools/curl_exec.py`
+- Or rely on `search_episode_data("_csrf_token")` — the full HTML is indexed before truncation, so the token is always retrievable by keyword search regardless of position.
+---
+### 4. Wikipedia — HTML Wrapping
+**Status:** Designed; implement in `curl_exec`
+**Detail:** Wikipedia (Kiwix) returns HTML. The environment wraps all non-JSON responses in a uniform JSON envelope `{status_code, headers, body}` before returning to the model. This wrapping is already part of `curl_exec`'s response structure (the `body` field is always a string for non-JSON content). No additional wrapping is needed — just ensure the system prompt tells the model to expect HTML strings in `body` for Wikipedia URLs.
+---
+### 5. Browser Agent — Deferred Live Implementation
+**Status:** Deferred
+**Detail:** During training, `browser_agent` always loads from pre-recorded HAR files. The live browser agent (using Playwright + `browser-use/bu-30b-a3b-preview` served as a local service) is NOT needed for the initial training run.
+At inference time, the live browser agent will be called as a separate service. The interface contract is:
+```python
+# The environment connects to the browser agent service via HTTP:
+POST http://browser-agent-service/run
+{"task": "...", "url": "..."}
+→ {"app": "...", "endpoints": [...], "note": "..."}
+```
+Implementation details are in `BROWSER_AGENT.md`. Skip for now — HAR files cover the full training set.
+---
+### 6. OSM (Map Application) — Not In Initial Training Scope
+**Status:** Intentionally excluded
+**Detail:** The OpenStreetMap application (port 3000) has ground truth catalogs and HAR recording tasks defined, but **no RL task templates target it in the initial training run**. The OSM artifacts are not needed for the first training loop.
+Do not spend time on OSM tasks until the 7 current templates are training successfully.
+---
+### 7. `max_steps` Is 20, Not 12
+**Status:** Updated in README and observation space
+**Reminder:** All code that initializes episodes must use `max_steps=20`. Search for any hardcoded `12` in the codebase before the first training run.
+```bash
+grep -r "max_steps.*12\|12.*max_steps" --include="*.py" .
+```
+---
+### 8. GRPO Training Configuration
+**Status:** Specified — follow the EcomRLVE-Gym pattern
+**Reference:** `winner_projects_last_ht/EcomRLVE-Gym/scripts/train_openenv.py` and `src/ecom_rlve/training/grpo.py`
+**Stack:** Unsloth + TRL `GRPOTrainer`. The training script structure from EcomRLVE-Gym maps directly onto HARvestGym — replace the environment wrapper and reward functions, keep the training scaffolding.
+**Policy model:** `Qwen/Qwen3-1.7B` with 4-bit quantization via Unsloth. LoRA rank 16, targeting `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
+**Key configuration values (from EcomRLVE-Gym, adapted for HARvestGym):**
+```python
+from trl import GRPOConfig, GRPOTrainer
+from unsloth import FastLanguageModel
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="Qwen/Qwen3-1.7B",
+    max_seq_length=8192,        # must fit 20-step Hard task episodes
+    load_in_4bit=True,
+    fast_inference=True,        # vLLM-backed fast generation for GRPO rollouts
+    max_lora_rank=16,
+    gpu_memory_utilization=0.6,
+)
+model = FastLanguageModel.get_peft_model(
+    model,
+    r=16,
+    lora_alpha=32,
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
+                    "gate_proj", "up_proj", "down_proj"],
+    use_gradient_checkpointing="unsloth",
+)
+training_args = GRPOConfig(
+    num_generations=4,          # G=4 rollouts per prompt (EcomRLVE default; bump to 8 if VRAM allows)
+    temperature=0.7,
+    max_prompt_length=4096,     # auto-detect from dataset sample + 20% headroom
+    max_completion_length=512,  # one tool call per step; curl commands are short
+    learning_rate=2e-5,
+    weight_decay=0.01,
+    warmup_ratio=0.1,
+    lr_scheduler_type="cosine",
+    optim="adamw_8bit",
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=1,
+    max_steps=300,
+    bf16=True,                  # use bfloat16 on Ampere+ GPUs
+    output_dir="outputs/harvestgym_grpo",
+)
+```
+**Reward functions passed to GRPOTrainer (three, like EcomRLVE-Gym):**
+1. `format_reward` — does the output parse as a valid tool call? (`+1.0` / `-2.0`)
+2. `tool_usage_reward` — is the tool name valid and arguments well-formed? (`+1.0` / `-0.5`)
+3. `env_reward` — environment scalar reward from the judge, scaled ×5 to dominate (`-7.5` to `+25.0`)
+**Curriculum:** Start all episodes on Template 1 (Easy). Introduce Medium templates when Easy success rate > 70%. Introduce Hard templates when Medium success rate > 60%.
+**KL coefficient:** Start at `0.01`. If the model diverges from pretrained behavior rapidly (reward collapses after initial improvement), reduce to `0.005`.
+---
+### 9. System Prompt — Form vs JSON Guidance
+**Status:** Designed in README; implement at build time
+**Detail:** The system prompt must include explicit instructions on when to use `Content-Type: application/x-www-form-urlencoded` vs `application/json`. Specifically:
+```
+For Postmill (Forum, port 9999): use form-encoded for login and post creation.
+For Magento REST (Shopping/Admin, ports 7770/7780): use application/json.
+For Wikipedia (port 8888): GET requests only, no Content-Type needed.
+When in doubt: check the endpoint schema returned by search_endpoints() — it specifies the expected Content-Type.
+```
+---
+## Non-Issues (Resolved in Design)
+- ~~`store_finding` / `get_findings` tools~~ — **Removed**. Value threading happens through episode `history`.
+- ~~`google/embeddinggemma-300m` doesn't exist~~ — **Confirmed real**. Uses `sentence-transformers` with `encode_query`/`encode_document`/`similarity`. Requires HF_TOKEN.
+- ~~12 steps too few~~ — **Fixed to 20**.
+- ~~Reward signal rewards busy episodes~~ — **Addressed** via curriculum learning + terminal reward dominance design. See README reward section.
+- ~~Wikipedia task unwinnable~~ — **Resolved**: check for HTTP 200 + correct URL, not JSON content.
+- ~~Forum CSRF handling~~ — **Resolved**: 3,000-char HTML truncation + `search_episode_data` fallback. No dedicated tool needed.
+- ~~JUDGE_ADMIN_TOKEN expiry risk~~ — **Resolved**: judge reads trajectory response bodies directly; uses agent's own session token for optional probes only.
+- ~~Concurrent episode isolation~~ — **Not needed**: multi-turn retry handles errors; no episode ID embedding required.
+- ~~Parameter pool drift~~ — **Not a concern**: no training tasks involve deletion or reorganization; graders compare against expected values, not absolute DB state.

Dockerfile ADDED Viewed

	@@ -0,0 +1,69 @@

+# HARvestGym — OpenEnv Environment
+# Multi-stage build using openenv-base
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Install git (for VCS dependencies)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git curl && \
+    rm -rf /var/lib/apt/lists/*
+# Build mode
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=HARvestGym
+# Copy the entire project
+COPY . /app/env
+WORKDIR /app/env
+# Ensure uv is available
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx 2>/dev/null || true; \
+    fi
+# Install dependencies
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Enable Gradio web interface
+ENV ENABLE_WEB_INTERFACE=true
+# Copy venv from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy project code
+COPY --from=builder /app/env /app/env
+# Set paths
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+EXPOSE 8000
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

GROUND_TRUTH_EXTRACTION.md ADDED Viewed

	@@ -0,0 +1,470 @@

+# Ground Truth Extraction
+Extract the API catalog from each live WebArena container by connecting to EC2 from Cursor, entering each container, and running Claude Code with the prompts below.
+Container names confirmed from the running EC2 instance:
+| App            | Container name                | Image                                |
+| ----------------| -------------------------------| --------------------------------------|
+| Shopping       | `shopping`                    | `shopping_final_0712`                |
+| Shopping Admin | `shopping_admin`              | `shopping_admin_final_0719`          |
+| Forum          | `forum`                       | `postmill-populated-exposed-withimg` |
+| Wikipedia      | `kiwix33`                     | `ghcr.io/kiwix/kiwix-serve:3.3.0`    |
+| Map (web)      | `openstreetmap-website-web-1` | `openstreetmap-website-web`          |
+Skipping GitLab for now (`gitlab`) - facing some 502 errors.
+Wikipedia (Kiwix) serves a static ZIM file — there is no source code to analyze. Its catalog entry is hardcoded at the bottom of this file.
+---
+## Connection workflow
+```
+Cursor → Remote SSH → EC2 host → Dev Containers extension → Attach to Running Container → pick app container → paste prompt into Cursor sidebar
+```
+**Step 1 — Connect from Cursor to EC2**
+`Cmd+Shift+P` → `Remote-SSH: Connect to Host` → `ubuntu@<EC2_IP>`
+**Step 2 — Attach to a container**
+With the Remote-SSH window open: `Cmd+Shift+P` → `Dev Containers: Attach to Running Container` → select the container (e.g. `shopping`).
+Cursor opens a new window with the container's filesystem as the workspace. The full source code is now loaded and indexed — no copying needed.
+**Step 3 — Paste the prompt into the Cursor sidebar**
+Open the AI chat sidebar, paste the prompt for that container (see sections below), and run it. Cursor's AI has the full codebase in context and will write `api_catalog.json` into the workspace (inside the container).
+**Step 4 — Copy the output back**
+The file written inside the container can be downloaded via `File → Download` from Cursor's remote explorer, or via `scp` from the EC2 host after the fact.
+Repeat for each container: `shopping`, `shopping_admin`, `forum`, `openstreetmap-website-web-1`.
+---
+## Expected output format
+The catalog captures **all API surface** found in the codebase — REST, GraphQL, WebSocket, form submissions. The `api_type` field distinguishes them. Do not restrict to only the endpoints used by the 7 training tasks; document everything.
+```json
+[
+  {
+    "api_type": "rest",
+    "endpoint": "POST /rest/V1/guest-carts/{cartId}/items",
+    "auth": "none",
+    "path_params": {
+      "cartId": {
+        "type": "string",
+        "source": "PREV_CALL",
+        "from_endpoint": "POST /rest/V1/guest-carts",
+        "from_field": ".body",
+        "notes": "entire response body is the cartId string"
+      }
+    },
+    "body_params": {
+      "cartItem.sku":      { "type": "string", "source": "PREV_CALL", "from_endpoint": "GET /rest/V1/products", "from_field": ".items[0].sku" },
+      "cartItem.qty":      { "type": "number", "source": "TASK_SPEC" },
+      "cartItem.quote_id": { "type": "string", "source": "DERIVED", "same_as": "cartId" }
+    },
+    "response_key_fields": []
+  },
+  {
+    "api_type": "graphql",
+    "endpoint": "POST /graphql",
+    "operation_name": "GetProducts",
+    "operation_type": "query",
+    "auth": "none",
+    "variables": {
+      "search":    { "type": "String", "source": "TASK_SPEC" },
+      "pageSize":  { "type": "Int",    "source": "STATIC", "value": 20 }
+    },
+    "response_key_fields": [".products.items[].sku", ".products.items[].name"]
+  },
+  {
+    "api_type": "websocket",
+    "endpoint": "ws:///realtime",
+    "auth": "session_cookie",
+    "notes": "describe message protocol; include event names and payload shapes"
+  },
+  {
+    "api_type": "form",
+    "endpoint": "POST /submission/create",
+    "auth": "session_cookie+csrf",
+    "content_type": "application/x-www-form-urlencoded",
+    "form_params": {
+      "_token": { "type": "string", "source": "AUTH_FLOW", "notes": "hidden input in page HTML" },
+      "title":  { "type": "string", "source": "TASK_SPEC" },
+      "url":    { "type": "string", "source": "TASK_SPEC", "notes": "optional if body provided" }
+    },
+    "response_key_fields": []
+  }
+]
+```
+`**api_type`:** `rest` | `graphql` | `websocket` | `form`
+`**source`:**
+- `TASK_SPEC` — given in the task description
+- `PREV_CALL` — from a prior response this episode; specify `from_endpoint` + `from_field`
+- `AUTH_FLOW` — token / cookie / CSRF from the login flow
+- `STATIC` — hardcoded in the app; document the actual value
+- `DERIVED` — aliased from another value (e.g. `quote_id` = `cartId`)
+---
+## App 1 — Shopping and Shopping Admin (Magento 2)
+**Container:** `shopping`
+**Source root:** `/var/www/magento2/` (confirmed — WebArena README runs `docker exec shopping /var/www/magento2/bin/magento ...`)
+Attach to the `shopping` container in Cursor, open `/var/www/magento2/` as the workspace, then paste this prompt into the sidebar:
+```
+You are working inside a Magento 2 codebase. Start by exploring the directory structure to orient yourself.
+Your job: produce a COMPLETE api_catalog.json covering ALL APIs exposed by this Magento 2
+installation — not just a subset. Document every endpoint you find, regardless of whether
+it is used in a specific task or not. The goal is a full map of the application's API surface.
+API types to scan for (all of them):
+1. REST endpoints — declared in webapi.xml files. These are the primary API.
+2. GraphQL — Magento has a full GraphQL API parallel to REST. Find the .graphqls schema files
+   and document every query and mutation.
+3. WebSockets — Magento does not typically use WebSockets, but check. If none found, note it.
+4. Admin AJAX endpoints — controllers under adminhtml/ that handle JSON AJAX requests.
+   These are separate from the REST API.
+For REST and Admin AJAX endpoints, produce:
+{
+  "api_type": "rest",
+  "endpoint": "METHOD /path/{template}",
+  "auth": "none | bearer_token | admin_bearer_token | session_cookie",
+  "path_params":   { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
+  "query_params":  { ... },
+  "body_params":   { ... },
+  "response_key_fields": ["jq paths that downstream calls will consume"]
+}
+For GraphQL queries/mutations, produce:
+{
+  "api_type": "graphql",
+  "endpoint": "POST /graphql",
+  "operation_name": "...",
+  "operation_type": "query | mutation | subscription",
+  "auth": "none | bearer_token",
+  "variables": { "<name>": { "type": "...", "source": "...", "notes": "..." } },
+  "response_key_fields": ["jq paths that downstream calls will consume"]
+}
+Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
+- PREV_CALL: must include from_endpoint and from_field (jq path into that response)
+- AUTH_FLOW: any token/cookie obtained during login
+- STATIC: include the actual static value
+Rules:
+- Document REQUIRED parameters only. Skip X-Requested-With, Cache-Control, correlation IDs.
+- For guest-cart: POST /rest/V1/guest-carts returns a plain quoted string — that IS the cartId.
+- quote_id in add-item body equals cartId — mark DERIVED.
+- For searchCriteria filter params, document the exact query string structure.
+- For GraphQL: read ALL .graphqls files to find every query, mutation, subscription.
+Write the output to api_catalog.json at the root of the codebase.
+```
+---
+## App 2 — Forum (Postmill / Symfony)
+**Container:** `forum`
+**Source root:** `/var/www/html/` (confirmed — `docker exec forum find / -name "composer.json"` returned `/var/www/html/composer.json`)
+Attach to the `forum` container in Cursor, open `/var/www/html/` as the workspace, then paste this prompt into the sidebar:
+```
+You are working inside a Postmill forum codebase (PHP / Symfony). Start by exploring the
+directory structure to orient yourself.
+Your job: produce a COMPLETE api_catalog.json covering ALL HTTP endpoints exposed by this
+Postmill installation — every route, every form action, every AJAX endpoint.
+API types to scan for:
+1. Form submissions (POST, application/x-www-form-urlencoded) — the primary interaction pattern
+2. JSON AJAX endpoints — controllers that return JsonResponse
+3. REST-style endpoints — if any exist under /api/
+4. WebSockets — Postmill does not typically use WebSockets but check for any Mercure
+   or Pusher integration. If none, note it.
+For form submissions and JSON endpoints, produce:
+{
+  "api_type": "form" | "rest",
+  "endpoint": "METHOD /path/{template}",
+  "auth": "none | session_cookie | session_cookie+csrf",
+  "content_type": "application/x-www-form-urlencoded | application/json | multipart/form-data",
+  "path_params":  { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
+  "query_params": { ... },
+  "form_params":  { ... },   // use this for form submissions
+  "body_params":  { ... },   // use this for JSON body
+  "response_key_fields": ["what downstream calls consume from this response"]
+}
+Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
+Postmill-specific notes:
+- Login is a form POST. Find the exact CSRF token field name in the security config or login form type.
+- All write operations (create post, vote, comment) require session_cookie+csrf.
+- The community slug / post ID in path templates come from TASK_SPEC or PREV_CALL.
+- Read every FormType class to get the exact field names for each form.
+- For CSRF tokens in forms: source is AUTH_FLOW (extracted from the page HTML before submit).
+Write the output to api_catalog.json at the root of the codebase.
+```
+---
+## App 3 — Map (OpenStreetMap / Rails)
+**Container:** `openstreetmap-website-web-1`
+**Source root:** `/app` (confirmed — `docker exec openstreetmap-website-web-1 ls /app` shows Gemfile, app/, config/, db/, etc.)
+Attach to the `openstreetmap-website-web-1` container in Cursor, open `/app` as the workspace, then paste this prompt into the sidebar:
+```
+You are working inside an OpenStreetMap Rails codebase. Start by exploring the directory
+structure to orient yourself.
+Your job: produce a COMPLETE api_catalog.json covering ALL HTTP endpoints exposed by this
+OpenStreetMap installation — every route, every API endpoint, every format variant.
+API types to scan for:
+1. REST API under /api/0.6/ — the main machine-readable API (XML and JSON variants)
+2. Search / geocoding — how place searches are handled (may proxy to Nominatim, or local)
+3. Web interface endpoints — HTML controllers, but also any that return JSON
+4. OAuth endpoints — any OAuth 1.0 or 2.0 flows
+5. WebSockets — unlikely but check for ActionCable or similar integration
+For each endpoint, produce:
+{
+  "api_type": "rest" | "form" | "websocket",
+  "endpoint": "METHOD /path/{template}",
+  "auth": "none | oauth | session_cookie",
+  "format_variants": [".json", ".xml"],   // if the endpoint supports multiple formats via extension
+  "path_params":  { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
+  "query_params": { ... },
+  "body_params":  { ... },
+  "response_key_fields": ["XPath or jq paths downstream calls consume"]
+}
+Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
+OpenStreetMap-specific notes:
+- The /api/0.6/ endpoints return XML by default; .json suffix returns JSON. Document both variants.
+- Node, way, relation IDs are integers — source is TASK_SPEC for direct tasks, PREV_CALL when
+  they come from a search result.
+- The search endpoint may be /search or proxied through Nominatim — read the routes and
+  controllers carefully to find where geographic searches are handled.
+- Read ALL controller files, not just the api/ subdirectory. There may be JSON endpoints in
+  the main web controllers too.
+Start with the routes file (e.g. config/routes.rb) to get the complete route list, then read each controller. Write the output to api_catalog.json at the root of the codebase.
+```
+---
+---
+## App 4 — Wikipedia (Kiwix) — No extraction needed
+Kiwix serves a static ZIM file at `/data/wikipedia_en_all_maxi_2022-05.zim` (confirmed — `docker exec kiwix33 find / -name "*.zim"` returned that path). There is no application source code to analyze — `kiwix-serve` is a C++ binary, not a web framework. The catalog entry is hardcoded below.
+Hardcoded catalog entry (save as `catalogs/wikipedia.json`):
+```json
+{
+  "_meta": {
+    "generated": "2026-04-08",
+    "source": "hardcoded — kiwix-serve binary serves a static ZIM file; no application source to analyze",
+    "zim_file": "/data/wikipedia_en_all_maxi_2022-05.zim",
+    "search_response": "HTML only — GET /search returns HTML page; agent must parse <a href> links for article URLs",
+    "article_page": "GET /wikipedia_en_all_maxi_2022-05/A/{title} — returns HTML article",
+    "websockets": "none"
+  },
+  "endpoints": [
+    {
+      "api_type": "rest",
+      "endpoint": "GET /search",
+      "auth": "none",
+      "query_params": {
+        "pattern": {
+          "type": "string",
+          "source": "TASK_SPEC",
+          "notes": "the search query, URL-encoded"
+        },
+        "books.name": {
+          "type": "string",
+          "source": "STATIC",
+          "value": "wikipedia_en_all_maxi_2022-05",
+          "notes": "selects which ZIM book to search"
+        }
+      },
+      "response_key_fields": [],
+      "notes": "IMPORTANT: response is HTML, not JSON. Parse <a href> anchor links matching /wikipedia_en_all_maxi_2022-05/A/... to extract article slugs."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "GET /wikipedia_en_all_maxi_2022-05/A/{article_title}",
+      "auth": "none",
+      "path_params": {
+        "article_title": {
+          "type": "string",
+          "source": "PREV_CALL",
+          "from_endpoint": "GET /search",
+          "from_field": "href attribute of first search result <a> tag",
+          "notes": "URL-encoded article slug, e.g. Albert_Einstein. Extract from the href on the search results HTML page."
+        }
+      },
+      "response_key_fields": [],
+      "notes": "Returns full HTML article page. HTTP 200 when article exists, 404 when not found."
+    }
+  ]
+}
+```
+---
+## Validation — smoke-test each catalog entry
+After Claude Code writes `api_catalog.json` for each app, validate a few key entries against the live server before committing:
+```bash
+EC2="ec2-16-59-2-56.us-east-2.compute.amazonaws.com"
+# Shopping: guest cart + add item
+CART=$(curl -s -X POST http://$EC2:7770/rest/V1/guest-carts \
+  -H "Content-Type: application/json" | tr -d '"')
+echo "cart_id: $CART"
+# Get admin token first (product listing requires auth)
+ADMIN_TOKEN=$(curl -s -X POST http://$EC2:7770/rest/V1/integration/admin/token \
+  -H "Content-Type: application/json" \
+  -d '{"username":"admin","password":"admin1234"}' | tr -d '"')
+SKU=$(curl -s "http://$EC2:7770/rest/V1/products?searchCriteria%5BpageSize%5D=1" \
+  -H "Authorization: Bearer $ADMIN_TOKEN" \
+  | python3 -c "import sys,json; print(json.load(sys.stdin)['items'][0]['sku'])")
+echo "sku: $SKU"
+curl -s -X POST http://$EC2:7770/rest/V1/guest-carts/$CART/items \
+  -H "Content-Type: application/json" \
+  -d "{\"cartItem\":{\"sku\":\"$SKU\",\"qty\":1,\"quote_id\":\"$CART\"}}" | python3 -m json.tool
+```
+If the response is a 200 with item details, the catalog entries for tasks 3–5 are correct.
+Or use the automated validator (requires `pip install requests`):
+```bash
+python3 validate_catalog.py --host ec2-16-59-2-56.us-east-2.compute.amazonaws.com --all
+```
+---
+## Final structure
+All source roots confirmed by running commands on the live EC2 instance:
+| Container                     | Source root          | How confirmed                                                         |
+| ----------------------------- | -------------------- | --------------------------------------------------------------------- |
+| `shopping`                    | `/var/www/magento2/` | WebArena README `docker exec` commands                                |
+| `shopping_admin`              | `/var/www/magento2/` | Same                                                                  |
+| `forum`                       | `/var/www/html/`     | `find / -name "composer.json"` returned `/var/www/html/composer.json` |
+| `openstreetmap-website-web-1` | `/app`               | `ls /app` shows Gemfile, app/, config/, db/                           |
+| `kiwix33`                     | N/A                  | Binary server; data at `/data/wikipedia_en_all_maxi_2022-05.zim`      |
+After running the AI on each container, download `api_catalog.json` via Cursor's remote explorer (`Right-click → Download`) and save locally as:
+```
+catalogs/
+  shopping.json       ← from shopping container
+  shopping_admin.json ← from shopping_admin container
+  forum.json          ← from forum container
+  osm.json            ← from openstreetmap-website-web-1 container
+  wikipedia.json      ← hardcoded above (no container needed)
+```
+These five files are committed to the repo and loaded by the judge at startup. They are never regenerated during training.
+---
+## Catalog status — live endpoint verification (2026-04-08)
+All five catalogs have been extracted and are committed. Below is the result of live testing against `ec2-16-59-2-56.us-east-2.compute.amazonaws.com`.
+### Summary table
+| Catalog              | Endpoints | JSON valid | Structure | Live test | Notes |
+|----------------------|-----------|------------|-----------|-----------|-------|
+| `shopping.json`      | 502       | ✅         | ✅        | ✅ PASS   | See details below |
+| `shopping_admin.json`| 552       | ✅         | ✅        | ✅ PASS   | See details below |
+| `forum.json`         | 91        | ✅         | ✅        | ⚠️ WARN   | Login not verified (see below) |
+| `osm.json`           | 217       | ✅         | ✅        | ✅ PASS   | See details below |
+| `wikipedia.json`     | 2         | ✅         | ✅        | ⚠️ WARN   | Search returns HTML not JSON (corrected) |
+---
+### Shopping (port 7770) — PASS
+**Auth:** `POST /rest/V1/integration/admin/token` with `admin`/`admin1234` returns a JWT bearer token. ✅
+**Guest cart:** `POST /rest/V1/guest-carts` returns a plain quoted string — confirmed this is the cartId. ✅
+**Product listing:** `GET /rest/V1/products?searchCriteria[pageSize]=N` requires bearer token — returns full product JSON with `total_count: 104368`. ✅
+**Add to cart:** `POST /rest/V1/guest-carts/{cartId}/items` with `{sku, qty, quote_id}` returns item detail at HTTP 200. ✅
+**Key finding:** `GET /rest/V1/products` without auth returns HTTP 401 ("consumer isn't authorized"). The catalog documents `auth: "admin_bearer_token"` for this endpoint — **correct**.
+---
+### Shopping Admin (port 7780)
+Shopping Admin uses the same Magento 2 REST API as Shopping but accessed on port 7780 with admin credentials. The `shopping_admin.json` catalog documents the same REST surface with admin-scoped auth. The admin UI itself is a browser-based SPA — its internal AJAX endpoints are documented in the catalog under `admin_ajax` type entries.
+---
+### Forum (port 9999)
+**Homepage:** HTTP 200. ✅
+**Login form structure:** Confirmed via HTML inspection. Form action is `POST /login_check`. CSRF field is `_csrf_token` (Symfony token, not form_key). ✅
+**Login result:** `POST /login_check` with `MarvelsGrantMan136`/`test1234` redirects to `/` (homepage) — login successful. ✅ (The original password `notarobot` from WebArena defaults was stale; the correct password on this instance is `test1234`.)
+**Catalog correctness:** The `forum.json` catalog correctly documents:
+- `POST /login_check` with `_csrf_token`, `_username`, `_password`
+- All write endpoints require `session_cookie+csrf`
+- `route_name` field on each entry (extra metadata, not used by judge)
+---
+### OSM / Map (port 3000)
+**Capabilities:** `GET /api/0.6/capabilities` returns XML. ✅
+**Map bbox:** `GET /api/0.6/map?bbox=-0.1,51.5,0.1,51.6` returns valid OSM XML with `<osm>` root. ✅
+**Search finding:** `GET /search?query=...` returns an **HTML page** (HTTP 200), not JSON. The actual geocoding is dispatched client-side to sub-endpoints:
+- `POST /geocoder/search_osm_nominatim` — Nominatim-backed search
+- `POST /geocoder/search_latlon` — coordinate-based search
+- `POST /geocoder/search_osm_nominatim_reverse` — reverse geocode
+**Search param name:** The catalog documents `query` as the query param name. Confirmed: `GET /search?query=New+York` returns HTTP 200 (HTML).
+---
+### Wikipedia / Kiwix (port 8888)
+**Search endpoint:** `GET /search?pattern=...&books.name=wikipedia_en_all_maxi_2022-05` returns HTTP 200. ✅
+**Article endpoint:** `GET /wikipedia_en_all_maxi_2022-05/A/Albert_Einstein` returns HTTP 200. ✅
+---

HAR_TASK_LIST.md ADDED Viewed

	@@ -0,0 +1,276 @@

+# HAR Recording Task List
+The browser agent performs all of these tasks in a single session with network recording enabled. Every request is captured — no filtering needed. One HAR dump is exported at the end.
+**Credentials used throughout:**
+- Shopping (customer): `emma.lopez@gmail.com` / `Password.1`
+- Shopping Admin: `admin` / `admin1234`
+- Forum: `MarvelsGrantMan136` / `test1234`
+---
+## App 1 — Shopping (port 7770)
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/`
+### Guest flows (no login)
+1. Open the homepage.
+2. Click into the **Beauty & Personal Care** top-level category from the nav.
+3. Navigate to **Beauty & Personal Care > Oral Care > Toothbrushes & Accessories** — let the product list load.
+4. Search for `"ginger"` using the search bar — let results load.
+5. Click on any product from the search results — let the product detail page load fully.
+6. Add that product to the cart (select a quantity if required, click Add to Cart).
+7. Click the cart icon — open the mini-cart.
+8. Click **Proceed to Checkout**.
+9. Fill in the guest checkout shipping form:
+   - Email: `test@example.com`
+   - First Name: `Test`
+   - Last Name: `User`
+   - Street: `123 Main St`
+   - City: `New York`
+   - State: `New York`
+   - ZIP: `10001`
+   - Country: `United States`
+   - Phone: `5551234567`
+10. Select the first available shipping method and click **Next**.
+11. On the payment step, leave the default payment method and click **Place Order**.
+### Logged-in customer flows
+12. Log in with `emma.lopez@gmail.com` / `Password.1` (My Account → Sign In).
+13. After login, open **My Account** dashboard.
+14. Navigate to **My Orders** under the account sidebar.
+15. Click into any existing order to view its detail page.
+16. Go to **My Wishlist** (account sidebar).
+17. Navigate to a product — **Sports & Outdoors > Exercise & Fitness** — pick any product.
+18. Click **Add to Wish List** on that product.
+19. Go to **My Wishlist** again to confirm it was added.
+20. From the wishlist, click **Add to Cart** for that same product.
+21. Go to the cart, change the quantity of one item to `2`, and click **Update Cart**.
+22. Navigate to **My Account > Address Book** and view existing addresses.
+23. Log out.
+---
+## App 2 — Shopping Admin (port 7780)
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`
+> **Note:** Port 7780 serves the customer-facing storefront at the root URL. The Magento Admin panel is at the `/admin` subpath. The browser agent must navigate directly to `/admin` to reach the admin login page.
+### Authentication
+24. Go to the admin login page at `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`.
+25. Log in with `admin` / `admin1234`.
+### Catalog management
+26. Navigate to **Catalog > Products** from the left sidebar.
+27. Let the product grid load with the default filters.
+28. Use the search/filter bar to filter products by **Name** containing `"tee"` — apply filters.
+29. Click into any product from the filtered list to open the product edit form.
+30. Change the product **Price** to any nearby value (e.g., add $1), scroll down to **Save**.
+31. Navigate back to **Catalog > Products**.
+32. Click **Add Product** (top right) — select **Simple Product** if prompted.
+33. Fill in the new product form:
+    - Product Name: `HAR Test Product`
+    - SKU: `HAR-TEST-001`
+    - Price: `19.99`
+    - Quantity: `100`
+    - Attribute Set: Default
+34. Click **Save** on the new product.
+### Order management
+35. Navigate to **Sales > Orders** from the left sidebar.
+36. Let the order grid load.
+37. Click into any existing order to open the order detail view.
+38. Note the order status. Click **Invoice** (if the button is available) — fill in the invoice form defaults and click **Submit Invoice**.
+### Customer management
+39. Navigate to **Customers > All Customers**.
+40. Click into any customer record to view the account detail page.
+41. In the customer account page, click the **Orders** tab to see their order history.
+### Reports
+42. Navigate to **Reports > Products > Bestsellers**.
+43. Navigate to **Reports > Sales > Orders** — let the report load.
+### Logout
+44. Log out from the admin panel (Admin menu, top right → Sign Out).
+---
+## App 3 — Forum (port 9999)
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/`
+### Guest browsing
+45. Open the forum homepage.
+46. Click on **Forums** in the nav — let the forum list load.
+47. Click into any available forum/subforum.
+48. Click on any post/thread to open it.
+49. Click on any user's username to view their profile page.
+### Authenticated flows
+50. Click **Log In** and sign in with `MarvelsGrantMan136` / `test1234`.
+51. After login, return to the homepage — confirm you are logged in.
+52. Click into a forum that allows posting.
+53. Click **New Thread** or **Submit Link** or **Submit Text** (whatever button is present for creating a post).
+54. Fill in the post form:
+    - Title: `HAR Test Post - API Coverage`
+    - Body/URL: `This is a test post created for HAR recording.`
+55. Submit the post.
+56. After submitting, view the created post's page.
+57. On the post, click the **Comment** or reply area — type a comment: `"Test comment for HAR recording."` — submit it.
+58. On any other post (not your own), click the **upvote** button.
+59. On any post, click **Save** / bookmark if the option exists.
+60. Navigate to your own profile page (click your username in the top bar).
+61. Click **Log Out**.
+---
+## App 4 — Map (port 3000)
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:3000/`
+### Browse and search
+62. Open the map homepage — let the default map tiles load.
+63. In the **Search** bar at the top, type `"New York"` and press enter/click search — let results load and map pan.
+64. Click on one of the search results to zoom into that location.
+65. Zoom in several levels using the `+` button or scroll wheel.
+66. Zoom back out using the `−` button.
+67. Pan the map by clicking and dragging to a different area.
+68. Search for `"London"` — let results load.
+69. Click **Export** in the top nav — let the export panel open (you don't need to actually download).
+70. Click on the map to drop a marker — then click **Where is this?** in the top bar with the marker active.
+### Node/way detail
+71. In the search box, search for `"Central Park"` — click the result.
+72. Click on any map feature (node/way) that becomes clickable — let the sidebar panel load the feature detail.
+---
+## Coverage cross-check
+After completing all tasks above, you should have HAR traffic covering:
+| App | Auth endpoints | Product/content listing | Item creation/mutation | Session/cookie flows |
+|-----|---------------|------------------------|----------------------|---------------------|
+| Shopping (guest) | — | ✓ category, search | ✓ cart, checkout | ✓ guest session |
+| Shopping (authed) | ✓ login, logout | ✓ orders, wishlist | ✓ wishlist add, cart update | ✓ customer token |
+| Admin | ✓ admin login/logout | ✓ product grid, order grid | ✓ product edit, create, invoice | ✓ admin token |
+| Forum (guest) | — | ✓ forums, posts | — | — |
+| Forum (authed) | ✓ login, logout | ✓ profile | ✓ post create, comment, vote | ✓ CSRF form_key |
+| Map | — | ✓ tile loads, search | — | — |
+---
+## Initial Run — Browser Agent Tasks for the 7 Training Templates
+The full task list above covers broad application exploration. For the initial training run, the browser agent only needs to complete the tasks that produce HAR traffic relevant to the **7 task templates defined in [README.md](README.md)**. Below is the minimum set grouped by application so the browser agent can work through one app at a time in a single session.
+---
+### Shopping (port 7770) — Templates 1, 3, 6
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/`
+Covers: category listing (Easy), add-to-cart (Medium), and full guest checkout (Hard).
+The browser agent runs the guest checkout flow end-to-end. This single pass captures all the HAR traffic needed for all three Shopping templates — category browsing produces Template 1 traffic, cart creation + item addition produces Template 3 traffic, and the full checkout completes Template 6.
+1. Open the Shopping homepage.
+2. Click into **Beauty & Personal Care** from the nav. *(Template 1: category listing)*
+3. Navigate to **Beauty & Personal Care > Oral Care > Toothbrushes & Accessories** — let the product list load. *(Template 1: product listing)*
+4. Search for `"ginger"` using the search bar — let results load. *(Template 3: product lookup)*
+5. Click on any product from the search results — let the product detail page load fully. *(Template 3: product detail)*
+6. Add that product to the cart. *(Template 3: cart creation + item addition)*
+7. Click the cart icon — open the mini-cart. *(Template 3: cart state)*
+8. Click **Proceed to Checkout**. *(Template 6: checkout begins)*
+9. Fill in the guest checkout shipping form: *(Template 6: shipping)*
+   - Email: `test@example.com`
+   - First Name: `Test`, Last Name: `User`
+   - Street: `123 Main St`, City: `New York`, State: `New York`, ZIP: `10001`
+   - Country: `United States`, Phone: `5551234567`
+10. Select the first available shipping method and click **Next**. *(Template 6: shipping method)*
+11. On the payment step, leave the default payment method and click **Place Order**. *(Template 6: payment + order)*
+**HAR traffic captured:** category tree API, product list/search API, guest cart creation, add-to-cart, estimate-shipping, set-shipping-information, payment-information, place-order.
+---
+### Shopping Admin (port 7780) — Template 7
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`
+Covers: admin product creation (Hard).
+> **Note:** The root URL on port 7780 shows the customer storefront, not the admin panel. The browser agent must navigate to `/admin` to reach the admin login.
+1. Go to the admin login page at `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`.
+2. Log in with `admin` / `admin1234`.
+3. Click **Add Product** (top right) — select **Simple Product** if prompted.
+4. Fill in the new product form:
+   - Product Name: `HAR Test Product`
+   - SKU: `HAR-TEST-001`
+   - Price: `19.99`
+   - Quantity: `100`
+   - Attribute Set: Default
+5. Click **Save** on the new product.
+**HAR traffic captured:** admin auth token flow, product creation POST with full Magento product schema.
+---
+### Forum (port 9999) — Templates 4, 5
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/`
+Covers: authenticated category browsing (Medium) and post creation (Hard).
+The browser agent logs in once, browses categories, then creates a post. This single pass captures traffic for both Forum templates — browsing produces Template 4 traffic, and post creation produces Template 5 traffic.
+1. Open the Forum homepage.
+2. Click on **Forums** in the nav — let the forum list load. *(Template 4: category listing)*
+3. Log in with `MarvelsGrantMan136` / `test1234`. *(Templates 4 & 5: auth + CSRF token)*
+4. After login, return to the homepage — confirm logged in.
+5. Click into any available forum/subforum. *(Template 4: authed category browse)*
+6. Click on any post/thread to open it. *(Template 4: authed post retrieval)*
+7. Navigate to a forum that allows posting. *(Template 5: post creation begins)*
+8. Click **New Thread** / **Submit Text**. *(Template 5: creation form)*
+9. Fill in the post form: *(Template 5: post body)*
+   - Title: `HAR Test Post - API Coverage`
+   - Body: `This is a test post created for HAR recording.`
+10. Submit the post. *(Template 5: POST with CSRF form_key)*
+11. View the created post's page. *(Template 5: confirm creation)*
+**HAR traffic captured:** login + session/CSRF extraction, forum/subforum listing (authed), thread listing, post creation with form_key.
+---
+### Wikipedia (port 8888) — Template 2
+`http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/`
+Covers: article summary retrieval (Easy).
+Wikipedia is not covered in the full task list above — these are new tasks for the initial run.
+1. Open the Wikipedia homepage.
+2. Search for any article (e.g., `"Python (programming language)"`).
+3. Click into the article and let the full page load.
+**HAR traffic captured:** Kiwix search API, article content retrieval API.
+---
+### What is NOT needed for the initial run
+| Skipped section | Tasks | Why not needed |
+|----------------|-------|----------------|
+| Shopping — logged-in customer flows | 12–23 | No template targets authed customer actions (orders, wishlist, address book) |
+| Admin — catalog editing | 26–31 | Template 7 only needs product *creation*, not editing existing products |
+| Admin — orders, customers, reports | 35–44 | No template targets admin read flows |
+| Forum — voting, commenting, bookmarking | 57–61 | Templates 4 & 5 cover browse and post creation only |
+| Map (port 3000) | 62–72 | No template targets the Map application |

JUDGE.md ADDED Viewed

	@@ -0,0 +1,660 @@

+# HARvestGym Judge Architecture
+This document specifies the full judge architecture — how task completion is verified and how rewards are computed after each episode ends.
+The judge is a deterministic, programmatic component. It does **not** use an LLM to score episodes. Every grader produces a score in `[0.0, 1.0]` that is then scaled to the reward range defined in `README.md`.
+---
+## Overview
+```
+Episode ends (model calls done() or max_steps=20 reached)
+    │
+    ▼
+Judge.evaluate(episode: Episode, task: Task) → EpisodeResult
+    │
+    ├─► 1. Identify task template from task.template_id
+    │
+    ├─► 2. Run programmatic grader for this template
+    │       │
+    │       ├─► Probe live application state (HTTP calls from judge, not model)
+    │       ├─► Inspect episode trajectory (call sequence, parameter sources)
+    │       └─► Compute score in [0.0, 1.0]
+    │
+    ├─► 3. Verify parameter sourcing (for partial credit)
+    │       │
+    │       └─► Cross-reference each curl call against ground truth catalog
+    │
+    └─► 4. Compute final reward
+            │
+            └─► Combine task score + parameter sourcing + step-level signals
+```
+---
+## Data Structures
+```python
+@dataclass
+class Episode:
+    task: Task
+    steps: list[Step]                # all tool calls and results
+    session_state: dict              # final session state
+    total_steps: int
+    terminated_by: str               # "done_call" | "max_steps"
+@dataclass
+class Step:
+    step_num: int
+    tool: str                        # browser_agent | search_endpoints | curl_exec | search_episode_data | done
+    action: str                      # raw tool call string
+    result: Any                      # tool return value
+    curl_parsed: CurlCall | None     # None for non-curl steps
+@dataclass
+class CurlCall:
+    method: str
+    url: str
+    path: str                        # normalized (IDs replaced with {id})
+    headers: dict
+    body: dict | str | None
+    status_code: int
+    response_body: Any
+@dataclass
+class Task:
+    template_id: int                 # 1–7
+    description: str                 # instantiated task string (with actual values)
+    params: dict                     # e.g. {"product_name": "Radiant Tee", "sku": "MH01"}
+    app: str                         # shopping | forum | wikipedia | shopping_admin
+    base_url: str
+    difficulty: str                  # easy | medium | hard
+@dataclass
+class EpisodeResult:
+    task_score: float                # 0.0–1.0 from grader
+    parameter_sourcing_score: float  # 0.0–1.0 from trajectory analysis
+    auth_obtained: bool              # did the model successfully authenticate?
+    reward: float                    # final composite reward
+    details: dict                    # per-grader diagnostic info for logging
+```
+---
+## Graders: Per-Template Verification
+Each template has its own grader. All graders make real HTTP calls to the live EC2 application to verify state — they do not rely solely on the episode trajectory.
+### Template 1 — Easy | Shopping: List products in category `{category_name}`
+**Success condition:** The model's curl call returned a 200 response containing at least one product in the correct category.
+```python
+def grade_template_1(episode: Episode, task: Task) -> float:
+    category_name = task.params["category_name"]
+    # Find the curl call that returned products
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.status_code == 200:
+            body = step.curl_parsed.response_body
+            if isinstance(body, dict) and "items" in body:
+                items = body["items"]
+                # Verify at least one item belongs to the target category
+                for item in items:
+                    # Check category_links or category name in item
+                    if _item_matches_category(item, category_name):
+                        return 1.0
+                # Items returned but wrong category — partial credit
+                if len(items) > 0:
+                    return 0.3
+    return 0.0
+def _item_matches_category(item: dict, category_name: str) -> bool:
+    """Check category_links or custom_attributes for category match."""
+    # Magento items carry category_links: [{"category_id": N}]
+    # Judge verifies by calling GET /rest/V1/categories?searchCriteria[filter...]=name
+    # and comparing category IDs. This is a judge-side probe, not relying on model output.
+    ...
+```
+**Reward mapping:**
+| Score | Meaning | Reward |
+|-------|---------|--------|
+| 1.0   | Products listed from correct category | +2.0 |
+| 0.3   | Products returned but wrong/unknown category | +0.5 |
+| 0.0   | No valid product list response | −1.5 |
+---
+### Template 2 — Easy | Wikipedia: Retrieve article for `{title}`
+**Success condition:** The model made a successful HTTP GET that returned a 200 response for a URL containing the article title (or a redirect to it). Content parsing is explicitly not required.
+```python
+def grade_template_2(episode: Episode, task: Task) -> float:
+    title = task.params["title"]
+    title_slug = title.lower().replace(" ", "_")
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.status_code == 200:
+            url = step.curl_parsed.url.lower()
+            if title_slug in url or title.lower() in url:
+                return 1.0
+    # Check for search result that found the article (indirect)
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.status_code == 200:
+            body_str = str(step.curl_parsed.response_body).lower()
+            if title.lower() in body_str and "wiki" in step.curl_parsed.url.lower():
+                return 0.5  # found reference but didn't fetch the article directly
+    return 0.0
+```
+**Reward mapping:**
+| Score | Reward |
+|-------|--------|
+| 1.0   | Correct article URL fetched with 200 | +2.0 |
+| 0.5   | Article title found in search results but not fetched | +0.5 |
+| 0.0   | No Wikipedia response | −1.5 |
+---
+### Template 3 — Medium | Shopping: Add `{product_name}` to a guest cart
+**Success condition:** Judge probes the cart after the episode to verify the item is present.
+```python
+def grade_template_3(episode: Episode, task: Task) -> float:
+    product_name = task.params["product_name"]
+    sku = task.params.get("sku")  # known from parameter pool
+    # Extract cart_id from episode trajectory
+    cart_id = _extract_cart_id(episode)
+    if not cart_id:
+        return _partial_score_no_cart(episode)
+    # Judge probes the live application
+    cart_response = _judge_probe(
+        f"GET /rest/V1/guest-carts/{cart_id}",
+        task.base_url
+    )
+    if not cart_response or cart_response.status_code != 200:
+        return 0.1  # cart was created but can't be verified
+    items = cart_response.body.get("items", [])
+    for item in items:
+        if item.get("sku") == sku or _fuzzy_match(item.get("name", ""), product_name):
+            return 1.0
+    # Cart exists but item not in it
+    if len(items) == 0 and cart_id:
+        return 0.2  # cart created, item not added
+    return 0.0
+def _partial_score_no_cart(episode: Episode) -> float:
+    """Partial credit: did the model attempt the right sequence?"""
+    attempted_cart_create = any(
+        s.curl_parsed and "guest-carts" in s.curl_parsed.path
+        and s.curl_parsed.method == "POST"
+        for s in episode.steps if s.curl_parsed
+    )
+    return 0.15 if attempted_cart_create else 0.0
+```
+**Reward mapping:**
+| Score | Reward |
+|-------|--------|
+| 1.0   | Item confirmed in cart via judge probe | +3.5 |
+| 0.2   | Cart created, item not added | +0.5 |
+| 0.15  | Correct call attempted, cart not created | +0.3 |
+| 0.0   | No valid attempt | −1.5 |
+---
+### Template 4 — Medium | Forum: Retrieve all posts in `{forum_category}` (authed)
+**Success condition:** The model authenticated and fetched a post listing that includes posts from the target category.
+```python
+def grade_template_4(episode: Episode, task: Task) -> float:
+    forum_category = task.params["forum_category"]
+    score = 0.0
+    # Check authentication was obtained
+    auth_obtained = _check_forum_auth(episode)
+    if auth_obtained:
+        score += 0.3  # auth is partial credit on its own (see reward table)
+    # Find a curl call that returned a post listing for the correct category
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.status_code == 200:
+            url = step.curl_parsed.url
+            body = step.curl_parsed.response_body
+            # Postmill returns post listings at /f/{category}
+            if f"/f/{forum_category.lower()}" in url.lower():
+                if _response_contains_posts(body):
+                    return 1.0
+    return score  # 0.3 if only auth, 0.0 if nothing
+def _check_forum_auth(episode: Episode) -> bool:
+    """Authentication: a POST to /login returned a redirect (302) or 200 with session cookie."""
+    for step in episode.steps:
+        if step.curl_parsed:
+            if step.curl_parsed.method == "POST" and "/login" in step.curl_parsed.path:
+                if step.curl_parsed.status_code in (200, 302):
+                    return True
+    return False
+```
+**Reward mapping:**
+| Score | Reward |
+|-------|--------|
+| 1.0   | Authenticated + posts fetched from correct category | +3.5 |
+| 0.3   | Authentication only, no post fetch | +0.8 |
+| 0.0   | No valid attempt | −1.5 |
+---
+### Template 5 — Hard | Forum: Create a post titled `{title}` in `{category}`
+**Success condition:** Judge probes the forum category page after the episode to verify the post exists.
+```python
+def grade_template_5(episode: Episode, task: Task) -> float:
+    title = task.params["title"]
+    category = task.params["category"]
+    # Phase 1: check authentication
+    auth_ok = _check_forum_auth(episode)
+    # Phase 2: check CSRF token was extracted and used
+    csrf_used = _check_csrf_in_trajectory(episode)
+    # Phase 3: judge probes the forum to verify post exists
+    posts = _judge_probe_forum_category(category, task.base_url)
+    for post in posts:
+        if _fuzzy_match(post.get("title", ""), title):
+            return 1.0
+    # Partial credit breakdown
+    if auth_ok and csrf_used:
+        return 0.5  # got auth and CSRF right, but post didn't land
+    if auth_ok:
+        return 0.3
+    return 0.0
+def _check_csrf_in_trajectory(episode: Episode) -> bool:
+    """Check that a POST body contained a _csrf_token field."""
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.method == "POST":
+            body_str = str(step.curl_parsed.body or "")
+            if "_csrf_token" in body_str and len(body_str) > 20:
+                return True
+    return False
+```
+**Reward mapping:**
+| Score | Reward |
+|-------|--------|
+| 1.0   | Post confirmed in forum via judge probe | +5.0 |
+| 0.5   | Auth + CSRF correct, post not created | +1.5 |
+| 0.3   | Auth only | +0.8 |
+| 0.0   | No valid attempt | −1.5 |
+---
+### Template 6 — Hard | Shopping: Guest checkout for `{product_name}`
+**Success condition:** A complete order was created. Judge checks for an order ID in the trajectory and optionally probes the admin API.
+```python
+def grade_template_6(episode: Episode, task: Task) -> float:
+    sku = task.params.get("sku")
+    # Check for order ID in trajectory (checkout success returns an integer order ID)
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.status_code == 200:
+            body = step.curl_parsed.response_body
+            # Magento checkout success: POST /rest/V1/guest-carts/{id}/order returns integer
+            if isinstance(body, int) and body > 0:
+                return 1.0
+            # Magento checkout success: body could also be JSON with "order_id"
+            if isinstance(body, dict) and body.get("order_id"):
+                return 1.0
+    # Partial credit: did the model get through cart + item + shipping?
+    stages = _checkout_stages_completed(episode, sku)
+    if stages >= 4:  # cart + item + email + shipping estimate
+        return 0.6
+    if stages >= 2:  # cart + item
+        return 0.3
+    if stages >= 1:  # cart only
+        return 0.1
+    return 0.0
+def _checkout_stages_completed(episode: Episode, sku: str) -> int:
+    """Count how many checkout stages the model completed successfully."""
+    stages = 0
+    paths_hit = {s.curl_parsed.path for s in episode.steps if s.curl_parsed and s.curl_parsed.status_code == 200}
+    if any("guest-carts" in p and "{" not in p for p in paths_hit): stages += 1  # cart created
+    if any("guest-carts" in p and "items" in p for p in paths_hit): stages += 1  # item added
+    if any("guest-carts" in p and "shipping" in p for p in paths_hit): stages += 1  # shipping
+    if any("guest-carts" in p and "payment" in p for p in paths_hit): stages += 1  # payment/order
+    return stages
+```
+**Reward mapping:**
+| Score | Reward |
+|-------|--------|
+| 1.0   | Order created (order_id in response) | +5.0 |
+| 0.6   | 4+ stages completed | +2.5 |
+| 0.3   | Cart + item only | +0.8 |
+| 0.1   | Cart only | +0.3 |
+| 0.0   | No valid attempt | −1.5 |
+---
+### Template 7 — Hard | Shopping Admin: Create product with SKU `{sku}`, price `{price}`
+**Success condition:** Judge probes the admin API to confirm the product exists with the correct SKU and price.
+```python
+def grade_template_7(episode: Episode, task: Task) -> float:
+    sku = task.params["sku"]
+    price = float(task.params["price"])
+    # Phase 1: check admin authentication
+    admin_token = _extract_admin_token(episode)
+    if not admin_token:
+        return 0.0
+    # Phase 2: judge probes the Magento REST API to confirm product exists
+    product = _judge_probe(
+        f"GET /rest/V1/products/{sku}",
+        task.base_url,
+        headers={"Authorization": f"Bearer {admin_token}"}
+    )
+    if not product or product.status_code != 200:
+        # Product might exist under a different auth context — try admin token from env
+        product = _judge_probe_with_env_admin_token(f"GET /rest/V1/products/{sku}", task.base_url)
+    if product and product.status_code == 200:
+        actual_price = float(product.body.get("price", -1))
+        price_ok = abs(actual_price - price) < 0.01
+        return 1.0 if price_ok else 0.7  # product exists but wrong price
+    # Partial credit: correct API called with correct schema
+    if _attempted_product_creation(episode, sku):
+        return 0.2
+    return 0.0
+def _extract_admin_token(episode: Episode) -> str | None:
+    """Find admin bearer token from a POST /rest/V1/integration/admin/token response."""
+    for step in episode.steps:
+        if step.curl_parsed and step.curl_parsed.status_code == 200:
+            if "integration/admin/token" in step.curl_parsed.path:
+                body = step.curl_parsed.response_body
+                if isinstance(body, str) and len(body) > 10:
+                    return body.strip('"')
+    return None
+```
+**Reward mapping:**
+| Score | Reward |
+|-------|--------|
+| 1.0   | Product confirmed in Magento with correct price | +5.0 |
+| 0.7   | Product exists but wrong price | +2.0 |
+| 0.2   | Admin auth + correct endpoint called | +0.5 |
+| 0.0   | No admin auth | −1.5 |
+---
+## Parameter Sourcing Verification
+In addition to the task-specific grader, the judge runs a parameter sourcing analysis over the full episode trajectory. This cross-references each curl call against the ground truth catalog to verify that parameter values were obtained from the correct sources.
+```python
+def verify_parameter_sourcing(episode: Episode, task: Task, catalog: list[dict]) -> float:
+    """
+    Returns a score in [0.0, 1.0] representing how correctly the model
+    sourced parameter values across all curl calls in the episode.
+    Checks each curl call against the ground truth catalog entry for that endpoint.
+    """
+    correct = 0
+    total = 0
+    for step in episode.steps:
+        if not step.curl_parsed:
+            continue
+        catalog_entry = _find_catalog_entry(step.curl_parsed.path, step.curl_parsed.method, catalog)
+        if not catalog_entry:
+            continue
+        # Check each parameter in the curl call
+        for param_name, param_meta in catalog_entry.get("path_params", {}).items():
+            total += 1
+            value_used = _extract_path_param(step.curl_parsed.url, param_name, catalog_entry)
+            if value_used and _param_sourced_correctly(value_used, param_meta, episode, step):
+                correct += 1
+        for param_name, param_meta in catalog_entry.get("body_params", {}).items():
+            total += 1
+            value_used = _extract_body_param(step.curl_parsed.body, param_name)
+            if value_used and _param_sourced_correctly(value_used, param_meta, episode, step):
+                correct += 1
+    if total == 0:
+        return 0.0
+    return correct / total
+def _param_sourced_correctly(value: Any, param_meta: dict, episode: Episode, step: Step) -> bool:
+    """
+    Verify that a parameter value came from the expected source.
+    Source types:
+      TASK_SPEC  — value must appear in the task description string
+      PREV_CALL  — value must appear in a prior step's response body
+      AUTH_FLOW  — value must come from an auth response (token, session)
+      STATIC     — value must match a known constant (e.g., store_id = 1)
+      DERIVED    — value must be derivable from another parameter in this call
+    """
+    source = param_meta.get("source")
+    if source == "TASK_SPEC":
+        return str(value) in episode.task.description
+    elif source == "PREV_CALL":
+        from_endpoint = param_meta.get("from_endpoint")
+        from_field = param_meta.get("from_field")
+        # Check prior steps for a response from from_endpoint with value at from_field
+        for prior_step in episode.steps:
+            if prior_step.step_num >= step.step_num:
+                break
+            if prior_step.curl_parsed:
+                if _path_matches(prior_step.curl_parsed.path, from_endpoint):
+                    extracted = _extract_field(prior_step.curl_parsed.response_body, from_field)
+                    if str(extracted) == str(value):
+                        return True
+        return False
+    elif source == "AUTH_FLOW":
+        # Value must appear in a session_state field or auth response
+        return str(value) in str(episode.session_state.values())
+    elif source == "STATIC":
+        expected = param_meta.get("value")
+        return str(value) == str(expected)
+    elif source == "DERIVED":
+        same_as = param_meta.get("same_as")
+        # Value must equal another param in the same call
+        # (e.g., quote_id must equal cart_id which is in the path)
+        if same_as and step.curl_parsed:
+            other_value = _extract_param_from_call(step.curl_parsed, same_as)
+            return str(value) == str(other_value)
+        return False
+    return False
+```
+---
+## Final Reward Computation
+```python
+def compute_reward(
+    task_score: float,
+    parameter_sourcing_score: float,
+    step_rewards: float,   # accumulated per-step rewards from README reward table
+    auth_obtained: bool,
+    task: Task,
+    terminated_by: str
+) -> float:
+    """
+    Combines task grader score, parameter sourcing, and step-level signals
+    into the final episode reward.
+    """
+    # Map task score to outcome reward (scales with difficulty tier)
+    tier_multipliers = {"easy": 1.0, "medium": 1.75, "hard": 2.5}
+    tier = task.difficulty
+    m = tier_multipliers.get(tier, 1.0)
+    if task_score == 1.0:
+        outcome_reward = 2.0 * m       # +2.0 (easy), +3.5 (medium), +5.0 (hard)
+    elif task_score >= 0.5:
+        outcome_reward = 0.5 * m       # partial
+    elif task_score > 0.0:
+        outcome_reward = 0.15 * m      # minimal attempt credit
+    else:
+        outcome_reward = -1.5          # complete failure (same across tiers)
+    # Auth bonus: applies even on task failure — model learned authentication
+    auth_bonus = 0.3 if auth_obtained and task_score < 1.0 else 0.0
+    # Parameter sourcing bonus (weighted into outcome, not additive)
+    # Only applied when task succeeds partially — avoids rewarding "busy" episodes
+    param_bonus = 0.0
+    if 0.0 < task_score < 1.0:
+        param_bonus = parameter_sourcing_score * 0.5 * m
+    total = outcome_reward + auth_bonus + param_bonus + step_rewards
+    return round(total, 4)
+```
+**Reward separation guarantee:**
+| Episode type | Approximate total reward |
+|---|---|
+| Easy task success (perfect param sourcing) | +2.0 to +3.2 |
+| Easy task failure (busy with steps) | −1.5 + max_step_rewards ≈ −0.2 |
+| Hard task success | +5.0 to +7.5 |
+| Hard task failure (some progress) | −1.5 + partial ≈ −0.5 to +1.5 |
+The terminal outcome reward dominates for complete successes and complete failures. Partial episodes sit in the middle — GRPO can distinguish all three signal zones.
+---
+## Judge Utilities
+```python
+def _judge_probe(path: str, base_url: str, headers: dict = None) -> ProbeResult | None:
+    """
+    The judge makes its own HTTP calls to verify application state.
+    These calls are NOT part of the episode trajectory and do NOT affect rewards.
+    Judge probes use a dedicated admin token from environment variables.
+    """
+    url = base_url.rstrip("/") + path
+    admin_headers = {"Authorization": f"Bearer {os.environ['JUDGE_ADMIN_TOKEN']}"}
+    if headers:
+        admin_headers.update(headers)
+    try:
+        resp = requests.get(url, headers=admin_headers, timeout=10)
+        return ProbeResult(status_code=resp.status_code, body=resp.json() if resp.text else None)
+    except Exception:
+        return None
+def _fuzzy_match(s1: str, s2: str, threshold: float = 0.85) -> bool:
+    """Case-insensitive substring or similarity match."""
+    s1, s2 = s1.lower().strip(), s2.lower().strip()
+    if s1 == s2 or s1 in s2 or s2 in s1:
+        return True
+    # Jaccard similarity as fallback
+    tokens1, tokens2 = set(s1.split()), set(s2.split())
+    if not tokens1 or not tokens2:
+        return False
+    return len(tokens1 & tokens2) / len(tokens1 | tokens2) >= threshold
+```
+---
+## Parameter Pool Alignment
+The judge is aware that parameter pools are pre-built snapshots of the live application state. For graders that verify values (e.g., SKU, price), the comparison is:
+- **SKU matching:** exact string match (SKUs are immutable in Magento)
+- **Price matching:** float comparison with ±0.01 tolerance
+- **Product name matching:** fuzzy match with 85% threshold (handles whitespace/casing)
+- **Category name matching:** fuzzy match, verified against live category tree
+The judge does **not** penalize the model if the live application has drifted from the parameter pool (e.g., a product was deleted). In this case, the episode is flagged as `invalid_episode` in the logs and excluded from the training batch. The `build_parameter_pools.py` script should be re-run to refresh the pool if too many episodes are flagged.
+---
+## Concurrent Episode Isolation
+All judge probes use read-only endpoints (GETs, admin token reads) to avoid interfering with other concurrent training episodes. The judge never issues write calls to the live application — it only reads state to verify what the model did.
+Write isolation (preventing two concurrent episodes from interfering with each other) is handled at the training harness level, not the judge level:
+- For **Easy** tasks (read-only): no isolation needed
+- For **Medium** tasks (cart operations): each episode uses a fresh guest cart; carts are session-scoped and do not conflict
+- For **Hard** tasks (post creation, product creation): episode IDs are embedded in the task params (e.g., SKU is prefixed with episode ID: `{sku}_{episode_id}`) to prevent naming collisions
+---
+## Logging and Diagnostics
+Every episode produces a structured log entry:
+```json
+{
+  "episode_id": "ep_1234",
+  "template_id": 3,
+  "task_description": "Add 'Radiant Tee' to a guest cart",
+  "task_score": 1.0,
+  "parameter_sourcing_score": 0.8,
+  "auth_obtained": false,
+  "reward": 3.9,
+  "step_rewards": 0.85,
+  "terminated_by": "done_call",
+  "total_steps": 7,
+  "grader_details": {
+    "cart_id_found": "cart-abc123",
+    "item_confirmed_in_cart": true,
+    "item_sku": "MH01"
+  },
+  "parameter_sourcing_details": [
+    {"step": 5, "param": "cartId", "source": "PREV_CALL", "correct": true},
+    {"step": 7, "param": "cartItem.sku", "source": "PREV_CALL", "correct": true},
+    {"step": 7, "param": "cartItem.quote_id", "source": "DERIVED", "correct": true}
+  ]
+}
+```
+These logs drive the training analytics and help identify which parameter sourcing patterns the model is still learning.

README.md CHANGED Viewed

@@ -1,10 +1,652 @@
 ---
 title: HARvestGym
-emoji: ⚡
 colorFrom: blue
-colorTo: pink
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: HARvestGym
+emoji: 🕸️
 colorFrom: blue
+colorTo: purple
 sdk: docker
 pinned: false
+tags:
+  - openenv
+  - reinforcement-learning
+  - api-agent
+  - web-tasks
+base_path: /web
+---
+# HARvestGym
+*Core idea: Trains LLMs to reverse-engineer and complete web tasks through raw HTTP APIs. No browser. No docs. Just a URL and a task.*
+### Can a small model learn to explore the API surface of any web application — and complete real tasks through those APIs, without ever opening a browser?
+Web applications are full of APIs. Every click in a browser triggers an HTTP call with a precise schema, a specific authentication header, an exact sequence of prerequisites. **HARvestGym trains a small model to do all of that directly** — given a task and a URL, it discovers the relevant endpoints, understands what each one needs, chains the calls in the right order, and completes the task without any browser.
+The model starts with nothing: no schema, no documentation, no endpoint list. It uses tools to explore — issuing requests, inspecting responses, building up its own understanding of how the application works. This is what a developer does when they reverse-engineer an API. The model learns to do the same.
+Given a URL and a task string, the agent must discover which endpoints exist, figure out schemas and parameter dependencies, and execute the right sequence. Zero prior knowledge.
+## What the Model (Policy) Is  Learning
+Given: a natural language task + a live web application URL. No prior knowledge of the application.
+The model calls `browser_agent` first — this returns the list of API endpoints the browser used to complete the task. The model now has a map: it knows what endpoints exist. What it does not know:
+- which of those endpoints are actually needed for this specific task
+- in what order they must be called (you cannot add to a cart before the cart exists)
+- where each required parameter value comes from
+- how to re-authenticate if a session expires mid-episode
+The model must learn to:
+1. **Discover endpoints** — by using a browser agent tool that completes the same task in a real browser while recording all network traffic, then filtering that traffic to extract only the meaningful application API calls (stripping out CDN requests, analytics, static assets). The browser agent runs once and generates the raw discovery data; the model uses this as its starting context.
+2. **Select the right endpoints** — from the browser agent's list, identify the subset relevant to the current task (not every observed endpoint is needed)
+3. **Sequence calls correctly** — determine the prerequisite order (create cart → find product → add item), including calls that must happen before others even though the task description doesn't say so
+4. **Thread parameters** — this is the hardest part. APIs form a dependency graph:
+  - Some values come from a previous response (`cart_id` from step 1 → path param in step 3)
+  - Some values come from the authentication flow (`form_key`, `Bearer token` → header in every subsequent call)
+  - Some values come from the task description (`product name` → search query → `sku` → body of add-item call)
+  - The ground truth catalog defines these relationships precisely; the model learns to navigate them
+5. **Handle auth and errors** — detect 401 / session-expired responses, re-authenticate, and continue; interpret 4xx errors and adjust the next call accordingly
+---
+## Architecture
+```
+                          TRAINING LOOP
+┌─────────────────────────────────────────────────────────────────────────┐
+│                                                                         │
+│  Task + App URL                                                         │
+│       │                                                                 │
+│       ▼                                                                 │
+│  ┌────────────────────────────────────────────────────────────────┐     │
+│  │                  Policy Model (RL Agent)                       │     │
+│  │         small model — no prior knowledge of the app           │     │
+│  │                                                                │     │
+│  │  Observation: task + history + session_state + last_result    │     │
+│  │                                                                │     │
+│  │  Step 1   ──► browser_agent(task, url)                        │     │
+│  │  Step 2+  ──► search_endpoints(query)                         │     │
+│  │           ──► curl_exec(command)                              │     │
+│  │           ──► search_episode_data(query)                      │     │
+│  │           ──► done(result)                                    │     │
+│  └────────┬───────────────────────────────────────────────────────┘     │
+│           │                                                             │
+│    ┌──────┴──────────────────────────────┐                             │
+│    │                                     │                             │
+│    ▼                                     ▼                             │
+│  ┌─────────────────────┐    ┌─────────────────────────────────────┐    │
+│  │   Browser Agent     │    │         Environment                 │    │
+│  │  (step 1 only)      │    │                                     │    │
+│  │                     │    │  • Executes curl_exec via subprocess│    │
+│  │ Training:           │    │  • Auto-injects session cookies     │    │
+│  │  Load pre-recorded  │    │  • Smart-truncates response bodies  │    │
+│  │  cached HAR from    │    │  • Indexes full responses into      │    │
+│  │   disk or launch    │    │    per-episode BM25 + GEMMA store   │    │
+│  │   on real browser   │    │  • Manages session_state: cookies,  │    │
+│  │                     │    │    CSRF tokens, auth headers        │    │
+│  │ Inference:          │    └──────────────┬──────────────────────┘    │
+│  │  Launch real browser│                   │                           │
+│  │  via Playwright +   │                   │ HTTP calls (always live)  │
+│  │  bu-30b-a3b-preview │                   ▼                           │
+│  │                     │    ┌─────────────────────────────────────┐    │
+│  │ Both paths produce: │    │     WebArena EC2 (live apps)        │    │
+│  │  • Filtered HAR     │    │                                     │    │
+│  │  • OpenAPI-like spec│    │  :7770  Shopping (Magento 2)        │    │
+│  │  • GEMMA embeddings │    │  :7780  Shopping Admin              │    │
+│  │    for search_      │    │  :9999  Forum (Postmill)            │    │
+│  │    endpoints()      │    │  :8888  Wikipedia (Kiwix)          │    │
+│  └─────────────────────┘    │  :3000  Map (OpenStreetMap)        │    │
+│                             └──────────────┬──────────────────────┘    │
+│                                            │                           │
+│                                            │ episode trajectory        │
+│                                            ▼                           │
+│                             ┌─────────────────────────────────────┐    │
+│                             │    Deterministic Judge              │    │
+│                             │                                     │    │
+│                             │  Per-template programmatic grader:  │    │
+│                             │  • Inspects episode trajectory      │    │
+│                             │  • Optionally probes live app state │    │
+│                             │  • Verifies parameter sourcing      │    │
+│                             │    (TASK_SPEC / PREV_CALL /         │    │
+│                             │     AUTH_FLOW / STATIC / DERIVED)  │    │
+│                             │  • Scores [0.0 → 1.0]              │    │
+│                             └──────────────┬──────────────────────┘    │
+│                                            │                           │
+│                                            ▼                           │
+│                             ┌─────────────────────────────────────┐    │
+│                             │         Reward Signal               │    │
+│                             │                                     │    │
+│                             │  Per-step:                          │    │
+│                             │   +0.2  valid API call (2xx)        │    │
+│                             │   +0.1  new path explored           │    │
+│                             │   +0.25 correct param sourcing      │    │
+│                             │   −0.15 repeated identical call     │    │
+│                             │   −0.3  browser_agent called again  │    │
+│                             │                                     │    │
+│                             │  Episode end:                       │    │
+│                             │   +2.0–+5.0 task complete (easy→hard│    │
+│                             │   −1.5      task failed             │    │
+│                             └──────────────┬──────────────────────┘    │
+│                                            │                           │
+│                                            ▼                           │
+│                             ┌─────────────────────────────────────┐    │
+│                             │    GRPO (via HF TRL)                │    │
+│                             │                                     │    │
+│                             │  8 parallel rollouts per prompt     │    │
+│                             │  Computes advantages without        │    │
+│                             │  a value function                   │    │
+│                             │  Updates policy weights             │    │
+│                             └─────────────────────────────────────┘    │
+│                                            │                           │
+│                                            └──► updated Policy Model   │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+### Data Flow: Browser Agent → Search Index → Execution
+```
+HAR File (cached using Browser Agent) ──► filter_har_entries()
+                                │
+                                ▼
+                     drop: CDN, analytics, static assets
+                     keep: {method, path, request_body,
+                             response_body, status_code}
+                                │
+                                ▼
+                     extract_openapi_spec()
+                       → structured endpoint catalog
+                          {path, method, params, auth, response_fields}
+                                │
+                         ┌──────┴──────┐
+                         │             │
+                         ▼             ▼
+               build_GEMMA_embeddings  return summary list
+               (search_endpoints       to RL agent:
+                index — full schemas)    [GET /products,
+                                          POST /guest-carts, ...]
+                         │
+                         ▼
+               search_endpoints("create guest cart")
+               → top-3 endpoint schemas with:
+                  • path params + sources
+                  • body params + sources
+                  • auth requirements
+                  • response field names
+```
+### Episode Response Indexing
+```
+curl_exec(command)
+     │
+     ├──► subprocess: execute against live EC2
+     │
+     ├──► index_full_response()
+     │       BM25 index  ── keyword match (IDs, SKUs, tokens)
+     │       GEMMA embed ── semantic match (paraphrases)
+     │       (indexes BEFORE truncation — all items stored)
+     │
+     └──► smart_truncate()
+              non-JSON HTML    → 3,000 chars
+              JSON primitive   → never truncated
+              error (4xx/5xx)  → never truncated
+              small JSON       → returned as-is
+              large array      → first 2 items shown
+                                 + _list_truncated annotation
+                                 + hint to call search_episode_data()
+```
+### Parameter Dependency Graph (what the judge tracks)
+```
+Task: "Add 'Radiant Tee' to a guest cart"
+┌─────────────────────────────────────────────────────────┐
+│  TASK_SPEC ──────────────────────────────────────────┐  │
+│    "Radiant Tee" (product name)                      │  │
+│         │                                            │  │
+│         ▼                                            │  │
+│  GET /rest/V1/products?name=Radiant+Tee              │  │
+│    → items[0].sku = "MH01"          (PREV_CALL) ──┐  │  │
+│                                                   │  │  │
+│  POST /rest/V1/guest-carts                        │  │  │
+│    → body = "cart-abc123"           (PREV_CALL) ──┼──┼─►│
+│                                                   │  │  │
+│  POST /rest/V1/guest-carts/{cartId}/items         │  │  │
+│    path: cartId      ◄────── "cart-abc123" ───────┘  │  │
+│    body: sku         ◄────── "MH01"         ─────────┘  │
+│    body: qty         ◄────── TASK_SPEC (quantity)       │
+│    body: quote_id    ◄────── DERIVED (= cartId)         │
+└─────────────────────────────────────────────────────────┘
+Source types tracked by the judge:
+  TASK_SPEC  — value stated in the task string
+  PREV_CALL  — value from a prior curl response in this episode
+  AUTH_FLOW  — value from a session/token auth step
+  STATIC     — fixed application constant (e.g. store_id = 1)
+  DERIVED    — computed from another param (e.g. quote_id = cart_id)
+```
+### Curriculum: Complexity Tiers
+```
+  Easy  ──────────────────────── graduate when P(success) > 0.7
+  │  Single call, no auth                                    │
+  │  Templates 1, 2                                          │
+  │  1 API call required                                     │
+  │                                                          ▼
+  Medium ──────────────────────── graduate when P(success) > 0.7
+  │  Auth + 1–2 dependent calls                              │
+  │  Templates 3, 4                                          │
+  │  2–3 API calls required                                  │
+  │                                                          ▼
+  Hard ────────────────────────── final tier
+     Multi-step chain, full auth, ID threading
+     Templates 5, 6, 7
+     4–8+ API calls required
+     Reward scaling: ×2.5 vs Easy
+```
+### The RL Agent's Tool: Browser Agent
+The RL agent has access to a **browser agent tool** powered by `[browser-use/bu-30b-a3b-preview](https://huggingface.co/browser-use/bu-30b-a3b-preview)` — a 30B MoE vision-language model (3B active parameters) purpose-built for web task completion, served via the [browser-use](https://github.com/browser-use/browser-use) library on Playwright. When the RL agent calls this tool with a natural language task, the browser agent:
+1. Opens the target application in a real browser
+2. Completes the task by clicking, typing, and navigating — exactly as a human would
+3. All HTTP traffic is intercepted via Playwright network events
+4. Returns the intercepted traffic, filtered down to only the application's own API calls
+The filtering step strips analytics pings, CDN requests, font loads, JS/CSS bundles and returns only `{method, path, request_body, response_body, status_code}` tuples for the app's actual API endpoints.
+**Training vs. inference — what gets cached:**
+- The browser agent output (filtered endpoint list) is pre-computed once per task and cached. During training, the RL model receives this cached result instantly — no live browser session runs.
+- The RL agent's own `curl_exec` calls **always hit the real live WebArena server** — during both training and inference. No API response is mocked or cached.
+- At inference, the browser agent runs live to handle novel tasks or changed application state.
+Full architecture and code: `[BROWSER_AGENT.md](BROWSER_AGENT.md)`
+### Ground Truth: From the Codebase, Not the Browser
+The browser agent shows *what* API calls happen. It does not explain *why* — specifically, it does not document where each parameter comes from or what field constraints exist. That comes from the application codebase.
+For each WebArena application, we perform a one-time static analysis (using a large model against the Docker image source) to produce a **ground truth API catalog** — a precise, hard-coded document specifying:
+```
+endpoint:    POST /rest/V1/guest-carts/{cartId}/items
+method:      POST
+auth:        None (guest cart)
+path_params:
+  cartId:    [string] obtained from: POST /rest/V1/guest-carts → response body
+body:
+  cartItem.sku:       [string] the product's SKU, from: GET /rest/V1/products → items[].sku
+  cartItem.qty:       [number] quantity, from: task specification
+  cartItem.quote_id:  [string] same as cartId
+```
+This is what the judge compares against. The ground truth defines the complete parameter relationship graph for each application.
+Full extraction process: `[GROUND_TRUTH_EXTRACTION.md](GROUND_TRUTH_EXTRACTION.md)`
+### The Training Loop
+```
+Task (natural language) + App URL
+          │
+          ▼
+Policy Model (sees: task + history of all prior actions/results + session_state + findings)
+    │  calls tools to explore and execute
+    ├─► browser_agent(task, url)   → filtered API call list (cached during training)
+    ├─► search_endpoints(query)   → full schema for a specific endpoint
+    ├─► curl_exec(command)        → execute HTTP call, get {status, headers, body}
+    ├─► search_episode_data(q)    → search prior response bodies in this episode
+    └─► done(result)              → declare task complete
+          │
+          ▼
+Live WebArena App (EC2)  ←─── real HTTP responses (always live, never mocked)
+          │
+          ▼
+Judge (compares against ground truth API catalog)
+          │
+          ▼
+Reward Signal  ──►  GRPO  ──►  updated policy
+```
+---
+## Target Applications
+All running on a single AWS EC2 instance. Real production software, no simulation.
+| App            | Port | URL                                                                                                                        | Software                                                   |
+| -------------- | ---- | -------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
+| Shopping       | 7770 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/) | Magento 2 — open-source e-commerce platform                |
+| Shopping Admin | 7780 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/) | Magento 2 Admin — backend panel for the same store         |
+| Forum          | 9999 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/) | Postmill — open-source Reddit-like link aggregation forum  |
+| Wikipedia      | 8888 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/) | Kiwix — read-only offline mirror of English Wikipedia      |
+| Map            | 3000 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:3000/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:3000/) | OpenStreetMap — open-source collaborative mapping platform |
+Source: [WebArena environment_docker](https://github.com/web-arena-x/webarena/tree/main/environment_docker)
+---
+## Spaces
+### Observation Space
+What the model sees at each step:
+```python
+class Observation(BaseModel):
+    task: str                  # Natural language task
+    app_base_url: str          # Root URL of the target application
+    last_tool_result: Any      # Result of last tool call:
+                               #   search_endpoints → list of endpoint schema strings
+                               #   curl_exec → {status_code, headers, body (smart-truncated)}
+                               #   search_episode_data → list of matching JSON object strings
+    history: list[dict]        # Full episode trajectory: list of {action, tool_result} pairs
+                               # from all prior steps. The model sees what it already tried,
+                               # enabling value threading (read a cart_id from step 2's response
+                               # and use it in step 5's curl call) and loop avoidance.
+    session_state: dict        # Auto-managed by environment: cookies, tokens, CSRF values
+                               # extracted from all prior HTTP Set-Cookie and response bodies
+                               # e.g. {"PHPSESSID": "abc", "form_key": "xyz", "cart_id": "123"}
+    step_count: int
+    max_steps: int             # 20
+```
+`session_state` is maintained by the environment. The model never parses `Set-Cookie` headers — the environment extracts tokens automatically and makes them available. The model decides *when* to authenticate and *which* session values to use; the environment handles *extraction*.
+**curl execution:** The agent outputs a curl command string. The environment parses it and executes it via subprocess against the live EC2 server — the agent machine never has a direct network connection to WebArena. The environment also injects cookies from `session_state` automatically before each call.
+**Response truncation — smart array truncation, not byte cutoff:** HTTP response bodies are processed by a pure Python function before being returned to the model. Rules applied in order:
+1. **Non-JSON body** (HTML, CSS, JS, plain text): truncate to 3,000 characters. HTML from form-serving pages (login, post creation) is kept longer than pure prose because CSRF tokens and `<input>` fields are embedded inside the markup and the model needs to locate them. See the [HTML / Form-Submission Handling](#html--form-submission-handling) section below for how the model is expected to work with HTML responses.
+2. **JSON primitive** (string, number, boolean): never truncated — these are tokens, IDs, confirmations.
+3. **Error response (4xx / 5xx)**: never truncated — the model needs every word to self-correct.
+4. **JSON object or array with no large arrays** (< 3 dict items per array): returned as-is.
+5. **JSON with a large array field** (≥ 3 dict items): keep first 2 items, drop the rest, and add a `_list_truncated` annotation:
+```json
+{
+  "items": [
+    {"sku": "MH01", "name": "Radiant Tee", "price": 22.0},
+    {"sku": "MH02", "name": "Breathe-Easy Tank", "price": 34.0}
+  ],
+  "_list_truncated": {
+    "field": "items",
+    "shown": 2,
+    "total": 50,
+    "note": "Showing 2 of 50 items. Use search_episode_data() to find a specific item from this response."
+  }
+}
+```
+**Episode response indexing:** Every `curl_exec` call indexes the full request and response bodies into a per-episode hybrid index (BM25 for keyword matching + GEMMA semantic embeddings for paraphrase handling). When a list is truncated, all items (not just the 2 shown) are indexed. The model can retrieve any specific object using `search_episode_data("keyword or natural language query")` without needing a filtered API endpoint to exist. See `TOOLS.md` for the full indexing algorithm.
+### Action Space
+The model outputs a single tool call per step. Full technical specifications for all tools (document construction, truncation implementation, index architecture, caveats) are in `[TOOLS.md](./TOOLS.md)`.
+| Tool                         | Input                             | What It Does                                                                                                                                                                              | Output                                                                                                                         |
+| ---------------------------- | --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| `browser_agent(task, url)`   | Task string + app base URL        | Checks for pre-recorded HAR; if found, processes it — otherwise launches live browser to perform task and record traffic. Extracts OpenAPI-like spec, builds GEMMA embeddings for search. | Summary list of API endpoint names + methods (e.g. `GET /products`). No schemas/headers. Use `search_endpoints()` for details. |
+| `search_endpoints(query)`    | Natural language query            | Semantic search over GEMMA-embedded endpoint spec built by `browser_agent`. Returns full parameter details for matching endpoints.                                                        | Top-3 endpoint schemas (method, path, auth, params with sources, response fields)                                              |
+| `curl_exec(command)`         | Full curl command string          | Executes HTTP call against live EC2 server, indexes full response into episode BM25 store, returns truncated observation.                                                                 | `{status_code, headers, body}` — body smart-truncated; full body indexed to episode store                                      |
+| `search_episode_data(query)` | Keyword or natural language query | Hybrid BM25 + GEMMA semantic search over all request/response bodies from prior `curl_exec` calls in this episode.                                                                        | Top-5 JSON objects from this episode's request/response history                                                                |
+| `done(result?)`              | Optional result string            | Signals task complete, triggers judge evaluation.                                                                                                                                         | Ends episode                                                                                                                   |
+`browser_agent` is called **exactly once per episode at step 1**. During training, it loads a cached pre-recorded HAR file(if available); at inference, it will launch a live browser session. It returns the deduplicated list of API endpoint patterns observed in the network traffic. **If called again after step 1, the call executes normally but a −0.3 penalty is applied to the reward.** `search_endpoints` then provides the full schema for any specific endpoint the model wants to call — searching the GEMMA embeddings built by `browser_agent` from the HAR data.
+`curl_exec` is the primary HTTP action — one string that encodes method, URL, headers, and body together, exactly as API documentation is written. This lets the model leverage its pretrained knowledge of `curl` syntax while producing calls that are self-documenting.
+```bash
+# Step 1 — Discover which endpoint creates a guest cart
+# (model calls search_endpoints first, sees: POST /rest/V1/guest-carts)
+# Step 2 — Create guest cart
+curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'
+# → body: "cart-abc123"  (plain string — never truncated)
+# Step 3 — Find the product SKU (list response, truncated to 2 items + note)
+curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'
+# → body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}],"total_count":1}
+# (1 item — not truncated; if 200 items, all 200 indexed, 2 shown in context)
+# Step 4 — Add item (model reads cart-abc123 from step 2, MH01 from step 3 — all in history)
+curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' \
+  -H 'Content-Type: application/json' \
+  -d '{"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-abc123"}}'
+```
+Values from prior responses (cart IDs, SKUs, tokens) are threaded directly from the growing episode history. `session_state` tokens (cookies, CSRF values) are auto-injected by the environment. If a list response was truncated and the model needs a specific item not shown in the 2-item sample, it calls `search_episode_data("Radiant Tee sku")` — all 200 items are indexed, even though only 2 were shown in context.
+### Prompt Structure:
+```
+SYSTEM: You are an API agent. Complete the task using only the tools available:
+        browser_agent, search_endpoints, curl_exec, search_episode_data, done.
+        When a response is HTML, look for JSON data embedded in <script> tags or
+        extract values from <input> fields. CSRF tokens appear as hidden inputs:
+        <input type="hidden" name="_csrf_token" value="XYZ">
+TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.../
+[session_state: {}]
+STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
+STEP 1 RESULT: {"app": "shopping", "endpoints": [
+  "POST /rest/V1/guest-carts",
+  "GET  /rest/V1/products",
+  "POST /rest/V1/guest-carts/{id}/items",
+  ...
+], "note": "Use search_endpoints() to get full schema for any of these."}
+STEP 2 ACTION: search_endpoints("create guest cart")
+STEP 2 RESULT: ["endpoint: POST /rest/V1/guest-carts | auth: none | returns: string (cartId)", ...]
+STEP 3 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
+STEP 3 RESULT: {status_code: 200, body: "cart-abc123"}
+STEP 4 ACTION: search_endpoints("find product by name get sku")
+STEP 4 RESULT: ["endpoint: GET /rest/V1/products | query: searchCriteria filters | returns: .items[].sku .items[].name", ...]
+STEP 5 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
+STEP 5 RESULT: {status_code: 200, body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}],"total_count":1}}
+STEP 6 ACTION: search_endpoints("add item to guest cart cartId")
+STEP 6 RESULT: ["endpoint: POST /rest/V1/guest-carts/{cartId}/items | path: cartId from POST /rest/V1/guest-carts | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId)", ...]
+STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
+STEP 7 RESULT: {status_code: 200, body: {"item_id": 5, "sku": "MH01", "qty": 1}}
+→ generate STEP 8: done("Radiant Tee added to cart")
+```
+`browser_agent` at step 1 gives the model the full endpoint landscape upfront — it can see `/rest/V1/guest-carts` and `/rest/V1/products` immediately and plan the call sequence before making any HTTP calls. `search_endpoints` fills in the exact parameter schemas. Value threading (`"MH01"`, `"cart-abc123"`) happens through the growing history — if step 5 had returned 200 products truncated to 2, the model would call `search_episode_data("Radiant Tee sku")` to retrieve `MH01` from the episode index.
+### Parameter Relationship Graph (What the Judge Knows)
+The judge holds a complete dependency map for each task:
+```
+Parameter Source Types:
+  TASK_SPEC    — value given directly in the task (e.g., "product #42")
+  PREV_CALL    — value from a prior API response in this episode
+  AUTH_FLOW    — value obtained during authentication (session token, CSRF key)
+  STATIC       — fixed value known from the application (e.g., store_id = 1)
+  DERIVED      — computed from another value (e.g., cart_id = quote_id)
+```
+For each task, the judge knows which parameters fall into which category, and whether the model correctly sourced each value. This is how partial credit works — the model gets reward for correctly threading a `cart_id` even if the final call had a wrong field elsewhere.
+### Reward Space
+**Per-step:**
+| Signal                       | Value | Trigger                                                                                             |
+| ---------------------------- | ----- | --------------------------------------------------------------------------------------------------- |
+| Valid API call (2xx)         | +0.2  | `curl_exec` returns 2xx status                                                                      |
+| New path called this episode | +0.1  | `curl_exec` normalized path not called before in this episode — discourages looping on one endpoint |
+| Correct parameter sourcing   | +0.25 | judge: value in curl call came from the correct source type                                         |
+| Session value correctly used | +0.1  | auth token/cookie present and correct in curl call                                                  |
+| Repeated identical call      | −0.15 | exact duplicate curl command issued twice                                                           |
+| browser_agent called again   | −0.3  | `browser_agent` called after step 1 — call executes normally, penalty applied to reward             |
+| Malformed curl command       | −0.1  | curl cannot be parsed or executed by the environment                                                |
+| 4xx response (recoverable)   | −0.05 | call failed but episode continues                                                                   |
+Note: `search_endpoints`, `search_episode_data`, and `done` carry no direct per-step reward. Using `search_endpoints` to find the correct schema is indirectly rewarded by enabling correct parameter sourcing (+0.25) in the curl call that follows. `search_episode_data` is indirectly rewarded by allowing the model to retrieve the correct value to place in the next curl command.
+**Episode end:**
+| Outcome                                                     | Reward                                     |
+| ----------------------------------------------------------- | ------------------------------------------ |
+| Task completed correctly                                    | +2.0 to +5.0 (scales with difficulty tier) |
+| Partial completion (right endpoints, wrong param threading) | +0.5 to +1.5                               |
+| Authentication correctly obtained (even if task fails)      | +0.3                                       |
+| Timeout / task failed entirely                              | −1.5                                       |
+Target signal separation: successful episodes `+3` to `+7`, failed episodes `−2` to `−1`. Required for GRPO.
+> **Reward design insight:** Pure step-level rewards can teach a model to "look busy" — accumulating +0.2 (valid call) and +0.1 (new path) rewards while never converging to task completion. To prevent this, the terminal outcome reward must dominate the sum of all per-step rewards. Two mechanisms enforce this:
+>
+> 1. **Hard ceiling on step rewards per episode.** Maximum achievable per-step reward over 20 steps is bounded: `20 × (0.2 + 0.1 + 0.25 + 0.1) = 13`. But a failed episode still ends at `−1.5`, so any correct episode completion still produces a substantially better total.
+> 2. **Curriculum learning as the primary defense.** Easy tasks (Template 1: single GET, no auth) have a trivially short optimal path (2 steps). There is no room to accumulate "fake" exploration reward when the optimal episode only needs 2 calls. The model learns that the terminal reward is the only thing that matters before it encounters tasks long enough to be gamed. Medium and Hard tiers are introduced only after the model reliably solves Easy — by then the behavior pattern is already anchored. This mirrors how SWE-gym-style environments scale difficulty: start simple enough that the reward signal is unambiguous, then broaden.
+>
+> **Premature `done()` penalty:** If the judge scores the final state as incorrect (task not completed), the episode ends at `−1.5`. There is no bonus for calling `done()` early — it is strictly worse than continuing to make correct API calls. The model only benefits from calling `done()` when the task is actually complete.
+**Reset behavior:** `reset()` clears session state, episode history, episode BM25 index, step counter. It does not reset the remote application database. The judge evaluates relative state (did the cart contain the item?), not absolute state (is the DB row count exactly N?).
+---
+## HTML / Form-Submission Handling
+Not every endpoint in the target applications returns JSON. The Forum (Postmill) and Wikipedia (Kiwix) applications rely on HTML form submissions and HTML responses respectively. The agent is designed to handle both transparently.
+### Why This Matters
+A generalizable API agent must work with the full spectrum of web interfaces — not just REST JSON endpoints. Form-based POST submissions (with CSRF tokens, multipart bodies, URL-encoded fields) are ubiquitous in real web applications. Training on them is intentional: the model learns to identify the correct request format from context rather than assuming JSON everywhere.
+### CSRF Token Extraction
+Postmill protects state-changing routes (login, post creation) with a per-session CSRF token. This token is embedded as a hidden `<input>` field in the HTML form:
+```html
+<input type="hidden" name="_csrf_token" value="abc123XYZ">
+```
+**How the model handles this — no dedicated CSRF tool needed:**
+1. The model issues a GET to the form page (e.g., `GET /login`).
+2. The environment returns the HTML body, truncated to 3,000 characters (raised from 1,000 specifically to ensure hidden input fields near the end of small forms are included).
+3. The model reads the `value` attribute of `input[name="_csrf_token"]` directly from the returned HTML string. HTML parsing is not required — the token appears as a predictable plain-text pattern in the markup.
+4. The model places the extracted token into the subsequent POST body or form field.
+5. The environment auto-extracts any `Set-Cookie` header from the login response into `session_state`, so subsequent requests are automatically authenticated.
+If the CSRF token is positioned after the 3,000-character cutoff (possible in very large rendered pages), the model can call `search_episode_data("_csrf_token")` — the full HTML body is indexed into the episode store before truncation, making the token retrievable by keyword search.
+```bash
+# Forum login flow
+curl -X POST 'http://ec2-.../login' \
+  -H 'Content-Type: application/x-www-form-urlencoded' \
+  -d '_csrf_token=abc123XYZ&_username=user&_password=pass'
+# → 302 redirect + Set-Cookie: PHPSESSID=... (auto-injected into session_state)
+# Forum post creation
+curl -X POST 'http://ec2-.../f/general/submit' \
+  -H 'Content-Type: application/x-www-form-urlencoded' \
+  -d '_csrf_token=abc123XYZ&title=My+Post&body=Hello+World'
+```
+### Wikipedia / HTML-Only Responses
+Kiwix serves static HTML pages — there is no JSON API. The agent treats Wikipedia responses as structured text: search results appear in `<a href>` anchor tags; article content is in `<p>` tags.
+The environment wraps the truncated HTML response in a lightweight JSON envelope before returning it to the model, so the observation format is always `{status_code, headers, body}` regardless of content type:
+```json
+{
+  "status_code": 200,
+  "headers": {"Content-Type": "text/html"},
+  "body": "<html>...<ul class='mw-search-results'><li><a href='/wiki/Mars'>Mars</a>...</ul>..."
+}
+```
+For Template 2 ("Retrieve article summary for `{title}`"), task completion is verified by confirming the correct article URL was fetched and returned HTTP 200 — not by parsing article content. This makes the grader robust to HTML structure changes.
+### Form vs. JSON Detection
+`curl_exec` detects whether a request is form-encoded or JSON by inspecting the `Content-Type` header in the curl command string:
+- `Content-Type: application/json` → body is JSON, response indexed as JSON
+- `Content-Type: application/x-www-form-urlencoded` or `multipart/form-data` → body is form data, response indexed as text
+- No `Content-Type` (GET requests) → response indexed based on `Content-Type` of the response
+The model is responsible for setting the correct `Content-Type` in its curl command. The system prompt includes explicit guidance on when to use each.
+---
+## Tasks
+HARvestGym trains on **7 task templates** rather than a larger flat task list. Each template is a parameterized scenario: one reward function, one ground truth catalog entry, one grader — but potentially hundreds of distinct episode variations produced by substituting different values for the template slots (`{product_name}`, `{category_name}`, etc.).
+If the training went smoothly, then we can scale it to automatically task creation to create all possible aspects of a task.
+**How template parameters are populated:** Before training, a one-time data prep step calls the application's own listing APIs and builds a static **parameter pool** for each template (see `[parameter_pools.json](parameter_pools.json)`, refreshed via `[scripts/build_parameter_pools.py](scripts/build_parameter_pools.py)`):
+| Template slot                 | Source                                                          |
+| ----------------------------- | --------------------------------------------------------------- |
+| `{category_name}`             | `GET /rest/V1/categories` — all leaf category names             |
+| `{product_name}`              | `GET /rest/V1/products?pageSize=200` — all product names + SKUs |
+| `{forum_category}`            | Forum's category listing API                                    |
+| `{title}`, `{sku}`, `{price}` | Generated or sampled from existing product names                |
+Each episode samples randomly from its pool. The model never sees the pool directly — it gets the task string (e.g., `"Add 'Radiant Tee' to a guest cart"`) and must discover the correct endpoint + SKU through its own API calls.
+### Complexity Tiers
+Templates are organized into **complexity tiers** for curriculum training — the model only graduates to harder templates once it reliably solves easier ones:
+| Tier   | Characteristic                                | API calls required |
+| ------ | --------------------------------------------- | ------------------ |
+| Easy   | Single call, no auth                          | 1                  |
+| Medium | Auth + 1–2 dependent calls                    | 2–3                |
+| Hard   | Multi-step chain with ID threading, full auth | 4–8+               |
+### Task Templates
+| #   | Tier   | App            | Template                                               | Key Challenge                                           |
+| --- | ------ | -------------- | ------------------------------------------------------ | ------------------------------------------------------- |
+| 1   | Easy   | Shopping       | List products in category `{category_name}`            | Single GET with query params                            |
+| 2   | Easy   | Wikipedia      | Retrieve article summary for `{title}`                 | Single GET, path parameter resolution                   |
+| 3   | Medium | Shopping       | Add `{product_name}` to a guest cart                   | 2 calls: create cart → add item; ID threading           |
+| 4   | Medium | Forum          | Retrieve all posts in `{forum_category}` (authed)      | Login → extract session → GET                           |
+| 5   | Hard   | Forum          | Create a post titled `{title}` in `{category}`         | Login → extract CSRF `form_key` → POST with full schema |
+| 6   | Hard   | Shopping       | Guest checkout for `{product_name}`                    | 5+ chained calls; cart → item → shipping → payment      |
+| 7   | Hard   | Shopping Admin | Create a new product with SKU `{sku}`, price `{price}` | Admin bearer token → full Magento product schema        |
+Each task has a deterministic programmatic grader (score in `[0.0, 1.0]`):
+- **Easy graders**: check HTTP response body for expected values
+- **Medium graders**: probe application state after episode (e.g., fetch the cart, verify item is present)
+- **Hard graders**: verify multi-step state change in the application (e.g., post exists, checkout created)
+**On optional request parameters:** API responses and real network traffic often contain extra headers and parameters (`X-Requested-With`, `Cache-Control`, correlation IDs, etc.) that are not functionally required. The judge scores only on *required* parameters. Extra or missing optional headers or body params do not affect the reward signal.
 ---

TOOLS.md ADDED Viewed

	@@ -0,0 +1,847 @@

+# HARvestGym Tool Specification
+Technical specification for all tools available to the RL agent. Each tool is a Python function called by the environment on behalf of the model. The model outputs a single tool call per step; the environment executes it and returns the result.
+---
+## Tool Set Summary
+| Tool                         | Input                             | What It Does                                                                                                                                                                                                                                                                                                                                                                                                 | Output                                                                                                                                                                                                                                      |
+| ---------------------------- | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `browser_agent(task, url)`   | Task string + app base URL        | Checks if a pre-recorded HAR file exists for this app; if so, processes it(only for training; at inference time it will always use browser-agent) — otherwise launches a live browser agent to perform the task and record network traffic. Either way, extracts an OpenAPI-like spec from the captured traffic, builds GEMMA embeddings for the search endpoint index, and returns a summary endpoint list. | Deduplicated list of API endpoint names with HTTP methods (e.g. `GET /products`, `POST /guest-carts`) — summary only, no headers/body/schemas. Use `search_endpoints()` with a natural-language query to get full details for any endpoint. |
+| `search_endpoints(query)`    | Natural language query            | Semantic search over the endpoint embeddings built by `browser_agent`. Matches the query against the GEMMA-embedded OpenAPI-like spec and returns the top-3 endpoint schemas with full parameter details.                                                                                                                                                                                                    | Top-3 endpoint schemas (method, path, auth, params with sources, response fields)                                                                                                                                                           |
+| `curl_exec(command)`         | Full curl command string          | Parses the curl command, executes it via subprocess against the live EC2 server, indexes the full response body into the episode's hybrid BM25 + GEMMA store (before truncation), then returns a truncated observation.                                                                                                                                                                                      | `{status_code, headers, body}` — body smart-truncated; full body indexed into episode store for `search_episode_data()`                                                                                                                     |
+| `search_episode_data(query)` | Natural language or keyword query | Hybrid BM25 + GEMMA semantic search across all request/response bodies accumulated during this episode from prior `curl_exec` calls. BM25 handles exact keyword matches (IDs, SKUs); GEMMA handles semantic paraphrases. Finds specific values from truncated or prior responses.                                                                                                                            | Top-5 JSON objects from this episode's request/response history, each annotated with step number and source endpoint                                                                                                                        |
+| `done(result?)`              | Optional result string            | Signals the model believes the task is complete. Triggers the judge to evaluate the episode against the ground truth catalog.                                                                                                                                                                                                                                                                                | Ends the episode                                                                                                                                                                                                                            |
+`browser_agent` is always called **once at the start of an episode** (step 1). It gives the model an API landscape map for the target application, so the model knows which endpoints exist before it begins probing. All subsequent discovery and execution uses the other tools. **If the model calls `browser_agent` again after step 1, it receives a −0.3 penalty reward** — the call still executes normally (loads HAR if it exists, or runs live browser), the penalty is just applied to the reward signal.
+---
+## Tool 0: `browser_agent(task, url)`
+### Purpose
+Give the model an initial map of the API surface for the target application at the start of every episode. The browser agent is a multi-stage pipeline that:
+1. Obtains HAR data (from pre-recorded file if available, or by launching a live browser)
+2. Processes it to extract an OpenAPI-like spec
+3. Builds GEMMA embeddings so `search_endpoints()` can work
+4. Returns a **summary-only** endpoint list to the RL agent — just names and methods, no schemas
+The output is intentionally sparse. Because there could be many endpoints that will waste context window. The agent sees *what* endpoints exist but not *how* to call them. It must call `search_endpoints()` to get full parameter details for any endpoint.
+### Interface
+```python
+def browser_agent(task: str, url: str) -> dict:
+    """
+    Multi-stage pipeline:
+    1. Check for pre-recorded HAR file → load if exists, else launch live browser
+    2. Filter HAR → extract OpenAPI-like spec (methods, paths, params, bodies)
+    3. Build GEMMA embeddings over the spec → stored for search_endpoints()
+    4. Return summary endpoint list (names + methods only)
+    Returns: {
+        "app": str,                      # resolved app name (shopping, forum, osm, wikipedia)
+        "endpoints": list[dict],         # summary: [{method, path}] — no schemas, no headers
+        "total_endpoints": int,          # count of deduplicated endpoints
+        "note": str                      # directs agent to use search_endpoints() for details
+    }
+    """
+```
+### Stage 1 — HAR Data Source
+The browser agent first checks if a pre-recorded HAR file exists. If it does, the browser is never launched — saving 30–120s per episode.
+```
+hars/
+  shopping.har       # all shopping tasks, all API calls recorded for all task templates
+  shopping_admin.har # all admin tasks
+  forum.har          # all forum tasks
+  osm.har            # all OSM tasks
+  wikipedia.har      # all Wikipedia tasks
+```
+```python
+HAR_MAP = {
+    ":7770": "hars/shopping.har",
+    ":7780": "hars/shopping_admin.har",
+    ":9999": "hars/forum.har",
+    ":3000": "hars/osm.har",
+    ":8888": "hars/wikipedia.har",
+}
+def get_har_data(task: str, url: str) -> dict:
+    har_path = resolve_har_path(url)       # port-based lookup from HAR_MAP
+    if har_path and os.path.exists(har_path):
+        # HAR exists — load from disk, no browser needed
+        with open(har_path) as f:
+            return json.load(f)
+    else:
+        # No HAR — launch live browser, perform task, capture traffic
+        raw_log = await run_browser_agent_live(task, url, "browser-use/bu-30b-a3b-preview")
+        return convert_raw_log_to_har(raw_log)
+```
+If no HAR exists, the browser agent launches Chromium via Playwright, connects the `bu-30b-a3b-preview` LLM, performs the task while intercepting all network traffic, and produces a HAR-format output. See `BROWSER_AGENT.md` for the live browser implementation.
+### Stage 2 — Filter and Extract OpenAPI-like Spec
+The HAR data (from either source) is processed to extract a structured spec:
+```python
+def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
+    entries = har_data["log"]["entries"]
+    seen = set()
+    spec_entries = []
+    for entry in entries:
+        req = entry["request"]
+        resp = entry["response"]
+        raw_url = req["url"]
+        method = req["method"]
+        # 1. Skip static assets (images, fonts, CSS, JS bundles, favicon)
+        if _is_static_asset(raw_url):
+            continue
+        # 2. Skip page navigation (HTML document loads)
+        content_type = _get_response_content_type(resp)
+        if "text/html" in content_type and method == "GET":
+            continue
+        # 3. Normalise path: replace concrete IDs with {id} placeholders
+        path = _normalise_path(urlparse(raw_url).path)
+        # 4. Deduplicate by (method, normalised_path)
+        key = f"{method} {path}"
+        if key in seen:
+            continue
+        seen.add(key)
+        # 5. Extract auth, body, query params for the spec document
+        has_auth = any(
+            h["name"].lower() in ("authorization", "x-api-key", "cookie")
+            for h in req["headers"]
+        )
+        spec_entries.append({
+            "method": method,
+            "path": path,
+            "query_params": urlparse(raw_url).query or None,
+            "request_body": _extract_body(req),
+            "status_code": resp["status"],
+            "response_content_type": content_type,
+            "response_body_sample": _truncate_body(resp),
+            "auth_observed": has_auth,
+        })
+    return spec_entries
+```
+### Stage 3 — Build GEMMA Embeddings
+The spec entries are embedded using `google/embeddinggemma-300m` (GEMMA). These embeddings are stored in the environment and power `search_endpoints()`.
+```python
+def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
+    model = SentenceTransformer("google/embeddinggemma-300m", token=os.environ.get("HF_TOKEN"))
+    chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
+    embeddings = model.encode_document(chunks, batch_size=32)
+    return embeddings, chunks  # stored in env for search_endpoints()
+```
+### Stage 4 — Return Summary
+The RL agent receives **only endpoint names and methods** — no schemas, no headers, no body details:
+```json
+{
+  "app": "shopping",
+  "endpoints": [
+    {"method": "POST", "path": "/rest/V1/integration/customer/token"},
+    {"method": "GET",  "path": "/rest/V1/products"},
+    {"method": "GET",  "path": "/rest/V1/products/{id}"},
+    {"method": "POST", "path": "/rest/V1/guest-carts"},
+    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items"},
+    {"method": "GET",  "path": "/rest/V1/guest-carts/{id}/totals"},
+    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/order"},
+    {"method": "GET",  "path": "/rest/V1/categories"}
+  ],
+  "total_endpoints": 8,
+  "note": "These endpoints were observed for this application. Use search_endpoints() with a natural language query to get the full schema, parameters, and auth details for any endpoint."
+}
+```
+### Path Normalisation
+The `_normalise_path` function replaces concrete dynamic segments with `{id}` placeholders so that duplicates collapse:
+- Numeric IDs: `/products/42` → `/products/{id}`
+- UUIDs: `/carts/3fa85f64-5717-4562-b3fc` → `/carts/{id}`
+- Magento cart IDs (mixed alphanumeric, 32+ chars): detected by length and character set
+- OSM node/way/relation IDs: `/api/0.6/node/12345678` → `/api/0.6/node/{id}`
+- Forum post slugs: `/f/general/1-hello-world` → `/f/{slug}/{id}-{slug}`
+Normalisation is pattern-based (regex), not AI-generated. No external calls.
+### When to Call
+`browser_agent` is called **exactly once per episode, at step 1**, before any other tool. It serves as the API landscape orientation AND builds the search index. **If called again after step 1, the call executes normally but the model receives a −0.3 penalty reward.** The model should not need to call it again mid-episode.
+### Relationship to Other Tools
+```
+browser_agent  →  "what endpoints exist?" (summary only)
+    │                    + builds GEMMA embeddings internally
+    │
+    ▼
+search_endpoints  →  "give me full schema for endpoint X"
+    │                    (searches the GEMMA embeddings built above)
+    ▼
+curl_exec         →  "call endpoint X, get live response"
+    │                    (indexes full response into BM25 episode store)
+    ▼
+search_episode_data  →  "find specific value from a prior response"
+                         (BM25 search over indexed episode data)
+```
+`browser_agent` provides breadth (what exists) and builds the search index. `search_endpoints` provides depth (how to call it). `curl_exec` provides live data and feeds the episode index. `search_episode_data` retrieves specific values from that index.
+---
+## Tool 1: `search_endpoints(query)`
+### Purpose
+Find which API endpoint to call for a given subtask. The model calls this when it does not yet know the correct URL, method, or parameter schema for the next HTTP call it needs to make.
+### Interface
+```python
+def search_endpoints(query: str) -> list[str]:
+    """
+    Semantic search over the endpoint embeddings built by browser_agent.
+    Returns the top-3 matching endpoint schemas as formatted text strings.
+    """
+```
+### Underlying Index
+- **Source:** The GEMMA embeddings built by `browser_agent` during Stage 4. These embeddings are created from the OpenAPI-like spec extracted from HAR data — the actual network traffic observed when the browser agent performed tasks on the application.
+- **Embedding model:** `google/embeddinggemma-300m` via `sentence-transformers`
+- **Built:** By `browser_agent` at the start of each episode (Stage 4). The browser agent processes the HAR data, extracts the OpenAPI-like spec, converts each endpoint to a text chunk, and embeds them using GEMMA.
+- **At runtime:** Stored in environment memory after `browser_agent` completes. Available for the rest of the episode. Discarded at episode end (rebuilt from HAR at next episode start).
+- **Query embedding:** Uses the `encode_query` method with prompt `task: search result | query: {query}`.
+- **Document embedding:** Uses the `encode_document` method with prompt `title: {endpoint} | text: {full_schema_text}`.
+- **Similarity:** Use the similarity function specified by the `google/embeddinggemma-300m` model card. The model's `sentence_transformers_config.json` specifies the correct metric (typically cosine similarity for normalized embeddings). Pure numpy, no FAISS needed at this scale.
+### Document Structure (one per extracted endpoint)
+Each endpoint from the browser agent's OpenAPI-like spec is converted to a searchable text chunk by `spec_entry_to_text()`:
+```
+app: shopping | endpoint: POST /rest/V1/guest-carts/{id}/items | status: 200 | auth: none | body: {"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-abc123"}} | response_sample: {"item_id":5,"sku":"MH01","qty":1}
+```
+The text chunks include method, path, status code, auth observation, query params, request body sample, and response body sample — all extracted from the actual HAR traffic. This is richer than just endpoint names (which is what the RL agent sees from `browser_agent`'s summary output) but less structured than a hand-written catalog.
+### Output Format
+Returns a list of 3 strings, each being the full text of one matching endpoint schema. The model reads these directly and extracts the method, URL pattern, observed parameters, and response structure.
+### When to Call
+- At the start of a task subtask: "I need to authenticate — what endpoint handles login?"
+- When discovering a prerequisite: "I need a cart ID first — what creates a cart?"
+- When unsure of the exact URL pattern: "Is it `/products/{id}` or `/products?id=`?"
+### Caveats
+- Returns observed traffic patterns, not formal API documentation. The schemas reflect what was seen in the HAR, not what the API formally supports. Some optional parameters may be missing if the browser agent's session didn't exercise them.
+- Returns schemas, not live values. The model still needs `curl_exec` to get actual data (product SKUs, cart IDs, etc.).
+- If no relevant endpoint exists in the index, returns the closest matches by cosine similarity. The model should treat low-confidence results skeptically and try `curl_exec` to probe.
+- The index covers only the current application (determined by `browser_agent`'s URL). Each episode's index is specific to one app.
+---
+## Tool 2: `curl_exec(command)`
+### Purpose
+Execute an HTTP request against the live EC2 application and return the response. This is the primary action tool — it is how the model actually interacts with the application.
+### Interface
+```python
+def curl_exec(command: str) -> dict:
+    """
+    Parses a curl command string, executes it via subprocess against the live EC2 server,
+    indexes the full response into the episode store, then returns a truncated observation.
+    Returns: {
+        "status_code": int,
+        "headers": dict,          # response headers
+        "body": str | dict        # truncated; see truncation rules below
+    }
+    """
+```
+### Execution Pipeline
+The environment performs these steps in order on every `curl_exec` call:
+```
+1. Parse the curl command string
+      Extract: method, URL, headers, body
+      Validate: URL host must match app_base_url (reject requests to external hosts)
+      Inject: session cookies from session_state into headers automatically
+2. Execute via subprocess
+      subprocess.run(["curl", ...parsed args...], timeout=10)
+      Capture: status_code, response headers, response body (full, untruncated)
+3. Index into episode store (BEFORE truncation)
+      Index the request body (if any)
+      Index the response body
+      See: Episode Store section below
+4. Truncate the response body for context
+      Apply truncation rules (see below)
+      Add truncation note if any array was trimmed
+5. Return to model
+      {status_code, headers, truncated_body} or error
+```
+### Truncation Rules
+Applied in order. First matching rule wins.
+**Rule 1 — Non-JSON body:**
+HTML from form-serving pages (login, post creation, etc.) is kept longer than a byte cutoff would allow because CSRF tokens and `<input>` fields are embedded inside the markup. The model locates them by reading the raw HTML string — no HTML parser required since tokens appear as predictable plain-text patterns (`<input type="hidden" name="_csrf_token" value="…">`). Even with 3,000 characters, if the CSRF token appears after the cutoff (possible in large pages), the full body is indexed in the episode store and can be retrieved with `search_episode_data("_csrf_token")`.
+```python
+NONJSON_MAX_CHARS = 3000
+if not is_valid_json(body):
+    return body[:NONJSON_MAX_CHARS] + (" [truncated — non-JSON response]" if len(body) > NONJSON_MAX_CHARS else "")
+```
+**Rule 2 — JSON primitive (string, number, boolean, null):**
+```python
+if isinstance(parsed, (str, int, float, bool)) or parsed is None:
+    return body  # never truncate; these are tokens, IDs, simple confirmations
+```
+**Rule 3 — Error response (4xx or 5xx):**
+```python
+if status_code >= 400:
+    return body  # never truncate error messages; the model needs every word to self-correct
+```
+**Rule 4 — JSON object or array with no large arrays:**
+```python
+# "large" means an array with >= 3 objects (dicts)
+if no array field contains >= 3 dict items:
+    return body  # small enough; return as-is
+```
+**Rule 5 — JSON with large array field(s):**
+```python
+# For each top-level field whose value is a list of >= 3 dicts:
+#   Keep first 2 items, drop the rest
+#   Add a _list_truncated annotation at the top level
+truncated = {k: v for k, v in parsed.items() if not is_large_list(v)}
+for key, val in parsed.items():
+    if is_large_list(val):
+        truncated[key] = val[:2]
+        truncated["_list_truncated"] = {
+            "field": key,
+            "shown": 2,
+            "total": len(val),
+            "note": f"Showing 2 of {len(val)} items. Use search_episode_data() to find a specific item from this response."
+        }
+return json.dumps(truncated)
+```
+The note is a static Python format string. It is not AI-generated. It does not suggest specific query parameters or URL patterns.
+### Session State Injection
+Before executing the curl command, the environment reads `session_state` and injects any relevant cookies or tokens:
+- If `session_state` contains `PHPSESSID`, inject as `Cookie: PHPSESSID=...`
+- If `session_state` contains `form_key` (Magento CSRF), inject as a header: `X-Form-Key: ...`
+- If `session_state` contains `PHPSESSID` for Forum requests (port 9999), inject as `Cookie: PHPSESSID=...`
+- If `session_state` contains a bearer token, inject as `Authorization: Bearer ...` only if the model's curl command does not already include an `Authorization` header
+**CSRF note for Postmill (Forum):** Postmill's `_csrf_token` is a request-body field, not a header. The environment does **not** auto-inject it — the model must extract it from the HTML form response and include it explicitly in the POST body. The `session_state` cookie (`PHPSESSID`) is auto-injected so the server associates the CSRF token with the active session. The expected workflow:
+```
+GET /login → HTML body contains <input type="hidden" name="_csrf_token" value="XYZ">
+Model reads "XYZ" from body string
+POST /login -d '_csrf_token=XYZ&_username=user&_password=pass'
+Environment auto-injects Cookie: PHPSESSID from session_state
+```
+The model is responsible for setting the correct `Content-Type` in its curl command. The model declares intent (which headers to include); the environment fills in the actual token values from `session_state`.
+### Caveats
+- `curl_exec` always hits the live EC2 server. No responses are mocked.
+- The timeout is 10 seconds. If the server does not respond, returns `{status_code: 0, error: "timeout"}`.
+- URL must be on the same host as `app_base_url`. Cross-host requests are rejected with `{status_code: 0, error: "host_not_allowed"}`.
+- The model must include the full URL including host and port. Relative paths are not supported.
+---
+## Tool 3: `search_episode_data(query)`
+### Purpose
+Search through all request and response bodies accumulated during the current episode. The model calls this when it needs a specific value (an ID, a name, a token) that was returned in a prior API response but the list was truncated.
+This tool exists because **not every data type has a search or filter API endpoint**. For applications where the model cannot make a targeted filtered query (e.g., a listing endpoint that only supports pagination, not field-based filtering)
+### Interface
+```python
+def search_episode_data(query: str) -> list[str]:
+    """
+    Keyword + BM25 search over all request and response bodies indexed during this episode.
+    Returns the top-5 matching JSON objects as formatted text strings, each annotated
+    with the step number and endpoint that produced them.
+    """
+```
+### Hybrid Search: BM25 + GEMMA Semantic Embeddings
+The episode index uses a **hybrid approach** combining BM25 keyword matching with GEMMA semantic embeddings (`google/embeddinggemma-300m`). Both indexes are maintained in parallel — BM25 for fast exact keyword recall, GEMMA for semantic understanding when the agent's query uses different terminology than what appears in the response data.
+**Why hybrid, not BM25 alone:**
+BM25 excels at exact keyword matches ("MH01", "cart-abc123", "Radiant Tee") but fails on paraphrases. If the agent queries "price of the tee shirt I found earlier", BM25 won't match "Radiant Tee" because the terms don't overlap. GEMMA semantic embeddings bridge this gap — "tee shirt" and "Radiant Tee" are semantically close in embedding space.
+**Why hybrid, not GEMMA alone:**
+GEMMA embeddings are weaker at exact string matching. Searching for a specific cart ID like "cart-abc123" benefits from BM25's precise token matching. The hybrid approach gets the best of both.
+**Scoring:** Results are ranked by a weighted combination of BM25 score (normalized) and GEMMA cosine similarity:
+```python
+hybrid_score = alpha * bm25_score_normalized + (1 - alpha) * GEMMA_cosine_similarity
+# alpha = 0.4 (tunable; favors semantic slightly over keyword)
+```
+**Performance:** GEMMA is 300M parameters. On GPU, embedding a batch of 200 response items takes ~1-2 seconds — acceptable overhead per `curl_exec` call. The GEMMA model is already loaded in memory for `search_endpoints`, so no additional model loading cost. BM25 remains instantaneous.
+**Fallback:** If no GPU is available, the system falls back to BM25-only mode. The GEMMA model is shared with `search_endpoints` — if it's loaded, episode data search uses it too.
+### Episode Index — Document Construction
+Every time `curl_exec` completes, the environment constructs embedding documents from the full (pre-truncation) request and response bodies and adds them to the in-memory BM25 index for the current episode.
+**Algorithm:**
+```python
+def build_index_documents(step: int, method: str, path: str,
+                           request_body: Any, response_body: Any,
+                           status_code: int) -> list[str]:
+    docs = []
+    # 1. Index the request body (if any)
+    if request_body is not None:
+        docs.append(
+            f"step:{step} source:request endpoint:{method} {path} "
+            f"body:{json.dumps(request_body, ensure_ascii=False)}"
+        )
+    # 2. Index the response body
+    if response_body is None or not is_valid_json(response_body):
+        docs.append(
+            f"step:{step} source:response endpoint:{method} {path} "
+            f"status:{status_code} body:{str(response_body)[:500]}"
+        )
+        return docs
+    parsed = json.loads(response_body) if isinstance(response_body, str) else response_body
+    # 3. JSON primitive — one document
+    if isinstance(parsed, (str, int, float, bool)) or parsed is None:
+        docs.append(
+            f"step:{step} source:response endpoint:{method} {path} "
+            f"status:{status_code} value:{parsed}"
+        )
+        return docs
+    # 4. JSON object — find top-level array fields
+    array_fields = {k: v for k, v in parsed.items()
+                    if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)}
+    scalar_fields = {k: v for k, v in parsed.items() if k not in array_fields}
+    if not array_fields:
+        # No arrays — one document for the whole object
+        docs.append(
+            f"step:{step} source:response endpoint:{method} {path} "
+            f"status:{status_code} data:{json.dumps(parsed, ensure_ascii=False)}"
+        )
+        return docs
+    # 5. Has array fields — one document per array item, with parent context attached
+    parent_context = (
+        f"step:{step} source:response endpoint:{method} {path} status:{status_code} "
+        + " ".join(f"{k}:{v}" for k, v in scalar_fields.items())
+    )
+    for field_name, items in array_fields.items():
+        for item in items:
+            # Flatten nested arrays within each item to strings (do not recurse further)
+            flat_item = {}
+            for k, v in item.items():
+                flat_item[k] = json.dumps(v) if isinstance(v, (list, dict)) else v
+            docs.append(
+                f"{parent_context} list_field:{field_name} "
+                f"item:{json.dumps(flat_item, ensure_ascii=False)}"
+            )
+    return docs
+```
+**Key design principle:** The parent context (step number, endpoint, HTTP status, scalar fields like `total_count`) is prepended to every child item document. When the model searches for "Radiant Tee product SKU", the returned document contains both `name:Radiant Tee sku:MH01` and the context `endpoint:GET /rest/V1/products step:2` — the model knows where this value came from and which step it appeared in.
+### Episode Index — Lifecycle
+```
+episode start  →  BM25 index initialized (empty)
+                  GEMMA embedding store initialized (empty)
+                        │
+each curl_exec  →  build_index_documents() called
+                   documents appended to BM25 corpus (BM25 index rebuilt, fast)
+                   documents embedded via GEMMA and appended to embedding store
+                        │
+search_episode_data()  →  BM25 scores computed (keyword match)
+                          GEMMA cosine similarity computed (semantic match)
+                          hybrid ranking: alpha * BM25 + (1-alpha) * GEMMA
+                          top-5 documents returned
+                        │
+episode end    →  both indexes discarded entirely
+next episode   →  fresh indexes from scratch
+```
+### Output Format
+Returns a list of up to 5 strings, each being one indexed document. Example:
+```
+[
+  "step:2 source:response endpoint:GET /rest/V1/products status:200 total_count:200 list_field:items item:{\"sku\": \"MH01\", \"name\": \"Radiant Tee\", \"price\": 22.0, \"type_id\": \"simple\"}",
+  "step:2 source:response endpoint:GET /rest/V1/products status:200 total_count:200 list_field:items item:{\"sku\": \"MH03\", \"name\": \"Radiant Tee Long Sleeve\", \"price\": 28.0, \"type_id\": \"simple\"}"
+]
+```
+The model reads `sku: MH01` from the first result and uses it in the next curl call.
+### When to Call
+- A prior curl response was truncated (`_list_truncated` present in the response) and the model needs a specific item not shown in the 2-item sample.
+- The model needs a value from a prior step but cannot easily locate it by scanning history (many steps ago, or buried in a complex response).
+- There is no filter/search API for the data type (practical assumption: not all applications expose filtered listing endpoints for every resource).
+### Caveats
+- Only searches data from **the current episode**. Values from prior episodes are not accessible (each episode starts with an empty index).
+- Only finds data that was actually returned by a `curl_exec` call in this episode. If the relevant API has not been called yet, the data is not indexed.
+- The hybrid search handles both exact keywords ("MH01", "cart-abc123") and semantic paraphrases ("tee shirt price"). However, using the same terminology seen in the response still produces the best results.
+- For large lists (200+ items), all items are indexed. BM25 search is fast regardless of index size. GEMMA embedding of large responses adds 1-2 seconds of overhead per `curl_exec` call.
+---
+## Tool 4: `done(result?)`
+### Interface
+```python
+def done(result: str = None) -> None:
+    """
+    Signals that the model believes the task is complete.
+    Triggers the judge to evaluate the episode against the ground truth catalog.
+    Ends the episode.
+    """
+```
+### Behavior
+- Calling `done()` immediately ends the episode. No further tool calls are processed.
+- The optional `result` string is logged but does not affect the reward. The judge evaluates the live application state, not the model's self-report.
+- If the model calls `done()` and the task is not actually complete, the episode ends with `−1.5` reward (timeout/failure outcome). The model should only call `done()` after the final `curl_exec` has returned a 2xx confirming the required state change.
+### How the Model Learns When to Call `done()`
+There is **no explicit "task complete" signal** from the environment. The model learns when to call `done()` purely through the reward signal over many episodes:
+- **Calling `done()` too early** (before the task is actually complete) → judge finds the expected state change is missing → `−1.5` reward. The model learns to avoid this.
+- **Calling `done()` after a successful final API call** (e.g., add-to-cart returns 2xx with `item_id`) → judge confirms the state change → `+2.0` to `+5.0` reward. The model learns that a 2xx response confirming the desired action is the right signal to call `done()`.
+- **Never calling `done()`** (running out of steps) → episode times out → `−1.5` reward. The model learns it must eventually commit.
+The learned pattern is: after the final state-changing `curl_exec` returns a 2xx response whose body confirms the expected outcome (e.g., `item_id` present in add-to-cart, `order_id` present in checkout), call `done()`. This mirrors how a human developer knows an API call succeeded — you check the response.
+**Optional verification step:** Before calling `done()`, the model can issue one more `curl_exec` to verify the state change (e.g., `GET /rest/V1/guest-carts/{id}` to confirm the item is in the cart). This costs one step but reduces the risk of premature `done()` calls. The model learns whether verification is worth the step cost through reward optimization.
+---
+## Episode Index — Full Example
+**Task:** `"Add 'Radiant Tee' to a guest cart at http://ec2-.../"`
+```
+SYSTEM: ...
+TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/
+[session_state: {}]
+STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
+STEP 1 RESULT: {
+  "app": "shopping",
+  "endpoints": [
+    {"method": "POST", "path": "/rest/V1/guest-carts",             "status": 200, "auth_observed": false},
+    {"method": "GET",  "path": "/rest/V1/products",                "status": 200, "auth_observed": false},
+    {"method": "GET",  "path": "/rest/V1/products/{id}",           "status": 200, "auth_observed": false},
+    {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items",  "status": 200, "auth_observed": false},
+    ...
+  ],
+  "note": "These endpoints were observed in pre-recorded browser sessions..."
+}
+STEP 2 ACTION: search_endpoints("find product by name, get sku")
+STEP 2 RESULT: [
+  "endpoint: GET /rest/V1/products | auth: none | query: searchCriteria filters (field, value, condition_type) | returns: .items[].sku, .items[].name, .items[].price",
+  ...
+]
+STEP 3 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
+STEP 3 RESULT: {"status_code": 200, "body": {"items": [{"sku": "MH01", "name": "Radiant Tee", "price": 22.0}], "total_count": 1}}
+STEP 4 ACTION: search_endpoints("create guest cart, get cart id")
+STEP 4 RESULT: [
+  "endpoint: POST /rest/V1/guest-carts | auth: none | no body | returns: cartId string (plain quoted string)",
+  ...
+]
+STEP 5 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
+STEP 5 RESULT: {"status_code": 200, "body": "cart-abc123"}
+STEP 6 ACTION: search_endpoints("add item to guest cart")
+STEP 6 RESULT: [
+  "endpoint: POST /rest/V1/guest-carts/{cartId}/items | auth: none | path: cartId (from POST /rest/V1/guest-carts) | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId) | returns: item_id",
+  ...
+]
+STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
+STEP 7 RESULT: {"status_code": 200, "body": {"item_id": 5, "sku": "MH01", "qty": 1}}
+STEP 8 ACTION: done("Radiant Tee (MH01) added to guest cart cart-abc123, item_id 5")
+```
+### Embedding build (by browser_agent, once per episode)
+The GEMMA embeddings for `search_endpoints` are built by `browser_agent` during Stage 4, not pre-built offline. Each episode starts fresh:
+```python
+from sentence_transformers import SentenceTransformer
+import numpy as np
+import os
+# google/embeddinggemma-300m requires accepting Google's license on HuggingFace.
+# Set HF_TOKEN env variable to a token that has accepted the license.
+# Uses encode_query / encode_document / similarity API from sentence-transformers.
+# NOTE: activations do not support float16 — use float32 or bfloat16.
+HF_TOKEN = os.environ.get("HF_TOKEN")  # must have accepted the license
+def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
+    """Called by browser_agent Stage 4 after extracting OpenAPI-like spec from HAR."""
+    model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN)
+    chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
+    # encode_document uses: "title: {endpoint} | text: {rest of chunk}"
+    embeddings = model.encode_document(chunks, batch_size=32)
+    # embeddings are returned normalized; dot product = cosine similarity
+    return embeddings, chunks  # stored in env memory for search_endpoints()
+```
+### Runtime search
+```python
+def search_endpoints(query: str, embeddings, texts, model, top_k=3) -> list[str]:
+    q_emb = model.encode_query(query)          # shape: (D,)
+    # Use similarity metric specified by google/embeddinggemma-300m model card
+    scores = model.similarity(q_emb, embeddings).squeeze(0)  # shape: (N,)
+    top_idx = np.argsort(scores)[::-1][:top_k]
+    return [texts[i] for i in top_idx]
+```
+### Index size per episode
+Typical endpoint counts per app (after HAR filtering and deduplication):
+- Shopping (Magento REST): ~8–15 endpoints per HAR
+- Shopping Admin (Magento Admin AJAX + REST): ~5–10 endpoints per HAR
+- Forum (Postmill forms + REST): ~3–8 endpoints per HAR
+- OSM (Rails API + web): ~5–10 endpoints per HAR
+- Wikipedia (Kiwix): ~2 endpoints per HAR
+**Typical: ~5–15 endpoints × 768 dims × 4 bytes = negligible memory.** Embedding time on GPU: <1 second per episode.
+---
+## Truncation Helper — Python Pseudocode
+```python
+import json
+TRUNCATE_LIST_AT = 2       # keep this many items from large arrays
+LARGE_ARRAY_THRESHOLD = 3  # arrays with >= this many dicts are "large"
+NONJSON_MAX_CHARS = 3000   # 3 000 chars — enough to capture hidden CSRF inputs in most HTML forms
+def truncate_response_body(body: str, status_code: int) -> str:
+    # Rule 3: never truncate errors
+    if status_code >= 400:
+        return body
+    # Rule 1: non-JSON
+    if not _is_json(body):
+        if len(body) > NONJSON_MAX_CHARS:
+            return body[:NONJSON_MAX_CHARS] + " [truncated — non-JSON response]"
+        return body
+    parsed = json.loads(body)
+    # Rule 2: primitive
+    if not isinstance(parsed, (dict, list)):
+        return body
+    # Rule 4/5: find large array fields
+    if isinstance(parsed, list):
+        if len(parsed) >= LARGE_ARRAY_THRESHOLD and isinstance(parsed[0], dict):
+            result = parsed[:TRUNCATE_LIST_AT]
+            note = {"_list_truncated": {
+                "shown": TRUNCATE_LIST_AT,
+                "total": len(parsed),
+                "note": f"Showing {TRUNCATE_LIST_AT} of {len(parsed)} items. "
+                        "Use search_episode_data() to find a specific item from this response."
+            }}
+            return json.dumps(result + [note])
+        return body
+    # parsed is a dict — check each value
+    needs_truncation = {
+        k for k, v in parsed.items()
+        if isinstance(v, list) and len(v) >= LARGE_ARRAY_THRESHOLD
+           and len(v) > 0 and isinstance(v[0], dict)
+    }
+    if not needs_truncation:
+        return body
+    result = {}
+    total_truncated = {}
+    for k, v in parsed.items():
+        if k in needs_truncation:
+            result[k] = v[:TRUNCATE_LIST_AT]
+            total_truncated[k] = len(v)
+        else:
+            result[k] = v
+    result["_list_truncated"] = {
+        "fields": total_truncated,
+        "shown_per_field": TRUNCATE_LIST_AT,
+        "note": (
+            f"List fields truncated: "
+            + ", ".join(f"{k} showing {TRUNCATE_LIST_AT}/{n}" for k, n in total_truncated.items())
+            + ". Use search_episode_data() to find a specific item from this response."
+        )
+    }
+    return json.dumps(result)
+def _is_json(s: str) -> bool:
+    try:
+        json.loads(s)
+        return True
+    except (ValueError, TypeError):
+        return False
+```
+---
+## Tool Call Format in the Episode Prompt
+The growing episode context uses this format for tool calls and results:
+```
+SYSTEM: You are an API agent. Your task is to complete the given task by calling the
+available tools: browser_agent, search_endpoints, curl_exec, search_episode_data, done.
+Complete the task using only HTTP calls to the application at the given URL.
+When a response body is HTML, read hidden input fields directly from the markup to
+extract CSRF tokens (pattern: <input type="hidden" name="_csrf_token" value="...">).
+For form submissions, use Content-Type: application/x-www-form-urlencoded.
+TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/
+[session_state: {}]
+STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
+STEP 1 RESULT: {
+  "app": "shopping",
+  "endpoints": [
+    "POST /rest/V1/guest-carts",
+    "GET  /rest/V1/products",
+    "GET  /rest/V1/products/{sku}",
+    "POST /rest/V1/guest-carts/{id}/items",
+    ...
+  ],
+  "note": "Use search_endpoints() to get full schema for any of these."
+}
+STEP 2 ACTION: search_endpoints("find product by name, get sku")
+STEP 2 RESULT: [
+  "endpoint: GET /rest/V1/products | auth: none | query: searchCriteria filters (field, value, condition_type) | returns: .items[].sku, .items[].name, .items[].price",
+  ...
+]
+STEP 3 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
+STEP 3 RESULT: {"status_code": 200, "body": {"items": [{"sku": "MH01", "name": "Radiant Tee", "price": 22.0}], "total_count": 1}}
+STEP 4 ACTION: search_endpoints("create guest cart, get cart id")
+STEP 4 RESULT: [
+  "endpoint: POST /rest/V1/guest-carts | auth: none | no body | returns: cartId string (plain quoted string)",
+  ...
+]
+STEP 5 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
+STEP 5 RESULT: {"status_code": 200, "body": "cart-abc123"}
+STEP 6 ACTION: search_endpoints("add item to guest cart")
+STEP 6 RESULT: [
+  "endpoint: POST /rest/V1/guest-carts/{cartId}/items | auth: none | path: cartId (from POST /rest/V1/guest-carts) | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId) | returns: item_id",
+  ...
+]
+STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
+STEP 7 RESULT: {"status_code": 200, "body": {"item_id": 5, "sku": "MH01", "qty": 1}}
+STEP 8 ACTION: done("Radiant Tee (MH01) added to guest cart cart-abc123, item_id 5")
+```
+Value threading happens entirely through the multi-turn context. The model reads `"MH01"` from step 2's result and `"cart-abc123"` from step 4's result directly — no explicit store/retrieve tools needed.
+---

__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """HARvestGym OpenEnv Environment."""

catalogs/forum.json ADDED Viewed

	@@ -0,0 +1,1517 @@

+{
+  "_meta": {
+    "generated": "2026-04-08",
+    "websockets": "none — no Mercure, Pusher, or WebSocket integration found in codebase"
+  },
+  "endpoints": [
+    {
+      "api_type": "form",
+      "route_name": "login_check",
+      "endpoint": "POST /login_check",
+      "auth": "none",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "_username": { "type": "string", "source": "TASK_SPEC", "notes": "Login username" },
+        "_password": { "type": "string", "source": "TASK_SPEC", "notes": "Login password" },
+        "_remember_me": { "type": "checkbox", "source": "STATIC", "notes": "Optional; value 'on' to persist session" },
+        "_csrf_token": { "type": "string", "source": "AUTH_FLOW", "notes": "Extracted from login page HTML; token id = 'authenticate'" }
+      },
+      "response_key_fields": ["Set-Cookie: PHPSESSID", "Set-Cookie: REMEMBERME"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "log_out",
+      "endpoint": "GET /log_out",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token for logout; csrf_parameter name is 'token'" }
+      },
+      "form_params": {},
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "registration",
+      "endpoint": "POST /registration",
+      "auth": "none",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "user_type[username]": { "type": "string", "source": "TASK_SPEC", "notes": "Username; 3-25 chars" },
+        "user_type[password][first]": { "type": "string", "source": "TASK_SPEC", "notes": "Password" },
+        "user_type[password][second]": { "type": "string", "source": "TASK_SPEC", "notes": "Repeat password" },
+        "user_type[email]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional email" },
+        "user_type[phone]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
+        "user_type[verification]": { "type": "string", "source": "AUTH_FLOW", "notes": "Captcha answer if registration captcha enabled" },
+        "user_type[_token]": { "type": "string", "source": "AUTH_FLOW", "notes": "Symfony CSRF token extracted from form HTML" }
+      },
+      "response_key_fields": ["redirect to login link on success"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "submit",
+      "endpoint": "POST /submit/{forum_name}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC", "notes": "Optional; forum slug; can be omitted to pre-select from form field" }
+      },
+      "query_params": {},
+      "form_params": {
+        "submission[title]": { "type": "string", "source": "TASK_SPEC", "notes": "Post title; max 300 chars" },
+        "submission[url]": { "type": "string", "source": "TASK_SPEC", "notes": "URL for link posts; optional" },
+        "submission[body]": { "type": "markdown", "source": "TASK_SPEC", "notes": "Post body; optional" },
+        "submission[mediaType]": { "type": "string", "source": "STATIC", "notes": "One of: url, image; only present if user can upload images" },
+        "submission[image]": { "type": "file", "source": "TASK_SPEC", "notes": "Image file if mediaType=image; multipart/form-data required" },
+        "submission[forum]": { "type": "string", "source": "TASK_SPEC", "notes": "Forum name if not in path" },
+        "submission[userFlag]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional user/mod flair flag" },
+        "submission[email]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
+        "submission[_token]": { "type": "string", "source": "AUTH_FLOW", "notes": "Symfony CSRF token from form HTML" }
+      },
+      "response_key_fields": ["redirect to /f/{forum_name}/{submission_id}/{slug} on success"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "edit_submission",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/edit",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL", "from_endpoint": "submit", "from_field": "forum_name" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL", "from_endpoint": "submit", "from_field": "submission_id" },
+        "slug": { "type": "string", "source": "PREV_CALL", "from_endpoint": "submit", "from_field": "slug", "notes": "Can be '-' as placeholder" }
+      },
+      "query_params": {},
+      "form_params": {
+        "submission[title]": { "type": "string", "source": "TASK_SPEC" },
+        "submission[url]": { "type": "string", "source": "TASK_SPEC", "notes": "Only for url-type submissions" },
+        "submission[body]": { "type": "markdown", "source": "TASK_SPEC" },
+        "submission[userFlag]": { "type": "string", "source": "TASK_SPEC" },
+        "submission[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_delete_own",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/delete",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL", "notes": "Can be '-'" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_submission'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_mod_delete",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/mod_delete",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "delete_reason[reason]": { "type": "string", "source": "TASK_SPEC", "notes": "Moderator deletion reason" },
+        "delete_reason[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_purge",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/purge",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'purge_submission'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_restore",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/restore",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'restore_submission'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "lock",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/lock",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'lock'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "unlock",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/unlock",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'lock'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "pin",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/pin",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'pin'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "unpin",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/unpin",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'pin'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_flair",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/flair",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "custom_text_flair[text]": { "type": "string", "source": "TASK_SPEC", "notes": "Flair text to apply" },
+        "custom_text_flair[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_remove_flairs",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/remove_flairs",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "id[]": { "type": "uuid[]", "source": "PREV_CALL", "notes": "Array of flair UUIDs to remove" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'remove_flair'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "submission_vote",
+      "endpoint": "POST /sv/{id}.{_format}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL", "from_endpoint": "submit or submission list", "from_field": "submission_id" },
+        "_format": { "type": "string", "source": "STATIC", "notes": "html or json; use json for AJAX" }
+      },
+      "query_params": {},
+      "form_params": {
+        "choice": { "type": "integer", "source": "STATIC", "notes": "1 = upvote, -1 = downvote, 0 = retract" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'vote'" }
+      },
+      "response_key_fields": ["netScore"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_post",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL", "notes": "Can be '-'" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL", "notes": "Parent comment ID for reply; omit/null for top-level reply to submission" }
+      },
+      "query_params": {},
+      "form_params": {
+        "reply_to_submission_{submissionId}[comment]": { "type": "markdown", "source": "TASK_SPEC", "notes": "Form name is 'reply_to_submission_{id}' for top-level, 'reply_to_comment_{id}' for replies; field path 'body'" },
+        "reply_to_submission_{submissionId}[userFlag]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional flair flag" },
+        "reply_to_submission_{submissionId}[email]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
+        "reply_to_submission_{submissionId}[_token]": { "type": "string", "source": "AUTH_FLOW", "notes": "Symfony CSRF token" }
+      },
+      "response_key_fields": ["redirect to comment anchor on success"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "edit_comment",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/edit",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "comment[comment]": { "type": "markdown", "source": "TASK_SPEC" },
+        "comment[userFlag]": { "type": "string", "source": "TASK_SPEC" },
+        "comment[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_delete_own",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/delete_own",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_own_comment'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_delete",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/delete",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "delete_reason[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "delete_reason[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_delete_thread",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/delete_thread",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "delete_reason[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "delete_reason[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_purge",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/purge",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'purge_comment'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_restore",
+      "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/restore",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'restore_comment'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "comment_vote",
+      "endpoint": "POST /cv/{id}.{_format}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL", "notes": "Comment ID" },
+        "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
+      },
+      "query_params": {},
+      "form_params": {
+        "choice": { "type": "integer", "source": "STATIC", "notes": "1 = upvote, -1 = downvote, 0 = retract" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'vote'" }
+      },
+      "response_key_fields": ["netScore"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "create_forum",
+      "endpoint": "POST /create_forum",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "forum[name]": { "type": "string", "source": "TASK_SPEC", "notes": "Forum slug; 3-25 chars" },
+        "forum[title]": { "type": "string", "source": "TASK_SPEC", "notes": "Forum display title" },
+        "forum[description]": { "type": "string", "source": "TASK_SPEC" },
+        "forum[sidebar]": { "type": "markdown", "source": "TASK_SPEC" },
+        "forum[tags]": { "type": "string[]", "source": "TASK_SPEC", "notes": "Array of tag names" },
+        "forum[moderationLogPublic]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "Admins/mods only" },
+        "forum[featured]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "ROLE_ADMIN only" },
+        "forum[email]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
+        "forum[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": ["redirect to /f/{forum_name}"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "edit_forum",
+      "endpoint": "POST /f/{forum_name}/edit",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "forum[name]": { "type": "string", "source": "TASK_SPEC" },
+        "forum[title]": { "type": "string", "source": "TASK_SPEC" },
+        "forum[description]": { "type": "string", "source": "TASK_SPEC" },
+        "forum[sidebar]": { "type": "markdown", "source": "TASK_SPEC" },
+        "forum[tags]": { "type": "string[]", "source": "TASK_SPEC" },
+        "forum[moderationLogPublic]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "forum[featured]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "ROLE_ADMIN only" },
+        "forum[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "delete_forum",
+      "endpoint": "POST /f/{forum_name}/delete",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "confirm_deletion[name]": { "type": "string", "source": "DERIVED", "notes": "Must equal the forum name exactly" },
+        "confirm_deletion[confirm]": { "type": "checkbox", "source": "STATIC", "notes": "Must be checked" },
+        "confirm_deletion[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "forum_appearance",
+      "endpoint": "POST /f/{forum_name}/appearance",
+      "auth": "session_cookie+csrf",
+      "content_type": "multipart/form-data",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "forum_appearance[suggestedTheme]": { "type": "string", "source": "TASK_SPEC", "notes": "Theme UUID or empty" },
+        "forum_appearance[backgroundImage]": { "type": "file", "source": "TASK_SPEC", "notes": "Image file; only if user can upload images" },
+        "forum_appearance[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "subscribe",
+      "endpoint": "POST /f/{forum_name}/subscribe.{_format}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" },
+        "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'subscribe'" }
+      },
+      "response_key_fields": ["subscribed"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "unsubscribe",
+      "endpoint": "POST /f/{forum_name}/unsubscribe.{_format}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" },
+        "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'subscribe'" }
+      },
+      "response_key_fields": ["subscribed"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "add_moderator",
+      "endpoint": "POST /f/{forum_name}/add_moderator",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "moderator[user]": { "type": "string", "source": "TASK_SPEC", "notes": "Username to add as moderator" },
+        "moderator[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "remove_moderator",
+      "endpoint": "POST /f/{forum_name}/remove_moderator/{moderator_id}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" },
+        "moderator_id": { "type": "uuid", "source": "PREV_CALL", "notes": "Moderator record UUID from moderators list" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'remove_moderator'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "forum_ban",
+      "endpoint": "POST /f/{forum_name}/ban/{username}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" },
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "forum_ban[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "forum_ban[expires][date]": { "type": "date", "source": "TASK_SPEC", "notes": "Optional expiry date; format YYYY-MM-DD" },
+        "forum_ban[expires][time]": { "type": "time", "source": "TASK_SPEC", "notes": "Optional expiry time; format HH:MM" },
+        "forum_ban[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "forum_unban",
+      "endpoint": "POST /f/{forum_name}/unban/{username}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "TASK_SPEC" },
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "forum_ban[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "forum_ban[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "forum_tag_edit",
+      "endpoint": "POST /tag/{name}/edit",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "name": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "forum_tag[name]": { "type": "string", "source": "TASK_SPEC" },
+        "forum_tag[description]": { "type": "string", "source": "TASK_SPEC" },
+        "forum_tag[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "compose_message",
+      "endpoint": "POST /user/{username}/compose_message",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC", "notes": "Recipient username" }
+      },
+      "query_params": {},
+      "form_params": {
+        "message[body]": { "type": "string", "source": "TASK_SPEC" },
+        "message[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": ["redirect to /messages/thread/{id}; thread id from redirect"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "reply_to_message",
+      "endpoint": "POST /message_reply/{id}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL", "from_endpoint": "compose_message", "notes": "Thread ID" }
+      },
+      "query_params": {},
+      "form_params": {
+        "message[body]": { "type": "string", "source": "TASK_SPEC" },
+        "message[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "delete_message",
+      "endpoint": "POST /messages/message/{id}/delete",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "uuid", "source": "PREV_CALL", "notes": "Message UUID" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_message'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "request_password_reset",
+      "endpoint": "POST /reset_password",
+      "auth": "none",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "request_password_reset[email]": { "type": "string", "source": "TASK_SPEC" },
+        "request_password_reset[verification]": { "type": "string", "source": "AUTH_FLOW", "notes": "Captcha answer" },
+        "request_password_reset[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "password_reset",
+      "endpoint": "POST /reset_password/{id}/{expires}/{checksum}",
+      "auth": "none",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "integer", "source": "AUTH_FLOW", "notes": "User ID from reset email link" },
+        "expires": { "type": "integer", "source": "AUTH_FLOW", "notes": "Unix timestamp from reset email link" },
+        "checksum": { "type": "string", "source": "AUTH_FLOW", "notes": "HMAC checksum from reset email link" }
+      },
+      "query_params": {},
+      "form_params": {
+        "user[username]": { "type": "string", "source": "TASK_SPEC" },
+        "user[password][first]": { "type": "string", "source": "TASK_SPEC" },
+        "user[password][second]": { "type": "string", "source": "TASK_SPEC" },
+        "user[email]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional" },
+        "user[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "edit_user",
+      "endpoint": "POST /user/{username}/account",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "AUTH_FLOW", "notes": "Must match authenticated user or admin" }
+      },
+      "query_params": {},
+      "form_params": {
+        "user[username]": { "type": "string", "source": "TASK_SPEC", "notes": "May be disabled if username change not allowed" },
+        "user[password][first]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional; leave blank to keep current" },
+        "user[password][second]": { "type": "string", "source": "TASK_SPEC" },
+        "user[email]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional" },
+        "user[phone]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
+        "user[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "user_settings",
+      "endpoint": "POST /user/{username}/preferences",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "query_params": {},
+      "form_params": {
+        "user_settings[locale]": { "type": "string", "source": "TASK_SPEC", "notes": "BCP47 locale code" },
+        "user_settings[frontPage]": { "type": "string", "source": "TASK_SPEC", "notes": "featured|subscribed|all|moderated" },
+        "user_settings[frontPageSortMode]": { "type": "string", "source": "TASK_SPEC", "notes": "hot|new|active" },
+        "user_settings[openExternalLinksInNewTab]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[autoFetchSubmissionTitles]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[enablePostPreviews]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[showThumbnails]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[notifyOnReply]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[notifyOnMentions]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[allowPrivateMessages]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[preferredFonts]": { "type": "string", "source": "TASK_SPEC" },
+        "user_settings[nightMode]": { "type": "string", "source": "TASK_SPEC", "notes": "0=auto, 1=light, 2=dark" },
+        "user_settings[preferredTheme]": { "type": "string", "source": "TASK_SPEC", "notes": "Theme UUID" },
+        "user_settings[showCustomStylesheets]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[poppersEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[fullWidthDisplayEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "user_settings[submissionLinkDestination]": { "type": "string", "source": "TASK_SPEC" },
+        "user_settings[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "edit_biography",
+      "endpoint": "POST /user/{username}/edit_biography",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "query_params": {},
+      "form_params": {
+        "user_biography[biography]": { "type": "markdown", "source": "TASK_SPEC" },
+        "user_biography[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "delete_account",
+      "endpoint": "POST /user/{username}/delete_account",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "query_params": {},
+      "form_params": {
+        "confirm_deletion[name]": { "type": "string", "source": "DERIVED", "notes": "Must equal the username exactly" },
+        "confirm_deletion[confirm]": { "type": "checkbox", "source": "STATIC" },
+        "confirm_deletion[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "block_user",
+      "endpoint": "POST /user/{username}/block_user",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC", "notes": "User to block" }
+      },
+      "query_params": {},
+      "form_params": {
+        "user_block[comment]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional private note about block" },
+        "user_block[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "unblock_user",
+      "endpoint": "POST /user/{username}/unblock_user",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'unblock'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "clear_notifications",
+      "endpoint": "POST /clear_notifications",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "id[]": { "type": "uuid[]", "source": "PREV_CALL", "notes": "Array of notification UUIDs to clear; omit to clear all" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'clear_notifications'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "user_whitelist",
+      "endpoint": "POST /user/{username}/whitelist",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'whitelist'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "user_dewhitelist",
+      "endpoint": "POST /user/{username}/dewhitelist",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'whitelist'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "hide_forum",
+      "endpoint": "POST /user/{username}/hide_forum/{forum}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "AUTH_FLOW" },
+        "forum": { "type": "string", "source": "TASK_SPEC", "notes": "Forum name" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'hide_forum'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "unhide_forum",
+      "endpoint": "POST /user/{username}/unhide_forum/{forum}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "AUTH_FLOW" },
+        "forum": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'hide_forum'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "change_night_mode",
+      "endpoint": "POST /night_mode.{_format}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
+      },
+      "query_params": {},
+      "form_params": {
+        "nightMode": { "type": "string", "source": "TASK_SPEC", "notes": "0=auto, 1=light, 2=dark" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'night_mode'" }
+      },
+      "response_key_fields": ["nightMode"]
+    },
+    {
+      "api_type": "form",
+      "route_name": "ban_user",
+      "endpoint": "POST /bans/ban_user/{username}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "ban_user[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "ban_user[expires][date]": { "type": "date", "source": "TASK_SPEC", "notes": "Optional; YYYY-MM-DD" },
+        "ban_user[expires][time]": { "type": "time", "source": "TASK_SPEC", "notes": "Optional; HH:MM" },
+        "ban_user[ban_ip]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "Also ban associated IPs" },
+        "ban_user[ips]": { "type": "string", "source": "DERIVED", "notes": "Comma/newline-separated IP list; auto-populated from user history" },
+        "ban_user[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "unban_user",
+      "endpoint": "POST /bans/unban_user/{username}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "unban_user[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "unban_user[unban_ips]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "Also lift associated IP bans" },
+        "unban_user[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "ban_ip",
+      "endpoint": "POST /bans/ban_ip",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "ip_ban[ip]": { "type": "string", "source": "TASK_SPEC", "notes": "IP address to ban" },
+        "ip_ban[reason]": { "type": "string", "source": "TASK_SPEC" },
+        "ip_ban[expires][date]": { "type": "date", "source": "TASK_SPEC", "notes": "Optional" },
+        "ip_ban[expires][time]": { "type": "time", "source": "TASK_SPEC", "notes": "Optional" },
+        "ip_ban[user]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional; username associated with this IP" },
+        "ip_ban[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "unban_ips",
+      "endpoint": "POST /bans/unban_ips",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "ban[]": { "type": "integer[]", "source": "PREV_CALL", "notes": "Array of IP ban IDs to remove" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'unban_ips'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "bad_phrase_add",
+      "endpoint": "POST /site/bad_phrases/add",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "bad_phrase[phrase]": { "type": "string", "source": "TASK_SPEC" },
+        "bad_phrase[phraseType]": { "type": "string", "source": "STATIC", "notes": "text or regex" },
+        "bad_phrase[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "bad_phrase_remove",
+      "endpoint": "POST /site/bad_phrases/remove",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "remove_bad_phrase[]": { "type": "uuid[]", "source": "PREV_CALL", "notes": "Array of bad phrase UUIDs to remove" },
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'remove_bad_phrase'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "site_settings",
+      "endpoint": "POST /site/settings",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "site_settings[siteName]": { "type": "string", "source": "TASK_SPEC" },
+        "site_settings[registrationOpen]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[usernameChangeEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[unwhitelistedUserMessagesEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[defaultSortMode]": { "type": "string", "source": "TASK_SPEC", "notes": "hot|active|new" },
+        "site_settings[defaultTheme]": { "type": "string", "source": "TASK_SPEC", "notes": "Theme UUID or empty" },
+        "site_settings[urlImagesEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[trashEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[wikiEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[wikiLogPublic]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[forumCreateRole]": { "type": "string", "source": "TASK_SPEC", "notes": "ROLE_ADMIN|ROLE_WHITELISTED|ROLE_USER" },
+        "site_settings[moderatorsCanSetForumLogVisibility]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[imageUploadRole]": { "type": "string", "source": "TASK_SPEC" },
+        "site_settings[wikiEditRole]": { "type": "string", "source": "TASK_SPEC" },
+        "site_settings[registrationCaptchaEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
+        "site_settings[submissionLinkDestination]": { "type": "string", "source": "TASK_SPEC" },
+        "site_settings[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "theme_create_css",
+      "endpoint": "POST /site/themes/css/create",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "css_theme[name]": { "type": "string", "source": "TASK_SPEC" },
+        "css_theme[css]": { "type": "string", "source": "TASK_SPEC" },
+        "css_theme[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "theme_edit_css",
+      "endpoint": "POST /site/themes/css/{id}/edit",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "uuid", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "css_theme[name]": { "type": "string", "source": "TASK_SPEC" },
+        "css_theme[css]": { "type": "string", "source": "TASK_SPEC" },
+        "css_theme[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "theme_delete",
+      "endpoint": "POST /site/themes/{id}/delete",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "id": { "type": "uuid", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_theme'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "theme_sync",
+      "endpoint": "POST /site/themes/sync",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'sync_themes'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "wiki_create",
+      "endpoint": "POST /wiki/_create/{path}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "path": { "type": "string", "source": "TASK_SPEC", "notes": "Wiki page path; can be empty to enter in form" }
+      },
+      "query_params": {},
+      "form_params": {
+        "wiki[path]": { "type": "string", "source": "TASK_SPEC", "notes": "Only present when path is empty" },
+        "wiki[title]": { "type": "string", "source": "TASK_SPEC" },
+        "wiki[body]": { "type": "markdown", "source": "TASK_SPEC" },
+        "wiki[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "wiki_edit",
+      "endpoint": "POST /wiki/_edit/{path}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "path": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "wiki[title]": { "type": "string", "source": "TASK_SPEC" },
+        "wiki[body]": { "type": "markdown", "source": "TASK_SPEC" },
+        "wiki[_token]": { "type": "string", "source": "AUTH_FLOW" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "wiki_delete",
+      "endpoint": "POST /wiki/_delete/{path}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "path": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'wiki_delete'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "wiki_lock",
+      "endpoint": "POST /wiki/_lock/{path}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "path": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'wiki_lock'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "form",
+      "route_name": "wiki_unlock",
+      "endpoint": "POST /wiki/_unlock/{path}",
+      "auth": "session_cookie+csrf",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {
+        "path": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "form_params": {
+        "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'wiki_lock'" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "rest",
+      "route_name": "fetch_title",
+      "endpoint": "POST /ft.json",
+      "auth": "session_cookie",
+      "content_type": "application/x-www-form-urlencoded",
+      "path_params": {},
+      "query_params": {},
+      "form_params": {
+        "url": { "type": "string", "source": "TASK_SPEC", "notes": "URL whose title to fetch; must be a valid URL" }
+      },
+      "body_params": {},
+      "response_key_fields": ["title"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "user_popper",
+      "endpoint": "GET /_up/{username}",
+      "auth": "none",
+      "content_type": "none",
+      "path_params": {
+        "username": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["HTML fragment for user popover"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "comment_json",
+      "endpoint": "GET /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}.json",
+      "auth": "session_cookie",
+      "content_type": "none",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" },
+        "slug": { "type": "string", "source": "PREV_CALL" },
+        "comment_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "body", "author", "created_at", "net_score", "visibility"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "submission_json",
+      "endpoint": "GET /f/{forum_name}/{submission_id}.json",
+      "auth": "session_cookie",
+      "content_type": "none",
+      "path_params": {
+        "forum_name": { "type": "string", "source": "PREV_CALL" },
+        "submission_id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "title", "url", "body", "author", "forum", "created_at", "net_score"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_comments_list",
+      "endpoint": "GET /api/comments",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {},
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "body", "author", "submission", "created_at", "net_score"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_comment_read",
+      "endpoint": "GET /api/comments/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "body", "author", "submission", "created_at", "net_score"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_comment_update",
+      "endpoint": "PUT /api/comments/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {
+        "body": { "type": "string", "source": "TASK_SPEC", "notes": "Comment body markdown; denormalization group: comment:update" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_forum_read",
+      "endpoint": "GET /api/forums/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "name", "title", "description", "sidebar", "created_at", "subscriber_count"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_forum_read_by_name",
+      "endpoint": "GET /api/forums/by_name/{name}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "name": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "name", "title", "description", "sidebar", "created_at", "subscriber_count"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_forum_create",
+      "endpoint": "POST /api/forums",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {},
+      "query_params": {},
+      "body_params": {
+        "name": { "type": "string", "source": "TASK_SPEC", "notes": "Forum slug; denormalization group: forum:create" },
+        "title": { "type": "string", "source": "TASK_SPEC" },
+        "description": { "type": "string", "source": "TASK_SPEC" },
+        "sidebar": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "response_key_fields": ["id", "name"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_forum_update",
+      "endpoint": "PUT /api/forums/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {
+        "title": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: forum:update" },
+        "description": { "type": "string", "source": "TASK_SPEC" },
+        "sidebar": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_submissions_list",
+      "endpoint": "GET /api/submissions",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {},
+      "query_params": {
+        "sortBy": { "type": "string", "source": "TASK_SPEC", "notes": "hot|new|active; defaults to user preference" },
+        "filter": { "type": "string", "source": "TASK_SPEC", "notes": "featured|subscribed|moderated|all; defaults to user preference" }
+      },
+      "body_params": {},
+      "response_key_fields": ["id", "title", "url", "body", "forum", "author", "created_at", "net_score"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_submission_read",
+      "endpoint": "GET /api/submissions/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "title", "url", "body", "forum", "author", "created_at", "net_score"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_submission_create",
+      "endpoint": "POST /api/submissions",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {},
+      "query_params": {},
+      "body_params": {
+        "title": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: submission:create" },
+        "url": { "type": "string", "source": "TASK_SPEC" },
+        "body": { "type": "string", "source": "TASK_SPEC" },
+        "forum": { "type": "string|integer", "source": "TASK_SPEC", "notes": "Forum name or ID" },
+        "mediaType": { "type": "string", "source": "STATIC", "notes": "url|image" }
+      },
+      "response_key_fields": ["id", "forum", "title"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_submission_update",
+      "endpoint": "PUT /api/submissions/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {
+        "title": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: submission:update" },
+        "url": { "type": "string", "source": "TASK_SPEC" },
+        "body": { "type": "string", "source": "TASK_SPEC" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_submission_delete",
+      "endpoint": "DELETE /api/submissions/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": []
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_submission_comments",
+      "endpoint": "GET /api/submissions/{id}/comments",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "body", "author", "net_score", "replies"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_user_read",
+      "endpoint": "GET /api/users/{id}",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "username", "created_at"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_user_self",
+      "endpoint": "GET /api/users/self",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {},
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "username"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_user_read_preferences",
+      "endpoint": "GET /api/users/{id}/preferences",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["locale", "front_page", "front_page_sort_mode", "night_mode", "notify_on_reply"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_user_update_preferences",
+      "endpoint": "PUT /api/users/{id}/preferences",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {
+        "locale": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: user:preferences" },
+        "front_page": { "type": "string", "source": "TASK_SPEC" },
+        "front_page_sort_mode": { "type": "string", "source": "TASK_SPEC" },
+        "night_mode": { "type": "integer", "source": "TASK_SPEC" },
+        "notify_on_reply": { "type": "boolean", "source": "TASK_SPEC" },
+        "notify_on_mentions": { "type": "boolean", "source": "TASK_SPEC" }
+      },
+      "response_key_fields": []
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_user_submissions",
+      "endpoint": "GET /api/users/{id}/submissions",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["id", "title", "forum", "net_score"]
+    },
+    {
+      "api_type": "rest",
+      "route_name": "api_user_moderator_of",
+      "endpoint": "GET /api/users/{id}/moderator_of",
+      "auth": "session_cookie",
+      "content_type": "application/json",
+      "path_params": {
+        "id": { "type": "integer", "source": "PREV_CALL" }
+      },
+      "query_params": {},
+      "body_params": {},
+      "response_key_fields": ["entries[].forum", "entries[].user"]
+    }
+  ]
+}

catalogs/osm.json ADDED Viewed

The diff for this file is too large to render. See raw diff

catalogs/shopping.json ADDED Viewed

The diff for this file is too large to render. See raw diff

catalogs/shopping_admin.json ADDED Viewed

The diff for this file is too large to render. See raw diff

catalogs/wikipedia.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "_meta": {
+    "generated": "2026-04-08",
+    "source": "hardcoded — kiwix-serve binary serves a static ZIM file; no application source to analyze",
+    "zim_file": "/data/wikipedia_en_all_maxi_2022-05.zim",
+    "search_response": "HTML only — GET /search returns HTML page; agent must parse <a href> links for article URLs",
+    "article_page": "GET /wikipedia_en_all_maxi_2022-05/A/{title} — returns HTML article",
+    "websockets": "none"
+  },
+  "endpoints": [
+    {
+      "api_type": "rest",
+      "endpoint": "GET /search",
+      "auth": "none",
+      "query_params": {
+        "pattern": {
+          "type": "string",
+          "source": "TASK_SPEC",
+          "notes": "the search query, URL-encoded"
+        },
+        "books.name": {
+          "type": "string",
+          "source": "STATIC",
+          "value": "wikipedia_en_all_maxi_2022-05",
+          "notes": "selects which ZIM book to search"
+        }
+      },
+      "response_key_fields": [],
+      "notes": "IMPORTANT: response is HTML, not JSON. Parse <a href> anchor links matching /wikipedia_en_all_maxi_2022-05/A/... to extract article slugs. The .results[0].url jq path does NOT apply — use HTML parsing."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "GET /wikipedia_en_all_maxi_2022-05/A/{article_title}",
+      "auth": "none",
+      "path_params": {
+        "article_title": {
+          "type": "string",
+          "source": "PREV_CALL",
+          "from_endpoint": "GET /search",
+          "from_field": "href attribute of first search result <a> tag",
+          "notes": "URL-encoded article slug, e.g. Albert_Einstein. Extract from the href on the search results HTML page. Verified live: HTTP 200 for valid titles."
+        }
+      },
+      "response_key_fields": [],
+      "notes": "Returns full HTML article page. HTTP 200 when article exists, 404 when not found."
+    }
+  ]
+}

client.py ADDED Viewed

	@@ -0,0 +1,59 @@

+"""HARvestGym client for interacting with the environment server."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+try:
+    from .server.models import HarvestGymAction, HarvestGymObservation
+except ModuleNotFoundError:
+    from server.models import HarvestGymAction, HarvestGymObservation
+class HARvestGymEnv(EnvClient[HarvestGymAction, HarvestGymObservation, State]):
+    """
+    Client for the HARvestGym Environment.
+    Example:
+        >>> async with HARvestGymEnv(base_url="http://localhost:8000") as env:
+        ...     result = await env.reset()
+        ...     result = await env.step(HarvestGymAction(
+        ...         tool="browser_agent",
+        ...         args={"task": "List products in category Gear",
+        ...               "url": "http://ec2-.../"}
+        ...     ))
+    """
+    def _step_payload(self, action: HarvestGymAction) -> Dict:
+        return {
+            "tool": action.tool,
+            "args": action.args,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[HarvestGymObservation]:
+        obs_data = payload.get("observation", {})
+        observation = HarvestGymObservation(
+            task=obs_data.get("task", ""),
+            app_base_url=obs_data.get("app_base_url", ""),
+            last_tool_result=obs_data.get("last_tool_result"),
+            history=obs_data.get("history", []),
+            session_state=obs_data.get("session_state", {}),
+            step_count=obs_data.get("step_count", 0),
+            max_steps=obs_data.get("max_steps", 20),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

hars/forum.har ADDED Viewed

The diff for this file is too large to render. See raw diff

hars/shopping.har ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dc116ba8f3cb52e5fe8335dcaf1eefbb88161df4d494f30832338f57bbe52ed9
+size 13392889

hars/shopping_admin.har ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1c9d48fde1cc1f65c0e81ff9a46d1b23fece9c352b1c548de91ca848ee2411f1
+size 60961456

hars/wikipedia.har ADDED Viewed

The diff for this file is too large to render. See raw diff

inference.py ADDED Viewed

	@@ -0,0 +1,375 @@

+"""
+HARvestGym — Inference Script
+==============================
+Runs the RL agent (driven by an LLM via OpenAI client) through three tasks:
+  1. har_classify_easy   — Template 1: list products in a category
+  2. har_classify_medium — Template 3: add product to guest cart
+  3. har_pipeline_hard   — Template 6: complete guest checkout
+STDOUT FORMAT (strictly enforced by hackathon):
+  [START] task=<task_name> env=<benchmark> model=<model_name>
+  [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+  [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+Usage:
+  HF_TOKEN=hf_xxx uv run inference.py
+  HF_TOKEN=hf_xxx MODEL_NAME=Qwen/Qwen2.5-72B-Instruct uv run inference.py
+"""
+import asyncio
+import json
+import os
+import sys
+import textwrap
+from typing import Any, List, Optional
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+HF_TOKEN = os.getenv("HF_TOKEN")
+if not HF_TOKEN:
+    raise ValueError(
+        "HF_TOKEN environment variable is required but not set. "
+        "Export it with: export HF_TOKEN=hf_your_token_here"
+    )
+BENCHMARK = "harvgym"
+MAX_STEPS = 20
+TEMPERATURE = 0.7
+MAX_TOKENS = 512
+SUCCESS_SCORE_THRESHOLD = 0.5
+# Task definitions for inference
+TASKS = [
+    {
+        "task_name": "har_classify_easy",
+        "template_id": 1,
+        "description": "List products in the 'Gear' category",
+        "app_base_url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
+        "difficulty": "easy",
+    },
+    {
+        "task_name": "har_classify_medium",
+        "template_id": 3,
+        "description": "Add 'Radiant Tee' to a guest cart",
+        "app_base_url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
+        "difficulty": "medium",
+    },
+    {
+        "task_name": "har_pipeline_hard",
+        "template_id": 6,
+        "description": "Complete a guest checkout for 'Radiant Tee'",
+        "app_base_url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
+        "difficulty": "hard",
+    },
+]
+# ---------------------------------------------------------------------------
+# Logging helpers (hackathon format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    # Sanitize action: no newlines
+    action_clean = action.replace("\n", " ").replace("\r", "")[:200]
+    print(
+        f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}",
+        flush=True,
+    )
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = textwrap.dedent("""
+You are an API agent. Your goal is to complete a real-world task by calling the correct
+sequence of HTTP API endpoints on a live web application.
+You have exactly these tools available (output ONE tool call per turn as JSON):
+1. browser_agent(task, url)
+   → Discovers which API endpoints exist for this app. Call this FIRST and ONLY ONCE.
+   → Returns: list of {method, path} endpoint names (no schemas)
+2. search_endpoints(query)
+   → Semantic search for endpoint schemas. Use after browser_agent to get full details.
+   → Example: search_endpoints("create guest cart") returns method, path, auth, params
+3. curl_exec(command)
+   → Execute an HTTP call. Returns {status_code, headers, body}.
+   → Use full curl syntax: curl -X POST 'URL' -H 'Content-Type: application/json' -d '{...}'
+   → Session cookies are auto-injected; you do NOT need to set Cookie headers manually.
+4. search_episode_data(query)
+   → Search all prior API responses in this episode for a specific value.
+   → Use when a response list was truncated and you need a specific item.
+5. done(result?)
+   → Call when the task is complete.
+RULES:
+- Output ONLY a single JSON object with keys "tool" and "args". Nothing else.
+- Call browser_agent exactly once at step 1.
+- Read values from prior responses (cart_id, sku, tokens) from the history.
+- For Magento Shopping API (port 7770/7780): use Content-Type: application/json
+- For Forum Postmill (port 9999): use Content-Type: application/x-www-form-urlencoded for login/post
+- For Wikipedia (port 8888): GET requests only
+EXAMPLE output format:
+{"tool": "curl_exec", "args": {"command": "curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'"}}
+""").strip()
+# ---------------------------------------------------------------------------
+# LLM agent loop
+# ---------------------------------------------------------------------------
+def build_user_prompt(task_desc: str, app_base_url: str, step: int,
+                       last_result: Any, history: List[dict],
+                       session_state: dict) -> str:
+    """Build the user prompt for each step."""
+    history_str = ""
+    if history:
+        recent = history[-6:]  # Last 6 steps to stay within context
+        lines = []
+        for h in recent:
+            result_str = json.dumps(h.get("result", ""))[:500]
+            lines.append(f"  Step {h['step']}: {h['tool']}({h.get('args', {})}) → {result_str}")
+        history_str = "\n".join(lines)
+    session_str = json.dumps(session_state, indent=2)[:300] if session_state else "{}"
+    last_result_str = json.dumps(last_result)[:800] if last_result is not None else "null"
+    return textwrap.dedent(f"""
+    TASK: {task_desc}
+    APP URL: {app_base_url}
+    STEP: {step}/{MAX_STEPS}
+    SESSION STATE (auto-managed cookies/tokens):
+    {session_str}
+    LAST TOOL RESULT:
+    {last_result_str}
+    RECENT HISTORY:
+    {history_str if history_str else "  (none yet)"}
+    What is your next tool call? Output ONLY the JSON object.
+    """).strip()
+def get_model_action(client: OpenAI, task_desc: str, app_base_url: str,
+                     step: int, last_result: Any, history: List[dict],
+                     session_state: dict) -> dict:
+    """Ask the LLM for the next action. Returns parsed tool call dict."""
+    user_prompt = build_user_prompt(task_desc, app_base_url, step,
+                                    last_result, history, session_state)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        # Parse JSON from response
+        # Handle markdown code blocks
+        if "```json" in text:
+            text = text.split("```json")[1].split("```")[0].strip()
+        elif "```" in text:
+            text = text.split("```")[1].split("```")[0].strip()
+        # Find first { ... } block
+        start = text.find("{")
+        end = text.rfind("}") + 1
+        if start >= 0 and end > start:
+            text = text[start:end]
+        parsed = json.loads(text)
+        if "tool" in parsed:
+            return parsed
+        # LLM returned something else — default to done
+        return {"tool": "done", "args": {"result": "Model returned non-tool response"}}
+    except json.JSONDecodeError:
+        # Couldn't parse JSON — try to extract tool name at minimum
+        if "browser_agent" in text:
+            return {"tool": "browser_agent", "args": {"task": task_desc, "url": app_base_url}}
+        elif "done" in text.lower():
+            return {"tool": "done", "args": {}}
+        else:
+            return {"tool": "done", "args": {"result": f"Parse error: {text[:100]}"}}
+    except Exception as exc:
+        print(f"[DEBUG] LLM call failed: {exc}", flush=True)
+        # Default to browser_agent on first step, done otherwise
+        if step == 1:
+            return {"tool": "browser_agent", "args": {"task": task_desc, "url": app_base_url}}
+        return {"tool": "done", "args": {"result": f"LLM error: {exc}"}}
+# ---------------------------------------------------------------------------
+# Single task episode runner
+# ---------------------------------------------------------------------------
+async def run_episode(task_config: dict, client: OpenAI) -> dict:
+    """
+    Run a single episode for one task.
+    Returns: {"task_name", "success", "steps", "score", "rewards"}
+    """
+    from server.models import HARvestGymEnvironment, HarvestGymAction
+    task_name = task_config["task_name"]
+    template_id = task_config["template_id"]
+    task_description = task_config["description"]
+    app_base_url = task_config["app_base_url"]
+    # Configure environment for this task
+    os.environ["HARVGYM_TASK"] = str(template_id)
+    env = HARvestGymEnvironment()
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    last_result = None
+    history: List[dict] = []
+    session_state: dict = {}
+    log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        # Reset
+        obs = env.reset()
+        task_desc = obs.task or task_description
+        base_url = obs.app_base_url or app_base_url
+        for step in range(1, MAX_STEPS + 1):
+            if getattr(obs, "done", False):
+                break
+            # Get action from LLM
+            action_dict = get_model_action(
+                client=client,
+                task_desc=task_desc,
+                app_base_url=base_url,
+                step=step,
+                last_result=last_result,
+                history=history,
+                session_state=session_state,
+            )
+            tool = action_dict.get("tool", "done")
+            args = action_dict.get("args", {})
+            action_str = f"{tool}({json.dumps(args)[:150]})"
+            error_str = None
+            try:
+                action = HarvestGymAction(tool=tool, args=args)
+                obs = env.step(action)
+                reward = float(obs.reward or 0.0)
+                done = bool(obs.done)
+                last_result = obs.last_tool_result
+                session_state = dict(obs.session_state or {})
+                # Update history
+                history.append({
+                    "step": step,
+                    "tool": tool,
+                    "args": args,
+                    "result": last_result,
+                })
+            except Exception as exc:
+                reward = -0.1
+                done = False
+                error_str = str(exc)[:200]
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error_str)
+            if done:
+                break
+        # Compute episode score from cumulative rewards
+        # Normalize: terminal reward dominates; clamp to [0, 1]
+        total_reward = sum(rewards)
+        # Map reward to [0, 1]: reward range is roughly [-1.5, +7.5] per design
+        score = (total_reward + 1.5) / 9.0
+        score = max(0.0, min(1.0, score))
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    except Exception as exc:
+        error_str = str(exc)[:200]
+        print(f"[DEBUG] Episode error: {error_str}", flush=True)
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {
+        "task_name": task_name,
+        "success": success,
+        "steps": steps_taken,
+        "score": score,
+        "rewards": rewards,
+    }
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+    results = []
+    for task_config in TASKS:
+        result = await run_episode(task_config, client)
+        results.append(result)
+    # Summary
+    print("\n[SUMMARY]", flush=True)
+    for r in results:
+        status = "PASS" if r["success"] else "FAIL"
+        print(
+            f"  [{status}] {r['task_name']} — score={r['score']:.2f} steps={r['steps']}",
+            flush=True,
+        )
+    overall_score = sum(r["score"] for r in results) / len(results) if results else 0.0
+    print(f"\n  overall_score={overall_score:.2f}", flush=True)
+if __name__ == "__main__":
+    asyncio.run(main())

models.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""
+HARvestGym — Root models.py (required by OpenEnv spec).
+Re-exports Action and Observation classes from server/models.py.
+"""
+from server.models import HarvestGymAction as HARvestGymAction
+from server.models import HarvestGymObservation as HARvestGymObservation
+# OpenEnv spec requires these names at module root
+Action = HARvestGymAction
+Observation = HARvestGymObservation
+__all__ = ["HARvestGymAction", "HARvestGymObservation", "Action", "Observation"]

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: HARvestGym
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

openenv_harvestgym.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,18 @@

+Metadata-Version: 2.4
+Name: openenv-harvestgym
+Version: 0.1.0
+Summary: HARvestGym: RL environment for training API-native web agents via HAR-guided exploration
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: fastapi>=0.100.0
+Requires-Dist: uvicorn>=0.23.0
+Requires-Dist: requests>=2.31.0
+Requires-Dist: rank-bm25>=0.2.2
+Requires-Dist: sentence-transformers>=3.0.0
+Requires-Dist: openai>=1.0.0
+Requires-Dist: numpy>=1.24.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"

openenv_harvestgym.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+README.md
+pyproject.toml
+openenv_harvestgym.egg-info/PKG-INFO
+openenv_harvestgym.egg-info/SOURCES.txt
+openenv_harvestgym.egg-info/dependency_links.txt
+openenv_harvestgym.egg-info/entry_points.txt
+openenv_harvestgym.egg-info/requires.txt
+openenv_harvestgym.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/episode.py
+server/judge.py
+server/models.py
+server/tools/__init__.py
+server/tools/browser_agent.py
+server/tools/curl_exec.py
+server/tools/search_endpoints.py
+server/tools/search_episode_data.py
+tests/test_e2e_episode.py
+tests/test_real_har.py

openenv_harvestgym.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_harvestgym.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = server.app:main

openenv_harvestgym.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+openenv-core[core]>=0.2.2
+pydantic>=2.0.0
+fastapi>=0.100.0
+uvicorn>=0.23.0
+requests>=2.31.0
+rank-bm25>=0.2.2
+sentence-transformers>=3.0.0
+openai>=1.0.0
+numpy>=1.24.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0
+pytest-asyncio>=0.23.0

openenv_harvestgym.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ server

parameter_pools.json ADDED Viewed

	@@ -0,0 +1,1090 @@

+{
+  "_meta": {
+    "description": "Static parameter pools for the 7 HARvestGym task templates.",
+    "generated_at": "2026-04-08",
+    "source": {
+      "categories": "GET /rest/V1/categories/list (live EC2, port 7780)",
+      "products": "GET /rest/V1/products type_id=simple + configurable (live EC2, port 7780)",
+      "forums": "HTML scrape of /forums page (live EC2, port 9999) + HTTP 200 verification per slug",
+      "wikipedia": "Well-known Wikipedia titles \u2014 verified by grader at runtime via HEAD /wikipedia_en.../A/{slug}",
+      "admin_skus": "Generated (HAR-TEST-NNN namespace, no collision with existing catalog)",
+      "post_titles": "Generated \u2014 grader checks post was created, not the exact title wording"
+    },
+    "grader_matching_notes": {
+      "template_1": "category_id stored for grader; category_name is what appears in task string",
+      "template_2": "expected_slug stored for grader (verifies HTTP 200); display title is in task string",
+      "template_3": "sku stored for grader (verifies cart item); product name is in task string",
+      "template_4": "forum_name must exist and return posts; no exact value matching needed",
+      "template_5": "title is free-form generated; grader only checks post was created in that forum",
+      "template_6": "sku stored for grader (verifies order was placed); product name is in task string",
+      "template_7": "sku+price are exact \u2014 grader calls GET /rest/V1/products/{sku} to verify creation"
+    }
+  },
+  "template_1": {
+    "description": "List products in category {category_name}",
+    "tier": "Easy",
+    "app": "shopping",
+    "slots": [
+      "category_name"
+    ],
+    "pool": {
+      "category_name": [
+        {
+          "name": "Gear",
+          "category_id": 3
+        },
+        {
+          "name": "Bags",
+          "category_id": 4
+        },
+        {
+          "name": "Fitness Equipment",
+          "category_id": 5
+        },
+        {
+          "name": "Watches",
+          "category_id": 6
+        },
+        {
+          "name": "New Luma Yoga Collection",
+          "category_id": 8
+        },
+        {
+          "name": "Training",
+          "category_id": 9
+        },
+        {
+          "name": "Video Download",
+          "category_id": 10
+        },
+        {
+          "name": "Men",
+          "category_id": 11
+        },
+        {
+          "name": "Tops",
+          "category_id": 12
+        },
+        {
+          "name": "Bottoms",
+          "category_id": 13
+        },
+        {
+          "name": "Jackets",
+          "category_id": 14
+        },
+        {
+          "name": "Hoodies & Sweatshirts",
+          "category_id": 15
+        },
+        {
+          "name": "Tees",
+          "category_id": 16
+        },
+        {
+          "name": "Tanks",
+          "category_id": 17
+        },
+        {
+          "name": "Pants",
+          "category_id": 18
+        },
+        {
+          "name": "Shorts",
+          "category_id": 19
+        },
+        {
+          "name": "Women",
+          "category_id": 20
+        },
+        {
+          "name": "Tops",
+          "category_id": 21
+        },
+        {
+          "name": "Bottoms",
+          "category_id": 22
+        },
+        {
+          "name": "Jackets",
+          "category_id": 23
+        },
+        {
+          "name": "Hoodies & Sweatshirts",
+          "category_id": 24
+        },
+        {
+          "name": "Tees",
+          "category_id": 25
+        },
+        {
+          "name": "Bras & Tanks",
+          "category_id": 26
+        },
+        {
+          "name": "Pants",
+          "category_id": 27
+        },
+        {
+          "name": "Shorts",
+          "category_id": 28
+        },
+        {
+          "name": "Women Sale",
+          "category_id": 30
+        },
+        {
+          "name": "Men Sale",
+          "category_id": 31
+        },
+        {
+          "name": "Pants",
+          "category_id": 32
+        },
+        {
+          "name": "Tees",
+          "category_id": 33
+        },
+        {
+          "name": "Erin Recommends",
+          "category_id": 34
+        },
+        {
+          "name": "Performance Fabrics",
+          "category_id": 35
+        },
+        {
+          "name": "Eco Friendly",
+          "category_id": 36
+        },
+        {
+          "name": "Sale",
+          "category_id": 37
+        },
+        {
+          "name": "What's New",
+          "category_id": 38
+        },
+        {
+          "name": "Performance Sportswear New",
+          "category_id": 39
+        },
+        {
+          "name": "Eco Collection New",
+          "category_id": 40
+        }
+      ]
+    }
+  },
+  "template_2": {
+    "description": "Retrieve article summary for {title}",
+    "tier": "Easy",
+    "app": "wikipedia",
+    "slots": [
+      "title"
+    ],
+    "pool": {
+      "title": [
+        {
+          "display": "Python (programming language)",
+          "search_query": "Python programming language",
+          "expected_slug": "Python_(programming_language)"
+        },
+        {
+          "display": "Albert Einstein",
+          "search_query": "Albert Einstein",
+          "expected_slug": "Albert_Einstein"
+        },
+        {
+          "display": "World War II",
+          "search_query": "World War II",
+          "expected_slug": "World_War_II"
+        },
+        {
+          "display": "Photosynthesis",
+          "search_query": "Photosynthesis",
+          "expected_slug": "Photosynthesis"
+        },
+        {
+          "display": "Marie Curie",
+          "search_query": "Marie Curie",
+          "expected_slug": "Marie_Curie"
+        },
+        {
+          "display": "Moon",
+          "search_query": "Moon",
+          "expected_slug": "Moon"
+        },
+        {
+          "display": "JavaScript",
+          "search_query": "JavaScript",
+          "expected_slug": "JavaScript"
+        },
+        {
+          "display": "Eiffel Tower",
+          "search_query": "Eiffel Tower",
+          "expected_slug": "Eiffel_Tower"
+        },
+        {
+          "display": "Black hole",
+          "search_query": "Black hole",
+          "expected_slug": "Black_hole"
+        },
+        {
+          "display": "Charles Darwin",
+          "search_query": "Charles Darwin",
+          "expected_slug": "Charles_Darwin"
+        },
+        {
+          "display": "Artificial intelligence",
+          "search_query": "Artificial intelligence",
+          "expected_slug": "Artificial_intelligence"
+        },
+        {
+          "display": "DNA",
+          "search_query": "DNA",
+          "expected_slug": "DNA"
+        },
+        {
+          "display": "Mount Everest",
+          "search_query": "Mount Everest",
+          "expected_slug": "Mount_Everest"
+        },
+        {
+          "display": "Isaac Newton",
+          "search_query": "Isaac Newton",
+          "expected_slug": "Isaac_Newton"
+        },
+        {
+          "display": "Solar System",
+          "search_query": "Solar System",
+          "expected_slug": "Solar_System"
+        },
+        {
+          "display": "Great Wall of China",
+          "search_query": "Great Wall of China",
+          "expected_slug": "Great_Wall_of_China"
+        },
+        {
+          "display": "William Shakespeare",
+          "search_query": "William Shakespeare",
+          "expected_slug": "William_Shakespeare"
+        },
+        {
+          "display": "Amazon River",
+          "search_query": "Amazon River",
+          "expected_slug": "Amazon_River"
+        },
+        {
+          "display": "Quantum mechanics",
+          "search_query": "Quantum mechanics",
+          "expected_slug": "Quantum_mechanics"
+        },
+        {
+          "display": "Napoleon",
+          "search_query": "Napoleon",
+          "expected_slug": "Napoleon"
+        }
+      ]
+    }
+  },
+  "template_3": {
+    "description": "Add {product_name} to a guest cart",
+    "tier": "Medium",
+    "app": "shopping",
+    "slots": [
+      "product_name"
+    ],
+    "pool": {
+      "product_name": [
+        {
+          "name": "Joust Duffle Bag",
+          "sku": "24-MB01"
+        },
+        {
+          "name": "Strive Shoulder Pack",
+          "sku": "24-MB04"
+        },
+        {
+          "name": "Crown Summit Backpack",
+          "sku": "24-MB03"
+        },
+        {
+          "name": "Wayfarer Messenger Bag",
+          "sku": "24-MB05"
+        },
+        {
+          "name": "Rival Field Messenger",
+          "sku": "24-MB06"
+        },
+        {
+          "name": "Fusion Backpack",
+          "sku": "24-MB02"
+        },
+        {
+          "name": "Impulse Duffle",
+          "sku": "24-UB02"
+        },
+        {
+          "name": "Voyage Yoga Bag",
+          "sku": "24-WB01"
+        },
+        {
+          "name": "Compete Track Tote",
+          "sku": "24-WB02"
+        },
+        {
+          "name": "Savvy Shoulder Tote",
+          "sku": "24-WB05"
+        },
+        {
+          "name": "Endeavor Daytrip Backpack",
+          "sku": "24-WB06"
+        },
+        {
+          "name": "Driven Backpack",
+          "sku": "24-WB03"
+        },
+        {
+          "name": "Overnight Duffle",
+          "sku": "24-WB07"
+        },
+        {
+          "name": "Push It Messenger Bag",
+          "sku": "24-WB04"
+        },
+        {
+          "name": "Affirm Water Bottle",
+          "sku": "24-UG06"
+        },
+        {
+          "name": "Dual Handle Cardio Ball",
+          "sku": "24-UG07"
+        },
+        {
+          "name": "Zing Jump Rope",
+          "sku": "24-UG04"
+        },
+        {
+          "name": "Pursuit Lumaflex&trade; Tone Band",
+          "sku": "24-UG02"
+        },
+        {
+          "name": "Go-Get'r Pushup Grips",
+          "sku": "24-UG05"
+        },
+        {
+          "name": "Quest Lumaflex&trade; Band",
+          "sku": "24-UG01"
+        },
+        {
+          "name": "Sprite Foam Yoga Brick",
+          "sku": "24-WG084"
+        },
+        {
+          "name": "Sprite Foam Roller",
+          "sku": "24-WG088"
+        },
+        {
+          "name": "Harmony Lumaflex&trade; Strength Band Kit",
+          "sku": "24-UG03"
+        },
+        {
+          "name": "Sprite Stasis Ball 55 cm",
+          "sku": "24-WG081-gray"
+        },
+        {
+          "name": "Sprite Stasis Ball 65 cm",
+          "sku": "24-WG082-gray"
+        },
+        {
+          "name": "Sprite Stasis Ball 75 cm",
+          "sku": "24-WG083-gray"
+        },
+        {
+          "name": "Sprite Yoga Strap 6 foot",
+          "sku": "24-WG085"
+        },
+        {
+          "name": "Sprite Yoga Strap 8 foot",
+          "sku": "24-WG086"
+        },
+        {
+          "name": "Sprite Yoga Strap 10 foot",
+          "sku": "24-WG087"
+        },
+        {
+          "name": "Aim Analog Watch",
+          "sku": "24-MG04"
+        },
+        {
+          "name": "Endurance Watch",
+          "sku": "24-MG01"
+        },
+        {
+          "name": "Summit Watch",
+          "sku": "24-MG03"
+        },
+        {
+          "name": "Cruise Dual Analog Watch",
+          "sku": "24-MG05"
+        },
+        {
+          "name": "Dash Digital Watch",
+          "sku": "24-MG02"
+        },
+        {
+          "name": "Luma Analog Watch",
+          "sku": "24-WG09"
+        },
+        {
+          "name": "Bolo Sport Watch",
+          "sku": "24-WG01"
+        },
+        {
+          "name": "Clamber Watch",
+          "sku": "24-WG03"
+        },
+        {
+          "name": "Didi Sport Watch",
+          "sku": "24-WG02"
+        },
+        {
+          "name": "Stellar Solar Jacket",
+          "sku": "WJ01"
+        },
+        {
+          "name": "Josie Yoga Jacket",
+          "sku": "WJ02"
+        },
+        {
+          "name": "Augusta Pullover Jacket",
+          "sku": "WJ03"
+        },
+        {
+          "name": "Ingrid Running Jacket",
+          "sku": "WJ04"
+        },
+        {
+          "name": "Riona Full Zip Jacket",
+          "sku": "WJ05"
+        },
+        {
+          "name": "Juno Jacket",
+          "sku": "WJ06"
+        },
+        {
+          "name": "Inez Full Zip Jacket",
+          "sku": "WJ07"
+        },
+        {
+          "name": "Adrienne Trek Jacket",
+          "sku": "WJ08"
+        },
+        {
+          "name": "Jade Yoga Jacket",
+          "sku": "WJ09"
+        },
+        {
+          "name": "Nadia Elements Shell",
+          "sku": "WJ10"
+        },
+        {
+          "name": "Neve Studio Dance Jacket",
+          "sku": "WJ11"
+        },
+        {
+          "name": "Olivia 1/4 Zip Light Jacket",
+          "sku": "WJ12"
+        },
+        {
+          "name": "Chaz Kangeroo Hoodie",
+          "sku": "MH01"
+        },
+        {
+          "name": "Teton Pullover Hoodie",
+          "sku": "MH02"
+        },
+        {
+          "name": "Bruno Compete Hoodie",
+          "sku": "MH03"
+        },
+        {
+          "name": "Frankie  Sweatshirt",
+          "sku": "MH04"
+        },
+        {
+          "name": "Hollister Backyard Sweatshirt",
+          "sku": "MH05"
+        },
+        {
+          "name": "Stark Fundamental Hoodie",
+          "sku": "MH06"
+        },
+        {
+          "name": "Hero Hoodie",
+          "sku": "MH07"
+        },
+        {
+          "name": "Oslo Trek Hoodie",
+          "sku": "MH08"
+        }
+      ]
+    }
+  },
+  "template_4": {
+    "description": "Retrieve all posts in {forum_category} (authed)",
+    "tier": "Medium",
+    "app": "forum",
+    "slots": [
+      "forum_category"
+    ],
+    "pool": {
+      "forum_category": [
+        {
+          "forum_name": "AskReddit"
+        },
+        {
+          "forum_name": "relationship_advice"
+        },
+        {
+          "forum_name": "worldnews"
+        },
+        {
+          "forum_name": "news"
+        },
+        {
+          "forum_name": "movies"
+        },
+        {
+          "forum_name": "memes"
+        },
+        {
+          "forum_name": "wallstreetbets"
+        },
+        {
+          "forum_name": "gaming"
+        },
+        {
+          "forum_name": "technology"
+        },
+        {
+          "forum_name": "pics"
+        },
+        {
+          "forum_name": "funny"
+        },
+        {
+          "forum_name": "television"
+        },
+        {
+          "forum_name": "mildlyinteresting"
+        },
+        {
+          "forum_name": "Showerthoughts"
+        },
+        {
+          "forum_name": "todayilearned"
+        },
+        {
+          "forum_name": "personalfinance"
+        },
+        {
+          "forum_name": "LifeProTips"
+        },
+        {
+          "forum_name": "Futurology"
+        },
+        {
+          "forum_name": "Music"
+        },
+        {
+          "forum_name": "explainlikeimfive"
+        },
+        {
+          "forum_name": "books"
+        },
+        {
+          "forum_name": "science"
+        },
+        {
+          "forum_name": "Jokes"
+        },
+        {
+          "forum_name": "tifu"
+        },
+        {
+          "forum_name": "space"
+        }
+      ]
+    }
+  },
+  "template_5": {
+    "description": "Create a post titled {title} in {category}",
+    "tier": "Hard",
+    "app": "forum",
+    "slots": [
+      "title",
+      "category"
+    ],
+    "pool": {
+      "title": [
+        "Thoughts on the latest developments in AI safety",
+        "Best practices for remote work in 2026",
+        "How do you stay motivated when learning a new skill?",
+        "What are your favourite open-source projects right now?",
+        "Underrated books that changed how you think",
+        "Tips for beginner photographers \u2014 what I wish I knew",
+        "The most interesting science paper I read this week",
+        "Ask me anything about Python performance tuning",
+        "Weekly discussion: what are you building this month?",
+        "Hidden gems in streaming music you should know about",
+        "Travel destinations that are worth the hype",
+        "How to cook a perfect risotto \u2014 my method after 10 attempts",
+        "What sport have you picked up recently and why?",
+        "Recommend a documentary that genuinely surprised you",
+        "Discussion: is functional programming overrated?",
+        "Things that made you better at managing personal finance",
+        "The weirdest film you watched and actually enjoyed",
+        "My experience switching from VS Code to a different editor",
+        "Why I started journaling and what changed",
+        "Gaming setup upgrades that actually made a difference"
+      ],
+      "category": [
+        {
+          "forum_name": "AskReddit"
+        },
+        {
+          "forum_name": "relationship_advice"
+        },
+        {
+          "forum_name": "worldnews"
+        },
+        {
+          "forum_name": "news"
+        },
+        {
+          "forum_name": "movies"
+        },
+        {
+          "forum_name": "memes"
+        },
+        {
+          "forum_name": "wallstreetbets"
+        },
+        {
+          "forum_name": "gaming"
+        },
+        {
+          "forum_name": "technology"
+        },
+        {
+          "forum_name": "pics"
+        },
+        {
+          "forum_name": "funny"
+        },
+        {
+          "forum_name": "television"
+        },
+        {
+          "forum_name": "mildlyinteresting"
+        },
+        {
+          "forum_name": "Showerthoughts"
+        },
+        {
+          "forum_name": "todayilearned"
+        },
+        {
+          "forum_name": "personalfinance"
+        },
+        {
+          "forum_name": "LifeProTips"
+        },
+        {
+          "forum_name": "Futurology"
+        },
+        {
+          "forum_name": "Music"
+        },
+        {
+          "forum_name": "explainlikeimfive"
+        },
+        {
+          "forum_name": "books"
+        },
+        {
+          "forum_name": "science"
+        },
+        {
+          "forum_name": "Jokes"
+        },
+        {
+          "forum_name": "tifu"
+        },
+        {
+          "forum_name": "space"
+        }
+      ]
+    }
+  },
+  "template_6": {
+    "description": "Guest checkout for {product_name}",
+    "tier": "Hard",
+    "app": "shopping",
+    "slots": [
+      "product_name"
+    ],
+    "pool": {
+      "product_name": [
+        {
+          "name": "Joust Duffle Bag",
+          "sku": "24-MB01"
+        },
+        {
+          "name": "Strive Shoulder Pack",
+          "sku": "24-MB04"
+        },
+        {
+          "name": "Crown Summit Backpack",
+          "sku": "24-MB03"
+        },
+        {
+          "name": "Wayfarer Messenger Bag",
+          "sku": "24-MB05"
+        },
+        {
+          "name": "Rival Field Messenger",
+          "sku": "24-MB06"
+        },
+        {
+          "name": "Fusion Backpack",
+          "sku": "24-MB02"
+        },
+        {
+          "name": "Impulse Duffle",
+          "sku": "24-UB02"
+        },
+        {
+          "name": "Voyage Yoga Bag",
+          "sku": "24-WB01"
+        },
+        {
+          "name": "Compete Track Tote",
+          "sku": "24-WB02"
+        },
+        {
+          "name": "Savvy Shoulder Tote",
+          "sku": "24-WB05"
+        },
+        {
+          "name": "Endeavor Daytrip Backpack",
+          "sku": "24-WB06"
+        },
+        {
+          "name": "Driven Backpack",
+          "sku": "24-WB03"
+        },
+        {
+          "name": "Overnight Duffle",
+          "sku": "24-WB07"
+        },
+        {
+          "name": "Push It Messenger Bag",
+          "sku": "24-WB04"
+        },
+        {
+          "name": "Affirm Water Bottle",
+          "sku": "24-UG06"
+        },
+        {
+          "name": "Dual Handle Cardio Ball",
+          "sku": "24-UG07"
+        },
+        {
+          "name": "Zing Jump Rope",
+          "sku": "24-UG04"
+        },
+        {
+          "name": "Pursuit Lumaflex&trade; Tone Band",
+          "sku": "24-UG02"
+        },
+        {
+          "name": "Go-Get'r Pushup Grips",
+          "sku": "24-UG05"
+        },
+        {
+          "name": "Quest Lumaflex&trade; Band",
+          "sku": "24-UG01"
+        },
+        {
+          "name": "Sprite Foam Yoga Brick",
+          "sku": "24-WG084"
+        },
+        {
+          "name": "Sprite Foam Roller",
+          "sku": "24-WG088"
+        },
+        {
+          "name": "Harmony Lumaflex&trade; Strength Band Kit",
+          "sku": "24-UG03"
+        },
+        {
+          "name": "Sprite Stasis Ball 55 cm",
+          "sku": "24-WG081-gray"
+        },
+        {
+          "name": "Sprite Stasis Ball 65 cm",
+          "sku": "24-WG082-gray"
+        },
+        {
+          "name": "Sprite Stasis Ball 75 cm",
+          "sku": "24-WG083-gray"
+        },
+        {
+          "name": "Sprite Yoga Strap 6 foot",
+          "sku": "24-WG085"
+        },
+        {
+          "name": "Sprite Yoga Strap 8 foot",
+          "sku": "24-WG086"
+        },
+        {
+          "name": "Sprite Yoga Strap 10 foot",
+          "sku": "24-WG087"
+        },
+        {
+          "name": "Aim Analog Watch",
+          "sku": "24-MG04"
+        },
+        {
+          "name": "Endurance Watch",
+          "sku": "24-MG01"
+        },
+        {
+          "name": "Summit Watch",
+          "sku": "24-MG03"
+        },
+        {
+          "name": "Cruise Dual Analog Watch",
+          "sku": "24-MG05"
+        },
+        {
+          "name": "Dash Digital Watch",
+          "sku": "24-MG02"
+        },
+        {
+          "name": "Luma Analog Watch",
+          "sku": "24-WG09"
+        },
+        {
+          "name": "Bolo Sport Watch",
+          "sku": "24-WG01"
+        },
+        {
+          "name": "Clamber Watch",
+          "sku": "24-WG03"
+        },
+        {
+          "name": "Didi Sport Watch",
+          "sku": "24-WG02"
+        },
+        {
+          "name": "Stellar Solar Jacket",
+          "sku": "WJ01"
+        },
+        {
+          "name": "Josie Yoga Jacket",
+          "sku": "WJ02"
+        },
+        {
+          "name": "Augusta Pullover Jacket",
+          "sku": "WJ03"
+        },
+        {
+          "name": "Ingrid Running Jacket",
+          "sku": "WJ04"
+        },
+        {
+          "name": "Riona Full Zip Jacket",
+          "sku": "WJ05"
+        },
+        {
+          "name": "Juno Jacket",
+          "sku": "WJ06"
+        },
+        {
+          "name": "Inez Full Zip Jacket",
+          "sku": "WJ07"
+        },
+        {
+          "name": "Adrienne Trek Jacket",
+          "sku": "WJ08"
+        },
+        {
+          "name": "Jade Yoga Jacket",
+          "sku": "WJ09"
+        },
+        {
+          "name": "Nadia Elements Shell",
+          "sku": "WJ10"
+        },
+        {
+          "name": "Neve Studio Dance Jacket",
+          "sku": "WJ11"
+        },
+        {
+          "name": "Olivia 1/4 Zip Light Jacket",
+          "sku": "WJ12"
+        },
+        {
+          "name": "Chaz Kangeroo Hoodie",
+          "sku": "MH01"
+        },
+        {
+          "name": "Teton Pullover Hoodie",
+          "sku": "MH02"
+        },
+        {
+          "name": "Bruno Compete Hoodie",
+          "sku": "MH03"
+        },
+        {
+          "name": "Frankie  Sweatshirt",
+          "sku": "MH04"
+        },
+        {
+          "name": "Hollister Backyard Sweatshirt",
+          "sku": "MH05"
+        },
+        {
+          "name": "Stark Fundamental Hoodie",
+          "sku": "MH06"
+        },
+        {
+          "name": "Hero Hoodie",
+          "sku": "MH07"
+        },
+        {
+          "name": "Oslo Trek Hoodie",
+          "sku": "MH08"
+        }
+      ]
+    }
+  },
+  "template_7": {
+    "description": "Create a new product with SKU {sku}, price {price}",
+    "tier": "Hard",
+    "app": "shopping_admin",
+    "slots": [
+      "sku",
+      "price",
+      "product_name"
+    ],
+    "pool": {
+      "product_spec": [
+        {
+          "sku": "HAR-TEST-001",
+          "price": 19.99,
+          "product_name": "HAR Training Widget Alpha"
+        },
+        {
+          "sku": "HAR-TEST-002",
+          "price": 34.5,
+          "product_name": "HAR Training Widget Beta"
+        },
+        {
+          "sku": "HAR-TEST-003",
+          "price": 9.99,
+          "product_name": "HAR Economy Pack"
+        },
+        {
+          "sku": "HAR-TEST-004",
+          "price": 49.0,
+          "product_name": "HAR Premium Kit"
+        },
+        {
+          "sku": "HAR-TEST-005",
+          "price": 7.75,
+          "product_name": "HAR Starter Bundle"
+        },
+        {
+          "sku": "HAR-TEST-006",
+          "price": 129.0,
+          "product_name": "HAR Deluxe Set"
+        },
+        {
+          "sku": "HAR-TEST-007",
+          "price": 22.0,
+          "product_name": "HAR Standard Unit"
+        },
+        {
+          "sku": "HAR-TEST-008",
+          "price": 14.95,
+          "product_name": "HAR Basic Module"
+        },
+        {
+          "sku": "HAR-TEST-009",
+          "price": 59.99,
+          "product_name": "HAR Advanced Pack"
+        },
+        {
+          "sku": "HAR-TEST-010",
+          "price": 3.5,
+          "product_name": "HAR Mini Component"
+        },
+        {
+          "sku": "HAR-TEST-011",
+          "price": 89.0,
+          "product_name": "HAR Pro Edition"
+        },
+        {
+          "sku": "HAR-TEST-012",
+          "price": 11.25,
+          "product_name": "HAR Lite Version"
+        },
+        {
+          "sku": "HAR-TEST-013",
+          "price": 199.99,
+          "product_name": "HAR Enterprise Module"
+        },
+        {
+          "sku": "HAR-TEST-014",
+          "price": 6.0,
+          "product_name": "HAR Sample Item"
+        },
+        {
+          "sku": "HAR-TEST-015",
+          "price": 45.0,
+          "product_name": "HAR Mid-Range Pack"
+        },
+        {
+          "sku": "HAR-TEST-016",
+          "price": 25.0,
+          "product_name": "HAR Core Component"
+        },
+        {
+          "sku": "HAR-TEST-017",
+          "price": 75.0,
+          "product_name": "HAR Extended Kit"
+        },
+        {
+          "sku": "HAR-TEST-018",
+          "price": 18.5,
+          "product_name": "HAR Value Bundle"
+        },
+        {
+          "sku": "HAR-TEST-019",
+          "price": 99.0,
+          "product_name": "HAR Complete Suite"
+        },
+        {
+          "sku": "HAR-TEST-020",
+          "price": 2.99,
+          "product_name": "HAR Micro Unit"
+        }
+      ]
+    }
+  }
+}

pyproject.toml ADDED Viewed

	@@ -0,0 +1,37 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-harvestgym"
+version = "0.1.0"
+description = "HARvestGym: RL environment for training API-native web agents via HAR-guided exploration"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "pydantic>=2.0.0",
+    "fastapi>=0.100.0",
+    "uvicorn>=0.23.0",
+    "requests>=2.31.0",
+    "rank-bm25>=0.2.2",
+    "sentence-transformers>=3.0.0",
+    "openai>=1.0.0",
+    "numpy>=1.24.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+    "pytest-asyncio>=0.23.0",
+]
+[project.scripts]
+server = "server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["server", "server.tools"]
+[tool.setuptools.package-data]
+"*" = ["hars/*.har", "catalogs/*.json", "parameter_pools.json"]

scripts/build_parameter_pools.py ADDED Viewed

	@@ -0,0 +1,364 @@

+#!/usr/bin/env python3
+"""
+Build (or refresh) parameter_pools.json by calling the live EC2 application APIs.
+Usage:
+    python scripts/build_parameter_pools.py --host ec2-16-59-2-56.us-east-2.compute.amazonaws.com
+    python scripts/build_parameter_pools.py --host localhost   # if running directly on EC2
+    python scripts/build_parameter_pools.py --host <IP> --output parameter_pools.json
+Requirements: pip install requests
+"""
+import argparse
+import json
+import sys
+from datetime import date
+from pathlib import Path
+try:
+    import requests
+    requests.packages.urllib3.disable_warnings()
+except ImportError:
+    print("pip install requests")
+    sys.exit(1)
+# ── Config ────────────────────────────────────────────────────────────────────
+PORTS = {
+    "shopping":       7770,
+    "shopping_admin": 7780,
+    "forum":          9999,
+    "wikipedia":      8888,
+}
+ADMIN_USER = "admin"
+ADMIN_PASS = "admin1234"
+# Wikipedia articles to verify exist in the ZIM snapshot
+WIKIPEDIA_TITLES = [
+    ("Python (programming language)", "Python programming language",   "Python_(programming_language)"),
+    ("Albert Einstein",               "Albert Einstein",               "Albert_Einstein"),
+    ("World War II",                  "World War II",                  "World_War_II"),
+    ("Photosynthesis",                "Photosynthesis",                "Photosynthesis"),
+    ("Marie Curie",                   "Marie Curie",                   "Marie_Curie"),
+    ("Moon",                          "Moon",                          "Moon"),
+    ("JavaScript",                    "JavaScript",                    "JavaScript"),
+    ("Eiffel Tower",                  "Eiffel Tower",                  "Eiffel_Tower"),
+    ("Black hole",                    "Black hole",                    "Black_hole"),
+    ("Charles Darwin",                "Charles Darwin",                "Charles_Darwin"),
+    ("Artificial intelligence",       "Artificial intelligence",       "Artificial_intelligence"),
+    ("DNA",                           "DNA",                           "DNA"),
+    ("Mount Everest",                 "Mount Everest",                 "Mount_Everest"),
+    ("Isaac Newton",                  "Isaac Newton",                  "Isaac_Newton"),
+    ("Solar System",                  "Solar System",                  "Solar_System"),
+    ("Great Wall of China",           "Great Wall of China",           "Great_Wall_of_China"),
+    ("William Shakespeare",           "William Shakespeare",           "William_Shakespeare"),
+    ("Amazon River",                  "Amazon River",                  "Amazon_River"),
+    ("Quantum mechanics",             "Quantum mechanics",             "Quantum_mechanics"),
+    ("Napoleon",                      "Napoleon",                      "Napoleon"),
+]
+# Post titles generated for template_5 — not fetched from the live app
+FORUM_POST_TITLES = [
+    "Thoughts on the latest developments in AI safety",
+    "Best practices for remote work in 2026",
+    "How do you stay motivated when learning a new skill?",
+    "What are your favourite open-source projects right now?",
+    "Underrated books that changed how you think",
+    "Tips for beginner photographers — what I wish I knew",
+    "The most interesting science paper I read this week",
+    "Ask me anything about Python performance tuning",
+    "Weekly discussion: what are you building this month?",
+    "Hidden gems in streaming music you should know about",
+    "Travel destinations that are worth the hype",
+    "How to cook a perfect risotto — my method after 10 attempts",
+    "What sport have you picked up recently and why?",
+    "Recommend a documentary that genuinely surprised you",
+    "Discussion: is functional programming overrated?",
+    "Things that made you better at managing personal finance",
+    "The weirdest film you watched and actually enjoyed",
+    "My experience switching from VS Code to a different editor",
+    "Why I started journaling and what changed",
+    "Gaming setup upgrades that actually made a difference",
+]
+# ── Helpers ───────────────────────────────────────────────────────────────────
+def base_url(host: str, app: str) -> str:
+    return f"http://{host}:{PORTS[app]}"
+def get_admin_token(host: str) -> str:
+    url = f"{base_url(host, 'shopping_admin')}/rest/V1/integration/admin/token"
+    resp = requests.post(url, json={"username": ADMIN_USER, "password": ADMIN_PASS}, timeout=10)
+    resp.raise_for_status()
+    return resp.json()
+def admin_get(host: str, path: str, token: str, params: dict = None):
+    url = f"{base_url(host, 'shopping_admin')}{path}"
+    resp = requests.get(url, headers={"Authorization": f"Bearer {token}"}, params=params, timeout=15)
+    resp.raise_for_status()
+    return resp.json()
+# ── Pool builders ─────────────────────────────────────────────────────────────
+def build_category_pool(host: str, token: str) -> list:
+    """Template 1: leaf categories from GET /rest/V1/categories/list."""
+    data = admin_get(host, "/rest/V1/categories/list", token, params={"searchCriteria[pageSize]": 500})
+    items = data.get("items", [])
+    pool = []
+    for item in items:
+        # include all named categories; caller can filter to leaf nodes if needed
+        if item.get("name") and item.get("id"):
+            pool.append({"name": item["name"], "category_id": item["id"]})
+    return pool
+def build_product_pool(host: str, token: str, max_items: int = 50) -> list:
+    """Templates 3 & 6: simple, in-stock products."""
+    data = admin_get(host, "/rest/V1/products", token, params={
+        "searchCriteria[filterGroups][0][filters][0][field]": "type_id",
+        "searchCriteria[filterGroups][0][filters][0][value]": "simple",
+        "searchCriteria[filterGroups][0][filters][0][conditionType]": "eq",
+        "searchCriteria[pageSize]": max_items,
+    })
+    items = data.get("items", [])
+    pool = []
+    for item in items:
+        name = item.get("name", "").strip()
+        sku  = item.get("sku", "").strip()
+        if name and sku:
+            pool.append({"name": name, "sku": sku})
+    return pool
+def build_wikipedia_pool(host: str) -> list:
+    """Template 2: verify known articles exist in the ZIM snapshot."""
+    base = base_url(host, "wikipedia")
+    verified = []
+    for display, search_query, expected_slug in WIKIPEDIA_TITLES:
+        check_url = f"{base}/wikipedia_en_all_maxi_2022-05/A/{expected_slug}"
+        try:
+            r = requests.head(check_url, timeout=8, allow_redirects=True)
+            if r.status_code == 200:
+                verified.append({
+                    "display": display,
+                    "search_query": search_query,
+                    "expected_slug": expected_slug,
+                })
+            else:
+                print(f"  [wikipedia] WARNING: {expected_slug} → HTTP {r.status_code}, skipping")
+        except Exception as e:
+            print(f"  [wikipedia] WARNING: could not reach {check_url}: {e}")
+    return verified
+def build_forum_category_pool(host: str) -> list:
+    """Templates 4 & 5: forum slugs with at least one submission."""
+    base = base_url(host, "forum")
+    pool = []
+    page = 1
+    while True:
+        try:
+            r = requests.get(f"{base}/api/forums", params={"page": page}, timeout=10)
+            r.raise_for_status()
+            data = r.json()
+        except Exception as e:
+            print(f"  [forum] WARNING: could not reach forums API: {e}")
+            break
+        items = data if isinstance(data, list) else data.get("items", data.get("forums", []))
+        if not items:
+            break
+        for item in items:
+            name = item.get("name") or item.get("forum_name") or item.get("normalizedName")
+            display = item.get("title") or item.get("displayName") or name
+            if name:
+                pool.append({"forum_name": name, "display_name": display or name})
+        if len(items) < 20:
+            break
+        page += 1
+    # deduplicate by forum_name
+    seen = set()
+    deduped = []
+    for entry in pool:
+        if entry["forum_name"] not in seen:
+            seen.add(entry["forum_name"])
+            deduped.append(entry)
+    return deduped
+# ── Template 7 pool ───────────────────────────────────────────────────────────
+def build_admin_product_pool() -> list:
+    """Template 7: fully generated SKU/price/name tuples. No API call needed."""
+    specs = [
+        ("HAR-TEST-001", 19.99,  "HAR Training Widget Alpha"),
+        ("HAR-TEST-002", 34.50,  "HAR Training Widget Beta"),
+        ("HAR-TEST-003", 9.99,   "HAR Economy Pack"),
+        ("HAR-TEST-004", 49.00,  "HAR Premium Kit"),
+        ("HAR-TEST-005", 7.75,   "HAR Starter Bundle"),
+        ("HAR-TEST-006", 129.00, "HAR Deluxe Set"),
+        ("HAR-TEST-007", 22.00,  "HAR Standard Unit"),
+        ("HAR-TEST-008", 14.95,  "HAR Basic Module"),
+        ("HAR-TEST-009", 59.99,  "HAR Advanced Pack"),
+        ("HAR-TEST-010", 3.50,   "HAR Mini Component"),
+        ("HAR-TEST-011", 89.00,  "HAR Pro Edition"),
+        ("HAR-TEST-012", 11.25,  "HAR Lite Version"),
+        ("HAR-TEST-013", 199.99, "HAR Enterprise Module"),
+        ("HAR-TEST-014", 6.00,   "HAR Sample Item"),
+        ("HAR-TEST-015", 45.00,  "HAR Mid-Range Pack"),
+        ("HAR-TEST-016", 25.00,  "HAR Core Component"),
+        ("HAR-TEST-017", 75.00,  "HAR Extended Kit"),
+        ("HAR-TEST-018", 18.50,  "HAR Value Bundle"),
+        ("HAR-TEST-019", 99.00,  "HAR Complete Suite"),
+        ("HAR-TEST-020", 2.99,   "HAR Micro Unit"),
+    ]
+    return [{"sku": sku, "price": price, "product_name": name} for sku, price, name in specs]
+# ── Main ──────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(description="Build HARvestGym parameter pools from live EC2 apps.")
+    parser.add_argument("--host",   default="ec2-16-59-2-56.us-east-2.compute.amazonaws.com")
+    parser.add_argument("--output", default="parameter_pools.json")
+    args = parser.parse_args()
+    host   = args.host
+    output = Path(args.output)
+    print(f"Building parameter pools — host: {host}\n")
+    # Admin token (needed for Shopping endpoints)
+    print("[1/7] Fetching admin token...")
+    token = get_admin_token(host)
+    print("      OK\n")
+    # Template 1 — category pool
+    print("[2/7] Template 1: Shopping categories...")
+    cat_pool = build_category_pool(host, token)
+    print(f"      {len(cat_pool)} categories found\n")
+    # Templates 3 & 6 — product pool
+    print("[3/7] Templates 3 & 6: Shopping products (simple, in-stock)...")
+    prod_pool = build_product_pool(host, token)
+    print(f"      {len(prod_pool)} products found\n")
+    # Template 2 — Wikipedia
+    print("[4/7] Template 2: Verifying Wikipedia articles...")
+    wiki_pool = build_wikipedia_pool(host)
+    print(f"      {len(wiki_pool)} articles verified\n")
+    # Templates 4 & 5 — Forum categories
+    print("[5/7] Templates 4 & 5: Forum categories...")
+    forum_pool = build_forum_category_pool(host)
+    # template_5 category pool excludes any image-only forums — same list for now
+    forum_pool_t5 = forum_pool
+    print(f"      {len(forum_pool)} forums found\n")
+    # Template 5 — post titles (static)
+    print("[6/7] Template 5: Post titles (static list, no API call)...")
+    print(f"      {len(FORUM_POST_TITLES)} titles loaded\n")
+    # Template 7 — admin product specs (static)
+    print("[7/7] Template 7: Admin product specs (generated, no API call)...")
+    admin_pool = build_admin_product_pool()
+    print(f"      {len(admin_pool)} product specs loaded\n")
+    # ── Assemble output ───────────────────────────────────────────────────────
+    pools = {
+        "_meta": {
+            "description": "Static parameter pools for the 7 HARvestGym task templates.",
+            "generated_at": str(date.today()),
+            "generated_from_host": host,
+            "how_to_refresh": "python scripts/build_parameter_pools.py --host <EC2_HOST>",
+            "source_apps": {
+                "shopping":       f"http://{host}:{PORTS['shopping']}/",
+                "shopping_admin": f"http://{host}:{PORTS['shopping_admin']}/admin",
+                "forum":          f"http://{host}:{PORTS['forum']}/",
+                "wikipedia":      f"http://{host}:{PORTS['wikipedia']}/",
+            },
+        },
+        "template_1": {
+            "description": "List products in category {category_name}",
+            "tier": "Easy",
+            "app": "shopping",
+            "slots": ["category_name"],
+            "source_endpoint": "GET /rest/V1/categories/list?searchCriteria[pageSize]=500",
+            "note": "Only leaf categories are meaningful for product listing tasks. category_id is stored for grader use — not exposed in the task string.",
+            "pool": {"category_name": cat_pool},
+        },
+        "template_2": {
+            "description": "Retrieve article summary for {title}",
+            "tier": "Easy",
+            "app": "wikipedia",
+            "slots": ["title"],
+            "source_endpoint": "HEAD /wikipedia_en_all_maxi_2022-05/A/{slug} (verification only)",
+            "note": "expected_slug is stored for grader verification. The agent must derive the slug independently via GET /search.",
+            "pool": {"title": wiki_pool},
+        },
+        "template_3": {
+            "description": "Add {product_name} to a guest cart",
+            "tier": "Medium",
+            "app": "shopping",
+            "slots": ["product_name"],
+            "source_endpoint": "GET /rest/V1/products?searchCriteria[pageSize]=50 (simple, in-stock only)",
+            "note": "SKU stored for grader use — agent must independently discover it via product search.",
+            "pool": {"product_name": prod_pool},
+        },
+        "template_4": {
+            "description": "Retrieve all posts in {forum_category} (authed)",
+            "tier": "Medium",
+            "app": "forum",
+            "slots": ["forum_category"],
+            "source_endpoint": "GET /api/forums?page=1",
+            "note": "forum_name is the URL slug; display_name is the human-readable label.",
+            "pool": {"forum_category": forum_pool},
+        },
+        "template_5": {
+            "description": "Create a post titled {title} in {category}",
+            "tier": "Hard",
+            "app": "forum",
+            "slots": ["title", "category"],
+            "source_endpoint": "GET /api/forums?page=1 (for category); titles are generated",
+            "note": "title and category are sampled independently. category list excludes any image-only forums.",
+            "pool": {
+                "title":    FORUM_POST_TITLES,
+                "category": forum_pool_t5,
+            },
+        },
+        "template_6": {
+            "description": "Guest checkout for {product_name}",
+            "tier": "Hard",
+            "app": "shopping",
+            "slots": ["product_name"],
+            "source_endpoint": "GET /rest/V1/products?searchCriteria[pageSize]=50 (same pool as template_3)",
+            "note": "Guest checkout email is always test@example.com (STATIC). Grader queries /rest/V1/orders by email to confirm order creation.",
+            "pool": {"product_name": prod_pool},
+        },
+        "template_7": {
+            "description": "Create a new product with SKU {sku}, price {price}",
+            "tier": "Hard",
+            "app": "shopping_admin",
+            "slots": ["sku", "price", "product_name"],
+            "source_endpoint": "Fully generated — SKUs follow HAR-XXXXX pattern, no collision with existing catalog.",
+            "note": "All slots sampled together as a product_spec tuple. attribute_set_id=4 (Default) is STATIC. Grader calls GET /rest/V1/products/{sku} to verify creation.",
+            "pool": {"product_spec": admin_pool},
+        },
+    }
+    output.write_text(json.dumps(pools, indent=2))
+    print(f"Written to {output}  ({output.stat().st_size:,} bytes)")
+    # Summary
+    print("\n=== POOL SUMMARY ===")
+    for tid in [k for k in pools if k.startswith("template")]:
+        t = pools[tid]
+        counts = {slot: len(vals) for slot, vals in t["pool"].items()}
+        print(f"  {tid}: {counts}")
+if __name__ == "__main__":
+    main()

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""
+FastAPI application for HARvestGym.
+Exposes HARvestGymEnvironment over HTTP endpoints compatible with OpenEnv EnvClient.
+Endpoints:
+    POST /reset  — Reset the environment
+    POST /step   — Execute an action
+    GET  /state  — Get current state
+    GET  /schema — Get action/observation schemas
+    GET  /health — Health check
+    WS   /ws     — WebSocket for persistent sessions
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:
+    raise ImportError(
+        "openenv is required. Install dependencies with 'uv sync'"
+    ) from e
+try:
+    from .models import HarvestGymAction, HarvestGymObservation, HARvestGymEnvironment
+except ModuleNotFoundError:
+    from server.models import HarvestGymAction, HarvestGymObservation, HARvestGymEnvironment
+app = create_app(
+    HARvestGymEnvironment,
+    HarvestGymAction,
+    HarvestGymObservation,
+    env_name="HARvestGym",
+    max_concurrent_envs=4,
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    if args.port != 8000:
+        main(port=args.port)
+    else:
+        main()

server/episode.py ADDED Viewed

	@@ -0,0 +1,53 @@

+"""Episode data structures for HARvestGym."""
+from dataclasses import dataclass, field
+from typing import Any
+@dataclass
+class CurlCall:
+    method: str
+    url: str
+    path: str          # normalized (IDs replaced with {id})
+    headers: dict
+    body: dict | str | None
+    status_code: int
+    response_body: Any
+    response_headers: dict = field(default_factory=dict)
+@dataclass
+class Step:
+    step_num: int
+    tool: str          # browser_agent | search_endpoints | curl_exec | search_episode_data | done
+    action: str        # raw tool call string
+    result: Any        # tool return value
+    curl_parsed: CurlCall | None = None
+@dataclass
+class Task:
+    template_id: int                 # 1-7
+    description: str                 # instantiated task string
+    params: dict                     # e.g. {"product_name": "Radiant Tee", "sku": "MH01"}
+    app: str                         # shopping | forum | wikipedia | shopping_admin
+    base_url: str
+    difficulty: str                  # easy | medium | hard
+@dataclass
+class Episode:
+    task: Task
+    steps: list[Step] = field(default_factory=list)
+    session_state: dict = field(default_factory=dict)
+    total_steps: int = 0
+    terminated_by: str = ""          # "done_call" | "max_steps"
+@dataclass
+class EpisodeResult:
+    task_score: float                # 0.0-1.0 from grader
+    parameter_sourcing_score: float  # 0.0-1.0 from trajectory analysis
+    auth_obtained: bool
+    reward: float                    # final composite reward
+    details: dict = field(default_factory=dict)

server/judge.py ADDED Viewed

	@@ -0,0 +1,691 @@

+"""
+HARvestGym Judge — deterministic programmatic graders for all 7 task templates.
+Each grader inspects the episode trajectory and/or probes the live application
+to compute a task score in [0.0, 1.0], then maps it to the reward range.
+"""
+from __future__ import annotations
+import json
+import re
+import time
+from pathlib import Path
+from typing import Any
+try:
+    import requests as _requests
+    _REQUESTS_AVAILABLE = True
+except ImportError:
+    _REQUESTS_AVAILABLE = False
+from .episode import Episode, EpisodeResult, Step, Task
+# ---------------------------------------------------------------------------
+# Reward tables (score → reward)
+# ---------------------------------------------------------------------------
+REWARD_TABLES = {
+    1: {1.0: 2.0, 0.3: 0.5, 0.0: -1.5},
+    2: {1.0: 2.0, 0.5: 0.5, 0.0: -1.5},
+    3: {1.0: 3.5, 0.2: 0.5, 0.15: 0.3, 0.0: -1.5},
+    4: {1.0: 3.5, 0.3: 0.8, 0.0: -1.5},
+    5: {1.0: 5.0, 0.5: 1.5, 0.3: 0.8, 0.0: -1.5},
+    6: {1.0: 5.0, 0.6: 2.5, 0.3: 0.8, 0.1: 0.3, 0.0: -1.5},
+    7: {1.0: 5.0, 0.7: 2.0, 0.2: 0.5, 0.0: -1.5},
+}
+AUTH_BONUS = 0.3  # added when auth was successfully obtained even if task fails
+def _score_to_reward(score: float, template_id: int) -> float:
+    """Map a [0,1] task score to a reward using the template's reward table."""
+    table = REWARD_TABLES.get(template_id, {1.0: 2.0, 0.0: -1.5})
+    # Find closest matching threshold
+    thresholds = sorted(table.keys(), reverse=True)
+    for threshold in thresholds:
+        if score >= threshold:
+            return table[threshold]
+    return table.get(0.0, -1.5)
+# ---------------------------------------------------------------------------
+# HTTP probe helper
+# ---------------------------------------------------------------------------
+def _judge_probe(path: str, base_url: str, headers: dict | None = None,
+                 timeout: int = 10) -> Any:
+    """Issue an HTTP GET from the judge (not the model) to verify live state."""
+    if not _REQUESTS_AVAILABLE:
+        return None
+    url = base_url.rstrip("/") + path
+    try:
+        resp = _requests.get(url, headers=headers or {}, timeout=timeout, verify=False)
+        result = type("ProbeResult", (), {
+            "status_code": resp.status_code,
+            "body": None,
+        })()
+        try:
+            result.body = resp.json()
+        except Exception:
+            result.body = resp.text
+        return result
+    except Exception as e:
+        print(f"[judge] probe failed {url}: {e}", flush=True)
+        return None
+def _judge_post_probe(path: str, base_url: str, data: dict | None = None,
+                      headers: dict | None = None, timeout: int = 10) -> Any:
+    """Issue an HTTP POST probe from the judge."""
+    if not _REQUESTS_AVAILABLE:
+        return None
+    url = base_url.rstrip("/") + path
+    try:
+        resp = _requests.post(url, json=data, headers=headers or {}, timeout=timeout, verify=False)
+        result = type("ProbeResult", (), {"status_code": resp.status_code, "body": None})()
+        try:
+            result.body = resp.json()
+        except Exception:
+            result.body = resp.text
+        return result
+    except Exception as e:
+        print(f"[judge] post probe failed {url}: {e}", flush=True)
+        return None
+# ---------------------------------------------------------------------------
+# Shared helpers
+# ---------------------------------------------------------------------------
+def _fuzzy_match(a: str, b: str) -> bool:
+    """Case-insensitive substring match in both directions."""
+    a, b = a.lower().strip(), b.lower().strip()
+    return a in b or b in a or a == b
+def _path_matches(path: str, pattern: str) -> bool:
+    """Check if a (normalized) path matches a pattern."""
+    return pattern.lower() in path.lower() or path.lower() in pattern.lower()
+def _extract_field(obj: Any, field_path: str) -> Any:
+    """Extract a nested field via dot notation: 'items.0.sku'."""
+    parts = field_path.split(".")
+    current = obj
+    for part in parts:
+        if current is None:
+            return None
+        if isinstance(current, dict):
+            current = current.get(part)
+        elif isinstance(current, list):
+            try:
+                current = current[int(part)]
+            except (IndexError, ValueError):
+                return None
+        else:
+            return None
+    return current
+def _get_curl_steps(episode: Episode):
+    """Return only steps that have curl_parsed."""
+    return [s for s in episode.steps if s.curl_parsed is not None]
+# ---------------------------------------------------------------------------
+# Template graders
+# ---------------------------------------------------------------------------
+def grade_template_1(episode: Episode, task: Task) -> float:
+    """Easy — Shopping: List products in category {category_name}"""
+    category_name = task.params.get("category_name", "")
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            body = cp.response_body
+            if isinstance(body, dict) and "items" in body:
+                items = body["items"]
+                if len(items) > 0:
+                    # Check if any item mentions the category
+                    for item in items:
+                        if _item_matches_category(item, category_name):
+                            return 1.0
+                    # Items returned but can't verify category — partial
+                    return 0.3
+            # Also check if it's a raw list
+            if isinstance(body, list) and len(body) > 0:
+                return 0.3
+    return 0.0
+def _item_matches_category(item: dict, category_name: str) -> bool:
+    """Check if an item is in the given category."""
+    # Check category_links field
+    for link in item.get("category_links", []):
+        # We trust the response at face value; category name match is partial anyway
+        pass
+    # Check extension_attributes
+    ext = item.get("extension_attributes", {})
+    category_links = ext.get("category_links", [])
+    if category_links:
+        return True  # has category links; assume matches
+    # Fallback: just having items is enough for category listing
+    return True
+def grade_template_2(episode: Episode, task: Task) -> float:
+    """Easy — Wikipedia: Retrieve article for {title}"""
+    title = task.params.get("title", "")
+    title_slug = title.lower().replace(" ", "_")
+    title_lower = title.lower()
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            url_lower = cp.url.lower()
+            # Direct article fetch
+            if title_slug in url_lower or title_lower.replace(" ", "_") in url_lower:
+                return 1.0
+            if "wiki/" + title_slug in url_lower:
+                return 1.0
+    # Search result found the article
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            body_str = str(cp.response_body).lower()
+            if title_lower in body_str and "wiki" in cp.url.lower():
+                return 0.5
+    return 0.0
+def _extract_cart_id(episode: Episode) -> str | None:
+    """Extract guest cart ID from episode trajectory."""
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            # POST /rest/V1/guest-carts returns bare string cart ID
+            if "guest-carts" in cp.path and cp.method == "POST":
+                body = cp.response_body
+                if isinstance(body, str) and len(body) > 5:
+                    return body.strip('"').strip()
+    return None
+def grade_template_3(episode: Episode, task: Task) -> float:
+    """Medium — Shopping: Add {product_name} to a guest cart"""
+    product_name = task.params.get("product_name", "")
+    sku = task.params.get("sku")
+    # Primary: check if add-to-cart responded with item_id
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            body = cp.response_body
+            if isinstance(body, dict) and "item_id" in body:
+                # Verify the sku if we have it
+                if sku and body.get("sku") == sku:
+                    return 1.0
+                if _fuzzy_match(str(body.get("name", "")), product_name):
+                    return 1.0
+                if body.get("item_id"):
+                    return 1.0
+    # Try live probe
+    cart_id = _extract_cart_id(episode)
+    if cart_id:
+        probe = _judge_probe(f"/rest/V1/guest-carts/{cart_id}", task.base_url)
+        if probe and probe.status_code == 200:
+            items = probe.body.get("items", []) if isinstance(probe.body, dict) else []
+            for item in items:
+                if sku and item.get("sku") == sku:
+                    return 1.0
+                if _fuzzy_match(str(item.get("name", "")), product_name):
+                    return 1.0
+            if len(items) == 0:
+                return 0.2  # cart created, item not added
+    # Partial: cart was created
+    if cart_id:
+        return 0.2
+    # Partial: attempted cart creation
+    if any("guest-carts" in (s.curl_parsed.path or "") and
+           s.curl_parsed.method == "POST"
+           for s in _get_curl_steps(episode)):
+        return 0.15
+    return 0.0
+def _check_forum_auth(episode: Episode) -> bool:
+    """Check if forum authentication was obtained."""
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.method == "POST" and "/login" in cp.path:
+            if cp.status_code in (200, 302):
+                return True
+    return False
+def _response_contains_posts(body: Any) -> bool:
+    """Check if a response body contains forum posts."""
+    if isinstance(body, list) and len(body) > 0:
+        return True
+    if isinstance(body, dict):
+        # Could be JSON with posts array or HTML
+        for key in ("posts", "items", "data", "results"):
+            if key in body and isinstance(body[key], list) and len(body[key]) > 0:
+                return True
+        # Postmill returns HTML — check for common post indicators
+        body_str = str(body).lower()
+        if "post" in body_str or "submission" in body_str:
+            return True
+    if isinstance(body, str) and len(body) > 100:
+        return True  # HTML response from forum
+    return False
+def grade_template_4(episode: Episode, task: Task) -> float:
+    """Medium — Forum: Retrieve posts in {forum_category} (authed)"""
+    forum_category = task.params.get("forum_category", "")
+    score = 0.0
+    auth_obtained = _check_forum_auth(episode)
+    if auth_obtained:
+        score += 0.3
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            url = cp.url
+            body = cp.response_body
+            if f"/f/{forum_category.lower()}" in url.lower():
+                if _response_contains_posts(body):
+                    return 1.0
+            # Also accept generic post listing with the category in URL
+            if forum_category.lower() in url.lower() and _response_contains_posts(body):
+                return 1.0
+    return score
+def _check_csrf_in_trajectory(episode: Episode) -> bool:
+    """Check that a POST body contained a _csrf_token field."""
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.method == "POST":
+            body_str = str(cp.body or "").lower()
+            if "_csrf_token" in body_str and len(body_str) > 20:
+                return True
+    return False
+def _judge_probe_forum_category(category: str, base_url: str) -> list:
+    """Probe the forum to get posts in a category."""
+    probe = _judge_probe(f"/f/{category}.json", base_url)
+    if probe and probe.status_code == 200:
+        body = probe.body
+        if isinstance(body, dict):
+            return body.get("posts", body.get("submissions", []))
+        if isinstance(body, list):
+            return body
+    return []
+def grade_template_5(episode: Episode, task: Task) -> float:
+    """Hard — Forum: Create a post titled {title} in {category}"""
+    title = task.params.get("title", "")
+    category = task.params.get("category", "")
+    auth_ok = _check_forum_auth(episode)
+    csrf_used = _check_csrf_in_trajectory(episode)
+    # Check if POST to submit returned success
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.method == "POST" and cp.status_code in (200, 201, 302):
+            if "submit" in cp.path or "post" in cp.path.lower():
+                # Post creation succeeded
+                body_str = str(cp.response_body or "").lower()
+                if title.lower() in body_str or "redirect" in str(cp.response_headers).lower():
+                    return 1.0
+                if cp.status_code in (201, 302):
+                    return 1.0
+    # Try judge probe
+    posts = _judge_probe_forum_category(category, task.base_url)
+    for post in posts:
+        post_title = post.get("title", post.get("name", ""))
+        if _fuzzy_match(post_title, title):
+            return 1.0
+    if auth_ok and csrf_used:
+        return 0.5
+    if auth_ok:
+        return 0.3
+    return 0.0
+def _checkout_stages_completed(episode: Episode, sku: str | None) -> int:
+    """Count checkout stages completed successfully."""
+    stages = 0
+    paths_hit = {
+        s.curl_parsed.path
+        for s in _get_curl_steps(episode)
+        if s.curl_parsed.status_code == 200
+    }
+    if any("guest-carts" in p and "{" not in p for p in paths_hit):
+        stages += 1
+    if any("guest-carts" in p and "items" in p for p in paths_hit):
+        stages += 1
+    if any("guest-carts" in p and ("shipping" in p or "email" in p) for p in paths_hit):
+        stages += 1
+    if any("guest-carts" in p and ("payment" in p or "order" in p) for p in paths_hit):
+        stages += 1
+    return stages
+def grade_template_6(episode: Episode, task: Task) -> float:
+    """Hard — Shopping: Guest checkout for {product_name}"""
+    sku = task.params.get("sku")
+    # Check for order ID
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200:
+            body = cp.response_body
+            if isinstance(body, int) and body > 0:
+                return 1.0
+            if isinstance(body, str):
+                try:
+                    v = int(body.strip('"').strip())
+                    if v > 0:
+                        return 1.0
+                except (ValueError, AttributeError):
+                    pass
+            if isinstance(body, dict) and body.get("order_id"):
+                return 1.0
+    stages = _checkout_stages_completed(episode, sku)
+    if stages >= 4:
+        return 0.6
+    if stages >= 2:
+        return 0.3
+    if stages >= 1:
+        return 0.1
+    return 0.0
+def _extract_admin_token(episode: Episode) -> str | None:
+    """Find admin bearer token from episode trajectory."""
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200 and "integration/admin/token" in cp.path:
+            body = cp.response_body
+            if isinstance(body, str) and len(body) > 10:
+                return body.strip('"').strip()
+    return None
+def _attempted_product_creation(episode: Episode, sku: str) -> bool:
+    """Check if the model attempted to create a product with this SKU."""
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.method == "POST" and "products" in cp.path:
+            body_str = str(cp.body or "").lower()
+            if sku.lower() in body_str:
+                return True
+    return False
+def grade_template_7(episode: Episode, task: Task) -> float:
+    """Hard — Shopping Admin: Create product with SKU {sku}, price {price}"""
+    sku = task.params.get("sku", "")
+    price = float(task.params.get("price", 0))
+    admin_token = _extract_admin_token(episode)
+    if not admin_token:
+        return 0.0
+    # Check if product creation returned success
+    for step in _get_curl_steps(episode):
+        cp = step.curl_parsed
+        if cp.status_code == 200 and cp.method == "POST" and "products" in cp.path:
+            body = cp.response_body
+            if isinstance(body, dict) and body.get("id"):
+                actual_price = float(body.get("price", -1))
+                price_ok = abs(actual_price - price) < 0.01
+                return 1.0 if price_ok else 0.7
+    # Judge probe
+    probe = _judge_probe(
+        f"/rest/V1/products/{sku}",
+        task.base_url,
+        headers={"Authorization": f"Bearer {admin_token}"}
+    )
+    if probe and probe.status_code == 200 and isinstance(probe.body, dict):
+        actual_price = float(probe.body.get("price", -1))
+        price_ok = abs(actual_price - price) < 0.01
+        return 1.0 if price_ok else 0.7
+    if _attempted_product_creation(episode, sku):
+        return 0.2
+    return 0.0
+# ---------------------------------------------------------------------------
+# Parameter sourcing verification
+# ---------------------------------------------------------------------------
+def _load_catalog(app: str) -> list[dict]:
+    """Load the ground truth catalog for an app."""
+    catalog_path = Path(__file__).parent.parent.parent / "catalogs" / f"{app}.json"
+    if not catalog_path.exists():
+        return []
+    try:
+        with open(catalog_path) as f:
+            data = json.load(f)
+        return data if isinstance(data, list) else data.get("endpoints", [])
+    except Exception:
+        return []
+def _find_catalog_entry(path: str, method: str, catalog: list[dict]) -> dict | None:
+    method = method.upper()
+    for entry in catalog:
+        cat_method = entry.get("method", "GET").upper()
+        cat_path = entry.get("path", "")
+        # Pattern match: {id} in catalog matches any segment
+        if cat_method == method and _path_pattern_match(path, cat_path):
+            return entry
+    return None
+def _path_pattern_match(actual_path: str, catalog_path: str) -> bool:
+    """Match actual path against catalog pattern with {id} wildcards."""
+    # Convert catalog pattern to regex
+    pattern = re.escape(catalog_path)
+    pattern = pattern.replace(r"\{", "{").replace(r"\}", "}")
+    pattern = re.sub(r"\{[^}]+\}", "[^/]+", pattern)
+    pattern = f"^{pattern}$"
+    return bool(re.match(pattern, actual_path, re.IGNORECASE))
+def verify_parameter_sourcing(episode: Episode, task: Task) -> float:
+    """Analyze parameter sourcing across episode trajectory. Returns [0, 1] score."""
+    catalog = _load_catalog(task.app)
+    if not catalog:
+        return 0.5  # neutral if no catalog
+    correct = 0
+    total = 0
+    steps = _get_curl_steps(episode)
+    for step in steps:
+        cp = step.curl_parsed
+        catalog_entry = _find_catalog_entry(cp.path, cp.method, catalog)
+        if not catalog_entry:
+            continue
+        path_params = catalog_entry.get("path_params", {})
+        body_params = catalog_entry.get("body_params", {})
+        for param_name, param_meta in path_params.items():
+            total += 1
+            value = _extract_path_param_value(cp.url, param_name)
+            if value and _param_sourced_correctly(value, param_meta, episode, step):
+                correct += 1
+        for param_name, param_meta in body_params.items():
+            total += 1
+            value = _extract_body_param_value(cp.body, param_name)
+            if value and _param_sourced_correctly(value, param_meta, episode, step):
+                correct += 1
+    if total == 0:
+        return 0.5
+    return correct / total
+def _extract_path_param_value(url: str, param_name: str) -> str | None:
+    """Best-effort path param extraction."""
+    # Just extract last non-empty path segment as a value
+    from urllib.parse import urlparse
+    path = urlparse(url).path
+    segments = [s for s in path.split("/") if s]
+    if segments:
+        return segments[-1]
+    return None
+def _extract_body_param_value(body: Any, param_name: str) -> Any:
+    """Extract a named param from request body."""
+    if body is None:
+        return None
+    if isinstance(body, dict):
+        if param_name in body:
+            return body[param_name]
+        # Search nested
+        for v in body.values():
+            if isinstance(v, dict):
+                result = _extract_body_param_value(v, param_name)
+                if result is not None:
+                    return result
+    if isinstance(body, str):
+        # Form-encoded: key=value&...
+        for pair in body.split("&"):
+            if "=" in pair:
+                k, _, v = pair.partition("=")
+                if k.strip() == param_name:
+                    return v.strip()
+    return None
+def _param_sourced_correctly(value: Any, param_meta: dict,
+                              episode: Episode, step: Step) -> bool:
+    source = param_meta.get("source", "")
+    value_str = str(value)
+    if source == "TASK_SPEC":
+        return value_str in episode.task.description
+    elif source == "PREV_CALL":
+        from_endpoint = param_meta.get("from_endpoint", "")
+        from_field = param_meta.get("from_field", "")
+        for prior_step in episode.steps:
+            if prior_step.step_num >= step.step_num:
+                break
+            if prior_step.curl_parsed:
+                ps = prior_step.curl_parsed
+                if _path_matches(ps.path, from_endpoint):
+                    extracted = _extract_field(ps.response_body, from_field)
+                    if str(extracted) == value_str:
+                        return True
+        return False
+    elif source == "AUTH_FLOW":
+        return value_str in str(episode.session_state.values())
+    elif source == "STATIC":
+        expected = str(param_meta.get("value", ""))
+        return value_str == expected
+    elif source == "DERIVED":
+        from_param = param_meta.get("from_param", "")
+        # Simplified: check if it appeared anywhere in session state
+        return value_str in str(episode.session_state.values())
+    return True  # unknown source type — don't penalize
+# ---------------------------------------------------------------------------
+# Main judge entry point
+# ---------------------------------------------------------------------------
+_GRADERS = {
+    1: grade_template_1,
+    2: grade_template_2,
+    3: grade_template_3,
+    4: grade_template_4,
+    5: grade_template_5,
+    6: grade_template_6,
+    7: grade_template_7,
+}
+def evaluate(episode: Episode) -> EpisodeResult:
+    """
+    Evaluate a completed episode and return reward + diagnostics.
+    Args:
+        episode: Completed episode with all steps recorded.
+    Returns:
+        EpisodeResult with task_score, parameter_sourcing_score, reward, details.
+    """
+    task = episode.task
+    template_id = task.template_id
+    grader = _GRADERS.get(template_id)
+    if grader is None:
+        return EpisodeResult(
+            task_score=0.0,
+            parameter_sourcing_score=0.0,
+            auth_obtained=False,
+            reward=-1.5,
+            details={"error": f"Unknown template_id: {template_id}"},
+        )
+    task_score = grader(episode, task)
+    param_score = verify_parameter_sourcing(episode, task)
+    auth_obtained = _check_forum_auth(episode) or bool(_extract_admin_token(episode))
+    # Compute reward
+    reward = _score_to_reward(task_score, template_id)
+    # Bonus for auth obtained even on task failure
+    if task_score < 0.5 and auth_obtained:
+        reward = max(reward, AUTH_BONUS)
+    return EpisodeResult(
+        task_score=task_score,
+        parameter_sourcing_score=param_score,
+        auth_obtained=auth_obtained,
+        reward=reward,
+        details={
+            "template_id": template_id,
+            "difficulty": task.difficulty,
+            "task_score": task_score,
+            "param_score": param_score,
+            "terminated_by": episode.terminated_by,
+            "total_steps": episode.total_steps,
+        },
+    )

server/models.py ADDED Viewed

	@@ -0,0 +1,517 @@

+"""
+HARvestGym Environment — core OpenEnv models.py
+Implements the OpenEnv spec:
+  - Observation, Action, Reward as Pydantic models
+  - reset() → initial observation + clean state
+  - step(action) → (observation, reward, done, info)
+  - state() → current state snapshot
+The environment manages episode state, dispatches tool calls, computes per-step
+rewards, and invokes the judge at episode end.
+"""
+from __future__ import annotations
+import json
+import os
+import random
+from pathlib import Path
+from typing import Any
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+from pydantic import Field
+from openenv.core.env_server.types import Action as BaseAction, Observation as BaseObservation
+# ---------------------------------------------------------------------------
+# Pydantic models
+# ---------------------------------------------------------------------------
+class HarvestGymObservation(BaseObservation):
+    """What the RL agent sees at each step."""
+    task: str = Field(default="", description="Natural language task description")
+    app_base_url: str = Field(default="", description="Root URL of the target application")
+    last_tool_result: Any = Field(default=None, description="Result of last tool call")
+    history: list[dict] = Field(default_factory=list, description="Full episode trajectory")
+    session_state: dict = Field(default_factory=dict, description="Auto-managed cookies/tokens")
+    step_count: int = Field(default=0)
+    max_steps: int = Field(default=20)
+    available_tools: list[str] = Field(
+        default_factory=lambda: [
+            "browser_agent(task, url)",
+            "search_endpoints(query)",
+            "curl_exec(command)",
+            "search_episode_data(query)",
+            "done(result?)",
+        ]
+    )
+class HarvestGymAction(BaseAction):
+    """One tool call from the RL agent."""
+    tool: str = Field(..., description="Tool name: browser_agent|search_endpoints|curl_exec|search_episode_data|done")
+    args: dict = Field(default_factory=dict, description="Tool-specific arguments")
+class HarvestGymReward(BaseObservation):
+    """Reward signal (returned as part of the observation)."""
+    value: float = Field(default=0.0, description="Scalar reward for this step")
+    breakdown: dict = Field(default_factory=dict, description="Per-signal reward components")
+# ---------------------------------------------------------------------------
+# Per-step reward constants
+# ---------------------------------------------------------------------------
+REWARD_VALID_API_CALL = 0.2      # curl_exec returns 2xx
+REWARD_NEW_PATH = 0.1            # curl path not seen before this episode
+REWARD_CORRECT_PARAM = 0.25      # judge: correct parameter sourcing (applied at end)
+REWARD_SESSION_VALUE = 0.1       # auth token/cookie correctly used
+PENALTY_REPEATED_CALL = -0.15    # exact duplicate curl command
+PENALTY_BROWSER_AGENT_AGAIN = -0.3  # browser_agent called after step 1
+PENALTY_MALFORMED_CURL = -0.1    # curl can't be parsed/executed
+PENALTY_4XX = -0.05              # recoverable HTTP error
+MAX_STEPS = 20
+# ---------------------------------------------------------------------------
+# Task templates
+# ---------------------------------------------------------------------------
+TEMPLATE_META = {
+    1: {"tier": "easy",   "app": "shopping",        "base_url_port": 7770},
+    2: {"tier": "easy",   "app": "wikipedia",       "base_url_port": 8888},
+    3: {"tier": "medium", "app": "shopping",        "base_url_port": 7770},
+    4: {"tier": "medium", "app": "forum",           "base_url_port": 9999},
+    5: {"tier": "hard",   "app": "forum",           "base_url_port": 9999},
+    6: {"tier": "hard",   "app": "shopping",        "base_url_port": 7770},
+    7: {"tier": "hard",   "app": "shopping_admin",  "base_url_port": 7780},
+}
+EC2_HOST = os.environ.get("EC2_HOST", "ec2-16-59-2-56.us-east-2.compute.amazonaws.com")
+TASK_NAME_TO_TEMPLATE = {
+    "har_classify_easy": 1,
+    "har_classify_medium": 3,
+    "har_pipeline_hard": 6,
+}
+TEMPLATE_DESCRIPTIONS = {
+    1: "List products in category {category_name}",
+    2: "Retrieve the Wikipedia article for '{title}'",
+    3: "Add '{product_name}' to a guest cart",
+    4: "Retrieve all posts in the '{forum_category}' forum (you must log in first)",
+    5: "Create a forum post titled '{title}' in the '{category}' forum",
+    6: "Complete a guest checkout for '{product_name}'",
+    7: "Create a new product in the admin panel with SKU '{sku}' and price {price}",
+}
+def _load_parameter_pools() -> dict:
+    pools_path = Path(__file__).parent.parent / "parameter_pools.json"
+    if pools_path.exists():
+        with open(pools_path) as f:
+            return json.load(f)
+    return {}
+def _sample_task(template_id: int, parameter_pools: dict) -> tuple[str, dict, str]:
+    """
+    Sample a task instance from the parameter pool.
+    Returns: (task_description, params_dict, app_base_url)
+    """
+    meta = TEMPLATE_META[template_id]
+    pool_key = f"template_{template_id}"
+    pool_data = parameter_pools.get(pool_key, {})
+    pool = pool_data.get("pool", {})
+    params: dict = {}
+    if template_id == 1:
+        items = pool.get("category_name", [{"name": "Gear", "category_id": 3}])
+        chosen = random.choice(items)
+        params = {"category_name": chosen["name"], "category_id": chosen.get("category_id")}
+        description = TEMPLATE_DESCRIPTIONS[1].format(**params)
+    elif template_id == 2:
+        items = pool.get("title", [{"title": "Python (programming language)", "expected_slug": "Python_(programming_language)"}])
+        if not items:
+            items = [{"title": "Python (programming language)", "expected_slug": "Python_(programming_language)"}]
+        chosen = random.choice(items)
+        title = chosen.get("title", chosen) if isinstance(chosen, dict) else chosen
+        params = {"title": title, "expected_slug": chosen.get("expected_slug", title.replace(" ", "_"))}
+        description = TEMPLATE_DESCRIPTIONS[2].format(**params)
+    elif template_id == 3:
+        items = pool.get("product_name", [{"name": "Radiant Tee", "sku": "MH01"}])
+        if not items:
+            items = [{"name": "Radiant Tee", "sku": "MH01"}]
+        chosen = random.choice(items)
+        product_name = chosen.get("name", chosen) if isinstance(chosen, dict) else chosen
+        sku = chosen.get("sku", "") if isinstance(chosen, dict) else ""
+        params = {"product_name": product_name, "sku": sku}
+        description = TEMPLATE_DESCRIPTIONS[3].format(**params)
+    elif template_id == 4:
+        items = pool.get("forum_category", [{"slug": "general", "name": "General"}])
+        if not items:
+            items = [{"slug": "general", "name": "General"}]
+        chosen = random.choice(items)
+        forum_cat = chosen.get("slug", chosen.get("name", "general")) if isinstance(chosen, dict) else chosen
+        params = {"forum_category": forum_cat}
+        description = TEMPLATE_DESCRIPTIONS[4].format(**params)
+    elif template_id == 5:
+        categories = pool.get("forum_category", [{"slug": "general"}])
+        titles = pool.get("post_title", ["Testing the API agent framework"])
+        if not categories:
+            categories = [{"slug": "general"}]
+        if not titles:
+            titles = ["Testing the API agent framework"]
+        chosen_cat = random.choice(categories)
+        chosen_title = random.choice(titles) if isinstance(titles[0], str) else random.choice(titles).get("title", "Test post")
+        forum_cat = chosen_cat.get("slug", "general") if isinstance(chosen_cat, dict) else chosen_cat
+        params = {"title": chosen_title, "category": forum_cat}
+        description = TEMPLATE_DESCRIPTIONS[5].format(**params)
+    elif template_id == 6:
+        items = pool.get("product_name", [{"name": "Radiant Tee", "sku": "MH01"}])
+        if not items:
+            items = [{"name": "Radiant Tee", "sku": "MH01"}]
+        chosen = random.choice(items)
+        product_name = chosen.get("name", chosen) if isinstance(chosen, dict) else chosen
+        sku = chosen.get("sku", "") if isinstance(chosen, dict) else ""
+        params = {"product_name": product_name, "sku": sku}
+        description = TEMPLATE_DESCRIPTIONS[6].format(**params)
+    elif template_id == 7:
+        items = pool.get("admin_sku", [{"sku": "HAR-TEST-001", "price": "29.99"}])
+        if not items:
+            items = [{"sku": "HAR-TEST-001", "price": "29.99"}]
+        chosen = random.choice(items)
+        sku = chosen.get("sku", "HAR-TEST-001") if isinstance(chosen, dict) else chosen
+        price = str(chosen.get("price", "29.99")) if isinstance(chosen, dict) else "29.99"
+        params = {"sku": sku, "price": price}
+        description = TEMPLATE_DESCRIPTIONS[7].format(**params)
+    else:
+        params = {}
+        description = f"Template {template_id}"
+    port = meta["base_url_port"]
+    base_url = f"http://{EC2_HOST}:{port}/"
+    return description, params, base_url
+# ---------------------------------------------------------------------------
+# Environment
+# ---------------------------------------------------------------------------
+class HARvestGymEnvironment(Environment):
+    """
+    HARvestGym: RL environment for training API-native web agents.
+    The agent must discover and execute the correct sequence of HTTP API calls
+    to complete real-world tasks on live web applications — starting from only
+    a task description and a URL, with no prior knowledge of the API schema.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._parameter_pools = _load_parameter_pools()
+        self._current_task = None        # Task dataclass
+        self._episode = None             # Episode dataclass
+        self._session_state: dict = {}
+        self._episode_store: dict = {}   # embeddings, BM25 corpus, etc.
+        self._called_paths: set = set()  # for new-path reward
+        self._last_curl_commands: list = []  # for duplicate detection
+        self._step_rewards: list[float] = []
+        self._done = False
+        # Determine default template from env var
+        self._task_name = os.environ.get("HARVGYM_TASK", "har_classify_easy")
+    def _get_template_id(self) -> int:
+        """Resolve task name or template ID from env var."""
+        task_name = self._task_name
+        if task_name in TASK_NAME_TO_TEMPLATE:
+            return TASK_NAME_TO_TEMPLATE[task_name]
+        # Try integer
+        try:
+            tid = int(task_name)
+            if 1 <= tid <= 7:
+                return tid
+        except (ValueError, TypeError):
+            pass
+        return 1  # default: easy
+    def reset(self) -> HarvestGymObservation:
+        """Reset environment: clear episode state, sample new task."""
+        from .episode import Episode, Task
+        template_id = self._get_template_id()
+        description, params, base_url = _sample_task(template_id, self._parameter_pools)
+        meta = TEMPLATE_META[template_id]
+        self._current_task = Task(
+            template_id=template_id,
+            description=description,
+            params=params,
+            app=meta["app"],
+            base_url=base_url,
+            difficulty=meta["tier"],
+        )
+        self._episode = Episode(task=self._current_task)
+        self._session_state = {}
+        self._episode_store = {}
+        self._called_paths = set()
+        self._last_curl_commands = []
+        self._step_rewards = []
+        self._done = False
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        return HarvestGymObservation(
+            task=description,
+            app_base_url=base_url,
+            last_tool_result=None,
+            history=[],
+            session_state={},
+            step_count=0,
+            max_steps=MAX_STEPS,
+            done=False,
+            reward=0.0,
+            metadata={
+                "template_id": template_id,
+                "difficulty": meta["tier"],
+                "app": meta["app"],
+            },
+        )
+    def step(self, action: HarvestGymAction) -> HarvestGymObservation:  # type: ignore[override]
+        """Execute one tool call and return the next observation."""
+        from .episode import Step, CurlCall
+        if self._done:
+            # Episode already finished
+            return self._make_obs(
+                last_tool_result={"error": "Episode already done. Call reset()."},
+                reward=0.0,
+                done=True,
+            )
+        self._state.step_count += 1
+        step_num = self._state.step_count
+        tool = action.tool.lower().strip()
+        args = action.args or {}
+        # Dispatch tool
+        result, step_reward, done = self._dispatch_tool(tool, args, step_num)
+        # Record step in episode
+        step_obj = Step(
+            step_num=step_num,
+            tool=tool,
+            action=f"{tool}({json.dumps(args)})",
+            result=result,
+        )
+        # If curl_exec, parse the curl call for judge
+        if tool == "curl_exec":
+            command = args.get("command", "")
+            try:
+                from .tools.curl_exec import parse_curl_command
+                parsed = parse_curl_command(command)
+                from urllib.parse import urlparse
+                path = urlparse(parsed["url"]).path if parsed["url"] else ""
+                from .tools.browser_agent import _normalise_path
+                norm_path = _normalise_path(path)
+                resp = result if isinstance(result, dict) else {}
+                step_obj.curl_parsed = CurlCall(
+                    method=parsed["method"],
+                    url=parsed["url"] or "",
+                    path=norm_path,
+                    headers=parsed["headers"],
+                    body=parsed["body"],
+                    status_code=resp.get("status_code", 0),
+                    response_body=resp.get("body"),
+                    response_headers=resp.get("headers", {}),
+                )
+            except Exception:
+                pass
+        if self._episode:
+            self._episode.steps.append(step_obj)
+            self._episode.total_steps = step_num
+        self._step_rewards.append(step_reward)
+        # Check max steps
+        if step_num >= MAX_STEPS and not done:
+            done = True
+            if self._episode:
+                self._episode.terminated_by = "max_steps"
+            # Invoke judge
+            judge_reward = self._invoke_judge()
+            step_reward += judge_reward
+        if done and self._episode and not self._episode.terminated_by:
+            self._episode.terminated_by = "done_call"
+        self._done = done
+        # Build history entry
+        history_entry = {
+            "step": step_num,
+            "tool": tool,
+            "args": args,
+            "result": result,
+            "reward": step_reward,
+        }
+        if self._episode:
+            history_for_obs = [
+                {"step": s.step_num, "tool": s.tool, "result": s.result}
+                for s in self._episode.steps
+            ]
+        else:
+            history_for_obs = [history_entry]
+        return HarvestGymObservation(
+            task=self._current_task.description if self._current_task else "",
+            app_base_url=self._current_task.base_url if self._current_task else "",
+            last_tool_result=result,
+            history=history_for_obs,
+            session_state=dict(self._session_state),
+            step_count=step_num,
+            max_steps=MAX_STEPS,
+            done=done,
+            reward=step_reward,
+            metadata={
+                "step": step_num,
+                "tool": tool,
+                "step_reward": step_reward,
+            },
+        )
+    def _dispatch_tool(self, tool: str, args: dict, step_num: int) -> tuple[Any, float, bool]:
+        """
+        Dispatch to the correct tool. Returns (result, step_reward, done).
+        """
+        reward = 0.0
+        done = False
+        if tool == "browser_agent":
+            task = args.get("task", self._current_task.description if self._current_task else "")
+            url = args.get("url", self._current_task.base_url if self._current_task else "")
+            # Penalty if called after step 1
+            if step_num > 1:
+                reward += PENALTY_BROWSER_AGENT_AGAIN
+            from .tools.browser_agent import run_browser_agent
+            result = run_browser_agent(task, url, episode_store=self._episode_store)
+        elif tool == "search_endpoints":
+            query = args.get("query", "")
+            from .tools.search_endpoints import search_endpoints
+            result = search_endpoints(query, self._episode_store)
+        elif tool == "curl_exec":
+            command = args.get("command", "")
+            if not command:
+                return {"error": "curl_exec requires 'command' argument"}, PENALTY_MALFORMED_CURL, False
+            # Duplicate detection
+            if command in self._last_curl_commands:
+                reward += PENALTY_REPEATED_CALL
+            self._last_curl_commands.append(command)
+            from .tools.curl_exec import curl_exec
+            result = curl_exec(
+                command=command,
+                session_state=self._session_state,
+                episode_store=self._episode_store,
+                app_base_url=self._current_task.base_url if self._current_task else "",
+            )
+            status = result.get("status_code", 0)
+            if status == -1 or "error" in result:
+                reward += PENALTY_MALFORMED_CURL
+            elif 200 <= status < 300:
+                reward += REWARD_VALID_API_CALL
+                # New path bonus
+                from urllib.parse import urlparse
+                from .tools.browser_agent import _normalise_path
+                try:
+                    parsed_for_path = __import__("shlex").split(command)
+                    for t in parsed_for_path:
+                        if t.startswith("http"):
+                            path = _normalise_path(urlparse(t.strip("'\"")).path)
+                            if path and path not in self._called_paths:
+                                self._called_paths.add(path)
+                                reward += REWARD_NEW_PATH
+                            break
+                except Exception:
+                    pass
+            elif 400 <= status < 500:
+                reward += PENALTY_4XX
+        elif tool == "search_episode_data":
+            query = args.get("query", "")
+            from .tools.search_episode_data import search_episode_data
+            result = search_episode_data(query, self._episode_store)
+        elif tool == "done":
+            result_str = args.get("result", "")
+            result = {"status": "done", "result": result_str}
+            done = True
+            # Invoke judge for final reward
+            judge_reward = self._invoke_judge()
+            reward += judge_reward
+        else:
+            result = {"error": f"Unknown tool: {tool}. Available: browser_agent, search_endpoints, curl_exec, search_episode_data, done"}
+            reward += PENALTY_MALFORMED_CURL
+        return result, reward, done
+    def _invoke_judge(self) -> float:
+        """Run the judge on the completed episode and return terminal reward."""
+        if self._episode is None or self._current_task is None:
+            return -1.5
+        try:
+            from .judge import evaluate
+            episode_result = evaluate(self._episode)
+            return episode_result.reward
+        except Exception as e:
+            print(f"[HARvestGym] Judge error: {e}", flush=True)
+            return -1.5
+    def _make_obs(self, last_tool_result: Any, reward: float, done: bool) -> HarvestGymObservation:
+        return HarvestGymObservation(
+            task=self._current_task.description if self._current_task else "",
+            app_base_url=self._current_task.base_url if self._current_task else "",
+            last_tool_result=last_tool_result,
+            history=[],
+            session_state=dict(self._session_state),
+            step_count=self._state.step_count,
+            max_steps=MAX_STEPS,
+            done=done,
+            reward=reward,
+        )
+    @property
+    def state(self) -> State:
+        return self._state

server/tools/__init__.py ADDED Viewed

File without changes

server/tools/browser_agent.py ADDED Viewed

	@@ -0,0 +1,418 @@

+"""
+browser_agent tool — HAR-based API surface discovery.
+At step 1, loads a pre-recorded HAR file for the target application,
+extracts an OpenAPI-like spec, builds GEMMA embeddings for search_endpoints().
+Falls back to all-MiniLM-L6-v2 if google/embeddinggemma-300m is unavailable.
+"""
+from __future__ import annotations
+import json
+import os
+import re
+from pathlib import Path
+from typing import Any
+from urllib.parse import urlparse
+import numpy as np
+# ---------------------------------------------------------------------------
+# HAR path resolution
+# ---------------------------------------------------------------------------
+HARS_DIR = Path(__file__).parent.parent.parent / "hars"
+CATALOGS_DIR = Path(__file__).parent.parent.parent / "catalogs"
+HAR_MAP: dict[str, str] = {
+    ":7770": "shopping.har",
+    ":7780": "shopping_admin.har",
+    ":9999": "forum.har",
+    ":3000": "osm.har",
+    ":8888": "wikipedia.har",
+}
+APP_NAME_MAP: dict[str, str] = {
+    ":7770": "shopping",
+    ":7780": "shopping_admin",
+    ":9999": "forum",
+    ":3000": "osm",
+    ":8888": "wikipedia",
+}
+# Static asset patterns to skip
+_STATIC_RE = re.compile(
+    r"\.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot|map|webp|avif|otf)(\?|$)",
+    re.IGNORECASE,
+)
+_ANALYTICS_HOSTS = {"google-analytics.com", "doubleclick.net", "googletagmanager.com",
+                    "cdn.jsdelivr.net", "cdnjs.cloudflare.com"}
+# ID normalisation patterns
+_ID_PATTERNS = [
+    (re.compile(r"/[0-9a-f]{32,}(?=/|$)"), "/{id}"),           # Magento cart IDs
+    (re.compile(r"/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}(?=/|$)"), "/{id}"),  # UUIDs
+    (re.compile(r"/\d+(?=/|$)"), "/{id}"),                      # numeric IDs
+]
+def _is_static_asset(url: str) -> bool:
+    parsed = urlparse(url)
+    if _STATIC_RE.search(parsed.path):
+        return True
+    if parsed.netloc in _ANALYTICS_HOSTS:
+        return True
+    return False
+def _normalise_path(path: str) -> str:
+    for pattern, replacement in _ID_PATTERNS:
+        path = pattern.sub(replacement, path)
+    return path
+def _get_content_type(entry: dict, which: str) -> str:
+    """Extract Content-Type from request or response headers."""
+    headers_key = "request" if which == "request" else "response"
+    obj = entry.get(headers_key, {})
+    for h in obj.get("headers", []):
+        if h.get("name", "").lower() == "content-type":
+            return h.get("value", "").lower()
+    if which == "response":
+        ct = obj.get("content", {}).get("mimeType", "")
+        return ct.lower()
+    return ""
+def _extract_body(req: dict) -> Any:
+    post_data = req.get("postData", {})
+    if not post_data:
+        return None
+    text = post_data.get("text", "")
+    if not text:
+        return None
+    try:
+        return json.loads(text)
+    except Exception:
+        return text[:200] if text else None
+def _truncate_response_sample(resp: dict) -> Any:
+    content = resp.get("content", {})
+    text = content.get("text", "")
+    if not text:
+        return None
+    try:
+        parsed = json.loads(text)
+        if isinstance(parsed, list) and len(parsed) > 2:
+            return parsed[:2]
+        if isinstance(parsed, dict):
+            # truncate large arrays in response
+            truncated = {}
+            for k, v in parsed.items():
+                if isinstance(v, list) and len(v) > 2:
+                    truncated[k] = v[:2]
+                else:
+                    truncated[k] = v
+            return truncated
+        return parsed
+    except Exception:
+        return text[:300] if text else None
+def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
+    """Extract OpenAPI-like spec from HAR data."""
+    entries = har_data.get("log", {}).get("entries", [])
+    seen: set[str] = set()
+    spec_entries = []
+    for entry in entries:
+        req = entry.get("request", {})
+        resp = entry.get("response", {})
+        raw_url = req.get("url", "")
+        method = req.get("method", "GET").upper()
+        if not raw_url:
+            continue
+        if _is_static_asset(raw_url):
+            continue
+        resp_ct = _get_content_type(entry, "response")
+        req_ct = _get_content_type(entry, "request")
+        parsed_url = urlparse(raw_url)
+        path = parsed_url.path
+        # Skip pure static HTML page loads (GET returning text/html for main page/nav)
+        # BUT keep: POST forms, API paths, admin paths, JSON responses
+        is_html_get = "text/html" in resp_ct and method == "GET"
+        has_api_path = any(x in path for x in ["/rest/", "/api/", "/ajax/", "/mui/", ".json"])
+        is_admin_path = "/admin/" in path or "/rest/V1/" in path
+        is_post = method in ("POST", "PUT", "PATCH", "DELETE")
+        has_json_response = "json" in resp_ct
+        if is_html_get and not has_api_path and not is_admin_path and not has_json_response:
+            # Skip pure page navigations but only for very common extensions
+            if not is_post:
+                continue
+        path_norm = _normalise_path(path)
+        key = f"{method} {path_norm}"
+        if key in seen:
+            continue
+        seen.add(key)
+        has_auth = any(
+            h.get("name", "").lower() in ("authorization", "x-api-key", "cookie")
+            for h in req.get("headers", [])
+        )
+        spec_entries.append({
+            "method": method,
+            "path": path_norm,
+            "query_params": parsed_url.query or None,
+            "request_body": _extract_body(req),
+            "status_code": resp.get("status", 0),
+            "response_content_type": resp_ct,
+            "response_body_sample": _truncate_response_sample(resp),
+            "auth_observed": has_auth,
+        })
+    return spec_entries
+def catalog_to_spec_entries(app_name: str) -> list[dict]:
+    """Load ground truth catalog as spec entries when HAR doesn't yield results."""
+    catalog_path = CATALOGS_DIR / f"{app_name}.json"
+    if not catalog_path.exists():
+        return []
+    try:
+        with open(catalog_path) as f:
+            data = json.load(f)
+        endpoints = data if isinstance(data, list) else data.get("endpoints", [])
+        spec_entries = []
+        for ep in endpoints:
+            # Handle "endpoint": "POST /rest/V1/..." format
+            endpoint_str = ep.get("endpoint", "")
+            if endpoint_str and " " in endpoint_str:
+                parts = endpoint_str.split(" ", 1)
+                method = parts[0].upper()
+                path = parts[1]
+            else:
+                path = ep.get("path", endpoint_str)
+                method = ep.get("method", "GET").upper()
+            if not path:
+                continue
+            auth = ep.get("auth", ep.get("authentication", "none"))
+            spec_entries.append({
+                "method": method,
+                "path": path,
+                "query_params": None,
+                "request_body": ep.get("body_params") or ep.get("body"),
+                "status_code": 200,
+                "response_content_type": "application/json",
+                "response_body_sample": ep.get("response_fields") or ep.get("response_sample"),
+                "auth_observed": auth not in ("none", "None", None, ""),
+            })
+        return spec_entries
+    except Exception as e:
+        print(f"[browser_agent] Failed to load catalog {app_name}: {e}", flush=True)
+        return []
+def spec_entry_to_text(entry: dict, app_name: str) -> str:
+    """Convert a spec entry to searchable text for embedding."""
+    parts = [
+        f"app: {app_name}",
+        f"endpoint: {entry['method']} {entry['path']}",
+        f"status: {entry['status_code']}",
+        f"auth: {'required' if entry['auth_observed'] else 'none'}",
+    ]
+    if entry.get("query_params"):
+        parts.append(f"query: {entry['query_params']}")
+    if entry.get("request_body"):
+        body_str = json.dumps(entry["request_body"])[:300] if not isinstance(entry["request_body"], str) else entry["request_body"][:300]
+        parts.append(f"body: {body_str}")
+    if entry.get("response_body_sample") is not None:
+        resp_str = json.dumps(entry["response_body_sample"])[:300] if not isinstance(entry["response_body_sample"], str) else str(entry["response_body_sample"])[:300]
+        parts.append(f"response_sample: {resp_str}")
+    return " | ".join(parts)
+# ---------------------------------------------------------------------------
+# Embedding model (lazy load)
+# ---------------------------------------------------------------------------
+_embedding_model = None
+_embedding_model_name = None
+def _get_embedding_model():
+    global _embedding_model, _embedding_model_name
+    if _embedding_model is not None:
+        return _embedding_model, _embedding_model_name
+    hf_token = os.environ.get("HF_TOKEN")
+    # Set a writable cache dir to avoid read-only filesystem errors
+    import tempfile
+    cache_dir = os.environ.get("HF_HOME", os.environ.get("TRANSFORMERS_CACHE",
+                               os.path.join(tempfile.gettempdir(), "hf_cache")))
+    os.makedirs(cache_dir, exist_ok=True)
+    os.environ.setdefault("HF_HOME", cache_dir)
+    os.environ.setdefault("TRANSFORMERS_CACHE", cache_dir)
+    os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", cache_dir)
+    # Skip embedding if HARVGYM_NO_EMBED is set (for testing/offline use)
+    if os.environ.get("HARVGYM_NO_EMBED"):
+        raise RuntimeError("Embeddings disabled via HARVGYM_NO_EMBED")
+    # Try GEMMA first, fall back to MiniLM
+    candidates = [
+        ("google/embeddinggemma-300m", hf_token),
+        ("all-MiniLM-L6-v2", None),
+        ("sentence-transformers/all-MiniLM-L6-v2", None),
+    ]
+    for model_name, token in candidates:
+        try:
+            from sentence_transformers import SentenceTransformer
+            kwargs: dict = {"cache_folder": cache_dir}
+            if token:
+                kwargs["token"] = token
+            model = SentenceTransformer(model_name, **kwargs)
+            _embedding_model = model
+            _embedding_model_name = model_name
+            print(f"[browser_agent] Loaded embedding model: {model_name}", flush=True)
+            return _embedding_model, _embedding_model_name
+        except Exception as e:
+            print(f"[browser_agent] Could not load {model_name}: {type(e).__name__}: {str(e)[:100]}", flush=True)
+    raise RuntimeError("No embedding model available. Install sentence-transformers.")
+def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
+    """Build embeddings over spec entries. Returns (embeddings_array, text_chunks)."""
+    model, model_name = _get_embedding_model()
+    chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
+    if not chunks:
+        return np.array([]), []
+    # Use encode_document if available (GEMMA), else plain encode
+    if hasattr(model, "encode_document"):
+        embeddings = model.encode_document(chunks, batch_size=32, show_progress_bar=False)
+    else:
+        embeddings = model.encode(chunks, batch_size=32, show_progress_bar=False)
+    if not isinstance(embeddings, np.ndarray):
+        embeddings = np.array(embeddings)
+    # Normalize for cosine similarity
+    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+    norms = np.where(norms == 0, 1, norms)
+    embeddings = embeddings / norms
+    return embeddings, chunks
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def run_browser_agent(task: str, url: str, episode_store=None) -> dict:
+    """
+    Load HAR for the app inferred from URL, extract spec, build embeddings.
+    Returns summary endpoint list.
+    episode_store: mutable dict where we store embeddings/spec for search_endpoints().
+    """
+    # Detect app from URL
+    app_name = "unknown"
+    har_filename = None
+    for port_suffix, fname in HAR_MAP.items():
+        if port_suffix in url:
+            har_filename = fname
+            app_name = APP_NAME_MAP[port_suffix]
+            break
+    if har_filename is None:
+        # Try to guess from URL path
+        if "shopping" in url.lower() or "7770" in url or "7780" in url:
+            har_filename = "shopping.har"
+            app_name = "shopping"
+        elif "forum" in url.lower() or "9999" in url:
+            har_filename = "forum.har"
+            app_name = "forum"
+        elif "wiki" in url.lower() or "8888" in url:
+            har_filename = "wikipedia.har"
+            app_name = "wikipedia"
+        else:
+            har_filename = "shopping.har"
+            app_name = "shopping"
+    har_path = HARS_DIR / har_filename
+    if not har_path.exists():
+        return {
+            "app": app_name,
+            "endpoints": [],
+            "total_endpoints": 0,
+            "note": f"HAR file not found: {har_path}. No endpoints available.",
+            "error": f"Missing HAR: {har_filename}",
+        }
+    with open(har_path) as f:
+        har_data = json.load(f)
+    spec_entries = extract_openapi_spec(har_data, url)
+    # Augment with ground truth catalog if HAR extraction is sparse
+    catalog_entries = catalog_to_spec_entries(app_name)
+    if len(spec_entries) < 5 and catalog_entries:
+        print(f"[browser_agent] HAR yielded {len(spec_entries)} endpoints, augmenting from catalog ({len(catalog_entries)} entries)", flush=True)
+        # Merge: catalog takes priority for proper paths
+        har_paths = {e["path"] for e in spec_entries}
+        for ce in catalog_entries:
+            if ce["path"] not in har_paths:
+                spec_entries.append(ce)
+    elif catalog_entries:
+        # Augment any catalog endpoints not found in HAR
+        har_paths = {e["path"] for e in spec_entries}
+        for ce in catalog_entries:
+            if ce["path"] not in har_paths:
+                spec_entries.append(ce)
+    # Build embeddings and store in episode_store for search_endpoints
+    if spec_entries and episode_store is not None:
+        try:
+            embeddings, chunks = build_endpoint_embeddings(spec_entries, app_name)
+            episode_store["endpoint_embeddings"] = embeddings
+            episode_store["endpoint_chunks"] = chunks
+            episode_store["spec_entries"] = spec_entries
+            episode_store["app_name"] = app_name
+        except Exception as e:
+            print(f"[browser_agent] Embedding build failed: {e}. Storing spec without embeddings.", flush=True)
+            # Store chunks as plain text even without embeddings for keyword fallback
+            chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
+            episode_store["endpoint_chunks"] = chunks
+            episode_store["endpoint_embeddings"] = None
+            episode_store["spec_entries"] = spec_entries
+            episode_store["app_name"] = app_name
+    elif episode_store is not None:
+        episode_store["spec_entries"] = []
+        episode_store["app_name"] = app_name
+    # Return summary only (no schemas)
+    summary_endpoints = [{"method": e["method"], "path": e["path"]} for e in spec_entries]
+    return {
+        "app": app_name,
+        "endpoints": summary_endpoints,
+        "total_endpoints": len(summary_endpoints),
+        "note": (
+            "These endpoints were observed for this application. "
+            "Use search_endpoints() with a natural language query to get the full schema, "
+            "parameters, and auth details for any endpoint."
+        ),
+    }

server/tools/curl_exec.py ADDED Viewed

	@@ -0,0 +1,434 @@

+"""
+curl_exec tool — execute HTTP calls via subprocess, index responses, return truncated result.
+Parses curl command string, executes against live EC2 server, auto-injects session cookies,
+indexes full response into episode BM25 store, returns smart-truncated observation.
+"""
+from __future__ import annotations
+import json
+import re
+import shlex
+import subprocess
+import time
+from typing import Any
+from urllib.parse import urlparse
+# ---------------------------------------------------------------------------
+# Truncation constants
+# ---------------------------------------------------------------------------
+NONJSON_MAX_CHARS = 3000      # HTML / plain text truncation (raised for CSRF token visibility)
+ARRAY_PREVIEW_ITEMS = 2       # How many items to show in large arrays
+ARRAY_LARGE_THRESHOLD = 3     # Arrays >= this size are truncated
+# ---------------------------------------------------------------------------
+# Curl command parser
+# ---------------------------------------------------------------------------
+def parse_curl_command(command: str) -> dict:
+    """
+    Parse a curl command string into components.
+    Returns dict with keys: method, url, headers, body, data_type
+    """
+    # Normalize: remove newline continuations
+    command = re.sub(r"\\\s*\n\s*", " ", command)
+    try:
+        tokens = shlex.split(command)
+    except ValueError:
+        # Fall back to simple split if shlex fails
+        tokens = command.split()
+    if not tokens or tokens[0] != "curl":
+        raise ValueError(f"Not a curl command: {command[:100]}")
+    result: dict = {
+        "method": "GET",
+        "url": None,
+        "headers": {},
+        "body": None,
+        "data_type": None,  # "json" | "form" | None
+    }
+    i = 1
+    while i < len(tokens):
+        tok = tokens[i]
+        if tok in ("-X", "--request") and i + 1 < len(tokens):
+            result["method"] = tokens[i + 1].upper()
+            i += 2
+        elif tok in ("-H", "--header") and i + 1 < len(tokens):
+            header = tokens[i + 1]
+            if ":" in header:
+                name, _, value = header.partition(":")
+                result["headers"][name.strip().lower()] = value.strip()
+            i += 2
+        elif tok in ("-d", "--data", "--data-raw", "--data-binary") and i + 1 < len(tokens):
+            result["body"] = tokens[i + 1]
+            if result["method"] == "GET":
+                result["method"] = "POST"
+            i += 2
+        elif tok == "--data-urlencode" and i + 1 < len(tokens):
+            # Append to existing body
+            existing = result.get("body") or ""
+            if existing:
+                result["body"] = existing + "&" + tokens[i + 1]
+            else:
+                result["body"] = tokens[i + 1]
+            if result["method"] == "GET":
+                result["method"] = "POST"
+            i += 2
+        elif tok in ("-F", "--form") and i + 1 < len(tokens):
+            existing = result.get("body") or ""
+            if existing:
+                result["body"] = existing + "&" + tokens[i + 1]
+            else:
+                result["body"] = tokens[i + 1]
+            if result["method"] == "GET":
+                result["method"] = "POST"
+            i += 2
+        elif tok in ("-u", "--user") and i + 1 < len(tokens):
+            i += 2  # skip basic auth for now
+        elif tok in ("-L", "--location", "-s", "--silent", "-v", "--verbose",
+                     "-k", "--insecure", "--compressed", "-g", "--globoff"):
+            i += 1
+        elif tok in ("-o", "--output", "--max-time", "--connect-timeout",
+                     "--retry", "-A", "--user-agent", "-e", "--referer"):
+            i += 2  # skip flag + value
+        elif not tok.startswith("-") and result["url"] is None:
+            result["url"] = tok.strip("'\"")
+            i += 1
+        elif tok.startswith("http"):
+            result["url"] = tok.strip("'\"")
+            i += 1
+        else:
+            i += 1
+    # Infer data_type from content-type header
+    ct = result["headers"].get("content-type", "")
+    if "application/json" in ct:
+        result["data_type"] = "json"
+    elif "application/x-www-form-urlencoded" in ct or "multipart/form-data" in ct:
+        result["data_type"] = "form"
+    elif result["body"]:
+        # Guess from body
+        if result["body"].strip().startswith("{") or result["body"].strip().startswith("["):
+            result["data_type"] = "json"
+        else:
+            result["data_type"] = "form"
+    return result
+# ---------------------------------------------------------------------------
+# Smart truncation
+# ---------------------------------------------------------------------------
+def smart_truncate(body_text: str, content_type: str = "") -> Any:
+    """
+    Apply truncation rules to a response body string.
+    Rules (first match wins):
+    1. Non-JSON → truncate to NONJSON_MAX_CHARS
+    2. JSON primitive (str/int/bool/null) → never truncate
+    3. Error (detected by content) → never truncate
+    4. JSON object/array with no large arrays → return as-is
+    5. JSON with large array → keep first ARRAY_PREVIEW_ITEMS, add _list_truncated note
+    """
+    if not body_text:
+        return ""
+    # Rule 1: non-JSON
+    if "application/json" not in content_type and not _looks_like_json(body_text):
+        return body_text[:NONJSON_MAX_CHARS]
+    # Try to parse as JSON
+    try:
+        parsed = json.loads(body_text)
+    except (json.JSONDecodeError, ValueError):
+        return body_text[:NONJSON_MAX_CHARS]
+    # Rule 2: JSON primitive
+    if not isinstance(parsed, (dict, list)):
+        return parsed
+    # Rule 3: detect error (4xx/5xx already handled by caller; this checks body content)
+    if isinstance(parsed, dict) and ("message" in parsed or "error" in parsed):
+        return parsed  # never truncate errors
+    # Rules 4 and 5
+    return _truncate_json(parsed)
+def _looks_like_json(text: str) -> bool:
+    stripped = text.strip()
+    return stripped.startswith("{") or stripped.startswith("[") or stripped.startswith('"')
+def _truncate_json(obj: Any) -> Any:
+    if isinstance(obj, list):
+        if len(obj) >= ARRAY_LARGE_THRESHOLD:
+            return {
+                "items": obj[:ARRAY_PREVIEW_ITEMS],
+                "_list_truncated": {
+                    "shown": ARRAY_PREVIEW_ITEMS,
+                    "total": len(obj),
+                    "note": (
+                        f"Showing {ARRAY_PREVIEW_ITEMS} of {len(obj)} items. "
+                        "Use search_episode_data() to find a specific item from this response."
+                    ),
+                },
+            }
+        return obj
+    if isinstance(obj, dict):
+        result = {}
+        for k, v in obj.items():
+            if isinstance(v, list) and len(v) >= ARRAY_LARGE_THRESHOLD:
+                result[k] = v[:ARRAY_PREVIEW_ITEMS]
+                result["_list_truncated"] = {
+                    "field": k,
+                    "shown": ARRAY_PREVIEW_ITEMS,
+                    "total": len(v),
+                    "note": (
+                        f"Showing {ARRAY_PREVIEW_ITEMS} of {len(v)} items. "
+                        "Use search_episode_data() to find a specific item from this response."
+                    ),
+                }
+            else:
+                result[k] = v
+        return result
+    return obj
+# ---------------------------------------------------------------------------
+# Cookie injection
+# ---------------------------------------------------------------------------
+def _inject_cookies(headers: dict, session_state: dict) -> dict:
+    """Inject cookies from session_state into the request headers."""
+    headers = dict(headers)  # copy
+    # Collect cookie values
+    cookie_parts = []
+    for key, value in session_state.items():
+        if key.lower() in ("phpsessid", "sessid", "session", "cookie",
+                           "mage-cache-sessid", "private_content_version",
+                           "form_key", "PHPSESSID"):
+            cookie_parts.append(f"{key}={value}")
+    # Check if there's a raw cookie header already
+    existing = headers.get("cookie", "")
+    if cookie_parts:
+        all_cookies = existing + ("; " if existing else "") + "; ".join(cookie_parts)
+        headers["cookie"] = all_cookies
+    return headers
+# ---------------------------------------------------------------------------
+# Session state extraction
+# ---------------------------------------------------------------------------
+def _extract_set_cookies(response_headers: dict, session_state: dict) -> None:
+    """Extract Set-Cookie headers into session_state."""
+    for name, value in response_headers.items():
+        if name.lower() == "set-cookie":
+            # Parse "NAME=VALUE; Path=...; ..."
+            cookies = value.split(";")
+            if cookies:
+                kv = cookies[0].strip()
+                if "=" in kv:
+                    k, _, v = kv.partition("=")
+                    session_state[k.strip()] = v.strip()
+def _extract_tokens_from_body(body: Any, session_state: dict) -> None:
+    """Extract auth tokens from JSON response bodies into session_state."""
+    if isinstance(body, str) and len(body) > 10 and len(body) < 500:
+        # Likely a token (Magento returns bare quoted strings for auth tokens)
+        stripped = body.strip('"').strip()
+        if re.match(r"^[A-Za-z0-9_\-\.]{20,}$", stripped):
+            session_state["_last_token"] = stripped
+    if isinstance(body, dict):
+        for key in ("access_token", "token", "cart_id", "form_key"):
+            if key in body and body[key]:
+                session_state[key] = body[key]
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def curl_exec(command: str, session_state: dict, episode_store: dict,
+              app_base_url: str = "") -> dict:
+    """
+    Parse and execute a curl command against the live app.
+    Args:
+        command: Full curl command string
+        session_state: Current session state (cookies/tokens), mutated in place
+        episode_store: Per-episode store for BM25 indexing, mutated in place
+        app_base_url: Base URL to validate requests against
+    Returns:
+        {status_code, headers, body} with smart-truncated body
+    """
+    try:
+        parsed = parse_curl_command(command)
+    except Exception as e:
+        return {"status_code": -1, "headers": {}, "body": f"curl parse error: {e}", "error": str(e)}
+    if not parsed["url"]:
+        return {"status_code": -1, "headers": {}, "body": "No URL in curl command", "error": "missing url"}
+    # Inject session cookies
+    parsed["headers"] = _inject_cookies(parsed["headers"], session_state)
+    # Build actual curl args
+    args = ["curl", "-s", "-i", "-L", "--max-time", "15"]
+    args += ["-X", parsed["method"]]
+    args += [parsed["url"]]
+    for h_name, h_val in parsed["headers"].items():
+        args += ["-H", f"{h_name}: {h_val}"]
+    if parsed["body"]:
+        args += ["-d", parsed["body"]]
+    try:
+        result = subprocess.run(
+            args,
+            capture_output=True,
+            text=True,
+            timeout=20,
+        )
+        raw_output = result.stdout
+    except subprocess.TimeoutExpired:
+        return {"status_code": -1, "headers": {}, "body": "Request timed out (20s)", "error": "timeout"}
+    except Exception as e:
+        return {"status_code": -1, "headers": {}, "body": f"subprocess error: {e}", "error": str(e)}
+    # Parse HTTP response: headers + body split at blank line
+    status_code = 0
+    resp_headers: dict[str, str] = {}
+    body_text = ""
+    if raw_output:
+        # Find status line (handle redirects: multiple HTTP/ headers)
+        lines = raw_output.split("\r\n") if "\r\n" in raw_output else raw_output.split("\n")
+        header_lines = []
+        body_lines = []
+        in_body = False
+        last_status = 0
+        for line in lines:
+            if in_body:
+                body_lines.append(line)
+            elif line.startswith("HTTP/"):
+                # Could be redirect status; keep last
+                parts = line.split(" ", 2)
+                if len(parts) >= 2:
+                    try:
+                        last_status = int(parts[1])
+                    except ValueError:
+                        pass
+                header_lines = []  # reset headers for this response
+            elif line.strip() == "":
+                if last_status:  # we've seen at least one status line
+                    in_body = True
+            else:
+                header_lines.append(line)
+        status_code = last_status
+        body_text = "\n".join(body_lines).strip()
+        for h_line in header_lines:
+            if ":" in h_line:
+                h_name, _, h_val = h_line.partition(":")
+                resp_headers[h_name.strip().lower()] = h_val.strip()
+    # Extract cookies / tokens into session_state
+    _extract_set_cookies(resp_headers, session_state)
+    # Try to parse body as JSON
+    resp_ct = resp_headers.get("content-type", "")
+    parsed_body: Any = body_text
+    try:
+        parsed_body = json.loads(body_text) if body_text else ""
+    except (json.JSONDecodeError, ValueError):
+        parsed_body = body_text
+    # Extract tokens from body
+    _extract_tokens_from_body(parsed_body, session_state)
+    # Index into episode BM25 store BEFORE truncation
+    _index_into_episode_store(
+        episode_store=episode_store,
+        request_body=parsed["body"],
+        response_body=parsed_body,
+        url=parsed["url"],
+        method=parsed["method"],
+        status_code=status_code,
+    )
+    # Apply smart truncation
+    if status_code >= 400:
+        # Never truncate errors
+        truncated_body = parsed_body
+    else:
+        body_for_truncation = body_text if isinstance(parsed_body, str) else json.dumps(parsed_body)
+        truncated_body = smart_truncate(body_for_truncation, resp_ct)
+    return {
+        "status_code": status_code,
+        "headers": resp_headers,
+        "body": truncated_body,
+    }
+# ---------------------------------------------------------------------------
+# Episode store indexing
+# ---------------------------------------------------------------------------
+def _index_into_episode_store(episode_store: dict, request_body: Any,
+                               response_body: Any, url: str, method: str,
+                               status_code: int) -> None:
+    """Index request/response into episode BM25 store for search_episode_data()."""
+    if "bm25_corpus" not in episode_store:
+        episode_store["bm25_corpus"] = []
+        episode_store["bm25_metadata"] = []
+    def _to_text(obj: Any) -> str:
+        if obj is None:
+            return ""
+        if isinstance(obj, str):
+            return obj
+        return json.dumps(obj)
+    entry_text = f"url: {url} | method: {method} | status: {status_code} | " \
+                 f"request: {_to_text(request_body)} | response: {_to_text(response_body)}"
+    episode_store["bm25_corpus"].append(entry_text)
+    episode_store["bm25_metadata"].append({
+        "url": url,
+        "method": method,
+        "status_code": status_code,
+        "response_body": response_body,
+    })

server/tools/search_endpoints.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""
+search_endpoints tool — semantic search over endpoint embeddings from browser_agent.
+Returns top-3 endpoint schemas (full text) for a natural language query.
+"""
+from __future__ import annotations
+import os
+import numpy as np
+def search_endpoints(query: str, episode_store: dict) -> list[str]:
+    """
+    Semantic search over endpoint embeddings built by browser_agent.
+    Args:
+        query: Natural language query (e.g. "create guest cart", "add item to cart")
+        episode_store: Mutable dict containing embeddings from browser_agent.
+    Returns:
+        List of up to 3 endpoint schema text strings.
+    """
+    chunks: list[str] = episode_store.get("endpoint_chunks", [])
+    embeddings = episode_store.get("endpoint_embeddings")
+    spec_entries: list[dict] = episode_store.get("spec_entries", [])
+    if not chunks:
+        return ["No endpoint index available. Call browser_agent(task, url) first."]
+    # If no embeddings, use keyword fallback directly
+    if embeddings is None or (hasattr(embeddings, '__len__') and len(embeddings) == 0):
+        query_lower = query.lower()
+        query_terms = query_lower.split()
+        matches = []
+        for chunk in chunks:
+            if any(term in chunk.lower() for term in query_terms):
+                matches.append(chunk)
+        return matches[:3] if matches else chunks[:3]
+    try:
+        model, model_name = _get_embedding_model()
+        if hasattr(model, "encode_query"):
+            q_emb = model.encode_query([query], show_progress_bar=False)
+        else:
+            q_emb = model.encode([query], show_progress_bar=False)
+        if not isinstance(q_emb, np.ndarray):
+            q_emb = np.array(q_emb)
+        # Normalize query embedding
+        norm = np.linalg.norm(q_emb, axis=1, keepdims=True)
+        if norm[0, 0] > 0:
+            q_emb = q_emb / norm
+        # Cosine similarity (embeddings already normalized)
+        scores = (embeddings @ q_emb.T).flatten()
+        top_k = min(3, len(scores))
+        top_indices = np.argsort(scores)[::-1][:top_k]
+        results = []
+        for idx in top_indices:
+            results.append(chunks[int(idx)])
+        return results
+    except Exception as e:
+        # Fallback: keyword match
+        print(f"[search_endpoints] Embedding search failed: {e}. Using keyword fallback.", flush=True)
+        query_lower = query.lower()
+        matches = []
+        for chunk in chunks:
+            if any(word in chunk.lower() for word in query_lower.split()):
+                matches.append(chunk)
+        return matches[:3] if matches else chunks[:3]
+# Lazy-load the model (shared with browser_agent)
+_embedding_model = None
+_embedding_model_name = None
+def _get_embedding_model():
+    global _embedding_model, _embedding_model_name
+    if _embedding_model is not None:
+        return _embedding_model, _embedding_model_name
+    # Re-use browser_agent's loader
+    from .browser_agent import _get_embedding_model as _ba_get
+    model, name = _ba_get()
+    _embedding_model = model
+    _embedding_model_name = name
+    return model, name

server/tools/search_episode_data.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+search_episode_data tool — BM25 + semantic search over accumulated episode response data.
+Searches all request/response bodies from prior curl_exec calls in this episode.
+"""
+from __future__ import annotations
+import json
+import re
+from typing import Any
+def search_episode_data(query: str, episode_store: dict) -> list[dict]:
+    """
+    Hybrid BM25 + keyword search over episode accumulated response bodies.
+    Args:
+        query: Keyword or natural language query (e.g. "Radiant Tee sku", "_csrf_token")
+        episode_store: Per-episode store containing bm25_corpus and bm25_metadata
+    Returns:
+        Top-5 matching JSON objects from episode history, annotated with step info
+    """
+    corpus: list[str] = episode_store.get("bm25_corpus", [])
+    metadata: list[dict] = episode_store.get("bm25_metadata", [])
+    if not corpus:
+        return [{"note": "No episode data yet. Make API calls with curl_exec() first."}]
+    # Try BM25 ranking
+    try:
+        from rank_bm25 import BM25Okapi
+        tokenized_corpus = [_tokenize(doc) for doc in corpus]
+        tokenized_query = _tokenize(query)
+        bm25 = BM25Okapi(tokenized_corpus)
+        scores = bm25.get_scores(tokenized_query)
+        # Get top 5 by BM25 score
+        import numpy as np
+        top_k = min(5, len(scores))
+        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
+        results = []
+        for idx in top_indices:
+            if scores[idx] > 0:
+                meta = metadata[idx]
+                result = {
+                    "step": idx + 1,
+                    "url": meta.get("url", ""),
+                    "method": meta.get("method", ""),
+                    "status_code": meta.get("status_code", 0),
+                    "data": meta.get("response_body"),
+                }
+                results.append(result)
+        if results:
+            return results
+    except ImportError:
+        pass
+    except Exception as e:
+        print(f"[search_episode_data] BM25 error: {e}", flush=True)
+    # Fallback: keyword match
+    query_lower = query.lower()
+    query_terms = query_lower.split()
+    results = []
+    for idx, doc in enumerate(corpus):
+        if any(term in doc.lower() for term in query_terms):
+            meta = metadata[idx]
+            results.append({
+                "step": idx + 1,
+                "url": meta.get("url", ""),
+                "method": meta.get("method", ""),
+                "status_code": meta.get("status_code", 0),
+                "data": meta.get("response_body"),
+            })
+    return results[:5] if results else [{"note": f"No results found for: {query}"}]
+def _tokenize(text: str) -> list[str]:
+    """Simple whitespace + punctuation tokenizer for BM25."""
+    text = text.lower()
+    tokens = re.findall(r"[a-z0-9_\-\.]+", text)
+    return tokens if tokens else [""]

tests/mock_data/mock_catalog.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "_meta": {
+    "generated": "2026-04-08",
+    "source": "mock catalog for testing"
+  },
+  "endpoints": [
+    {
+      "api_type": "rest",
+      "endpoint": "GET /rest/V1/categories",
+      "auth": "none",
+      "query_params": {},
+      "response_key_fields": ["id", "name", "children_data"],
+      "notes": "Returns the full category tree. No auth required."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "GET /rest/V1/products",
+      "auth": "none",
+      "query_params": {
+        "searchCriteria[filter_groups][0][filters][0][field]": {"type": "string", "source": "TASK_SPEC", "notes": "field name to filter on (e.g. 'name', 'sku')"},
+        "searchCriteria[filter_groups][0][filters][0][value]": {"type": "string", "source": "TASK_SPEC", "notes": "value to filter for"},
+        "searchCriteria[filter_groups][0][filters][0][condition_type]": {"type": "string", "source": "STATIC", "value": "like", "notes": "comparison operator"}
+      },
+      "response_key_fields": ["items[].sku", "items[].name", "items[].price", "total_count"],
+      "notes": "Search/list products. Use searchCriteria filters for targeted lookups."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "POST /rest/V1/guest-carts",
+      "auth": "none",
+      "body_params": {},
+      "response_key_fields": ["cartId (plain string in body)"],
+      "notes": "Creates a new guest cart. Returns the cartId as a plain quoted string."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "POST /rest/V1/guest-carts/{cartId}/items",
+      "auth": "none",
+      "path_params": {
+        "cartId": {"type": "string", "source": "PREV_CALL", "from_endpoint": "POST /rest/V1/guest-carts", "from_field": "response body"}
+      },
+      "body_params": {
+        "cartItem.sku": {"type": "string", "source": "PREV_CALL", "from_endpoint": "GET /rest/V1/products", "from_field": "items[].sku"},
+        "cartItem.qty": {"type": "number", "source": "TASK_SPEC", "notes": "quantity to add"},
+        "cartItem.quote_id": {"type": "string", "source": "DERIVED", "notes": "same value as cartId"}
+      },
+      "response_key_fields": ["item_id", "sku", "qty"],
+      "notes": "Add an item to a guest cart. cartId must exist first."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "POST /rest/V1/integration/customer/token",
+      "auth": "none",
+      "body_params": {
+        "username": {"type": "string", "source": "TASK_SPEC"},
+        "password": {"type": "string", "source": "TASK_SPEC"}
+      },
+      "response_key_fields": ["bearer token (plain string in body)"],
+      "notes": "Authenticate a customer. Returns a bearer token string."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "POST /rest/V1/guest-carts/{cartId}/estimate-shipping-methods",
+      "auth": "none",
+      "path_params": {
+        "cartId": {"type": "string", "source": "PREV_CALL", "from_endpoint": "POST /rest/V1/guest-carts", "from_field": "response body"}
+      },
+      "body_params": {
+        "address.city": {"type": "string", "source": "TASK_SPEC"},
+        "address.region_id": {"type": "number", "source": "TASK_SPEC"},
+        "address.postcode": {"type": "string", "source": "TASK_SPEC"},
+        "address.country_id": {"type": "string", "source": "TASK_SPEC"}
+      },
+      "response_key_fields": ["[].carrier_code", "[].method_code", "[].amount"],
+      "notes": "Get available shipping methods for a guest cart."
+    },
+    {
+      "api_type": "rest",
+      "endpoint": "GET /rest/V1/guest-carts/{cartId}/totals",
+      "auth": "none",
+      "path_params": {
+        "cartId": {"type": "string", "source": "PREV_CALL", "from_endpoint": "POST /rest/V1/guest-carts", "from_field": "response body"}
+      },
+      "response_key_fields": ["grand_total", "subtotal", "items[].item_id", "items[].price"],
+      "notes": "Get cart totals and line items."
+    }
+  ]
+}

tests/mock_data/mock_har.json ADDED Viewed

	@@ -0,0 +1,170 @@

+{
+  "log": {
+    "version": "1.2",
+    "creator": {"name": "MockHAR", "version": "1.0"},
+    "pages": [],
+    "entries": [
+      {
+        "request": {
+          "method": "GET",
+          "url": "http://localhost:7770/rest/V1/categories",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "{\"id\":1,\"name\":\"Root\",\"children_data\":[{\"id\":2,\"name\":\"Default Category\"},{\"id\":3,\"name\":\"Beauty & Personal Care\"}]}"}
+        }
+      },
+      {
+        "request": {
+          "method": "GET",
+          "url": "http://localhost:7770/rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [{"name": "searchCriteria[filter_groups][0][filters][0][field]", "value": "name"}],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "{\"items\":[{\"sku\":\"MH01\",\"name\":\"Radiant Tee\",\"price\":22.0,\"type_id\":\"simple\"},{\"sku\":\"MH02\",\"name\":\"Breathe-Easy Tank\",\"price\":34.0,\"type_id\":\"simple\"},{\"sku\":\"MH03\",\"name\":\"Stellar Solar Jacket\",\"price\":75.0,\"type_id\":\"configurable\"},{\"sku\":\"MH04\",\"name\":\"Argus All-Weather Tank\",\"price\":22.0,\"type_id\":\"simple\"}],\"total_count\":4}"}
+        }
+      },
+      {
+        "request": {
+          "method": "POST",
+          "url": "http://localhost:7770/rest/V1/guest-carts",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "\"cart-abc123\""}
+        }
+      },
+      {
+        "request": {
+          "method": "POST",
+          "url": "http://localhost:7770/rest/V1/guest-carts/cart-abc123/items",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [],
+          "postData": {"mimeType": "application/json", "text": "{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}"}
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "{\"item_id\":5,\"sku\":\"MH01\",\"qty\":1,\"name\":\"Radiant Tee\",\"price\":22.0,\"product_type\":\"simple\",\"quote_id\":\"cart-abc123\"}"}
+        }
+      },
+      {
+        "request": {
+          "method": "GET",
+          "url": "http://localhost:7770/rest/V1/guest-carts/cart-abc123/totals",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "{\"grand_total\":22.0,\"subtotal\":22.0,\"items\":[{\"item_id\":5,\"price\":22.0,\"qty\":1,\"name\":\"Radiant Tee\"}]}"}
+        }
+      },
+      {
+        "request": {
+          "method": "POST",
+          "url": "http://localhost:7770/rest/V1/integration/customer/token",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [],
+          "postData": {"mimeType": "application/json", "text": "{\"username\":\"emma.lopez@gmail.com\",\"password\":\"Password.1\"}"}
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "\"token-xyz789\""}
+        }
+      },
+      {
+        "request": {
+          "method": "POST",
+          "url": "http://localhost:7770/rest/V1/guest-carts/cart-abc123/estimate-shipping-methods",
+          "headers": [
+            {"name": "Accept", "value": "application/json"},
+            {"name": "Content-Type", "value": "application/json"}
+          ],
+          "queryString": [],
+          "postData": {"mimeType": "application/json", "text": "{\"address\":{\"city\":\"New York\",\"region_id\":43,\"postcode\":\"10001\",\"country_id\":\"US\"}}"}
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/json"}],
+          "content": {"mimeType": "application/json", "text": "[{\"carrier_code\":\"flatrate\",\"method_code\":\"flatrate\",\"carrier_title\":\"Flat Rate\",\"method_title\":\"Fixed\",\"amount\":5.0,\"available\":true}]"}
+        }
+      },
+      {
+        "request": {
+          "method": "GET",
+          "url": "http://localhost:7770/static/version123/frontend/Magento/luma/en_US/mage/gallery.js",
+          "headers": [{"name": "Accept", "value": "*/*"}],
+          "queryString": [],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "application/javascript"}],
+          "content": {"mimeType": "application/javascript", "text": "// gallery js code..."}
+        }
+      },
+      {
+        "request": {
+          "method": "GET",
+          "url": "http://localhost:7770/media/catalog/product/m/h/mh01-black_main.jpg",
+          "headers": [{"name": "Accept", "value": "image/*"}],
+          "queryString": [],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "image/jpeg"}],
+          "content": {"mimeType": "image/jpeg", "text": ""}
+        }
+      },
+      {
+        "request": {
+          "method": "GET",
+          "url": "http://localhost:7770/beauty-personal-care.html",
+          "headers": [{"name": "Accept", "value": "text/html"}],
+          "queryString": [],
+          "postData": null
+        },
+        "response": {
+          "status": 200,
+          "headers": [{"name": "Content-Type", "value": "text/html; charset=UTF-8"}],
+          "content": {"mimeType": "text/html", "text": "<html>...</html>"}
+        }
+      }
+    ]
+  }
+}

tests/test_e2e_episode.py ADDED Viewed

	@@ -0,0 +1,272 @@

+"""
+End-to-End Episode Simulation: "Add Radiant Tee to a guest cart"
+Simulates the full tool chain with mock data:
+  browser_agent → search_endpoints → curl_exec → search_episode_data → done
+Tests that values thread correctly between tools and that each tool's
+output feeds properly into the next tool's input.
+"""
+import json
+import os
+import sys
+import re
+# Add tests dir to path
+sys.path.insert(0, os.path.dirname(__file__))
+from tool_browser_agent import browser_agent, extract_openapi_spec, spec_entry_to_text
+from tool_search_endpoints import SearchEndpoints
+from tool_curl_exec import mock_curl_exec, parse_curl_command
+from tool_search_episode_data import EpisodeDataStore
+def run_episode():
+    """Simulate a full episode: Add 'Radiant Tee' to a guest cart."""
+    print("=" * 70)
+    print("E2E EPISODE: Add 'Radiant Tee' to a guest cart")
+    print("URL: http://localhost:7770/")
+    print("=" * 70)
+    task = "Add 'Radiant Tee' to a guest cart"
+    url = "http://localhost:7770/"
+    mock_data_dir = os.path.join(os.path.dirname(__file__), "mock_data")
+    # Episode state
+    episode_index_docs = []
+    episode_store = EpisodeDataStore()
+    session_state = {}
+    step = 0
+    # -----------------------------------------------------------------------
+    # STEP 1: browser_agent — discover endpoints
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: browser_agent(\"{task}\", \"{url}\")")
+    print(f"{'='*50}")
+    # Load mock HAR directly (simulating HAR file exists)
+    mock_har_path = os.path.join(mock_data_dir, "mock_har.json")
+    with open(mock_har_path) as f:
+        har_data = json.load(f)
+    spec_entries = extract_openapi_spec(har_data, url)
+    text_chunks = [spec_entry_to_text(e, "shopping") for e in spec_entries]
+    # Build summary output (what RL agent sees)
+    summary = {
+        "app": "shopping",
+        "endpoints": [{"method": e["method"], "path": e["path"]} for e in spec_entries],
+        "total_endpoints": len(spec_entries),
+        "note": "Use search_endpoints() for full details on any endpoint."
+    }
+    print(f"\nResult: {len(summary['endpoints'])} endpoints discovered:")
+    for ep in summary["endpoints"]:
+        print(f"  {ep['method']:6s} {ep['path']}")
+    # Set up the search_endpoints tool with browser_agent output (NOT catalog ground truth)
+    # In the real system, search_endpoints searches the GEMMA embeddings built by browser_agent
+    # from HAR data. Here we use keyword search as a test fallback for GEMMA.
+    search_tool = SearchEndpoints()
+    search_tool.load_from_browser_agent(text_chunks)
+    print(f"\n  → search_endpoints index built: {len(text_chunks)} docs from browser_agent HAR output")
+    # -----------------------------------------------------------------------
+    # STEP 2: search_endpoints — "how to find a product by name?"
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: search_endpoints(\"find product by name get sku\")")
+    print(f"{'='*50}")
+    results = search_tool.search("find product by name get sku", top_k=3)
+    print(f"\nTop-3 results:")
+    for i, r in enumerate(results):
+        ep_match = re.search(r'endpoint: (\S+ \S+)', r)
+        ep_name = ep_match.group(1) if ep_match else "?"
+        print(f"  [{i+1}] {ep_name}")
+        print(f"      {r[:150]}...")
+    # Agent decides: GET /rest/V1/products with searchCriteria filter
+    print(f"\n  → Agent decides: GET /rest/V1/products with name filter")
+    # -----------------------------------------------------------------------
+    # STEP 3: curl_exec — search for Radiant Tee
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: curl_exec(\"curl .../products?searchCriteria[...]=Radiant+Tee\")")
+    print(f"{'='*50}")
+    result = mock_curl_exec(
+        "curl 'http://localhost:7770/rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'",
+        step, episode_index_docs
+    )
+    episode_store.add_documents(episode_index_docs[-len(episode_index_docs):])
+    print(f"\nResult: status={result['status_code']}")
+    body = result["body"]
+    if isinstance(body, dict):
+        items = body.get("items", [])
+        print(f"  items shown: {len(items)}, total: {body.get('total_count', '?')}")
+        for item in items:
+            print(f"    sku={item['sku']}, name={item['name']}, price={item['price']}")
+        if "_list_truncated" in body:
+            print(f"  [TRUNCATED] {body['_list_truncated']['note']}")
+    # Agent extracts: sku="MH01" from response
+    target_sku = "MH01"
+    print(f"\n  → Agent extracts: sku='{target_sku}' for 'Radiant Tee'")
+    # -----------------------------------------------------------------------
+    # STEP 4: search_endpoints — "how to create a guest cart?"
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: search_endpoints(\"create guest cart get cart id\")")
+    print(f"{'='*50}")
+    results = search_tool.search("create guest cart get cart id", top_k=3)
+    print(f"\nTop result:")
+    ep_match = re.search(r'endpoint: (\S+ \S+)', results[0])
+    print(f"  {ep_match.group(1) if ep_match else results[0][:80]}")
+    print(f"\n  → Agent decides: POST /rest/V1/guest-carts")
+    # -----------------------------------------------------------------------
+    # STEP 5: curl_exec — create guest cart
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: curl_exec(\"curl -X POST .../guest-carts\")")
+    print(f"{'='*50}")
+    result = mock_curl_exec(
+        "curl -X POST 'http://localhost:7770/rest/V1/guest-carts' -H 'Content-Type: application/json'",
+        step, episode_index_docs
+    )
+    episode_store.add_documents(episode_index_docs[-1:])
+    cart_id = result["body"]
+    print(f"\nResult: status={result['status_code']}, cart_id={cart_id}")
+    print(f"\n  → Agent extracts: cart_id='{cart_id}'")
+    # -----------------------------------------------------------------------
+    # STEP 6: search_endpoints — "how to add item to guest cart?"
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: search_endpoints(\"add item to guest cart\")")
+    print(f"{'='*50}")
+    results = search_tool.search("add item to guest cart cartId sku", top_k=3)
+    print(f"\nTop result:")
+    print(f"  {results[0][:200]}...")
+    print(f"\n  → Agent decides: POST /rest/V1/guest-carts/{{cartId}}/items")
+    print(f"    cartId = {cart_id} (from step {step-1})")
+    print(f"    sku = {target_sku} (from step 3)")
+    print(f"    quote_id = {cart_id} (DERIVED, same as cartId)")
+    # -----------------------------------------------------------------------
+    # STEP 7: curl_exec — add Radiant Tee to cart
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: curl_exec(\"curl -X POST .../guest-carts/{cart_id}/items\")")
+    print(f"{'='*50}")
+    result = mock_curl_exec(
+        f'curl -X POST "http://localhost:7770/rest/V1/guest-carts/{cart_id}/items" '
+        f'-H "Content-Type: application/json" '
+        f'-d \'{{"cartItem":{{"sku":"{target_sku}","qty":1,"quote_id":"{cart_id}"}}}}\'',
+        step, episode_index_docs
+    )
+    episode_store.add_documents(episode_index_docs[-2:])  # request + response docs
+    print(f"\nResult: status={result['status_code']}")
+    body = result["body"]
+    if isinstance(body, dict):
+        print(f"  item_id={body.get('item_id')}, sku={body.get('sku')}, qty={body.get('qty')}")
+    # -----------------------------------------------------------------------
+    # Test: search_episode_data — can we find values from prior steps?
+    # -----------------------------------------------------------------------
+    print(f"\n{'='*50}")
+    print(f"VERIFICATION: search_episode_data queries")
+    print(f"{'='*50}")
+    print(f"\nEpisode index: {episode_store.doc_count} documents total\n")
+    # Can we find the product from step 2?
+    results = episode_store.search("Radiant Tee sku", top_k=1)
+    found_product = results and "MH01" in results[0]
+    print(f"  Find 'Radiant Tee' sku from products response: {'PASS' if found_product else 'FAIL'}")
+    if results:
+        print(f"    → {results[0][:100]}...")
+    # Can we find the cart ID from step 4?
+    results = episode_store.search("guest-carts cart", top_k=3)
+    found_cart = any("cart-mock" in r for r in results)
+    print(f"\n  Find cart ID from create-cart response: {'PASS' if found_cart else 'FAIL'}")
+    if results:
+        print(f"    → {results[0][:100]}...")
+    # Can we find the add-to-cart confirmation?
+    results = episode_store.search("item_id sku MH01", top_k=1)
+    found_confirm = results and "item_id" in results[0]
+    print(f"\n  Find add-to-cart confirmation: {'PASS' if found_confirm else 'FAIL'}")
+    if results:
+        print(f"    → {results[0][:100]}...")
+    # -----------------------------------------------------------------------
+    # STEP 8: done
+    # -----------------------------------------------------------------------
+    step += 1
+    print(f"\n{'='*50}")
+    print(f"STEP {step}: done(\"Radiant Tee (MH01) added to guest cart {cart_id}\")")
+    print(f"{'='*50}")
+    print(f"\n  Episode complete. {step} steps total.")
+    print(f"  Episode index: {episode_store.doc_count} documents indexed")
+    # -----------------------------------------------------------------------
+    # Summary
+    # -----------------------------------------------------------------------
+    print(f"\n{'='*70}")
+    print(f"EPISODE SUMMARY")
+    print(f"{'='*70}")
+    print(f"""
+  Task:       {task}
+  App:        shopping (port 7770)
+  Steps:      {step}
+  Tools used: browser_agent → search_endpoints (x3) → curl_exec (x3) → done
+  Value Threading:
+    Step 3: GET /products → sku='MH01' (Radiant Tee)
+    Step 5: POST /guest-carts → cart_id='{cart_id}'
+    Step 7: POST /guest-carts/{cart_id}/items
+            sku='{target_sku}' (from step 3)
+            quote_id='{cart_id}' (DERIVED from step 5)
+  Episode Index: {episode_store.doc_count} documents
+    - Categories, products (5 items), cart creation, add-to-cart
+    - All searchable via search_episode_data()
+  Result: item_id=5, sku=MH01, qty=1 added to cart {cart_id}
+""")
+    # Assertions
+    assert found_product, "Failed to find product in episode data"
+    assert found_cart, "Failed to find cart ID in episode data"
+    assert found_confirm, "Failed to find add-to-cart confirmation"
+    print("[PASS] End-to-end episode simulation completed successfully\n")
+if __name__ == "__main__":
+    run_episode()

tests/test_real_har.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""
+Test browser_agent pipeline against REAL HAR files.
+Processes the actual recorded HAR data to verify filtering, deduplication,
+and path normalisation work on real-world traffic.
+"""
+import json
+import os
+import sys
+sys.path.insert(0, os.path.dirname(__file__))
+from tool_browser_agent import extract_openapi_spec, spec_entry_to_text, build_summary_output
+HARS_DIR = os.path.join(os.path.dirname(__file__), "..", "hars")
+APPS = {
+    "wikipedia": {
+        "har": "wikipedia.har",
+        "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/",
+    },
+    "forum": {
+        "har": "forum.har",
+        "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/",
+    },
+    "shopping": {
+        "har": "shopping.har",
+        "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
+    },
+    "shopping_admin": {
+        "har": "shopping_admin.har",
+        "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/",
+    },
+}
+def test_app(app_name: str, config: dict):
+    har_path = os.path.join(HARS_DIR, config["har"])
+    if not os.path.exists(har_path):
+        print(f"  [SKIP] {har_path} not found")
+        return
+    print(f"\n{'='*60}")
+    print(f"APP: {app_name} ({config['har']})")
+    print(f"{'='*60}")
+    with open(har_path) as f:
+        har_data = json.load(f)
+    total_entries = len(har_data["log"]["entries"])
+    spec = extract_openapi_spec(har_data, config["url"])
+    chunks = [spec_entry_to_text(e, app_name) for e in spec]
+    summary = build_summary_output(spec, app_name)
+    print(f"\n  Total HAR entries: {total_entries}")
+    print(f"  Filtered API endpoints: {len(spec)}")
+    print(f"  Reduction: {total_entries} → {len(spec)} ({100*(1-len(spec)/max(total_entries,1)):.0f}% filtered out)")
+    print(f"\n  Endpoints:")
+    for ep in spec:
+        auth_marker = " [AUTH]" if ep["auth_observed"] else ""
+        body_marker = " [BODY]" if ep.get("request_body") else ""
+        print(f"    {ep['method']:6s} {ep['path'][:70]:70s} {ep['status_code']}{auth_marker}{body_marker}")
+    print(f"\n  Text chunks for embedding: {len(chunks)}")
+    if chunks:
+        print(f"    Sample: {chunks[0][:120]}...")
+    return spec
+if __name__ == "__main__":
+    print("=" * 60)
+    print("TEST: browser_agent pipeline against REAL HAR files")
+    print("=" * 60)
+    all_specs = {}
+    for app_name, config in APPS.items():
+        spec = test_app(app_name, config)
+        if spec:
+            all_specs[app_name] = spec
+    # Summary
+    print(f"\n\n{'='*60}")
+    print(f"SUMMARY")
+    print(f"{'='*60}")
+    for app, spec in all_specs.items():
+        methods = {}
+        for e in spec:
+            methods[e["method"]] = methods.get(e["method"], 0) + 1
+        method_str = ", ".join(f"{m}:{c}" for m, c in sorted(methods.items()))
+        print(f"  {app:20s}: {len(spec):3d} endpoints ({method_str})")
+    print(f"\n[PASS] Real HAR processing completed successfully")

tests/tool_browser_agent.py ADDED Viewed

	@@ -0,0 +1,327 @@

+"""
+Tool 0: browser_agent — HAR processing pipeline.
+Stages:
+1. Check for pre-recorded HAR file (by port mapping) → load or fall back to live browser
+2. Filter HAR entries: skip static assets, HTML pages, deduplicate by (method, normalised path)
+3. Build OpenAPI-like spec from filtered entries
+4. Build GEMMA embeddings over the spec (for search_endpoints)
+5. Return summary endpoint list (method + path only)
+"""
+import json
+import os
+import re
+from urllib.parse import urlparse, parse_qs
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+HAR_MAP = {
+    ":7770": "hars/shopping.har",
+    ":7780": "hars/shopping_admin.har",
+    ":9999": "hars/forum.har",
+    ":3000": "hars/osm.har",
+    ":8888": "hars/wikipedia.har",
+}
+APP_NAMES = {
+    ":7770": "shopping",
+    ":7780": "shopping_admin",
+    ":9999": "forum",
+    ":3000": "osm",
+    ":8888": "wikipedia",
+}
+SKIP_EXTENSIONS = {".css", ".png", ".jpg", ".jpeg", ".svg", ".ico", ".woff",
+                   ".woff2", ".ttf", ".gif", ".js", ".map"}
+SKIP_PATH_PREFIXES = ["/static/", "/media/", "/_next/", "/assets/",
+                      "/__webpack", "/pub/static/"]
+# ---------------------------------------------------------------------------
+# Path normalisation
+# ---------------------------------------------------------------------------
+# Patterns for dynamic segments
+_UUID_RE = re.compile(r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}', re.I)
+_LONG_ALPHANUM_RE = re.compile(r'[a-zA-Z0-9]{32,}')  # Magento cart IDs etc.
+_NUMERIC_ID_RE = re.compile(r'^[0-9]+$')
+_FORUM_SLUG_RE = re.compile(r'^[0-9]+-[a-z0-9-]+$')  # e.g. "1-hello-world"
+def _normalise_path(path: str) -> str:
+    """Replace concrete IDs/slugs with {id} placeholders."""
+    segments = path.strip("/").split("/")
+    normalised = []
+    for seg in segments:
+        if _UUID_RE.fullmatch(seg):
+            normalised.append("{id}")
+        elif _LONG_ALPHANUM_RE.fullmatch(seg):
+            normalised.append("{id}")
+        elif _NUMERIC_ID_RE.fullmatch(seg) and len(seg) >= 2:
+            normalised.append("{id}")
+        elif _FORUM_SLUG_RE.fullmatch(seg):
+            normalised.append("{id}-{slug}")
+        else:
+            normalised.append(seg)
+    return "/" + "/".join(normalised)
+# ---------------------------------------------------------------------------
+# Filtering
+# ---------------------------------------------------------------------------
+def _is_static_asset(url: str) -> bool:
+    """Check if URL points to a static asset."""
+    parsed = urlparse(url)
+    path = parsed.path.lower()
+    for ext in SKIP_EXTENSIONS:
+        if path.endswith(ext):
+            return True
+    for prefix in SKIP_PATH_PREFIXES:
+        if path.startswith(prefix):
+            return True
+    return False
+def _get_response_content_type(resp: dict) -> str:
+    """Extract content-type from response headers or content field."""
+    # Check headers
+    for h in resp.get("headers", []):
+        if h["name"].lower() == "content-type":
+            return h["value"].lower()
+    # Check content field (HAR format)
+    content = resp.get("content", {})
+    return content.get("mimeType", "").lower()
+def _extract_body(req: dict) -> str | None:
+    """Extract request body text from HAR entry."""
+    pd = req.get("postData")
+    if pd is None:
+        return None
+    if isinstance(pd, dict):
+        return pd.get("text")
+    return str(pd) if pd else None
+def _truncate_body(resp: dict, max_len: int = 500) -> str | None:
+    """Extract and truncate response body for spec document."""
+    content = resp.get("content", {})
+    text = content.get("text", "")
+    if not text:
+        return None
+    if len(text) > max_len:
+        return text[:max_len] + "..."
+    return text
+# ---------------------------------------------------------------------------
+# Core pipeline
+# ---------------------------------------------------------------------------
+def resolve_har_path(url: str, base_dir: str = ".") -> str | None:
+    """Find pre-recorded HAR file for this app URL."""
+    for port_key, rel_path in HAR_MAP.items():
+        if port_key in url:
+            full_path = os.path.join(base_dir, rel_path)
+            if os.path.exists(full_path):
+                return full_path
+    return None
+def resolve_app_name(url: str) -> str:
+    """Map URL to app name."""
+    for port_key, name in APP_NAMES.items():
+        if port_key in url:
+            return name
+    return "unknown"
+def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
+    """
+    Stage 2-3: Filter HAR entries and extract OpenAPI-like spec.
+    Returns list of structured endpoint documents.
+    """
+    entries = har_data["log"]["entries"]
+    seen = set()
+    spec_entries = []
+    for entry in entries:
+        req = entry["request"]
+        resp = entry["response"]
+        raw_url = req["url"]
+        method = req["method"]
+        # Skip static assets
+        if _is_static_asset(raw_url):
+            continue
+        # Skip HTML page navigations
+        content_type = _get_response_content_type(resp)
+        if "text/html" in content_type and method == "GET":
+            continue
+        # Normalise path
+        parsed = urlparse(raw_url)
+        path = _normalise_path(parsed.path)
+        # Deduplicate
+        key = f"{method} {path}"
+        if key in seen:
+            continue
+        seen.add(key)
+        # Auth detection
+        has_auth = any(
+            h["name"].lower() in ("authorization", "x-api-key", "cookie")
+            for h in req.get("headers", [])
+        )
+        spec_entries.append({
+            "method": method,
+            "path": path,
+            "query_params": parsed.query or None,
+            "request_body": _extract_body(req),
+            "status_code": resp["status"],
+            "response_content_type": content_type,
+            "response_body_sample": _truncate_body(resp),
+            "auth_observed": has_auth,
+        })
+    return spec_entries
+def spec_entry_to_text(entry: dict, app_name: str) -> str:
+    """Convert a spec entry to a searchable text document for embedding."""
+    parts = [
+        f"app: {app_name}",
+        f"endpoint: {entry['method']} {entry['path']}",
+        f"status: {entry['status_code']}",
+        f"auth: {'required' if entry['auth_observed'] else 'none'}",
+    ]
+    if entry.get("query_params"):
+        parts.append(f"query: {entry['query_params']}")
+    if entry.get("request_body"):
+        parts.append(f"body: {entry['request_body'][:200]}")
+    if entry.get("response_body_sample"):
+        parts.append(f"response_sample: {entry['response_body_sample'][:200]}")
+    return " | ".join(parts)
+def build_summary_output(spec_entries: list[dict], app_name: str) -> dict:
+    """Stage 5: Build summary-only output for the RL agent."""
+    endpoints = [{"method": e["method"], "path": e["path"]} for e in spec_entries]
+    return {
+        "app": app_name,
+        "endpoints": endpoints,
+        "total_endpoints": len(endpoints),
+        "note": (
+            "These endpoints were observed for this application. "
+            "Use search_endpoints() with a natural language query to get "
+            "the full schema, parameters, and auth details for any endpoint."
+        ),
+    }
+def browser_agent(task: str, url: str, base_dir: str = ".") -> tuple[dict, list[dict], list[str]]:
+    """
+    Full browser_agent pipeline.
+    Returns:
+        (summary_output, spec_entries, text_chunks)
+        - summary_output: what the RL agent sees
+        - spec_entries: structured spec for internal use
+        - text_chunks: searchable text docs for embedding/search
+    """
+    app_name = resolve_app_name(url)
+    # Stage 1: Get HAR data
+    har_path = resolve_har_path(url, base_dir)
+    if har_path:
+        with open(har_path) as f:
+            har_data = json.load(f)
+    else:
+        raise FileNotFoundError(
+            f"No HAR file found for {url}. Live browser fallback not implemented in test mode."
+        )
+    # Stage 2-3: Extract spec
+    spec_entries = extract_openapi_spec(har_data, url)
+    # Stage 4: Build text chunks (embeddings would happen here)
+    text_chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
+    # Stage 5: Build summary
+    summary = build_summary_output(spec_entries, app_name)
+    return summary, spec_entries, text_chunks
+# ---------------------------------------------------------------------------
+# Test
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    print("=" * 70)
+    print("TEST: browser_agent with mock HAR data")
+    print("=" * 70)
+    mock_har_path = os.path.join(os.path.dirname(__file__), "mock_data", "mock_har.json")
+    with open(mock_har_path) as f:
+        har_data = json.load(f)
+    url = "http://localhost:7770/"
+    app_name = "shopping"
+    # Test filtering
+    spec = extract_openapi_spec(har_data, url)
+    print(f"\nFiltered {len(har_data['log']['entries'])} HAR entries → {len(spec)} API endpoints\n")
+    for e in spec:
+        print(f"  {e['method']:6s} {e['path']}")
+        if e.get("request_body"):
+            print(f"         body: {e['request_body'][:80]}...")
+    # Test summary output
+    summary = build_summary_output(spec, app_name)
+    print(f"\n--- Summary Output (what RL agent sees) ---")
+    print(json.dumps(summary, indent=2))
+    # Test text chunks
+    chunks = [spec_entry_to_text(e, app_name) for e in spec]
+    print(f"\n--- Text Chunks for Embedding ({len(chunks)} docs) ---")
+    for i, chunk in enumerate(chunks):
+        print(f"  [{i}] {chunk[:120]}...")
+    # Test path normalisation
+    print(f"\n--- Path Normalisation Tests ---")
+    test_paths = [
+        "/rest/V1/products/42",
+        "/rest/V1/guest-carts/3fa85f64-5717-4562-b3fc-2c963f66afa6/items",
+        "/rest/V1/guest-carts/abcdef1234567890abcdef1234567890ab/totals",
+        "/api/0.6/node/12345678",
+        "/f/general/1-hello-world",
+        "/rest/V1/categories",
+        "/rest/V1/products",
+    ]
+    for p in test_paths:
+        print(f"  {p:65s} → {_normalise_path(p)}")
+    # Test static asset detection
+    print(f"\n--- Static Asset Detection ---")
+    test_urls = [
+        "http://localhost:7770/rest/V1/products",
+        "http://localhost:7770/static/version1/file.js",
+        "http://localhost:7770/media/catalog/product/img.jpg",
+        "http://localhost:7770/beauty-personal-care.html",
+    ]
+    for u in test_urls:
+        print(f"  {u:60s} → static={_is_static_asset(u)}")
+    print("\n[PASS] browser_agent tool tests completed successfully")

tests/tool_curl_exec.py ADDED Viewed

	@@ -0,0 +1,442 @@

+"""
+Tool 2: curl_exec — HTTP execution with truncation and episode indexing.
+Pipeline:
+1. Parse curl command string → extract method, URL, headers, body
+2. Execute via subprocess (or mock in test mode)
+3. Index full response into episode BM25 store (before truncation)
+4. Truncate response body for context window
+5. Return {status_code, headers, body}
+"""
+import json
+import re
+import shlex
+from typing import Any
+# ---------------------------------------------------------------------------
+# Curl command parser
+# ---------------------------------------------------------------------------
+def parse_curl_command(command: str) -> dict:
+    """
+    Parse a curl command string into structured components.
+    Returns: {method, url, headers: dict, body: str|None}
+    """
+    # Handle the command as a shell argument list
+    try:
+        parts = shlex.split(command)
+    except ValueError:
+        return {"error": "Failed to parse curl command"}
+    # Remove 'curl' prefix if present
+    if parts and parts[0] == "curl":
+        parts = parts[1:]
+    result = {
+        "method": "GET",
+        "url": None,
+        "headers": {},
+        "body": None,
+    }
+    i = 0
+    while i < len(parts):
+        part = parts[i]
+        if part in ("-X", "--request"):
+            i += 1
+            if i < len(parts):
+                result["method"] = parts[i].upper()
+        elif part in ("-H", "--header"):
+            i += 1
+            if i < len(parts):
+                header = parts[i]
+                if ":" in header:
+                    key, val = header.split(":", 1)
+                    result["headers"][key.strip()] = val.strip()
+        elif part in ("-d", "--data", "--data-raw"):
+            i += 1
+            if i < len(parts):
+                result["body"] = parts[i]
+                if result["method"] == "GET":
+                    result["method"] = "POST"
+        elif not part.startswith("-"):
+            result["url"] = part
+        i += 1
+    return result
+# ---------------------------------------------------------------------------
+# Response truncation
+# ---------------------------------------------------------------------------
+TRUNCATE_LIST_AT = 2
+LARGE_ARRAY_THRESHOLD = 3
+NONJSON_MAX_CHARS = 1000
+def _is_json(s: str) -> bool:
+    try:
+        json.loads(s)
+        return True
+    except (ValueError, TypeError):
+        return False
+def truncate_response_body(body: str, status_code: int) -> str:
+    """Apply smart truncation rules to response body."""
+    # Rule 3: never truncate errors
+    if status_code >= 400:
+        return body
+    # Rule 1: non-JSON
+    if not _is_json(body):
+        if len(body) > NONJSON_MAX_CHARS:
+            return body[:NONJSON_MAX_CHARS] + " [truncated - non-JSON response]"
+        return body
+    parsed = json.loads(body)
+    # Rule 2: primitive
+    if not isinstance(parsed, (dict, list)):
+        return body
+    # Handle top-level array
+    if isinstance(parsed, list):
+        if (len(parsed) >= LARGE_ARRAY_THRESHOLD
+                and len(parsed) > 0 and isinstance(parsed[0], dict)):
+            result = parsed[:TRUNCATE_LIST_AT]
+            note = {"_list_truncated": {
+                "shown": TRUNCATE_LIST_AT,
+                "total": len(parsed),
+                "note": f"Showing {TRUNCATE_LIST_AT} of {len(parsed)} items. "
+                        "Use search_episode_data() to find a specific item from this response."
+            }}
+            return json.dumps(result + [note])
+        return body
+    # Handle dict — check each value for large arrays
+    needs_truncation = {
+        k for k, v in parsed.items()
+        if isinstance(v, list) and len(v) >= LARGE_ARRAY_THRESHOLD
+           and len(v) > 0 and isinstance(v[0], dict)
+    }
+    if not needs_truncation:
+        return body
+    result = {}
+    total_truncated = {}
+    for k, v in parsed.items():
+        if k in needs_truncation:
+            result[k] = v[:TRUNCATE_LIST_AT]
+            total_truncated[k] = len(v)
+        else:
+            result[k] = v
+    result["_list_truncated"] = {
+        "fields": total_truncated,
+        "shown_per_field": TRUNCATE_LIST_AT,
+        "note": (
+            "List fields truncated: "
+            + ", ".join(f"{k} showing {TRUNCATE_LIST_AT}/{n}"
+                        for k, n in total_truncated.items())
+            + ". Use search_episode_data() to find a specific item from this response."
+        )
+    }
+    return json.dumps(result)
+# ---------------------------------------------------------------------------
+# Episode index document construction
+# ---------------------------------------------------------------------------
+def build_index_documents(step: int, method: str, path: str,
+                           request_body: Any, response_body: Any,
+                           status_code: int) -> list[str]:
+    """
+    Build BM25-indexable documents from a curl_exec result.
+    Called BEFORE truncation so all items are indexed.
+    """
+    docs = []
+    # Index request body
+    if request_body is not None:
+        docs.append(
+            f"step:{step} source:request endpoint:{method} {path} "
+            f"body:{json.dumps(request_body, ensure_ascii=False) if isinstance(request_body, (dict, list)) else str(request_body)}"
+        )
+    # Index response body
+    if response_body is None:
+        return docs
+    if isinstance(response_body, str) and not _is_json(response_body):
+        docs.append(
+            f"step:{step} source:response endpoint:{method} {path} "
+            f"status:{status_code} body:{response_body[:500]}"
+        )
+        return docs
+    parsed = json.loads(response_body) if isinstance(response_body, str) else response_body
+    # Primitive value
+    if isinstance(parsed, (str, int, float, bool)) or parsed is None:
+        docs.append(
+            f"step:{step} source:response endpoint:{method} {path} "
+            f"status:{status_code} value:{parsed}"
+        )
+        return docs
+    # Top-level array
+    if isinstance(parsed, list):
+        for item in parsed:
+            if isinstance(item, dict):
+                docs.append(
+                    f"step:{step} source:response endpoint:{method} {path} "
+                    f"status:{status_code} item:{json.dumps(item, ensure_ascii=False)}"
+                )
+            else:
+                docs.append(
+                    f"step:{step} source:response endpoint:{method} {path} "
+                    f"status:{status_code} value:{item}"
+                )
+        return docs
+    # Dict — find array fields
+    array_fields = {k: v for k, v in parsed.items()
+                    if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)}
+    scalar_fields = {k: v for k, v in parsed.items() if k not in array_fields}
+    if not array_fields:
+        docs.append(
+            f"step:{step} source:response endpoint:{method} {path} "
+            f"status:{status_code} data:{json.dumps(parsed, ensure_ascii=False)}"
+        )
+        return docs
+    # Array fields — one doc per item with parent context
+    parent_context = (
+        f"step:{step} source:response endpoint:{method} {path} status:{status_code} "
+        + " ".join(f"{k}:{v}" for k, v in scalar_fields.items()
+                   if not isinstance(v, (dict, list)))
+    )
+    for field_name, items in array_fields.items():
+        for item in items:
+            flat_item = {}
+            for k, v in item.items():
+                flat_item[k] = json.dumps(v) if isinstance(v, (list, dict)) else v
+            docs.append(
+                f"{parent_context} list_field:{field_name} "
+                f"item:{json.dumps(flat_item, ensure_ascii=False)}"
+            )
+    return docs
+# ---------------------------------------------------------------------------
+# Mock execution (for testing)
+# ---------------------------------------------------------------------------
+# Mock responses keyed by (method, path_pattern)
+MOCK_RESPONSES = {
+    ("GET", "/rest/V1/categories"): {
+        "status_code": 200,
+        "headers": {"Content-Type": "application/json"},
+        "body": json.dumps({
+            "id": 1, "name": "Root",
+            "children_data": [
+                {"id": 2, "name": "Default Category"},
+                {"id": 3, "name": "Beauty & Personal Care"}
+            ]
+        })
+    },
+    ("GET", "/rest/V1/products"): {
+        "status_code": 200,
+        "headers": {"Content-Type": "application/json"},
+        "body": json.dumps({
+            "items": [
+                {"sku": "MH01", "name": "Radiant Tee", "price": 22.0, "type_id": "simple"},
+                {"sku": "MH02", "name": "Breathe-Easy Tank", "price": 34.0, "type_id": "simple"},
+                {"sku": "MH03", "name": "Stellar Solar Jacket", "price": 75.0, "type_id": "configurable"},
+                {"sku": "MH04", "name": "Argus All-Weather Tank", "price": 22.0, "type_id": "simple"},
+            ],
+            "total_count": 4
+        })
+    },
+    ("POST", "/rest/V1/guest-carts"): {
+        "status_code": 200,
+        "headers": {"Content-Type": "application/json"},
+        "body": '"cart-mock-abc123"'
+    },
+    ("POST", "/rest/V1/guest-carts/{id}/items"): {
+        "status_code": 200,
+        "headers": {"Content-Type": "application/json"},
+        "body": json.dumps({
+            "item_id": 5, "sku": "MH01", "qty": 1,
+            "name": "Radiant Tee", "price": 22.0,
+            "product_type": "simple", "quote_id": "cart-mock-abc123"
+        })
+    },
+}
+def mock_curl_exec(command: str, step: int, episode_index: list) -> dict:
+    """
+    Mock curl_exec for testing. Matches against MOCK_RESPONSES.
+    Also builds index documents and adds to episode_index.
+    """
+    parsed = parse_curl_command(command)
+    if "error" in parsed:
+        return {"status_code": 0, "error": parsed["error"]}
+    method = parsed["method"]
+    url = parsed["url"]
+    from urllib.parse import urlparse
+    path = urlparse(url).path
+    # Try exact match first, then pattern match
+    response = None
+    for (m, p), resp in MOCK_RESPONSES.items():
+        if m != method:
+            continue
+        # Replace {id} with regex for matching
+        pattern = re.sub(r'\{[^}]+\}', r'[^/]+', p)
+        if re.fullmatch(pattern, path):
+            response = resp
+            break
+    if response is None:
+        response = {
+            "status_code": 404,
+            "headers": {"Content-Type": "application/json"},
+            "body": json.dumps({"message": f"No mock for {method} {path}"})
+        }
+    # Build index documents BEFORE truncation
+    req_body = None
+    if parsed["body"]:
+        try:
+            req_body = json.loads(parsed["body"])
+        except (json.JSONDecodeError, TypeError):
+            req_body = parsed["body"]
+    index_docs = build_index_documents(
+        step=step,
+        method=method,
+        path=path,
+        request_body=req_body,
+        response_body=response["body"],
+        status_code=response["status_code"]
+    )
+    episode_index.extend(index_docs)
+    # Truncate body for context
+    truncated_body = truncate_response_body(response["body"], response["status_code"])
+    return {
+        "status_code": response["status_code"],
+        "headers": response["headers"],
+        "body": json.loads(truncated_body) if _is_json(truncated_body) else truncated_body,
+    }
+# ---------------------------------------------------------------------------
+# Test
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    print("=" * 70)
+    print("TEST: curl_exec with mock responses")
+    print("=" * 70)
+    # Test curl parsing
+    print("\n--- Curl Command Parsing ---")
+    commands = [
+        'curl http://localhost:7770/rest/V1/categories',
+        'curl -X POST http://localhost:7770/rest/V1/guest-carts -H "Content-Type: application/json"',
+        "curl -X POST 'http://localhost:7770/rest/V1/guest-carts/cart-abc/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1}}'",
+    ]
+    for cmd in commands:
+        parsed = parse_curl_command(cmd)
+        print(f"  {cmd[:70]}...")
+        print(f"    method={parsed['method']} url={parsed['url']} body={'yes' if parsed['body'] else 'no'}")
+    # Test truncation
+    print("\n--- Response Truncation ---")
+    # Primitive (never truncated)
+    assert truncate_response_body('"cart-abc123"', 200) == '"cart-abc123"'
+    print("  [OK] Primitive string not truncated")
+    # Error (never truncated)
+    long_error = json.dumps({"message": "x" * 2000})
+    assert truncate_response_body(long_error, 400) == long_error
+    print("  [OK] Error response not truncated")
+    # Small object (not truncated)
+    small = json.dumps({"id": 1, "name": "test"})
+    assert truncate_response_body(small, 200) == small
+    print("  [OK] Small object not truncated")
+    # Large array in dict (truncated to 2 items)
+    large = json.dumps({
+        "items": [{"sku": f"P{i}", "name": f"Product {i}"} for i in range(20)],
+        "total_count": 20
+    })
+    result = json.loads(truncate_response_body(large, 200))
+    assert len(result["items"]) == 2
+    assert "_list_truncated" in result
+    assert result["_list_truncated"]["fields"]["items"] == 20
+    print(f"  [OK] Large array truncated: 20 items → {len(result['items'])} shown")
+    print(f"       Note: {result['_list_truncated']['note'][:80]}...")
+    # Top-level array (truncated)
+    top_array = json.dumps([{"id": i, "name": f"Item {i}"} for i in range(10)])
+    result = json.loads(truncate_response_body(top_array, 200))
+    assert len(result) == 3  # 2 items + truncation note
+    print(f"  [OK] Top-level array truncated: 10 items → 2 shown + note")
+    # Test mock execution with indexing
+    print("\n--- Mock Execution + Indexing ---")
+    episode_index = []
+    # Step 1: Get categories
+    r = mock_curl_exec("curl http://localhost:7770/rest/V1/categories", 1, episode_index)
+    print(f"  Step 1: GET /categories → {r['status_code']}, body keys: {list(r['body'].keys()) if isinstance(r['body'], dict) else 'primitive'}")
+    # Step 2: Search products
+    r = mock_curl_exec(
+        "curl 'http://localhost:7770/rest/V1/products?searchCriteria[filter]=name'",
+        2, episode_index
+    )
+    print(f"  Step 2: GET /products → {r['status_code']}, items shown: {len(r['body'].get('items', []))}, total: {r['body'].get('total_count', '?')}")
+    if "_list_truncated" in r["body"]:
+        print(f"           Truncated: {r['body']['_list_truncated']['note'][:60]}...")
+    # Step 3: Create cart
+    r = mock_curl_exec(
+        "curl -X POST http://localhost:7770/rest/V1/guest-carts -H 'Content-Type: application/json'",
+        3, episode_index
+    )
+    print(f"  Step 3: POST /guest-carts → {r['status_code']}, cart_id: {r['body']}")
+    # Step 4: Add item
+    r = mock_curl_exec(
+        'curl -X POST http://localhost:7770/rest/V1/guest-carts/cart-mock-abc123/items -H "Content-Type: application/json" -d \'{"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-mock-abc123"}}\'',
+        4, episode_index
+    )
+    print(f"  Step 4: POST /guest-carts/.../items → {r['status_code']}, item_id: {r['body'].get('item_id')}")
+    # Show episode index
+    print(f"\n--- Episode Index ({len(episode_index)} documents) ---")
+    for i, doc in enumerate(episode_index):
+        print(f"  [{i}] {doc[:120]}...")
+    print("\n[PASS] curl_exec tool tests completed successfully")

tests/tool_search_endpoints.py ADDED Viewed

	@@ -0,0 +1,239 @@

+"""
+Tool 1: search_endpoints — Semantic search over endpoint catalog.
+Uses GEMMA embeddings (google/embeddinggemma-300m) for semantic search.
+Falls back to keyword matching when GEMMA is not available (test mode).
+"""
+import json
+import os
+import re
+import math
+from collections import Counter
+# ---------------------------------------------------------------------------
+# Keyword-based fallback search (for testing without GEMMA model)
+# Uses TF-IDF-like scoring
+# ---------------------------------------------------------------------------
+def _tokenize(text: str) -> list[str]:
+    """Simple whitespace + punctuation tokenizer."""
+    return re.findall(r'[a-zA-Z0-9_/{}]+', text.lower())
+class KeywordSearchIndex:
+    """Simple TF-IDF search index for testing without neural embeddings."""
+    def __init__(self):
+        self.documents: list[str] = []
+        self.doc_tokens: list[list[str]] = []
+        self.idf: dict[str, float] = {}
+    def add_documents(self, docs: list[str]):
+        self.documents = docs
+        self.doc_tokens = [_tokenize(d) for d in docs]
+        self._build_idf()
+    def _build_idf(self):
+        n = len(self.documents)
+        df = Counter()
+        for tokens in self.doc_tokens:
+            for t in set(tokens):
+                df[t] += 1
+        self.idf = {t: math.log(n / (1 + count)) for t, count in df.items()}
+    def search(self, query: str, top_k: int = 3) -> list[tuple[int, float, str]]:
+        """Returns list of (index, score, document) tuples."""
+        query_tokens = _tokenize(query)
+        scores = []
+        for i, doc_toks in enumerate(self.doc_tokens):
+            tf = Counter(doc_toks)
+            score = sum(
+                (tf.get(qt, 0) / max(len(doc_toks), 1)) * self.idf.get(qt, 0)
+                for qt in query_tokens
+            )
+            scores.append((i, score, self.documents[i]))
+        scores.sort(key=lambda x: x[1], reverse=True)
+        return scores[:top_k]
+# ---------------------------------------------------------------------------
+# Catalog loading
+# ---------------------------------------------------------------------------
+def load_catalog(catalog_path: str) -> list[dict]:
+    """Load a ground truth catalog JSON file."""
+    with open(catalog_path) as f:
+        data = json.load(f)
+    return data.get("endpoints", data if isinstance(data, list) else [])
+def catalog_entry_to_text(entry: dict, app_name: str = "") -> str:
+    """Convert a catalog endpoint to a searchable text document."""
+    parts = []
+    if app_name:
+        parts.append(f"app: {app_name}")
+    endpoint = entry.get("endpoint", "")
+    parts.append(f"endpoint: {endpoint}")
+    auth = entry.get("auth", "none")
+    parts.append(f"auth: {auth}")
+    # Query params
+    qp = entry.get("query_params", {})
+    if qp:
+        param_strs = []
+        for k, v in qp.items():
+            if isinstance(v, dict):
+                param_strs.append(f"{k} ({v.get('type', '?')}, source: {v.get('source', '?')})")
+            else:
+                param_strs.append(f"{k}: {v}")
+        parts.append(f"query_params: {', '.join(param_strs)}")
+    # Path params
+    pp = entry.get("path_params", {})
+    if pp:
+        param_strs = []
+        for k, v in pp.items():
+            if isinstance(v, dict):
+                src = v.get("source", "?")
+                from_ep = v.get("from_endpoint", "")
+                param_strs.append(f"{k} ({v.get('type', '?')}, source: {src}, from: {from_ep})")
+            else:
+                param_strs.append(f"{k}: {v}")
+        parts.append(f"path_params: {', '.join(param_strs)}")
+    # Body params
+    bp = entry.get("body_params", entry.get("form_params", {}))
+    if bp:
+        param_strs = []
+        for k, v in bp.items():
+            if isinstance(v, dict):
+                src = v.get("source", "?")
+                from_ep = v.get("from_endpoint", "")
+                notes = v.get("notes", "")
+                param_strs.append(f"{k} ({v.get('type', '?')}, source: {src})")
+            else:
+                param_strs.append(f"{k}: {v}")
+        parts.append(f"body_params: {', '.join(param_strs)}")
+    # Response fields
+    rkf = entry.get("response_key_fields", [])
+    if rkf:
+        parts.append(f"returns: {', '.join(str(f) for f in rkf)}")
+    # Notes
+    notes = entry.get("notes", "")
+    if notes:
+        parts.append(f"notes: {notes}")
+    return " | ".join(parts)
+# ---------------------------------------------------------------------------
+# search_endpoints tool
+# ---------------------------------------------------------------------------
+class SearchEndpoints:
+    """
+    Tool 1 implementation.
+    Loads catalog, builds search index, provides search interface.
+    """
+    def __init__(self):
+        self.index = KeywordSearchIndex()
+        self.raw_entries: list[dict] = []
+        self.text_chunks: list[str] = []
+    def load_catalog(self, catalog_path: str, app_name: str = ""):
+        """Load a catalog and build the search index."""
+        self.raw_entries = load_catalog(catalog_path)
+        self.text_chunks = [catalog_entry_to_text(e, app_name) for e in self.raw_entries]
+        self.index.add_documents(self.text_chunks)
+    def load_from_browser_agent(self, text_chunks: list[str]):
+        """Load text chunks produced by browser_agent Stage 4."""
+        self.text_chunks = text_chunks
+        self.index.add_documents(text_chunks)
+    def search(self, query: str, top_k: int = 3) -> list[str]:
+        """
+        Search endpoints by natural language query.
+        Returns top-k matching endpoint schema texts.
+        """
+        results = self.index.search(query, top_k)
+        return [doc for _, _, doc in results]
+    def search_with_scores(self, query: str, top_k: int = 3) -> list[tuple[float, str]]:
+        """Search with scores for debugging."""
+        results = self.index.search(query, top_k)
+        return [(score, doc) for _, score, doc in results]
+# ---------------------------------------------------------------------------
+# Test
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    print("=" * 70)
+    print("TEST: search_endpoints with browser_agent output")
+    print("=" * 70)
+    # PRIMARY TEST: load from browser_agent output (this is the real data flow)
+    # In production, search_endpoints searches GEMMA embeddings built by browser_agent
+    # from HAR data. Here we test with keyword search as a fallback for GEMMA.
+    print("\n--- Primary: load from browser_agent HAR output ---")
+    from tool_browser_agent import extract_openapi_spec, spec_entry_to_text
+    mock_har_path = os.path.join(os.path.dirname(__file__), "mock_data", "mock_har.json")
+    with open(mock_har_path) as f:
+        har_data = json.load(f)
+    spec = extract_openapi_spec(har_data, "http://localhost:7770/")
+    chunks = [spec_entry_to_text(e, "shopping") for e in spec]
+    tool = SearchEndpoints()
+    tool.load_from_browser_agent(chunks)
+    print(f"\nLoaded {len(tool.text_chunks)} endpoint documents from browser_agent output\n")
+    for i, chunk in enumerate(tool.text_chunks):
+        print(f"  [{i}] {chunk[:100]}...")
+    # Test queries against browser_agent output
+    queries = [
+        "find product by name get sku",
+        "create guest cart",
+        "add item to guest cart",
+        "authenticate customer login",
+        "shipping methods for cart",
+        "get cart total",
+        "list categories",
+    ]
+    print(f"\n--- Search Results (from browser_agent HAR output) ---\n")
+    for q in queries:
+        print(f"Query: \"{q}\"")
+        results = tool.search_with_scores(q, top_k=3)
+        for score, doc in results:
+            # Extract just the endpoint name for display
+            ep_match = re.search(r'endpoint: (\S+ \S+)', doc)
+            ep_name = ep_match.group(1) if ep_match else doc[:60]
+            print(f"  [{score:.3f}] {ep_name}")
+        print()
+    # SECONDARY TEST: catalog loading (used by judge for ground truth, NOT by search_endpoints)
+    print("--- Secondary: catalog loading (for judge ground truth, not search_endpoints) ---")
+    catalog_path = os.path.join(os.path.dirname(__file__), "mock_data", "mock_catalog.json")
+    tool2 = SearchEndpoints()
+    tool2.load_catalog(catalog_path, app_name="shopping")
+    print(f"  Catalog loaded: {len(tool2.text_chunks)} endpoint documents (judge reference only)")
+    results = tool2.search("add item to cart", top_k=1)
+    print(f"  Query: 'add item to cart' → top result:")
+    print(f"    {results[0][:120]}...")
+    print("\n[PASS] search_endpoints tool tests completed successfully")

tests/tool_search_episode_data.py ADDED Viewed

	@@ -0,0 +1,273 @@

+"""
+Tool 3: search_episode_data — BM25 search over episode request/response history.
+Uses rank_bm25 for keyword matching over all indexed data from curl_exec calls.
+Falls back to simple keyword matching when rank_bm25 is not available.
+"""
+import re
+import math
+from collections import Counter
+# ---------------------------------------------------------------------------
+# Simple BM25 implementation (no external dependencies)
+# ---------------------------------------------------------------------------
+def _tokenize(text: str) -> list[str]:
+    """Tokenize text into words."""
+    return re.findall(r'[a-zA-Z0-9_./{}:]+', text.lower())
+class SimpleBM25:
+    """
+    Minimal BM25 implementation for episode data search.
+    No external dependencies — pure Python.
+    """
+    def __init__(self, k1: float = 1.5, b: float = 0.75):
+        self.k1 = k1
+        self.b = b
+        self.corpus: list[str] = []
+        self.tokenized: list[list[str]] = []
+        self.doc_len: list[int] = []
+        self.avgdl: float = 0
+        self.idf: dict[str, float] = {}
+        self.n_docs: int = 0
+    def index(self, documents: list[str]):
+        """Build BM25 index from documents."""
+        self.corpus = documents
+        self.tokenized = [_tokenize(d) for d in documents]
+        self.doc_len = [len(t) for t in self.tokenized]
+        self.n_docs = len(documents)
+        self.avgdl = sum(self.doc_len) / max(self.n_docs, 1)
+        # Compute IDF
+        df = Counter()
+        for tokens in self.tokenized:
+            for t in set(tokens):
+                df[t] += 1
+        self.idf = {}
+        for term, freq in df.items():
+            # Standard BM25 IDF
+            self.idf[term] = math.log(
+                (self.n_docs - freq + 0.5) / (freq + 0.5) + 1
+            )
+    def add_documents(self, new_docs: list[str]):
+        """Incrementally add documents and rebuild index."""
+        self.corpus.extend(new_docs)
+        new_tokenized = [_tokenize(d) for d in new_docs]
+        self.tokenized.extend(new_tokenized)
+        self.doc_len.extend(len(t) for t in new_tokenized)
+        self.n_docs = len(self.corpus)
+        self.avgdl = sum(self.doc_len) / max(self.n_docs, 1)
+        # Recompute IDF
+        df = Counter()
+        for tokens in self.tokenized:
+            for t in set(tokens):
+                df[t] += 1
+        self.idf = {
+            term: math.log((self.n_docs - freq + 0.5) / (freq + 0.5) + 1)
+            for term, freq in df.items()
+        }
+    def search(self, query: str, top_k: int = 5) -> list[tuple[int, float, str]]:
+        """
+        Search for query in corpus.
+        Returns: list of (doc_index, score, document) tuples, sorted by score descending.
+        """
+        query_tokens = _tokenize(query)
+        scores = []
+        for i, doc_tokens in enumerate(self.tokenized):
+            score = 0.0
+            tf = Counter(doc_tokens)
+            dl = self.doc_len[i]
+            for qt in query_tokens:
+                if qt not in self.idf:
+                    continue
+                term_freq = tf.get(qt, 0)
+                idf = self.idf[qt]
+                numerator = term_freq * (self.k1 + 1)
+                denominator = term_freq + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
+                score += idf * numerator / denominator
+            scores.append((i, score, self.corpus[i]))
+        scores.sort(key=lambda x: x[1], reverse=True)
+        return scores[:top_k]
+# ---------------------------------------------------------------------------
+# Episode data store
+# ---------------------------------------------------------------------------
+class EpisodeDataStore:
+    """
+    Per-episode BM25 index over all request/response bodies.
+    Initialized empty at episode start, grows with each curl_exec call.
+    Discarded at episode end.
+    """
+    def __init__(self):
+        self.bm25 = SimpleBM25()
+        self.bm25.index([])  # Initialize empty
+    def add_documents(self, docs: list[str]):
+        """Add new documents (from a curl_exec call) to the index."""
+        self.bm25.add_documents(docs)
+    def search(self, query: str, top_k: int = 5) -> list[str]:
+        """
+        Search episode data by keyword query.
+        Returns top-k matching documents as strings.
+        """
+        if self.bm25.n_docs == 0:
+            return []
+        results = self.bm25.search(query, top_k)
+        return [doc for _, score, doc in results if score > 0]
+    def search_with_scores(self, query: str, top_k: int = 5) -> list[tuple[float, str]]:
+        """Search with scores for debugging."""
+        results = self.bm25.search(query, top_k)
+        return [(score, doc) for _, score, doc in results]
+    @property
+    def doc_count(self) -> int:
+        return self.bm25.n_docs
+    def reset(self):
+        """Clear all data (called at episode end)."""
+        self.bm25 = SimpleBM25()
+        self.bm25.index([])
+# ---------------------------------------------------------------------------
+# Test
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    print("=" * 70)
+    print("TEST: search_episode_data with simulated episode")
+    print("=" * 70)
+    from tool_curl_exec import build_index_documents
+    import json
+    store = EpisodeDataStore()
+    # Simulate episode: 4 curl_exec calls building up the index
+    # Step 1: GET /categories
+    docs = build_index_documents(
+        step=1, method="GET", path="/rest/V1/categories",
+        request_body=None,
+        response_body=json.dumps({
+            "id": 1, "name": "Root",
+            "children_data": [
+                {"id": 2, "name": "Default Category"},
+                {"id": 3, "name": "Beauty & Personal Care"}
+            ]
+        }),
+        status_code=200
+    )
+    store.add_documents(docs)
+    print(f"\nStep 1: indexed {len(docs)} docs from GET /categories")
+    # Step 2: GET /products (with array items)
+    docs = build_index_documents(
+        step=2, method="GET", path="/rest/V1/products",
+        request_body=None,
+        response_body=json.dumps({
+            "items": [
+                {"sku": "MH01", "name": "Radiant Tee", "price": 22.0},
+                {"sku": "MH02", "name": "Breathe-Easy Tank", "price": 34.0},
+                {"sku": "MH03", "name": "Stellar Solar Jacket", "price": 75.0},
+                {"sku": "MH04", "name": "Argus All-Weather Tank", "price": 22.0},
+                {"sku": "WS01", "name": "Iris Workout Top", "price": 29.0},
+            ],
+            "total_count": 5
+        }),
+        status_code=200
+    )
+    store.add_documents(docs)
+    print(f"Step 2: indexed {len(docs)} docs from GET /products (5 items)")
+    # Step 3: POST /guest-carts
+    docs = build_index_documents(
+        step=3, method="POST", path="/rest/V1/guest-carts",
+        request_body=None,
+        response_body='"cart-mock-abc123"',
+        status_code=200
+    )
+    store.add_documents(docs)
+    print(f"Step 3: indexed {len(docs)} docs from POST /guest-carts")
+    # Step 4: POST /guest-carts/.../items
+    docs = build_index_documents(
+        step=4, method="POST", path="/rest/V1/guest-carts/cart-mock-abc123/items",
+        request_body={"cartItem": {"sku": "MH01", "qty": 1, "quote_id": "cart-mock-abc123"}},
+        response_body=json.dumps({
+            "item_id": 5, "sku": "MH01", "qty": 1,
+            "name": "Radiant Tee", "price": 22.0
+        }),
+        status_code=200
+    )
+    store.add_documents(docs)
+    print(f"Step 4: indexed {len(docs)} docs from POST /guest-carts/.../items")
+    print(f"\nTotal documents in episode index: {store.doc_count}")
+    # Test searches
+    print(f"\n--- Search Tests ---\n")
+    queries = [
+        ("Radiant Tee sku", "Should find MH01 product"),
+        ("Stellar Solar Jacket price", "Should find MH03 at $75"),
+        ("cart-mock-abc123", "Should find cart ID"),
+        ("Beauty Personal Care", "Should find category"),
+        ("item_id 5", "Should find add-to-cart result"),
+        ("Iris Workout Top", "Should find WS01 product"),
+    ]
+    all_passed = True
+    for query, description in queries:
+        results = store.search_with_scores(query, top_k=3)
+        print(f"Query: \"{query}\" ({description})")
+        if results:
+            for score, doc in results:
+                print(f"  [{score:.3f}] {doc[:120]}...")
+        else:
+            print(f"  [NO RESULTS]")
+            all_passed = False
+        print()
+    # Verify specific lookups
+    print("--- Specific Value Lookups ---\n")
+    # Can we find the product SKU from a name?
+    results = store.search("Radiant Tee", top_k=1)
+    found_sku = "MH01" in results[0] if results else False
+    print(f"  Find 'Radiant Tee' SKU: {'PASS' if found_sku else 'FAIL'} ({'MH01' if found_sku else 'not found'})")
+    # Can we find the cart ID?
+    results = store.search("cart guest-carts", top_k=3)
+    found_cart = any("cart-mock-abc123" in r for r in results)
+    print(f"  Find cart ID: {'PASS' if found_cart else 'FAIL'}")
+    # Can we find from which step data came?
+    results = store.search("Radiant Tee", top_k=1)
+    found_step = "step:2" in results[0] if results else False
+    print(f"  Step annotation present: {'PASS' if found_step else 'FAIL'}")
+    # Test reset
+    store.reset()
+    assert store.doc_count == 0
+    print(f"\n  Episode reset: doc_count = {store.doc_count} [PASS]")
+    print("\n[PASS] search_episode_data tool tests completed successfully")