HARvestGym / GROUND_TRUTH_EXTRACTION.md
kdcyberdude's picture
Upload folder using huggingface_hub
e6ce96e verified

Ground Truth Extraction

Extract the API catalog from each live WebArena container by connecting to EC2 from Cursor, entering each container, and running Claude Code with the prompts below.

Container names confirmed from the running EC2 instance:

App Container name Image
Shopping shopping shopping_final_0712
Shopping Admin shopping_admin shopping_admin_final_0719
Forum forum postmill-populated-exposed-withimg
Wikipedia kiwix33 ghcr.io/kiwix/kiwix-serve:3.3.0
Map (web) openstreetmap-website-web-1 openstreetmap-website-web

Skipping GitLab for now (gitlab) - facing some 502 errors.
Wikipedia (Kiwix) serves a static ZIM file β€” there is no source code to analyze. Its catalog entry is hardcoded at the bottom of this file.


Connection workflow

Cursor β†’ Remote SSH β†’ EC2 host β†’ Dev Containers extension β†’ Attach to Running Container β†’ pick app container β†’ paste prompt into Cursor sidebar

Step 1 β€” Connect from Cursor to EC2

Cmd+Shift+P β†’ Remote-SSH: Connect to Host β†’ ubuntu@<EC2_IP>

Step 2 β€” Attach to a container

With the Remote-SSH window open: Cmd+Shift+P β†’ Dev Containers: Attach to Running Container β†’ select the container (e.g. shopping).

Cursor opens a new window with the container's filesystem as the workspace. The full source code is now loaded and indexed β€” no copying needed.

Step 3 β€” Paste the prompt into the Cursor sidebar

Open the AI chat sidebar, paste the prompt for that container (see sections below), and run it. Cursor's AI has the full codebase in context and will write api_catalog.json into the workspace (inside the container).

Step 4 β€” Copy the output back

The file written inside the container can be downloaded via File β†’ Download from Cursor's remote explorer, or via scp from the EC2 host after the fact.

Repeat for each container: shopping, shopping_admin, forum, openstreetmap-website-web-1.


Expected output format

The catalog captures all API surface found in the codebase β€” REST, GraphQL, WebSocket, form submissions. The api_type field distinguishes them. Do not restrict to only the endpoints used by the 7 training tasks; document everything.

[
  {
    "api_type": "rest",
    "endpoint": "POST /rest/V1/guest-carts/{cartId}/items",
    "auth": "none",
    "path_params": {
      "cartId": {
        "type": "string",
        "source": "PREV_CALL",
        "from_endpoint": "POST /rest/V1/guest-carts",
        "from_field": ".body",
        "notes": "entire response body is the cartId string"
      }
    },
    "body_params": {
      "cartItem.sku":      { "type": "string", "source": "PREV_CALL", "from_endpoint": "GET /rest/V1/products", "from_field": ".items[0].sku" },
      "cartItem.qty":      { "type": "number", "source": "TASK_SPEC" },
      "cartItem.quote_id": { "type": "string", "source": "DERIVED", "same_as": "cartId" }
    },
    "response_key_fields": []
  },
  {
    "api_type": "graphql",
    "endpoint": "POST /graphql",
    "operation_name": "GetProducts",
    "operation_type": "query",
    "auth": "none",
    "variables": {
      "search":    { "type": "String", "source": "TASK_SPEC" },
      "pageSize":  { "type": "Int",    "source": "STATIC", "value": 20 }
    },
    "response_key_fields": [".products.items[].sku", ".products.items[].name"]
  },
  {
    "api_type": "websocket",
    "endpoint": "ws:///realtime",
    "auth": "session_cookie",
    "notes": "describe message protocol; include event names and payload shapes"
  },
  {
    "api_type": "form",
    "endpoint": "POST /submission/create",
    "auth": "session_cookie+csrf",
    "content_type": "application/x-www-form-urlencoded",
    "form_params": {
      "_token": { "type": "string", "source": "AUTH_FLOW", "notes": "hidden input in page HTML" },
      "title":  { "type": "string", "source": "TASK_SPEC" },
      "url":    { "type": "string", "source": "TASK_SPEC", "notes": "optional if body provided" }
    },
    "response_key_fields": []
  }
]

**api_type:** rest | graphql | websocket | form

**source:**

  • TASK_SPEC β€” given in the task description
  • PREV_CALL β€” from a prior response this episode; specify from_endpoint + from_field
  • AUTH_FLOW β€” token / cookie / CSRF from the login flow
  • STATIC β€” hardcoded in the app; document the actual value
  • DERIVED β€” aliased from another value (e.g. quote_id = cartId)

App 1 β€” Shopping and Shopping Admin (Magento 2)

Container: shopping
Source root: /var/www/magento2/ (confirmed β€” WebArena README runs docker exec shopping /var/www/magento2/bin/magento ...)

Attach to the shopping container in Cursor, open /var/www/magento2/ as the workspace, then paste this prompt into the sidebar:

You are working inside a Magento 2 codebase. Start by exploring the directory structure to orient yourself.

Your job: produce a COMPLETE api_catalog.json covering ALL APIs exposed by this Magento 2
installation β€” not just a subset. Document every endpoint you find, regardless of whether
it is used in a specific task or not. The goal is a full map of the application's API surface.

API types to scan for (all of them):
1. REST endpoints β€” declared in webapi.xml files. These are the primary API.
2. GraphQL β€” Magento has a full GraphQL API parallel to REST. Find the .graphqls schema files
   and document every query and mutation.
3. WebSockets β€” Magento does not typically use WebSockets, but check. If none found, note it.
4. Admin AJAX endpoints β€” controllers under adminhtml/ that handle JSON AJAX requests.
   These are separate from the REST API.

For REST and Admin AJAX endpoints, produce:
{
  "api_type": "rest",
  "endpoint": "METHOD /path/{template}",
  "auth": "none | bearer_token | admin_bearer_token | session_cookie",
  "path_params":   { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
  "query_params":  { ... },
  "body_params":   { ... },
  "response_key_fields": ["jq paths that downstream calls will consume"]
}

For GraphQL queries/mutations, produce:
{
  "api_type": "graphql",
  "endpoint": "POST /graphql",
  "operation_name": "...",
  "operation_type": "query | mutation | subscription",
  "auth": "none | bearer_token",
  "variables": { "<name>": { "type": "...", "source": "...", "notes": "..." } },
  "response_key_fields": ["jq paths that downstream calls will consume"]
}

Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
- PREV_CALL: must include from_endpoint and from_field (jq path into that response)
- AUTH_FLOW: any token/cookie obtained during login
- STATIC: include the actual static value

Rules:
- Document REQUIRED parameters only. Skip X-Requested-With, Cache-Control, correlation IDs.
- For guest-cart: POST /rest/V1/guest-carts returns a plain quoted string β€” that IS the cartId.
- quote_id in add-item body equals cartId β€” mark DERIVED.
- For searchCriteria filter params, document the exact query string structure.
- For GraphQL: read ALL .graphqls files to find every query, mutation, subscription.

Write the output to api_catalog.json at the root of the codebase.

App 2 β€” Forum (Postmill / Symfony)

Container: forum
Source root: /var/www/html/ (confirmed β€” docker exec forum find / -name "composer.json" returned /var/www/html/composer.json)

Attach to the forum container in Cursor, open /var/www/html/ as the workspace, then paste this prompt into the sidebar:

You are working inside a Postmill forum codebase (PHP / Symfony). Start by exploring the
directory structure to orient yourself.

Your job: produce a COMPLETE api_catalog.json covering ALL HTTP endpoints exposed by this
Postmill installation β€” every route, every form action, every AJAX endpoint.

API types to scan for:
1. Form submissions (POST, application/x-www-form-urlencoded) β€” the primary interaction pattern
2. JSON AJAX endpoints β€” controllers that return JsonResponse
3. REST-style endpoints β€” if any exist under /api/
4. WebSockets β€” Postmill does not typically use WebSockets but check for any Mercure
   or Pusher integration. If none, note it.

For form submissions and JSON endpoints, produce:
{
  "api_type": "form" | "rest",
  "endpoint": "METHOD /path/{template}",
  "auth": "none | session_cookie | session_cookie+csrf",
  "content_type": "application/x-www-form-urlencoded | application/json | multipart/form-data",
  "path_params":  { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
  "query_params": { ... },
  "form_params":  { ... },   // use this for form submissions
  "body_params":  { ... },   // use this for JSON body
  "response_key_fields": ["what downstream calls consume from this response"]
}

Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED

Postmill-specific notes:
- Login is a form POST. Find the exact CSRF token field name in the security config or login form type.
- All write operations (create post, vote, comment) require session_cookie+csrf.
- The community slug / post ID in path templates come from TASK_SPEC or PREV_CALL.
- Read every FormType class to get the exact field names for each form.
- For CSRF tokens in forms: source is AUTH_FLOW (extracted from the page HTML before submit).

Write the output to api_catalog.json at the root of the codebase.

App 3 β€” Map (OpenStreetMap / Rails)

Container: openstreetmap-website-web-1
Source root: /app (confirmed β€” docker exec openstreetmap-website-web-1 ls /app shows Gemfile, app/, config/, db/, etc.)

Attach to the openstreetmap-website-web-1 container in Cursor, open /app as the workspace, then paste this prompt into the sidebar:

You are working inside an OpenStreetMap Rails codebase. Start by exploring the directory
structure to orient yourself.

Your job: produce a COMPLETE api_catalog.json covering ALL HTTP endpoints exposed by this
OpenStreetMap installation β€” every route, every API endpoint, every format variant.

API types to scan for:
1. REST API under /api/0.6/ β€” the main machine-readable API (XML and JSON variants)
2. Search / geocoding β€” how place searches are handled (may proxy to Nominatim, or local)
3. Web interface endpoints β€” HTML controllers, but also any that return JSON
4. OAuth endpoints β€” any OAuth 1.0 or 2.0 flows
5. WebSockets β€” unlikely but check for ActionCable or similar integration

For each endpoint, produce:
{
  "api_type": "rest" | "form" | "websocket",
  "endpoint": "METHOD /path/{template}",
  "auth": "none | oauth | session_cookie",
  "format_variants": [".json", ".xml"],   // if the endpoint supports multiple formats via extension
  "path_params":  { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
  "query_params": { ... },
  "body_params":  { ... },
  "response_key_fields": ["XPath or jq paths downstream calls consume"]
}

Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED

OpenStreetMap-specific notes:
- The /api/0.6/ endpoints return XML by default; .json suffix returns JSON. Document both variants.
- Node, way, relation IDs are integers β€” source is TASK_SPEC for direct tasks, PREV_CALL when
  they come from a search result.
- The search endpoint may be /search or proxied through Nominatim β€” read the routes and
  controllers carefully to find where geographic searches are handled.
- Read ALL controller files, not just the api/ subdirectory. There may be JSON endpoints in
  the main web controllers too.

Start with the routes file (e.g. config/routes.rb) to get the complete route list, then read each controller. Write the output to api_catalog.json at the root of the codebase.


App 4 β€” Wikipedia (Kiwix) β€” No extraction needed

Kiwix serves a static ZIM file at /data/wikipedia_en_all_maxi_2022-05.zim (confirmed β€” docker exec kiwix33 find / -name "*.zim" returned that path). There is no application source code to analyze β€” kiwix-serve is a C++ binary, not a web framework. The catalog entry is hardcoded below.

Hardcoded catalog entry (save as catalogs/wikipedia.json):

{
  "_meta": {
    "generated": "2026-04-08",
    "source": "hardcoded β€” kiwix-serve binary serves a static ZIM file; no application source to analyze",
    "zim_file": "/data/wikipedia_en_all_maxi_2022-05.zim",
    "search_response": "HTML only β€” GET /search returns HTML page; agent must parse <a href> links for article URLs",
    "article_page": "GET /wikipedia_en_all_maxi_2022-05/A/{title} β€” returns HTML article",
    "websockets": "none"
  },
  "endpoints": [
    {
      "api_type": "rest",
      "endpoint": "GET /search",
      "auth": "none",
      "query_params": {
        "pattern": {
          "type": "string",
          "source": "TASK_SPEC",
          "notes": "the search query, URL-encoded"
        },
        "books.name": {
          "type": "string",
          "source": "STATIC",
          "value": "wikipedia_en_all_maxi_2022-05",
          "notes": "selects which ZIM book to search"
        }
      },
      "response_key_fields": [],
      "notes": "IMPORTANT: response is HTML, not JSON. Parse <a href> anchor links matching /wikipedia_en_all_maxi_2022-05/A/... to extract article slugs."
    },
    {
      "api_type": "rest",
      "endpoint": "GET /wikipedia_en_all_maxi_2022-05/A/{article_title}",
      "auth": "none",
      "path_params": {
        "article_title": {
          "type": "string",
          "source": "PREV_CALL",
          "from_endpoint": "GET /search",
          "from_field": "href attribute of first search result <a> tag",
          "notes": "URL-encoded article slug, e.g. Albert_Einstein. Extract from the href on the search results HTML page."
        }
      },
      "response_key_fields": [],
      "notes": "Returns full HTML article page. HTTP 200 when article exists, 404 when not found."
    }
  ]
}

Validation β€” smoke-test each catalog entry

After Claude Code writes api_catalog.json for each app, validate a few key entries against the live server before committing:

EC2="ec2-16-59-2-56.us-east-2.compute.amazonaws.com"

# Shopping: guest cart + add item
CART=$(curl -s -X POST http://$EC2:7770/rest/V1/guest-carts \
  -H "Content-Type: application/json" | tr -d '"')
echo "cart_id: $CART"

# Get admin token first (product listing requires auth)
ADMIN_TOKEN=$(curl -s -X POST http://$EC2:7770/rest/V1/integration/admin/token \
  -H "Content-Type: application/json" \
  -d '{"username":"admin","password":"admin1234"}' | tr -d '"')

SKU=$(curl -s "http://$EC2:7770/rest/V1/products?searchCriteria%5BpageSize%5D=1" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['items'][0]['sku'])")
echo "sku: $SKU"

curl -s -X POST http://$EC2:7770/rest/V1/guest-carts/$CART/items \
  -H "Content-Type: application/json" \
  -d "{\"cartItem\":{\"sku\":\"$SKU\",\"qty\":1,\"quote_id\":\"$CART\"}}" | python3 -m json.tool

If the response is a 200 with item details, the catalog entries for tasks 3–5 are correct.

Or use the automated validator (requires pip install requests):

python3 validate_catalog.py --host ec2-16-59-2-56.us-east-2.compute.amazonaws.com --all

Final structure

All source roots confirmed by running commands on the live EC2 instance:

Container Source root How confirmed
shopping /var/www/magento2/ WebArena README docker exec commands
shopping_admin /var/www/magento2/ Same
forum /var/www/html/ find / -name "composer.json" returned /var/www/html/composer.json
openstreetmap-website-web-1 /app ls /app shows Gemfile, app/, config/, db/
kiwix33 N/A Binary server; data at /data/wikipedia_en_all_maxi_2022-05.zim

After running the AI on each container, download api_catalog.json via Cursor's remote explorer (Right-click β†’ Download) and save locally as:

catalogs/
  shopping.json       ← from shopping container
  shopping_admin.json ← from shopping_admin container
  forum.json          ← from forum container
  osm.json            ← from openstreetmap-website-web-1 container
  wikipedia.json      ← hardcoded above (no container needed)

These five files are committed to the repo and loaded by the judge at startup. They are never regenerated during training.


Catalog status β€” live endpoint verification (2026-04-08)

All five catalogs have been extracted and are committed. Below is the result of live testing against ec2-16-59-2-56.us-east-2.compute.amazonaws.com.

Summary table

Catalog Endpoints JSON valid Structure Live test Notes
shopping.json 502 βœ… βœ… βœ… PASS See details below
shopping_admin.json 552 βœ… βœ… βœ… PASS See details below
forum.json 91 βœ… βœ… ⚠️ WARN Login not verified (see below)
osm.json 217 βœ… βœ… βœ… PASS See details below
wikipedia.json 2 βœ… βœ… ⚠️ WARN Search returns HTML not JSON (corrected)

Shopping (port 7770) β€” PASS

Auth: POST /rest/V1/integration/admin/token with admin/admin1234 returns a JWT bearer token. βœ…
Guest cart: POST /rest/V1/guest-carts returns a plain quoted string β€” confirmed this is the cartId. βœ…
Product listing: GET /rest/V1/products?searchCriteria[pageSize]=N requires bearer token β€” returns full product JSON with total_count: 104368. βœ…
Add to cart: POST /rest/V1/guest-carts/{cartId}/items with {sku, qty, quote_id} returns item detail at HTTP 200. βœ…

Key finding: GET /rest/V1/products without auth returns HTTP 401 ("consumer isn't authorized"). The catalog documents auth: "admin_bearer_token" for this endpoint β€” correct.


Shopping Admin (port 7780)

Shopping Admin uses the same Magento 2 REST API as Shopping but accessed on port 7780 with admin credentials. The shopping_admin.json catalog documents the same REST surface with admin-scoped auth. The admin UI itself is a browser-based SPA β€” its internal AJAX endpoints are documented in the catalog under admin_ajax type entries.


Forum (port 9999)

Homepage: HTTP 200. βœ…
Login form structure: Confirmed via HTML inspection. Form action is POST /login_check. CSRF field is _csrf_token (Symfony token, not form_key). βœ…
Login result: POST /login_check with MarvelsGrantMan136/test1234 redirects to / (homepage) β€” login successful. βœ… (The original password notarobot from WebArena defaults was stale; the correct password on this instance is test1234.)

Catalog correctness: The forum.json catalog correctly documents:

  • POST /login_check with _csrf_token, _username, _password
  • All write endpoints require session_cookie+csrf
  • route_name field on each entry (extra metadata, not used by judge)

OSM / Map (port 3000)

Capabilities: GET /api/0.6/capabilities returns XML. βœ…
Map bbox: GET /api/0.6/map?bbox=-0.1,51.5,0.1,51.6 returns valid OSM XML with <osm> root. βœ…

Search finding: GET /search?query=... returns an HTML page (HTTP 200), not JSON. The actual geocoding is dispatched client-side to sub-endpoints:

  • POST /geocoder/search_osm_nominatim β€” Nominatim-backed search
  • POST /geocoder/search_latlon β€” coordinate-based search
  • POST /geocoder/search_osm_nominatim_reverse β€” reverse geocode

Search param name: The catalog documents query as the query param name. Confirmed: GET /search?query=New+York returns HTTP 200 (HTML).


Wikipedia / Kiwix (port 8888)

Search endpoint: GET /search?pattern=...&books.name=wikipedia_en_all_maxi_2022-05 returns HTTP 200. βœ…
Article endpoint: GET /wikipedia_en_all_maxi_2022-05/A/Albert_Einstein returns HTTP 200. βœ