kdcyberdude commited on
Commit
e6ce96e
·
verified ·
1 Parent(s): f2c780a

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +2 -0
  2. BROWSER_AGENT.md +502 -0
  3. BUILD_NOTES.md +205 -0
  4. Dockerfile +69 -0
  5. GROUND_TRUTH_EXTRACTION.md +470 -0
  6. HAR_TASK_LIST.md +276 -0
  7. JUDGE.md +660 -0
  8. README.md +645 -3
  9. TOOLS.md +847 -0
  10. __init__.py +1 -0
  11. catalogs/forum.json +1517 -0
  12. catalogs/osm.json +0 -0
  13. catalogs/shopping.json +0 -0
  14. catalogs/shopping_admin.json +0 -0
  15. catalogs/wikipedia.json +48 -0
  16. client.py +59 -0
  17. hars/forum.har +0 -0
  18. hars/shopping.har +3 -0
  19. hars/shopping_admin.har +3 -0
  20. hars/wikipedia.har +0 -0
  21. inference.py +375 -0
  22. models.py +14 -0
  23. openenv.yaml +6 -0
  24. openenv_harvestgym.egg-info/PKG-INFO +18 -0
  25. openenv_harvestgym.egg-info/SOURCES.txt +20 -0
  26. openenv_harvestgym.egg-info/dependency_links.txt +1 -0
  27. openenv_harvestgym.egg-info/entry_points.txt +2 -0
  28. openenv_harvestgym.egg-info/requires.txt +14 -0
  29. openenv_harvestgym.egg-info/top_level.txt +1 -0
  30. parameter_pools.json +1090 -0
  31. pyproject.toml +37 -0
  32. scripts/build_parameter_pools.py +364 -0
  33. server/__init__.py +0 -0
  34. server/app.py +49 -0
  35. server/episode.py +53 -0
  36. server/judge.py +691 -0
  37. server/models.py +517 -0
  38. server/tools/__init__.py +0 -0
  39. server/tools/browser_agent.py +418 -0
  40. server/tools/curl_exec.py +434 -0
  41. server/tools/search_endpoints.py +93 -0
  42. server/tools/search_episode_data.py +87 -0
  43. tests/mock_data/mock_catalog.json +88 -0
  44. tests/mock_data/mock_har.json +170 -0
  45. tests/test_e2e_episode.py +272 -0
  46. tests/test_real_har.py +93 -0
  47. tests/tool_browser_agent.py +327 -0
  48. tests/tool_curl_exec.py +442 -0
  49. tests/tool_search_endpoints.py +239 -0
  50. tests/tool_search_episode_data.py +273 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ hars/shopping.har filter=lfs diff=lfs merge=lfs -text
37
+ hars/shopping_admin.har filter=lfs diff=lfs merge=lfs -text
BROWSER_AGENT.md ADDED
@@ -0,0 +1,502 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Browser Agent Component
2
+
3
+ This document describes the browser agent tool used by the HARvestGym RL agent — how it works, how to build it, and how it integrates with the environment.
4
+
5
+ ---
6
+
7
+ ## What It Is
8
+
9
+ The browser agent is a multi-stage tool the RL agent calls at the start of every episode. Given a natural language task and a URL, it:
10
+
11
+ 1. **Checks if a pre-recorded HAR file exists** for this application
12
+ 2. If HAR exists → loads it directly (no browser launched)
13
+ 3. If no HAR → **launches a real browser** (Chromium via Playwright), connects an LLM, performs the task, and records all network traffic as a HAR file
14
+ 4. **Processes the HAR** (from either source) to extract an OpenAPI-like spec
15
+ 5. **Builds GEMMA embeddings** over the extracted spec so `search_endpoints()` can do semantic search
16
+ 6. **Returns a summary** — the list of API endpoint names and HTTP methods only
17
+
18
+ The browser agent is a script that orchestrates multiple processing stages. The RL agent sees only the final summary output — a list of endpoints like `GET /products`, `POST /guest-carts`. No headers, no body schemas, no parameter details. To get full details about any endpoint, the agent calls `search_endpoints()` with a natural language query — this searches the GEMMA embeddings built during the browser agent's processing stage.
19
+
20
+ ---
21
+
22
+ ## Library: browser-use
23
+
24
+ **Repository:** [browser-use/browser-use](https://github.com/browser-use/browser-use)
25
+ **Stars:** 86k+ (April 2026)
26
+ **License:** MIT
27
+ **Language:** Python 3.11+
28
+
29
+ `browser-use` connects any LLM to a Playwright-controlled browser. The LLM receives the page state (DOM, screenshot, or both), decides on an action (click, type, navigate, extract), and `browser-use` executes it. It uses a sense-plan-act loop with built-in error handling.
30
+
31
+ Install:
32
+
33
+ ```bash
34
+ pip install browser-use
35
+ playwright install chromium
36
+ ```
37
+
38
+ ---
39
+
40
+ ## How It Works: Full Pipeline
41
+
42
+ ### Stage 1 — Obtain HAR Data
43
+
44
+ The browser agent first checks whether a pre-recorded HAR file exists for the target application. If it does, the browser is never launched — this saves 30–120 seconds per episode.
45
+
46
+ ```python
47
+ import json, os
48
+ from urllib.parse import urlparse
49
+
50
+ HAR_MAP = {
51
+ ":7770": "hars/shopping.har",
52
+ ":7780": "hars/shopping_admin.har",
53
+ ":9999": "hars/forum.har",
54
+ ":3000": "hars/osm.har",
55
+ ":8888": "hars/wikipedia.har",
56
+ }
57
+
58
+ def resolve_har_path(url: str) -> str | None:
59
+ """Check if a pre-recorded HAR exists for this app."""
60
+ for port_key, path in HAR_MAP.items():
61
+ if port_key in url:
62
+ if os.path.exists(path):
63
+ return path
64
+ return None
65
+
66
+
67
+ async def get_har_data(task: str, url: str, llm_model: str) -> dict:
68
+ """
69
+ Stage 1: Get HAR data — from file if available, from live browser otherwise.
70
+ Returns the parsed HAR JSON.
71
+ """
72
+ har_path = resolve_har_path(url)
73
+
74
+ if har_path:
75
+ # HAR exists — load directly, no browser needed
76
+ with open(har_path) as f:
77
+ return json.load(f)
78
+ else:
79
+ # No HAR — run live browser session and capture traffic
80
+ raw_log = await run_browser_agent_live(task, url, llm_model)
81
+ return convert_raw_log_to_har(raw_log)
82
+ ```
83
+
84
+ ### Stage 2 — Live Browser Session (only if no HAR exists)
85
+
86
+ When no pre-recorded HAR is available, the browser agent launches a real Chromium browser, connects the LLM, and performs the task while intercepting all network traffic:
87
+
88
+ ```python
89
+ from playwright.async_api import async_playwright
90
+ from browser_use import Agent
91
+ from langchain_openai import ChatOpenAI
92
+
93
+ async def run_browser_agent_live(task: str, url: str, llm_model: str) -> list[dict]:
94
+ """
95
+ Runs browser-use agent on the given task, intercepts all network traffic,
96
+ returns raw request/response log.
97
+ """
98
+ requests_log = []
99
+
100
+ async with async_playwright() as p:
101
+ browser = await p.chromium.launch(headless=False)
102
+ context = await browser.new_context()
103
+ page = await context.new_page()
104
+
105
+ # Attach network interceptor
106
+ async def on_request(request):
107
+ requests_log.append({
108
+ "type": "request",
109
+ "url": request.url,
110
+ "method": request.method,
111
+ "headers": dict(request.headers),
112
+ "post_data": request.post_data,
113
+ })
114
+
115
+ async def on_response(response):
116
+ try:
117
+ body = await response.text()
118
+ except Exception:
119
+ body = None
120
+ requests_log.append({
121
+ "type": "response",
122
+ "url": response.url,
123
+ "status": response.status,
124
+ "headers": dict(response.headers),
125
+ "body": body,
126
+ })
127
+
128
+ page.on("request", on_request)
129
+ page.on("response", on_response)
130
+
131
+ # Navigate to app first
132
+ await page.goto(url)
133
+
134
+ # Run browser agent
135
+ llm = ChatOpenAI(model=llm_model, base_url="https://router.huggingface.co/v1")
136
+ agent = Agent(task=task, llm=llm, page=page)
137
+ await agent.run()
138
+
139
+ await browser.close()
140
+
141
+ return requests_log
142
+ ```
143
+
144
+ ### Stage 3 — Filter and Extract OpenAPI-like Spec
145
+
146
+ The HAR data (from either source) contains everything: fonts, analytics, CDN requests, JS bundles, CSS. The browser agent filters this down and extracts a structured OpenAPI-like specification:
147
+
148
+ ```python
149
+ import re
150
+ from urllib.parse import urlparse
151
+
152
+ SKIP_EXTENSIONS = {".css", ".png", ".jpg", ".svg", ".ico", ".woff", ".woff2", ".ttf", ".gif"}
153
+ SKIP_DOMAINS = {"google-analytics.com", "doubleclick.net", "cloudflare.com", "cdn.", "fonts.googleapis.com"}
154
+ SKIP_PATH_PREFIXES = ["/static/", "/media/", "/_next/", "/assets/", "/__webpack"]
155
+
156
+ def is_application_api_call(url: str, app_base_url: str) -> bool:
157
+ parsed = urlparse(url)
158
+ app_host = urlparse(app_base_url).netloc
159
+
160
+ if parsed.netloc != app_host:
161
+ return False
162
+
163
+ path = parsed.path.lower()
164
+ for ext in SKIP_EXTENSIONS:
165
+ if path.endswith(ext):
166
+ return False
167
+ for prefix in SKIP_PATH_PREFIXES:
168
+ if path.startswith(prefix):
169
+ return False
170
+
171
+ return True
172
+
173
+
174
+ def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
175
+ """
176
+ Stage 3: Process HAR entries into an OpenAPI-like spec.
177
+ Each entry becomes a structured endpoint document with method, path,
178
+ query params, request body schema, response body schema, status codes, auth info.
179
+ """
180
+ entries = har_data["log"]["entries"]
181
+ seen = set()
182
+ spec_entries = []
183
+
184
+ for entry in entries:
185
+ req = entry["request"]
186
+ resp = entry["response"]
187
+ raw_url = req["url"]
188
+ method = req["method"]
189
+
190
+ # Filter non-API traffic
191
+ if not is_application_api_call(raw_url, app_base_url):
192
+ continue
193
+
194
+ # Skip HTML page navigations
195
+ content_type = _get_response_content_type(resp)
196
+ if "text/html" in content_type and method == "GET":
197
+ continue
198
+
199
+ # Normalise path: replace IDs with {id}
200
+ parsed = urlparse(raw_url)
201
+ path = _normalise_path(parsed.path)
202
+
203
+ # Deduplicate by (method, normalised_path)
204
+ key = f"{method} {path}"
205
+ if key in seen:
206
+ continue
207
+ seen.add(key)
208
+
209
+ # Extract auth info
210
+ has_auth = any(
211
+ h["name"].lower() in ("authorization", "x-api-key", "cookie")
212
+ for h in req["headers"]
213
+ )
214
+
215
+ # Build endpoint spec document
216
+ spec_entry = {
217
+ "method": method,
218
+ "path": path,
219
+ "query_params": parsed.query or None,
220
+ "request_headers": {h["name"]: h["value"] for h in req["headers"]
221
+ if h["name"].lower() in ("content-type", "authorization", "x-requested-with")},
222
+ "request_body": _extract_body(req),
223
+ "status_code": resp["status"],
224
+ "response_content_type": content_type,
225
+ "response_body_sample": _truncate_body(resp),
226
+ "auth_observed": has_auth,
227
+ }
228
+ spec_entries.append(spec_entry)
229
+
230
+ return spec_entries
231
+ ```
232
+
233
+ ### Stage 4 — Build GEMMA Embeddings for Search
234
+
235
+ The extracted spec entries are converted to text documents and embedded using GEMMA embeddings. These embeddings power the `search_endpoints()` tool — when the RL agent queries "how to add item to cart", the semantic search finds the matching endpoint spec.
236
+
237
+ ```python
238
+ from sentence_transformers import SentenceTransformer
239
+ import numpy as np
240
+
241
+ def build_endpoint_embeddings(spec_entries: list[dict], app_name: str) -> tuple[np.ndarray, list[str]]:
242
+ """
243
+ Stage 4: Convert spec entries to text chunks and build GEMMA embeddings.
244
+ These embeddings are stored in memory for the duration of the episode,
245
+ enabling search_endpoints() to do semantic search.
246
+ """
247
+ model = SentenceTransformer("google/embeddinggemma-300m")
248
+
249
+ chunks = []
250
+ for entry in spec_entries:
251
+ chunk = spec_entry_to_text(entry, app_name)
252
+ chunks.append(chunk)
253
+
254
+ # GEMMA encode_document: "title: {endpoint} | text: {rest of chunk}"
255
+ embeddings = model.encode_document(chunks, batch_size=32)
256
+ # Use similarity metric from google/embeddinggemma-300m model card
257
+
258
+ return embeddings, chunks
259
+
260
+
261
+ def spec_entry_to_text(entry: dict, app_name: str) -> str:
262
+ """Convert a single spec entry to a searchable text document."""
263
+ parts = [
264
+ f"app: {app_name}",
265
+ f"endpoint: {entry['method']} {entry['path']}",
266
+ f"status: {entry['status_code']}",
267
+ f"auth: {'required' if entry['auth_observed'] else 'none'}",
268
+ ]
269
+ if entry.get("query_params"):
270
+ parts.append(f"query: {entry['query_params']}")
271
+ if entry.get("request_body"):
272
+ parts.append(f"body: {entry['request_body']}")
273
+ if entry.get("response_body_sample"):
274
+ parts.append(f"response_sample: {entry['response_body_sample']}")
275
+ return " | ".join(parts)
276
+ ```
277
+
278
+ ### Stage 5 — Return Summary to RL Agent
279
+
280
+ The browser agent returns **only a summary** — endpoint names and HTTP methods. No headers, no body schemas, no parameter details. The agent must call `search_endpoints()` to get the full details.
281
+
282
+ ```python
283
+ def build_browser_agent_output(spec_entries: list[dict], app_name: str) -> dict:
284
+ """
285
+ Stage 5: Build the summary output returned to the RL agent.
286
+ This is intentionally sparse — just endpoint names and methods.
287
+ """
288
+ summary_endpoints = []
289
+ for entry in spec_entries:
290
+ summary_endpoints.append({
291
+ "method": entry["method"],
292
+ "path": entry["path"],
293
+ })
294
+
295
+ return {
296
+ "app": app_name,
297
+ "endpoints": summary_endpoints,
298
+ "total_endpoints": len(summary_endpoints),
299
+ "note": (
300
+ "These endpoints were observed for this application. "
301
+ "Use search_endpoints() with a natural language query to get "
302
+ "the full schema, parameters, and auth details for any endpoint."
303
+ )
304
+ }
305
+ ```
306
+
307
+ ### Full Orchestration
308
+
309
+ ```python
310
+ async def browser_agent(task: str, url: str) -> dict:
311
+ """
312
+ Complete browser agent pipeline:
313
+ 1. Get HAR data (from file or live browser)
314
+ 2. Filter and extract OpenAPI-like spec
315
+ 3. Build GEMMA embeddings for search_endpoints()
316
+ 4. Return summary endpoint list to RL agent
317
+ """
318
+ app_name = resolve_app_name(url)
319
+ llm_model = "browser-use/bu-30b-a3b-preview"
320
+
321
+ # Stage 1-2: Get HAR data
322
+ har_data = await get_har_data(task, url, llm_model)
323
+
324
+ # Stage 3: Extract OpenAPI-like spec
325
+ spec_entries = extract_openapi_spec(har_data, url)
326
+
327
+ # Stage 4: Build GEMMA embeddings (stored in environment for search_endpoints)
328
+ embeddings, chunks = build_endpoint_embeddings(spec_entries, app_name)
329
+ store_episode_embeddings(app_name, embeddings, chunks) # makes search_endpoints() work
330
+
331
+ # Stage 5: Return summary to RL agent
332
+ return build_browser_agent_output(spec_entries, app_name)
333
+ ```
334
+
335
+ ---
336
+
337
+ ## Output Example
338
+
339
+ What the RL agent sees (summary only — no schemas, no headers, no body details):
340
+
341
+ ```json
342
+ {
343
+ "app": "shopping",
344
+ "endpoints": [
345
+ {"method": "POST", "path": "/rest/V1/integration/customer/token"},
346
+ {"method": "GET", "path": "/rest/V1/products"},
347
+ {"method": "GET", "path": "/rest/V1/products/{id}"},
348
+ {"method": "POST", "path": "/rest/V1/guest-carts"},
349
+ {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items"},
350
+ {"method": "GET", "path": "/rest/V1/guest-carts/{id}/totals"},
351
+ {"method": "POST", "path": "/rest/V1/guest-carts/{id}/order"},
352
+ {"method": "GET", "path": "/rest/V1/categories"}
353
+ ],
354
+ "total_endpoints": 8,
355
+ "note": "These endpoints were observed for this application. Use search_endpoints() with a natural language query to get the full schema, parameters, and auth details for any endpoint."
356
+ }
357
+ ```
358
+
359
+ To get full details, the agent calls:
360
+ ```
361
+ search_endpoints("add item to guest cart")
362
+ → returns full schema: POST /rest/V1/guest-carts/{cartId}/items, body params, auth, response fields
363
+ ```
364
+
365
+ ---
366
+
367
+ ## How search_endpoints() Uses the Embeddings
368
+
369
+ The GEMMA embeddings built in Stage 4 are what power `search_endpoints()`. When the RL agent calls `search_endpoints("create guest cart")`:
370
+
371
+ 1. The query is encoded using GEMMA `encode_query`
372
+ 2. Dot product similarity against all endpoint embeddings
373
+ 3. Top-3 matching endpoint spec documents are returned with full details
374
+
375
+ ```python
376
+ def search_endpoints(query: str, embeddings, texts, model, top_k=3) -> list[str]:
377
+ q_emb = model.encode_query(query) # shape: (D,)
378
+ # Use similarity metric specified by google/embeddinggemma-300m model card
379
+ scores = model.similarity(q_emb, embeddings).squeeze(0) # shape: (N,)
380
+ top_idx = np.argsort(scores)[::-1][:top_k]
381
+ return [texts[i] for i in top_idx]
382
+ ```
383
+
384
+ The endpoint documents returned by search contain the full extracted spec — method, path, query params, request body structure, response samples, auth requirements. This is the detailed view that complements the summary list from `browser_agent`.
385
+
386
+ ---
387
+
388
+ ## LLM Choice for Browser Agent
389
+
390
+ We use **[`browser-use/bu-30b-a3b-preview`](https://huggingface.co/browser-use/bu-30b-a3b-preview)** — a model purpose-built and fine-tuned specifically for browser-use tasks.
391
+
392
+ | Property | Value |
393
+ |----------|-------|
394
+ | **Base model** | Qwen3-VL-30B-A3B-Instruct |
395
+ | **Architecture** | Vision-Language MoE (Mixture of Experts) |
396
+ | **Total parameters** | 30B |
397
+ | **Active parameters** | 3B (MoE — only 3B fire per forward pass) |
398
+ | **Context length** | 65,536 tokens |
399
+ | **Specialization** | Superior DOM understanding + visual reasoning for web tasks |
400
+
401
+ This model is designed to be served with vLLM and integrates directly with the `browser-use` library via its `ChatOpenAI`-compatible interface:
402
+
403
+ ```python
404
+ from browser_use import Agent, ChatOpenAI
405
+
406
+ llm = ChatOpenAI(
407
+ base_url="http://localhost:8000/v1", # vLLM server
408
+ model="browser-use/bu-30b-a3b-preview",
409
+ temperature=0.6,
410
+ top_p=0.95,
411
+ dont_force_structured_output=True, # speeds up inference
412
+ )
413
+
414
+ agent = Agent(task=task, llm=llm)
415
+ agent.run_sync()
416
+ ```
417
+
418
+ Serve with vLLM:
419
+
420
+ ```bash
421
+ vllm serve browser-use/bu-30b-a3b-preview \
422
+ --max-model-len 65536 \
423
+ --host 0.0.0.0 \
424
+ --port 8000
425
+ ```
426
+
427
+ Because 3B parameters are active per forward pass (MoE), this model is fast enough for deployment without requiring a full large-model GPU allocation.
428
+
429
+ ---
430
+
431
+ ## Training vs. Inference: What Changes
432
+
433
+ ```
434
+ Training Inference
435
+ │ │
436
+ browser_agent │ HAR file exists → loads from │ HAR file may not exist
437
+ Stage 1 │ disk, no browser launched │ → launches live browser session
438
+ │ │ → records traffic as HAR
439
+ │ │
440
+ browser_agent │ Processes HAR → extracts spec │ Same processing pipeline
441
+ Stages 2-4 │ → builds GEMMA embeddings │ on the live-captured traffic
442
+ │ → returns summary │
443
+ │ │
444
+ curl_exec │ hits REAL live server │ hits REAL live server
445
+ calls │ (WebArena EC2) │ (WebArena EC2)
446
+ │ │
447
+ judge │ probes REAL live server │ probes REAL live server
448
+ verification │ to verify task completion │ to verify task completion
449
+ ```
450
+
451
+ **What changes between training and inference:** only Stage 1 — where the HAR data comes from. During training, pre-recorded HAR files exist for all tasks, so the browser is never launched. At inference, the HAR may not exist for novel tasks, so the browser runs live.
452
+
453
+ **What never changes:** Stages 3-5 (spec extraction, embedding, summary output) run identically regardless of the HAR source. And `curl_exec` always hits the real live server — no responses are ever mocked.
454
+
455
+ ---
456
+
457
+ ## Integration with the Environment
458
+
459
+ ```
460
+ RL Environment (FastAPI server)
461
+
462
+ ├── receives Action: {tool: "browser_agent", input: {task, url}}
463
+
464
+ ├── Stage 1: HAR file exists?
465
+ │ ├── YES → load HAR from disk (~0ms)
466
+ │ └── NO → spawn live browser session (30-120s)
467
+ │ ├── Playwright + bu-30b-a3b-preview
468
+ │ ├── intercept all HTTP traffic
469
+ │ └── produce HAR data
470
+
471
+ ├── Stage 3: Extract OpenAPI-like spec from HAR
472
+
473
+ ├── Stage 4: Build GEMMA embeddings → stored in env for search_endpoints()
474
+
475
+ └── Stage 5: Return summary endpoint list as Observation.last_tool_result
476
+
477
+ │ (agent now knows WHAT endpoints exist, but not HOW to call them)
478
+
479
+
480
+ search_endpoints("natural language query")
481
+ │ → semantic search over GEMMA embeddings
482
+ │ → returns full endpoint schema with params, auth, response fields
483
+
484
+ curl_exec("curl -X POST ...")
485
+ │ → executes against real live WebArena server (EC2)
486
+ │ → indexes full response into episode BM25 store
487
+
488
+ search_episode_data("keyword query")
489
+ │ → BM25 search over indexed responses from this episode
490
+
491
+ done() → judge evaluates against ground truth
492
+ ```
493
+
494
+ ---
495
+
496
+ ## Reference Tools
497
+
498
+ - [browser-use GitHub](https://github.com/browser-use/browser-use) — the core library
499
+ - [browser-use docs](https://docs.browser-use.com) — configuration, custom actions, LLM setup
500
+ - [Playwright network events](https://playwright.dev/python/docs/network) — request/response interception API
501
+ - [har-to-openapi](https://github.com/jonluca/har-to-openapi) — alternative: convert HAR files to OpenAPI spec format
502
+ - [jsluice](https://github.com/BishopFox/jsluice) — extract API routes from JavaScript bundles (useful supplement to network interception) - Future Scope
BUILD_NOTES.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HARvestGym — Build Notes & Deferred Items
2
+
3
+ This file captures caveats, deferred implementation decisions, and things to keep in mind during the building phase. It is not a specification — it is a living checklist.
4
+
5
+ ---
6
+
7
+ ## Critical Build-Time Checklist
8
+
9
+ ### 1. `google/embeddinggemma-300m` — License Acceptance Required
10
+
11
+ **Status:** Deferred to build time
12
+ **Action needed:** Accept the Google license for `google/embeddinggemma-300m` at https://huggingface.co/google/embeddinggemma-300m while logged in to HuggingFace. Then ensure `HF_TOKEN` is set in the environment before running any embedding code.
13
+
14
+ ```bash
15
+ export HF_TOKEN=hf_... # must have accepted the google/embeddinggemma-300m license
16
+ ```
17
+
18
+ The model also requires `float32` or `bfloat16` — **not `float16`**. If you see activation errors, check the dtype:
19
+
20
+ ```python
21
+ model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN)
22
+ # Default dtype is float32; explicitly set bfloat16 if on GPU:
23
+ # model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN, model_kwargs={"torch_dtype": "bfloat16"})
24
+ ```
25
+
26
+ ---
27
+
28
+ ### 2. Judge Verification — Trajectory-Based, No External Token Needed
29
+
30
+ **Status:** Resolved in design
31
+ **Detail:** The judge does **not** need a pre-set admin token or an outbound probe call to verify task completion. Verification is done by inspecting the episode trajectory already available in the environment.
32
+
33
+ **Approach:**
34
+ - The judge reads the `curl_exec` request/response history from the current episode (already stored in `episode.steps`).
35
+ - If the final state-changing call returned a 2xx response with the expected payload (e.g., `item_id` in add-to-cart, `order_id` in checkout, `post` in the forum response body), that response **is** the ground truth — the web server confirmed it.
36
+ - No re-probe is needed: the application already validated the request and returned success. The environment trusts that a 2xx from the live server is accurate.
37
+
38
+ **When a live probe is still used:** Template 3 (add-to-cart) and Template 7 (product creation) optionally re-fetch the resource (cart contents, product by SKU) to double-check state. These probes use the admin credentials the RL agent itself obtained during the episode (extracted from `session_state`), not a pre-configured environment token. If the agent did not authenticate (e.g., it tried to create a product without admin auth), the probe will 401 — which correctly scores the episode as failed.
39
+
40
+ **Implementation note for the judge:**
41
+ ```python
42
+ # Prefer: check the response body from the agent's own curl calls
43
+ for step in episode.steps:
44
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
45
+ body = step.curl_parsed.response_body
46
+ # e.g. Template 3: look for item_id in add-to-cart response
47
+ if isinstance(body, dict) and "item_id" in body:
48
+ return 1.0
49
+
50
+ # Fallback live probe (optional, uses agent's own session token from episode):
51
+ admin_token = _extract_admin_token(episode) # from agent's auth step
52
+ if admin_token:
53
+ product = _judge_probe(f"GET /rest/V1/products/{sku}", base_url,
54
+ headers={"Authorization": f"Bearer {admin_token}"})
55
+ ```
56
+
57
+ ---
58
+
59
+ ### 3. Forum (Postmill) — CSRF Token Position in HTML
60
+
61
+ **Status:** Handled in design; verify at build time
62
+ **Detail:** The HTML truncation limit was raised to 3,000 characters specifically to capture hidden `<input type="hidden" name="_csrf_token">` fields. However, on some Postmill routes, the CSRF token appears after the main nav and the full form body. At build time, test the actual login page HTML to confirm the token appears within the first 3,000 characters.
63
+
64
+ ```bash
65
+ curl -s 'http://ec2-...:9999/login' | head -c 3000 | grep _csrf_token
66
+ ```
67
+
68
+ If the token is not captured, either:
69
+ - Increase `NONJSON_MAX_CHARS` further in `tools/curl_exec.py`
70
+ - Or rely on `search_episode_data("_csrf_token")` — the full HTML is indexed before truncation, so the token is always retrievable by keyword search regardless of position.
71
+
72
+ ---
73
+
74
+ ### 4. Wikipedia — HTML Wrapping
75
+
76
+ **Status:** Designed; implement in `curl_exec`
77
+ **Detail:** Wikipedia (Kiwix) returns HTML. The environment wraps all non-JSON responses in a uniform JSON envelope `{status_code, headers, body}` before returning to the model. This wrapping is already part of `curl_exec`'s response structure (the `body` field is always a string for non-JSON content). No additional wrapping is needed — just ensure the system prompt tells the model to expect HTML strings in `body` for Wikipedia URLs.
78
+
79
+ ---
80
+
81
+ ### 5. Browser Agent — Deferred Live Implementation
82
+
83
+ **Status:** Deferred
84
+ **Detail:** During training, `browser_agent` always loads from pre-recorded HAR files. The live browser agent (using Playwright + `browser-use/bu-30b-a3b-preview` served as a local service) is NOT needed for the initial training run.
85
+
86
+ At inference time, the live browser agent will be called as a separate service. The interface contract is:
87
+
88
+ ```python
89
+ # The environment connects to the browser agent service via HTTP:
90
+ POST http://browser-agent-service/run
91
+ {"task": "...", "url": "..."}
92
+ → {"app": "...", "endpoints": [...], "note": "..."}
93
+ ```
94
+
95
+ Implementation details are in `BROWSER_AGENT.md`. Skip for now — HAR files cover the full training set.
96
+
97
+ ---
98
+
99
+ ### 6. OSM (Map Application) — Not In Initial Training Scope
100
+
101
+ **Status:** Intentionally excluded
102
+ **Detail:** The OpenStreetMap application (port 3000) has ground truth catalogs and HAR recording tasks defined, but **no RL task templates target it in the initial training run**. The OSM artifacts are not needed for the first training loop.
103
+
104
+ Do not spend time on OSM tasks until the 7 current templates are training successfully.
105
+
106
+ ---
107
+
108
+ ### 7. `max_steps` Is 20, Not 12
109
+
110
+ **Status:** Updated in README and observation space
111
+ **Reminder:** All code that initializes episodes must use `max_steps=20`. Search for any hardcoded `12` in the codebase before the first training run.
112
+
113
+ ```bash
114
+ grep -r "max_steps.*12\|12.*max_steps" --include="*.py" .
115
+ ```
116
+
117
+ ---
118
+
119
+ ### 8. GRPO Training Configuration
120
+
121
+ **Status:** Specified — follow the EcomRLVE-Gym pattern
122
+ **Reference:** `winner_projects_last_ht/EcomRLVE-Gym/scripts/train_openenv.py` and `src/ecom_rlve/training/grpo.py`
123
+
124
+ **Stack:** Unsloth + TRL `GRPOTrainer`. The training script structure from EcomRLVE-Gym maps directly onto HARvestGym — replace the environment wrapper and reward functions, keep the training scaffolding.
125
+
126
+ **Policy model:** `Qwen/Qwen3-1.7B` with 4-bit quantization via Unsloth. LoRA rank 16, targeting `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`.
127
+
128
+ **Key configuration values (from EcomRLVE-Gym, adapted for HARvestGym):**
129
+
130
+ ```python
131
+ from trl import GRPOConfig, GRPOTrainer
132
+ from unsloth import FastLanguageModel
133
+
134
+ model, tokenizer = FastLanguageModel.from_pretrained(
135
+ model_name="Qwen/Qwen3-1.7B",
136
+ max_seq_length=8192, # must fit 20-step Hard task episodes
137
+ load_in_4bit=True,
138
+ fast_inference=True, # vLLM-backed fast generation for GRPO rollouts
139
+ max_lora_rank=16,
140
+ gpu_memory_utilization=0.6,
141
+ )
142
+
143
+ model = FastLanguageModel.get_peft_model(
144
+ model,
145
+ r=16,
146
+ lora_alpha=32,
147
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
148
+ "gate_proj", "up_proj", "down_proj"],
149
+ use_gradient_checkpointing="unsloth",
150
+ )
151
+
152
+ training_args = GRPOConfig(
153
+ num_generations=4, # G=4 rollouts per prompt (EcomRLVE default; bump to 8 if VRAM allows)
154
+ temperature=0.7,
155
+ max_prompt_length=4096, # auto-detect from dataset sample + 20% headroom
156
+ max_completion_length=512, # one tool call per step; curl commands are short
157
+ learning_rate=2e-5,
158
+ weight_decay=0.01,
159
+ warmup_ratio=0.1,
160
+ lr_scheduler_type="cosine",
161
+ optim="adamw_8bit",
162
+ per_device_train_batch_size=1,
163
+ gradient_accumulation_steps=1,
164
+ max_steps=300,
165
+ bf16=True, # use bfloat16 on Ampere+ GPUs
166
+ output_dir="outputs/harvestgym_grpo",
167
+ )
168
+ ```
169
+
170
+ **Reward functions passed to GRPOTrainer (three, like EcomRLVE-Gym):**
171
+ 1. `format_reward` — does the output parse as a valid tool call? (`+1.0` / `-2.0`)
172
+ 2. `tool_usage_reward` — is the tool name valid and arguments well-formed? (`+1.0` / `-0.5`)
173
+ 3. `env_reward` — environment scalar reward from the judge, scaled ×5 to dominate (`-7.5` to `+25.0`)
174
+
175
+ **Curriculum:** Start all episodes on Template 1 (Easy). Introduce Medium templates when Easy success rate > 70%. Introduce Hard templates when Medium success rate > 60%.
176
+
177
+ **KL coefficient:** Start at `0.01`. If the model diverges from pretrained behavior rapidly (reward collapses after initial improvement), reduce to `0.005`.
178
+
179
+ ---
180
+
181
+ ### 9. System Prompt — Form vs JSON Guidance
182
+
183
+ **Status:** Designed in README; implement at build time
184
+ **Detail:** The system prompt must include explicit instructions on when to use `Content-Type: application/x-www-form-urlencoded` vs `application/json`. Specifically:
185
+
186
+ ```
187
+ For Postmill (Forum, port 9999): use form-encoded for login and post creation.
188
+ For Magento REST (Shopping/Admin, ports 7770/7780): use application/json.
189
+ For Wikipedia (port 8888): GET requests only, no Content-Type needed.
190
+ When in doubt: check the endpoint schema returned by search_endpoints() — it specifies the expected Content-Type.
191
+ ```
192
+
193
+ ---
194
+
195
+ ## Non-Issues (Resolved in Design)
196
+
197
+ - ~~`store_finding` / `get_findings` tools~~ — **Removed**. Value threading happens through episode `history`.
198
+ - ~~`google/embeddinggemma-300m` doesn't exist~~ — **Confirmed real**. Uses `sentence-transformers` with `encode_query`/`encode_document`/`similarity`. Requires HF_TOKEN.
199
+ - ~~12 steps too few~~ — **Fixed to 20**.
200
+ - ~~Reward signal rewards busy episodes~~ — **Addressed** via curriculum learning + terminal reward dominance design. See README reward section.
201
+ - ~~Wikipedia task unwinnable~~ — **Resolved**: check for HTTP 200 + correct URL, not JSON content.
202
+ - ~~Forum CSRF handling~~ — **Resolved**: 3,000-char HTML truncation + `search_episode_data` fallback. No dedicated tool needed.
203
+ - ~~JUDGE_ADMIN_TOKEN expiry risk~~ — **Resolved**: judge reads trajectory response bodies directly; uses agent's own session token for optional probes only.
204
+ - ~~Concurrent episode isolation~~ — **Not needed**: multi-turn retry handles errors; no episode ID embedding required.
205
+ - ~~Parameter pool drift~~ — **Not a concern**: no training tasks involve deletion or reorganization; graders compare against expected values, not absolute DB state.
Dockerfile ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HARvestGym — OpenEnv Environment
2
+ # Multi-stage build using openenv-base
3
+
4
+ ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
5
+ FROM ${BASE_IMAGE} AS builder
6
+
7
+ WORKDIR /app
8
+
9
+ # Install git (for VCS dependencies)
10
+ RUN apt-get update && \
11
+ apt-get install -y --no-install-recommends git curl && \
12
+ rm -rf /var/lib/apt/lists/*
13
+
14
+ # Build mode
15
+ ARG BUILD_MODE=in-repo
16
+ ARG ENV_NAME=HARvestGym
17
+
18
+ # Copy the entire project
19
+ COPY . /app/env
20
+
21
+ WORKDIR /app/env
22
+
23
+ # Ensure uv is available
24
+ RUN if ! command -v uv >/dev/null 2>&1; then \
25
+ curl -LsSf https://astral.sh/uv/install.sh | sh && \
26
+ mv /root/.local/bin/uv /usr/local/bin/uv && \
27
+ mv /root/.local/bin/uvx /usr/local/bin/uvx 2>/dev/null || true; \
28
+ fi
29
+
30
+ # Install dependencies
31
+ RUN --mount=type=cache,target=/root/.cache/uv \
32
+ if [ -f uv.lock ]; then \
33
+ uv sync --frozen --no-install-project --no-editable; \
34
+ else \
35
+ uv sync --no-install-project --no-editable; \
36
+ fi
37
+
38
+ RUN --mount=type=cache,target=/root/.cache/uv \
39
+ if [ -f uv.lock ]; then \
40
+ uv sync --frozen --no-editable; \
41
+ else \
42
+ uv sync --no-editable; \
43
+ fi
44
+
45
+ # Final stage
46
+ FROM ${BASE_IMAGE}
47
+
48
+ WORKDIR /app
49
+
50
+ # Enable Gradio web interface
51
+ ENV ENABLE_WEB_INTERFACE=true
52
+
53
+ # Copy venv from builder
54
+ COPY --from=builder /app/env/.venv /app/.venv
55
+
56
+ # Copy project code
57
+ COPY --from=builder /app/env /app/env
58
+
59
+ # Set paths
60
+ ENV PATH="/app/.venv/bin:$PATH"
61
+ ENV PYTHONPATH="/app/env:$PYTHONPATH"
62
+
63
+ # Health check
64
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
65
+ CMD curl -f http://localhost:8000/health || exit 1
66
+
67
+ EXPOSE 8000
68
+
69
+ CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]
GROUND_TRUTH_EXTRACTION.md ADDED
@@ -0,0 +1,470 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ground Truth Extraction
2
+
3
+ Extract the API catalog from each live WebArena container by connecting to EC2 from Cursor, entering each container, and running Claude Code with the prompts below.
4
+
5
+ Container names confirmed from the running EC2 instance:
6
+
7
+
8
+ | App | Container name | Image |
9
+ | ----------------| -------------------------------| --------------------------------------|
10
+ | Shopping | `shopping` | `shopping_final_0712` |
11
+ | Shopping Admin | `shopping_admin` | `shopping_admin_final_0719` |
12
+ | Forum | `forum` | `postmill-populated-exposed-withimg` |
13
+ | Wikipedia | `kiwix33` | `ghcr.io/kiwix/kiwix-serve:3.3.0` |
14
+ | Map (web) | `openstreetmap-website-web-1` | `openstreetmap-website-web` |
15
+
16
+
17
+ Skipping GitLab for now (`gitlab`) - facing some 502 errors.
18
+ Wikipedia (Kiwix) serves a static ZIM file — there is no source code to analyze. Its catalog entry is hardcoded at the bottom of this file.
19
+
20
+ ---
21
+
22
+ ## Connection workflow
23
+
24
+ ```
25
+ Cursor → Remote SSH → EC2 host → Dev Containers extension → Attach to Running Container → pick app container → paste prompt into Cursor sidebar
26
+ ```
27
+
28
+ **Step 1 — Connect from Cursor to EC2**
29
+
30
+ `Cmd+Shift+P` → `Remote-SSH: Connect to Host` → `ubuntu@<EC2_IP>`
31
+
32
+ **Step 2 — Attach to a container**
33
+
34
+ With the Remote-SSH window open: `Cmd+Shift+P` → `Dev Containers: Attach to Running Container` → select the container (e.g. `shopping`).
35
+
36
+ Cursor opens a new window with the container's filesystem as the workspace. The full source code is now loaded and indexed — no copying needed.
37
+
38
+ **Step 3 — Paste the prompt into the Cursor sidebar**
39
+
40
+ Open the AI chat sidebar, paste the prompt for that container (see sections below), and run it. Cursor's AI has the full codebase in context and will write `api_catalog.json` into the workspace (inside the container).
41
+
42
+ **Step 4 — Copy the output back**
43
+
44
+ The file written inside the container can be downloaded via `File → Download` from Cursor's remote explorer, or via `scp` from the EC2 host after the fact.
45
+
46
+ Repeat for each container: `shopping`, `shopping_admin`, `forum`, `openstreetmap-website-web-1`.
47
+
48
+ ---
49
+
50
+ ## Expected output format
51
+
52
+ The catalog captures **all API surface** found in the codebase — REST, GraphQL, WebSocket, form submissions. The `api_type` field distinguishes them. Do not restrict to only the endpoints used by the 7 training tasks; document everything.
53
+
54
+ ```json
55
+ [
56
+ {
57
+ "api_type": "rest",
58
+ "endpoint": "POST /rest/V1/guest-carts/{cartId}/items",
59
+ "auth": "none",
60
+ "path_params": {
61
+ "cartId": {
62
+ "type": "string",
63
+ "source": "PREV_CALL",
64
+ "from_endpoint": "POST /rest/V1/guest-carts",
65
+ "from_field": ".body",
66
+ "notes": "entire response body is the cartId string"
67
+ }
68
+ },
69
+ "body_params": {
70
+ "cartItem.sku": { "type": "string", "source": "PREV_CALL", "from_endpoint": "GET /rest/V1/products", "from_field": ".items[0].sku" },
71
+ "cartItem.qty": { "type": "number", "source": "TASK_SPEC" },
72
+ "cartItem.quote_id": { "type": "string", "source": "DERIVED", "same_as": "cartId" }
73
+ },
74
+ "response_key_fields": []
75
+ },
76
+ {
77
+ "api_type": "graphql",
78
+ "endpoint": "POST /graphql",
79
+ "operation_name": "GetProducts",
80
+ "operation_type": "query",
81
+ "auth": "none",
82
+ "variables": {
83
+ "search": { "type": "String", "source": "TASK_SPEC" },
84
+ "pageSize": { "type": "Int", "source": "STATIC", "value": 20 }
85
+ },
86
+ "response_key_fields": [".products.items[].sku", ".products.items[].name"]
87
+ },
88
+ {
89
+ "api_type": "websocket",
90
+ "endpoint": "ws:///realtime",
91
+ "auth": "session_cookie",
92
+ "notes": "describe message protocol; include event names and payload shapes"
93
+ },
94
+ {
95
+ "api_type": "form",
96
+ "endpoint": "POST /submission/create",
97
+ "auth": "session_cookie+csrf",
98
+ "content_type": "application/x-www-form-urlencoded",
99
+ "form_params": {
100
+ "_token": { "type": "string", "source": "AUTH_FLOW", "notes": "hidden input in page HTML" },
101
+ "title": { "type": "string", "source": "TASK_SPEC" },
102
+ "url": { "type": "string", "source": "TASK_SPEC", "notes": "optional if body provided" }
103
+ },
104
+ "response_key_fields": []
105
+ }
106
+ ]
107
+ ```
108
+
109
+ `**api_type`:** `rest` | `graphql` | `websocket` | `form`
110
+
111
+ `**source`:**
112
+
113
+ - `TASK_SPEC` — given in the task description
114
+ - `PREV_CALL` — from a prior response this episode; specify `from_endpoint` + `from_field`
115
+ - `AUTH_FLOW` — token / cookie / CSRF from the login flow
116
+ - `STATIC` — hardcoded in the app; document the actual value
117
+ - `DERIVED` — aliased from another value (e.g. `quote_id` = `cartId`)
118
+
119
+ ---
120
+
121
+ ## App 1 — Shopping and Shopping Admin (Magento 2)
122
+
123
+ **Container:** `shopping`
124
+ **Source root:** `/var/www/magento2/` (confirmed — WebArena README runs `docker exec shopping /var/www/magento2/bin/magento ...`)
125
+
126
+ Attach to the `shopping` container in Cursor, open `/var/www/magento2/` as the workspace, then paste this prompt into the sidebar:
127
+
128
+ ```
129
+ You are working inside a Magento 2 codebase. Start by exploring the directory structure to orient yourself.
130
+
131
+ Your job: produce a COMPLETE api_catalog.json covering ALL APIs exposed by this Magento 2
132
+ installation — not just a subset. Document every endpoint you find, regardless of whether
133
+ it is used in a specific task or not. The goal is a full map of the application's API surface.
134
+
135
+ API types to scan for (all of them):
136
+ 1. REST endpoints — declared in webapi.xml files. These are the primary API.
137
+ 2. GraphQL — Magento has a full GraphQL API parallel to REST. Find the .graphqls schema files
138
+ and document every query and mutation.
139
+ 3. WebSockets — Magento does not typically use WebSockets, but check. If none found, note it.
140
+ 4. Admin AJAX endpoints — controllers under adminhtml/ that handle JSON AJAX requests.
141
+ These are separate from the REST API.
142
+
143
+ For REST and Admin AJAX endpoints, produce:
144
+ {
145
+ "api_type": "rest",
146
+ "endpoint": "METHOD /path/{template}",
147
+ "auth": "none | bearer_token | admin_bearer_token | session_cookie",
148
+ "path_params": { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
149
+ "query_params": { ... },
150
+ "body_params": { ... },
151
+ "response_key_fields": ["jq paths that downstream calls will consume"]
152
+ }
153
+
154
+ For GraphQL queries/mutations, produce:
155
+ {
156
+ "api_type": "graphql",
157
+ "endpoint": "POST /graphql",
158
+ "operation_name": "...",
159
+ "operation_type": "query | mutation | subscription",
160
+ "auth": "none | bearer_token",
161
+ "variables": { "<name>": { "type": "...", "source": "...", "notes": "..." } },
162
+ "response_key_fields": ["jq paths that downstream calls will consume"]
163
+ }
164
+
165
+ Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
166
+ - PREV_CALL: must include from_endpoint and from_field (jq path into that response)
167
+ - AUTH_FLOW: any token/cookie obtained during login
168
+ - STATIC: include the actual static value
169
+
170
+ Rules:
171
+ - Document REQUIRED parameters only. Skip X-Requested-With, Cache-Control, correlation IDs.
172
+ - For guest-cart: POST /rest/V1/guest-carts returns a plain quoted string — that IS the cartId.
173
+ - quote_id in add-item body equals cartId — mark DERIVED.
174
+ - For searchCriteria filter params, document the exact query string structure.
175
+ - For GraphQL: read ALL .graphqls files to find every query, mutation, subscription.
176
+
177
+ Write the output to api_catalog.json at the root of the codebase.
178
+ ```
179
+
180
+ ---
181
+
182
+ ## App 2 — Forum (Postmill / Symfony)
183
+
184
+ **Container:** `forum`
185
+ **Source root:** `/var/www/html/` (confirmed — `docker exec forum find / -name "composer.json"` returned `/var/www/html/composer.json`)
186
+
187
+ Attach to the `forum` container in Cursor, open `/var/www/html/` as the workspace, then paste this prompt into the sidebar:
188
+
189
+ ```
190
+ You are working inside a Postmill forum codebase (PHP / Symfony). Start by exploring the
191
+ directory structure to orient yourself.
192
+
193
+ Your job: produce a COMPLETE api_catalog.json covering ALL HTTP endpoints exposed by this
194
+ Postmill installation — every route, every form action, every AJAX endpoint.
195
+
196
+ API types to scan for:
197
+ 1. Form submissions (POST, application/x-www-form-urlencoded) — the primary interaction pattern
198
+ 2. JSON AJAX endpoints — controllers that return JsonResponse
199
+ 3. REST-style endpoints — if any exist under /api/
200
+ 4. WebSockets — Postmill does not typically use WebSockets but check for any Mercure
201
+ or Pusher integration. If none, note it.
202
+
203
+ For form submissions and JSON endpoints, produce:
204
+ {
205
+ "api_type": "form" | "rest",
206
+ "endpoint": "METHOD /path/{template}",
207
+ "auth": "none | session_cookie | session_cookie+csrf",
208
+ "content_type": "application/x-www-form-urlencoded | application/json | multipart/form-data",
209
+ "path_params": { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
210
+ "query_params": { ... },
211
+ "form_params": { ... }, // use this for form submissions
212
+ "body_params": { ... }, // use this for JSON body
213
+ "response_key_fields": ["what downstream calls consume from this response"]
214
+ }
215
+
216
+ Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
217
+
218
+ Postmill-specific notes:
219
+ - Login is a form POST. Find the exact CSRF token field name in the security config or login form type.
220
+ - All write operations (create post, vote, comment) require session_cookie+csrf.
221
+ - The community slug / post ID in path templates come from TASK_SPEC or PREV_CALL.
222
+ - Read every FormType class to get the exact field names for each form.
223
+ - For CSRF tokens in forms: source is AUTH_FLOW (extracted from the page HTML before submit).
224
+
225
+ Write the output to api_catalog.json at the root of the codebase.
226
+ ```
227
+
228
+ ---
229
+
230
+ ## App 3 — Map (OpenStreetMap / Rails)
231
+
232
+ **Container:** `openstreetmap-website-web-1`
233
+ **Source root:** `/app` (confirmed — `docker exec openstreetmap-website-web-1 ls /app` shows Gemfile, app/, config/, db/, etc.)
234
+
235
+ Attach to the `openstreetmap-website-web-1` container in Cursor, open `/app` as the workspace, then paste this prompt into the sidebar:
236
+
237
+ ```
238
+ You are working inside an OpenStreetMap Rails codebase. Start by exploring the directory
239
+ structure to orient yourself.
240
+
241
+ Your job: produce a COMPLETE api_catalog.json covering ALL HTTP endpoints exposed by this
242
+ OpenStreetMap installation — every route, every API endpoint, every format variant.
243
+
244
+ API types to scan for:
245
+ 1. REST API under /api/0.6/ — the main machine-readable API (XML and JSON variants)
246
+ 2. Search / geocoding — how place searches are handled (may proxy to Nominatim, or local)
247
+ 3. Web interface endpoints — HTML controllers, but also any that return JSON
248
+ 4. OAuth endpoints — any OAuth 1.0 or 2.0 flows
249
+ 5. WebSockets — unlikely but check for ActionCable or similar integration
250
+
251
+ For each endpoint, produce:
252
+ {
253
+ "api_type": "rest" | "form" | "websocket",
254
+ "endpoint": "METHOD /path/{template}",
255
+ "auth": "none | oauth | session_cookie",
256
+ "format_variants": [".json", ".xml"], // if the endpoint supports multiple formats via extension
257
+ "path_params": { "<name>": { "type": "...", "source": "...", "from_endpoint": "...", "from_field": "...", "notes": "..." } },
258
+ "query_params": { ... },
259
+ "body_params": { ... },
260
+ "response_key_fields": ["XPath or jq paths downstream calls consume"]
261
+ }
262
+
263
+ Source types: TASK_SPEC | PREV_CALL | AUTH_FLOW | STATIC | DERIVED
264
+
265
+ OpenStreetMap-specific notes:
266
+ - The /api/0.6/ endpoints return XML by default; .json suffix returns JSON. Document both variants.
267
+ - Node, way, relation IDs are integers — source is TASK_SPEC for direct tasks, PREV_CALL when
268
+ they come from a search result.
269
+ - The search endpoint may be /search or proxied through Nominatim — read the routes and
270
+ controllers carefully to find where geographic searches are handled.
271
+ - Read ALL controller files, not just the api/ subdirectory. There may be JSON endpoints in
272
+ the main web controllers too.
273
+
274
+ Start with the routes file (e.g. config/routes.rb) to get the complete route list, then read each controller. Write the output to api_catalog.json at the root of the codebase.
275
+ ```
276
+
277
+ ---
278
+
279
+ ---
280
+
281
+ ## App 4 — Wikipedia (Kiwix) — No extraction needed
282
+
283
+ Kiwix serves a static ZIM file at `/data/wikipedia_en_all_maxi_2022-05.zim` (confirmed — `docker exec kiwix33 find / -name "*.zim"` returned that path). There is no application source code to analyze — `kiwix-serve` is a C++ binary, not a web framework. The catalog entry is hardcoded below.
284
+
285
+ Hardcoded catalog entry (save as `catalogs/wikipedia.json`):
286
+
287
+ ```json
288
+ {
289
+ "_meta": {
290
+ "generated": "2026-04-08",
291
+ "source": "hardcoded — kiwix-serve binary serves a static ZIM file; no application source to analyze",
292
+ "zim_file": "/data/wikipedia_en_all_maxi_2022-05.zim",
293
+ "search_response": "HTML only — GET /search returns HTML page; agent must parse <a href> links for article URLs",
294
+ "article_page": "GET /wikipedia_en_all_maxi_2022-05/A/{title} — returns HTML article",
295
+ "websockets": "none"
296
+ },
297
+ "endpoints": [
298
+ {
299
+ "api_type": "rest",
300
+ "endpoint": "GET /search",
301
+ "auth": "none",
302
+ "query_params": {
303
+ "pattern": {
304
+ "type": "string",
305
+ "source": "TASK_SPEC",
306
+ "notes": "the search query, URL-encoded"
307
+ },
308
+ "books.name": {
309
+ "type": "string",
310
+ "source": "STATIC",
311
+ "value": "wikipedia_en_all_maxi_2022-05",
312
+ "notes": "selects which ZIM book to search"
313
+ }
314
+ },
315
+ "response_key_fields": [],
316
+ "notes": "IMPORTANT: response is HTML, not JSON. Parse <a href> anchor links matching /wikipedia_en_all_maxi_2022-05/A/... to extract article slugs."
317
+ },
318
+ {
319
+ "api_type": "rest",
320
+ "endpoint": "GET /wikipedia_en_all_maxi_2022-05/A/{article_title}",
321
+ "auth": "none",
322
+ "path_params": {
323
+ "article_title": {
324
+ "type": "string",
325
+ "source": "PREV_CALL",
326
+ "from_endpoint": "GET /search",
327
+ "from_field": "href attribute of first search result <a> tag",
328
+ "notes": "URL-encoded article slug, e.g. Albert_Einstein. Extract from the href on the search results HTML page."
329
+ }
330
+ },
331
+ "response_key_fields": [],
332
+ "notes": "Returns full HTML article page. HTTP 200 when article exists, 404 when not found."
333
+ }
334
+ ]
335
+ }
336
+ ```
337
+
338
+ ---
339
+
340
+ ## Validation — smoke-test each catalog entry
341
+
342
+ After Claude Code writes `api_catalog.json` for each app, validate a few key entries against the live server before committing:
343
+
344
+ ```bash
345
+ EC2="ec2-16-59-2-56.us-east-2.compute.amazonaws.com"
346
+
347
+ # Shopping: guest cart + add item
348
+ CART=$(curl -s -X POST http://$EC2:7770/rest/V1/guest-carts \
349
+ -H "Content-Type: application/json" | tr -d '"')
350
+ echo "cart_id: $CART"
351
+
352
+ # Get admin token first (product listing requires auth)
353
+ ADMIN_TOKEN=$(curl -s -X POST http://$EC2:7770/rest/V1/integration/admin/token \
354
+ -H "Content-Type: application/json" \
355
+ -d '{"username":"admin","password":"admin1234"}' | tr -d '"')
356
+
357
+ SKU=$(curl -s "http://$EC2:7770/rest/V1/products?searchCriteria%5BpageSize%5D=1" \
358
+ -H "Authorization: Bearer $ADMIN_TOKEN" \
359
+ | python3 -c "import sys,json; print(json.load(sys.stdin)['items'][0]['sku'])")
360
+ echo "sku: $SKU"
361
+
362
+ curl -s -X POST http://$EC2:7770/rest/V1/guest-carts/$CART/items \
363
+ -H "Content-Type: application/json" \
364
+ -d "{\"cartItem\":{\"sku\":\"$SKU\",\"qty\":1,\"quote_id\":\"$CART\"}}" | python3 -m json.tool
365
+ ```
366
+
367
+ If the response is a 200 with item details, the catalog entries for tasks 3–5 are correct.
368
+
369
+ Or use the automated validator (requires `pip install requests`):
370
+
371
+ ```bash
372
+ python3 validate_catalog.py --host ec2-16-59-2-56.us-east-2.compute.amazonaws.com --all
373
+ ```
374
+
375
+ ---
376
+
377
+ ## Final structure
378
+
379
+ All source roots confirmed by running commands on the live EC2 instance:
380
+
381
+
382
+ | Container | Source root | How confirmed |
383
+ | ----------------------------- | -------------------- | --------------------------------------------------------------------- |
384
+ | `shopping` | `/var/www/magento2/` | WebArena README `docker exec` commands |
385
+ | `shopping_admin` | `/var/www/magento2/` | Same |
386
+ | `forum` | `/var/www/html/` | `find / -name "composer.json"` returned `/var/www/html/composer.json` |
387
+ | `openstreetmap-website-web-1` | `/app` | `ls /app` shows Gemfile, app/, config/, db/ |
388
+ | `kiwix33` | N/A | Binary server; data at `/data/wikipedia_en_all_maxi_2022-05.zim` |
389
+
390
+
391
+ After running the AI on each container, download `api_catalog.json` via Cursor's remote explorer (`Right-click → Download`) and save locally as:
392
+
393
+ ```
394
+ catalogs/
395
+ shopping.json ← from shopping container
396
+ shopping_admin.json ← from shopping_admin container
397
+ forum.json ← from forum container
398
+ osm.json ← from openstreetmap-website-web-1 container
399
+ wikipedia.json ← hardcoded above (no container needed)
400
+ ```
401
+
402
+ These five files are committed to the repo and loaded by the judge at startup. They are never regenerated during training.
403
+
404
+ ---
405
+
406
+ ## Catalog status — live endpoint verification (2026-04-08)
407
+
408
+ All five catalogs have been extracted and are committed. Below is the result of live testing against `ec2-16-59-2-56.us-east-2.compute.amazonaws.com`.
409
+
410
+ ### Summary table
411
+
412
+ | Catalog | Endpoints | JSON valid | Structure | Live test | Notes |
413
+ |----------------------|-----------|------------|-----------|-----------|-------|
414
+ | `shopping.json` | 502 | ✅ | ✅ | ✅ PASS | See details below |
415
+ | `shopping_admin.json`| 552 | ✅ | ✅ | ✅ PASS | See details below |
416
+ | `forum.json` | 91 | ✅ | ✅ | ⚠️ WARN | Login not verified (see below) |
417
+ | `osm.json` | 217 | ✅ | ✅ | ✅ PASS | See details below |
418
+ | `wikipedia.json` | 2 | ✅ | ✅ | ⚠️ WARN | Search returns HTML not JSON (corrected) |
419
+
420
+ ---
421
+
422
+ ### Shopping (port 7770) — PASS
423
+
424
+ **Auth:** `POST /rest/V1/integration/admin/token` with `admin`/`admin1234` returns a JWT bearer token. ✅
425
+ **Guest cart:** `POST /rest/V1/guest-carts` returns a plain quoted string — confirmed this is the cartId. ✅
426
+ **Product listing:** `GET /rest/V1/products?searchCriteria[pageSize]=N` requires bearer token — returns full product JSON with `total_count: 104368`. ✅
427
+ **Add to cart:** `POST /rest/V1/guest-carts/{cartId}/items` with `{sku, qty, quote_id}` returns item detail at HTTP 200. ✅
428
+
429
+ **Key finding:** `GET /rest/V1/products` without auth returns HTTP 401 ("consumer isn't authorized"). The catalog documents `auth: "admin_bearer_token"` for this endpoint — **correct**.
430
+
431
+ ---
432
+
433
+ ### Shopping Admin (port 7780)
434
+
435
+ Shopping Admin uses the same Magento 2 REST API as Shopping but accessed on port 7780 with admin credentials. The `shopping_admin.json` catalog documents the same REST surface with admin-scoped auth. The admin UI itself is a browser-based SPA — its internal AJAX endpoints are documented in the catalog under `admin_ajax` type entries.
436
+
437
+ ---
438
+
439
+ ### Forum (port 9999)
440
+
441
+ **Homepage:** HTTP 200. ✅
442
+ **Login form structure:** Confirmed via HTML inspection. Form action is `POST /login_check`. CSRF field is `_csrf_token` (Symfony token, not form_key). ✅
443
+ **Login result:** `POST /login_check` with `MarvelsGrantMan136`/`test1234` redirects to `/` (homepage) — login successful. ✅ (The original password `notarobot` from WebArena defaults was stale; the correct password on this instance is `test1234`.)
444
+
445
+ **Catalog correctness:** The `forum.json` catalog correctly documents:
446
+ - `POST /login_check` with `_csrf_token`, `_username`, `_password`
447
+ - All write endpoints require `session_cookie+csrf`
448
+ - `route_name` field on each entry (extra metadata, not used by judge)
449
+
450
+ ---
451
+
452
+ ### OSM / Map (port 3000)
453
+
454
+ **Capabilities:** `GET /api/0.6/capabilities` returns XML. ✅
455
+ **Map bbox:** `GET /api/0.6/map?bbox=-0.1,51.5,0.1,51.6` returns valid OSM XML with `<osm>` root. ✅
456
+
457
+ **Search finding:** `GET /search?query=...` returns an **HTML page** (HTTP 200), not JSON. The actual geocoding is dispatched client-side to sub-endpoints:
458
+ - `POST /geocoder/search_osm_nominatim` — Nominatim-backed search
459
+ - `POST /geocoder/search_latlon` — coordinate-based search
460
+ - `POST /geocoder/search_osm_nominatim_reverse` — reverse geocode
461
+
462
+ **Search param name:** The catalog documents `query` as the query param name. Confirmed: `GET /search?query=New+York` returns HTTP 200 (HTML).
463
+
464
+ ---
465
+
466
+ ### Wikipedia / Kiwix (port 8888)
467
+
468
+ **Search endpoint:** `GET /search?pattern=...&books.name=wikipedia_en_all_maxi_2022-05` returns HTTP 200. ✅
469
+ **Article endpoint:** `GET /wikipedia_en_all_maxi_2022-05/A/Albert_Einstein` returns HTTP 200. ✅
470
+ ---
HAR_TASK_LIST.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HAR Recording Task List
2
+
3
+ The browser agent performs all of these tasks in a single session with network recording enabled. Every request is captured — no filtering needed. One HAR dump is exported at the end.
4
+
5
+ **Credentials used throughout:**
6
+ - Shopping (customer): `emma.lopez@gmail.com` / `Password.1`
7
+ - Shopping Admin: `admin` / `admin1234`
8
+ - Forum: `MarvelsGrantMan136` / `test1234`
9
+
10
+ ---
11
+
12
+ ## App 1 — Shopping (port 7770)
13
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/`
14
+
15
+ ### Guest flows (no login)
16
+
17
+ 1. Open the homepage.
18
+ 2. Click into the **Beauty & Personal Care** top-level category from the nav.
19
+ 3. Navigate to **Beauty & Personal Care > Oral Care > Toothbrushes & Accessories** — let the product list load.
20
+ 4. Search for `"ginger"` using the search bar — let results load.
21
+ 5. Click on any product from the search results — let the product detail page load fully.
22
+ 6. Add that product to the cart (select a quantity if required, click Add to Cart).
23
+ 7. Click the cart icon — open the mini-cart.
24
+ 8. Click **Proceed to Checkout**.
25
+ 9. Fill in the guest checkout shipping form:
26
+ - Email: `test@example.com`
27
+ - First Name: `Test`
28
+ - Last Name: `User`
29
+ - Street: `123 Main St`
30
+ - City: `New York`
31
+ - State: `New York`
32
+ - ZIP: `10001`
33
+ - Country: `United States`
34
+ - Phone: `5551234567`
35
+ 10. Select the first available shipping method and click **Next**.
36
+ 11. On the payment step, leave the default payment method and click **Place Order**.
37
+
38
+ ### Logged-in customer flows
39
+
40
+ 12. Log in with `emma.lopez@gmail.com` / `Password.1` (My Account → Sign In).
41
+ 13. After login, open **My Account** dashboard.
42
+ 14. Navigate to **My Orders** under the account sidebar.
43
+ 15. Click into any existing order to view its detail page.
44
+ 16. Go to **My Wishlist** (account sidebar).
45
+ 17. Navigate to a product — **Sports & Outdoors > Exercise & Fitness** — pick any product.
46
+ 18. Click **Add to Wish List** on that product.
47
+ 19. Go to **My Wishlist** again to confirm it was added.
48
+ 20. From the wishlist, click **Add to Cart** for that same product.
49
+ 21. Go to the cart, change the quantity of one item to `2`, and click **Update Cart**.
50
+ 22. Navigate to **My Account > Address Book** and view existing addresses.
51
+ 23. Log out.
52
+
53
+ ---
54
+
55
+ ## App 2 — Shopping Admin (port 7780)
56
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`
57
+
58
+ > **Note:** Port 7780 serves the customer-facing storefront at the root URL. The Magento Admin panel is at the `/admin` subpath. The browser agent must navigate directly to `/admin` to reach the admin login page.
59
+
60
+ ### Authentication
61
+
62
+ 24. Go to the admin login page at `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`.
63
+ 25. Log in with `admin` / `admin1234`.
64
+
65
+ ### Catalog management
66
+
67
+ 26. Navigate to **Catalog > Products** from the left sidebar.
68
+ 27. Let the product grid load with the default filters.
69
+ 28. Use the search/filter bar to filter products by **Name** containing `"tee"` — apply filters.
70
+ 29. Click into any product from the filtered list to open the product edit form.
71
+ 30. Change the product **Price** to any nearby value (e.g., add $1), scroll down to **Save**.
72
+ 31. Navigate back to **Catalog > Products**.
73
+ 32. Click **Add Product** (top right) — select **Simple Product** if prompted.
74
+ 33. Fill in the new product form:
75
+ - Product Name: `HAR Test Product`
76
+ - SKU: `HAR-TEST-001`
77
+ - Price: `19.99`
78
+ - Quantity: `100`
79
+ - Attribute Set: Default
80
+ 34. Click **Save** on the new product.
81
+
82
+ ### Order management
83
+
84
+ 35. Navigate to **Sales > Orders** from the left sidebar.
85
+ 36. Let the order grid load.
86
+ 37. Click into any existing order to open the order detail view.
87
+ 38. Note the order status. Click **Invoice** (if the button is available) — fill in the invoice form defaults and click **Submit Invoice**.
88
+
89
+ ### Customer management
90
+
91
+ 39. Navigate to **Customers > All Customers**.
92
+ 40. Click into any customer record to view the account detail page.
93
+ 41. In the customer account page, click the **Orders** tab to see their order history.
94
+
95
+ ### Reports
96
+
97
+ 42. Navigate to **Reports > Products > Bestsellers**.
98
+ 43. Navigate to **Reports > Sales > Orders** — let the report load.
99
+
100
+ ### Logout
101
+
102
+ 44. Log out from the admin panel (Admin menu, top right → Sign Out).
103
+
104
+ ---
105
+
106
+ ## App 3 — Forum (port 9999)
107
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/`
108
+
109
+ ### Guest browsing
110
+
111
+ 45. Open the forum homepage.
112
+ 46. Click on **Forums** in the nav — let the forum list load.
113
+ 47. Click into any available forum/subforum.
114
+ 48. Click on any post/thread to open it.
115
+ 49. Click on any user's username to view their profile page.
116
+
117
+ ### Authenticated flows
118
+
119
+ 50. Click **Log In** and sign in with `MarvelsGrantMan136` / `test1234`.
120
+ 51. After login, return to the homepage — confirm you are logged in.
121
+ 52. Click into a forum that allows posting.
122
+ 53. Click **New Thread** or **Submit Link** or **Submit Text** (whatever button is present for creating a post).
123
+ 54. Fill in the post form:
124
+ - Title: `HAR Test Post - API Coverage`
125
+ - Body/URL: `This is a test post created for HAR recording.`
126
+ 55. Submit the post.
127
+ 56. After submitting, view the created post's page.
128
+ 57. On the post, click the **Comment** or reply area — type a comment: `"Test comment for HAR recording."` — submit it.
129
+ 58. On any other post (not your own), click the **upvote** button.
130
+ 59. On any post, click **Save** / bookmark if the option exists.
131
+ 60. Navigate to your own profile page (click your username in the top bar).
132
+ 61. Click **Log Out**.
133
+
134
+ ---
135
+
136
+ ## App 4 — Map (port 3000)
137
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:3000/`
138
+
139
+ ### Browse and search
140
+
141
+ 62. Open the map homepage — let the default map tiles load.
142
+ 63. In the **Search** bar at the top, type `"New York"` and press enter/click search — let results load and map pan.
143
+ 64. Click on one of the search results to zoom into that location.
144
+ 65. Zoom in several levels using the `+` button or scroll wheel.
145
+ 66. Zoom back out using the `−` button.
146
+ 67. Pan the map by clicking and dragging to a different area.
147
+ 68. Search for `"London"` — let results load.
148
+ 69. Click **Export** in the top nav — let the export panel open (you don't need to actually download).
149
+ 70. Click on the map to drop a marker — then click **Where is this?** in the top bar with the marker active.
150
+
151
+ ### Node/way detail
152
+
153
+ 71. In the search box, search for `"Central Park"` — click the result.
154
+ 72. Click on any map feature (node/way) that becomes clickable — let the sidebar panel load the feature detail.
155
+
156
+ ---
157
+
158
+ ## Coverage cross-check
159
+
160
+ After completing all tasks above, you should have HAR traffic covering:
161
+
162
+ | App | Auth endpoints | Product/content listing | Item creation/mutation | Session/cookie flows |
163
+ |-----|---------------|------------------------|----------------------|---------------------|
164
+ | Shopping (guest) | — | ✓ category, search | ✓ cart, checkout | ✓ guest session |
165
+ | Shopping (authed) | ✓ login, logout | ✓ orders, wishlist | ✓ wishlist add, cart update | ✓ customer token |
166
+ | Admin | ✓ admin login/logout | ✓ product grid, order grid | ✓ product edit, create, invoice | ✓ admin token |
167
+ | Forum (guest) | — | ✓ forums, posts | — | — |
168
+ | Forum (authed) | ✓ login, logout | ✓ profile | ✓ post create, comment, vote | ✓ CSRF form_key |
169
+ | Map | — | ✓ tile loads, search | — | — |
170
+
171
+ ---
172
+
173
+ ## Initial Run — Browser Agent Tasks for the 7 Training Templates
174
+
175
+ The full task list above covers broad application exploration. For the initial training run, the browser agent only needs to complete the tasks that produce HAR traffic relevant to the **7 task templates defined in [README.md](README.md)**. Below is the minimum set grouped by application so the browser agent can work through one app at a time in a single session.
176
+
177
+ ---
178
+
179
+ ### Shopping (port 7770) — Templates 1, 3, 6
180
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/`
181
+
182
+ Covers: category listing (Easy), add-to-cart (Medium), and full guest checkout (Hard).
183
+
184
+ The browser agent runs the guest checkout flow end-to-end. This single pass captures all the HAR traffic needed for all three Shopping templates — category browsing produces Template 1 traffic, cart creation + item addition produces Template 3 traffic, and the full checkout completes Template 6.
185
+
186
+ 1. Open the Shopping homepage.
187
+ 2. Click into **Beauty & Personal Care** from the nav. *(Template 1: category listing)*
188
+ 3. Navigate to **Beauty & Personal Care > Oral Care > Toothbrushes & Accessories** — let the product list load. *(Template 1: product listing)*
189
+ 4. Search for `"ginger"` using the search bar — let results load. *(Template 3: product lookup)*
190
+ 5. Click on any product from the search results — let the product detail page load fully. *(Template 3: product detail)*
191
+ 6. Add that product to the cart. *(Template 3: cart creation + item addition)*
192
+ 7. Click the cart icon — open the mini-cart. *(Template 3: cart state)*
193
+ 8. Click **Proceed to Checkout**. *(Template 6: checkout begins)*
194
+ 9. Fill in the guest checkout shipping form: *(Template 6: shipping)*
195
+ - Email: `test@example.com`
196
+ - First Name: `Test`, Last Name: `User`
197
+ - Street: `123 Main St`, City: `New York`, State: `New York`, ZIP: `10001`
198
+ - Country: `United States`, Phone: `5551234567`
199
+ 10. Select the first available shipping method and click **Next**. *(Template 6: shipping method)*
200
+ 11. On the payment step, leave the default payment method and click **Place Order**. *(Template 6: payment + order)*
201
+
202
+ **HAR traffic captured:** category tree API, product list/search API, guest cart creation, add-to-cart, estimate-shipping, set-shipping-information, payment-information, place-order.
203
+
204
+ ---
205
+
206
+ ### Shopping Admin (port 7780) — Template 7
207
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`
208
+
209
+ Covers: admin product creation (Hard).
210
+
211
+ > **Note:** The root URL on port 7780 shows the customer storefront, not the admin panel. The browser agent must navigate to `/admin` to reach the admin login.
212
+
213
+ 1. Go to the admin login page at `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/admin`.
214
+ 2. Log in with `admin` / `admin1234`.
215
+ 3. Click **Add Product** (top right) — select **Simple Product** if prompted.
216
+ 4. Fill in the new product form:
217
+ - Product Name: `HAR Test Product`
218
+ - SKU: `HAR-TEST-001`
219
+ - Price: `19.99`
220
+ - Quantity: `100`
221
+ - Attribute Set: Default
222
+ 5. Click **Save** on the new product.
223
+
224
+ **HAR traffic captured:** admin auth token flow, product creation POST with full Magento product schema.
225
+
226
+ ---
227
+
228
+ ### Forum (port 9999) — Templates 4, 5
229
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/`
230
+
231
+ Covers: authenticated category browsing (Medium) and post creation (Hard).
232
+
233
+ The browser agent logs in once, browses categories, then creates a post. This single pass captures traffic for both Forum templates — browsing produces Template 4 traffic, and post creation produces Template 5 traffic.
234
+
235
+ 1. Open the Forum homepage.
236
+ 2. Click on **Forums** in the nav — let the forum list load. *(Template 4: category listing)*
237
+ 3. Log in with `MarvelsGrantMan136` / `test1234`. *(Templates 4 & 5: auth + CSRF token)*
238
+ 4. After login, return to the homepage — confirm logged in.
239
+ 5. Click into any available forum/subforum. *(Template 4: authed category browse)*
240
+ 6. Click on any post/thread to open it. *(Template 4: authed post retrieval)*
241
+ 7. Navigate to a forum that allows posting. *(Template 5: post creation begins)*
242
+ 8. Click **New Thread** / **Submit Text**. *(Template 5: creation form)*
243
+ 9. Fill in the post form: *(Template 5: post body)*
244
+ - Title: `HAR Test Post - API Coverage`
245
+ - Body: `This is a test post created for HAR recording.`
246
+ 10. Submit the post. *(Template 5: POST with CSRF form_key)*
247
+ 11. View the created post's page. *(Template 5: confirm creation)*
248
+
249
+ **HAR traffic captured:** login + session/CSRF extraction, forum/subforum listing (authed), thread listing, post creation with form_key.
250
+
251
+ ---
252
+
253
+ ### Wikipedia (port 8888) — Template 2
254
+ `http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/`
255
+
256
+ Covers: article summary retrieval (Easy).
257
+
258
+ Wikipedia is not covered in the full task list above — these are new tasks for the initial run.
259
+
260
+ 1. Open the Wikipedia homepage.
261
+ 2. Search for any article (e.g., `"Python (programming language)"`).
262
+ 3. Click into the article and let the full page load.
263
+
264
+ **HAR traffic captured:** Kiwix search API, article content retrieval API.
265
+
266
+ ---
267
+
268
+ ### What is NOT needed for the initial run
269
+
270
+ | Skipped section | Tasks | Why not needed |
271
+ |----------------|-------|----------------|
272
+ | Shopping — logged-in customer flows | 12–23 | No template targets authed customer actions (orders, wishlist, address book) |
273
+ | Admin — catalog editing | 26–31 | Template 7 only needs product *creation*, not editing existing products |
274
+ | Admin — orders, customers, reports | 35–44 | No template targets admin read flows |
275
+ | Forum — voting, commenting, bookmarking | 57–61 | Templates 4 & 5 cover browse and post creation only |
276
+ | Map (port 3000) | 62–72 | No template targets the Map application |
JUDGE.md ADDED
@@ -0,0 +1,660 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HARvestGym Judge Architecture
2
+
3
+ This document specifies the full judge architecture — how task completion is verified and how rewards are computed after each episode ends.
4
+
5
+ The judge is a deterministic, programmatic component. It does **not** use an LLM to score episodes. Every grader produces a score in `[0.0, 1.0]` that is then scaled to the reward range defined in `README.md`.
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ ```
12
+ Episode ends (model calls done() or max_steps=20 reached)
13
+
14
+
15
+ Judge.evaluate(episode: Episode, task: Task) → EpisodeResult
16
+
17
+ ├─► 1. Identify task template from task.template_id
18
+
19
+ ├─► 2. Run programmatic grader for this template
20
+ │ │
21
+ │ ├─► Probe live application state (HTTP calls from judge, not model)
22
+ │ ├─► Inspect episode trajectory (call sequence, parameter sources)
23
+ │ └─► Compute score in [0.0, 1.0]
24
+
25
+ ├─► 3. Verify parameter sourcing (for partial credit)
26
+ │ │
27
+ │ └─► Cross-reference each curl call against ground truth catalog
28
+
29
+ └─► 4. Compute final reward
30
+
31
+ └─► Combine task score + parameter sourcing + step-level signals
32
+ ```
33
+
34
+ ---
35
+
36
+ ## Data Structures
37
+
38
+ ```python
39
+ @dataclass
40
+ class Episode:
41
+ task: Task
42
+ steps: list[Step] # all tool calls and results
43
+ session_state: dict # final session state
44
+ total_steps: int
45
+ terminated_by: str # "done_call" | "max_steps"
46
+
47
+ @dataclass
48
+ class Step:
49
+ step_num: int
50
+ tool: str # browser_agent | search_endpoints | curl_exec | search_episode_data | done
51
+ action: str # raw tool call string
52
+ result: Any # tool return value
53
+ curl_parsed: CurlCall | None # None for non-curl steps
54
+
55
+ @dataclass
56
+ class CurlCall:
57
+ method: str
58
+ url: str
59
+ path: str # normalized (IDs replaced with {id})
60
+ headers: dict
61
+ body: dict | str | None
62
+ status_code: int
63
+ response_body: Any
64
+
65
+ @dataclass
66
+ class Task:
67
+ template_id: int # 1–7
68
+ description: str # instantiated task string (with actual values)
69
+ params: dict # e.g. {"product_name": "Radiant Tee", "sku": "MH01"}
70
+ app: str # shopping | forum | wikipedia | shopping_admin
71
+ base_url: str
72
+ difficulty: str # easy | medium | hard
73
+
74
+ @dataclass
75
+ class EpisodeResult:
76
+ task_score: float # 0.0–1.0 from grader
77
+ parameter_sourcing_score: float # 0.0–1.0 from trajectory analysis
78
+ auth_obtained: bool # did the model successfully authenticate?
79
+ reward: float # final composite reward
80
+ details: dict # per-grader diagnostic info for logging
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Graders: Per-Template Verification
86
+
87
+ Each template has its own grader. All graders make real HTTP calls to the live EC2 application to verify state — they do not rely solely on the episode trajectory.
88
+
89
+ ### Template 1 — Easy | Shopping: List products in category `{category_name}`
90
+
91
+ **Success condition:** The model's curl call returned a 200 response containing at least one product in the correct category.
92
+
93
+ ```python
94
+ def grade_template_1(episode: Episode, task: Task) -> float:
95
+ category_name = task.params["category_name"]
96
+
97
+ # Find the curl call that returned products
98
+ for step in episode.steps:
99
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
100
+ body = step.curl_parsed.response_body
101
+ if isinstance(body, dict) and "items" in body:
102
+ items = body["items"]
103
+ # Verify at least one item belongs to the target category
104
+ for item in items:
105
+ # Check category_links or category name in item
106
+ if _item_matches_category(item, category_name):
107
+ return 1.0
108
+ # Items returned but wrong category — partial credit
109
+ if len(items) > 0:
110
+ return 0.3
111
+ return 0.0
112
+
113
+ def _item_matches_category(item: dict, category_name: str) -> bool:
114
+ """Check category_links or custom_attributes for category match."""
115
+ # Magento items carry category_links: [{"category_id": N}]
116
+ # Judge verifies by calling GET /rest/V1/categories?searchCriteria[filter...]=name
117
+ # and comparing category IDs. This is a judge-side probe, not relying on model output.
118
+ ...
119
+ ```
120
+
121
+ **Reward mapping:**
122
+
123
+ | Score | Meaning | Reward |
124
+ |-------|---------|--------|
125
+ | 1.0 | Products listed from correct category | +2.0 |
126
+ | 0.3 | Products returned but wrong/unknown category | +0.5 |
127
+ | 0.0 | No valid product list response | −1.5 |
128
+
129
+ ---
130
+
131
+ ### Template 2 — Easy | Wikipedia: Retrieve article for `{title}`
132
+
133
+ **Success condition:** The model made a successful HTTP GET that returned a 200 response for a URL containing the article title (or a redirect to it). Content parsing is explicitly not required.
134
+
135
+ ```python
136
+ def grade_template_2(episode: Episode, task: Task) -> float:
137
+ title = task.params["title"]
138
+ title_slug = title.lower().replace(" ", "_")
139
+
140
+ for step in episode.steps:
141
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
142
+ url = step.curl_parsed.url.lower()
143
+ if title_slug in url or title.lower() in url:
144
+ return 1.0
145
+
146
+ # Check for search result that found the article (indirect)
147
+ for step in episode.steps:
148
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
149
+ body_str = str(step.curl_parsed.response_body).lower()
150
+ if title.lower() in body_str and "wiki" in step.curl_parsed.url.lower():
151
+ return 0.5 # found reference but didn't fetch the article directly
152
+
153
+ return 0.0
154
+ ```
155
+
156
+ **Reward mapping:**
157
+
158
+ | Score | Reward |
159
+ |-------|--------|
160
+ | 1.0 | Correct article URL fetched with 200 | +2.0 |
161
+ | 0.5 | Article title found in search results but not fetched | +0.5 |
162
+ | 0.0 | No Wikipedia response | −1.5 |
163
+
164
+ ---
165
+
166
+ ### Template 3 — Medium | Shopping: Add `{product_name}` to a guest cart
167
+
168
+ **Success condition:** Judge probes the cart after the episode to verify the item is present.
169
+
170
+ ```python
171
+ def grade_template_3(episode: Episode, task: Task) -> float:
172
+ product_name = task.params["product_name"]
173
+ sku = task.params.get("sku") # known from parameter pool
174
+
175
+ # Extract cart_id from episode trajectory
176
+ cart_id = _extract_cart_id(episode)
177
+ if not cart_id:
178
+ return _partial_score_no_cart(episode)
179
+
180
+ # Judge probes the live application
181
+ cart_response = _judge_probe(
182
+ f"GET /rest/V1/guest-carts/{cart_id}",
183
+ task.base_url
184
+ )
185
+ if not cart_response or cart_response.status_code != 200:
186
+ return 0.1 # cart was created but can't be verified
187
+
188
+ items = cart_response.body.get("items", [])
189
+ for item in items:
190
+ if item.get("sku") == sku or _fuzzy_match(item.get("name", ""), product_name):
191
+ return 1.0
192
+
193
+ # Cart exists but item not in it
194
+ if len(items) == 0 and cart_id:
195
+ return 0.2 # cart created, item not added
196
+
197
+ return 0.0
198
+
199
+ def _partial_score_no_cart(episode: Episode) -> float:
200
+ """Partial credit: did the model attempt the right sequence?"""
201
+ attempted_cart_create = any(
202
+ s.curl_parsed and "guest-carts" in s.curl_parsed.path
203
+ and s.curl_parsed.method == "POST"
204
+ for s in episode.steps if s.curl_parsed
205
+ )
206
+ return 0.15 if attempted_cart_create else 0.0
207
+ ```
208
+
209
+ **Reward mapping:**
210
+
211
+ | Score | Reward |
212
+ |-------|--------|
213
+ | 1.0 | Item confirmed in cart via judge probe | +3.5 |
214
+ | 0.2 | Cart created, item not added | +0.5 |
215
+ | 0.15 | Correct call attempted, cart not created | +0.3 |
216
+ | 0.0 | No valid attempt | −1.5 |
217
+
218
+ ---
219
+
220
+ ### Template 4 — Medium | Forum: Retrieve all posts in `{forum_category}` (authed)
221
+
222
+ **Success condition:** The model authenticated and fetched a post listing that includes posts from the target category.
223
+
224
+ ```python
225
+ def grade_template_4(episode: Episode, task: Task) -> float:
226
+ forum_category = task.params["forum_category"]
227
+ score = 0.0
228
+
229
+ # Check authentication was obtained
230
+ auth_obtained = _check_forum_auth(episode)
231
+ if auth_obtained:
232
+ score += 0.3 # auth is partial credit on its own (see reward table)
233
+
234
+ # Find a curl call that returned a post listing for the correct category
235
+ for step in episode.steps:
236
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
237
+ url = step.curl_parsed.url
238
+ body = step.curl_parsed.response_body
239
+
240
+ # Postmill returns post listings at /f/{category}
241
+ if f"/f/{forum_category.lower()}" in url.lower():
242
+ if _response_contains_posts(body):
243
+ return 1.0
244
+
245
+ return score # 0.3 if only auth, 0.0 if nothing
246
+
247
+ def _check_forum_auth(episode: Episode) -> bool:
248
+ """Authentication: a POST to /login returned a redirect (302) or 200 with session cookie."""
249
+ for step in episode.steps:
250
+ if step.curl_parsed:
251
+ if step.curl_parsed.method == "POST" and "/login" in step.curl_parsed.path:
252
+ if step.curl_parsed.status_code in (200, 302):
253
+ return True
254
+ return False
255
+ ```
256
+
257
+ **Reward mapping:**
258
+
259
+ | Score | Reward |
260
+ |-------|--------|
261
+ | 1.0 | Authenticated + posts fetched from correct category | +3.5 |
262
+ | 0.3 | Authentication only, no post fetch | +0.8 |
263
+ | 0.0 | No valid attempt | −1.5 |
264
+
265
+ ---
266
+
267
+ ### Template 5 — Hard | Forum: Create a post titled `{title}` in `{category}`
268
+
269
+ **Success condition:** Judge probes the forum category page after the episode to verify the post exists.
270
+
271
+ ```python
272
+ def grade_template_5(episode: Episode, task: Task) -> float:
273
+ title = task.params["title"]
274
+ category = task.params["category"]
275
+
276
+ # Phase 1: check authentication
277
+ auth_ok = _check_forum_auth(episode)
278
+
279
+ # Phase 2: check CSRF token was extracted and used
280
+ csrf_used = _check_csrf_in_trajectory(episode)
281
+
282
+ # Phase 3: judge probes the forum to verify post exists
283
+ posts = _judge_probe_forum_category(category, task.base_url)
284
+ for post in posts:
285
+ if _fuzzy_match(post.get("title", ""), title):
286
+ return 1.0
287
+
288
+ # Partial credit breakdown
289
+ if auth_ok and csrf_used:
290
+ return 0.5 # got auth and CSRF right, but post didn't land
291
+ if auth_ok:
292
+ return 0.3
293
+ return 0.0
294
+
295
+ def _check_csrf_in_trajectory(episode: Episode) -> bool:
296
+ """Check that a POST body contained a _csrf_token field."""
297
+ for step in episode.steps:
298
+ if step.curl_parsed and step.curl_parsed.method == "POST":
299
+ body_str = str(step.curl_parsed.body or "")
300
+ if "_csrf_token" in body_str and len(body_str) > 20:
301
+ return True
302
+ return False
303
+ ```
304
+
305
+ **Reward mapping:**
306
+
307
+ | Score | Reward |
308
+ |-------|--------|
309
+ | 1.0 | Post confirmed in forum via judge probe | +5.0 |
310
+ | 0.5 | Auth + CSRF correct, post not created | +1.5 |
311
+ | 0.3 | Auth only | +0.8 |
312
+ | 0.0 | No valid attempt | −1.5 |
313
+
314
+ ---
315
+
316
+ ### Template 6 — Hard | Shopping: Guest checkout for `{product_name}`
317
+
318
+ **Success condition:** A complete order was created. Judge checks for an order ID in the trajectory and optionally probes the admin API.
319
+
320
+ ```python
321
+ def grade_template_6(episode: Episode, task: Task) -> float:
322
+ sku = task.params.get("sku")
323
+
324
+ # Check for order ID in trajectory (checkout success returns an integer order ID)
325
+ for step in episode.steps:
326
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
327
+ body = step.curl_parsed.response_body
328
+ # Magento checkout success: POST /rest/V1/guest-carts/{id}/order returns integer
329
+ if isinstance(body, int) and body > 0:
330
+ return 1.0
331
+ # Magento checkout success: body could also be JSON with "order_id"
332
+ if isinstance(body, dict) and body.get("order_id"):
333
+ return 1.0
334
+
335
+ # Partial credit: did the model get through cart + item + shipping?
336
+ stages = _checkout_stages_completed(episode, sku)
337
+ if stages >= 4: # cart + item + email + shipping estimate
338
+ return 0.6
339
+ if stages >= 2: # cart + item
340
+ return 0.3
341
+ if stages >= 1: # cart only
342
+ return 0.1
343
+ return 0.0
344
+
345
+ def _checkout_stages_completed(episode: Episode, sku: str) -> int:
346
+ """Count how many checkout stages the model completed successfully."""
347
+ stages = 0
348
+ paths_hit = {s.curl_parsed.path for s in episode.steps if s.curl_parsed and s.curl_parsed.status_code == 200}
349
+
350
+ if any("guest-carts" in p and "{" not in p for p in paths_hit): stages += 1 # cart created
351
+ if any("guest-carts" in p and "items" in p for p in paths_hit): stages += 1 # item added
352
+ if any("guest-carts" in p and "shipping" in p for p in paths_hit): stages += 1 # shipping
353
+ if any("guest-carts" in p and "payment" in p for p in paths_hit): stages += 1 # payment/order
354
+ return stages
355
+ ```
356
+
357
+ **Reward mapping:**
358
+
359
+ | Score | Reward |
360
+ |-------|--------|
361
+ | 1.0 | Order created (order_id in response) | +5.0 |
362
+ | 0.6 | 4+ stages completed | +2.5 |
363
+ | 0.3 | Cart + item only | +0.8 |
364
+ | 0.1 | Cart only | +0.3 |
365
+ | 0.0 | No valid attempt | −1.5 |
366
+
367
+ ---
368
+
369
+ ### Template 7 — Hard | Shopping Admin: Create product with SKU `{sku}`, price `{price}`
370
+
371
+ **Success condition:** Judge probes the admin API to confirm the product exists with the correct SKU and price.
372
+
373
+ ```python
374
+ def grade_template_7(episode: Episode, task: Task) -> float:
375
+ sku = task.params["sku"]
376
+ price = float(task.params["price"])
377
+
378
+ # Phase 1: check admin authentication
379
+ admin_token = _extract_admin_token(episode)
380
+ if not admin_token:
381
+ return 0.0
382
+
383
+ # Phase 2: judge probes the Magento REST API to confirm product exists
384
+ product = _judge_probe(
385
+ f"GET /rest/V1/products/{sku}",
386
+ task.base_url,
387
+ headers={"Authorization": f"Bearer {admin_token}"}
388
+ )
389
+ if not product or product.status_code != 200:
390
+ # Product might exist under a different auth context — try admin token from env
391
+ product = _judge_probe_with_env_admin_token(f"GET /rest/V1/products/{sku}", task.base_url)
392
+
393
+ if product and product.status_code == 200:
394
+ actual_price = float(product.body.get("price", -1))
395
+ price_ok = abs(actual_price - price) < 0.01
396
+ return 1.0 if price_ok else 0.7 # product exists but wrong price
397
+
398
+ # Partial credit: correct API called with correct schema
399
+ if _attempted_product_creation(episode, sku):
400
+ return 0.2
401
+
402
+ return 0.0
403
+
404
+ def _extract_admin_token(episode: Episode) -> str | None:
405
+ """Find admin bearer token from a POST /rest/V1/integration/admin/token response."""
406
+ for step in episode.steps:
407
+ if step.curl_parsed and step.curl_parsed.status_code == 200:
408
+ if "integration/admin/token" in step.curl_parsed.path:
409
+ body = step.curl_parsed.response_body
410
+ if isinstance(body, str) and len(body) > 10:
411
+ return body.strip('"')
412
+ return None
413
+ ```
414
+
415
+ **Reward mapping:**
416
+
417
+ | Score | Reward |
418
+ |-------|--------|
419
+ | 1.0 | Product confirmed in Magento with correct price | +5.0 |
420
+ | 0.7 | Product exists but wrong price | +2.0 |
421
+ | 0.2 | Admin auth + correct endpoint called | +0.5 |
422
+ | 0.0 | No admin auth | −1.5 |
423
+
424
+ ---
425
+
426
+ ## Parameter Sourcing Verification
427
+
428
+ In addition to the task-specific grader, the judge runs a parameter sourcing analysis over the full episode trajectory. This cross-references each curl call against the ground truth catalog to verify that parameter values were obtained from the correct sources.
429
+
430
+ ```python
431
+ def verify_parameter_sourcing(episode: Episode, task: Task, catalog: list[dict]) -> float:
432
+ """
433
+ Returns a score in [0.0, 1.0] representing how correctly the model
434
+ sourced parameter values across all curl calls in the episode.
435
+
436
+ Checks each curl call against the ground truth catalog entry for that endpoint.
437
+ """
438
+ correct = 0
439
+ total = 0
440
+
441
+ for step in episode.steps:
442
+ if not step.curl_parsed:
443
+ continue
444
+
445
+ catalog_entry = _find_catalog_entry(step.curl_parsed.path, step.curl_parsed.method, catalog)
446
+ if not catalog_entry:
447
+ continue
448
+
449
+ # Check each parameter in the curl call
450
+ for param_name, param_meta in catalog_entry.get("path_params", {}).items():
451
+ total += 1
452
+ value_used = _extract_path_param(step.curl_parsed.url, param_name, catalog_entry)
453
+ if value_used and _param_sourced_correctly(value_used, param_meta, episode, step):
454
+ correct += 1
455
+
456
+ for param_name, param_meta in catalog_entry.get("body_params", {}).items():
457
+ total += 1
458
+ value_used = _extract_body_param(step.curl_parsed.body, param_name)
459
+ if value_used and _param_sourced_correctly(value_used, param_meta, episode, step):
460
+ correct += 1
461
+
462
+ if total == 0:
463
+ return 0.0
464
+ return correct / total
465
+
466
+ def _param_sourced_correctly(value: Any, param_meta: dict, episode: Episode, step: Step) -> bool:
467
+ """
468
+ Verify that a parameter value came from the expected source.
469
+
470
+ Source types:
471
+ TASK_SPEC — value must appear in the task description string
472
+ PREV_CALL — value must appear in a prior step's response body
473
+ AUTH_FLOW — value must come from an auth response (token, session)
474
+ STATIC — value must match a known constant (e.g., store_id = 1)
475
+ DERIVED — value must be derivable from another parameter in this call
476
+ """
477
+ source = param_meta.get("source")
478
+
479
+ if source == "TASK_SPEC":
480
+ return str(value) in episode.task.description
481
+
482
+ elif source == "PREV_CALL":
483
+ from_endpoint = param_meta.get("from_endpoint")
484
+ from_field = param_meta.get("from_field")
485
+ # Check prior steps for a response from from_endpoint with value at from_field
486
+ for prior_step in episode.steps:
487
+ if prior_step.step_num >= step.step_num:
488
+ break
489
+ if prior_step.curl_parsed:
490
+ if _path_matches(prior_step.curl_parsed.path, from_endpoint):
491
+ extracted = _extract_field(prior_step.curl_parsed.response_body, from_field)
492
+ if str(extracted) == str(value):
493
+ return True
494
+ return False
495
+
496
+ elif source == "AUTH_FLOW":
497
+ # Value must appear in a session_state field or auth response
498
+ return str(value) in str(episode.session_state.values())
499
+
500
+ elif source == "STATIC":
501
+ expected = param_meta.get("value")
502
+ return str(value) == str(expected)
503
+
504
+ elif source == "DERIVED":
505
+ same_as = param_meta.get("same_as")
506
+ # Value must equal another param in the same call
507
+ # (e.g., quote_id must equal cart_id which is in the path)
508
+ if same_as and step.curl_parsed:
509
+ other_value = _extract_param_from_call(step.curl_parsed, same_as)
510
+ return str(value) == str(other_value)
511
+ return False
512
+
513
+ return False
514
+ ```
515
+
516
+ ---
517
+
518
+ ## Final Reward Computation
519
+
520
+ ```python
521
+ def compute_reward(
522
+ task_score: float,
523
+ parameter_sourcing_score: float,
524
+ step_rewards: float, # accumulated per-step rewards from README reward table
525
+ auth_obtained: bool,
526
+ task: Task,
527
+ terminated_by: str
528
+ ) -> float:
529
+ """
530
+ Combines task grader score, parameter sourcing, and step-level signals
531
+ into the final episode reward.
532
+ """
533
+ # Map task score to outcome reward (scales with difficulty tier)
534
+ tier_multipliers = {"easy": 1.0, "medium": 1.75, "hard": 2.5}
535
+ tier = task.difficulty
536
+ m = tier_multipliers.get(tier, 1.0)
537
+
538
+ if task_score == 1.0:
539
+ outcome_reward = 2.0 * m # +2.0 (easy), +3.5 (medium), +5.0 (hard)
540
+ elif task_score >= 0.5:
541
+ outcome_reward = 0.5 * m # partial
542
+ elif task_score > 0.0:
543
+ outcome_reward = 0.15 * m # minimal attempt credit
544
+ else:
545
+ outcome_reward = -1.5 # complete failure (same across tiers)
546
+
547
+ # Auth bonus: applies even on task failure — model learned authentication
548
+ auth_bonus = 0.3 if auth_obtained and task_score < 1.0 else 0.0
549
+
550
+ # Parameter sourcing bonus (weighted into outcome, not additive)
551
+ # Only applied when task succeeds partially — avoids rewarding "busy" episodes
552
+ param_bonus = 0.0
553
+ if 0.0 < task_score < 1.0:
554
+ param_bonus = parameter_sourcing_score * 0.5 * m
555
+
556
+ total = outcome_reward + auth_bonus + param_bonus + step_rewards
557
+ return round(total, 4)
558
+ ```
559
+
560
+ **Reward separation guarantee:**
561
+
562
+ | Episode type | Approximate total reward |
563
+ |---|---|
564
+ | Easy task success (perfect param sourcing) | +2.0 to +3.2 |
565
+ | Easy task failure (busy with steps) | −1.5 + max_step_rewards ≈ −0.2 |
566
+ | Hard task success | +5.0 to +7.5 |
567
+ | Hard task failure (some progress) | −1.5 + partial ≈ −0.5 to +1.5 |
568
+
569
+ The terminal outcome reward dominates for complete successes and complete failures. Partial episodes sit in the middle — GRPO can distinguish all three signal zones.
570
+
571
+ ---
572
+
573
+ ## Judge Utilities
574
+
575
+ ```python
576
+ def _judge_probe(path: str, base_url: str, headers: dict = None) -> ProbeResult | None:
577
+ """
578
+ The judge makes its own HTTP calls to verify application state.
579
+ These calls are NOT part of the episode trajectory and do NOT affect rewards.
580
+ Judge probes use a dedicated admin token from environment variables.
581
+ """
582
+ url = base_url.rstrip("/") + path
583
+ admin_headers = {"Authorization": f"Bearer {os.environ['JUDGE_ADMIN_TOKEN']}"}
584
+ if headers:
585
+ admin_headers.update(headers)
586
+ try:
587
+ resp = requests.get(url, headers=admin_headers, timeout=10)
588
+ return ProbeResult(status_code=resp.status_code, body=resp.json() if resp.text else None)
589
+ except Exception:
590
+ return None
591
+
592
+ def _fuzzy_match(s1: str, s2: str, threshold: float = 0.85) -> bool:
593
+ """Case-insensitive substring or similarity match."""
594
+ s1, s2 = s1.lower().strip(), s2.lower().strip()
595
+ if s1 == s2 or s1 in s2 or s2 in s1:
596
+ return True
597
+ # Jaccard similarity as fallback
598
+ tokens1, tokens2 = set(s1.split()), set(s2.split())
599
+ if not tokens1 or not tokens2:
600
+ return False
601
+ return len(tokens1 & tokens2) / len(tokens1 | tokens2) >= threshold
602
+ ```
603
+
604
+ ---
605
+
606
+ ## Parameter Pool Alignment
607
+
608
+ The judge is aware that parameter pools are pre-built snapshots of the live application state. For graders that verify values (e.g., SKU, price), the comparison is:
609
+
610
+ - **SKU matching:** exact string match (SKUs are immutable in Magento)
611
+ - **Price matching:** float comparison with ±0.01 tolerance
612
+ - **Product name matching:** fuzzy match with 85% threshold (handles whitespace/casing)
613
+ - **Category name matching:** fuzzy match, verified against live category tree
614
+
615
+ The judge does **not** penalize the model if the live application has drifted from the parameter pool (e.g., a product was deleted). In this case, the episode is flagged as `invalid_episode` in the logs and excluded from the training batch. The `build_parameter_pools.py` script should be re-run to refresh the pool if too many episodes are flagged.
616
+
617
+ ---
618
+
619
+ ## Concurrent Episode Isolation
620
+
621
+ All judge probes use read-only endpoints (GETs, admin token reads) to avoid interfering with other concurrent training episodes. The judge never issues write calls to the live application — it only reads state to verify what the model did.
622
+
623
+ Write isolation (preventing two concurrent episodes from interfering with each other) is handled at the training harness level, not the judge level:
624
+
625
+ - For **Easy** tasks (read-only): no isolation needed
626
+ - For **Medium** tasks (cart operations): each episode uses a fresh guest cart; carts are session-scoped and do not conflict
627
+ - For **Hard** tasks (post creation, product creation): episode IDs are embedded in the task params (e.g., SKU is prefixed with episode ID: `{sku}_{episode_id}`) to prevent naming collisions
628
+
629
+ ---
630
+
631
+ ## Logging and Diagnostics
632
+
633
+ Every episode produces a structured log entry:
634
+
635
+ ```json
636
+ {
637
+ "episode_id": "ep_1234",
638
+ "template_id": 3,
639
+ "task_description": "Add 'Radiant Tee' to a guest cart",
640
+ "task_score": 1.0,
641
+ "parameter_sourcing_score": 0.8,
642
+ "auth_obtained": false,
643
+ "reward": 3.9,
644
+ "step_rewards": 0.85,
645
+ "terminated_by": "done_call",
646
+ "total_steps": 7,
647
+ "grader_details": {
648
+ "cart_id_found": "cart-abc123",
649
+ "item_confirmed_in_cart": true,
650
+ "item_sku": "MH01"
651
+ },
652
+ "parameter_sourcing_details": [
653
+ {"step": 5, "param": "cartId", "source": "PREV_CALL", "correct": true},
654
+ {"step": 7, "param": "cartItem.sku", "source": "PREV_CALL", "correct": true},
655
+ {"step": 7, "param": "cartItem.quote_id", "source": "DERIVED", "correct": true}
656
+ ]
657
+ }
658
+ ```
659
+
660
+ These logs drive the training analytics and help identify which parameter sourcing patterns the model is still learning.
README.md CHANGED
@@ -1,10 +1,652 @@
1
  ---
2
  title: HARvestGym
3
- emoji:
4
  colorFrom: blue
5
- colorTo: pink
6
  sdk: docker
7
  pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
  title: HARvestGym
3
+ emoji: 🕸️
4
  colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
  pinned: false
8
+ tags:
9
+ - openenv
10
+ - reinforcement-learning
11
+ - api-agent
12
+ - web-tasks
13
+ base_path: /web
14
+ ---
15
+
16
+ # HARvestGym
17
+
18
+ *Core idea: Trains LLMs to reverse-engineer and complete web tasks through raw HTTP APIs. No browser. No docs. Just a URL and a task.*
19
+
20
+ ### Can a small model learn to explore the API surface of any web application — and complete real tasks through those APIs, without ever opening a browser?
21
+
22
+ Web applications are full of APIs. Every click in a browser triggers an HTTP call with a precise schema, a specific authentication header, an exact sequence of prerequisites. **HARvestGym trains a small model to do all of that directly** — given a task and a URL, it discovers the relevant endpoints, understands what each one needs, chains the calls in the right order, and completes the task without any browser.
23
+
24
+ The model starts with nothing: no schema, no documentation, no endpoint list. It uses tools to explore — issuing requests, inspecting responses, building up its own understanding of how the application works. This is what a developer does when they reverse-engineer an API. The model learns to do the same.
25
+
26
+ Given a URL and a task string, the agent must discover which endpoints exist, figure out schemas and parameter dependencies, and execute the right sequence. Zero prior knowledge.
27
+
28
+ ## What the Model (Policy) Is Learning
29
+
30
+ Given: a natural language task + a live web application URL. No prior knowledge of the application.
31
+
32
+ The model calls `browser_agent` first — this returns the list of API endpoints the browser used to complete the task. The model now has a map: it knows what endpoints exist. What it does not know:
33
+
34
+ - which of those endpoints are actually needed for this specific task
35
+ - in what order they must be called (you cannot add to a cart before the cart exists)
36
+ - where each required parameter value comes from
37
+ - how to re-authenticate if a session expires mid-episode
38
+
39
+ The model must learn to:
40
+
41
+ 1. **Discover endpoints** — by using a browser agent tool that completes the same task in a real browser while recording all network traffic, then filtering that traffic to extract only the meaningful application API calls (stripping out CDN requests, analytics, static assets). The browser agent runs once and generates the raw discovery data; the model uses this as its starting context.
42
+ 2. **Select the right endpoints** — from the browser agent's list, identify the subset relevant to the current task (not every observed endpoint is needed)
43
+ 3. **Sequence calls correctly** — determine the prerequisite order (create cart → find product → add item), including calls that must happen before others even though the task description doesn't say so
44
+ 4. **Thread parameters** — this is the hardest part. APIs form a dependency graph:
45
+ - Some values come from a previous response (`cart_id` from step 1 → path param in step 3)
46
+ - Some values come from the authentication flow (`form_key`, `Bearer token` → header in every subsequent call)
47
+ - Some values come from the task description (`product name` → search query → `sku` → body of add-item call)
48
+ - The ground truth catalog defines these relationships precisely; the model learns to navigate them
49
+ 5. **Handle auth and errors** — detect 401 / session-expired responses, re-authenticate, and continue; interpret 4xx errors and adjust the next call accordingly
50
+
51
+ ---
52
+
53
+ ## Architecture
54
+
55
+ ```
56
+ TRAINING LOOP
57
+ ┌─────────────────────────────────────────────────────────────────────────┐
58
+ │ │
59
+ │ Task + App URL │
60
+ │ │ │
61
+ │ ▼ │
62
+ │ ┌────────────────────────────────────────────────────────────────┐ │
63
+ │ │ Policy Model (RL Agent) │ │
64
+ │ │ small model — no prior knowledge of the app │ │
65
+ │ │ │ │
66
+ │ │ Observation: task + history + session_state + last_result │ │
67
+ │ │ │ │
68
+ │ │ Step 1 ──► browser_agent(task, url) │ │
69
+ │ │ Step 2+ ──► search_endpoints(query) │ │
70
+ │ │ ──► curl_exec(command) │ │
71
+ │ │ ──► search_episode_data(query) │ │
72
+ │ │ ──► done(result) │ │
73
+ │ └────────┬───────────────────────────────────────────────────────┘ │
74
+ │ │ │
75
+ │ ┌──────┴──────────────────────────────┐ │
76
+ │ │ │ │
77
+ │ ▼ ▼ │
78
+ │ ┌─────────────────────┐ ┌─────────────────────────────────────┐ │
79
+ │ │ Browser Agent │ │ Environment │ │
80
+ │ │ (step 1 only) │ │ │ │
81
+ │ │ │ │ • Executes curl_exec via subprocess│ │
82
+ │ │ Training: │ │ • Auto-injects session cookies │ │
83
+ │ │ Load pre-recorded │ │ • Smart-truncates response bodies │ │
84
+ │ │ cached HAR from │ │ • Indexes full responses into │ │
85
+ │ │ disk or launch │ │ per-episode BM25 + GEMMA store │ │
86
+ │ │ on real browser │ │ • Manages session_state: cookies, │ │
87
+ │ │ │ │ CSRF tokens, auth headers │ │
88
+ │ │ Inference: │ └──────────────┬──────────────────────┘ │
89
+ │ │ Launch real browser│ │ │
90
+ │ │ via Playwright + │ │ HTTP calls (always live) │
91
+ │ │ bu-30b-a3b-preview │ ▼ │
92
+ │ │ │ ┌─────────────────────────────────────┐ │
93
+ │ │ Both paths produce: │ │ WebArena EC2 (live apps) │ │
94
+ │ │ • Filtered HAR │ │ │ │
95
+ │ │ • OpenAPI-like spec│ │ :7770 Shopping (Magento 2) │ │
96
+ │ │ • GEMMA embeddings │ │ :7780 Shopping Admin │ │
97
+ │ │ for search_ │ │ :9999 Forum (Postmill) │ │
98
+ │ │ endpoints() │ │ :8888 Wikipedia (Kiwix) │ │
99
+ │ └─────────────────────┘ │ :3000 Map (OpenStreetMap) │ │
100
+ │ └──────────────┬──────────────────────┘ │
101
+ │ │ │
102
+ │ │ episode trajectory │
103
+ │ ▼ │
104
+ │ ┌─────────────────────────────────────┐ │
105
+ │ │ Deterministic Judge │ │
106
+ │ │ │ │
107
+ │ │ Per-template programmatic grader: │ │
108
+ │ │ • Inspects episode trajectory │ │
109
+ │ │ • Optionally probes live app state │ │
110
+ │ │ • Verifies parameter sourcing │ │
111
+ │ │ (TASK_SPEC / PREV_CALL / │ │
112
+ │ │ AUTH_FLOW / STATIC / DERIVED) │ │
113
+ │ │ • Scores [0.0 → 1.0] │ │
114
+ │ └──────────────┬──────────────────────┘ │
115
+ │ │ │
116
+ │ ▼ │
117
+ │ ┌─────────────────────────────────────┐ │
118
+ │ │ Reward Signal │ │
119
+ │ │ │ │
120
+ │ │ Per-step: │ │
121
+ │ │ +0.2 valid API call (2xx) │ │
122
+ │ │ +0.1 new path explored │ │
123
+ │ │ +0.25 correct param sourcing │ │
124
+ │ │ −0.15 repeated identical call │ │
125
+ │ │ −0.3 browser_agent called again │ │
126
+ │ │ │ │
127
+ │ │ Episode end: │ │
128
+ │ │ +2.0–+5.0 task complete (easy→hard│ │
129
+ │ │ −1.5 task failed │ │
130
+ │ └──────────────┬──────────────────────┘ │
131
+ │ │ │
132
+ │ ▼ │
133
+ │ ┌─────────────────────────────────────┐ │
134
+ │ │ GRPO (via HF TRL) │ │
135
+ │ │ │ │
136
+ │ │ 8 parallel rollouts per prompt │ │
137
+ │ │ Computes advantages without │ │
138
+ │ │ a value function │ │
139
+ │ │ Updates policy weights │ │
140
+ │ └─────────────────────────────────────┘ │
141
+ │ │ │
142
+ │ └──► updated Policy Model │
143
+ └─────────────────────────────────────────────────────────────────────────┘
144
+ ```
145
+
146
+ ### Data Flow: Browser Agent → Search Index → Execution
147
+
148
+ ```
149
+ HAR File (cached using Browser Agent) ──► filter_har_entries()
150
+
151
+
152
+ drop: CDN, analytics, static assets
153
+ keep: {method, path, request_body,
154
+ response_body, status_code}
155
+
156
+
157
+ extract_openapi_spec()
158
+ → structured endpoint catalog
159
+ {path, method, params, auth, response_fields}
160
+
161
+ ┌──────┴──────┐
162
+ │ │
163
+ ▼ ▼
164
+ build_GEMMA_embeddings return summary list
165
+ (search_endpoints to RL agent:
166
+ index — full schemas) [GET /products,
167
+ POST /guest-carts, ...]
168
+
169
+
170
+ search_endpoints("create guest cart")
171
+ → top-3 endpoint schemas with:
172
+ • path params + sources
173
+ • body params + sources
174
+ • auth requirements
175
+ • response field names
176
+ ```
177
+
178
+ ### Episode Response Indexing
179
+
180
+ ```
181
+ curl_exec(command)
182
+
183
+ ├──► subprocess: execute against live EC2
184
+
185
+ ├──► index_full_response()
186
+ │ BM25 index ── keyword match (IDs, SKUs, tokens)
187
+ │ GEMMA embed ── semantic match (paraphrases)
188
+ │ (indexes BEFORE truncation — all items stored)
189
+
190
+ └──► smart_truncate()
191
+ non-JSON HTML → 3,000 chars
192
+ JSON primitive → never truncated
193
+ error (4xx/5xx) → never truncated
194
+ small JSON → returned as-is
195
+ large array → first 2 items shown
196
+ + _list_truncated annotation
197
+ + hint to call search_episode_data()
198
+ ```
199
+
200
+ ### Parameter Dependency Graph (what the judge tracks)
201
+
202
+ ```
203
+ Task: "Add 'Radiant Tee' to a guest cart"
204
+
205
+ ┌─────────────────────────────────────────────────────────┐
206
+ │ TASK_SPEC ──────────────────────────────────────────┐ │
207
+ │ "Radiant Tee" (product name) │ │
208
+ │ │ │ │
209
+ │ ▼ │ │
210
+ │ GET /rest/V1/products?name=Radiant+Tee │ │
211
+ │ → items[0].sku = "MH01" (PREV_CALL) ──┐ │ │
212
+ │ │ │ │
213
+ │ POST /rest/V1/guest-carts │ │ │
214
+ │ → body = "cart-abc123" (PREV_CALL) ──┼──┼─►│
215
+ │ │ │ │
216
+ │ POST /rest/V1/guest-carts/{cartId}/items │ │ │
217
+ │ path: cartId ◄────── "cart-abc123" ───────┘ │ │
218
+ │ body: sku ◄────── "MH01" ─────────┘ │
219
+ │ body: qty ◄────── TASK_SPEC (quantity) │
220
+ │ body: quote_id ◄────── DERIVED (= cartId) │
221
+ └─────────────────────────────────────────────────────────┘
222
+
223
+ Source types tracked by the judge:
224
+ TASK_SPEC — value stated in the task string
225
+ PREV_CALL — value from a prior curl response in this episode
226
+ AUTH_FLOW — value from a session/token auth step
227
+ STATIC — fixed application constant (e.g. store_id = 1)
228
+ DERIVED — computed from another param (e.g. quote_id = cart_id)
229
+ ```
230
+
231
+ ### Curriculum: Complexity Tiers
232
+
233
+ ```
234
+ Easy ──────────────────────── graduate when P(success) > 0.7
235
+ │ Single call, no auth │
236
+ │ Templates 1, 2 │
237
+ │ 1 API call required │
238
+ │ ▼
239
+ Medium ──────────────────────── graduate when P(success) > 0.7
240
+ │ Auth + 1–2 dependent calls │
241
+ │ Templates 3, 4 │
242
+ │ 2–3 API calls required │
243
+ │ ▼
244
+ Hard ────────────────────────── final tier
245
+ Multi-step chain, full auth, ID threading
246
+ Templates 5, 6, 7
247
+ 4–8+ API calls required
248
+ Reward scaling: ×2.5 vs Easy
249
+ ```
250
+
251
+ ### The RL Agent's Tool: Browser Agent
252
+
253
+ The RL agent has access to a **browser agent tool** powered by `[browser-use/bu-30b-a3b-preview](https://huggingface.co/browser-use/bu-30b-a3b-preview)` — a 30B MoE vision-language model (3B active parameters) purpose-built for web task completion, served via the [browser-use](https://github.com/browser-use/browser-use) library on Playwright. When the RL agent calls this tool with a natural language task, the browser agent:
254
+
255
+ 1. Opens the target application in a real browser
256
+ 2. Completes the task by clicking, typing, and navigating — exactly as a human would
257
+ 3. All HTTP traffic is intercepted via Playwright network events
258
+ 4. Returns the intercepted traffic, filtered down to only the application's own API calls
259
+
260
+ The filtering step strips analytics pings, CDN requests, font loads, JS/CSS bundles and returns only `{method, path, request_body, response_body, status_code}` tuples for the app's actual API endpoints.
261
+
262
+ **Training vs. inference — what gets cached:**
263
+
264
+ - The browser agent output (filtered endpoint list) is pre-computed once per task and cached. During training, the RL model receives this cached result instantly — no live browser session runs.
265
+ - The RL agent's own `curl_exec` calls **always hit the real live WebArena server** — during both training and inference. No API response is mocked or cached.
266
+ - At inference, the browser agent runs live to handle novel tasks or changed application state.
267
+
268
+ Full architecture and code: `[BROWSER_AGENT.md](BROWSER_AGENT.md)`
269
+
270
+ ### Ground Truth: From the Codebase, Not the Browser
271
+
272
+ The browser agent shows *what* API calls happen. It does not explain *why* — specifically, it does not document where each parameter comes from or what field constraints exist. That comes from the application codebase.
273
+
274
+ For each WebArena application, we perform a one-time static analysis (using a large model against the Docker image source) to produce a **ground truth API catalog** — a precise, hard-coded document specifying:
275
+
276
+ ```
277
+ endpoint: POST /rest/V1/guest-carts/{cartId}/items
278
+ method: POST
279
+ auth: None (guest cart)
280
+ path_params:
281
+ cartId: [string] obtained from: POST /rest/V1/guest-carts → response body
282
+ body:
283
+ cartItem.sku: [string] the product's SKU, from: GET /rest/V1/products → items[].sku
284
+ cartItem.qty: [number] quantity, from: task specification
285
+ cartItem.quote_id: [string] same as cartId
286
+ ```
287
+
288
+ This is what the judge compares against. The ground truth defines the complete parameter relationship graph for each application.
289
+
290
+ Full extraction process: `[GROUND_TRUTH_EXTRACTION.md](GROUND_TRUTH_EXTRACTION.md)`
291
+
292
+ ### The Training Loop
293
+
294
+ ```
295
+ Task (natural language) + App URL
296
+
297
+
298
+ Policy Model (sees: task + history of all prior actions/results + session_state + findings)
299
+ │ calls tools to explore and execute
300
+ ├─► browser_agent(task, url) → filtered API call list (cached during training)
301
+ ├─► search_endpoints(query) → full schema for a specific endpoint
302
+ ├─► curl_exec(command) → execute HTTP call, get {status, headers, body}
303
+ ├─► search_episode_data(q) → search prior response bodies in this episode
304
+ └─► done(result) → declare task complete
305
+
306
+
307
+ Live WebArena App (EC2) ←─── real HTTP responses (always live, never mocked)
308
+
309
+
310
+ Judge (compares against ground truth API catalog)
311
+
312
+
313
+ Reward Signal ──► GRPO ──► updated policy
314
+ ```
315
+
316
+ ---
317
+
318
+ ## Target Applications
319
+
320
+ All running on a single AWS EC2 instance. Real production software, no simulation.
321
+
322
+
323
+ | App | Port | URL | Software |
324
+ | -------------- | ---- | -------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
325
+ | Shopping | 7770 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/) | Magento 2 — open-source e-commerce platform |
326
+ | Shopping Admin | 7780 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/) | Magento 2 Admin — backend panel for the same store |
327
+ | Forum | 9999 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/) | Postmill — open-source Reddit-like link aggregation forum |
328
+ | Wikipedia | 8888 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/) | Kiwix — read-only offline mirror of English Wikipedia |
329
+ | Map | 3000 | [http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:3000/](http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:3000/) | OpenStreetMap — open-source collaborative mapping platform |
330
+
331
+
332
+ Source: [WebArena environment_docker](https://github.com/web-arena-x/webarena/tree/main/environment_docker)
333
+
334
+ ---
335
+
336
+ ## Spaces
337
+
338
+ ### Observation Space
339
+
340
+ What the model sees at each step:
341
+
342
+ ```python
343
+ class Observation(BaseModel):
344
+ task: str # Natural language task
345
+ app_base_url: str # Root URL of the target application
346
+ last_tool_result: Any # Result of last tool call:
347
+ # search_endpoints → list of endpoint schema strings
348
+ # curl_exec → {status_code, headers, body (smart-truncated)}
349
+ # search_episode_data → list of matching JSON object strings
350
+ history: list[dict] # Full episode trajectory: list of {action, tool_result} pairs
351
+ # from all prior steps. The model sees what it already tried,
352
+ # enabling value threading (read a cart_id from step 2's response
353
+ # and use it in step 5's curl call) and loop avoidance.
354
+ session_state: dict # Auto-managed by environment: cookies, tokens, CSRF values
355
+ # extracted from all prior HTTP Set-Cookie and response bodies
356
+ # e.g. {"PHPSESSID": "abc", "form_key": "xyz", "cart_id": "123"}
357
+ step_count: int
358
+ max_steps: int # 20
359
+ ```
360
+
361
+ `session_state` is maintained by the environment. The model never parses `Set-Cookie` headers — the environment extracts tokens automatically and makes them available. The model decides *when* to authenticate and *which* session values to use; the environment handles *extraction*.
362
+
363
+ **curl execution:** The agent outputs a curl command string. The environment parses it and executes it via subprocess against the live EC2 server — the agent machine never has a direct network connection to WebArena. The environment also injects cookies from `session_state` automatically before each call.
364
+
365
+ **Response truncation — smart array truncation, not byte cutoff:** HTTP response bodies are processed by a pure Python function before being returned to the model. Rules applied in order:
366
+
367
+ 1. **Non-JSON body** (HTML, CSS, JS, plain text): truncate to 3,000 characters. HTML from form-serving pages (login, post creation) is kept longer than pure prose because CSRF tokens and `<input>` fields are embedded inside the markup and the model needs to locate them. See the [HTML / Form-Submission Handling](#html--form-submission-handling) section below for how the model is expected to work with HTML responses.
368
+ 2. **JSON primitive** (string, number, boolean): never truncated — these are tokens, IDs, confirmations.
369
+ 3. **Error response (4xx / 5xx)**: never truncated — the model needs every word to self-correct.
370
+ 4. **JSON object or array with no large arrays** (< 3 dict items per array): returned as-is.
371
+ 5. **JSON with a large array field** (≥ 3 dict items): keep first 2 items, drop the rest, and add a `_list_truncated` annotation:
372
+
373
+ ```json
374
+ {
375
+ "items": [
376
+ {"sku": "MH01", "name": "Radiant Tee", "price": 22.0},
377
+ {"sku": "MH02", "name": "Breathe-Easy Tank", "price": 34.0}
378
+ ],
379
+ "_list_truncated": {
380
+ "field": "items",
381
+ "shown": 2,
382
+ "total": 50,
383
+ "note": "Showing 2 of 50 items. Use search_episode_data() to find a specific item from this response."
384
+ }
385
+ }
386
+ ```
387
+
388
+ **Episode response indexing:** Every `curl_exec` call indexes the full request and response bodies into a per-episode hybrid index (BM25 for keyword matching + GEMMA semantic embeddings for paraphrase handling). When a list is truncated, all items (not just the 2 shown) are indexed. The model can retrieve any specific object using `search_episode_data("keyword or natural language query")` without needing a filtered API endpoint to exist. See `TOOLS.md` for the full indexing algorithm.
389
+
390
+ ### Action Space
391
+
392
+ The model outputs a single tool call per step. Full technical specifications for all tools (document construction, truncation implementation, index architecture, caveats) are in `[TOOLS.md](./TOOLS.md)`.
393
+
394
+
395
+ | Tool | Input | What It Does | Output |
396
+ | ---------------------------- | --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
397
+ | `browser_agent(task, url)` | Task string + app base URL | Checks for pre-recorded HAR; if found, processes it — otherwise launches live browser to perform task and record traffic. Extracts OpenAPI-like spec, builds GEMMA embeddings for search. | Summary list of API endpoint names + methods (e.g. `GET /products`). No schemas/headers. Use `search_endpoints()` for details. |
398
+ | `search_endpoints(query)` | Natural language query | Semantic search over GEMMA-embedded endpoint spec built by `browser_agent`. Returns full parameter details for matching endpoints. | Top-3 endpoint schemas (method, path, auth, params with sources, response fields) |
399
+ | `curl_exec(command)` | Full curl command string | Executes HTTP call against live EC2 server, indexes full response into episode BM25 store, returns truncated observation. | `{status_code, headers, body}` — body smart-truncated; full body indexed to episode store |
400
+ | `search_episode_data(query)` | Keyword or natural language query | Hybrid BM25 + GEMMA semantic search over all request/response bodies from prior `curl_exec` calls in this episode. | Top-5 JSON objects from this episode's request/response history |
401
+ | `done(result?)` | Optional result string | Signals task complete, triggers judge evaluation. | Ends episode |
402
+
403
+
404
+ `browser_agent` is called **exactly once per episode at step 1**. During training, it loads a cached pre-recorded HAR file(if available); at inference, it will launch a live browser session. It returns the deduplicated list of API endpoint patterns observed in the network traffic. **If called again after step 1, the call executes normally but a −0.3 penalty is applied to the reward.** `search_endpoints` then provides the full schema for any specific endpoint the model wants to call — searching the GEMMA embeddings built by `browser_agent` from the HAR data.
405
+
406
+ `curl_exec` is the primary HTTP action — one string that encodes method, URL, headers, and body together, exactly as API documentation is written. This lets the model leverage its pretrained knowledge of `curl` syntax while producing calls that are self-documenting.
407
+
408
+ ```bash
409
+ # Step 1 — Discover which endpoint creates a guest cart
410
+ # (model calls search_endpoints first, sees: POST /rest/V1/guest-carts)
411
+
412
+ # Step 2 — Create guest cart
413
+ curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'
414
+ # → body: "cart-abc123" (plain string — never truncated)
415
+
416
+ # Step 3 — Find the product SKU (list response, truncated to 2 items + note)
417
+ curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'
418
+ # → body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}],"total_count":1}
419
+ # (1 item — not truncated; if 200 items, all 200 indexed, 2 shown in context)
420
+
421
+ # Step 4 — Add item (model reads cart-abc123 from step 2, MH01 from step 3 — all in history)
422
+ curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' \
423
+ -H 'Content-Type: application/json' \
424
+ -d '{"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-abc123"}}'
425
+ ```
426
+
427
+ Values from prior responses (cart IDs, SKUs, tokens) are threaded directly from the growing episode history. `session_state` tokens (cookies, CSRF values) are auto-injected by the environment. If a list response was truncated and the model needs a specific item not shown in the 2-item sample, it calls `search_episode_data("Radiant Tee sku")` — all 200 items are indexed, even though only 2 were shown in context.
428
+
429
+ ### Prompt Structure:
430
+
431
+ ```
432
+ SYSTEM: You are an API agent. Complete the task using only the tools available:
433
+ browser_agent, search_endpoints, curl_exec, search_episode_data, done.
434
+ When a response is HTML, look for JSON data embedded in <script> tags or
435
+ extract values from <input> fields. CSRF tokens appear as hidden inputs:
436
+ <input type="hidden" name="_csrf_token" value="XYZ">
437
+
438
+ TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.../
439
+
440
+ [session_state: {}]
441
+
442
+ STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
443
+ STEP 1 RESULT: {"app": "shopping", "endpoints": [
444
+ "POST /rest/V1/guest-carts",
445
+ "GET /rest/V1/products",
446
+ "POST /rest/V1/guest-carts/{id}/items",
447
+ ...
448
+ ], "note": "Use search_endpoints() to get full schema for any of these."}
449
+
450
+ STEP 2 ACTION: search_endpoints("create guest cart")
451
+ STEP 2 RESULT: ["endpoint: POST /rest/V1/guest-carts | auth: none | returns: string (cartId)", ...]
452
+
453
+ STEP 3 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
454
+ STEP 3 RESULT: {status_code: 200, body: "cart-abc123"}
455
+
456
+ STEP 4 ACTION: search_endpoints("find product by name get sku")
457
+ STEP 4 RESULT: ["endpoint: GET /rest/V1/products | query: searchCriteria filters | returns: .items[].sku .items[].name", ...]
458
+
459
+ STEP 5 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
460
+ STEP 5 RESULT: {status_code: 200, body: {"items":[{"sku":"MH01","name":"Radiant Tee","price":22.0}],"total_count":1}}
461
+
462
+ STEP 6 ACTION: search_endpoints("add item to guest cart cartId")
463
+ STEP 6 RESULT: ["endpoint: POST /rest/V1/guest-carts/{cartId}/items | path: cartId from POST /rest/V1/guest-carts | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId)", ...]
464
+
465
+ STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
466
+ STEP 7 RESULT: {status_code: 200, body: {"item_id": 5, "sku": "MH01", "qty": 1}}
467
+
468
+ → generate STEP 8: done("Radiant Tee added to cart")
469
+ ```
470
+
471
+ `browser_agent` at step 1 gives the model the full endpoint landscape upfront — it can see `/rest/V1/guest-carts` and `/rest/V1/products` immediately and plan the call sequence before making any HTTP calls. `search_endpoints` fills in the exact parameter schemas. Value threading (`"MH01"`, `"cart-abc123"`) happens through the growing history — if step 5 had returned 200 products truncated to 2, the model would call `search_episode_data("Radiant Tee sku")` to retrieve `MH01` from the episode index.
472
+
473
+ ### Parameter Relationship Graph (What the Judge Knows)
474
+
475
+ The judge holds a complete dependency map for each task:
476
+
477
+ ```
478
+ Parameter Source Types:
479
+ TASK_SPEC — value given directly in the task (e.g., "product #42")
480
+ PREV_CALL — value from a prior API response in this episode
481
+ AUTH_FLOW — value obtained during authentication (session token, CSRF key)
482
+ STATIC — fixed value known from the application (e.g., store_id = 1)
483
+ DERIVED — computed from another value (e.g., cart_id = quote_id)
484
+ ```
485
+
486
+ For each task, the judge knows which parameters fall into which category, and whether the model correctly sourced each value. This is how partial credit works — the model gets reward for correctly threading a `cart_id` even if the final call had a wrong field elsewhere.
487
+
488
+ ### Reward Space
489
+
490
+ **Per-step:**
491
+
492
+
493
+ | Signal | Value | Trigger |
494
+ | ---------------------------- | ----- | --------------------------------------------------------------------------------------------------- |
495
+ | Valid API call (2xx) | +0.2 | `curl_exec` returns 2xx status |
496
+ | New path called this episode | +0.1 | `curl_exec` normalized path not called before in this episode — discourages looping on one endpoint |
497
+ | Correct parameter sourcing | +0.25 | judge: value in curl call came from the correct source type |
498
+ | Session value correctly used | +0.1 | auth token/cookie present and correct in curl call |
499
+ | Repeated identical call | −0.15 | exact duplicate curl command issued twice |
500
+ | browser_agent called again | −0.3 | `browser_agent` called after step 1 — call executes normally, penalty applied to reward |
501
+ | Malformed curl command | −0.1 | curl cannot be parsed or executed by the environment |
502
+ | 4xx response (recoverable) | −0.05 | call failed but episode continues |
503
+
504
+
505
+ Note: `search_endpoints`, `search_episode_data`, and `done` carry no direct per-step reward. Using `search_endpoints` to find the correct schema is indirectly rewarded by enabling correct parameter sourcing (+0.25) in the curl call that follows. `search_episode_data` is indirectly rewarded by allowing the model to retrieve the correct value to place in the next curl command.
506
+
507
+ **Episode end:**
508
+
509
+
510
+ | Outcome | Reward |
511
+ | ----------------------------------------------------------- | ------------------------------------------ |
512
+ | Task completed correctly | +2.0 to +5.0 (scales with difficulty tier) |
513
+ | Partial completion (right endpoints, wrong param threading) | +0.5 to +1.5 |
514
+ | Authentication correctly obtained (even if task fails) | +0.3 |
515
+ | Timeout / task failed entirely | −1.5 |
516
+
517
+
518
+ Target signal separation: successful episodes `+3` to `+7`, failed episodes `−2` to `−1`. Required for GRPO.
519
+
520
+ > **Reward design insight:** Pure step-level rewards can teach a model to "look busy" — accumulating +0.2 (valid call) and +0.1 (new path) rewards while never converging to task completion. To prevent this, the terminal outcome reward must dominate the sum of all per-step rewards. Two mechanisms enforce this:
521
+ >
522
+ > 1. **Hard ceiling on step rewards per episode.** Maximum achievable per-step reward over 20 steps is bounded: `20 × (0.2 + 0.1 + 0.25 + 0.1) = 13`. But a failed episode still ends at `−1.5`, so any correct episode completion still produces a substantially better total.
523
+ > 2. **Curriculum learning as the primary defense.** Easy tasks (Template 1: single GET, no auth) have a trivially short optimal path (2 steps). There is no room to accumulate "fake" exploration reward when the optimal episode only needs 2 calls. The model learns that the terminal reward is the only thing that matters before it encounters tasks long enough to be gamed. Medium and Hard tiers are introduced only after the model reliably solves Easy — by then the behavior pattern is already anchored. This mirrors how SWE-gym-style environments scale difficulty: start simple enough that the reward signal is unambiguous, then broaden.
524
+ >
525
+ > **Premature `done()` penalty:** If the judge scores the final state as incorrect (task not completed), the episode ends at `−1.5`. There is no bonus for calling `done()` early — it is strictly worse than continuing to make correct API calls. The model only benefits from calling `done()` when the task is actually complete.
526
+
527
+ **Reset behavior:** `reset()` clears session state, episode history, episode BM25 index, step counter. It does not reset the remote application database. The judge evaluates relative state (did the cart contain the item?), not absolute state (is the DB row count exactly N?).
528
+
529
+ ---
530
+
531
+ ## HTML / Form-Submission Handling
532
+
533
+ Not every endpoint in the target applications returns JSON. The Forum (Postmill) and Wikipedia (Kiwix) applications rely on HTML form submissions and HTML responses respectively. The agent is designed to handle both transparently.
534
+
535
+ ### Why This Matters
536
+
537
+ A generalizable API agent must work with the full spectrum of web interfaces — not just REST JSON endpoints. Form-based POST submissions (with CSRF tokens, multipart bodies, URL-encoded fields) are ubiquitous in real web applications. Training on them is intentional: the model learns to identify the correct request format from context rather than assuming JSON everywhere.
538
+
539
+ ### CSRF Token Extraction
540
+
541
+ Postmill protects state-changing routes (login, post creation) with a per-session CSRF token. This token is embedded as a hidden `<input>` field in the HTML form:
542
+
543
+ ```html
544
+ <input type="hidden" name="_csrf_token" value="abc123XYZ">
545
+ ```
546
+
547
+ **How the model handles this — no dedicated CSRF tool needed:**
548
+
549
+ 1. The model issues a GET to the form page (e.g., `GET /login`).
550
+ 2. The environment returns the HTML body, truncated to 3,000 characters (raised from 1,000 specifically to ensure hidden input fields near the end of small forms are included).
551
+ 3. The model reads the `value` attribute of `input[name="_csrf_token"]` directly from the returned HTML string. HTML parsing is not required — the token appears as a predictable plain-text pattern in the markup.
552
+ 4. The model places the extracted token into the subsequent POST body or form field.
553
+ 5. The environment auto-extracts any `Set-Cookie` header from the login response into `session_state`, so subsequent requests are automatically authenticated.
554
+
555
+ If the CSRF token is positioned after the 3,000-character cutoff (possible in very large rendered pages), the model can call `search_episode_data("_csrf_token")` — the full HTML body is indexed into the episode store before truncation, making the token retrievable by keyword search.
556
+
557
+ ```bash
558
+ # Forum login flow
559
+ curl -X POST 'http://ec2-.../login' \
560
+ -H 'Content-Type: application/x-www-form-urlencoded' \
561
+ -d '_csrf_token=abc123XYZ&_username=user&_password=pass'
562
+ # → 302 redirect + Set-Cookie: PHPSESSID=... (auto-injected into session_state)
563
+
564
+ # Forum post creation
565
+ curl -X POST 'http://ec2-.../f/general/submit' \
566
+ -H 'Content-Type: application/x-www-form-urlencoded' \
567
+ -d '_csrf_token=abc123XYZ&title=My+Post&body=Hello+World'
568
+ ```
569
+
570
+ ### Wikipedia / HTML-Only Responses
571
+
572
+ Kiwix serves static HTML pages — there is no JSON API. The agent treats Wikipedia responses as structured text: search results appear in `<a href>` anchor tags; article content is in `<p>` tags.
573
+
574
+ The environment wraps the truncated HTML response in a lightweight JSON envelope before returning it to the model, so the observation format is always `{status_code, headers, body}` regardless of content type:
575
+
576
+ ```json
577
+ {
578
+ "status_code": 200,
579
+ "headers": {"Content-Type": "text/html"},
580
+ "body": "<html>...<ul class='mw-search-results'><li><a href='/wiki/Mars'>Mars</a>...</ul>..."
581
+ }
582
+ ```
583
+
584
+ For Template 2 ("Retrieve article summary for `{title}`"), task completion is verified by confirming the correct article URL was fetched and returned HTTP 200 — not by parsing article content. This makes the grader robust to HTML structure changes.
585
+
586
+ ### Form vs. JSON Detection
587
+
588
+ `curl_exec` detects whether a request is form-encoded or JSON by inspecting the `Content-Type` header in the curl command string:
589
+
590
+ - `Content-Type: application/json` → body is JSON, response indexed as JSON
591
+ - `Content-Type: application/x-www-form-urlencoded` or `multipart/form-data` → body is form data, response indexed as text
592
+ - No `Content-Type` (GET requests) → response indexed based on `Content-Type` of the response
593
+
594
+ The model is responsible for setting the correct `Content-Type` in its curl command. The system prompt includes explicit guidance on when to use each.
595
+
596
+ ---
597
+
598
+ ## Tasks
599
+
600
+ HARvestGym trains on **7 task templates** rather than a larger flat task list. Each template is a parameterized scenario: one reward function, one ground truth catalog entry, one grader — but potentially hundreds of distinct episode variations produced by substituting different values for the template slots (`{product_name}`, `{category_name}`, etc.).
601
+
602
+ If the training went smoothly, then we can scale it to automatically task creation to create all possible aspects of a task.
603
+
604
+ **How template parameters are populated:** Before training, a one-time data prep step calls the application's own listing APIs and builds a static **parameter pool** for each template (see `[parameter_pools.json](parameter_pools.json)`, refreshed via `[scripts/build_parameter_pools.py](scripts/build_parameter_pools.py)`):
605
+
606
+
607
+ | Template slot | Source |
608
+ | ----------------------------- | --------------------------------------------------------------- |
609
+ | `{category_name}` | `GET /rest/V1/categories` — all leaf category names |
610
+ | `{product_name}` | `GET /rest/V1/products?pageSize=200` — all product names + SKUs |
611
+ | `{forum_category}` | Forum's category listing API |
612
+ | `{title}`, `{sku}`, `{price}` | Generated or sampled from existing product names |
613
+
614
+
615
+ Each episode samples randomly from its pool. The model never sees the pool directly — it gets the task string (e.g., `"Add 'Radiant Tee' to a guest cart"`) and must discover the correct endpoint + SKU through its own API calls.
616
+
617
+ ### Complexity Tiers
618
+
619
+ Templates are organized into **complexity tiers** for curriculum training — the model only graduates to harder templates once it reliably solves easier ones:
620
+
621
+
622
+ | Tier | Characteristic | API calls required |
623
+ | ------ | --------------------------------------------- | ------------------ |
624
+ | Easy | Single call, no auth | 1 |
625
+ | Medium | Auth + 1–2 dependent calls | 2–3 |
626
+ | Hard | Multi-step chain with ID threading, full auth | 4–8+ |
627
+
628
+
629
+ ### Task Templates
630
+
631
+
632
+ | # | Tier | App | Template | Key Challenge |
633
+ | --- | ------ | -------------- | ------------------------------------------------------ | ------------------------------------------------------- |
634
+ | 1 | Easy | Shopping | List products in category `{category_name}` | Single GET with query params |
635
+ | 2 | Easy | Wikipedia | Retrieve article summary for `{title}` | Single GET, path parameter resolution |
636
+ | 3 | Medium | Shopping | Add `{product_name}` to a guest cart | 2 calls: create cart → add item; ID threading |
637
+ | 4 | Medium | Forum | Retrieve all posts in `{forum_category}` (authed) | Login → extract session → GET |
638
+ | 5 | Hard | Forum | Create a post titled `{title}` in `{category}` | Login → extract CSRF `form_key` → POST with full schema |
639
+ | 6 | Hard | Shopping | Guest checkout for `{product_name}` | 5+ chained calls; cart → item → shipping → payment |
640
+ | 7 | Hard | Shopping Admin | Create a new product with SKU `{sku}`, price `{price}` | Admin bearer token → full Magento product schema |
641
+
642
+
643
+ Each task has a deterministic programmatic grader (score in `[0.0, 1.0]`):
644
+
645
+ - **Easy graders**: check HTTP response body for expected values
646
+ - **Medium graders**: probe application state after episode (e.g., fetch the cart, verify item is present)
647
+ - **Hard graders**: verify multi-step state change in the application (e.g., post exists, checkout created)
648
+
649
+ **On optional request parameters:** API responses and real network traffic often contain extra headers and parameters (`X-Requested-With`, `Cache-Control`, correlation IDs, etc.) that are not functionally required. The judge scores only on *required* parameters. Extra or missing optional headers or body params do not affect the reward signal.
650
+
651
  ---
652
 
 
TOOLS.md ADDED
@@ -0,0 +1,847 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HARvestGym Tool Specification
2
+
3
+ Technical specification for all tools available to the RL agent. Each tool is a Python function called by the environment on behalf of the model. The model outputs a single tool call per step; the environment executes it and returns the result.
4
+
5
+ ---
6
+
7
+ ## Tool Set Summary
8
+
9
+
10
+ | Tool | Input | What It Does | Output |
11
+ | ---------------------------- | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
12
+ | `browser_agent(task, url)` | Task string + app base URL | Checks if a pre-recorded HAR file exists for this app; if so, processes it(only for training; at inference time it will always use browser-agent) — otherwise launches a live browser agent to perform the task and record network traffic. Either way, extracts an OpenAPI-like spec from the captured traffic, builds GEMMA embeddings for the search endpoint index, and returns a summary endpoint list. | Deduplicated list of API endpoint names with HTTP methods (e.g. `GET /products`, `POST /guest-carts`) — summary only, no headers/body/schemas. Use `search_endpoints()` with a natural-language query to get full details for any endpoint. |
13
+ | `search_endpoints(query)` | Natural language query | Semantic search over the endpoint embeddings built by `browser_agent`. Matches the query against the GEMMA-embedded OpenAPI-like spec and returns the top-3 endpoint schemas with full parameter details. | Top-3 endpoint schemas (method, path, auth, params with sources, response fields) |
14
+ | `curl_exec(command)` | Full curl command string | Parses the curl command, executes it via subprocess against the live EC2 server, indexes the full response body into the episode's hybrid BM25 + GEMMA store (before truncation), then returns a truncated observation. | `{status_code, headers, body}` — body smart-truncated; full body indexed into episode store for `search_episode_data()` |
15
+ | `search_episode_data(query)` | Natural language or keyword query | Hybrid BM25 + GEMMA semantic search across all request/response bodies accumulated during this episode from prior `curl_exec` calls. BM25 handles exact keyword matches (IDs, SKUs); GEMMA handles semantic paraphrases. Finds specific values from truncated or prior responses. | Top-5 JSON objects from this episode's request/response history, each annotated with step number and source endpoint |
16
+ | `done(result?)` | Optional result string | Signals the model believes the task is complete. Triggers the judge to evaluate the episode against the ground truth catalog. | Ends the episode |
17
+
18
+
19
+ `browser_agent` is always called **once at the start of an episode** (step 1). It gives the model an API landscape map for the target application, so the model knows which endpoints exist before it begins probing. All subsequent discovery and execution uses the other tools. **If the model calls `browser_agent` again after step 1, it receives a −0.3 penalty reward** — the call still executes normally (loads HAR if it exists, or runs live browser), the penalty is just applied to the reward signal.
20
+
21
+ ---
22
+
23
+ ## Tool 0: `browser_agent(task, url)`
24
+
25
+ ### Purpose
26
+
27
+ Give the model an initial map of the API surface for the target application at the start of every episode. The browser agent is a multi-stage pipeline that:
28
+
29
+ 1. Obtains HAR data (from pre-recorded file if available, or by launching a live browser)
30
+ 2. Processes it to extract an OpenAPI-like spec
31
+ 3. Builds GEMMA embeddings so `search_endpoints()` can work
32
+ 4. Returns a **summary-only** endpoint list to the RL agent — just names and methods, no schemas
33
+
34
+ The output is intentionally sparse. Because there could be many endpoints that will waste context window. The agent sees *what* endpoints exist but not *how* to call them. It must call `search_endpoints()` to get full parameter details for any endpoint.
35
+
36
+ ### Interface
37
+
38
+ ```python
39
+ def browser_agent(task: str, url: str) -> dict:
40
+ """
41
+ Multi-stage pipeline:
42
+ 1. Check for pre-recorded HAR file → load if exists, else launch live browser
43
+ 2. Filter HAR → extract OpenAPI-like spec (methods, paths, params, bodies)
44
+ 3. Build GEMMA embeddings over the spec → stored for search_endpoints()
45
+ 4. Return summary endpoint list (names + methods only)
46
+
47
+ Returns: {
48
+ "app": str, # resolved app name (shopping, forum, osm, wikipedia)
49
+ "endpoints": list[dict], # summary: [{method, path}] — no schemas, no headers
50
+ "total_endpoints": int, # count of deduplicated endpoints
51
+ "note": str # directs agent to use search_endpoints() for details
52
+ }
53
+ """
54
+ ```
55
+
56
+ ### Stage 1 — HAR Data Source
57
+
58
+ The browser agent first checks if a pre-recorded HAR file exists. If it does, the browser is never launched — saving 30–120s per episode.
59
+
60
+ ```
61
+ hars/
62
+ shopping.har # all shopping tasks, all API calls recorded for all task templates
63
+ shopping_admin.har # all admin tasks
64
+ forum.har # all forum tasks
65
+ osm.har # all OSM tasks
66
+ wikipedia.har # all Wikipedia tasks
67
+ ```
68
+
69
+ ```python
70
+ HAR_MAP = {
71
+ ":7770": "hars/shopping.har",
72
+ ":7780": "hars/shopping_admin.har",
73
+ ":9999": "hars/forum.har",
74
+ ":3000": "hars/osm.har",
75
+ ":8888": "hars/wikipedia.har",
76
+ }
77
+
78
+ def get_har_data(task: str, url: str) -> dict:
79
+ har_path = resolve_har_path(url) # port-based lookup from HAR_MAP
80
+ if har_path and os.path.exists(har_path):
81
+ # HAR exists — load from disk, no browser needed
82
+ with open(har_path) as f:
83
+ return json.load(f)
84
+ else:
85
+ # No HAR — launch live browser, perform task, capture traffic
86
+ raw_log = await run_browser_agent_live(task, url, "browser-use/bu-30b-a3b-preview")
87
+ return convert_raw_log_to_har(raw_log)
88
+ ```
89
+
90
+ If no HAR exists, the browser agent launches Chromium via Playwright, connects the `bu-30b-a3b-preview` LLM, performs the task while intercepting all network traffic, and produces a HAR-format output. See `BROWSER_AGENT.md` for the live browser implementation.
91
+
92
+ ### Stage 2 — Filter and Extract OpenAPI-like Spec
93
+
94
+ The HAR data (from either source) is processed to extract a structured spec:
95
+
96
+ ```python
97
+ def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
98
+ entries = har_data["log"]["entries"]
99
+ seen = set()
100
+ spec_entries = []
101
+
102
+ for entry in entries:
103
+ req = entry["request"]
104
+ resp = entry["response"]
105
+ raw_url = req["url"]
106
+ method = req["method"]
107
+
108
+ # 1. Skip static assets (images, fonts, CSS, JS bundles, favicon)
109
+ if _is_static_asset(raw_url):
110
+ continue
111
+
112
+ # 2. Skip page navigation (HTML document loads)
113
+ content_type = _get_response_content_type(resp)
114
+ if "text/html" in content_type and method == "GET":
115
+ continue
116
+
117
+ # 3. Normalise path: replace concrete IDs with {id} placeholders
118
+ path = _normalise_path(urlparse(raw_url).path)
119
+
120
+ # 4. Deduplicate by (method, normalised_path)
121
+ key = f"{method} {path}"
122
+ if key in seen:
123
+ continue
124
+ seen.add(key)
125
+
126
+ # 5. Extract auth, body, query params for the spec document
127
+ has_auth = any(
128
+ h["name"].lower() in ("authorization", "x-api-key", "cookie")
129
+ for h in req["headers"]
130
+ )
131
+
132
+ spec_entries.append({
133
+ "method": method,
134
+ "path": path,
135
+ "query_params": urlparse(raw_url).query or None,
136
+ "request_body": _extract_body(req),
137
+ "status_code": resp["status"],
138
+ "response_content_type": content_type,
139
+ "response_body_sample": _truncate_body(resp),
140
+ "auth_observed": has_auth,
141
+ })
142
+
143
+ return spec_entries
144
+ ```
145
+
146
+ ### Stage 3 — Build GEMMA Embeddings
147
+
148
+ The spec entries are embedded using `google/embeddinggemma-300m` (GEMMA). These embeddings are stored in the environment and power `search_endpoints()`.
149
+
150
+ ```python
151
+ def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
152
+ model = SentenceTransformer("google/embeddinggemma-300m", token=os.environ.get("HF_TOKEN"))
153
+ chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
154
+ embeddings = model.encode_document(chunks, batch_size=32)
155
+ return embeddings, chunks # stored in env for search_endpoints()
156
+ ```
157
+
158
+ ### Stage 4 — Return Summary
159
+
160
+ The RL agent receives **only endpoint names and methods** — no schemas, no headers, no body details:
161
+
162
+ ```json
163
+ {
164
+ "app": "shopping",
165
+ "endpoints": [
166
+ {"method": "POST", "path": "/rest/V1/integration/customer/token"},
167
+ {"method": "GET", "path": "/rest/V1/products"},
168
+ {"method": "GET", "path": "/rest/V1/products/{id}"},
169
+ {"method": "POST", "path": "/rest/V1/guest-carts"},
170
+ {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items"},
171
+ {"method": "GET", "path": "/rest/V1/guest-carts/{id}/totals"},
172
+ {"method": "POST", "path": "/rest/V1/guest-carts/{id}/order"},
173
+ {"method": "GET", "path": "/rest/V1/categories"}
174
+ ],
175
+ "total_endpoints": 8,
176
+ "note": "These endpoints were observed for this application. Use search_endpoints() with a natural language query to get the full schema, parameters, and auth details for any endpoint."
177
+ }
178
+ ```
179
+
180
+ ### Path Normalisation
181
+
182
+ The `_normalise_path` function replaces concrete dynamic segments with `{id}` placeholders so that duplicates collapse:
183
+
184
+ - Numeric IDs: `/products/42` → `/products/{id}`
185
+ - UUIDs: `/carts/3fa85f64-5717-4562-b3fc` → `/carts/{id}`
186
+ - Magento cart IDs (mixed alphanumeric, 32+ chars): detected by length and character set
187
+ - OSM node/way/relation IDs: `/api/0.6/node/12345678` → `/api/0.6/node/{id}`
188
+ - Forum post slugs: `/f/general/1-hello-world` → `/f/{slug}/{id}-{slug}`
189
+
190
+ Normalisation is pattern-based (regex), not AI-generated. No external calls.
191
+
192
+ ### When to Call
193
+
194
+ `browser_agent` is called **exactly once per episode, at step 1**, before any other tool. It serves as the API landscape orientation AND builds the search index. **If called again after step 1, the call executes normally but the model receives a −0.3 penalty reward.** The model should not need to call it again mid-episode.
195
+
196
+ ### Relationship to Other Tools
197
+
198
+ ```
199
+ browser_agent → "what endpoints exist?" (summary only)
200
+ │ + builds GEMMA embeddings internally
201
+
202
+
203
+ search_endpoints → "give me full schema for endpoint X"
204
+ │ (searches the GEMMA embeddings built above)
205
+
206
+ curl_exec → "call endpoint X, get live response"
207
+ │ (indexes full response into BM25 episode store)
208
+
209
+ search_episode_data → "find specific value from a prior response"
210
+ (BM25 search over indexed episode data)
211
+ ```
212
+
213
+ `browser_agent` provides breadth (what exists) and builds the search index. `search_endpoints` provides depth (how to call it). `curl_exec` provides live data and feeds the episode index. `search_episode_data` retrieves specific values from that index.
214
+
215
+ ---
216
+
217
+ ## Tool 1: `search_endpoints(query)`
218
+
219
+ ### Purpose
220
+
221
+ Find which API endpoint to call for a given subtask. The model calls this when it does not yet know the correct URL, method, or parameter schema for the next HTTP call it needs to make.
222
+
223
+ ### Interface
224
+
225
+ ```python
226
+ def search_endpoints(query: str) -> list[str]:
227
+ """
228
+ Semantic search over the endpoint embeddings built by browser_agent.
229
+ Returns the top-3 matching endpoint schemas as formatted text strings.
230
+ """
231
+ ```
232
+
233
+ ### Underlying Index
234
+
235
+ - **Source:** The GEMMA embeddings built by `browser_agent` during Stage 4. These embeddings are created from the OpenAPI-like spec extracted from HAR data — the actual network traffic observed when the browser agent performed tasks on the application.
236
+ - **Embedding model:** `google/embeddinggemma-300m` via `sentence-transformers`
237
+ - **Built:** By `browser_agent` at the start of each episode (Stage 4). The browser agent processes the HAR data, extracts the OpenAPI-like spec, converts each endpoint to a text chunk, and embeds them using GEMMA.
238
+ - **At runtime:** Stored in environment memory after `browser_agent` completes. Available for the rest of the episode. Discarded at episode end (rebuilt from HAR at next episode start).
239
+ - **Query embedding:** Uses the `encode_query` method with prompt `task: search result | query: {query}`.
240
+ - **Document embedding:** Uses the `encode_document` method with prompt `title: {endpoint} | text: {full_schema_text}`.
241
+ - **Similarity:** Use the similarity function specified by the `google/embeddinggemma-300m` model card. The model's `sentence_transformers_config.json` specifies the correct metric (typically cosine similarity for normalized embeddings). Pure numpy, no FAISS needed at this scale.
242
+
243
+ ### Document Structure (one per extracted endpoint)
244
+
245
+ Each endpoint from the browser agent's OpenAPI-like spec is converted to a searchable text chunk by `spec_entry_to_text()`:
246
+
247
+ ```
248
+ app: shopping | endpoint: POST /rest/V1/guest-carts/{id}/items | status: 200 | auth: none | body: {"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-abc123"}} | response_sample: {"item_id":5,"sku":"MH01","qty":1}
249
+ ```
250
+
251
+ The text chunks include method, path, status code, auth observation, query params, request body sample, and response body sample — all extracted from the actual HAR traffic. This is richer than just endpoint names (which is what the RL agent sees from `browser_agent`'s summary output) but less structured than a hand-written catalog.
252
+
253
+ ### Output Format
254
+
255
+ Returns a list of 3 strings, each being the full text of one matching endpoint schema. The model reads these directly and extracts the method, URL pattern, observed parameters, and response structure.
256
+
257
+ ### When to Call
258
+
259
+ - At the start of a task subtask: "I need to authenticate — what endpoint handles login?"
260
+ - When discovering a prerequisite: "I need a cart ID first — what creates a cart?"
261
+ - When unsure of the exact URL pattern: "Is it `/products/{id}` or `/products?id=`?"
262
+
263
+ ### Caveats
264
+
265
+ - Returns observed traffic patterns, not formal API documentation. The schemas reflect what was seen in the HAR, not what the API formally supports. Some optional parameters may be missing if the browser agent's session didn't exercise them.
266
+ - Returns schemas, not live values. The model still needs `curl_exec` to get actual data (product SKUs, cart IDs, etc.).
267
+ - If no relevant endpoint exists in the index, returns the closest matches by cosine similarity. The model should treat low-confidence results skeptically and try `curl_exec` to probe.
268
+ - The index covers only the current application (determined by `browser_agent`'s URL). Each episode's index is specific to one app.
269
+
270
+ ---
271
+
272
+ ## Tool 2: `curl_exec(command)`
273
+
274
+ ### Purpose
275
+
276
+ Execute an HTTP request against the live EC2 application and return the response. This is the primary action tool — it is how the model actually interacts with the application.
277
+
278
+ ### Interface
279
+
280
+ ```python
281
+ def curl_exec(command: str) -> dict:
282
+ """
283
+ Parses a curl command string, executes it via subprocess against the live EC2 server,
284
+ indexes the full response into the episode store, then returns a truncated observation.
285
+
286
+ Returns: {
287
+ "status_code": int,
288
+ "headers": dict, # response headers
289
+ "body": str | dict # truncated; see truncation rules below
290
+ }
291
+ """
292
+ ```
293
+
294
+ ### Execution Pipeline
295
+
296
+ The environment performs these steps in order on every `curl_exec` call:
297
+
298
+ ```
299
+ 1. Parse the curl command string
300
+ Extract: method, URL, headers, body
301
+ Validate: URL host must match app_base_url (reject requests to external hosts)
302
+ Inject: session cookies from session_state into headers automatically
303
+
304
+ 2. Execute via subprocess
305
+ subprocess.run(["curl", ...parsed args...], timeout=10)
306
+ Capture: status_code, response headers, response body (full, untruncated)
307
+
308
+ 3. Index into episode store (BEFORE truncation)
309
+ Index the request body (if any)
310
+ Index the response body
311
+ See: Episode Store section below
312
+
313
+ 4. Truncate the response body for context
314
+ Apply truncation rules (see below)
315
+ Add truncation note if any array was trimmed
316
+
317
+ 5. Return to model
318
+ {status_code, headers, truncated_body} or error
319
+ ```
320
+
321
+ ### Truncation Rules
322
+
323
+ Applied in order. First matching rule wins.
324
+
325
+ **Rule 1 — Non-JSON body:**
326
+
327
+ HTML from form-serving pages (login, post creation, etc.) is kept longer than a byte cutoff would allow because CSRF tokens and `<input>` fields are embedded inside the markup. The model locates them by reading the raw HTML string — no HTML parser required since tokens appear as predictable plain-text patterns (`<input type="hidden" name="_csrf_token" value="…">`). Even with 3,000 characters, if the CSRF token appears after the cutoff (possible in large pages), the full body is indexed in the episode store and can be retrieved with `search_episode_data("_csrf_token")`.
328
+
329
+ ```python
330
+ NONJSON_MAX_CHARS = 3000
331
+
332
+ if not is_valid_json(body):
333
+ return body[:NONJSON_MAX_CHARS] + (" [truncated — non-JSON response]" if len(body) > NONJSON_MAX_CHARS else "")
334
+ ```
335
+
336
+ **Rule 2 — JSON primitive (string, number, boolean, null):**
337
+
338
+ ```python
339
+ if isinstance(parsed, (str, int, float, bool)) or parsed is None:
340
+ return body # never truncate; these are tokens, IDs, simple confirmations
341
+ ```
342
+
343
+ **Rule 3 — Error response (4xx or 5xx):**
344
+
345
+ ```python
346
+ if status_code >= 400:
347
+ return body # never truncate error messages; the model needs every word to self-correct
348
+ ```
349
+
350
+ **Rule 4 — JSON object or array with no large arrays:**
351
+
352
+ ```python
353
+ # "large" means an array with >= 3 objects (dicts)
354
+ if no array field contains >= 3 dict items:
355
+ return body # small enough; return as-is
356
+ ```
357
+
358
+ **Rule 5 — JSON with large array field(s):**
359
+
360
+ ```python
361
+ # For each top-level field whose value is a list of >= 3 dicts:
362
+ # Keep first 2 items, drop the rest
363
+ # Add a _list_truncated annotation at the top level
364
+
365
+ truncated = {k: v for k, v in parsed.items() if not is_large_list(v)}
366
+ for key, val in parsed.items():
367
+ if is_large_list(val):
368
+ truncated[key] = val[:2]
369
+ truncated["_list_truncated"] = {
370
+ "field": key,
371
+ "shown": 2,
372
+ "total": len(val),
373
+ "note": f"Showing 2 of {len(val)} items. Use search_episode_data() to find a specific item from this response."
374
+ }
375
+ return json.dumps(truncated)
376
+ ```
377
+
378
+ The note is a static Python format string. It is not AI-generated. It does not suggest specific query parameters or URL patterns.
379
+
380
+ ### Session State Injection
381
+
382
+ Before executing the curl command, the environment reads `session_state` and injects any relevant cookies or tokens:
383
+
384
+ - If `session_state` contains `PHPSESSID`, inject as `Cookie: PHPSESSID=...`
385
+ - If `session_state` contains `form_key` (Magento CSRF), inject as a header: `X-Form-Key: ...`
386
+ - If `session_state` contains `PHPSESSID` for Forum requests (port 9999), inject as `Cookie: PHPSESSID=...`
387
+ - If `session_state` contains a bearer token, inject as `Authorization: Bearer ...` only if the model's curl command does not already include an `Authorization` header
388
+
389
+ **CSRF note for Postmill (Forum):** Postmill's `_csrf_token` is a request-body field, not a header. The environment does **not** auto-inject it — the model must extract it from the HTML form response and include it explicitly in the POST body. The `session_state` cookie (`PHPSESSID`) is auto-injected so the server associates the CSRF token with the active session. The expected workflow:
390
+
391
+ ```
392
+ GET /login → HTML body contains <input type="hidden" name="_csrf_token" value="XYZ">
393
+ Model reads "XYZ" from body string
394
+ POST /login -d '_csrf_token=XYZ&_username=user&_password=pass'
395
+ Environment auto-injects Cookie: PHPSESSID from session_state
396
+ ```
397
+
398
+ The model is responsible for setting the correct `Content-Type` in its curl command. The model declares intent (which headers to include); the environment fills in the actual token values from `session_state`.
399
+
400
+ ### Caveats
401
+
402
+ - `curl_exec` always hits the live EC2 server. No responses are mocked.
403
+ - The timeout is 10 seconds. If the server does not respond, returns `{status_code: 0, error: "timeout"}`.
404
+ - URL must be on the same host as `app_base_url`. Cross-host requests are rejected with `{status_code: 0, error: "host_not_allowed"}`.
405
+ - The model must include the full URL including host and port. Relative paths are not supported.
406
+
407
+ ---
408
+
409
+ ## Tool 3: `search_episode_data(query)`
410
+
411
+ ### Purpose
412
+
413
+ Search through all request and response bodies accumulated during the current episode. The model calls this when it needs a specific value (an ID, a name, a token) that was returned in a prior API response but the list was truncated.
414
+
415
+ This tool exists because **not every data type has a search or filter API endpoint**. For applications where the model cannot make a targeted filtered query (e.g., a listing endpoint that only supports pagination, not field-based filtering)
416
+
417
+ ### Interface
418
+
419
+ ```python
420
+ def search_episode_data(query: str) -> list[str]:
421
+ """
422
+ Keyword + BM25 search over all request and response bodies indexed during this episode.
423
+ Returns the top-5 matching JSON objects as formatted text strings, each annotated
424
+ with the step number and endpoint that produced them.
425
+ """
426
+ ```
427
+
428
+ ### Hybrid Search: BM25 + GEMMA Semantic Embeddings
429
+
430
+ The episode index uses a **hybrid approach** combining BM25 keyword matching with GEMMA semantic embeddings (`google/embeddinggemma-300m`). Both indexes are maintained in parallel — BM25 for fast exact keyword recall, GEMMA for semantic understanding when the agent's query uses different terminology than what appears in the response data.
431
+
432
+ **Why hybrid, not BM25 alone:**
433
+
434
+ BM25 excels at exact keyword matches ("MH01", "cart-abc123", "Radiant Tee") but fails on paraphrases. If the agent queries "price of the tee shirt I found earlier", BM25 won't match "Radiant Tee" because the terms don't overlap. GEMMA semantic embeddings bridge this gap — "tee shirt" and "Radiant Tee" are semantically close in embedding space.
435
+
436
+ **Why hybrid, not GEMMA alone:**
437
+
438
+ GEMMA embeddings are weaker at exact string matching. Searching for a specific cart ID like "cart-abc123" benefits from BM25's precise token matching. The hybrid approach gets the best of both.
439
+
440
+ **Scoring:** Results are ranked by a weighted combination of BM25 score (normalized) and GEMMA cosine similarity:
441
+
442
+ ```python
443
+ hybrid_score = alpha * bm25_score_normalized + (1 - alpha) * GEMMA_cosine_similarity
444
+ # alpha = 0.4 (tunable; favors semantic slightly over keyword)
445
+ ```
446
+
447
+ **Performance:** GEMMA is 300M parameters. On GPU, embedding a batch of 200 response items takes ~1-2 seconds — acceptable overhead per `curl_exec` call. The GEMMA model is already loaded in memory for `search_endpoints`, so no additional model loading cost. BM25 remains instantaneous.
448
+
449
+ **Fallback:** If no GPU is available, the system falls back to BM25-only mode. The GEMMA model is shared with `search_endpoints` — if it's loaded, episode data search uses it too.
450
+
451
+ ### Episode Index — Document Construction
452
+
453
+ Every time `curl_exec` completes, the environment constructs embedding documents from the full (pre-truncation) request and response bodies and adds them to the in-memory BM25 index for the current episode.
454
+
455
+ **Algorithm:**
456
+
457
+ ```python
458
+ def build_index_documents(step: int, method: str, path: str,
459
+ request_body: Any, response_body: Any,
460
+ status_code: int) -> list[str]:
461
+ docs = []
462
+
463
+ # 1. Index the request body (if any)
464
+ if request_body is not None:
465
+ docs.append(
466
+ f"step:{step} source:request endpoint:{method} {path} "
467
+ f"body:{json.dumps(request_body, ensure_ascii=False)}"
468
+ )
469
+
470
+ # 2. Index the response body
471
+ if response_body is None or not is_valid_json(response_body):
472
+ docs.append(
473
+ f"step:{step} source:response endpoint:{method} {path} "
474
+ f"status:{status_code} body:{str(response_body)[:500]}"
475
+ )
476
+ return docs
477
+
478
+ parsed = json.loads(response_body) if isinstance(response_body, str) else response_body
479
+
480
+ # 3. JSON primitive — one document
481
+ if isinstance(parsed, (str, int, float, bool)) or parsed is None:
482
+ docs.append(
483
+ f"step:{step} source:response endpoint:{method} {path} "
484
+ f"status:{status_code} value:{parsed}"
485
+ )
486
+ return docs
487
+
488
+ # 4. JSON object — find top-level array fields
489
+ array_fields = {k: v for k, v in parsed.items()
490
+ if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)}
491
+ scalar_fields = {k: v for k, v in parsed.items() if k not in array_fields}
492
+
493
+ if not array_fields:
494
+ # No arrays — one document for the whole object
495
+ docs.append(
496
+ f"step:{step} source:response endpoint:{method} {path} "
497
+ f"status:{status_code} data:{json.dumps(parsed, ensure_ascii=False)}"
498
+ )
499
+ return docs
500
+
501
+ # 5. Has array fields — one document per array item, with parent context attached
502
+ parent_context = (
503
+ f"step:{step} source:response endpoint:{method} {path} status:{status_code} "
504
+ + " ".join(f"{k}:{v}" for k, v in scalar_fields.items())
505
+ )
506
+ for field_name, items in array_fields.items():
507
+ for item in items:
508
+ # Flatten nested arrays within each item to strings (do not recurse further)
509
+ flat_item = {}
510
+ for k, v in item.items():
511
+ flat_item[k] = json.dumps(v) if isinstance(v, (list, dict)) else v
512
+ docs.append(
513
+ f"{parent_context} list_field:{field_name} "
514
+ f"item:{json.dumps(flat_item, ensure_ascii=False)}"
515
+ )
516
+
517
+ return docs
518
+ ```
519
+
520
+ **Key design principle:** The parent context (step number, endpoint, HTTP status, scalar fields like `total_count`) is prepended to every child item document. When the model searches for "Radiant Tee product SKU", the returned document contains both `name:Radiant Tee sku:MH01` and the context `endpoint:GET /rest/V1/products step:2` — the model knows where this value came from and which step it appeared in.
521
+
522
+ ### Episode Index — Lifecycle
523
+
524
+ ```
525
+ episode start → BM25 index initialized (empty)
526
+ GEMMA embedding store initialized (empty)
527
+
528
+ each curl_exec → build_index_documents() called
529
+ documents appended to BM25 corpus (BM25 index rebuilt, fast)
530
+ documents embedded via GEMMA and appended to embedding store
531
+
532
+ search_episode_data() → BM25 scores computed (keyword match)
533
+ GEMMA cosine similarity computed (semantic match)
534
+ hybrid ranking: alpha * BM25 + (1-alpha) * GEMMA
535
+ top-5 documents returned
536
+
537
+ episode end → both indexes discarded entirely
538
+ next episode → fresh indexes from scratch
539
+ ```
540
+
541
+ ### Output Format
542
+
543
+ Returns a list of up to 5 strings, each being one indexed document. Example:
544
+
545
+ ```
546
+ [
547
+ "step:2 source:response endpoint:GET /rest/V1/products status:200 total_count:200 list_field:items item:{\"sku\": \"MH01\", \"name\": \"Radiant Tee\", \"price\": 22.0, \"type_id\": \"simple\"}",
548
+ "step:2 source:response endpoint:GET /rest/V1/products status:200 total_count:200 list_field:items item:{\"sku\": \"MH03\", \"name\": \"Radiant Tee Long Sleeve\", \"price\": 28.0, \"type_id\": \"simple\"}"
549
+ ]
550
+ ```
551
+
552
+ The model reads `sku: MH01` from the first result and uses it in the next curl call.
553
+
554
+ ### When to Call
555
+
556
+ - A prior curl response was truncated (`_list_truncated` present in the response) and the model needs a specific item not shown in the 2-item sample.
557
+ - The model needs a value from a prior step but cannot easily locate it by scanning history (many steps ago, or buried in a complex response).
558
+ - There is no filter/search API for the data type (practical assumption: not all applications expose filtered listing endpoints for every resource).
559
+
560
+ ### Caveats
561
+
562
+ - Only searches data from **the current episode**. Values from prior episodes are not accessible (each episode starts with an empty index).
563
+ - Only finds data that was actually returned by a `curl_exec` call in this episode. If the relevant API has not been called yet, the data is not indexed.
564
+ - The hybrid search handles both exact keywords ("MH01", "cart-abc123") and semantic paraphrases ("tee shirt price"). However, using the same terminology seen in the response still produces the best results.
565
+ - For large lists (200+ items), all items are indexed. BM25 search is fast regardless of index size. GEMMA embedding of large responses adds 1-2 seconds of overhead per `curl_exec` call.
566
+
567
+ ---
568
+
569
+ ## Tool 4: `done(result?)`
570
+
571
+ ### Interface
572
+
573
+ ```python
574
+ def done(result: str = None) -> None:
575
+ """
576
+ Signals that the model believes the task is complete.
577
+ Triggers the judge to evaluate the episode against the ground truth catalog.
578
+ Ends the episode.
579
+ """
580
+ ```
581
+
582
+ ### Behavior
583
+
584
+ - Calling `done()` immediately ends the episode. No further tool calls are processed.
585
+ - The optional `result` string is logged but does not affect the reward. The judge evaluates the live application state, not the model's self-report.
586
+ - If the model calls `done()` and the task is not actually complete, the episode ends with `−1.5` reward (timeout/failure outcome). The model should only call `done()` after the final `curl_exec` has returned a 2xx confirming the required state change.
587
+
588
+ ### How the Model Learns When to Call `done()`
589
+
590
+ There is **no explicit "task complete" signal** from the environment. The model learns when to call `done()` purely through the reward signal over many episodes:
591
+
592
+ - **Calling `done()` too early** (before the task is actually complete) → judge finds the expected state change is missing → `−1.5` reward. The model learns to avoid this.
593
+ - **Calling `done()` after a successful final API call** (e.g., add-to-cart returns 2xx with `item_id`) → judge confirms the state change → `+2.0` to `+5.0` reward. The model learns that a 2xx response confirming the desired action is the right signal to call `done()`.
594
+ - **Never calling `done()`** (running out of steps) → episode times out → `−1.5` reward. The model learns it must eventually commit.
595
+
596
+ The learned pattern is: after the final state-changing `curl_exec` returns a 2xx response whose body confirms the expected outcome (e.g., `item_id` present in add-to-cart, `order_id` present in checkout), call `done()`. This mirrors how a human developer knows an API call succeeded — you check the response.
597
+
598
+ **Optional verification step:** Before calling `done()`, the model can issue one more `curl_exec` to verify the state change (e.g., `GET /rest/V1/guest-carts/{id}` to confirm the item is in the cart). This costs one step but reduces the risk of premature `done()` calls. The model learns whether verification is worth the step cost through reward optimization.
599
+
600
+ ---
601
+
602
+ ## Episode Index — Full Example
603
+
604
+ **Task:** `"Add 'Radiant Tee' to a guest cart at http://ec2-.../"`
605
+
606
+ ```
607
+ SYSTEM: ...
608
+
609
+ TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/
610
+
611
+ [session_state: {}]
612
+
613
+ STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
614
+ STEP 1 RESULT: {
615
+ "app": "shopping",
616
+ "endpoints": [
617
+ {"method": "POST", "path": "/rest/V1/guest-carts", "status": 200, "auth_observed": false},
618
+ {"method": "GET", "path": "/rest/V1/products", "status": 200, "auth_observed": false},
619
+ {"method": "GET", "path": "/rest/V1/products/{id}", "status": 200, "auth_observed": false},
620
+ {"method": "POST", "path": "/rest/V1/guest-carts/{id}/items", "status": 200, "auth_observed": false},
621
+ ...
622
+ ],
623
+ "note": "These endpoints were observed in pre-recorded browser sessions..."
624
+ }
625
+
626
+ STEP 2 ACTION: search_endpoints("find product by name, get sku")
627
+ STEP 2 RESULT: [
628
+ "endpoint: GET /rest/V1/products | auth: none | query: searchCriteria filters (field, value, condition_type) | returns: .items[].sku, .items[].name, .items[].price",
629
+ ...
630
+ ]
631
+
632
+ STEP 3 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
633
+ STEP 3 RESULT: {"status_code": 200, "body": {"items": [{"sku": "MH01", "name": "Radiant Tee", "price": 22.0}], "total_count": 1}}
634
+
635
+ STEP 4 ACTION: search_endpoints("create guest cart, get cart id")
636
+ STEP 4 RESULT: [
637
+ "endpoint: POST /rest/V1/guest-carts | auth: none | no body | returns: cartId string (plain quoted string)",
638
+ ...
639
+ ]
640
+
641
+ STEP 5 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
642
+ STEP 5 RESULT: {"status_code": 200, "body": "cart-abc123"}
643
+
644
+ STEP 6 ACTION: search_endpoints("add item to guest cart")
645
+ STEP 6 RESULT: [
646
+ "endpoint: POST /rest/V1/guest-carts/{cartId}/items | auth: none | path: cartId (from POST /rest/V1/guest-carts) | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId) | returns: item_id",
647
+ ...
648
+ ]
649
+
650
+ STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
651
+ STEP 7 RESULT: {"status_code": 200, "body": {"item_id": 5, "sku": "MH01", "qty": 1}}
652
+
653
+ STEP 8 ACTION: done("Radiant Tee (MH01) added to guest cart cart-abc123, item_id 5")
654
+ ```
655
+
656
+ ### Embedding build (by browser_agent, once per episode)
657
+
658
+ The GEMMA embeddings for `search_endpoints` are built by `browser_agent` during Stage 4, not pre-built offline. Each episode starts fresh:
659
+
660
+ ```python
661
+ from sentence_transformers import SentenceTransformer
662
+ import numpy as np
663
+ import os
664
+
665
+ # google/embeddinggemma-300m requires accepting Google's license on HuggingFace.
666
+ # Set HF_TOKEN env variable to a token that has accepted the license.
667
+ # Uses encode_query / encode_document / similarity API from sentence-transformers.
668
+ # NOTE: activations do not support float16 — use float32 or bfloat16.
669
+ HF_TOKEN = os.environ.get("HF_TOKEN") # must have accepted the license
670
+
671
+ def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
672
+ """Called by browser_agent Stage 4 after extracting OpenAPI-like spec from HAR."""
673
+ model = SentenceTransformer("google/embeddinggemma-300m", token=HF_TOKEN)
674
+ chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
675
+ # encode_document uses: "title: {endpoint} | text: {rest of chunk}"
676
+ embeddings = model.encode_document(chunks, batch_size=32)
677
+ # embeddings are returned normalized; dot product = cosine similarity
678
+ return embeddings, chunks # stored in env memory for search_endpoints()
679
+ ```
680
+
681
+ ### Runtime search
682
+
683
+ ```python
684
+ def search_endpoints(query: str, embeddings, texts, model, top_k=3) -> list[str]:
685
+ q_emb = model.encode_query(query) # shape: (D,)
686
+ # Use similarity metric specified by google/embeddinggemma-300m model card
687
+ scores = model.similarity(q_emb, embeddings).squeeze(0) # shape: (N,)
688
+ top_idx = np.argsort(scores)[::-1][:top_k]
689
+ return [texts[i] for i in top_idx]
690
+ ```
691
+
692
+ ### Index size per episode
693
+
694
+ Typical endpoint counts per app (after HAR filtering and deduplication):
695
+
696
+ - Shopping (Magento REST): ~8–15 endpoints per HAR
697
+ - Shopping Admin (Magento Admin AJAX + REST): ~5–10 endpoints per HAR
698
+ - Forum (Postmill forms + REST): ~3–8 endpoints per HAR
699
+ - OSM (Rails API + web): ~5–10 endpoints per HAR
700
+ - Wikipedia (Kiwix): ~2 endpoints per HAR
701
+
702
+ **Typical: ~5–15 endpoints × 768 dims × 4 bytes = negligible memory.** Embedding time on GPU: <1 second per episode.
703
+
704
+ ---
705
+
706
+ ## Truncation Helper — Python Pseudocode
707
+
708
+ ```python
709
+ import json
710
+
711
+ TRUNCATE_LIST_AT = 2 # keep this many items from large arrays
712
+ LARGE_ARRAY_THRESHOLD = 3 # arrays with >= this many dicts are "large"
713
+ NONJSON_MAX_CHARS = 3000 # 3 000 chars — enough to capture hidden CSRF inputs in most HTML forms
714
+
715
+ def truncate_response_body(body: str, status_code: int) -> str:
716
+ # Rule 3: never truncate errors
717
+ if status_code >= 400:
718
+ return body
719
+
720
+ # Rule 1: non-JSON
721
+ if not _is_json(body):
722
+ if len(body) > NONJSON_MAX_CHARS:
723
+ return body[:NONJSON_MAX_CHARS] + " [truncated — non-JSON response]"
724
+ return body
725
+
726
+ parsed = json.loads(body)
727
+
728
+ # Rule 2: primitive
729
+ if not isinstance(parsed, (dict, list)):
730
+ return body
731
+
732
+ # Rule 4/5: find large array fields
733
+ if isinstance(parsed, list):
734
+ if len(parsed) >= LARGE_ARRAY_THRESHOLD and isinstance(parsed[0], dict):
735
+ result = parsed[:TRUNCATE_LIST_AT]
736
+ note = {"_list_truncated": {
737
+ "shown": TRUNCATE_LIST_AT,
738
+ "total": len(parsed),
739
+ "note": f"Showing {TRUNCATE_LIST_AT} of {len(parsed)} items. "
740
+ "Use search_episode_data() to find a specific item from this response."
741
+ }}
742
+ return json.dumps(result + [note])
743
+ return body
744
+
745
+ # parsed is a dict — check each value
746
+ needs_truncation = {
747
+ k for k, v in parsed.items()
748
+ if isinstance(v, list) and len(v) >= LARGE_ARRAY_THRESHOLD
749
+ and len(v) > 0 and isinstance(v[0], dict)
750
+ }
751
+ if not needs_truncation:
752
+ return body
753
+
754
+ result = {}
755
+ total_truncated = {}
756
+ for k, v in parsed.items():
757
+ if k in needs_truncation:
758
+ result[k] = v[:TRUNCATE_LIST_AT]
759
+ total_truncated[k] = len(v)
760
+ else:
761
+ result[k] = v
762
+
763
+ result["_list_truncated"] = {
764
+ "fields": total_truncated,
765
+ "shown_per_field": TRUNCATE_LIST_AT,
766
+ "note": (
767
+ f"List fields truncated: "
768
+ + ", ".join(f"{k} showing {TRUNCATE_LIST_AT}/{n}" for k, n in total_truncated.items())
769
+ + ". Use search_episode_data() to find a specific item from this response."
770
+ )
771
+ }
772
+ return json.dumps(result)
773
+
774
+
775
+ def _is_json(s: str) -> bool:
776
+ try:
777
+ json.loads(s)
778
+ return True
779
+ except (ValueError, TypeError):
780
+ return False
781
+ ```
782
+
783
+ ---
784
+
785
+ ## Tool Call Format in the Episode Prompt
786
+
787
+ The growing episode context uses this format for tool calls and results:
788
+
789
+ ```
790
+ SYSTEM: You are an API agent. Your task is to complete the given task by calling the
791
+ available tools: browser_agent, search_endpoints, curl_exec, search_episode_data, done.
792
+ Complete the task using only HTTP calls to the application at the given URL.
793
+ When a response body is HTML, read hidden input fields directly from the markup to
794
+ extract CSRF tokens (pattern: <input type="hidden" name="_csrf_token" value="...">).
795
+ For form submissions, use Content-Type: application/x-www-form-urlencoded.
796
+
797
+ TASK: Add "Radiant Tee" to a guest cart at http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/
798
+
799
+ [session_state: {}]
800
+
801
+ STEP 1 ACTION: browser_agent("Add Radiant Tee to a guest cart", "http://ec2-...:7770/")
802
+ STEP 1 RESULT: {
803
+ "app": "shopping",
804
+ "endpoints": [
805
+ "POST /rest/V1/guest-carts",
806
+ "GET /rest/V1/products",
807
+ "GET /rest/V1/products/{sku}",
808
+ "POST /rest/V1/guest-carts/{id}/items",
809
+ ...
810
+ ],
811
+ "note": "Use search_endpoints() to get full schema for any of these."
812
+ }
813
+
814
+ STEP 2 ACTION: search_endpoints("find product by name, get sku")
815
+ STEP 2 RESULT: [
816
+ "endpoint: GET /rest/V1/products | auth: none | query: searchCriteria filters (field, value, condition_type) | returns: .items[].sku, .items[].name, .items[].price",
817
+ ...
818
+ ]
819
+
820
+ STEP 3 ACTION: curl_exec("curl 'http://ec2-.../rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'")
821
+ STEP 3 RESULT: {"status_code": 200, "body": {"items": [{"sku": "MH01", "name": "Radiant Tee", "price": 22.0}], "total_count": 1}}
822
+
823
+ STEP 4 ACTION: search_endpoints("create guest cart, get cart id")
824
+ STEP 4 RESULT: [
825
+ "endpoint: POST /rest/V1/guest-carts | auth: none | no body | returns: cartId string (plain quoted string)",
826
+ ...
827
+ ]
828
+
829
+ STEP 5 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'")
830
+ STEP 5 RESULT: {"status_code": 200, "body": "cart-abc123"}
831
+
832
+ STEP 6 ACTION: search_endpoints("add item to guest cart")
833
+ STEP 6 RESULT: [
834
+ "endpoint: POST /rest/V1/guest-carts/{cartId}/items | auth: none | path: cartId (from POST /rest/V1/guest-carts) | body: cartItem.sku, cartItem.qty, cartItem.quote_id (same as cartId) | returns: item_id",
835
+ ...
836
+ ]
837
+
838
+ STEP 7 ACTION: curl_exec("curl -X POST 'http://ec2-.../rest/V1/guest-carts/cart-abc123/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}'")
839
+ STEP 7 RESULT: {"status_code": 200, "body": {"item_id": 5, "sku": "MH01", "qty": 1}}
840
+
841
+ STEP 8 ACTION: done("Radiant Tee (MH01) added to guest cart cart-abc123, item_id 5")
842
+ ```
843
+
844
+ Value threading happens entirely through the multi-turn context. The model reads `"MH01"` from step 2's result and `"cart-abc123"` from step 4's result directly — no explicit store/retrieve tools needed.
845
+
846
+ ---
847
+
__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """HARvestGym OpenEnv Environment."""
catalogs/forum.json ADDED
@@ -0,0 +1,1517 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_meta": {
3
+ "generated": "2026-04-08",
4
+ "websockets": "none — no Mercure, Pusher, or WebSocket integration found in codebase"
5
+ },
6
+ "endpoints": [
7
+ {
8
+ "api_type": "form",
9
+ "route_name": "login_check",
10
+ "endpoint": "POST /login_check",
11
+ "auth": "none",
12
+ "content_type": "application/x-www-form-urlencoded",
13
+ "path_params": {},
14
+ "query_params": {},
15
+ "form_params": {
16
+ "_username": { "type": "string", "source": "TASK_SPEC", "notes": "Login username" },
17
+ "_password": { "type": "string", "source": "TASK_SPEC", "notes": "Login password" },
18
+ "_remember_me": { "type": "checkbox", "source": "STATIC", "notes": "Optional; value 'on' to persist session" },
19
+ "_csrf_token": { "type": "string", "source": "AUTH_FLOW", "notes": "Extracted from login page HTML; token id = 'authenticate'" }
20
+ },
21
+ "response_key_fields": ["Set-Cookie: PHPSESSID", "Set-Cookie: REMEMBERME"]
22
+ },
23
+ {
24
+ "api_type": "form",
25
+ "route_name": "log_out",
26
+ "endpoint": "GET /log_out",
27
+ "auth": "session_cookie+csrf",
28
+ "content_type": "application/x-www-form-urlencoded",
29
+ "path_params": {},
30
+ "query_params": {
31
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token for logout; csrf_parameter name is 'token'" }
32
+ },
33
+ "form_params": {},
34
+ "response_key_fields": []
35
+ },
36
+ {
37
+ "api_type": "form",
38
+ "route_name": "registration",
39
+ "endpoint": "POST /registration",
40
+ "auth": "none",
41
+ "content_type": "application/x-www-form-urlencoded",
42
+ "path_params": {},
43
+ "query_params": {},
44
+ "form_params": {
45
+ "user_type[username]": { "type": "string", "source": "TASK_SPEC", "notes": "Username; 3-25 chars" },
46
+ "user_type[password][first]": { "type": "string", "source": "TASK_SPEC", "notes": "Password" },
47
+ "user_type[password][second]": { "type": "string", "source": "TASK_SPEC", "notes": "Repeat password" },
48
+ "user_type[email]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional email" },
49
+ "user_type[phone]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
50
+ "user_type[verification]": { "type": "string", "source": "AUTH_FLOW", "notes": "Captcha answer if registration captcha enabled" },
51
+ "user_type[_token]": { "type": "string", "source": "AUTH_FLOW", "notes": "Symfony CSRF token extracted from form HTML" }
52
+ },
53
+ "response_key_fields": ["redirect to login link on success"]
54
+ },
55
+ {
56
+ "api_type": "form",
57
+ "route_name": "submit",
58
+ "endpoint": "POST /submit/{forum_name}",
59
+ "auth": "session_cookie+csrf",
60
+ "content_type": "application/x-www-form-urlencoded",
61
+ "path_params": {
62
+ "forum_name": { "type": "string", "source": "TASK_SPEC", "notes": "Optional; forum slug; can be omitted to pre-select from form field" }
63
+ },
64
+ "query_params": {},
65
+ "form_params": {
66
+ "submission[title]": { "type": "string", "source": "TASK_SPEC", "notes": "Post title; max 300 chars" },
67
+ "submission[url]": { "type": "string", "source": "TASK_SPEC", "notes": "URL for link posts; optional" },
68
+ "submission[body]": { "type": "markdown", "source": "TASK_SPEC", "notes": "Post body; optional" },
69
+ "submission[mediaType]": { "type": "string", "source": "STATIC", "notes": "One of: url, image; only present if user can upload images" },
70
+ "submission[image]": { "type": "file", "source": "TASK_SPEC", "notes": "Image file if mediaType=image; multipart/form-data required" },
71
+ "submission[forum]": { "type": "string", "source": "TASK_SPEC", "notes": "Forum name if not in path" },
72
+ "submission[userFlag]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional user/mod flair flag" },
73
+ "submission[email]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
74
+ "submission[_token]": { "type": "string", "source": "AUTH_FLOW", "notes": "Symfony CSRF token from form HTML" }
75
+ },
76
+ "response_key_fields": ["redirect to /f/{forum_name}/{submission_id}/{slug} on success"]
77
+ },
78
+ {
79
+ "api_type": "form",
80
+ "route_name": "edit_submission",
81
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/edit",
82
+ "auth": "session_cookie+csrf",
83
+ "content_type": "application/x-www-form-urlencoded",
84
+ "path_params": {
85
+ "forum_name": { "type": "string", "source": "PREV_CALL", "from_endpoint": "submit", "from_field": "forum_name" },
86
+ "submission_id": { "type": "integer", "source": "PREV_CALL", "from_endpoint": "submit", "from_field": "submission_id" },
87
+ "slug": { "type": "string", "source": "PREV_CALL", "from_endpoint": "submit", "from_field": "slug", "notes": "Can be '-' as placeholder" }
88
+ },
89
+ "query_params": {},
90
+ "form_params": {
91
+ "submission[title]": { "type": "string", "source": "TASK_SPEC" },
92
+ "submission[url]": { "type": "string", "source": "TASK_SPEC", "notes": "Only for url-type submissions" },
93
+ "submission[body]": { "type": "markdown", "source": "TASK_SPEC" },
94
+ "submission[userFlag]": { "type": "string", "source": "TASK_SPEC" },
95
+ "submission[_token]": { "type": "string", "source": "AUTH_FLOW" }
96
+ },
97
+ "response_key_fields": []
98
+ },
99
+ {
100
+ "api_type": "form",
101
+ "route_name": "submission_delete_own",
102
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/delete",
103
+ "auth": "session_cookie+csrf",
104
+ "content_type": "application/x-www-form-urlencoded",
105
+ "path_params": {
106
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
107
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
108
+ "slug": { "type": "string", "source": "PREV_CALL", "notes": "Can be '-'" }
109
+ },
110
+ "query_params": {},
111
+ "form_params": {
112
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_submission'" }
113
+ },
114
+ "response_key_fields": []
115
+ },
116
+ {
117
+ "api_type": "form",
118
+ "route_name": "submission_mod_delete",
119
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/mod_delete",
120
+ "auth": "session_cookie+csrf",
121
+ "content_type": "application/x-www-form-urlencoded",
122
+ "path_params": {
123
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
124
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
125
+ "slug": { "type": "string", "source": "PREV_CALL" }
126
+ },
127
+ "query_params": {},
128
+ "form_params": {
129
+ "delete_reason[reason]": { "type": "string", "source": "TASK_SPEC", "notes": "Moderator deletion reason" },
130
+ "delete_reason[_token]": { "type": "string", "source": "AUTH_FLOW" }
131
+ },
132
+ "response_key_fields": []
133
+ },
134
+ {
135
+ "api_type": "form",
136
+ "route_name": "submission_purge",
137
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/purge",
138
+ "auth": "session_cookie+csrf",
139
+ "content_type": "application/x-www-form-urlencoded",
140
+ "path_params": {
141
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
142
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
143
+ "slug": { "type": "string", "source": "PREV_CALL" }
144
+ },
145
+ "query_params": {},
146
+ "form_params": {
147
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'purge_submission'" }
148
+ },
149
+ "response_key_fields": []
150
+ },
151
+ {
152
+ "api_type": "form",
153
+ "route_name": "submission_restore",
154
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/restore",
155
+ "auth": "session_cookie+csrf",
156
+ "content_type": "application/x-www-form-urlencoded",
157
+ "path_params": {
158
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
159
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
160
+ "slug": { "type": "string", "source": "PREV_CALL" }
161
+ },
162
+ "query_params": {},
163
+ "form_params": {
164
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'restore_submission'" }
165
+ },
166
+ "response_key_fields": []
167
+ },
168
+ {
169
+ "api_type": "form",
170
+ "route_name": "lock",
171
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/lock",
172
+ "auth": "session_cookie+csrf",
173
+ "content_type": "application/x-www-form-urlencoded",
174
+ "path_params": {
175
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
176
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
177
+ "slug": { "type": "string", "source": "PREV_CALL" }
178
+ },
179
+ "query_params": {},
180
+ "form_params": {
181
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'lock'" }
182
+ },
183
+ "response_key_fields": []
184
+ },
185
+ {
186
+ "api_type": "form",
187
+ "route_name": "unlock",
188
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/unlock",
189
+ "auth": "session_cookie+csrf",
190
+ "content_type": "application/x-www-form-urlencoded",
191
+ "path_params": {
192
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
193
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
194
+ "slug": { "type": "string", "source": "PREV_CALL" }
195
+ },
196
+ "query_params": {},
197
+ "form_params": {
198
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'lock'" }
199
+ },
200
+ "response_key_fields": []
201
+ },
202
+ {
203
+ "api_type": "form",
204
+ "route_name": "pin",
205
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/pin",
206
+ "auth": "session_cookie+csrf",
207
+ "content_type": "application/x-www-form-urlencoded",
208
+ "path_params": {
209
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
210
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
211
+ "slug": { "type": "string", "source": "PREV_CALL" }
212
+ },
213
+ "query_params": {},
214
+ "form_params": {
215
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'pin'" }
216
+ },
217
+ "response_key_fields": []
218
+ },
219
+ {
220
+ "api_type": "form",
221
+ "route_name": "unpin",
222
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/unpin",
223
+ "auth": "session_cookie+csrf",
224
+ "content_type": "application/x-www-form-urlencoded",
225
+ "path_params": {
226
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
227
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
228
+ "slug": { "type": "string", "source": "PREV_CALL" }
229
+ },
230
+ "query_params": {},
231
+ "form_params": {
232
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'pin'" }
233
+ },
234
+ "response_key_fields": []
235
+ },
236
+ {
237
+ "api_type": "form",
238
+ "route_name": "submission_flair",
239
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/flair",
240
+ "auth": "session_cookie+csrf",
241
+ "content_type": "application/x-www-form-urlencoded",
242
+ "path_params": {
243
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
244
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
245
+ "slug": { "type": "string", "source": "PREV_CALL" }
246
+ },
247
+ "query_params": {},
248
+ "form_params": {
249
+ "custom_text_flair[text]": { "type": "string", "source": "TASK_SPEC", "notes": "Flair text to apply" },
250
+ "custom_text_flair[_token]": { "type": "string", "source": "AUTH_FLOW" }
251
+ },
252
+ "response_key_fields": []
253
+ },
254
+ {
255
+ "api_type": "form",
256
+ "route_name": "submission_remove_flairs",
257
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/remove_flairs",
258
+ "auth": "session_cookie+csrf",
259
+ "content_type": "application/x-www-form-urlencoded",
260
+ "path_params": {
261
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
262
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
263
+ "slug": { "type": "string", "source": "PREV_CALL" }
264
+ },
265
+ "query_params": {},
266
+ "form_params": {
267
+ "id[]": { "type": "uuid[]", "source": "PREV_CALL", "notes": "Array of flair UUIDs to remove" },
268
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'remove_flair'" }
269
+ },
270
+ "response_key_fields": []
271
+ },
272
+ {
273
+ "api_type": "form",
274
+ "route_name": "submission_vote",
275
+ "endpoint": "POST /sv/{id}.{_format}",
276
+ "auth": "session_cookie+csrf",
277
+ "content_type": "application/x-www-form-urlencoded",
278
+ "path_params": {
279
+ "id": { "type": "integer", "source": "PREV_CALL", "from_endpoint": "submit or submission list", "from_field": "submission_id" },
280
+ "_format": { "type": "string", "source": "STATIC", "notes": "html or json; use json for AJAX" }
281
+ },
282
+ "query_params": {},
283
+ "form_params": {
284
+ "choice": { "type": "integer", "source": "STATIC", "notes": "1 = upvote, -1 = downvote, 0 = retract" },
285
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'vote'" }
286
+ },
287
+ "response_key_fields": ["netScore"]
288
+ },
289
+ {
290
+ "api_type": "form",
291
+ "route_name": "comment_post",
292
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}",
293
+ "auth": "session_cookie+csrf",
294
+ "content_type": "application/x-www-form-urlencoded",
295
+ "path_params": {
296
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
297
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
298
+ "slug": { "type": "string", "source": "PREV_CALL", "notes": "Can be '-'" },
299
+ "comment_id": { "type": "integer", "source": "PREV_CALL", "notes": "Parent comment ID for reply; omit/null for top-level reply to submission" }
300
+ },
301
+ "query_params": {},
302
+ "form_params": {
303
+ "reply_to_submission_{submissionId}[comment]": { "type": "markdown", "source": "TASK_SPEC", "notes": "Form name is 'reply_to_submission_{id}' for top-level, 'reply_to_comment_{id}' for replies; field path 'body'" },
304
+ "reply_to_submission_{submissionId}[userFlag]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional flair flag" },
305
+ "reply_to_submission_{submissionId}[email]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
306
+ "reply_to_submission_{submissionId}[_token]": { "type": "string", "source": "AUTH_FLOW", "notes": "Symfony CSRF token" }
307
+ },
308
+ "response_key_fields": ["redirect to comment anchor on success"]
309
+ },
310
+ {
311
+ "api_type": "form",
312
+ "route_name": "edit_comment",
313
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/edit",
314
+ "auth": "session_cookie+csrf",
315
+ "content_type": "application/x-www-form-urlencoded",
316
+ "path_params": {
317
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
318
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
319
+ "slug": { "type": "string", "source": "PREV_CALL" },
320
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
321
+ },
322
+ "query_params": {},
323
+ "form_params": {
324
+ "comment[comment]": { "type": "markdown", "source": "TASK_SPEC" },
325
+ "comment[userFlag]": { "type": "string", "source": "TASK_SPEC" },
326
+ "comment[_token]": { "type": "string", "source": "AUTH_FLOW" }
327
+ },
328
+ "response_key_fields": []
329
+ },
330
+ {
331
+ "api_type": "form",
332
+ "route_name": "comment_delete_own",
333
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/delete_own",
334
+ "auth": "session_cookie+csrf",
335
+ "content_type": "application/x-www-form-urlencoded",
336
+ "path_params": {
337
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
338
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
339
+ "slug": { "type": "string", "source": "PREV_CALL" },
340
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
341
+ },
342
+ "query_params": {},
343
+ "form_params": {
344
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_own_comment'" }
345
+ },
346
+ "response_key_fields": []
347
+ },
348
+ {
349
+ "api_type": "form",
350
+ "route_name": "comment_delete",
351
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/delete",
352
+ "auth": "session_cookie+csrf",
353
+ "content_type": "application/x-www-form-urlencoded",
354
+ "path_params": {
355
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
356
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
357
+ "slug": { "type": "string", "source": "PREV_CALL" },
358
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
359
+ },
360
+ "query_params": {},
361
+ "form_params": {
362
+ "delete_reason[reason]": { "type": "string", "source": "TASK_SPEC" },
363
+ "delete_reason[_token]": { "type": "string", "source": "AUTH_FLOW" }
364
+ },
365
+ "response_key_fields": []
366
+ },
367
+ {
368
+ "api_type": "form",
369
+ "route_name": "comment_delete_thread",
370
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/delete_thread",
371
+ "auth": "session_cookie+csrf",
372
+ "content_type": "application/x-www-form-urlencoded",
373
+ "path_params": {
374
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
375
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
376
+ "slug": { "type": "string", "source": "PREV_CALL" },
377
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
378
+ },
379
+ "query_params": {},
380
+ "form_params": {
381
+ "delete_reason[reason]": { "type": "string", "source": "TASK_SPEC" },
382
+ "delete_reason[_token]": { "type": "string", "source": "AUTH_FLOW" }
383
+ },
384
+ "response_key_fields": []
385
+ },
386
+ {
387
+ "api_type": "form",
388
+ "route_name": "comment_purge",
389
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/purge",
390
+ "auth": "session_cookie+csrf",
391
+ "content_type": "application/x-www-form-urlencoded",
392
+ "path_params": {
393
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
394
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
395
+ "slug": { "type": "string", "source": "PREV_CALL" },
396
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
397
+ },
398
+ "query_params": {},
399
+ "form_params": {
400
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'purge_comment'" }
401
+ },
402
+ "response_key_fields": []
403
+ },
404
+ {
405
+ "api_type": "form",
406
+ "route_name": "comment_restore",
407
+ "endpoint": "POST /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}/restore",
408
+ "auth": "session_cookie+csrf",
409
+ "content_type": "application/x-www-form-urlencoded",
410
+ "path_params": {
411
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
412
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
413
+ "slug": { "type": "string", "source": "PREV_CALL" },
414
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
415
+ },
416
+ "query_params": {},
417
+ "form_params": {
418
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'restore_comment'" }
419
+ },
420
+ "response_key_fields": []
421
+ },
422
+ {
423
+ "api_type": "form",
424
+ "route_name": "comment_vote",
425
+ "endpoint": "POST /cv/{id}.{_format}",
426
+ "auth": "session_cookie+csrf",
427
+ "content_type": "application/x-www-form-urlencoded",
428
+ "path_params": {
429
+ "id": { "type": "integer", "source": "PREV_CALL", "notes": "Comment ID" },
430
+ "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
431
+ },
432
+ "query_params": {},
433
+ "form_params": {
434
+ "choice": { "type": "integer", "source": "STATIC", "notes": "1 = upvote, -1 = downvote, 0 = retract" },
435
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'vote'" }
436
+ },
437
+ "response_key_fields": ["netScore"]
438
+ },
439
+ {
440
+ "api_type": "form",
441
+ "route_name": "create_forum",
442
+ "endpoint": "POST /create_forum",
443
+ "auth": "session_cookie+csrf",
444
+ "content_type": "application/x-www-form-urlencoded",
445
+ "path_params": {},
446
+ "query_params": {},
447
+ "form_params": {
448
+ "forum[name]": { "type": "string", "source": "TASK_SPEC", "notes": "Forum slug; 3-25 chars" },
449
+ "forum[title]": { "type": "string", "source": "TASK_SPEC", "notes": "Forum display title" },
450
+ "forum[description]": { "type": "string", "source": "TASK_SPEC" },
451
+ "forum[sidebar]": { "type": "markdown", "source": "TASK_SPEC" },
452
+ "forum[tags]": { "type": "string[]", "source": "TASK_SPEC", "notes": "Array of tag names" },
453
+ "forum[moderationLogPublic]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "Admins/mods only" },
454
+ "forum[featured]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "ROLE_ADMIN only" },
455
+ "forum[email]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
456
+ "forum[_token]": { "type": "string", "source": "AUTH_FLOW" }
457
+ },
458
+ "response_key_fields": ["redirect to /f/{forum_name}"]
459
+ },
460
+ {
461
+ "api_type": "form",
462
+ "route_name": "edit_forum",
463
+ "endpoint": "POST /f/{forum_name}/edit",
464
+ "auth": "session_cookie+csrf",
465
+ "content_type": "application/x-www-form-urlencoded",
466
+ "path_params": {
467
+ "forum_name": { "type": "string", "source": "TASK_SPEC" }
468
+ },
469
+ "query_params": {},
470
+ "form_params": {
471
+ "forum[name]": { "type": "string", "source": "TASK_SPEC" },
472
+ "forum[title]": { "type": "string", "source": "TASK_SPEC" },
473
+ "forum[description]": { "type": "string", "source": "TASK_SPEC" },
474
+ "forum[sidebar]": { "type": "markdown", "source": "TASK_SPEC" },
475
+ "forum[tags]": { "type": "string[]", "source": "TASK_SPEC" },
476
+ "forum[moderationLogPublic]": { "type": "checkbox", "source": "TASK_SPEC" },
477
+ "forum[featured]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "ROLE_ADMIN only" },
478
+ "forum[_token]": { "type": "string", "source": "AUTH_FLOW" }
479
+ },
480
+ "response_key_fields": []
481
+ },
482
+ {
483
+ "api_type": "form",
484
+ "route_name": "delete_forum",
485
+ "endpoint": "POST /f/{forum_name}/delete",
486
+ "auth": "session_cookie+csrf",
487
+ "content_type": "application/x-www-form-urlencoded",
488
+ "path_params": {
489
+ "forum_name": { "type": "string", "source": "TASK_SPEC" }
490
+ },
491
+ "query_params": {},
492
+ "form_params": {
493
+ "confirm_deletion[name]": { "type": "string", "source": "DERIVED", "notes": "Must equal the forum name exactly" },
494
+ "confirm_deletion[confirm]": { "type": "checkbox", "source": "STATIC", "notes": "Must be checked" },
495
+ "confirm_deletion[_token]": { "type": "string", "source": "AUTH_FLOW" }
496
+ },
497
+ "response_key_fields": []
498
+ },
499
+ {
500
+ "api_type": "form",
501
+ "route_name": "forum_appearance",
502
+ "endpoint": "POST /f/{forum_name}/appearance",
503
+ "auth": "session_cookie+csrf",
504
+ "content_type": "multipart/form-data",
505
+ "path_params": {
506
+ "forum_name": { "type": "string", "source": "TASK_SPEC" }
507
+ },
508
+ "query_params": {},
509
+ "form_params": {
510
+ "forum_appearance[suggestedTheme]": { "type": "string", "source": "TASK_SPEC", "notes": "Theme UUID or empty" },
511
+ "forum_appearance[backgroundImage]": { "type": "file", "source": "TASK_SPEC", "notes": "Image file; only if user can upload images" },
512
+ "forum_appearance[_token]": { "type": "string", "source": "AUTH_FLOW" }
513
+ },
514
+ "response_key_fields": []
515
+ },
516
+ {
517
+ "api_type": "form",
518
+ "route_name": "subscribe",
519
+ "endpoint": "POST /f/{forum_name}/subscribe.{_format}",
520
+ "auth": "session_cookie+csrf",
521
+ "content_type": "application/x-www-form-urlencoded",
522
+ "path_params": {
523
+ "forum_name": { "type": "string", "source": "TASK_SPEC" },
524
+ "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
525
+ },
526
+ "query_params": {},
527
+ "form_params": {
528
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'subscribe'" }
529
+ },
530
+ "response_key_fields": ["subscribed"]
531
+ },
532
+ {
533
+ "api_type": "form",
534
+ "route_name": "unsubscribe",
535
+ "endpoint": "POST /f/{forum_name}/unsubscribe.{_format}",
536
+ "auth": "session_cookie+csrf",
537
+ "content_type": "application/x-www-form-urlencoded",
538
+ "path_params": {
539
+ "forum_name": { "type": "string", "source": "TASK_SPEC" },
540
+ "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
541
+ },
542
+ "query_params": {},
543
+ "form_params": {
544
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'subscribe'" }
545
+ },
546
+ "response_key_fields": ["subscribed"]
547
+ },
548
+ {
549
+ "api_type": "form",
550
+ "route_name": "add_moderator",
551
+ "endpoint": "POST /f/{forum_name}/add_moderator",
552
+ "auth": "session_cookie+csrf",
553
+ "content_type": "application/x-www-form-urlencoded",
554
+ "path_params": {
555
+ "forum_name": { "type": "string", "source": "TASK_SPEC" }
556
+ },
557
+ "query_params": {},
558
+ "form_params": {
559
+ "moderator[user]": { "type": "string", "source": "TASK_SPEC", "notes": "Username to add as moderator" },
560
+ "moderator[_token]": { "type": "string", "source": "AUTH_FLOW" }
561
+ },
562
+ "response_key_fields": []
563
+ },
564
+ {
565
+ "api_type": "form",
566
+ "route_name": "remove_moderator",
567
+ "endpoint": "POST /f/{forum_name}/remove_moderator/{moderator_id}",
568
+ "auth": "session_cookie+csrf",
569
+ "content_type": "application/x-www-form-urlencoded",
570
+ "path_params": {
571
+ "forum_name": { "type": "string", "source": "TASK_SPEC" },
572
+ "moderator_id": { "type": "uuid", "source": "PREV_CALL", "notes": "Moderator record UUID from moderators list" }
573
+ },
574
+ "query_params": {},
575
+ "form_params": {
576
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'remove_moderator'" }
577
+ },
578
+ "response_key_fields": []
579
+ },
580
+ {
581
+ "api_type": "form",
582
+ "route_name": "forum_ban",
583
+ "endpoint": "POST /f/{forum_name}/ban/{username}",
584
+ "auth": "session_cookie+csrf",
585
+ "content_type": "application/x-www-form-urlencoded",
586
+ "path_params": {
587
+ "forum_name": { "type": "string", "source": "TASK_SPEC" },
588
+ "username": { "type": "string", "source": "TASK_SPEC" }
589
+ },
590
+ "query_params": {},
591
+ "form_params": {
592
+ "forum_ban[reason]": { "type": "string", "source": "TASK_SPEC" },
593
+ "forum_ban[expires][date]": { "type": "date", "source": "TASK_SPEC", "notes": "Optional expiry date; format YYYY-MM-DD" },
594
+ "forum_ban[expires][time]": { "type": "time", "source": "TASK_SPEC", "notes": "Optional expiry time; format HH:MM" },
595
+ "forum_ban[_token]": { "type": "string", "source": "AUTH_FLOW" }
596
+ },
597
+ "response_key_fields": []
598
+ },
599
+ {
600
+ "api_type": "form",
601
+ "route_name": "forum_unban",
602
+ "endpoint": "POST /f/{forum_name}/unban/{username}",
603
+ "auth": "session_cookie+csrf",
604
+ "content_type": "application/x-www-form-urlencoded",
605
+ "path_params": {
606
+ "forum_name": { "type": "string", "source": "TASK_SPEC" },
607
+ "username": { "type": "string", "source": "TASK_SPEC" }
608
+ },
609
+ "query_params": {},
610
+ "form_params": {
611
+ "forum_ban[reason]": { "type": "string", "source": "TASK_SPEC" },
612
+ "forum_ban[_token]": { "type": "string", "source": "AUTH_FLOW" }
613
+ },
614
+ "response_key_fields": []
615
+ },
616
+ {
617
+ "api_type": "form",
618
+ "route_name": "forum_tag_edit",
619
+ "endpoint": "POST /tag/{name}/edit",
620
+ "auth": "session_cookie+csrf",
621
+ "content_type": "application/x-www-form-urlencoded",
622
+ "path_params": {
623
+ "name": { "type": "string", "source": "TASK_SPEC" }
624
+ },
625
+ "query_params": {},
626
+ "form_params": {
627
+ "forum_tag[name]": { "type": "string", "source": "TASK_SPEC" },
628
+ "forum_tag[description]": { "type": "string", "source": "TASK_SPEC" },
629
+ "forum_tag[_token]": { "type": "string", "source": "AUTH_FLOW" }
630
+ },
631
+ "response_key_fields": []
632
+ },
633
+ {
634
+ "api_type": "form",
635
+ "route_name": "compose_message",
636
+ "endpoint": "POST /user/{username}/compose_message",
637
+ "auth": "session_cookie+csrf",
638
+ "content_type": "application/x-www-form-urlencoded",
639
+ "path_params": {
640
+ "username": { "type": "string", "source": "TASK_SPEC", "notes": "Recipient username" }
641
+ },
642
+ "query_params": {},
643
+ "form_params": {
644
+ "message[body]": { "type": "string", "source": "TASK_SPEC" },
645
+ "message[_token]": { "type": "string", "source": "AUTH_FLOW" }
646
+ },
647
+ "response_key_fields": ["redirect to /messages/thread/{id}; thread id from redirect"]
648
+ },
649
+ {
650
+ "api_type": "form",
651
+ "route_name": "reply_to_message",
652
+ "endpoint": "POST /message_reply/{id}",
653
+ "auth": "session_cookie+csrf",
654
+ "content_type": "application/x-www-form-urlencoded",
655
+ "path_params": {
656
+ "id": { "type": "integer", "source": "PREV_CALL", "from_endpoint": "compose_message", "notes": "Thread ID" }
657
+ },
658
+ "query_params": {},
659
+ "form_params": {
660
+ "message[body]": { "type": "string", "source": "TASK_SPEC" },
661
+ "message[_token]": { "type": "string", "source": "AUTH_FLOW" }
662
+ },
663
+ "response_key_fields": []
664
+ },
665
+ {
666
+ "api_type": "form",
667
+ "route_name": "delete_message",
668
+ "endpoint": "POST /messages/message/{id}/delete",
669
+ "auth": "session_cookie+csrf",
670
+ "content_type": "application/x-www-form-urlencoded",
671
+ "path_params": {
672
+ "id": { "type": "uuid", "source": "PREV_CALL", "notes": "Message UUID" }
673
+ },
674
+ "query_params": {},
675
+ "form_params": {
676
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_message'" }
677
+ },
678
+ "response_key_fields": []
679
+ },
680
+ {
681
+ "api_type": "form",
682
+ "route_name": "request_password_reset",
683
+ "endpoint": "POST /reset_password",
684
+ "auth": "none",
685
+ "content_type": "application/x-www-form-urlencoded",
686
+ "path_params": {},
687
+ "query_params": {},
688
+ "form_params": {
689
+ "request_password_reset[email]": { "type": "string", "source": "TASK_SPEC" },
690
+ "request_password_reset[verification]": { "type": "string", "source": "AUTH_FLOW", "notes": "Captcha answer" },
691
+ "request_password_reset[_token]": { "type": "string", "source": "AUTH_FLOW" }
692
+ },
693
+ "response_key_fields": []
694
+ },
695
+ {
696
+ "api_type": "form",
697
+ "route_name": "password_reset",
698
+ "endpoint": "POST /reset_password/{id}/{expires}/{checksum}",
699
+ "auth": "none",
700
+ "content_type": "application/x-www-form-urlencoded",
701
+ "path_params": {
702
+ "id": { "type": "integer", "source": "AUTH_FLOW", "notes": "User ID from reset email link" },
703
+ "expires": { "type": "integer", "source": "AUTH_FLOW", "notes": "Unix timestamp from reset email link" },
704
+ "checksum": { "type": "string", "source": "AUTH_FLOW", "notes": "HMAC checksum from reset email link" }
705
+ },
706
+ "query_params": {},
707
+ "form_params": {
708
+ "user[username]": { "type": "string", "source": "TASK_SPEC" },
709
+ "user[password][first]": { "type": "string", "source": "TASK_SPEC" },
710
+ "user[password][second]": { "type": "string", "source": "TASK_SPEC" },
711
+ "user[email]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional" },
712
+ "user[_token]": { "type": "string", "source": "AUTH_FLOW" }
713
+ },
714
+ "response_key_fields": []
715
+ },
716
+ {
717
+ "api_type": "form",
718
+ "route_name": "edit_user",
719
+ "endpoint": "POST /user/{username}/account",
720
+ "auth": "session_cookie+csrf",
721
+ "content_type": "application/x-www-form-urlencoded",
722
+ "path_params": {
723
+ "username": { "type": "string", "source": "AUTH_FLOW", "notes": "Must match authenticated user or admin" }
724
+ },
725
+ "query_params": {},
726
+ "form_params": {
727
+ "user[username]": { "type": "string", "source": "TASK_SPEC", "notes": "May be disabled if username change not allowed" },
728
+ "user[password][first]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional; leave blank to keep current" },
729
+ "user[password][second]": { "type": "string", "source": "TASK_SPEC" },
730
+ "user[email]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional" },
731
+ "user[phone]": { "type": "string", "source": "STATIC", "notes": "Honeypot — must be left empty" },
732
+ "user[_token]": { "type": "string", "source": "AUTH_FLOW" }
733
+ },
734
+ "response_key_fields": []
735
+ },
736
+ {
737
+ "api_type": "form",
738
+ "route_name": "user_settings",
739
+ "endpoint": "POST /user/{username}/preferences",
740
+ "auth": "session_cookie+csrf",
741
+ "content_type": "application/x-www-form-urlencoded",
742
+ "path_params": {
743
+ "username": { "type": "string", "source": "AUTH_FLOW" }
744
+ },
745
+ "query_params": {},
746
+ "form_params": {
747
+ "user_settings[locale]": { "type": "string", "source": "TASK_SPEC", "notes": "BCP47 locale code" },
748
+ "user_settings[frontPage]": { "type": "string", "source": "TASK_SPEC", "notes": "featured|subscribed|all|moderated" },
749
+ "user_settings[frontPageSortMode]": { "type": "string", "source": "TASK_SPEC", "notes": "hot|new|active" },
750
+ "user_settings[openExternalLinksInNewTab]": { "type": "checkbox", "source": "TASK_SPEC" },
751
+ "user_settings[autoFetchSubmissionTitles]": { "type": "checkbox", "source": "TASK_SPEC" },
752
+ "user_settings[enablePostPreviews]": { "type": "checkbox", "source": "TASK_SPEC" },
753
+ "user_settings[showThumbnails]": { "type": "checkbox", "source": "TASK_SPEC" },
754
+ "user_settings[notifyOnReply]": { "type": "checkbox", "source": "TASK_SPEC" },
755
+ "user_settings[notifyOnMentions]": { "type": "checkbox", "source": "TASK_SPEC" },
756
+ "user_settings[allowPrivateMessages]": { "type": "checkbox", "source": "TASK_SPEC" },
757
+ "user_settings[preferredFonts]": { "type": "string", "source": "TASK_SPEC" },
758
+ "user_settings[nightMode]": { "type": "string", "source": "TASK_SPEC", "notes": "0=auto, 1=light, 2=dark" },
759
+ "user_settings[preferredTheme]": { "type": "string", "source": "TASK_SPEC", "notes": "Theme UUID" },
760
+ "user_settings[showCustomStylesheets]": { "type": "checkbox", "source": "TASK_SPEC" },
761
+ "user_settings[poppersEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
762
+ "user_settings[fullWidthDisplayEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
763
+ "user_settings[submissionLinkDestination]": { "type": "string", "source": "TASK_SPEC" },
764
+ "user_settings[_token]": { "type": "string", "source": "AUTH_FLOW" }
765
+ },
766
+ "response_key_fields": []
767
+ },
768
+ {
769
+ "api_type": "form",
770
+ "route_name": "edit_biography",
771
+ "endpoint": "POST /user/{username}/edit_biography",
772
+ "auth": "session_cookie+csrf",
773
+ "content_type": "application/x-www-form-urlencoded",
774
+ "path_params": {
775
+ "username": { "type": "string", "source": "AUTH_FLOW" }
776
+ },
777
+ "query_params": {},
778
+ "form_params": {
779
+ "user_biography[biography]": { "type": "markdown", "source": "TASK_SPEC" },
780
+ "user_biography[_token]": { "type": "string", "source": "AUTH_FLOW" }
781
+ },
782
+ "response_key_fields": []
783
+ },
784
+ {
785
+ "api_type": "form",
786
+ "route_name": "delete_account",
787
+ "endpoint": "POST /user/{username}/delete_account",
788
+ "auth": "session_cookie+csrf",
789
+ "content_type": "application/x-www-form-urlencoded",
790
+ "path_params": {
791
+ "username": { "type": "string", "source": "AUTH_FLOW" }
792
+ },
793
+ "query_params": {},
794
+ "form_params": {
795
+ "confirm_deletion[name]": { "type": "string", "source": "DERIVED", "notes": "Must equal the username exactly" },
796
+ "confirm_deletion[confirm]": { "type": "checkbox", "source": "STATIC" },
797
+ "confirm_deletion[_token]": { "type": "string", "source": "AUTH_FLOW" }
798
+ },
799
+ "response_key_fields": []
800
+ },
801
+ {
802
+ "api_type": "form",
803
+ "route_name": "block_user",
804
+ "endpoint": "POST /user/{username}/block_user",
805
+ "auth": "session_cookie+csrf",
806
+ "content_type": "application/x-www-form-urlencoded",
807
+ "path_params": {
808
+ "username": { "type": "string", "source": "TASK_SPEC", "notes": "User to block" }
809
+ },
810
+ "query_params": {},
811
+ "form_params": {
812
+ "user_block[comment]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional private note about block" },
813
+ "user_block[_token]": { "type": "string", "source": "AUTH_FLOW" }
814
+ },
815
+ "response_key_fields": []
816
+ },
817
+ {
818
+ "api_type": "form",
819
+ "route_name": "unblock_user",
820
+ "endpoint": "POST /user/{username}/unblock_user",
821
+ "auth": "session_cookie+csrf",
822
+ "content_type": "application/x-www-form-urlencoded",
823
+ "path_params": {
824
+ "username": { "type": "string", "source": "TASK_SPEC" }
825
+ },
826
+ "query_params": {},
827
+ "form_params": {
828
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'unblock'" }
829
+ },
830
+ "response_key_fields": []
831
+ },
832
+ {
833
+ "api_type": "form",
834
+ "route_name": "clear_notifications",
835
+ "endpoint": "POST /clear_notifications",
836
+ "auth": "session_cookie+csrf",
837
+ "content_type": "application/x-www-form-urlencoded",
838
+ "path_params": {},
839
+ "query_params": {},
840
+ "form_params": {
841
+ "id[]": { "type": "uuid[]", "source": "PREV_CALL", "notes": "Array of notification UUIDs to clear; omit to clear all" },
842
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'clear_notifications'" }
843
+ },
844
+ "response_key_fields": []
845
+ },
846
+ {
847
+ "api_type": "form",
848
+ "route_name": "user_whitelist",
849
+ "endpoint": "POST /user/{username}/whitelist",
850
+ "auth": "session_cookie+csrf",
851
+ "content_type": "application/x-www-form-urlencoded",
852
+ "path_params": {
853
+ "username": { "type": "string", "source": "TASK_SPEC" }
854
+ },
855
+ "query_params": {},
856
+ "form_params": {
857
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'whitelist'" }
858
+ },
859
+ "response_key_fields": []
860
+ },
861
+ {
862
+ "api_type": "form",
863
+ "route_name": "user_dewhitelist",
864
+ "endpoint": "POST /user/{username}/dewhitelist",
865
+ "auth": "session_cookie+csrf",
866
+ "content_type": "application/x-www-form-urlencoded",
867
+ "path_params": {
868
+ "username": { "type": "string", "source": "TASK_SPEC" }
869
+ },
870
+ "query_params": {},
871
+ "form_params": {
872
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'whitelist'" }
873
+ },
874
+ "response_key_fields": []
875
+ },
876
+ {
877
+ "api_type": "form",
878
+ "route_name": "hide_forum",
879
+ "endpoint": "POST /user/{username}/hide_forum/{forum}",
880
+ "auth": "session_cookie+csrf",
881
+ "content_type": "application/x-www-form-urlencoded",
882
+ "path_params": {
883
+ "username": { "type": "string", "source": "AUTH_FLOW" },
884
+ "forum": { "type": "string", "source": "TASK_SPEC", "notes": "Forum name" }
885
+ },
886
+ "query_params": {},
887
+ "form_params": {
888
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'hide_forum'" }
889
+ },
890
+ "response_key_fields": []
891
+ },
892
+ {
893
+ "api_type": "form",
894
+ "route_name": "unhide_forum",
895
+ "endpoint": "POST /user/{username}/unhide_forum/{forum}",
896
+ "auth": "session_cookie+csrf",
897
+ "content_type": "application/x-www-form-urlencoded",
898
+ "path_params": {
899
+ "username": { "type": "string", "source": "AUTH_FLOW" },
900
+ "forum": { "type": "string", "source": "TASK_SPEC" }
901
+ },
902
+ "query_params": {},
903
+ "form_params": {
904
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'hide_forum'" }
905
+ },
906
+ "response_key_fields": []
907
+ },
908
+ {
909
+ "api_type": "form",
910
+ "route_name": "change_night_mode",
911
+ "endpoint": "POST /night_mode.{_format}",
912
+ "auth": "session_cookie+csrf",
913
+ "content_type": "application/x-www-form-urlencoded",
914
+ "path_params": {
915
+ "_format": { "type": "string", "source": "STATIC", "notes": "html or json" }
916
+ },
917
+ "query_params": {},
918
+ "form_params": {
919
+ "nightMode": { "type": "string", "source": "TASK_SPEC", "notes": "0=auto, 1=light, 2=dark" },
920
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'night_mode'" }
921
+ },
922
+ "response_key_fields": ["nightMode"]
923
+ },
924
+ {
925
+ "api_type": "form",
926
+ "route_name": "ban_user",
927
+ "endpoint": "POST /bans/ban_user/{username}",
928
+ "auth": "session_cookie+csrf",
929
+ "content_type": "application/x-www-form-urlencoded",
930
+ "path_params": {
931
+ "username": { "type": "string", "source": "TASK_SPEC" }
932
+ },
933
+ "query_params": {},
934
+ "form_params": {
935
+ "ban_user[reason]": { "type": "string", "source": "TASK_SPEC" },
936
+ "ban_user[expires][date]": { "type": "date", "source": "TASK_SPEC", "notes": "Optional; YYYY-MM-DD" },
937
+ "ban_user[expires][time]": { "type": "time", "source": "TASK_SPEC", "notes": "Optional; HH:MM" },
938
+ "ban_user[ban_ip]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "Also ban associated IPs" },
939
+ "ban_user[ips]": { "type": "string", "source": "DERIVED", "notes": "Comma/newline-separated IP list; auto-populated from user history" },
940
+ "ban_user[_token]": { "type": "string", "source": "AUTH_FLOW" }
941
+ },
942
+ "response_key_fields": []
943
+ },
944
+ {
945
+ "api_type": "form",
946
+ "route_name": "unban_user",
947
+ "endpoint": "POST /bans/unban_user/{username}",
948
+ "auth": "session_cookie+csrf",
949
+ "content_type": "application/x-www-form-urlencoded",
950
+ "path_params": {
951
+ "username": { "type": "string", "source": "TASK_SPEC" }
952
+ },
953
+ "query_params": {},
954
+ "form_params": {
955
+ "unban_user[reason]": { "type": "string", "source": "TASK_SPEC" },
956
+ "unban_user[unban_ips]": { "type": "checkbox", "source": "TASK_SPEC", "notes": "Also lift associated IP bans" },
957
+ "unban_user[_token]": { "type": "string", "source": "AUTH_FLOW" }
958
+ },
959
+ "response_key_fields": []
960
+ },
961
+ {
962
+ "api_type": "form",
963
+ "route_name": "ban_ip",
964
+ "endpoint": "POST /bans/ban_ip",
965
+ "auth": "session_cookie+csrf",
966
+ "content_type": "application/x-www-form-urlencoded",
967
+ "path_params": {},
968
+ "query_params": {},
969
+ "form_params": {
970
+ "ip_ban[ip]": { "type": "string", "source": "TASK_SPEC", "notes": "IP address to ban" },
971
+ "ip_ban[reason]": { "type": "string", "source": "TASK_SPEC" },
972
+ "ip_ban[expires][date]": { "type": "date", "source": "TASK_SPEC", "notes": "Optional" },
973
+ "ip_ban[expires][time]": { "type": "time", "source": "TASK_SPEC", "notes": "Optional" },
974
+ "ip_ban[user]": { "type": "string", "source": "TASK_SPEC", "notes": "Optional; username associated with this IP" },
975
+ "ip_ban[_token]": { "type": "string", "source": "AUTH_FLOW" }
976
+ },
977
+ "response_key_fields": []
978
+ },
979
+ {
980
+ "api_type": "form",
981
+ "route_name": "unban_ips",
982
+ "endpoint": "POST /bans/unban_ips",
983
+ "auth": "session_cookie+csrf",
984
+ "content_type": "application/x-www-form-urlencoded",
985
+ "path_params": {},
986
+ "query_params": {},
987
+ "form_params": {
988
+ "ban[]": { "type": "integer[]", "source": "PREV_CALL", "notes": "Array of IP ban IDs to remove" },
989
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'unban_ips'" }
990
+ },
991
+ "response_key_fields": []
992
+ },
993
+ {
994
+ "api_type": "form",
995
+ "route_name": "bad_phrase_add",
996
+ "endpoint": "POST /site/bad_phrases/add",
997
+ "auth": "session_cookie+csrf",
998
+ "content_type": "application/x-www-form-urlencoded",
999
+ "path_params": {},
1000
+ "query_params": {},
1001
+ "form_params": {
1002
+ "bad_phrase[phrase]": { "type": "string", "source": "TASK_SPEC" },
1003
+ "bad_phrase[phraseType]": { "type": "string", "source": "STATIC", "notes": "text or regex" },
1004
+ "bad_phrase[_token]": { "type": "string", "source": "AUTH_FLOW" }
1005
+ },
1006
+ "response_key_fields": []
1007
+ },
1008
+ {
1009
+ "api_type": "form",
1010
+ "route_name": "bad_phrase_remove",
1011
+ "endpoint": "POST /site/bad_phrases/remove",
1012
+ "auth": "session_cookie+csrf",
1013
+ "content_type": "application/x-www-form-urlencoded",
1014
+ "path_params": {},
1015
+ "query_params": {},
1016
+ "form_params": {
1017
+ "remove_bad_phrase[]": { "type": "uuid[]", "source": "PREV_CALL", "notes": "Array of bad phrase UUIDs to remove" },
1018
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'remove_bad_phrase'" }
1019
+ },
1020
+ "response_key_fields": []
1021
+ },
1022
+ {
1023
+ "api_type": "form",
1024
+ "route_name": "site_settings",
1025
+ "endpoint": "POST /site/settings",
1026
+ "auth": "session_cookie+csrf",
1027
+ "content_type": "application/x-www-form-urlencoded",
1028
+ "path_params": {},
1029
+ "query_params": {},
1030
+ "form_params": {
1031
+ "site_settings[siteName]": { "type": "string", "source": "TASK_SPEC" },
1032
+ "site_settings[registrationOpen]": { "type": "checkbox", "source": "TASK_SPEC" },
1033
+ "site_settings[usernameChangeEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
1034
+ "site_settings[unwhitelistedUserMessagesEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
1035
+ "site_settings[defaultSortMode]": { "type": "string", "source": "TASK_SPEC", "notes": "hot|active|new" },
1036
+ "site_settings[defaultTheme]": { "type": "string", "source": "TASK_SPEC", "notes": "Theme UUID or empty" },
1037
+ "site_settings[urlImagesEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
1038
+ "site_settings[trashEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
1039
+ "site_settings[wikiEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
1040
+ "site_settings[wikiLogPublic]": { "type": "checkbox", "source": "TASK_SPEC" },
1041
+ "site_settings[forumCreateRole]": { "type": "string", "source": "TASK_SPEC", "notes": "ROLE_ADMIN|ROLE_WHITELISTED|ROLE_USER" },
1042
+ "site_settings[moderatorsCanSetForumLogVisibility]": { "type": "checkbox", "source": "TASK_SPEC" },
1043
+ "site_settings[imageUploadRole]": { "type": "string", "source": "TASK_SPEC" },
1044
+ "site_settings[wikiEditRole]": { "type": "string", "source": "TASK_SPEC" },
1045
+ "site_settings[registrationCaptchaEnabled]": { "type": "checkbox", "source": "TASK_SPEC" },
1046
+ "site_settings[submissionLinkDestination]": { "type": "string", "source": "TASK_SPEC" },
1047
+ "site_settings[_token]": { "type": "string", "source": "AUTH_FLOW" }
1048
+ },
1049
+ "response_key_fields": []
1050
+ },
1051
+ {
1052
+ "api_type": "form",
1053
+ "route_name": "theme_create_css",
1054
+ "endpoint": "POST /site/themes/css/create",
1055
+ "auth": "session_cookie+csrf",
1056
+ "content_type": "application/x-www-form-urlencoded",
1057
+ "path_params": {},
1058
+ "query_params": {},
1059
+ "form_params": {
1060
+ "css_theme[name]": { "type": "string", "source": "TASK_SPEC" },
1061
+ "css_theme[css]": { "type": "string", "source": "TASK_SPEC" },
1062
+ "css_theme[_token]": { "type": "string", "source": "AUTH_FLOW" }
1063
+ },
1064
+ "response_key_fields": []
1065
+ },
1066
+ {
1067
+ "api_type": "form",
1068
+ "route_name": "theme_edit_css",
1069
+ "endpoint": "POST /site/themes/css/{id}/edit",
1070
+ "auth": "session_cookie+csrf",
1071
+ "content_type": "application/x-www-form-urlencoded",
1072
+ "path_params": {
1073
+ "id": { "type": "uuid", "source": "PREV_CALL" }
1074
+ },
1075
+ "query_params": {},
1076
+ "form_params": {
1077
+ "css_theme[name]": { "type": "string", "source": "TASK_SPEC" },
1078
+ "css_theme[css]": { "type": "string", "source": "TASK_SPEC" },
1079
+ "css_theme[_token]": { "type": "string", "source": "AUTH_FLOW" }
1080
+ },
1081
+ "response_key_fields": []
1082
+ },
1083
+ {
1084
+ "api_type": "form",
1085
+ "route_name": "theme_delete",
1086
+ "endpoint": "POST /site/themes/{id}/delete",
1087
+ "auth": "session_cookie+csrf",
1088
+ "content_type": "application/x-www-form-urlencoded",
1089
+ "path_params": {
1090
+ "id": { "type": "uuid", "source": "PREV_CALL" }
1091
+ },
1092
+ "query_params": {},
1093
+ "form_params": {
1094
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'delete_theme'" }
1095
+ },
1096
+ "response_key_fields": []
1097
+ },
1098
+ {
1099
+ "api_type": "form",
1100
+ "route_name": "theme_sync",
1101
+ "endpoint": "POST /site/themes/sync",
1102
+ "auth": "session_cookie+csrf",
1103
+ "content_type": "application/x-www-form-urlencoded",
1104
+ "path_params": {},
1105
+ "query_params": {},
1106
+ "form_params": {
1107
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'sync_themes'" }
1108
+ },
1109
+ "response_key_fields": []
1110
+ },
1111
+ {
1112
+ "api_type": "form",
1113
+ "route_name": "wiki_create",
1114
+ "endpoint": "POST /wiki/_create/{path}",
1115
+ "auth": "session_cookie+csrf",
1116
+ "content_type": "application/x-www-form-urlencoded",
1117
+ "path_params": {
1118
+ "path": { "type": "string", "source": "TASK_SPEC", "notes": "Wiki page path; can be empty to enter in form" }
1119
+ },
1120
+ "query_params": {},
1121
+ "form_params": {
1122
+ "wiki[path]": { "type": "string", "source": "TASK_SPEC", "notes": "Only present when path is empty" },
1123
+ "wiki[title]": { "type": "string", "source": "TASK_SPEC" },
1124
+ "wiki[body]": { "type": "markdown", "source": "TASK_SPEC" },
1125
+ "wiki[_token]": { "type": "string", "source": "AUTH_FLOW" }
1126
+ },
1127
+ "response_key_fields": []
1128
+ },
1129
+ {
1130
+ "api_type": "form",
1131
+ "route_name": "wiki_edit",
1132
+ "endpoint": "POST /wiki/_edit/{path}",
1133
+ "auth": "session_cookie+csrf",
1134
+ "content_type": "application/x-www-form-urlencoded",
1135
+ "path_params": {
1136
+ "path": { "type": "string", "source": "TASK_SPEC" }
1137
+ },
1138
+ "query_params": {},
1139
+ "form_params": {
1140
+ "wiki[title]": { "type": "string", "source": "TASK_SPEC" },
1141
+ "wiki[body]": { "type": "markdown", "source": "TASK_SPEC" },
1142
+ "wiki[_token]": { "type": "string", "source": "AUTH_FLOW" }
1143
+ },
1144
+ "response_key_fields": []
1145
+ },
1146
+ {
1147
+ "api_type": "form",
1148
+ "route_name": "wiki_delete",
1149
+ "endpoint": "POST /wiki/_delete/{path}",
1150
+ "auth": "session_cookie+csrf",
1151
+ "content_type": "application/x-www-form-urlencoded",
1152
+ "path_params": {
1153
+ "path": { "type": "string", "source": "TASK_SPEC" }
1154
+ },
1155
+ "query_params": {},
1156
+ "form_params": {
1157
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'wiki_delete'" }
1158
+ },
1159
+ "response_key_fields": []
1160
+ },
1161
+ {
1162
+ "api_type": "form",
1163
+ "route_name": "wiki_lock",
1164
+ "endpoint": "POST /wiki/_lock/{path}",
1165
+ "auth": "session_cookie+csrf",
1166
+ "content_type": "application/x-www-form-urlencoded",
1167
+ "path_params": {
1168
+ "path": { "type": "string", "source": "TASK_SPEC" }
1169
+ },
1170
+ "query_params": {},
1171
+ "form_params": {
1172
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'wiki_lock'" }
1173
+ },
1174
+ "response_key_fields": []
1175
+ },
1176
+ {
1177
+ "api_type": "form",
1178
+ "route_name": "wiki_unlock",
1179
+ "endpoint": "POST /wiki/_unlock/{path}",
1180
+ "auth": "session_cookie+csrf",
1181
+ "content_type": "application/x-www-form-urlencoded",
1182
+ "path_params": {
1183
+ "path": { "type": "string", "source": "TASK_SPEC" }
1184
+ },
1185
+ "query_params": {},
1186
+ "form_params": {
1187
+ "token": { "type": "string", "source": "AUTH_FLOW", "notes": "CSRF token id = 'wiki_lock'" }
1188
+ },
1189
+ "response_key_fields": []
1190
+ },
1191
+ {
1192
+ "api_type": "rest",
1193
+ "route_name": "fetch_title",
1194
+ "endpoint": "POST /ft.json",
1195
+ "auth": "session_cookie",
1196
+ "content_type": "application/x-www-form-urlencoded",
1197
+ "path_params": {},
1198
+ "query_params": {},
1199
+ "form_params": {
1200
+ "url": { "type": "string", "source": "TASK_SPEC", "notes": "URL whose title to fetch; must be a valid URL" }
1201
+ },
1202
+ "body_params": {},
1203
+ "response_key_fields": ["title"]
1204
+ },
1205
+ {
1206
+ "api_type": "rest",
1207
+ "route_name": "user_popper",
1208
+ "endpoint": "GET /_up/{username}",
1209
+ "auth": "none",
1210
+ "content_type": "none",
1211
+ "path_params": {
1212
+ "username": { "type": "string", "source": "TASK_SPEC" }
1213
+ },
1214
+ "query_params": {},
1215
+ "body_params": {},
1216
+ "response_key_fields": ["HTML fragment for user popover"]
1217
+ },
1218
+ {
1219
+ "api_type": "rest",
1220
+ "route_name": "comment_json",
1221
+ "endpoint": "GET /f/{forum_name}/{submission_id}/{slug}/comment/{comment_id}.json",
1222
+ "auth": "session_cookie",
1223
+ "content_type": "none",
1224
+ "path_params": {
1225
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
1226
+ "submission_id": { "type": "integer", "source": "PREV_CALL" },
1227
+ "slug": { "type": "string", "source": "PREV_CALL" },
1228
+ "comment_id": { "type": "integer", "source": "PREV_CALL" }
1229
+ },
1230
+ "query_params": {},
1231
+ "body_params": {},
1232
+ "response_key_fields": ["id", "body", "author", "created_at", "net_score", "visibility"]
1233
+ },
1234
+ {
1235
+ "api_type": "rest",
1236
+ "route_name": "submission_json",
1237
+ "endpoint": "GET /f/{forum_name}/{submission_id}.json",
1238
+ "auth": "session_cookie",
1239
+ "content_type": "none",
1240
+ "path_params": {
1241
+ "forum_name": { "type": "string", "source": "PREV_CALL" },
1242
+ "submission_id": { "type": "integer", "source": "PREV_CALL" }
1243
+ },
1244
+ "query_params": {},
1245
+ "body_params": {},
1246
+ "response_key_fields": ["id", "title", "url", "body", "author", "forum", "created_at", "net_score"]
1247
+ },
1248
+ {
1249
+ "api_type": "rest",
1250
+ "route_name": "api_comments_list",
1251
+ "endpoint": "GET /api/comments",
1252
+ "auth": "session_cookie",
1253
+ "content_type": "application/json",
1254
+ "path_params": {},
1255
+ "query_params": {},
1256
+ "body_params": {},
1257
+ "response_key_fields": ["id", "body", "author", "submission", "created_at", "net_score"]
1258
+ },
1259
+ {
1260
+ "api_type": "rest",
1261
+ "route_name": "api_comment_read",
1262
+ "endpoint": "GET /api/comments/{id}",
1263
+ "auth": "session_cookie",
1264
+ "content_type": "application/json",
1265
+ "path_params": {
1266
+ "id": { "type": "integer", "source": "PREV_CALL" }
1267
+ },
1268
+ "query_params": {},
1269
+ "body_params": {},
1270
+ "response_key_fields": ["id", "body", "author", "submission", "created_at", "net_score"]
1271
+ },
1272
+ {
1273
+ "api_type": "rest",
1274
+ "route_name": "api_comment_update",
1275
+ "endpoint": "PUT /api/comments/{id}",
1276
+ "auth": "session_cookie",
1277
+ "content_type": "application/json",
1278
+ "path_params": {
1279
+ "id": { "type": "integer", "source": "PREV_CALL" }
1280
+ },
1281
+ "query_params": {},
1282
+ "body_params": {
1283
+ "body": { "type": "string", "source": "TASK_SPEC", "notes": "Comment body markdown; denormalization group: comment:update" }
1284
+ },
1285
+ "response_key_fields": []
1286
+ },
1287
+ {
1288
+ "api_type": "rest",
1289
+ "route_name": "api_forum_read",
1290
+ "endpoint": "GET /api/forums/{id}",
1291
+ "auth": "session_cookie",
1292
+ "content_type": "application/json",
1293
+ "path_params": {
1294
+ "id": { "type": "integer", "source": "PREV_CALL" }
1295
+ },
1296
+ "query_params": {},
1297
+ "body_params": {},
1298
+ "response_key_fields": ["id", "name", "title", "description", "sidebar", "created_at", "subscriber_count"]
1299
+ },
1300
+ {
1301
+ "api_type": "rest",
1302
+ "route_name": "api_forum_read_by_name",
1303
+ "endpoint": "GET /api/forums/by_name/{name}",
1304
+ "auth": "session_cookie",
1305
+ "content_type": "application/json",
1306
+ "path_params": {
1307
+ "name": { "type": "string", "source": "TASK_SPEC" }
1308
+ },
1309
+ "query_params": {},
1310
+ "body_params": {},
1311
+ "response_key_fields": ["id", "name", "title", "description", "sidebar", "created_at", "subscriber_count"]
1312
+ },
1313
+ {
1314
+ "api_type": "rest",
1315
+ "route_name": "api_forum_create",
1316
+ "endpoint": "POST /api/forums",
1317
+ "auth": "session_cookie",
1318
+ "content_type": "application/json",
1319
+ "path_params": {},
1320
+ "query_params": {},
1321
+ "body_params": {
1322
+ "name": { "type": "string", "source": "TASK_SPEC", "notes": "Forum slug; denormalization group: forum:create" },
1323
+ "title": { "type": "string", "source": "TASK_SPEC" },
1324
+ "description": { "type": "string", "source": "TASK_SPEC" },
1325
+ "sidebar": { "type": "string", "source": "TASK_SPEC" }
1326
+ },
1327
+ "response_key_fields": ["id", "name"]
1328
+ },
1329
+ {
1330
+ "api_type": "rest",
1331
+ "route_name": "api_forum_update",
1332
+ "endpoint": "PUT /api/forums/{id}",
1333
+ "auth": "session_cookie",
1334
+ "content_type": "application/json",
1335
+ "path_params": {
1336
+ "id": { "type": "integer", "source": "PREV_CALL" }
1337
+ },
1338
+ "query_params": {},
1339
+ "body_params": {
1340
+ "title": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: forum:update" },
1341
+ "description": { "type": "string", "source": "TASK_SPEC" },
1342
+ "sidebar": { "type": "string", "source": "TASK_SPEC" }
1343
+ },
1344
+ "response_key_fields": []
1345
+ },
1346
+ {
1347
+ "api_type": "rest",
1348
+ "route_name": "api_submissions_list",
1349
+ "endpoint": "GET /api/submissions",
1350
+ "auth": "session_cookie",
1351
+ "content_type": "application/json",
1352
+ "path_params": {},
1353
+ "query_params": {
1354
+ "sortBy": { "type": "string", "source": "TASK_SPEC", "notes": "hot|new|active; defaults to user preference" },
1355
+ "filter": { "type": "string", "source": "TASK_SPEC", "notes": "featured|subscribed|moderated|all; defaults to user preference" }
1356
+ },
1357
+ "body_params": {},
1358
+ "response_key_fields": ["id", "title", "url", "body", "forum", "author", "created_at", "net_score"]
1359
+ },
1360
+ {
1361
+ "api_type": "rest",
1362
+ "route_name": "api_submission_read",
1363
+ "endpoint": "GET /api/submissions/{id}",
1364
+ "auth": "session_cookie",
1365
+ "content_type": "application/json",
1366
+ "path_params": {
1367
+ "id": { "type": "integer", "source": "PREV_CALL" }
1368
+ },
1369
+ "query_params": {},
1370
+ "body_params": {},
1371
+ "response_key_fields": ["id", "title", "url", "body", "forum", "author", "created_at", "net_score"]
1372
+ },
1373
+ {
1374
+ "api_type": "rest",
1375
+ "route_name": "api_submission_create",
1376
+ "endpoint": "POST /api/submissions",
1377
+ "auth": "session_cookie",
1378
+ "content_type": "application/json",
1379
+ "path_params": {},
1380
+ "query_params": {},
1381
+ "body_params": {
1382
+ "title": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: submission:create" },
1383
+ "url": { "type": "string", "source": "TASK_SPEC" },
1384
+ "body": { "type": "string", "source": "TASK_SPEC" },
1385
+ "forum": { "type": "string|integer", "source": "TASK_SPEC", "notes": "Forum name or ID" },
1386
+ "mediaType": { "type": "string", "source": "STATIC", "notes": "url|image" }
1387
+ },
1388
+ "response_key_fields": ["id", "forum", "title"]
1389
+ },
1390
+ {
1391
+ "api_type": "rest",
1392
+ "route_name": "api_submission_update",
1393
+ "endpoint": "PUT /api/submissions/{id}",
1394
+ "auth": "session_cookie",
1395
+ "content_type": "application/json",
1396
+ "path_params": {
1397
+ "id": { "type": "integer", "source": "PREV_CALL" }
1398
+ },
1399
+ "query_params": {},
1400
+ "body_params": {
1401
+ "title": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: submission:update" },
1402
+ "url": { "type": "string", "source": "TASK_SPEC" },
1403
+ "body": { "type": "string", "source": "TASK_SPEC" }
1404
+ },
1405
+ "response_key_fields": []
1406
+ },
1407
+ {
1408
+ "api_type": "rest",
1409
+ "route_name": "api_submission_delete",
1410
+ "endpoint": "DELETE /api/submissions/{id}",
1411
+ "auth": "session_cookie",
1412
+ "content_type": "application/json",
1413
+ "path_params": {
1414
+ "id": { "type": "integer", "source": "PREV_CALL" }
1415
+ },
1416
+ "query_params": {},
1417
+ "body_params": {},
1418
+ "response_key_fields": []
1419
+ },
1420
+ {
1421
+ "api_type": "rest",
1422
+ "route_name": "api_submission_comments",
1423
+ "endpoint": "GET /api/submissions/{id}/comments",
1424
+ "auth": "session_cookie",
1425
+ "content_type": "application/json",
1426
+ "path_params": {
1427
+ "id": { "type": "integer", "source": "PREV_CALL" }
1428
+ },
1429
+ "query_params": {},
1430
+ "body_params": {},
1431
+ "response_key_fields": ["id", "body", "author", "net_score", "replies"]
1432
+ },
1433
+ {
1434
+ "api_type": "rest",
1435
+ "route_name": "api_user_read",
1436
+ "endpoint": "GET /api/users/{id}",
1437
+ "auth": "session_cookie",
1438
+ "content_type": "application/json",
1439
+ "path_params": {
1440
+ "id": { "type": "integer", "source": "PREV_CALL" }
1441
+ },
1442
+ "query_params": {},
1443
+ "body_params": {},
1444
+ "response_key_fields": ["id", "username", "created_at"]
1445
+ },
1446
+ {
1447
+ "api_type": "rest",
1448
+ "route_name": "api_user_self",
1449
+ "endpoint": "GET /api/users/self",
1450
+ "auth": "session_cookie",
1451
+ "content_type": "application/json",
1452
+ "path_params": {},
1453
+ "query_params": {},
1454
+ "body_params": {},
1455
+ "response_key_fields": ["id", "username"]
1456
+ },
1457
+ {
1458
+ "api_type": "rest",
1459
+ "route_name": "api_user_read_preferences",
1460
+ "endpoint": "GET /api/users/{id}/preferences",
1461
+ "auth": "session_cookie",
1462
+ "content_type": "application/json",
1463
+ "path_params": {
1464
+ "id": { "type": "integer", "source": "PREV_CALL" }
1465
+ },
1466
+ "query_params": {},
1467
+ "body_params": {},
1468
+ "response_key_fields": ["locale", "front_page", "front_page_sort_mode", "night_mode", "notify_on_reply"]
1469
+ },
1470
+ {
1471
+ "api_type": "rest",
1472
+ "route_name": "api_user_update_preferences",
1473
+ "endpoint": "PUT /api/users/{id}/preferences",
1474
+ "auth": "session_cookie",
1475
+ "content_type": "application/json",
1476
+ "path_params": {
1477
+ "id": { "type": "integer", "source": "PREV_CALL" }
1478
+ },
1479
+ "query_params": {},
1480
+ "body_params": {
1481
+ "locale": { "type": "string", "source": "TASK_SPEC", "notes": "denormalization group: user:preferences" },
1482
+ "front_page": { "type": "string", "source": "TASK_SPEC" },
1483
+ "front_page_sort_mode": { "type": "string", "source": "TASK_SPEC" },
1484
+ "night_mode": { "type": "integer", "source": "TASK_SPEC" },
1485
+ "notify_on_reply": { "type": "boolean", "source": "TASK_SPEC" },
1486
+ "notify_on_mentions": { "type": "boolean", "source": "TASK_SPEC" }
1487
+ },
1488
+ "response_key_fields": []
1489
+ },
1490
+ {
1491
+ "api_type": "rest",
1492
+ "route_name": "api_user_submissions",
1493
+ "endpoint": "GET /api/users/{id}/submissions",
1494
+ "auth": "session_cookie",
1495
+ "content_type": "application/json",
1496
+ "path_params": {
1497
+ "id": { "type": "integer", "source": "PREV_CALL" }
1498
+ },
1499
+ "query_params": {},
1500
+ "body_params": {},
1501
+ "response_key_fields": ["id", "title", "forum", "net_score"]
1502
+ },
1503
+ {
1504
+ "api_type": "rest",
1505
+ "route_name": "api_user_moderator_of",
1506
+ "endpoint": "GET /api/users/{id}/moderator_of",
1507
+ "auth": "session_cookie",
1508
+ "content_type": "application/json",
1509
+ "path_params": {
1510
+ "id": { "type": "integer", "source": "PREV_CALL" }
1511
+ },
1512
+ "query_params": {},
1513
+ "body_params": {},
1514
+ "response_key_fields": ["entries[].forum", "entries[].user"]
1515
+ }
1516
+ ]
1517
+ }
catalogs/osm.json ADDED
The diff for this file is too large to render. See raw diff
 
catalogs/shopping.json ADDED
The diff for this file is too large to render. See raw diff
 
catalogs/shopping_admin.json ADDED
The diff for this file is too large to render. See raw diff
 
catalogs/wikipedia.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_meta": {
3
+ "generated": "2026-04-08",
4
+ "source": "hardcoded — kiwix-serve binary serves a static ZIM file; no application source to analyze",
5
+ "zim_file": "/data/wikipedia_en_all_maxi_2022-05.zim",
6
+ "search_response": "HTML only — GET /search returns HTML page; agent must parse <a href> links for article URLs",
7
+ "article_page": "GET /wikipedia_en_all_maxi_2022-05/A/{title} — returns HTML article",
8
+ "websockets": "none"
9
+ },
10
+ "endpoints": [
11
+ {
12
+ "api_type": "rest",
13
+ "endpoint": "GET /search",
14
+ "auth": "none",
15
+ "query_params": {
16
+ "pattern": {
17
+ "type": "string",
18
+ "source": "TASK_SPEC",
19
+ "notes": "the search query, URL-encoded"
20
+ },
21
+ "books.name": {
22
+ "type": "string",
23
+ "source": "STATIC",
24
+ "value": "wikipedia_en_all_maxi_2022-05",
25
+ "notes": "selects which ZIM book to search"
26
+ }
27
+ },
28
+ "response_key_fields": [],
29
+ "notes": "IMPORTANT: response is HTML, not JSON. Parse <a href> anchor links matching /wikipedia_en_all_maxi_2022-05/A/... to extract article slugs. The .results[0].url jq path does NOT apply — use HTML parsing."
30
+ },
31
+ {
32
+ "api_type": "rest",
33
+ "endpoint": "GET /wikipedia_en_all_maxi_2022-05/A/{article_title}",
34
+ "auth": "none",
35
+ "path_params": {
36
+ "article_title": {
37
+ "type": "string",
38
+ "source": "PREV_CALL",
39
+ "from_endpoint": "GET /search",
40
+ "from_field": "href attribute of first search result <a> tag",
41
+ "notes": "URL-encoded article slug, e.g. Albert_Einstein. Extract from the href on the search results HTML page. Verified live: HTTP 200 for valid titles."
42
+ }
43
+ },
44
+ "response_key_fields": [],
45
+ "notes": "Returns full HTML article page. HTTP 200 when article exists, 404 when not found."
46
+ }
47
+ ]
48
+ }
client.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """HARvestGym client for interacting with the environment server."""
2
+
3
+ from typing import Dict
4
+
5
+ from openenv.core import EnvClient
6
+ from openenv.core.client_types import StepResult
7
+ from openenv.core.env_server.types import State
8
+
9
+ try:
10
+ from .server.models import HarvestGymAction, HarvestGymObservation
11
+ except ModuleNotFoundError:
12
+ from server.models import HarvestGymAction, HarvestGymObservation
13
+
14
+
15
+ class HARvestGymEnv(EnvClient[HarvestGymAction, HarvestGymObservation, State]):
16
+ """
17
+ Client for the HARvestGym Environment.
18
+
19
+ Example:
20
+ >>> async with HARvestGymEnv(base_url="http://localhost:8000") as env:
21
+ ... result = await env.reset()
22
+ ... result = await env.step(HarvestGymAction(
23
+ ... tool="browser_agent",
24
+ ... args={"task": "List products in category Gear",
25
+ ... "url": "http://ec2-.../"}
26
+ ... ))
27
+ """
28
+
29
+ def _step_payload(self, action: HarvestGymAction) -> Dict:
30
+ return {
31
+ "tool": action.tool,
32
+ "args": action.args,
33
+ }
34
+
35
+ def _parse_result(self, payload: Dict) -> StepResult[HarvestGymObservation]:
36
+ obs_data = payload.get("observation", {})
37
+ observation = HarvestGymObservation(
38
+ task=obs_data.get("task", ""),
39
+ app_base_url=obs_data.get("app_base_url", ""),
40
+ last_tool_result=obs_data.get("last_tool_result"),
41
+ history=obs_data.get("history", []),
42
+ session_state=obs_data.get("session_state", {}),
43
+ step_count=obs_data.get("step_count", 0),
44
+ max_steps=obs_data.get("max_steps", 20),
45
+ done=payload.get("done", False),
46
+ reward=payload.get("reward"),
47
+ metadata=obs_data.get("metadata", {}),
48
+ )
49
+ return StepResult(
50
+ observation=observation,
51
+ reward=payload.get("reward"),
52
+ done=payload.get("done", False),
53
+ )
54
+
55
+ def _parse_state(self, payload: Dict) -> State:
56
+ return State(
57
+ episode_id=payload.get("episode_id"),
58
+ step_count=payload.get("step_count", 0),
59
+ )
hars/forum.har ADDED
The diff for this file is too large to render. See raw diff
 
hars/shopping.har ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc116ba8f3cb52e5fe8335dcaf1eefbb88161df4d494f30832338f57bbe52ed9
3
+ size 13392889
hars/shopping_admin.har ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c9d48fde1cc1f65c0e81ff9a46d1b23fece9c352b1c548de91ca848ee2411f1
3
+ size 60961456
hars/wikipedia.har ADDED
The diff for this file is too large to render. See raw diff
 
inference.py ADDED
@@ -0,0 +1,375 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HARvestGym — Inference Script
3
+ ==============================
4
+
5
+ Runs the RL agent (driven by an LLM via OpenAI client) through three tasks:
6
+ 1. har_classify_easy — Template 1: list products in a category
7
+ 2. har_classify_medium — Template 3: add product to guest cart
8
+ 3. har_pipeline_hard — Template 6: complete guest checkout
9
+
10
+ STDOUT FORMAT (strictly enforced by hackathon):
11
+ [START] task=<task_name> env=<benchmark> model=<model_name>
12
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
13
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
14
+
15
+ Usage:
16
+ HF_TOKEN=hf_xxx uv run inference.py
17
+ HF_TOKEN=hf_xxx MODEL_NAME=Qwen/Qwen2.5-72B-Instruct uv run inference.py
18
+ """
19
+
20
+ import asyncio
21
+ import json
22
+ import os
23
+ import sys
24
+ import textwrap
25
+ from typing import Any, List, Optional
26
+
27
+ from openai import OpenAI
28
+
29
+ # ---------------------------------------------------------------------------
30
+ # Configuration
31
+ # ---------------------------------------------------------------------------
32
+
33
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
34
+ MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
35
+ HF_TOKEN = os.getenv("HF_TOKEN")
36
+
37
+ if not HF_TOKEN:
38
+ raise ValueError(
39
+ "HF_TOKEN environment variable is required but not set. "
40
+ "Export it with: export HF_TOKEN=hf_your_token_here"
41
+ )
42
+
43
+ BENCHMARK = "harvgym"
44
+ MAX_STEPS = 20
45
+ TEMPERATURE = 0.7
46
+ MAX_TOKENS = 512
47
+ SUCCESS_SCORE_THRESHOLD = 0.5
48
+
49
+ # Task definitions for inference
50
+ TASKS = [
51
+ {
52
+ "task_name": "har_classify_easy",
53
+ "template_id": 1,
54
+ "description": "List products in the 'Gear' category",
55
+ "app_base_url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
56
+ "difficulty": "easy",
57
+ },
58
+ {
59
+ "task_name": "har_classify_medium",
60
+ "template_id": 3,
61
+ "description": "Add 'Radiant Tee' to a guest cart",
62
+ "app_base_url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
63
+ "difficulty": "medium",
64
+ },
65
+ {
66
+ "task_name": "har_pipeline_hard",
67
+ "template_id": 6,
68
+ "description": "Complete a guest checkout for 'Radiant Tee'",
69
+ "app_base_url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
70
+ "difficulty": "hard",
71
+ },
72
+ ]
73
+
74
+ # ---------------------------------------------------------------------------
75
+ # Logging helpers (hackathon format)
76
+ # ---------------------------------------------------------------------------
77
+
78
+ def log_start(task: str, env: str, model: str) -> None:
79
+ print(f"[START] task={task} env={env} model={model}", flush=True)
80
+
81
+
82
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
83
+ error_val = error if error else "null"
84
+ done_val = str(done).lower()
85
+ # Sanitize action: no newlines
86
+ action_clean = action.replace("\n", " ").replace("\r", "")[:200]
87
+ print(
88
+ f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
89
+ flush=True,
90
+ )
91
+
92
+
93
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
94
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
95
+ print(
96
+ f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_str}",
97
+ flush=True,
98
+ )
99
+
100
+
101
+ # ---------------------------------------------------------------------------
102
+ # System prompt
103
+ # ---------------------------------------------------------------------------
104
+
105
+ SYSTEM_PROMPT = textwrap.dedent("""
106
+ You are an API agent. Your goal is to complete a real-world task by calling the correct
107
+ sequence of HTTP API endpoints on a live web application.
108
+
109
+ You have exactly these tools available (output ONE tool call per turn as JSON):
110
+
111
+ 1. browser_agent(task, url)
112
+ → Discovers which API endpoints exist for this app. Call this FIRST and ONLY ONCE.
113
+ → Returns: list of {method, path} endpoint names (no schemas)
114
+
115
+ 2. search_endpoints(query)
116
+ → Semantic search for endpoint schemas. Use after browser_agent to get full details.
117
+ → Example: search_endpoints("create guest cart") returns method, path, auth, params
118
+
119
+ 3. curl_exec(command)
120
+ → Execute an HTTP call. Returns {status_code, headers, body}.
121
+ → Use full curl syntax: curl -X POST 'URL' -H 'Content-Type: application/json' -d '{...}'
122
+ → Session cookies are auto-injected; you do NOT need to set Cookie headers manually.
123
+
124
+ 4. search_episode_data(query)
125
+ → Search all prior API responses in this episode for a specific value.
126
+ → Use when a response list was truncated and you need a specific item.
127
+
128
+ 5. done(result?)
129
+ → Call when the task is complete.
130
+
131
+ RULES:
132
+ - Output ONLY a single JSON object with keys "tool" and "args". Nothing else.
133
+ - Call browser_agent exactly once at step 1.
134
+ - Read values from prior responses (cart_id, sku, tokens) from the history.
135
+ - For Magento Shopping API (port 7770/7780): use Content-Type: application/json
136
+ - For Forum Postmill (port 9999): use Content-Type: application/x-www-form-urlencoded for login/post
137
+ - For Wikipedia (port 8888): GET requests only
138
+
139
+ EXAMPLE output format:
140
+ {"tool": "curl_exec", "args": {"command": "curl -X POST 'http://ec2-.../rest/V1/guest-carts' -H 'Content-Type: application/json'"}}
141
+ """).strip()
142
+
143
+
144
+ # ---------------------------------------------------------------------------
145
+ # LLM agent loop
146
+ # ---------------------------------------------------------------------------
147
+
148
+ def build_user_prompt(task_desc: str, app_base_url: str, step: int,
149
+ last_result: Any, history: List[dict],
150
+ session_state: dict) -> str:
151
+ """Build the user prompt for each step."""
152
+ history_str = ""
153
+ if history:
154
+ recent = history[-6:] # Last 6 steps to stay within context
155
+ lines = []
156
+ for h in recent:
157
+ result_str = json.dumps(h.get("result", ""))[:500]
158
+ lines.append(f" Step {h['step']}: {h['tool']}({h.get('args', {})}) → {result_str}")
159
+ history_str = "\n".join(lines)
160
+
161
+ session_str = json.dumps(session_state, indent=2)[:300] if session_state else "{}"
162
+
163
+ last_result_str = json.dumps(last_result)[:800] if last_result is not None else "null"
164
+
165
+ return textwrap.dedent(f"""
166
+ TASK: {task_desc}
167
+ APP URL: {app_base_url}
168
+ STEP: {step}/{MAX_STEPS}
169
+
170
+ SESSION STATE (auto-managed cookies/tokens):
171
+ {session_str}
172
+
173
+ LAST TOOL RESULT:
174
+ {last_result_str}
175
+
176
+ RECENT HISTORY:
177
+ {history_str if history_str else " (none yet)"}
178
+
179
+ What is your next tool call? Output ONLY the JSON object.
180
+ """).strip()
181
+
182
+
183
+ def get_model_action(client: OpenAI, task_desc: str, app_base_url: str,
184
+ step: int, last_result: Any, history: List[dict],
185
+ session_state: dict) -> dict:
186
+ """Ask the LLM for the next action. Returns parsed tool call dict."""
187
+ user_prompt = build_user_prompt(task_desc, app_base_url, step,
188
+ last_result, history, session_state)
189
+ try:
190
+ completion = client.chat.completions.create(
191
+ model=MODEL_NAME,
192
+ messages=[
193
+ {"role": "system", "content": SYSTEM_PROMPT},
194
+ {"role": "user", "content": user_prompt},
195
+ ],
196
+ temperature=TEMPERATURE,
197
+ max_tokens=MAX_TOKENS,
198
+ stream=False,
199
+ )
200
+ text = (completion.choices[0].message.content or "").strip()
201
+
202
+ # Parse JSON from response
203
+ # Handle markdown code blocks
204
+ if "```json" in text:
205
+ text = text.split("```json")[1].split("```")[0].strip()
206
+ elif "```" in text:
207
+ text = text.split("```")[1].split("```")[0].strip()
208
+
209
+ # Find first { ... } block
210
+ start = text.find("{")
211
+ end = text.rfind("}") + 1
212
+ if start >= 0 and end > start:
213
+ text = text[start:end]
214
+
215
+ parsed = json.loads(text)
216
+ if "tool" in parsed:
217
+ return parsed
218
+
219
+ # LLM returned something else — default to done
220
+ return {"tool": "done", "args": {"result": "Model returned non-tool response"}}
221
+
222
+ except json.JSONDecodeError:
223
+ # Couldn't parse JSON — try to extract tool name at minimum
224
+ if "browser_agent" in text:
225
+ return {"tool": "browser_agent", "args": {"task": task_desc, "url": app_base_url}}
226
+ elif "done" in text.lower():
227
+ return {"tool": "done", "args": {}}
228
+ else:
229
+ return {"tool": "done", "args": {"result": f"Parse error: {text[:100]}"}}
230
+ except Exception as exc:
231
+ print(f"[DEBUG] LLM call failed: {exc}", flush=True)
232
+ # Default to browser_agent on first step, done otherwise
233
+ if step == 1:
234
+ return {"tool": "browser_agent", "args": {"task": task_desc, "url": app_base_url}}
235
+ return {"tool": "done", "args": {"result": f"LLM error: {exc}"}}
236
+
237
+
238
+ # ---------------------------------------------------------------------------
239
+ # Single task episode runner
240
+ # ---------------------------------------------------------------------------
241
+
242
+ async def run_episode(task_config: dict, client: OpenAI) -> dict:
243
+ """
244
+ Run a single episode for one task.
245
+
246
+ Returns: {"task_name", "success", "steps", "score", "rewards"}
247
+ """
248
+ from server.models import HARvestGymEnvironment, HarvestGymAction
249
+
250
+ task_name = task_config["task_name"]
251
+ template_id = task_config["template_id"]
252
+ task_description = task_config["description"]
253
+ app_base_url = task_config["app_base_url"]
254
+
255
+ # Configure environment for this task
256
+ os.environ["HARVGYM_TASK"] = str(template_id)
257
+
258
+ env = HARvestGymEnvironment()
259
+
260
+ rewards: List[float] = []
261
+ steps_taken = 0
262
+ score = 0.0
263
+ success = False
264
+ last_result = None
265
+ history: List[dict] = []
266
+ session_state: dict = {}
267
+
268
+ log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
269
+
270
+ try:
271
+ # Reset
272
+ obs = env.reset()
273
+ task_desc = obs.task or task_description
274
+ base_url = obs.app_base_url or app_base_url
275
+
276
+ for step in range(1, MAX_STEPS + 1):
277
+ if getattr(obs, "done", False):
278
+ break
279
+
280
+ # Get action from LLM
281
+ action_dict = get_model_action(
282
+ client=client,
283
+ task_desc=task_desc,
284
+ app_base_url=base_url,
285
+ step=step,
286
+ last_result=last_result,
287
+ history=history,
288
+ session_state=session_state,
289
+ )
290
+
291
+ tool = action_dict.get("tool", "done")
292
+ args = action_dict.get("args", {})
293
+
294
+ action_str = f"{tool}({json.dumps(args)[:150]})"
295
+ error_str = None
296
+
297
+ try:
298
+ action = HarvestGymAction(tool=tool, args=args)
299
+ obs = env.step(action)
300
+
301
+ reward = float(obs.reward or 0.0)
302
+ done = bool(obs.done)
303
+ last_result = obs.last_tool_result
304
+ session_state = dict(obs.session_state or {})
305
+
306
+ # Update history
307
+ history.append({
308
+ "step": step,
309
+ "tool": tool,
310
+ "args": args,
311
+ "result": last_result,
312
+ })
313
+
314
+ except Exception as exc:
315
+ reward = -0.1
316
+ done = False
317
+ error_str = str(exc)[:200]
318
+
319
+ rewards.append(reward)
320
+ steps_taken = step
321
+ log_step(step=step, action=action_str, reward=reward, done=done, error=error_str)
322
+
323
+ if done:
324
+ break
325
+
326
+ # Compute episode score from cumulative rewards
327
+ # Normalize: terminal reward dominates; clamp to [0, 1]
328
+ total_reward = sum(rewards)
329
+ # Map reward to [0, 1]: reward range is roughly [-1.5, +7.5] per design
330
+ score = (total_reward + 1.5) / 9.0
331
+ score = max(0.0, min(1.0, score))
332
+ success = score >= SUCCESS_SCORE_THRESHOLD
333
+
334
+ except Exception as exc:
335
+ error_str = str(exc)[:200]
336
+ print(f"[DEBUG] Episode error: {error_str}", flush=True)
337
+ finally:
338
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
339
+
340
+ return {
341
+ "task_name": task_name,
342
+ "success": success,
343
+ "steps": steps_taken,
344
+ "score": score,
345
+ "rewards": rewards,
346
+ }
347
+
348
+
349
+ # ---------------------------------------------------------------------------
350
+ # Main
351
+ # ---------------------------------------------------------------------------
352
+
353
+ async def main() -> None:
354
+ client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
355
+
356
+ results = []
357
+ for task_config in TASKS:
358
+ result = await run_episode(task_config, client)
359
+ results.append(result)
360
+
361
+ # Summary
362
+ print("\n[SUMMARY]", flush=True)
363
+ for r in results:
364
+ status = "PASS" if r["success"] else "FAIL"
365
+ print(
366
+ f" [{status}] {r['task_name']} — score={r['score']:.2f} steps={r['steps']}",
367
+ flush=True,
368
+ )
369
+
370
+ overall_score = sum(r["score"] for r in results) / len(results) if results else 0.0
371
+ print(f"\n overall_score={overall_score:.2f}", flush=True)
372
+
373
+
374
+ if __name__ == "__main__":
375
+ asyncio.run(main())
models.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HARvestGym — Root models.py (required by OpenEnv spec).
3
+
4
+ Re-exports Action and Observation classes from server/models.py.
5
+ """
6
+
7
+ from server.models import HarvestGymAction as HARvestGymAction
8
+ from server.models import HarvestGymObservation as HARvestGymObservation
9
+
10
+ # OpenEnv spec requires these names at module root
11
+ Action = HARvestGymAction
12
+ Observation = HARvestGymObservation
13
+
14
+ __all__ = ["HARvestGymAction", "HARvestGymObservation", "Action", "Observation"]
openenv.yaml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: HARvestGym
3
+ type: space
4
+ runtime: fastapi
5
+ app: server.app:app
6
+ port: 8000
openenv_harvestgym.egg-info/PKG-INFO ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: openenv-harvestgym
3
+ Version: 0.1.0
4
+ Summary: HARvestGym: RL environment for training API-native web agents via HAR-guided exploration
5
+ Requires-Python: >=3.10
6
+ Requires-Dist: openenv-core[core]>=0.2.2
7
+ Requires-Dist: pydantic>=2.0.0
8
+ Requires-Dist: fastapi>=0.100.0
9
+ Requires-Dist: uvicorn>=0.23.0
10
+ Requires-Dist: requests>=2.31.0
11
+ Requires-Dist: rank-bm25>=0.2.2
12
+ Requires-Dist: sentence-transformers>=3.0.0
13
+ Requires-Dist: openai>=1.0.0
14
+ Requires-Dist: numpy>=1.24.0
15
+ Provides-Extra: dev
16
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
17
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
18
+ Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
openenv_harvestgym.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ pyproject.toml
3
+ openenv_harvestgym.egg-info/PKG-INFO
4
+ openenv_harvestgym.egg-info/SOURCES.txt
5
+ openenv_harvestgym.egg-info/dependency_links.txt
6
+ openenv_harvestgym.egg-info/entry_points.txt
7
+ openenv_harvestgym.egg-info/requires.txt
8
+ openenv_harvestgym.egg-info/top_level.txt
9
+ server/__init__.py
10
+ server/app.py
11
+ server/episode.py
12
+ server/judge.py
13
+ server/models.py
14
+ server/tools/__init__.py
15
+ server/tools/browser_agent.py
16
+ server/tools/curl_exec.py
17
+ server/tools/search_endpoints.py
18
+ server/tools/search_episode_data.py
19
+ tests/test_e2e_episode.py
20
+ tests/test_real_har.py
openenv_harvestgym.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
openenv_harvestgym.egg-info/entry_points.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [console_scripts]
2
+ server = server.app:main
openenv_harvestgym.egg-info/requires.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ openenv-core[core]>=0.2.2
2
+ pydantic>=2.0.0
3
+ fastapi>=0.100.0
4
+ uvicorn>=0.23.0
5
+ requests>=2.31.0
6
+ rank-bm25>=0.2.2
7
+ sentence-transformers>=3.0.0
8
+ openai>=1.0.0
9
+ numpy>=1.24.0
10
+
11
+ [dev]
12
+ pytest>=8.0.0
13
+ pytest-cov>=4.0.0
14
+ pytest-asyncio>=0.23.0
openenv_harvestgym.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ server
parameter_pools.json ADDED
@@ -0,0 +1,1090 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_meta": {
3
+ "description": "Static parameter pools for the 7 HARvestGym task templates.",
4
+ "generated_at": "2026-04-08",
5
+ "source": {
6
+ "categories": "GET /rest/V1/categories/list (live EC2, port 7780)",
7
+ "products": "GET /rest/V1/products type_id=simple + configurable (live EC2, port 7780)",
8
+ "forums": "HTML scrape of /forums page (live EC2, port 9999) + HTTP 200 verification per slug",
9
+ "wikipedia": "Well-known Wikipedia titles \u2014 verified by grader at runtime via HEAD /wikipedia_en.../A/{slug}",
10
+ "admin_skus": "Generated (HAR-TEST-NNN namespace, no collision with existing catalog)",
11
+ "post_titles": "Generated \u2014 grader checks post was created, not the exact title wording"
12
+ },
13
+ "grader_matching_notes": {
14
+ "template_1": "category_id stored for grader; category_name is what appears in task string",
15
+ "template_2": "expected_slug stored for grader (verifies HTTP 200); display title is in task string",
16
+ "template_3": "sku stored for grader (verifies cart item); product name is in task string",
17
+ "template_4": "forum_name must exist and return posts; no exact value matching needed",
18
+ "template_5": "title is free-form generated; grader only checks post was created in that forum",
19
+ "template_6": "sku stored for grader (verifies order was placed); product name is in task string",
20
+ "template_7": "sku+price are exact \u2014 grader calls GET /rest/V1/products/{sku} to verify creation"
21
+ }
22
+ },
23
+ "template_1": {
24
+ "description": "List products in category {category_name}",
25
+ "tier": "Easy",
26
+ "app": "shopping",
27
+ "slots": [
28
+ "category_name"
29
+ ],
30
+ "pool": {
31
+ "category_name": [
32
+ {
33
+ "name": "Gear",
34
+ "category_id": 3
35
+ },
36
+ {
37
+ "name": "Bags",
38
+ "category_id": 4
39
+ },
40
+ {
41
+ "name": "Fitness Equipment",
42
+ "category_id": 5
43
+ },
44
+ {
45
+ "name": "Watches",
46
+ "category_id": 6
47
+ },
48
+ {
49
+ "name": "New Luma Yoga Collection",
50
+ "category_id": 8
51
+ },
52
+ {
53
+ "name": "Training",
54
+ "category_id": 9
55
+ },
56
+ {
57
+ "name": "Video Download",
58
+ "category_id": 10
59
+ },
60
+ {
61
+ "name": "Men",
62
+ "category_id": 11
63
+ },
64
+ {
65
+ "name": "Tops",
66
+ "category_id": 12
67
+ },
68
+ {
69
+ "name": "Bottoms",
70
+ "category_id": 13
71
+ },
72
+ {
73
+ "name": "Jackets",
74
+ "category_id": 14
75
+ },
76
+ {
77
+ "name": "Hoodies & Sweatshirts",
78
+ "category_id": 15
79
+ },
80
+ {
81
+ "name": "Tees",
82
+ "category_id": 16
83
+ },
84
+ {
85
+ "name": "Tanks",
86
+ "category_id": 17
87
+ },
88
+ {
89
+ "name": "Pants",
90
+ "category_id": 18
91
+ },
92
+ {
93
+ "name": "Shorts",
94
+ "category_id": 19
95
+ },
96
+ {
97
+ "name": "Women",
98
+ "category_id": 20
99
+ },
100
+ {
101
+ "name": "Tops",
102
+ "category_id": 21
103
+ },
104
+ {
105
+ "name": "Bottoms",
106
+ "category_id": 22
107
+ },
108
+ {
109
+ "name": "Jackets",
110
+ "category_id": 23
111
+ },
112
+ {
113
+ "name": "Hoodies & Sweatshirts",
114
+ "category_id": 24
115
+ },
116
+ {
117
+ "name": "Tees",
118
+ "category_id": 25
119
+ },
120
+ {
121
+ "name": "Bras & Tanks",
122
+ "category_id": 26
123
+ },
124
+ {
125
+ "name": "Pants",
126
+ "category_id": 27
127
+ },
128
+ {
129
+ "name": "Shorts",
130
+ "category_id": 28
131
+ },
132
+ {
133
+ "name": "Women Sale",
134
+ "category_id": 30
135
+ },
136
+ {
137
+ "name": "Men Sale",
138
+ "category_id": 31
139
+ },
140
+ {
141
+ "name": "Pants",
142
+ "category_id": 32
143
+ },
144
+ {
145
+ "name": "Tees",
146
+ "category_id": 33
147
+ },
148
+ {
149
+ "name": "Erin Recommends",
150
+ "category_id": 34
151
+ },
152
+ {
153
+ "name": "Performance Fabrics",
154
+ "category_id": 35
155
+ },
156
+ {
157
+ "name": "Eco Friendly",
158
+ "category_id": 36
159
+ },
160
+ {
161
+ "name": "Sale",
162
+ "category_id": 37
163
+ },
164
+ {
165
+ "name": "What's New",
166
+ "category_id": 38
167
+ },
168
+ {
169
+ "name": "Performance Sportswear New",
170
+ "category_id": 39
171
+ },
172
+ {
173
+ "name": "Eco Collection New",
174
+ "category_id": 40
175
+ }
176
+ ]
177
+ }
178
+ },
179
+ "template_2": {
180
+ "description": "Retrieve article summary for {title}",
181
+ "tier": "Easy",
182
+ "app": "wikipedia",
183
+ "slots": [
184
+ "title"
185
+ ],
186
+ "pool": {
187
+ "title": [
188
+ {
189
+ "display": "Python (programming language)",
190
+ "search_query": "Python programming language",
191
+ "expected_slug": "Python_(programming_language)"
192
+ },
193
+ {
194
+ "display": "Albert Einstein",
195
+ "search_query": "Albert Einstein",
196
+ "expected_slug": "Albert_Einstein"
197
+ },
198
+ {
199
+ "display": "World War II",
200
+ "search_query": "World War II",
201
+ "expected_slug": "World_War_II"
202
+ },
203
+ {
204
+ "display": "Photosynthesis",
205
+ "search_query": "Photosynthesis",
206
+ "expected_slug": "Photosynthesis"
207
+ },
208
+ {
209
+ "display": "Marie Curie",
210
+ "search_query": "Marie Curie",
211
+ "expected_slug": "Marie_Curie"
212
+ },
213
+ {
214
+ "display": "Moon",
215
+ "search_query": "Moon",
216
+ "expected_slug": "Moon"
217
+ },
218
+ {
219
+ "display": "JavaScript",
220
+ "search_query": "JavaScript",
221
+ "expected_slug": "JavaScript"
222
+ },
223
+ {
224
+ "display": "Eiffel Tower",
225
+ "search_query": "Eiffel Tower",
226
+ "expected_slug": "Eiffel_Tower"
227
+ },
228
+ {
229
+ "display": "Black hole",
230
+ "search_query": "Black hole",
231
+ "expected_slug": "Black_hole"
232
+ },
233
+ {
234
+ "display": "Charles Darwin",
235
+ "search_query": "Charles Darwin",
236
+ "expected_slug": "Charles_Darwin"
237
+ },
238
+ {
239
+ "display": "Artificial intelligence",
240
+ "search_query": "Artificial intelligence",
241
+ "expected_slug": "Artificial_intelligence"
242
+ },
243
+ {
244
+ "display": "DNA",
245
+ "search_query": "DNA",
246
+ "expected_slug": "DNA"
247
+ },
248
+ {
249
+ "display": "Mount Everest",
250
+ "search_query": "Mount Everest",
251
+ "expected_slug": "Mount_Everest"
252
+ },
253
+ {
254
+ "display": "Isaac Newton",
255
+ "search_query": "Isaac Newton",
256
+ "expected_slug": "Isaac_Newton"
257
+ },
258
+ {
259
+ "display": "Solar System",
260
+ "search_query": "Solar System",
261
+ "expected_slug": "Solar_System"
262
+ },
263
+ {
264
+ "display": "Great Wall of China",
265
+ "search_query": "Great Wall of China",
266
+ "expected_slug": "Great_Wall_of_China"
267
+ },
268
+ {
269
+ "display": "William Shakespeare",
270
+ "search_query": "William Shakespeare",
271
+ "expected_slug": "William_Shakespeare"
272
+ },
273
+ {
274
+ "display": "Amazon River",
275
+ "search_query": "Amazon River",
276
+ "expected_slug": "Amazon_River"
277
+ },
278
+ {
279
+ "display": "Quantum mechanics",
280
+ "search_query": "Quantum mechanics",
281
+ "expected_slug": "Quantum_mechanics"
282
+ },
283
+ {
284
+ "display": "Napoleon",
285
+ "search_query": "Napoleon",
286
+ "expected_slug": "Napoleon"
287
+ }
288
+ ]
289
+ }
290
+ },
291
+ "template_3": {
292
+ "description": "Add {product_name} to a guest cart",
293
+ "tier": "Medium",
294
+ "app": "shopping",
295
+ "slots": [
296
+ "product_name"
297
+ ],
298
+ "pool": {
299
+ "product_name": [
300
+ {
301
+ "name": "Joust Duffle Bag",
302
+ "sku": "24-MB01"
303
+ },
304
+ {
305
+ "name": "Strive Shoulder Pack",
306
+ "sku": "24-MB04"
307
+ },
308
+ {
309
+ "name": "Crown Summit Backpack",
310
+ "sku": "24-MB03"
311
+ },
312
+ {
313
+ "name": "Wayfarer Messenger Bag",
314
+ "sku": "24-MB05"
315
+ },
316
+ {
317
+ "name": "Rival Field Messenger",
318
+ "sku": "24-MB06"
319
+ },
320
+ {
321
+ "name": "Fusion Backpack",
322
+ "sku": "24-MB02"
323
+ },
324
+ {
325
+ "name": "Impulse Duffle",
326
+ "sku": "24-UB02"
327
+ },
328
+ {
329
+ "name": "Voyage Yoga Bag",
330
+ "sku": "24-WB01"
331
+ },
332
+ {
333
+ "name": "Compete Track Tote",
334
+ "sku": "24-WB02"
335
+ },
336
+ {
337
+ "name": "Savvy Shoulder Tote",
338
+ "sku": "24-WB05"
339
+ },
340
+ {
341
+ "name": "Endeavor Daytrip Backpack",
342
+ "sku": "24-WB06"
343
+ },
344
+ {
345
+ "name": "Driven Backpack",
346
+ "sku": "24-WB03"
347
+ },
348
+ {
349
+ "name": "Overnight Duffle",
350
+ "sku": "24-WB07"
351
+ },
352
+ {
353
+ "name": "Push It Messenger Bag",
354
+ "sku": "24-WB04"
355
+ },
356
+ {
357
+ "name": "Affirm Water Bottle",
358
+ "sku": "24-UG06"
359
+ },
360
+ {
361
+ "name": "Dual Handle Cardio Ball",
362
+ "sku": "24-UG07"
363
+ },
364
+ {
365
+ "name": "Zing Jump Rope",
366
+ "sku": "24-UG04"
367
+ },
368
+ {
369
+ "name": "Pursuit Lumaflex&trade; Tone Band",
370
+ "sku": "24-UG02"
371
+ },
372
+ {
373
+ "name": "Go-Get'r Pushup Grips",
374
+ "sku": "24-UG05"
375
+ },
376
+ {
377
+ "name": "Quest Lumaflex&trade; Band",
378
+ "sku": "24-UG01"
379
+ },
380
+ {
381
+ "name": "Sprite Foam Yoga Brick",
382
+ "sku": "24-WG084"
383
+ },
384
+ {
385
+ "name": "Sprite Foam Roller",
386
+ "sku": "24-WG088"
387
+ },
388
+ {
389
+ "name": "Harmony Lumaflex&trade; Strength Band Kit",
390
+ "sku": "24-UG03"
391
+ },
392
+ {
393
+ "name": "Sprite Stasis Ball 55 cm",
394
+ "sku": "24-WG081-gray"
395
+ },
396
+ {
397
+ "name": "Sprite Stasis Ball 65 cm",
398
+ "sku": "24-WG082-gray"
399
+ },
400
+ {
401
+ "name": "Sprite Stasis Ball 75 cm",
402
+ "sku": "24-WG083-gray"
403
+ },
404
+ {
405
+ "name": "Sprite Yoga Strap 6 foot",
406
+ "sku": "24-WG085"
407
+ },
408
+ {
409
+ "name": "Sprite Yoga Strap 8 foot",
410
+ "sku": "24-WG086"
411
+ },
412
+ {
413
+ "name": "Sprite Yoga Strap 10 foot",
414
+ "sku": "24-WG087"
415
+ },
416
+ {
417
+ "name": "Aim Analog Watch",
418
+ "sku": "24-MG04"
419
+ },
420
+ {
421
+ "name": "Endurance Watch",
422
+ "sku": "24-MG01"
423
+ },
424
+ {
425
+ "name": "Summit Watch",
426
+ "sku": "24-MG03"
427
+ },
428
+ {
429
+ "name": "Cruise Dual Analog Watch",
430
+ "sku": "24-MG05"
431
+ },
432
+ {
433
+ "name": "Dash Digital Watch",
434
+ "sku": "24-MG02"
435
+ },
436
+ {
437
+ "name": "Luma Analog Watch",
438
+ "sku": "24-WG09"
439
+ },
440
+ {
441
+ "name": "Bolo Sport Watch",
442
+ "sku": "24-WG01"
443
+ },
444
+ {
445
+ "name": "Clamber Watch",
446
+ "sku": "24-WG03"
447
+ },
448
+ {
449
+ "name": "Didi Sport Watch",
450
+ "sku": "24-WG02"
451
+ },
452
+ {
453
+ "name": "Stellar Solar Jacket",
454
+ "sku": "WJ01"
455
+ },
456
+ {
457
+ "name": "Josie Yoga Jacket",
458
+ "sku": "WJ02"
459
+ },
460
+ {
461
+ "name": "Augusta Pullover Jacket",
462
+ "sku": "WJ03"
463
+ },
464
+ {
465
+ "name": "Ingrid Running Jacket",
466
+ "sku": "WJ04"
467
+ },
468
+ {
469
+ "name": "Riona Full Zip Jacket",
470
+ "sku": "WJ05"
471
+ },
472
+ {
473
+ "name": "Juno Jacket",
474
+ "sku": "WJ06"
475
+ },
476
+ {
477
+ "name": "Inez Full Zip Jacket",
478
+ "sku": "WJ07"
479
+ },
480
+ {
481
+ "name": "Adrienne Trek Jacket",
482
+ "sku": "WJ08"
483
+ },
484
+ {
485
+ "name": "Jade Yoga Jacket",
486
+ "sku": "WJ09"
487
+ },
488
+ {
489
+ "name": "Nadia Elements Shell",
490
+ "sku": "WJ10"
491
+ },
492
+ {
493
+ "name": "Neve Studio Dance Jacket",
494
+ "sku": "WJ11"
495
+ },
496
+ {
497
+ "name": "Olivia 1/4 Zip Light Jacket",
498
+ "sku": "WJ12"
499
+ },
500
+ {
501
+ "name": "Chaz Kangeroo Hoodie",
502
+ "sku": "MH01"
503
+ },
504
+ {
505
+ "name": "Teton Pullover Hoodie",
506
+ "sku": "MH02"
507
+ },
508
+ {
509
+ "name": "Bruno Compete Hoodie",
510
+ "sku": "MH03"
511
+ },
512
+ {
513
+ "name": "Frankie Sweatshirt",
514
+ "sku": "MH04"
515
+ },
516
+ {
517
+ "name": "Hollister Backyard Sweatshirt",
518
+ "sku": "MH05"
519
+ },
520
+ {
521
+ "name": "Stark Fundamental Hoodie",
522
+ "sku": "MH06"
523
+ },
524
+ {
525
+ "name": "Hero Hoodie",
526
+ "sku": "MH07"
527
+ },
528
+ {
529
+ "name": "Oslo Trek Hoodie",
530
+ "sku": "MH08"
531
+ }
532
+ ]
533
+ }
534
+ },
535
+ "template_4": {
536
+ "description": "Retrieve all posts in {forum_category} (authed)",
537
+ "tier": "Medium",
538
+ "app": "forum",
539
+ "slots": [
540
+ "forum_category"
541
+ ],
542
+ "pool": {
543
+ "forum_category": [
544
+ {
545
+ "forum_name": "AskReddit"
546
+ },
547
+ {
548
+ "forum_name": "relationship_advice"
549
+ },
550
+ {
551
+ "forum_name": "worldnews"
552
+ },
553
+ {
554
+ "forum_name": "news"
555
+ },
556
+ {
557
+ "forum_name": "movies"
558
+ },
559
+ {
560
+ "forum_name": "memes"
561
+ },
562
+ {
563
+ "forum_name": "wallstreetbets"
564
+ },
565
+ {
566
+ "forum_name": "gaming"
567
+ },
568
+ {
569
+ "forum_name": "technology"
570
+ },
571
+ {
572
+ "forum_name": "pics"
573
+ },
574
+ {
575
+ "forum_name": "funny"
576
+ },
577
+ {
578
+ "forum_name": "television"
579
+ },
580
+ {
581
+ "forum_name": "mildlyinteresting"
582
+ },
583
+ {
584
+ "forum_name": "Showerthoughts"
585
+ },
586
+ {
587
+ "forum_name": "todayilearned"
588
+ },
589
+ {
590
+ "forum_name": "personalfinance"
591
+ },
592
+ {
593
+ "forum_name": "LifeProTips"
594
+ },
595
+ {
596
+ "forum_name": "Futurology"
597
+ },
598
+ {
599
+ "forum_name": "Music"
600
+ },
601
+ {
602
+ "forum_name": "explainlikeimfive"
603
+ },
604
+ {
605
+ "forum_name": "books"
606
+ },
607
+ {
608
+ "forum_name": "science"
609
+ },
610
+ {
611
+ "forum_name": "Jokes"
612
+ },
613
+ {
614
+ "forum_name": "tifu"
615
+ },
616
+ {
617
+ "forum_name": "space"
618
+ }
619
+ ]
620
+ }
621
+ },
622
+ "template_5": {
623
+ "description": "Create a post titled {title} in {category}",
624
+ "tier": "Hard",
625
+ "app": "forum",
626
+ "slots": [
627
+ "title",
628
+ "category"
629
+ ],
630
+ "pool": {
631
+ "title": [
632
+ "Thoughts on the latest developments in AI safety",
633
+ "Best practices for remote work in 2026",
634
+ "How do you stay motivated when learning a new skill?",
635
+ "What are your favourite open-source projects right now?",
636
+ "Underrated books that changed how you think",
637
+ "Tips for beginner photographers \u2014 what I wish I knew",
638
+ "The most interesting science paper I read this week",
639
+ "Ask me anything about Python performance tuning",
640
+ "Weekly discussion: what are you building this month?",
641
+ "Hidden gems in streaming music you should know about",
642
+ "Travel destinations that are worth the hype",
643
+ "How to cook a perfect risotto \u2014 my method after 10 attempts",
644
+ "What sport have you picked up recently and why?",
645
+ "Recommend a documentary that genuinely surprised you",
646
+ "Discussion: is functional programming overrated?",
647
+ "Things that made you better at managing personal finance",
648
+ "The weirdest film you watched and actually enjoyed",
649
+ "My experience switching from VS Code to a different editor",
650
+ "Why I started journaling and what changed",
651
+ "Gaming setup upgrades that actually made a difference"
652
+ ],
653
+ "category": [
654
+ {
655
+ "forum_name": "AskReddit"
656
+ },
657
+ {
658
+ "forum_name": "relationship_advice"
659
+ },
660
+ {
661
+ "forum_name": "worldnews"
662
+ },
663
+ {
664
+ "forum_name": "news"
665
+ },
666
+ {
667
+ "forum_name": "movies"
668
+ },
669
+ {
670
+ "forum_name": "memes"
671
+ },
672
+ {
673
+ "forum_name": "wallstreetbets"
674
+ },
675
+ {
676
+ "forum_name": "gaming"
677
+ },
678
+ {
679
+ "forum_name": "technology"
680
+ },
681
+ {
682
+ "forum_name": "pics"
683
+ },
684
+ {
685
+ "forum_name": "funny"
686
+ },
687
+ {
688
+ "forum_name": "television"
689
+ },
690
+ {
691
+ "forum_name": "mildlyinteresting"
692
+ },
693
+ {
694
+ "forum_name": "Showerthoughts"
695
+ },
696
+ {
697
+ "forum_name": "todayilearned"
698
+ },
699
+ {
700
+ "forum_name": "personalfinance"
701
+ },
702
+ {
703
+ "forum_name": "LifeProTips"
704
+ },
705
+ {
706
+ "forum_name": "Futurology"
707
+ },
708
+ {
709
+ "forum_name": "Music"
710
+ },
711
+ {
712
+ "forum_name": "explainlikeimfive"
713
+ },
714
+ {
715
+ "forum_name": "books"
716
+ },
717
+ {
718
+ "forum_name": "science"
719
+ },
720
+ {
721
+ "forum_name": "Jokes"
722
+ },
723
+ {
724
+ "forum_name": "tifu"
725
+ },
726
+ {
727
+ "forum_name": "space"
728
+ }
729
+ ]
730
+ }
731
+ },
732
+ "template_6": {
733
+ "description": "Guest checkout for {product_name}",
734
+ "tier": "Hard",
735
+ "app": "shopping",
736
+ "slots": [
737
+ "product_name"
738
+ ],
739
+ "pool": {
740
+ "product_name": [
741
+ {
742
+ "name": "Joust Duffle Bag",
743
+ "sku": "24-MB01"
744
+ },
745
+ {
746
+ "name": "Strive Shoulder Pack",
747
+ "sku": "24-MB04"
748
+ },
749
+ {
750
+ "name": "Crown Summit Backpack",
751
+ "sku": "24-MB03"
752
+ },
753
+ {
754
+ "name": "Wayfarer Messenger Bag",
755
+ "sku": "24-MB05"
756
+ },
757
+ {
758
+ "name": "Rival Field Messenger",
759
+ "sku": "24-MB06"
760
+ },
761
+ {
762
+ "name": "Fusion Backpack",
763
+ "sku": "24-MB02"
764
+ },
765
+ {
766
+ "name": "Impulse Duffle",
767
+ "sku": "24-UB02"
768
+ },
769
+ {
770
+ "name": "Voyage Yoga Bag",
771
+ "sku": "24-WB01"
772
+ },
773
+ {
774
+ "name": "Compete Track Tote",
775
+ "sku": "24-WB02"
776
+ },
777
+ {
778
+ "name": "Savvy Shoulder Tote",
779
+ "sku": "24-WB05"
780
+ },
781
+ {
782
+ "name": "Endeavor Daytrip Backpack",
783
+ "sku": "24-WB06"
784
+ },
785
+ {
786
+ "name": "Driven Backpack",
787
+ "sku": "24-WB03"
788
+ },
789
+ {
790
+ "name": "Overnight Duffle",
791
+ "sku": "24-WB07"
792
+ },
793
+ {
794
+ "name": "Push It Messenger Bag",
795
+ "sku": "24-WB04"
796
+ },
797
+ {
798
+ "name": "Affirm Water Bottle",
799
+ "sku": "24-UG06"
800
+ },
801
+ {
802
+ "name": "Dual Handle Cardio Ball",
803
+ "sku": "24-UG07"
804
+ },
805
+ {
806
+ "name": "Zing Jump Rope",
807
+ "sku": "24-UG04"
808
+ },
809
+ {
810
+ "name": "Pursuit Lumaflex&trade; Tone Band",
811
+ "sku": "24-UG02"
812
+ },
813
+ {
814
+ "name": "Go-Get'r Pushup Grips",
815
+ "sku": "24-UG05"
816
+ },
817
+ {
818
+ "name": "Quest Lumaflex&trade; Band",
819
+ "sku": "24-UG01"
820
+ },
821
+ {
822
+ "name": "Sprite Foam Yoga Brick",
823
+ "sku": "24-WG084"
824
+ },
825
+ {
826
+ "name": "Sprite Foam Roller",
827
+ "sku": "24-WG088"
828
+ },
829
+ {
830
+ "name": "Harmony Lumaflex&trade; Strength Band Kit",
831
+ "sku": "24-UG03"
832
+ },
833
+ {
834
+ "name": "Sprite Stasis Ball 55 cm",
835
+ "sku": "24-WG081-gray"
836
+ },
837
+ {
838
+ "name": "Sprite Stasis Ball 65 cm",
839
+ "sku": "24-WG082-gray"
840
+ },
841
+ {
842
+ "name": "Sprite Stasis Ball 75 cm",
843
+ "sku": "24-WG083-gray"
844
+ },
845
+ {
846
+ "name": "Sprite Yoga Strap 6 foot",
847
+ "sku": "24-WG085"
848
+ },
849
+ {
850
+ "name": "Sprite Yoga Strap 8 foot",
851
+ "sku": "24-WG086"
852
+ },
853
+ {
854
+ "name": "Sprite Yoga Strap 10 foot",
855
+ "sku": "24-WG087"
856
+ },
857
+ {
858
+ "name": "Aim Analog Watch",
859
+ "sku": "24-MG04"
860
+ },
861
+ {
862
+ "name": "Endurance Watch",
863
+ "sku": "24-MG01"
864
+ },
865
+ {
866
+ "name": "Summit Watch",
867
+ "sku": "24-MG03"
868
+ },
869
+ {
870
+ "name": "Cruise Dual Analog Watch",
871
+ "sku": "24-MG05"
872
+ },
873
+ {
874
+ "name": "Dash Digital Watch",
875
+ "sku": "24-MG02"
876
+ },
877
+ {
878
+ "name": "Luma Analog Watch",
879
+ "sku": "24-WG09"
880
+ },
881
+ {
882
+ "name": "Bolo Sport Watch",
883
+ "sku": "24-WG01"
884
+ },
885
+ {
886
+ "name": "Clamber Watch",
887
+ "sku": "24-WG03"
888
+ },
889
+ {
890
+ "name": "Didi Sport Watch",
891
+ "sku": "24-WG02"
892
+ },
893
+ {
894
+ "name": "Stellar Solar Jacket",
895
+ "sku": "WJ01"
896
+ },
897
+ {
898
+ "name": "Josie Yoga Jacket",
899
+ "sku": "WJ02"
900
+ },
901
+ {
902
+ "name": "Augusta Pullover Jacket",
903
+ "sku": "WJ03"
904
+ },
905
+ {
906
+ "name": "Ingrid Running Jacket",
907
+ "sku": "WJ04"
908
+ },
909
+ {
910
+ "name": "Riona Full Zip Jacket",
911
+ "sku": "WJ05"
912
+ },
913
+ {
914
+ "name": "Juno Jacket",
915
+ "sku": "WJ06"
916
+ },
917
+ {
918
+ "name": "Inez Full Zip Jacket",
919
+ "sku": "WJ07"
920
+ },
921
+ {
922
+ "name": "Adrienne Trek Jacket",
923
+ "sku": "WJ08"
924
+ },
925
+ {
926
+ "name": "Jade Yoga Jacket",
927
+ "sku": "WJ09"
928
+ },
929
+ {
930
+ "name": "Nadia Elements Shell",
931
+ "sku": "WJ10"
932
+ },
933
+ {
934
+ "name": "Neve Studio Dance Jacket",
935
+ "sku": "WJ11"
936
+ },
937
+ {
938
+ "name": "Olivia 1/4 Zip Light Jacket",
939
+ "sku": "WJ12"
940
+ },
941
+ {
942
+ "name": "Chaz Kangeroo Hoodie",
943
+ "sku": "MH01"
944
+ },
945
+ {
946
+ "name": "Teton Pullover Hoodie",
947
+ "sku": "MH02"
948
+ },
949
+ {
950
+ "name": "Bruno Compete Hoodie",
951
+ "sku": "MH03"
952
+ },
953
+ {
954
+ "name": "Frankie Sweatshirt",
955
+ "sku": "MH04"
956
+ },
957
+ {
958
+ "name": "Hollister Backyard Sweatshirt",
959
+ "sku": "MH05"
960
+ },
961
+ {
962
+ "name": "Stark Fundamental Hoodie",
963
+ "sku": "MH06"
964
+ },
965
+ {
966
+ "name": "Hero Hoodie",
967
+ "sku": "MH07"
968
+ },
969
+ {
970
+ "name": "Oslo Trek Hoodie",
971
+ "sku": "MH08"
972
+ }
973
+ ]
974
+ }
975
+ },
976
+ "template_7": {
977
+ "description": "Create a new product with SKU {sku}, price {price}",
978
+ "tier": "Hard",
979
+ "app": "shopping_admin",
980
+ "slots": [
981
+ "sku",
982
+ "price",
983
+ "product_name"
984
+ ],
985
+ "pool": {
986
+ "product_spec": [
987
+ {
988
+ "sku": "HAR-TEST-001",
989
+ "price": 19.99,
990
+ "product_name": "HAR Training Widget Alpha"
991
+ },
992
+ {
993
+ "sku": "HAR-TEST-002",
994
+ "price": 34.5,
995
+ "product_name": "HAR Training Widget Beta"
996
+ },
997
+ {
998
+ "sku": "HAR-TEST-003",
999
+ "price": 9.99,
1000
+ "product_name": "HAR Economy Pack"
1001
+ },
1002
+ {
1003
+ "sku": "HAR-TEST-004",
1004
+ "price": 49.0,
1005
+ "product_name": "HAR Premium Kit"
1006
+ },
1007
+ {
1008
+ "sku": "HAR-TEST-005",
1009
+ "price": 7.75,
1010
+ "product_name": "HAR Starter Bundle"
1011
+ },
1012
+ {
1013
+ "sku": "HAR-TEST-006",
1014
+ "price": 129.0,
1015
+ "product_name": "HAR Deluxe Set"
1016
+ },
1017
+ {
1018
+ "sku": "HAR-TEST-007",
1019
+ "price": 22.0,
1020
+ "product_name": "HAR Standard Unit"
1021
+ },
1022
+ {
1023
+ "sku": "HAR-TEST-008",
1024
+ "price": 14.95,
1025
+ "product_name": "HAR Basic Module"
1026
+ },
1027
+ {
1028
+ "sku": "HAR-TEST-009",
1029
+ "price": 59.99,
1030
+ "product_name": "HAR Advanced Pack"
1031
+ },
1032
+ {
1033
+ "sku": "HAR-TEST-010",
1034
+ "price": 3.5,
1035
+ "product_name": "HAR Mini Component"
1036
+ },
1037
+ {
1038
+ "sku": "HAR-TEST-011",
1039
+ "price": 89.0,
1040
+ "product_name": "HAR Pro Edition"
1041
+ },
1042
+ {
1043
+ "sku": "HAR-TEST-012",
1044
+ "price": 11.25,
1045
+ "product_name": "HAR Lite Version"
1046
+ },
1047
+ {
1048
+ "sku": "HAR-TEST-013",
1049
+ "price": 199.99,
1050
+ "product_name": "HAR Enterprise Module"
1051
+ },
1052
+ {
1053
+ "sku": "HAR-TEST-014",
1054
+ "price": 6.0,
1055
+ "product_name": "HAR Sample Item"
1056
+ },
1057
+ {
1058
+ "sku": "HAR-TEST-015",
1059
+ "price": 45.0,
1060
+ "product_name": "HAR Mid-Range Pack"
1061
+ },
1062
+ {
1063
+ "sku": "HAR-TEST-016",
1064
+ "price": 25.0,
1065
+ "product_name": "HAR Core Component"
1066
+ },
1067
+ {
1068
+ "sku": "HAR-TEST-017",
1069
+ "price": 75.0,
1070
+ "product_name": "HAR Extended Kit"
1071
+ },
1072
+ {
1073
+ "sku": "HAR-TEST-018",
1074
+ "price": 18.5,
1075
+ "product_name": "HAR Value Bundle"
1076
+ },
1077
+ {
1078
+ "sku": "HAR-TEST-019",
1079
+ "price": 99.0,
1080
+ "product_name": "HAR Complete Suite"
1081
+ },
1082
+ {
1083
+ "sku": "HAR-TEST-020",
1084
+ "price": 2.99,
1085
+ "product_name": "HAR Micro Unit"
1086
+ }
1087
+ ]
1088
+ }
1089
+ }
1090
+ }
pyproject.toml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=45", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "openenv-harvestgym"
7
+ version = "0.1.0"
8
+ description = "HARvestGym: RL environment for training API-native web agents via HAR-guided exploration"
9
+ requires-python = ">=3.10"
10
+ dependencies = [
11
+ "openenv-core[core]>=0.2.2",
12
+ "pydantic>=2.0.0",
13
+ "fastapi>=0.100.0",
14
+ "uvicorn>=0.23.0",
15
+ "requests>=2.31.0",
16
+ "rank-bm25>=0.2.2",
17
+ "sentence-transformers>=3.0.0",
18
+ "openai>=1.0.0",
19
+ "numpy>=1.24.0",
20
+ ]
21
+
22
+ [project.optional-dependencies]
23
+ dev = [
24
+ "pytest>=8.0.0",
25
+ "pytest-cov>=4.0.0",
26
+ "pytest-asyncio>=0.23.0",
27
+ ]
28
+
29
+ [project.scripts]
30
+ server = "server.app:main"
31
+
32
+ [tool.setuptools]
33
+ include-package-data = true
34
+ packages = ["server", "server.tools"]
35
+
36
+ [tool.setuptools.package-data]
37
+ "*" = ["hars/*.har", "catalogs/*.json", "parameter_pools.json"]
scripts/build_parameter_pools.py ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Build (or refresh) parameter_pools.json by calling the live EC2 application APIs.
4
+
5
+ Usage:
6
+ python scripts/build_parameter_pools.py --host ec2-16-59-2-56.us-east-2.compute.amazonaws.com
7
+ python scripts/build_parameter_pools.py --host localhost # if running directly on EC2
8
+ python scripts/build_parameter_pools.py --host <IP> --output parameter_pools.json
9
+
10
+ Requirements: pip install requests
11
+ """
12
+
13
+ import argparse
14
+ import json
15
+ import sys
16
+ from datetime import date
17
+ from pathlib import Path
18
+
19
+ try:
20
+ import requests
21
+ requests.packages.urllib3.disable_warnings()
22
+ except ImportError:
23
+ print("pip install requests")
24
+ sys.exit(1)
25
+
26
+ # ── Config ────────────────────────────────────────────────────────────────────
27
+ PORTS = {
28
+ "shopping": 7770,
29
+ "shopping_admin": 7780,
30
+ "forum": 9999,
31
+ "wikipedia": 8888,
32
+ }
33
+
34
+ ADMIN_USER = "admin"
35
+ ADMIN_PASS = "admin1234"
36
+
37
+ # Wikipedia articles to verify exist in the ZIM snapshot
38
+ WIKIPEDIA_TITLES = [
39
+ ("Python (programming language)", "Python programming language", "Python_(programming_language)"),
40
+ ("Albert Einstein", "Albert Einstein", "Albert_Einstein"),
41
+ ("World War II", "World War II", "World_War_II"),
42
+ ("Photosynthesis", "Photosynthesis", "Photosynthesis"),
43
+ ("Marie Curie", "Marie Curie", "Marie_Curie"),
44
+ ("Moon", "Moon", "Moon"),
45
+ ("JavaScript", "JavaScript", "JavaScript"),
46
+ ("Eiffel Tower", "Eiffel Tower", "Eiffel_Tower"),
47
+ ("Black hole", "Black hole", "Black_hole"),
48
+ ("Charles Darwin", "Charles Darwin", "Charles_Darwin"),
49
+ ("Artificial intelligence", "Artificial intelligence", "Artificial_intelligence"),
50
+ ("DNA", "DNA", "DNA"),
51
+ ("Mount Everest", "Mount Everest", "Mount_Everest"),
52
+ ("Isaac Newton", "Isaac Newton", "Isaac_Newton"),
53
+ ("Solar System", "Solar System", "Solar_System"),
54
+ ("Great Wall of China", "Great Wall of China", "Great_Wall_of_China"),
55
+ ("William Shakespeare", "William Shakespeare", "William_Shakespeare"),
56
+ ("Amazon River", "Amazon River", "Amazon_River"),
57
+ ("Quantum mechanics", "Quantum mechanics", "Quantum_mechanics"),
58
+ ("Napoleon", "Napoleon", "Napoleon"),
59
+ ]
60
+
61
+ # Post titles generated for template_5 — not fetched from the live app
62
+ FORUM_POST_TITLES = [
63
+ "Thoughts on the latest developments in AI safety",
64
+ "Best practices for remote work in 2026",
65
+ "How do you stay motivated when learning a new skill?",
66
+ "What are your favourite open-source projects right now?",
67
+ "Underrated books that changed how you think",
68
+ "Tips for beginner photographers — what I wish I knew",
69
+ "The most interesting science paper I read this week",
70
+ "Ask me anything about Python performance tuning",
71
+ "Weekly discussion: what are you building this month?",
72
+ "Hidden gems in streaming music you should know about",
73
+ "Travel destinations that are worth the hype",
74
+ "How to cook a perfect risotto — my method after 10 attempts",
75
+ "What sport have you picked up recently and why?",
76
+ "Recommend a documentary that genuinely surprised you",
77
+ "Discussion: is functional programming overrated?",
78
+ "Things that made you better at managing personal finance",
79
+ "The weirdest film you watched and actually enjoyed",
80
+ "My experience switching from VS Code to a different editor",
81
+ "Why I started journaling and what changed",
82
+ "Gaming setup upgrades that actually made a difference",
83
+ ]
84
+
85
+ # ── Helpers ───────────────────────────────────────────────────────────────────
86
+
87
+ def base_url(host: str, app: str) -> str:
88
+ return f"http://{host}:{PORTS[app]}"
89
+
90
+
91
+ def get_admin_token(host: str) -> str:
92
+ url = f"{base_url(host, 'shopping_admin')}/rest/V1/integration/admin/token"
93
+ resp = requests.post(url, json={"username": ADMIN_USER, "password": ADMIN_PASS}, timeout=10)
94
+ resp.raise_for_status()
95
+ return resp.json()
96
+
97
+
98
+ def admin_get(host: str, path: str, token: str, params: dict = None):
99
+ url = f"{base_url(host, 'shopping_admin')}{path}"
100
+ resp = requests.get(url, headers={"Authorization": f"Bearer {token}"}, params=params, timeout=15)
101
+ resp.raise_for_status()
102
+ return resp.json()
103
+
104
+
105
+ # ── Pool builders ─────────────────────────────────────────────────────────────
106
+
107
+ def build_category_pool(host: str, token: str) -> list:
108
+ """Template 1: leaf categories from GET /rest/V1/categories/list."""
109
+ data = admin_get(host, "/rest/V1/categories/list", token, params={"searchCriteria[pageSize]": 500})
110
+ items = data.get("items", [])
111
+ pool = []
112
+ for item in items:
113
+ # include all named categories; caller can filter to leaf nodes if needed
114
+ if item.get("name") and item.get("id"):
115
+ pool.append({"name": item["name"], "category_id": item["id"]})
116
+ return pool
117
+
118
+
119
+ def build_product_pool(host: str, token: str, max_items: int = 50) -> list:
120
+ """Templates 3 & 6: simple, in-stock products."""
121
+ data = admin_get(host, "/rest/V1/products", token, params={
122
+ "searchCriteria[filterGroups][0][filters][0][field]": "type_id",
123
+ "searchCriteria[filterGroups][0][filters][0][value]": "simple",
124
+ "searchCriteria[filterGroups][0][filters][0][conditionType]": "eq",
125
+ "searchCriteria[pageSize]": max_items,
126
+ })
127
+ items = data.get("items", [])
128
+ pool = []
129
+ for item in items:
130
+ name = item.get("name", "").strip()
131
+ sku = item.get("sku", "").strip()
132
+ if name and sku:
133
+ pool.append({"name": name, "sku": sku})
134
+ return pool
135
+
136
+
137
+ def build_wikipedia_pool(host: str) -> list:
138
+ """Template 2: verify known articles exist in the ZIM snapshot."""
139
+ base = base_url(host, "wikipedia")
140
+ verified = []
141
+ for display, search_query, expected_slug in WIKIPEDIA_TITLES:
142
+ check_url = f"{base}/wikipedia_en_all_maxi_2022-05/A/{expected_slug}"
143
+ try:
144
+ r = requests.head(check_url, timeout=8, allow_redirects=True)
145
+ if r.status_code == 200:
146
+ verified.append({
147
+ "display": display,
148
+ "search_query": search_query,
149
+ "expected_slug": expected_slug,
150
+ })
151
+ else:
152
+ print(f" [wikipedia] WARNING: {expected_slug} → HTTP {r.status_code}, skipping")
153
+ except Exception as e:
154
+ print(f" [wikipedia] WARNING: could not reach {check_url}: {e}")
155
+ return verified
156
+
157
+
158
+ def build_forum_category_pool(host: str) -> list:
159
+ """Templates 4 & 5: forum slugs with at least one submission."""
160
+ base = base_url(host, "forum")
161
+ pool = []
162
+ page = 1
163
+ while True:
164
+ try:
165
+ r = requests.get(f"{base}/api/forums", params={"page": page}, timeout=10)
166
+ r.raise_for_status()
167
+ data = r.json()
168
+ except Exception as e:
169
+ print(f" [forum] WARNING: could not reach forums API: {e}")
170
+ break
171
+ items = data if isinstance(data, list) else data.get("items", data.get("forums", []))
172
+ if not items:
173
+ break
174
+ for item in items:
175
+ name = item.get("name") or item.get("forum_name") or item.get("normalizedName")
176
+ display = item.get("title") or item.get("displayName") or name
177
+ if name:
178
+ pool.append({"forum_name": name, "display_name": display or name})
179
+ if len(items) < 20:
180
+ break
181
+ page += 1
182
+ # deduplicate by forum_name
183
+ seen = set()
184
+ deduped = []
185
+ for entry in pool:
186
+ if entry["forum_name"] not in seen:
187
+ seen.add(entry["forum_name"])
188
+ deduped.append(entry)
189
+ return deduped
190
+
191
+
192
+ # ── Template 7 pool ───────────────────────────────────────────────────────────
193
+
194
+ def build_admin_product_pool() -> list:
195
+ """Template 7: fully generated SKU/price/name tuples. No API call needed."""
196
+ specs = [
197
+ ("HAR-TEST-001", 19.99, "HAR Training Widget Alpha"),
198
+ ("HAR-TEST-002", 34.50, "HAR Training Widget Beta"),
199
+ ("HAR-TEST-003", 9.99, "HAR Economy Pack"),
200
+ ("HAR-TEST-004", 49.00, "HAR Premium Kit"),
201
+ ("HAR-TEST-005", 7.75, "HAR Starter Bundle"),
202
+ ("HAR-TEST-006", 129.00, "HAR Deluxe Set"),
203
+ ("HAR-TEST-007", 22.00, "HAR Standard Unit"),
204
+ ("HAR-TEST-008", 14.95, "HAR Basic Module"),
205
+ ("HAR-TEST-009", 59.99, "HAR Advanced Pack"),
206
+ ("HAR-TEST-010", 3.50, "HAR Mini Component"),
207
+ ("HAR-TEST-011", 89.00, "HAR Pro Edition"),
208
+ ("HAR-TEST-012", 11.25, "HAR Lite Version"),
209
+ ("HAR-TEST-013", 199.99, "HAR Enterprise Module"),
210
+ ("HAR-TEST-014", 6.00, "HAR Sample Item"),
211
+ ("HAR-TEST-015", 45.00, "HAR Mid-Range Pack"),
212
+ ("HAR-TEST-016", 25.00, "HAR Core Component"),
213
+ ("HAR-TEST-017", 75.00, "HAR Extended Kit"),
214
+ ("HAR-TEST-018", 18.50, "HAR Value Bundle"),
215
+ ("HAR-TEST-019", 99.00, "HAR Complete Suite"),
216
+ ("HAR-TEST-020", 2.99, "HAR Micro Unit"),
217
+ ]
218
+ return [{"sku": sku, "price": price, "product_name": name} for sku, price, name in specs]
219
+
220
+
221
+ # ── Main ──────────────────────────────────────────────────────────────────────
222
+
223
+ def main():
224
+ parser = argparse.ArgumentParser(description="Build HARvestGym parameter pools from live EC2 apps.")
225
+ parser.add_argument("--host", default="ec2-16-59-2-56.us-east-2.compute.amazonaws.com")
226
+ parser.add_argument("--output", default="parameter_pools.json")
227
+ args = parser.parse_args()
228
+
229
+ host = args.host
230
+ output = Path(args.output)
231
+
232
+ print(f"Building parameter pools — host: {host}\n")
233
+
234
+ # Admin token (needed for Shopping endpoints)
235
+ print("[1/7] Fetching admin token...")
236
+ token = get_admin_token(host)
237
+ print(" OK\n")
238
+
239
+ # Template 1 — category pool
240
+ print("[2/7] Template 1: Shopping categories...")
241
+ cat_pool = build_category_pool(host, token)
242
+ print(f" {len(cat_pool)} categories found\n")
243
+
244
+ # Templates 3 & 6 — product pool
245
+ print("[3/7] Templates 3 & 6: Shopping products (simple, in-stock)...")
246
+ prod_pool = build_product_pool(host, token)
247
+ print(f" {len(prod_pool)} products found\n")
248
+
249
+ # Template 2 — Wikipedia
250
+ print("[4/7] Template 2: Verifying Wikipedia articles...")
251
+ wiki_pool = build_wikipedia_pool(host)
252
+ print(f" {len(wiki_pool)} articles verified\n")
253
+
254
+ # Templates 4 & 5 — Forum categories
255
+ print("[5/7] Templates 4 & 5: Forum categories...")
256
+ forum_pool = build_forum_category_pool(host)
257
+ # template_5 category pool excludes any image-only forums — same list for now
258
+ forum_pool_t5 = forum_pool
259
+ print(f" {len(forum_pool)} forums found\n")
260
+
261
+ # Template 5 — post titles (static)
262
+ print("[6/7] Template 5: Post titles (static list, no API call)...")
263
+ print(f" {len(FORUM_POST_TITLES)} titles loaded\n")
264
+
265
+ # Template 7 — admin product specs (static)
266
+ print("[7/7] Template 7: Admin product specs (generated, no API call)...")
267
+ admin_pool = build_admin_product_pool()
268
+ print(f" {len(admin_pool)} product specs loaded\n")
269
+
270
+ # ── Assemble output ───────────────────────────────────────────────────────
271
+ pools = {
272
+ "_meta": {
273
+ "description": "Static parameter pools for the 7 HARvestGym task templates.",
274
+ "generated_at": str(date.today()),
275
+ "generated_from_host": host,
276
+ "how_to_refresh": "python scripts/build_parameter_pools.py --host <EC2_HOST>",
277
+ "source_apps": {
278
+ "shopping": f"http://{host}:{PORTS['shopping']}/",
279
+ "shopping_admin": f"http://{host}:{PORTS['shopping_admin']}/admin",
280
+ "forum": f"http://{host}:{PORTS['forum']}/",
281
+ "wikipedia": f"http://{host}:{PORTS['wikipedia']}/",
282
+ },
283
+ },
284
+ "template_1": {
285
+ "description": "List products in category {category_name}",
286
+ "tier": "Easy",
287
+ "app": "shopping",
288
+ "slots": ["category_name"],
289
+ "source_endpoint": "GET /rest/V1/categories/list?searchCriteria[pageSize]=500",
290
+ "note": "Only leaf categories are meaningful for product listing tasks. category_id is stored for grader use — not exposed in the task string.",
291
+ "pool": {"category_name": cat_pool},
292
+ },
293
+ "template_2": {
294
+ "description": "Retrieve article summary for {title}",
295
+ "tier": "Easy",
296
+ "app": "wikipedia",
297
+ "slots": ["title"],
298
+ "source_endpoint": "HEAD /wikipedia_en_all_maxi_2022-05/A/{slug} (verification only)",
299
+ "note": "expected_slug is stored for grader verification. The agent must derive the slug independently via GET /search.",
300
+ "pool": {"title": wiki_pool},
301
+ },
302
+ "template_3": {
303
+ "description": "Add {product_name} to a guest cart",
304
+ "tier": "Medium",
305
+ "app": "shopping",
306
+ "slots": ["product_name"],
307
+ "source_endpoint": "GET /rest/V1/products?searchCriteria[pageSize]=50 (simple, in-stock only)",
308
+ "note": "SKU stored for grader use — agent must independently discover it via product search.",
309
+ "pool": {"product_name": prod_pool},
310
+ },
311
+ "template_4": {
312
+ "description": "Retrieve all posts in {forum_category} (authed)",
313
+ "tier": "Medium",
314
+ "app": "forum",
315
+ "slots": ["forum_category"],
316
+ "source_endpoint": "GET /api/forums?page=1",
317
+ "note": "forum_name is the URL slug; display_name is the human-readable label.",
318
+ "pool": {"forum_category": forum_pool},
319
+ },
320
+ "template_5": {
321
+ "description": "Create a post titled {title} in {category}",
322
+ "tier": "Hard",
323
+ "app": "forum",
324
+ "slots": ["title", "category"],
325
+ "source_endpoint": "GET /api/forums?page=1 (for category); titles are generated",
326
+ "note": "title and category are sampled independently. category list excludes any image-only forums.",
327
+ "pool": {
328
+ "title": FORUM_POST_TITLES,
329
+ "category": forum_pool_t5,
330
+ },
331
+ },
332
+ "template_6": {
333
+ "description": "Guest checkout for {product_name}",
334
+ "tier": "Hard",
335
+ "app": "shopping",
336
+ "slots": ["product_name"],
337
+ "source_endpoint": "GET /rest/V1/products?searchCriteria[pageSize]=50 (same pool as template_3)",
338
+ "note": "Guest checkout email is always test@example.com (STATIC). Grader queries /rest/V1/orders by email to confirm order creation.",
339
+ "pool": {"product_name": prod_pool},
340
+ },
341
+ "template_7": {
342
+ "description": "Create a new product with SKU {sku}, price {price}",
343
+ "tier": "Hard",
344
+ "app": "shopping_admin",
345
+ "slots": ["sku", "price", "product_name"],
346
+ "source_endpoint": "Fully generated — SKUs follow HAR-XXXXX pattern, no collision with existing catalog.",
347
+ "note": "All slots sampled together as a product_spec tuple. attribute_set_id=4 (Default) is STATIC. Grader calls GET /rest/V1/products/{sku} to verify creation.",
348
+ "pool": {"product_spec": admin_pool},
349
+ },
350
+ }
351
+
352
+ output.write_text(json.dumps(pools, indent=2))
353
+ print(f"Written to {output} ({output.stat().st_size:,} bytes)")
354
+
355
+ # Summary
356
+ print("\n=== POOL SUMMARY ===")
357
+ for tid in [k for k in pools if k.startswith("template")]:
358
+ t = pools[tid]
359
+ counts = {slot: len(vals) for slot, vals in t["pool"].items()}
360
+ print(f" {tid}: {counts}")
361
+
362
+
363
+ if __name__ == "__main__":
364
+ main()
server/__init__.py ADDED
File without changes
server/app.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI application for HARvestGym.
3
+
4
+ Exposes HARvestGymEnvironment over HTTP endpoints compatible with OpenEnv EnvClient.
5
+
6
+ Endpoints:
7
+ POST /reset — Reset the environment
8
+ POST /step — Execute an action
9
+ GET /state — Get current state
10
+ GET /schema — Get action/observation schemas
11
+ GET /health — Health check
12
+ WS /ws — WebSocket for persistent sessions
13
+ """
14
+
15
+ try:
16
+ from openenv.core.env_server.http_server import create_app
17
+ except Exception as e:
18
+ raise ImportError(
19
+ "openenv is required. Install dependencies with 'uv sync'"
20
+ ) from e
21
+
22
+ try:
23
+ from .models import HarvestGymAction, HarvestGymObservation, HARvestGymEnvironment
24
+ except ModuleNotFoundError:
25
+ from server.models import HarvestGymAction, HarvestGymObservation, HARvestGymEnvironment
26
+
27
+ app = create_app(
28
+ HARvestGymEnvironment,
29
+ HarvestGymAction,
30
+ HarvestGymObservation,
31
+ env_name="HARvestGym",
32
+ max_concurrent_envs=4,
33
+ )
34
+
35
+
36
+ def main(host: str = "0.0.0.0", port: int = 8000):
37
+ import uvicorn
38
+ uvicorn.run(app, host=host, port=port)
39
+
40
+
41
+ if __name__ == "__main__":
42
+ import argparse
43
+ parser = argparse.ArgumentParser()
44
+ parser.add_argument("--port", type=int, default=8000)
45
+ args = parser.parse_args()
46
+ if args.port != 8000:
47
+ main(port=args.port)
48
+ else:
49
+ main()
server/episode.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Episode data structures for HARvestGym."""
2
+
3
+ from dataclasses import dataclass, field
4
+ from typing import Any
5
+
6
+
7
+ @dataclass
8
+ class CurlCall:
9
+ method: str
10
+ url: str
11
+ path: str # normalized (IDs replaced with {id})
12
+ headers: dict
13
+ body: dict | str | None
14
+ status_code: int
15
+ response_body: Any
16
+ response_headers: dict = field(default_factory=dict)
17
+
18
+
19
+ @dataclass
20
+ class Step:
21
+ step_num: int
22
+ tool: str # browser_agent | search_endpoints | curl_exec | search_episode_data | done
23
+ action: str # raw tool call string
24
+ result: Any # tool return value
25
+ curl_parsed: CurlCall | None = None
26
+
27
+
28
+ @dataclass
29
+ class Task:
30
+ template_id: int # 1-7
31
+ description: str # instantiated task string
32
+ params: dict # e.g. {"product_name": "Radiant Tee", "sku": "MH01"}
33
+ app: str # shopping | forum | wikipedia | shopping_admin
34
+ base_url: str
35
+ difficulty: str # easy | medium | hard
36
+
37
+
38
+ @dataclass
39
+ class Episode:
40
+ task: Task
41
+ steps: list[Step] = field(default_factory=list)
42
+ session_state: dict = field(default_factory=dict)
43
+ total_steps: int = 0
44
+ terminated_by: str = "" # "done_call" | "max_steps"
45
+
46
+
47
+ @dataclass
48
+ class EpisodeResult:
49
+ task_score: float # 0.0-1.0 from grader
50
+ parameter_sourcing_score: float # 0.0-1.0 from trajectory analysis
51
+ auth_obtained: bool
52
+ reward: float # final composite reward
53
+ details: dict = field(default_factory=dict)
server/judge.py ADDED
@@ -0,0 +1,691 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HARvestGym Judge — deterministic programmatic graders for all 7 task templates.
3
+
4
+ Each grader inspects the episode trajectory and/or probes the live application
5
+ to compute a task score in [0.0, 1.0], then maps it to the reward range.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import json
11
+ import re
12
+ import time
13
+ from pathlib import Path
14
+ from typing import Any
15
+
16
+ try:
17
+ import requests as _requests
18
+ _REQUESTS_AVAILABLE = True
19
+ except ImportError:
20
+ _REQUESTS_AVAILABLE = False
21
+
22
+ from .episode import Episode, EpisodeResult, Step, Task
23
+
24
+ # ---------------------------------------------------------------------------
25
+ # Reward tables (score → reward)
26
+ # ---------------------------------------------------------------------------
27
+
28
+ REWARD_TABLES = {
29
+ 1: {1.0: 2.0, 0.3: 0.5, 0.0: -1.5},
30
+ 2: {1.0: 2.0, 0.5: 0.5, 0.0: -1.5},
31
+ 3: {1.0: 3.5, 0.2: 0.5, 0.15: 0.3, 0.0: -1.5},
32
+ 4: {1.0: 3.5, 0.3: 0.8, 0.0: -1.5},
33
+ 5: {1.0: 5.0, 0.5: 1.5, 0.3: 0.8, 0.0: -1.5},
34
+ 6: {1.0: 5.0, 0.6: 2.5, 0.3: 0.8, 0.1: 0.3, 0.0: -1.5},
35
+ 7: {1.0: 5.0, 0.7: 2.0, 0.2: 0.5, 0.0: -1.5},
36
+ }
37
+
38
+ AUTH_BONUS = 0.3 # added when auth was successfully obtained even if task fails
39
+
40
+
41
+ def _score_to_reward(score: float, template_id: int) -> float:
42
+ """Map a [0,1] task score to a reward using the template's reward table."""
43
+ table = REWARD_TABLES.get(template_id, {1.0: 2.0, 0.0: -1.5})
44
+ # Find closest matching threshold
45
+ thresholds = sorted(table.keys(), reverse=True)
46
+ for threshold in thresholds:
47
+ if score >= threshold:
48
+ return table[threshold]
49
+ return table.get(0.0, -1.5)
50
+
51
+
52
+ # ---------------------------------------------------------------------------
53
+ # HTTP probe helper
54
+ # ---------------------------------------------------------------------------
55
+
56
+ def _judge_probe(path: str, base_url: str, headers: dict | None = None,
57
+ timeout: int = 10) -> Any:
58
+ """Issue an HTTP GET from the judge (not the model) to verify live state."""
59
+ if not _REQUESTS_AVAILABLE:
60
+ return None
61
+ url = base_url.rstrip("/") + path
62
+ try:
63
+ resp = _requests.get(url, headers=headers or {}, timeout=timeout, verify=False)
64
+ result = type("ProbeResult", (), {
65
+ "status_code": resp.status_code,
66
+ "body": None,
67
+ })()
68
+ try:
69
+ result.body = resp.json()
70
+ except Exception:
71
+ result.body = resp.text
72
+ return result
73
+ except Exception as e:
74
+ print(f"[judge] probe failed {url}: {e}", flush=True)
75
+ return None
76
+
77
+
78
+ def _judge_post_probe(path: str, base_url: str, data: dict | None = None,
79
+ headers: dict | None = None, timeout: int = 10) -> Any:
80
+ """Issue an HTTP POST probe from the judge."""
81
+ if not _REQUESTS_AVAILABLE:
82
+ return None
83
+ url = base_url.rstrip("/") + path
84
+ try:
85
+ resp = _requests.post(url, json=data, headers=headers or {}, timeout=timeout, verify=False)
86
+ result = type("ProbeResult", (), {"status_code": resp.status_code, "body": None})()
87
+ try:
88
+ result.body = resp.json()
89
+ except Exception:
90
+ result.body = resp.text
91
+ return result
92
+ except Exception as e:
93
+ print(f"[judge] post probe failed {url}: {e}", flush=True)
94
+ return None
95
+
96
+
97
+ # ---------------------------------------------------------------------------
98
+ # Shared helpers
99
+ # ---------------------------------------------------------------------------
100
+
101
+ def _fuzzy_match(a: str, b: str) -> bool:
102
+ """Case-insensitive substring match in both directions."""
103
+ a, b = a.lower().strip(), b.lower().strip()
104
+ return a in b or b in a or a == b
105
+
106
+
107
+ def _path_matches(path: str, pattern: str) -> bool:
108
+ """Check if a (normalized) path matches a pattern."""
109
+ return pattern.lower() in path.lower() or path.lower() in pattern.lower()
110
+
111
+
112
+ def _extract_field(obj: Any, field_path: str) -> Any:
113
+ """Extract a nested field via dot notation: 'items.0.sku'."""
114
+ parts = field_path.split(".")
115
+ current = obj
116
+ for part in parts:
117
+ if current is None:
118
+ return None
119
+ if isinstance(current, dict):
120
+ current = current.get(part)
121
+ elif isinstance(current, list):
122
+ try:
123
+ current = current[int(part)]
124
+ except (IndexError, ValueError):
125
+ return None
126
+ else:
127
+ return None
128
+ return current
129
+
130
+
131
+ def _get_curl_steps(episode: Episode):
132
+ """Return only steps that have curl_parsed."""
133
+ return [s for s in episode.steps if s.curl_parsed is not None]
134
+
135
+
136
+ # ---------------------------------------------------------------------------
137
+ # Template graders
138
+ # ---------------------------------------------------------------------------
139
+
140
+ def grade_template_1(episode: Episode, task: Task) -> float:
141
+ """Easy — Shopping: List products in category {category_name}"""
142
+ category_name = task.params.get("category_name", "")
143
+
144
+ for step in _get_curl_steps(episode):
145
+ cp = step.curl_parsed
146
+ if cp.status_code == 200:
147
+ body = cp.response_body
148
+ if isinstance(body, dict) and "items" in body:
149
+ items = body["items"]
150
+ if len(items) > 0:
151
+ # Check if any item mentions the category
152
+ for item in items:
153
+ if _item_matches_category(item, category_name):
154
+ return 1.0
155
+ # Items returned but can't verify category — partial
156
+ return 0.3
157
+ # Also check if it's a raw list
158
+ if isinstance(body, list) and len(body) > 0:
159
+ return 0.3
160
+
161
+ return 0.0
162
+
163
+
164
+ def _item_matches_category(item: dict, category_name: str) -> bool:
165
+ """Check if an item is in the given category."""
166
+ # Check category_links field
167
+ for link in item.get("category_links", []):
168
+ # We trust the response at face value; category name match is partial anyway
169
+ pass
170
+ # Check extension_attributes
171
+ ext = item.get("extension_attributes", {})
172
+ category_links = ext.get("category_links", [])
173
+ if category_links:
174
+ return True # has category links; assume matches
175
+ # Fallback: just having items is enough for category listing
176
+ return True
177
+
178
+
179
+ def grade_template_2(episode: Episode, task: Task) -> float:
180
+ """Easy — Wikipedia: Retrieve article for {title}"""
181
+ title = task.params.get("title", "")
182
+ title_slug = title.lower().replace(" ", "_")
183
+ title_lower = title.lower()
184
+
185
+ for step in _get_curl_steps(episode):
186
+ cp = step.curl_parsed
187
+ if cp.status_code == 200:
188
+ url_lower = cp.url.lower()
189
+ # Direct article fetch
190
+ if title_slug in url_lower or title_lower.replace(" ", "_") in url_lower:
191
+ return 1.0
192
+ if "wiki/" + title_slug in url_lower:
193
+ return 1.0
194
+
195
+ # Search result found the article
196
+ for step in _get_curl_steps(episode):
197
+ cp = step.curl_parsed
198
+ if cp.status_code == 200:
199
+ body_str = str(cp.response_body).lower()
200
+ if title_lower in body_str and "wiki" in cp.url.lower():
201
+ return 0.5
202
+
203
+ return 0.0
204
+
205
+
206
+ def _extract_cart_id(episode: Episode) -> str | None:
207
+ """Extract guest cart ID from episode trajectory."""
208
+ for step in _get_curl_steps(episode):
209
+ cp = step.curl_parsed
210
+ if cp.status_code == 200:
211
+ # POST /rest/V1/guest-carts returns bare string cart ID
212
+ if "guest-carts" in cp.path and cp.method == "POST":
213
+ body = cp.response_body
214
+ if isinstance(body, str) and len(body) > 5:
215
+ return body.strip('"').strip()
216
+ return None
217
+
218
+
219
+ def grade_template_3(episode: Episode, task: Task) -> float:
220
+ """Medium — Shopping: Add {product_name} to a guest cart"""
221
+ product_name = task.params.get("product_name", "")
222
+ sku = task.params.get("sku")
223
+
224
+ # Primary: check if add-to-cart responded with item_id
225
+ for step in _get_curl_steps(episode):
226
+ cp = step.curl_parsed
227
+ if cp.status_code == 200:
228
+ body = cp.response_body
229
+ if isinstance(body, dict) and "item_id" in body:
230
+ # Verify the sku if we have it
231
+ if sku and body.get("sku") == sku:
232
+ return 1.0
233
+ if _fuzzy_match(str(body.get("name", "")), product_name):
234
+ return 1.0
235
+ if body.get("item_id"):
236
+ return 1.0
237
+
238
+ # Try live probe
239
+ cart_id = _extract_cart_id(episode)
240
+ if cart_id:
241
+ probe = _judge_probe(f"/rest/V1/guest-carts/{cart_id}", task.base_url)
242
+ if probe and probe.status_code == 200:
243
+ items = probe.body.get("items", []) if isinstance(probe.body, dict) else []
244
+ for item in items:
245
+ if sku and item.get("sku") == sku:
246
+ return 1.0
247
+ if _fuzzy_match(str(item.get("name", "")), product_name):
248
+ return 1.0
249
+ if len(items) == 0:
250
+ return 0.2 # cart created, item not added
251
+
252
+ # Partial: cart was created
253
+ if cart_id:
254
+ return 0.2
255
+
256
+ # Partial: attempted cart creation
257
+ if any("guest-carts" in (s.curl_parsed.path or "") and
258
+ s.curl_parsed.method == "POST"
259
+ for s in _get_curl_steps(episode)):
260
+ return 0.15
261
+
262
+ return 0.0
263
+
264
+
265
+ def _check_forum_auth(episode: Episode) -> bool:
266
+ """Check if forum authentication was obtained."""
267
+ for step in _get_curl_steps(episode):
268
+ cp = step.curl_parsed
269
+ if cp.method == "POST" and "/login" in cp.path:
270
+ if cp.status_code in (200, 302):
271
+ return True
272
+ return False
273
+
274
+
275
+ def _response_contains_posts(body: Any) -> bool:
276
+ """Check if a response body contains forum posts."""
277
+ if isinstance(body, list) and len(body) > 0:
278
+ return True
279
+ if isinstance(body, dict):
280
+ # Could be JSON with posts array or HTML
281
+ for key in ("posts", "items", "data", "results"):
282
+ if key in body and isinstance(body[key], list) and len(body[key]) > 0:
283
+ return True
284
+ # Postmill returns HTML — check for common post indicators
285
+ body_str = str(body).lower()
286
+ if "post" in body_str or "submission" in body_str:
287
+ return True
288
+ if isinstance(body, str) and len(body) > 100:
289
+ return True # HTML response from forum
290
+ return False
291
+
292
+
293
+ def grade_template_4(episode: Episode, task: Task) -> float:
294
+ """Medium — Forum: Retrieve posts in {forum_category} (authed)"""
295
+ forum_category = task.params.get("forum_category", "")
296
+ score = 0.0
297
+
298
+ auth_obtained = _check_forum_auth(episode)
299
+ if auth_obtained:
300
+ score += 0.3
301
+
302
+ for step in _get_curl_steps(episode):
303
+ cp = step.curl_parsed
304
+ if cp.status_code == 200:
305
+ url = cp.url
306
+ body = cp.response_body
307
+ if f"/f/{forum_category.lower()}" in url.lower():
308
+ if _response_contains_posts(body):
309
+ return 1.0
310
+ # Also accept generic post listing with the category in URL
311
+ if forum_category.lower() in url.lower() and _response_contains_posts(body):
312
+ return 1.0
313
+
314
+ return score
315
+
316
+
317
+ def _check_csrf_in_trajectory(episode: Episode) -> bool:
318
+ """Check that a POST body contained a _csrf_token field."""
319
+ for step in _get_curl_steps(episode):
320
+ cp = step.curl_parsed
321
+ if cp.method == "POST":
322
+ body_str = str(cp.body or "").lower()
323
+ if "_csrf_token" in body_str and len(body_str) > 20:
324
+ return True
325
+ return False
326
+
327
+
328
+ def _judge_probe_forum_category(category: str, base_url: str) -> list:
329
+ """Probe the forum to get posts in a category."""
330
+ probe = _judge_probe(f"/f/{category}.json", base_url)
331
+ if probe and probe.status_code == 200:
332
+ body = probe.body
333
+ if isinstance(body, dict):
334
+ return body.get("posts", body.get("submissions", []))
335
+ if isinstance(body, list):
336
+ return body
337
+ return []
338
+
339
+
340
+ def grade_template_5(episode: Episode, task: Task) -> float:
341
+ """Hard — Forum: Create a post titled {title} in {category}"""
342
+ title = task.params.get("title", "")
343
+ category = task.params.get("category", "")
344
+
345
+ auth_ok = _check_forum_auth(episode)
346
+ csrf_used = _check_csrf_in_trajectory(episode)
347
+
348
+ # Check if POST to submit returned success
349
+ for step in _get_curl_steps(episode):
350
+ cp = step.curl_parsed
351
+ if cp.method == "POST" and cp.status_code in (200, 201, 302):
352
+ if "submit" in cp.path or "post" in cp.path.lower():
353
+ # Post creation succeeded
354
+ body_str = str(cp.response_body or "").lower()
355
+ if title.lower() in body_str or "redirect" in str(cp.response_headers).lower():
356
+ return 1.0
357
+ if cp.status_code in (201, 302):
358
+ return 1.0
359
+
360
+ # Try judge probe
361
+ posts = _judge_probe_forum_category(category, task.base_url)
362
+ for post in posts:
363
+ post_title = post.get("title", post.get("name", ""))
364
+ if _fuzzy_match(post_title, title):
365
+ return 1.0
366
+
367
+ if auth_ok and csrf_used:
368
+ return 0.5
369
+ if auth_ok:
370
+ return 0.3
371
+ return 0.0
372
+
373
+
374
+ def _checkout_stages_completed(episode: Episode, sku: str | None) -> int:
375
+ """Count checkout stages completed successfully."""
376
+ stages = 0
377
+ paths_hit = {
378
+ s.curl_parsed.path
379
+ for s in _get_curl_steps(episode)
380
+ if s.curl_parsed.status_code == 200
381
+ }
382
+
383
+ if any("guest-carts" in p and "{" not in p for p in paths_hit):
384
+ stages += 1
385
+ if any("guest-carts" in p and "items" in p for p in paths_hit):
386
+ stages += 1
387
+ if any("guest-carts" in p and ("shipping" in p or "email" in p) for p in paths_hit):
388
+ stages += 1
389
+ if any("guest-carts" in p and ("payment" in p or "order" in p) for p in paths_hit):
390
+ stages += 1
391
+
392
+ return stages
393
+
394
+
395
+ def grade_template_6(episode: Episode, task: Task) -> float:
396
+ """Hard — Shopping: Guest checkout for {product_name}"""
397
+ sku = task.params.get("sku")
398
+
399
+ # Check for order ID
400
+ for step in _get_curl_steps(episode):
401
+ cp = step.curl_parsed
402
+ if cp.status_code == 200:
403
+ body = cp.response_body
404
+ if isinstance(body, int) and body > 0:
405
+ return 1.0
406
+ if isinstance(body, str):
407
+ try:
408
+ v = int(body.strip('"').strip())
409
+ if v > 0:
410
+ return 1.0
411
+ except (ValueError, AttributeError):
412
+ pass
413
+ if isinstance(body, dict) and body.get("order_id"):
414
+ return 1.0
415
+
416
+ stages = _checkout_stages_completed(episode, sku)
417
+ if stages >= 4:
418
+ return 0.6
419
+ if stages >= 2:
420
+ return 0.3
421
+ if stages >= 1:
422
+ return 0.1
423
+ return 0.0
424
+
425
+
426
+ def _extract_admin_token(episode: Episode) -> str | None:
427
+ """Find admin bearer token from episode trajectory."""
428
+ for step in _get_curl_steps(episode):
429
+ cp = step.curl_parsed
430
+ if cp.status_code == 200 and "integration/admin/token" in cp.path:
431
+ body = cp.response_body
432
+ if isinstance(body, str) and len(body) > 10:
433
+ return body.strip('"').strip()
434
+ return None
435
+
436
+
437
+ def _attempted_product_creation(episode: Episode, sku: str) -> bool:
438
+ """Check if the model attempted to create a product with this SKU."""
439
+ for step in _get_curl_steps(episode):
440
+ cp = step.curl_parsed
441
+ if cp.method == "POST" and "products" in cp.path:
442
+ body_str = str(cp.body or "").lower()
443
+ if sku.lower() in body_str:
444
+ return True
445
+ return False
446
+
447
+
448
+ def grade_template_7(episode: Episode, task: Task) -> float:
449
+ """Hard — Shopping Admin: Create product with SKU {sku}, price {price}"""
450
+ sku = task.params.get("sku", "")
451
+ price = float(task.params.get("price", 0))
452
+
453
+ admin_token = _extract_admin_token(episode)
454
+ if not admin_token:
455
+ return 0.0
456
+
457
+ # Check if product creation returned success
458
+ for step in _get_curl_steps(episode):
459
+ cp = step.curl_parsed
460
+ if cp.status_code == 200 and cp.method == "POST" and "products" in cp.path:
461
+ body = cp.response_body
462
+ if isinstance(body, dict) and body.get("id"):
463
+ actual_price = float(body.get("price", -1))
464
+ price_ok = abs(actual_price - price) < 0.01
465
+ return 1.0 if price_ok else 0.7
466
+
467
+ # Judge probe
468
+ probe = _judge_probe(
469
+ f"/rest/V1/products/{sku}",
470
+ task.base_url,
471
+ headers={"Authorization": f"Bearer {admin_token}"}
472
+ )
473
+ if probe and probe.status_code == 200 and isinstance(probe.body, dict):
474
+ actual_price = float(probe.body.get("price", -1))
475
+ price_ok = abs(actual_price - price) < 0.01
476
+ return 1.0 if price_ok else 0.7
477
+
478
+ if _attempted_product_creation(episode, sku):
479
+ return 0.2
480
+
481
+ return 0.0
482
+
483
+
484
+ # ---------------------------------------------------------------------------
485
+ # Parameter sourcing verification
486
+ # ---------------------------------------------------------------------------
487
+
488
+ def _load_catalog(app: str) -> list[dict]:
489
+ """Load the ground truth catalog for an app."""
490
+ catalog_path = Path(__file__).parent.parent.parent / "catalogs" / f"{app}.json"
491
+ if not catalog_path.exists():
492
+ return []
493
+ try:
494
+ with open(catalog_path) as f:
495
+ data = json.load(f)
496
+ return data if isinstance(data, list) else data.get("endpoints", [])
497
+ except Exception:
498
+ return []
499
+
500
+
501
+ def _find_catalog_entry(path: str, method: str, catalog: list[dict]) -> dict | None:
502
+ method = method.upper()
503
+ for entry in catalog:
504
+ cat_method = entry.get("method", "GET").upper()
505
+ cat_path = entry.get("path", "")
506
+ # Pattern match: {id} in catalog matches any segment
507
+ if cat_method == method and _path_pattern_match(path, cat_path):
508
+ return entry
509
+ return None
510
+
511
+
512
+ def _path_pattern_match(actual_path: str, catalog_path: str) -> bool:
513
+ """Match actual path against catalog pattern with {id} wildcards."""
514
+ # Convert catalog pattern to regex
515
+ pattern = re.escape(catalog_path)
516
+ pattern = pattern.replace(r"\{", "{").replace(r"\}", "}")
517
+ pattern = re.sub(r"\{[^}]+\}", "[^/]+", pattern)
518
+ pattern = f"^{pattern}$"
519
+ return bool(re.match(pattern, actual_path, re.IGNORECASE))
520
+
521
+
522
+ def verify_parameter_sourcing(episode: Episode, task: Task) -> float:
523
+ """Analyze parameter sourcing across episode trajectory. Returns [0, 1] score."""
524
+ catalog = _load_catalog(task.app)
525
+ if not catalog:
526
+ return 0.5 # neutral if no catalog
527
+
528
+ correct = 0
529
+ total = 0
530
+ steps = _get_curl_steps(episode)
531
+
532
+ for step in steps:
533
+ cp = step.curl_parsed
534
+ catalog_entry = _find_catalog_entry(cp.path, cp.method, catalog)
535
+ if not catalog_entry:
536
+ continue
537
+
538
+ path_params = catalog_entry.get("path_params", {})
539
+ body_params = catalog_entry.get("body_params", {})
540
+
541
+ for param_name, param_meta in path_params.items():
542
+ total += 1
543
+ value = _extract_path_param_value(cp.url, param_name)
544
+ if value and _param_sourced_correctly(value, param_meta, episode, step):
545
+ correct += 1
546
+
547
+ for param_name, param_meta in body_params.items():
548
+ total += 1
549
+ value = _extract_body_param_value(cp.body, param_name)
550
+ if value and _param_sourced_correctly(value, param_meta, episode, step):
551
+ correct += 1
552
+
553
+ if total == 0:
554
+ return 0.5
555
+ return correct / total
556
+
557
+
558
+ def _extract_path_param_value(url: str, param_name: str) -> str | None:
559
+ """Best-effort path param extraction."""
560
+ # Just extract last non-empty path segment as a value
561
+ from urllib.parse import urlparse
562
+ path = urlparse(url).path
563
+ segments = [s for s in path.split("/") if s]
564
+ if segments:
565
+ return segments[-1]
566
+ return None
567
+
568
+
569
+ def _extract_body_param_value(body: Any, param_name: str) -> Any:
570
+ """Extract a named param from request body."""
571
+ if body is None:
572
+ return None
573
+ if isinstance(body, dict):
574
+ if param_name in body:
575
+ return body[param_name]
576
+ # Search nested
577
+ for v in body.values():
578
+ if isinstance(v, dict):
579
+ result = _extract_body_param_value(v, param_name)
580
+ if result is not None:
581
+ return result
582
+ if isinstance(body, str):
583
+ # Form-encoded: key=value&...
584
+ for pair in body.split("&"):
585
+ if "=" in pair:
586
+ k, _, v = pair.partition("=")
587
+ if k.strip() == param_name:
588
+ return v.strip()
589
+ return None
590
+
591
+
592
+ def _param_sourced_correctly(value: Any, param_meta: dict,
593
+ episode: Episode, step: Step) -> bool:
594
+ source = param_meta.get("source", "")
595
+ value_str = str(value)
596
+
597
+ if source == "TASK_SPEC":
598
+ return value_str in episode.task.description
599
+
600
+ elif source == "PREV_CALL":
601
+ from_endpoint = param_meta.get("from_endpoint", "")
602
+ from_field = param_meta.get("from_field", "")
603
+ for prior_step in episode.steps:
604
+ if prior_step.step_num >= step.step_num:
605
+ break
606
+ if prior_step.curl_parsed:
607
+ ps = prior_step.curl_parsed
608
+ if _path_matches(ps.path, from_endpoint):
609
+ extracted = _extract_field(ps.response_body, from_field)
610
+ if str(extracted) == value_str:
611
+ return True
612
+ return False
613
+
614
+ elif source == "AUTH_FLOW":
615
+ return value_str in str(episode.session_state.values())
616
+
617
+ elif source == "STATIC":
618
+ expected = str(param_meta.get("value", ""))
619
+ return value_str == expected
620
+
621
+ elif source == "DERIVED":
622
+ from_param = param_meta.get("from_param", "")
623
+ # Simplified: check if it appeared anywhere in session state
624
+ return value_str in str(episode.session_state.values())
625
+
626
+ return True # unknown source type — don't penalize
627
+
628
+
629
+ # ---------------------------------------------------------------------------
630
+ # Main judge entry point
631
+ # ---------------------------------------------------------------------------
632
+
633
+ _GRADERS = {
634
+ 1: grade_template_1,
635
+ 2: grade_template_2,
636
+ 3: grade_template_3,
637
+ 4: grade_template_4,
638
+ 5: grade_template_5,
639
+ 6: grade_template_6,
640
+ 7: grade_template_7,
641
+ }
642
+
643
+
644
+ def evaluate(episode: Episode) -> EpisodeResult:
645
+ """
646
+ Evaluate a completed episode and return reward + diagnostics.
647
+
648
+ Args:
649
+ episode: Completed episode with all steps recorded.
650
+
651
+ Returns:
652
+ EpisodeResult with task_score, parameter_sourcing_score, reward, details.
653
+ """
654
+ task = episode.task
655
+ template_id = task.template_id
656
+
657
+ grader = _GRADERS.get(template_id)
658
+ if grader is None:
659
+ return EpisodeResult(
660
+ task_score=0.0,
661
+ parameter_sourcing_score=0.0,
662
+ auth_obtained=False,
663
+ reward=-1.5,
664
+ details={"error": f"Unknown template_id: {template_id}"},
665
+ )
666
+
667
+ task_score = grader(episode, task)
668
+ param_score = verify_parameter_sourcing(episode, task)
669
+ auth_obtained = _check_forum_auth(episode) or bool(_extract_admin_token(episode))
670
+
671
+ # Compute reward
672
+ reward = _score_to_reward(task_score, template_id)
673
+
674
+ # Bonus for auth obtained even on task failure
675
+ if task_score < 0.5 and auth_obtained:
676
+ reward = max(reward, AUTH_BONUS)
677
+
678
+ return EpisodeResult(
679
+ task_score=task_score,
680
+ parameter_sourcing_score=param_score,
681
+ auth_obtained=auth_obtained,
682
+ reward=reward,
683
+ details={
684
+ "template_id": template_id,
685
+ "difficulty": task.difficulty,
686
+ "task_score": task_score,
687
+ "param_score": param_score,
688
+ "terminated_by": episode.terminated_by,
689
+ "total_steps": episode.total_steps,
690
+ },
691
+ )
server/models.py ADDED
@@ -0,0 +1,517 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HARvestGym Environment — core OpenEnv models.py
3
+
4
+ Implements the OpenEnv spec:
5
+ - Observation, Action, Reward as Pydantic models
6
+ - reset() → initial observation + clean state
7
+ - step(action) → (observation, reward, done, info)
8
+ - state() → current state snapshot
9
+
10
+ The environment manages episode state, dispatches tool calls, computes per-step
11
+ rewards, and invokes the judge at episode end.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import json
17
+ import os
18
+ import random
19
+ from pathlib import Path
20
+ from typing import Any
21
+ from uuid import uuid4
22
+
23
+ from openenv.core.env_server.interfaces import Environment
24
+ from openenv.core.env_server.types import State
25
+ from pydantic import Field
26
+
27
+ from openenv.core.env_server.types import Action as BaseAction, Observation as BaseObservation
28
+
29
+ # ---------------------------------------------------------------------------
30
+ # Pydantic models
31
+ # ---------------------------------------------------------------------------
32
+
33
+
34
+ class HarvestGymObservation(BaseObservation):
35
+ """What the RL agent sees at each step."""
36
+
37
+ task: str = Field(default="", description="Natural language task description")
38
+ app_base_url: str = Field(default="", description="Root URL of the target application")
39
+ last_tool_result: Any = Field(default=None, description="Result of last tool call")
40
+ history: list[dict] = Field(default_factory=list, description="Full episode trajectory")
41
+ session_state: dict = Field(default_factory=dict, description="Auto-managed cookies/tokens")
42
+ step_count: int = Field(default=0)
43
+ max_steps: int = Field(default=20)
44
+ available_tools: list[str] = Field(
45
+ default_factory=lambda: [
46
+ "browser_agent(task, url)",
47
+ "search_endpoints(query)",
48
+ "curl_exec(command)",
49
+ "search_episode_data(query)",
50
+ "done(result?)",
51
+ ]
52
+ )
53
+
54
+
55
+ class HarvestGymAction(BaseAction):
56
+ """One tool call from the RL agent."""
57
+
58
+ tool: str = Field(..., description="Tool name: browser_agent|search_endpoints|curl_exec|search_episode_data|done")
59
+ args: dict = Field(default_factory=dict, description="Tool-specific arguments")
60
+
61
+
62
+ class HarvestGymReward(BaseObservation):
63
+ """Reward signal (returned as part of the observation)."""
64
+
65
+ value: float = Field(default=0.0, description="Scalar reward for this step")
66
+ breakdown: dict = Field(default_factory=dict, description="Per-signal reward components")
67
+
68
+
69
+ # ---------------------------------------------------------------------------
70
+ # Per-step reward constants
71
+ # ---------------------------------------------------------------------------
72
+
73
+ REWARD_VALID_API_CALL = 0.2 # curl_exec returns 2xx
74
+ REWARD_NEW_PATH = 0.1 # curl path not seen before this episode
75
+ REWARD_CORRECT_PARAM = 0.25 # judge: correct parameter sourcing (applied at end)
76
+ REWARD_SESSION_VALUE = 0.1 # auth token/cookie correctly used
77
+ PENALTY_REPEATED_CALL = -0.15 # exact duplicate curl command
78
+ PENALTY_BROWSER_AGENT_AGAIN = -0.3 # browser_agent called after step 1
79
+ PENALTY_MALFORMED_CURL = -0.1 # curl can't be parsed/executed
80
+ PENALTY_4XX = -0.05 # recoverable HTTP error
81
+
82
+ MAX_STEPS = 20
83
+
84
+ # ---------------------------------------------------------------------------
85
+ # Task templates
86
+ # ---------------------------------------------------------------------------
87
+
88
+ TEMPLATE_META = {
89
+ 1: {"tier": "easy", "app": "shopping", "base_url_port": 7770},
90
+ 2: {"tier": "easy", "app": "wikipedia", "base_url_port": 8888},
91
+ 3: {"tier": "medium", "app": "shopping", "base_url_port": 7770},
92
+ 4: {"tier": "medium", "app": "forum", "base_url_port": 9999},
93
+ 5: {"tier": "hard", "app": "forum", "base_url_port": 9999},
94
+ 6: {"tier": "hard", "app": "shopping", "base_url_port": 7770},
95
+ 7: {"tier": "hard", "app": "shopping_admin", "base_url_port": 7780},
96
+ }
97
+
98
+ EC2_HOST = os.environ.get("EC2_HOST", "ec2-16-59-2-56.us-east-2.compute.amazonaws.com")
99
+
100
+ TASK_NAME_TO_TEMPLATE = {
101
+ "har_classify_easy": 1,
102
+ "har_classify_medium": 3,
103
+ "har_pipeline_hard": 6,
104
+ }
105
+
106
+ TEMPLATE_DESCRIPTIONS = {
107
+ 1: "List products in category {category_name}",
108
+ 2: "Retrieve the Wikipedia article for '{title}'",
109
+ 3: "Add '{product_name}' to a guest cart",
110
+ 4: "Retrieve all posts in the '{forum_category}' forum (you must log in first)",
111
+ 5: "Create a forum post titled '{title}' in the '{category}' forum",
112
+ 6: "Complete a guest checkout for '{product_name}'",
113
+ 7: "Create a new product in the admin panel with SKU '{sku}' and price {price}",
114
+ }
115
+
116
+
117
+ def _load_parameter_pools() -> dict:
118
+ pools_path = Path(__file__).parent.parent / "parameter_pools.json"
119
+ if pools_path.exists():
120
+ with open(pools_path) as f:
121
+ return json.load(f)
122
+ return {}
123
+
124
+
125
+ def _sample_task(template_id: int, parameter_pools: dict) -> tuple[str, dict, str]:
126
+ """
127
+ Sample a task instance from the parameter pool.
128
+
129
+ Returns: (task_description, params_dict, app_base_url)
130
+ """
131
+ meta = TEMPLATE_META[template_id]
132
+ pool_key = f"template_{template_id}"
133
+ pool_data = parameter_pools.get(pool_key, {})
134
+ pool = pool_data.get("pool", {})
135
+
136
+ params: dict = {}
137
+
138
+ if template_id == 1:
139
+ items = pool.get("category_name", [{"name": "Gear", "category_id": 3}])
140
+ chosen = random.choice(items)
141
+ params = {"category_name": chosen["name"], "category_id": chosen.get("category_id")}
142
+ description = TEMPLATE_DESCRIPTIONS[1].format(**params)
143
+
144
+ elif template_id == 2:
145
+ items = pool.get("title", [{"title": "Python (programming language)", "expected_slug": "Python_(programming_language)"}])
146
+ if not items:
147
+ items = [{"title": "Python (programming language)", "expected_slug": "Python_(programming_language)"}]
148
+ chosen = random.choice(items)
149
+ title = chosen.get("title", chosen) if isinstance(chosen, dict) else chosen
150
+ params = {"title": title, "expected_slug": chosen.get("expected_slug", title.replace(" ", "_"))}
151
+ description = TEMPLATE_DESCRIPTIONS[2].format(**params)
152
+
153
+ elif template_id == 3:
154
+ items = pool.get("product_name", [{"name": "Radiant Tee", "sku": "MH01"}])
155
+ if not items:
156
+ items = [{"name": "Radiant Tee", "sku": "MH01"}]
157
+ chosen = random.choice(items)
158
+ product_name = chosen.get("name", chosen) if isinstance(chosen, dict) else chosen
159
+ sku = chosen.get("sku", "") if isinstance(chosen, dict) else ""
160
+ params = {"product_name": product_name, "sku": sku}
161
+ description = TEMPLATE_DESCRIPTIONS[3].format(**params)
162
+
163
+ elif template_id == 4:
164
+ items = pool.get("forum_category", [{"slug": "general", "name": "General"}])
165
+ if not items:
166
+ items = [{"slug": "general", "name": "General"}]
167
+ chosen = random.choice(items)
168
+ forum_cat = chosen.get("slug", chosen.get("name", "general")) if isinstance(chosen, dict) else chosen
169
+ params = {"forum_category": forum_cat}
170
+ description = TEMPLATE_DESCRIPTIONS[4].format(**params)
171
+
172
+ elif template_id == 5:
173
+ categories = pool.get("forum_category", [{"slug": "general"}])
174
+ titles = pool.get("post_title", ["Testing the API agent framework"])
175
+ if not categories:
176
+ categories = [{"slug": "general"}]
177
+ if not titles:
178
+ titles = ["Testing the API agent framework"]
179
+ chosen_cat = random.choice(categories)
180
+ chosen_title = random.choice(titles) if isinstance(titles[0], str) else random.choice(titles).get("title", "Test post")
181
+ forum_cat = chosen_cat.get("slug", "general") if isinstance(chosen_cat, dict) else chosen_cat
182
+ params = {"title": chosen_title, "category": forum_cat}
183
+ description = TEMPLATE_DESCRIPTIONS[5].format(**params)
184
+
185
+ elif template_id == 6:
186
+ items = pool.get("product_name", [{"name": "Radiant Tee", "sku": "MH01"}])
187
+ if not items:
188
+ items = [{"name": "Radiant Tee", "sku": "MH01"}]
189
+ chosen = random.choice(items)
190
+ product_name = chosen.get("name", chosen) if isinstance(chosen, dict) else chosen
191
+ sku = chosen.get("sku", "") if isinstance(chosen, dict) else ""
192
+ params = {"product_name": product_name, "sku": sku}
193
+ description = TEMPLATE_DESCRIPTIONS[6].format(**params)
194
+
195
+ elif template_id == 7:
196
+ items = pool.get("admin_sku", [{"sku": "HAR-TEST-001", "price": "29.99"}])
197
+ if not items:
198
+ items = [{"sku": "HAR-TEST-001", "price": "29.99"}]
199
+ chosen = random.choice(items)
200
+ sku = chosen.get("sku", "HAR-TEST-001") if isinstance(chosen, dict) else chosen
201
+ price = str(chosen.get("price", "29.99")) if isinstance(chosen, dict) else "29.99"
202
+ params = {"sku": sku, "price": price}
203
+ description = TEMPLATE_DESCRIPTIONS[7].format(**params)
204
+
205
+ else:
206
+ params = {}
207
+ description = f"Template {template_id}"
208
+
209
+ port = meta["base_url_port"]
210
+ base_url = f"http://{EC2_HOST}:{port}/"
211
+ return description, params, base_url
212
+
213
+
214
+ # ---------------------------------------------------------------------------
215
+ # Environment
216
+ # ---------------------------------------------------------------------------
217
+
218
+ class HARvestGymEnvironment(Environment):
219
+ """
220
+ HARvestGym: RL environment for training API-native web agents.
221
+
222
+ The agent must discover and execute the correct sequence of HTTP API calls
223
+ to complete real-world tasks on live web applications — starting from only
224
+ a task description and a URL, with no prior knowledge of the API schema.
225
+ """
226
+
227
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
228
+
229
+ def __init__(self):
230
+ self._state = State(episode_id=str(uuid4()), step_count=0)
231
+ self._parameter_pools = _load_parameter_pools()
232
+ self._current_task = None # Task dataclass
233
+ self._episode = None # Episode dataclass
234
+ self._session_state: dict = {}
235
+ self._episode_store: dict = {} # embeddings, BM25 corpus, etc.
236
+ self._called_paths: set = set() # for new-path reward
237
+ self._last_curl_commands: list = [] # for duplicate detection
238
+ self._step_rewards: list[float] = []
239
+ self._done = False
240
+
241
+ # Determine default template from env var
242
+ self._task_name = os.environ.get("HARVGYM_TASK", "har_classify_easy")
243
+
244
+ def _get_template_id(self) -> int:
245
+ """Resolve task name or template ID from env var."""
246
+ task_name = self._task_name
247
+ if task_name in TASK_NAME_TO_TEMPLATE:
248
+ return TASK_NAME_TO_TEMPLATE[task_name]
249
+ # Try integer
250
+ try:
251
+ tid = int(task_name)
252
+ if 1 <= tid <= 7:
253
+ return tid
254
+ except (ValueError, TypeError):
255
+ pass
256
+ return 1 # default: easy
257
+
258
+ def reset(self) -> HarvestGymObservation:
259
+ """Reset environment: clear episode state, sample new task."""
260
+ from .episode import Episode, Task
261
+
262
+ template_id = self._get_template_id()
263
+ description, params, base_url = _sample_task(template_id, self._parameter_pools)
264
+
265
+ meta = TEMPLATE_META[template_id]
266
+ self._current_task = Task(
267
+ template_id=template_id,
268
+ description=description,
269
+ params=params,
270
+ app=meta["app"],
271
+ base_url=base_url,
272
+ difficulty=meta["tier"],
273
+ )
274
+
275
+ self._episode = Episode(task=self._current_task)
276
+ self._session_state = {}
277
+ self._episode_store = {}
278
+ self._called_paths = set()
279
+ self._last_curl_commands = []
280
+ self._step_rewards = []
281
+ self._done = False
282
+ self._state = State(episode_id=str(uuid4()), step_count=0)
283
+
284
+ return HarvestGymObservation(
285
+ task=description,
286
+ app_base_url=base_url,
287
+ last_tool_result=None,
288
+ history=[],
289
+ session_state={},
290
+ step_count=0,
291
+ max_steps=MAX_STEPS,
292
+ done=False,
293
+ reward=0.0,
294
+ metadata={
295
+ "template_id": template_id,
296
+ "difficulty": meta["tier"],
297
+ "app": meta["app"],
298
+ },
299
+ )
300
+
301
+ def step(self, action: HarvestGymAction) -> HarvestGymObservation: # type: ignore[override]
302
+ """Execute one tool call and return the next observation."""
303
+ from .episode import Step, CurlCall
304
+
305
+ if self._done:
306
+ # Episode already finished
307
+ return self._make_obs(
308
+ last_tool_result={"error": "Episode already done. Call reset()."},
309
+ reward=0.0,
310
+ done=True,
311
+ )
312
+
313
+ self._state.step_count += 1
314
+ step_num = self._state.step_count
315
+
316
+ tool = action.tool.lower().strip()
317
+ args = action.args or {}
318
+
319
+ # Dispatch tool
320
+ result, step_reward, done = self._dispatch_tool(tool, args, step_num)
321
+
322
+ # Record step in episode
323
+ step_obj = Step(
324
+ step_num=step_num,
325
+ tool=tool,
326
+ action=f"{tool}({json.dumps(args)})",
327
+ result=result,
328
+ )
329
+
330
+ # If curl_exec, parse the curl call for judge
331
+ if tool == "curl_exec":
332
+ command = args.get("command", "")
333
+ try:
334
+ from .tools.curl_exec import parse_curl_command
335
+ parsed = parse_curl_command(command)
336
+ from urllib.parse import urlparse
337
+ path = urlparse(parsed["url"]).path if parsed["url"] else ""
338
+ from .tools.browser_agent import _normalise_path
339
+ norm_path = _normalise_path(path)
340
+
341
+ resp = result if isinstance(result, dict) else {}
342
+ step_obj.curl_parsed = CurlCall(
343
+ method=parsed["method"],
344
+ url=parsed["url"] or "",
345
+ path=norm_path,
346
+ headers=parsed["headers"],
347
+ body=parsed["body"],
348
+ status_code=resp.get("status_code", 0),
349
+ response_body=resp.get("body"),
350
+ response_headers=resp.get("headers", {}),
351
+ )
352
+ except Exception:
353
+ pass
354
+
355
+ if self._episode:
356
+ self._episode.steps.append(step_obj)
357
+ self._episode.total_steps = step_num
358
+
359
+ self._step_rewards.append(step_reward)
360
+
361
+ # Check max steps
362
+ if step_num >= MAX_STEPS and not done:
363
+ done = True
364
+ if self._episode:
365
+ self._episode.terminated_by = "max_steps"
366
+ # Invoke judge
367
+ judge_reward = self._invoke_judge()
368
+ step_reward += judge_reward
369
+
370
+ if done and self._episode and not self._episode.terminated_by:
371
+ self._episode.terminated_by = "done_call"
372
+
373
+ self._done = done
374
+
375
+ # Build history entry
376
+ history_entry = {
377
+ "step": step_num,
378
+ "tool": tool,
379
+ "args": args,
380
+ "result": result,
381
+ "reward": step_reward,
382
+ }
383
+ if self._episode:
384
+ history_for_obs = [
385
+ {"step": s.step_num, "tool": s.tool, "result": s.result}
386
+ for s in self._episode.steps
387
+ ]
388
+ else:
389
+ history_for_obs = [history_entry]
390
+
391
+ return HarvestGymObservation(
392
+ task=self._current_task.description if self._current_task else "",
393
+ app_base_url=self._current_task.base_url if self._current_task else "",
394
+ last_tool_result=result,
395
+ history=history_for_obs,
396
+ session_state=dict(self._session_state),
397
+ step_count=step_num,
398
+ max_steps=MAX_STEPS,
399
+ done=done,
400
+ reward=step_reward,
401
+ metadata={
402
+ "step": step_num,
403
+ "tool": tool,
404
+ "step_reward": step_reward,
405
+ },
406
+ )
407
+
408
+ def _dispatch_tool(self, tool: str, args: dict, step_num: int) -> tuple[Any, float, bool]:
409
+ """
410
+ Dispatch to the correct tool. Returns (result, step_reward, done).
411
+ """
412
+ reward = 0.0
413
+ done = False
414
+
415
+ if tool == "browser_agent":
416
+ task = args.get("task", self._current_task.description if self._current_task else "")
417
+ url = args.get("url", self._current_task.base_url if self._current_task else "")
418
+
419
+ # Penalty if called after step 1
420
+ if step_num > 1:
421
+ reward += PENALTY_BROWSER_AGENT_AGAIN
422
+
423
+ from .tools.browser_agent import run_browser_agent
424
+ result = run_browser_agent(task, url, episode_store=self._episode_store)
425
+
426
+ elif tool == "search_endpoints":
427
+ query = args.get("query", "")
428
+ from .tools.search_endpoints import search_endpoints
429
+ result = search_endpoints(query, self._episode_store)
430
+
431
+ elif tool == "curl_exec":
432
+ command = args.get("command", "")
433
+ if not command:
434
+ return {"error": "curl_exec requires 'command' argument"}, PENALTY_MALFORMED_CURL, False
435
+
436
+ # Duplicate detection
437
+ if command in self._last_curl_commands:
438
+ reward += PENALTY_REPEATED_CALL
439
+ self._last_curl_commands.append(command)
440
+
441
+ from .tools.curl_exec import curl_exec
442
+ result = curl_exec(
443
+ command=command,
444
+ session_state=self._session_state,
445
+ episode_store=self._episode_store,
446
+ app_base_url=self._current_task.base_url if self._current_task else "",
447
+ )
448
+
449
+ status = result.get("status_code", 0)
450
+ if status == -1 or "error" in result:
451
+ reward += PENALTY_MALFORMED_CURL
452
+ elif 200 <= status < 300:
453
+ reward += REWARD_VALID_API_CALL
454
+ # New path bonus
455
+ from urllib.parse import urlparse
456
+ from .tools.browser_agent import _normalise_path
457
+ try:
458
+ parsed_for_path = __import__("shlex").split(command)
459
+ for t in parsed_for_path:
460
+ if t.startswith("http"):
461
+ path = _normalise_path(urlparse(t.strip("'\"")).path)
462
+ if path and path not in self._called_paths:
463
+ self._called_paths.add(path)
464
+ reward += REWARD_NEW_PATH
465
+ break
466
+ except Exception:
467
+ pass
468
+ elif 400 <= status < 500:
469
+ reward += PENALTY_4XX
470
+
471
+ elif tool == "search_episode_data":
472
+ query = args.get("query", "")
473
+ from .tools.search_episode_data import search_episode_data
474
+ result = search_episode_data(query, self._episode_store)
475
+
476
+ elif tool == "done":
477
+ result_str = args.get("result", "")
478
+ result = {"status": "done", "result": result_str}
479
+ done = True
480
+ # Invoke judge for final reward
481
+ judge_reward = self._invoke_judge()
482
+ reward += judge_reward
483
+
484
+ else:
485
+ result = {"error": f"Unknown tool: {tool}. Available: browser_agent, search_endpoints, curl_exec, search_episode_data, done"}
486
+ reward += PENALTY_MALFORMED_CURL
487
+
488
+ return result, reward, done
489
+
490
+ def _invoke_judge(self) -> float:
491
+ """Run the judge on the completed episode and return terminal reward."""
492
+ if self._episode is None or self._current_task is None:
493
+ return -1.5
494
+ try:
495
+ from .judge import evaluate
496
+ episode_result = evaluate(self._episode)
497
+ return episode_result.reward
498
+ except Exception as e:
499
+ print(f"[HARvestGym] Judge error: {e}", flush=True)
500
+ return -1.5
501
+
502
+ def _make_obs(self, last_tool_result: Any, reward: float, done: bool) -> HarvestGymObservation:
503
+ return HarvestGymObservation(
504
+ task=self._current_task.description if self._current_task else "",
505
+ app_base_url=self._current_task.base_url if self._current_task else "",
506
+ last_tool_result=last_tool_result,
507
+ history=[],
508
+ session_state=dict(self._session_state),
509
+ step_count=self._state.step_count,
510
+ max_steps=MAX_STEPS,
511
+ done=done,
512
+ reward=reward,
513
+ )
514
+
515
+ @property
516
+ def state(self) -> State:
517
+ return self._state
server/tools/__init__.py ADDED
File without changes
server/tools/browser_agent.py ADDED
@@ -0,0 +1,418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ browser_agent tool — HAR-based API surface discovery.
3
+
4
+ At step 1, loads a pre-recorded HAR file for the target application,
5
+ extracts an OpenAPI-like spec, builds GEMMA embeddings for search_endpoints().
6
+ Falls back to all-MiniLM-L6-v2 if google/embeddinggemma-300m is unavailable.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import json
12
+ import os
13
+ import re
14
+ from pathlib import Path
15
+ from typing import Any
16
+ from urllib.parse import urlparse
17
+
18
+ import numpy as np
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # HAR path resolution
22
+ # ---------------------------------------------------------------------------
23
+
24
+ HARS_DIR = Path(__file__).parent.parent.parent / "hars"
25
+ CATALOGS_DIR = Path(__file__).parent.parent.parent / "catalogs"
26
+
27
+ HAR_MAP: dict[str, str] = {
28
+ ":7770": "shopping.har",
29
+ ":7780": "shopping_admin.har",
30
+ ":9999": "forum.har",
31
+ ":3000": "osm.har",
32
+ ":8888": "wikipedia.har",
33
+ }
34
+
35
+ APP_NAME_MAP: dict[str, str] = {
36
+ ":7770": "shopping",
37
+ ":7780": "shopping_admin",
38
+ ":9999": "forum",
39
+ ":3000": "osm",
40
+ ":8888": "wikipedia",
41
+ }
42
+
43
+ # Static asset patterns to skip
44
+ _STATIC_RE = re.compile(
45
+ r"\.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot|map|webp|avif|otf)(\?|$)",
46
+ re.IGNORECASE,
47
+ )
48
+ _ANALYTICS_HOSTS = {"google-analytics.com", "doubleclick.net", "googletagmanager.com",
49
+ "cdn.jsdelivr.net", "cdnjs.cloudflare.com"}
50
+
51
+ # ID normalisation patterns
52
+ _ID_PATTERNS = [
53
+ (re.compile(r"/[0-9a-f]{32,}(?=/|$)"), "/{id}"), # Magento cart IDs
54
+ (re.compile(r"/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}(?=/|$)"), "/{id}"), # UUIDs
55
+ (re.compile(r"/\d+(?=/|$)"), "/{id}"), # numeric IDs
56
+ ]
57
+
58
+
59
+ def _is_static_asset(url: str) -> bool:
60
+ parsed = urlparse(url)
61
+ if _STATIC_RE.search(parsed.path):
62
+ return True
63
+ if parsed.netloc in _ANALYTICS_HOSTS:
64
+ return True
65
+ return False
66
+
67
+
68
+ def _normalise_path(path: str) -> str:
69
+ for pattern, replacement in _ID_PATTERNS:
70
+ path = pattern.sub(replacement, path)
71
+ return path
72
+
73
+
74
+ def _get_content_type(entry: dict, which: str) -> str:
75
+ """Extract Content-Type from request or response headers."""
76
+ headers_key = "request" if which == "request" else "response"
77
+ obj = entry.get(headers_key, {})
78
+ for h in obj.get("headers", []):
79
+ if h.get("name", "").lower() == "content-type":
80
+ return h.get("value", "").lower()
81
+ if which == "response":
82
+ ct = obj.get("content", {}).get("mimeType", "")
83
+ return ct.lower()
84
+ return ""
85
+
86
+
87
+ def _extract_body(req: dict) -> Any:
88
+ post_data = req.get("postData", {})
89
+ if not post_data:
90
+ return None
91
+ text = post_data.get("text", "")
92
+ if not text:
93
+ return None
94
+ try:
95
+ return json.loads(text)
96
+ except Exception:
97
+ return text[:200] if text else None
98
+
99
+
100
+ def _truncate_response_sample(resp: dict) -> Any:
101
+ content = resp.get("content", {})
102
+ text = content.get("text", "")
103
+ if not text:
104
+ return None
105
+ try:
106
+ parsed = json.loads(text)
107
+ if isinstance(parsed, list) and len(parsed) > 2:
108
+ return parsed[:2]
109
+ if isinstance(parsed, dict):
110
+ # truncate large arrays in response
111
+ truncated = {}
112
+ for k, v in parsed.items():
113
+ if isinstance(v, list) and len(v) > 2:
114
+ truncated[k] = v[:2]
115
+ else:
116
+ truncated[k] = v
117
+ return truncated
118
+ return parsed
119
+ except Exception:
120
+ return text[:300] if text else None
121
+
122
+
123
+ def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
124
+ """Extract OpenAPI-like spec from HAR data."""
125
+ entries = har_data.get("log", {}).get("entries", [])
126
+ seen: set[str] = set()
127
+ spec_entries = []
128
+
129
+ for entry in entries:
130
+ req = entry.get("request", {})
131
+ resp = entry.get("response", {})
132
+ raw_url = req.get("url", "")
133
+ method = req.get("method", "GET").upper()
134
+
135
+ if not raw_url:
136
+ continue
137
+ if _is_static_asset(raw_url):
138
+ continue
139
+
140
+ resp_ct = _get_content_type(entry, "response")
141
+ req_ct = _get_content_type(entry, "request")
142
+
143
+ parsed_url = urlparse(raw_url)
144
+ path = parsed_url.path
145
+
146
+ # Skip pure static HTML page loads (GET returning text/html for main page/nav)
147
+ # BUT keep: POST forms, API paths, admin paths, JSON responses
148
+ is_html_get = "text/html" in resp_ct and method == "GET"
149
+ has_api_path = any(x in path for x in ["/rest/", "/api/", "/ajax/", "/mui/", ".json"])
150
+ is_admin_path = "/admin/" in path or "/rest/V1/" in path
151
+ is_post = method in ("POST", "PUT", "PATCH", "DELETE")
152
+ has_json_response = "json" in resp_ct
153
+
154
+ if is_html_get and not has_api_path and not is_admin_path and not has_json_response:
155
+ # Skip pure page navigations but only for very common extensions
156
+ if not is_post:
157
+ continue
158
+
159
+ path_norm = _normalise_path(path)
160
+ key = f"{method} {path_norm}"
161
+ if key in seen:
162
+ continue
163
+ seen.add(key)
164
+
165
+ has_auth = any(
166
+ h.get("name", "").lower() in ("authorization", "x-api-key", "cookie")
167
+ for h in req.get("headers", [])
168
+ )
169
+
170
+ spec_entries.append({
171
+ "method": method,
172
+ "path": path_norm,
173
+ "query_params": parsed_url.query or None,
174
+ "request_body": _extract_body(req),
175
+ "status_code": resp.get("status", 0),
176
+ "response_content_type": resp_ct,
177
+ "response_body_sample": _truncate_response_sample(resp),
178
+ "auth_observed": has_auth,
179
+ })
180
+
181
+ return spec_entries
182
+
183
+
184
+ def catalog_to_spec_entries(app_name: str) -> list[dict]:
185
+ """Load ground truth catalog as spec entries when HAR doesn't yield results."""
186
+ catalog_path = CATALOGS_DIR / f"{app_name}.json"
187
+ if not catalog_path.exists():
188
+ return []
189
+ try:
190
+ with open(catalog_path) as f:
191
+ data = json.load(f)
192
+ endpoints = data if isinstance(data, list) else data.get("endpoints", [])
193
+ spec_entries = []
194
+ for ep in endpoints:
195
+ # Handle "endpoint": "POST /rest/V1/..." format
196
+ endpoint_str = ep.get("endpoint", "")
197
+ if endpoint_str and " " in endpoint_str:
198
+ parts = endpoint_str.split(" ", 1)
199
+ method = parts[0].upper()
200
+ path = parts[1]
201
+ else:
202
+ path = ep.get("path", endpoint_str)
203
+ method = ep.get("method", "GET").upper()
204
+
205
+ if not path:
206
+ continue
207
+
208
+ auth = ep.get("auth", ep.get("authentication", "none"))
209
+ spec_entries.append({
210
+ "method": method,
211
+ "path": path,
212
+ "query_params": None,
213
+ "request_body": ep.get("body_params") or ep.get("body"),
214
+ "status_code": 200,
215
+ "response_content_type": "application/json",
216
+ "response_body_sample": ep.get("response_fields") or ep.get("response_sample"),
217
+ "auth_observed": auth not in ("none", "None", None, ""),
218
+ })
219
+ return spec_entries
220
+ except Exception as e:
221
+ print(f"[browser_agent] Failed to load catalog {app_name}: {e}", flush=True)
222
+ return []
223
+
224
+
225
+ def spec_entry_to_text(entry: dict, app_name: str) -> str:
226
+ """Convert a spec entry to searchable text for embedding."""
227
+ parts = [
228
+ f"app: {app_name}",
229
+ f"endpoint: {entry['method']} {entry['path']}",
230
+ f"status: {entry['status_code']}",
231
+ f"auth: {'required' if entry['auth_observed'] else 'none'}",
232
+ ]
233
+ if entry.get("query_params"):
234
+ parts.append(f"query: {entry['query_params']}")
235
+ if entry.get("request_body"):
236
+ body_str = json.dumps(entry["request_body"])[:300] if not isinstance(entry["request_body"], str) else entry["request_body"][:300]
237
+ parts.append(f"body: {body_str}")
238
+ if entry.get("response_body_sample") is not None:
239
+ resp_str = json.dumps(entry["response_body_sample"])[:300] if not isinstance(entry["response_body_sample"], str) else str(entry["response_body_sample"])[:300]
240
+ parts.append(f"response_sample: {resp_str}")
241
+ return " | ".join(parts)
242
+
243
+
244
+ # ---------------------------------------------------------------------------
245
+ # Embedding model (lazy load)
246
+ # ---------------------------------------------------------------------------
247
+
248
+ _embedding_model = None
249
+ _embedding_model_name = None
250
+
251
+
252
+ def _get_embedding_model():
253
+ global _embedding_model, _embedding_model_name
254
+ if _embedding_model is not None:
255
+ return _embedding_model, _embedding_model_name
256
+
257
+ hf_token = os.environ.get("HF_TOKEN")
258
+
259
+ # Set a writable cache dir to avoid read-only filesystem errors
260
+ import tempfile
261
+ cache_dir = os.environ.get("HF_HOME", os.environ.get("TRANSFORMERS_CACHE",
262
+ os.path.join(tempfile.gettempdir(), "hf_cache")))
263
+ os.makedirs(cache_dir, exist_ok=True)
264
+ os.environ.setdefault("HF_HOME", cache_dir)
265
+ os.environ.setdefault("TRANSFORMERS_CACHE", cache_dir)
266
+ os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", cache_dir)
267
+
268
+ # Skip embedding if HARVGYM_NO_EMBED is set (for testing/offline use)
269
+ if os.environ.get("HARVGYM_NO_EMBED"):
270
+ raise RuntimeError("Embeddings disabled via HARVGYM_NO_EMBED")
271
+
272
+ # Try GEMMA first, fall back to MiniLM
273
+ candidates = [
274
+ ("google/embeddinggemma-300m", hf_token),
275
+ ("all-MiniLM-L6-v2", None),
276
+ ("sentence-transformers/all-MiniLM-L6-v2", None),
277
+ ]
278
+
279
+ for model_name, token in candidates:
280
+ try:
281
+ from sentence_transformers import SentenceTransformer
282
+ kwargs: dict = {"cache_folder": cache_dir}
283
+ if token:
284
+ kwargs["token"] = token
285
+ model = SentenceTransformer(model_name, **kwargs)
286
+ _embedding_model = model
287
+ _embedding_model_name = model_name
288
+ print(f"[browser_agent] Loaded embedding model: {model_name}", flush=True)
289
+ return _embedding_model, _embedding_model_name
290
+ except Exception as e:
291
+ print(f"[browser_agent] Could not load {model_name}: {type(e).__name__}: {str(e)[:100]}", flush=True)
292
+
293
+ raise RuntimeError("No embedding model available. Install sentence-transformers.")
294
+
295
+
296
+ def build_endpoint_embeddings(spec_entries: list[dict], app_name: str):
297
+ """Build embeddings over spec entries. Returns (embeddings_array, text_chunks)."""
298
+ model, model_name = _get_embedding_model()
299
+ chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
300
+ if not chunks:
301
+ return np.array([]), []
302
+
303
+ # Use encode_document if available (GEMMA), else plain encode
304
+ if hasattr(model, "encode_document"):
305
+ embeddings = model.encode_document(chunks, batch_size=32, show_progress_bar=False)
306
+ else:
307
+ embeddings = model.encode(chunks, batch_size=32, show_progress_bar=False)
308
+
309
+ if not isinstance(embeddings, np.ndarray):
310
+ embeddings = np.array(embeddings)
311
+
312
+ # Normalize for cosine similarity
313
+ norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
314
+ norms = np.where(norms == 0, 1, norms)
315
+ embeddings = embeddings / norms
316
+
317
+ return embeddings, chunks
318
+
319
+
320
+ # ---------------------------------------------------------------------------
321
+ # Public API
322
+ # ---------------------------------------------------------------------------
323
+
324
+ def run_browser_agent(task: str, url: str, episode_store=None) -> dict:
325
+ """
326
+ Load HAR for the app inferred from URL, extract spec, build embeddings.
327
+ Returns summary endpoint list.
328
+
329
+ episode_store: mutable dict where we store embeddings/spec for search_endpoints().
330
+ """
331
+ # Detect app from URL
332
+ app_name = "unknown"
333
+ har_filename = None
334
+ for port_suffix, fname in HAR_MAP.items():
335
+ if port_suffix in url:
336
+ har_filename = fname
337
+ app_name = APP_NAME_MAP[port_suffix]
338
+ break
339
+
340
+ if har_filename is None:
341
+ # Try to guess from URL path
342
+ if "shopping" in url.lower() or "7770" in url or "7780" in url:
343
+ har_filename = "shopping.har"
344
+ app_name = "shopping"
345
+ elif "forum" in url.lower() or "9999" in url:
346
+ har_filename = "forum.har"
347
+ app_name = "forum"
348
+ elif "wiki" in url.lower() or "8888" in url:
349
+ har_filename = "wikipedia.har"
350
+ app_name = "wikipedia"
351
+ else:
352
+ har_filename = "shopping.har"
353
+ app_name = "shopping"
354
+
355
+ har_path = HARS_DIR / har_filename
356
+ if not har_path.exists():
357
+ return {
358
+ "app": app_name,
359
+ "endpoints": [],
360
+ "total_endpoints": 0,
361
+ "note": f"HAR file not found: {har_path}. No endpoints available.",
362
+ "error": f"Missing HAR: {har_filename}",
363
+ }
364
+
365
+ with open(har_path) as f:
366
+ har_data = json.load(f)
367
+
368
+ spec_entries = extract_openapi_spec(har_data, url)
369
+
370
+ # Augment with ground truth catalog if HAR extraction is sparse
371
+ catalog_entries = catalog_to_spec_entries(app_name)
372
+ if len(spec_entries) < 5 and catalog_entries:
373
+ print(f"[browser_agent] HAR yielded {len(spec_entries)} endpoints, augmenting from catalog ({len(catalog_entries)} entries)", flush=True)
374
+ # Merge: catalog takes priority for proper paths
375
+ har_paths = {e["path"] for e in spec_entries}
376
+ for ce in catalog_entries:
377
+ if ce["path"] not in har_paths:
378
+ spec_entries.append(ce)
379
+ elif catalog_entries:
380
+ # Augment any catalog endpoints not found in HAR
381
+ har_paths = {e["path"] for e in spec_entries}
382
+ for ce in catalog_entries:
383
+ if ce["path"] not in har_paths:
384
+ spec_entries.append(ce)
385
+
386
+ # Build embeddings and store in episode_store for search_endpoints
387
+ if spec_entries and episode_store is not None:
388
+ try:
389
+ embeddings, chunks = build_endpoint_embeddings(spec_entries, app_name)
390
+ episode_store["endpoint_embeddings"] = embeddings
391
+ episode_store["endpoint_chunks"] = chunks
392
+ episode_store["spec_entries"] = spec_entries
393
+ episode_store["app_name"] = app_name
394
+ except Exception as e:
395
+ print(f"[browser_agent] Embedding build failed: {e}. Storing spec without embeddings.", flush=True)
396
+ # Store chunks as plain text even without embeddings for keyword fallback
397
+ chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
398
+ episode_store["endpoint_chunks"] = chunks
399
+ episode_store["endpoint_embeddings"] = None
400
+ episode_store["spec_entries"] = spec_entries
401
+ episode_store["app_name"] = app_name
402
+ elif episode_store is not None:
403
+ episode_store["spec_entries"] = []
404
+ episode_store["app_name"] = app_name
405
+
406
+ # Return summary only (no schemas)
407
+ summary_endpoints = [{"method": e["method"], "path": e["path"]} for e in spec_entries]
408
+
409
+ return {
410
+ "app": app_name,
411
+ "endpoints": summary_endpoints,
412
+ "total_endpoints": len(summary_endpoints),
413
+ "note": (
414
+ "These endpoints were observed for this application. "
415
+ "Use search_endpoints() with a natural language query to get the full schema, "
416
+ "parameters, and auth details for any endpoint."
417
+ ),
418
+ }
server/tools/curl_exec.py ADDED
@@ -0,0 +1,434 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ curl_exec tool — execute HTTP calls via subprocess, index responses, return truncated result.
3
+
4
+ Parses curl command string, executes against live EC2 server, auto-injects session cookies,
5
+ indexes full response into episode BM25 store, returns smart-truncated observation.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import json
11
+ import re
12
+ import shlex
13
+ import subprocess
14
+ import time
15
+ from typing import Any
16
+ from urllib.parse import urlparse
17
+
18
+ # ---------------------------------------------------------------------------
19
+ # Truncation constants
20
+ # ---------------------------------------------------------------------------
21
+
22
+ NONJSON_MAX_CHARS = 3000 # HTML / plain text truncation (raised for CSRF token visibility)
23
+ ARRAY_PREVIEW_ITEMS = 2 # How many items to show in large arrays
24
+ ARRAY_LARGE_THRESHOLD = 3 # Arrays >= this size are truncated
25
+
26
+
27
+ # ---------------------------------------------------------------------------
28
+ # Curl command parser
29
+ # ---------------------------------------------------------------------------
30
+
31
+ def parse_curl_command(command: str) -> dict:
32
+ """
33
+ Parse a curl command string into components.
34
+
35
+ Returns dict with keys: method, url, headers, body, data_type
36
+ """
37
+ # Normalize: remove newline continuations
38
+ command = re.sub(r"\\\s*\n\s*", " ", command)
39
+
40
+ try:
41
+ tokens = shlex.split(command)
42
+ except ValueError:
43
+ # Fall back to simple split if shlex fails
44
+ tokens = command.split()
45
+
46
+ if not tokens or tokens[0] != "curl":
47
+ raise ValueError(f"Not a curl command: {command[:100]}")
48
+
49
+ result: dict = {
50
+ "method": "GET",
51
+ "url": None,
52
+ "headers": {},
53
+ "body": None,
54
+ "data_type": None, # "json" | "form" | None
55
+ }
56
+
57
+ i = 1
58
+ while i < len(tokens):
59
+ tok = tokens[i]
60
+
61
+ if tok in ("-X", "--request") and i + 1 < len(tokens):
62
+ result["method"] = tokens[i + 1].upper()
63
+ i += 2
64
+
65
+ elif tok in ("-H", "--header") and i + 1 < len(tokens):
66
+ header = tokens[i + 1]
67
+ if ":" in header:
68
+ name, _, value = header.partition(":")
69
+ result["headers"][name.strip().lower()] = value.strip()
70
+ i += 2
71
+
72
+ elif tok in ("-d", "--data", "--data-raw", "--data-binary") and i + 1 < len(tokens):
73
+ result["body"] = tokens[i + 1]
74
+ if result["method"] == "GET":
75
+ result["method"] = "POST"
76
+ i += 2
77
+
78
+ elif tok == "--data-urlencode" and i + 1 < len(tokens):
79
+ # Append to existing body
80
+ existing = result.get("body") or ""
81
+ if existing:
82
+ result["body"] = existing + "&" + tokens[i + 1]
83
+ else:
84
+ result["body"] = tokens[i + 1]
85
+ if result["method"] == "GET":
86
+ result["method"] = "POST"
87
+ i += 2
88
+
89
+ elif tok in ("-F", "--form") and i + 1 < len(tokens):
90
+ existing = result.get("body") or ""
91
+ if existing:
92
+ result["body"] = existing + "&" + tokens[i + 1]
93
+ else:
94
+ result["body"] = tokens[i + 1]
95
+ if result["method"] == "GET":
96
+ result["method"] = "POST"
97
+ i += 2
98
+
99
+ elif tok in ("-u", "--user") and i + 1 < len(tokens):
100
+ i += 2 # skip basic auth for now
101
+
102
+ elif tok in ("-L", "--location", "-s", "--silent", "-v", "--verbose",
103
+ "-k", "--insecure", "--compressed", "-g", "--globoff"):
104
+ i += 1
105
+
106
+ elif tok in ("-o", "--output", "--max-time", "--connect-timeout",
107
+ "--retry", "-A", "--user-agent", "-e", "--referer"):
108
+ i += 2 # skip flag + value
109
+
110
+ elif not tok.startswith("-") and result["url"] is None:
111
+ result["url"] = tok.strip("'\"")
112
+ i += 1
113
+
114
+ elif tok.startswith("http"):
115
+ result["url"] = tok.strip("'\"")
116
+ i += 1
117
+
118
+ else:
119
+ i += 1
120
+
121
+ # Infer data_type from content-type header
122
+ ct = result["headers"].get("content-type", "")
123
+ if "application/json" in ct:
124
+ result["data_type"] = "json"
125
+ elif "application/x-www-form-urlencoded" in ct or "multipart/form-data" in ct:
126
+ result["data_type"] = "form"
127
+ elif result["body"]:
128
+ # Guess from body
129
+ if result["body"].strip().startswith("{") or result["body"].strip().startswith("["):
130
+ result["data_type"] = "json"
131
+ else:
132
+ result["data_type"] = "form"
133
+
134
+ return result
135
+
136
+
137
+ # ---------------------------------------------------------------------------
138
+ # Smart truncation
139
+ # ---------------------------------------------------------------------------
140
+
141
+ def smart_truncate(body_text: str, content_type: str = "") -> Any:
142
+ """
143
+ Apply truncation rules to a response body string.
144
+
145
+ Rules (first match wins):
146
+ 1. Non-JSON → truncate to NONJSON_MAX_CHARS
147
+ 2. JSON primitive (str/int/bool/null) → never truncate
148
+ 3. Error (detected by content) → never truncate
149
+ 4. JSON object/array with no large arrays → return as-is
150
+ 5. JSON with large array → keep first ARRAY_PREVIEW_ITEMS, add _list_truncated note
151
+ """
152
+ if not body_text:
153
+ return ""
154
+
155
+ # Rule 1: non-JSON
156
+ if "application/json" not in content_type and not _looks_like_json(body_text):
157
+ return body_text[:NONJSON_MAX_CHARS]
158
+
159
+ # Try to parse as JSON
160
+ try:
161
+ parsed = json.loads(body_text)
162
+ except (json.JSONDecodeError, ValueError):
163
+ return body_text[:NONJSON_MAX_CHARS]
164
+
165
+ # Rule 2: JSON primitive
166
+ if not isinstance(parsed, (dict, list)):
167
+ return parsed
168
+
169
+ # Rule 3: detect error (4xx/5xx already handled by caller; this checks body content)
170
+ if isinstance(parsed, dict) and ("message" in parsed or "error" in parsed):
171
+ return parsed # never truncate errors
172
+
173
+ # Rules 4 and 5
174
+ return _truncate_json(parsed)
175
+
176
+
177
+ def _looks_like_json(text: str) -> bool:
178
+ stripped = text.strip()
179
+ return stripped.startswith("{") or stripped.startswith("[") or stripped.startswith('"')
180
+
181
+
182
+ def _truncate_json(obj: Any) -> Any:
183
+ if isinstance(obj, list):
184
+ if len(obj) >= ARRAY_LARGE_THRESHOLD:
185
+ return {
186
+ "items": obj[:ARRAY_PREVIEW_ITEMS],
187
+ "_list_truncated": {
188
+ "shown": ARRAY_PREVIEW_ITEMS,
189
+ "total": len(obj),
190
+ "note": (
191
+ f"Showing {ARRAY_PREVIEW_ITEMS} of {len(obj)} items. "
192
+ "Use search_episode_data() to find a specific item from this response."
193
+ ),
194
+ },
195
+ }
196
+ return obj
197
+
198
+ if isinstance(obj, dict):
199
+ result = {}
200
+ for k, v in obj.items():
201
+ if isinstance(v, list) and len(v) >= ARRAY_LARGE_THRESHOLD:
202
+ result[k] = v[:ARRAY_PREVIEW_ITEMS]
203
+ result["_list_truncated"] = {
204
+ "field": k,
205
+ "shown": ARRAY_PREVIEW_ITEMS,
206
+ "total": len(v),
207
+ "note": (
208
+ f"Showing {ARRAY_PREVIEW_ITEMS} of {len(v)} items. "
209
+ "Use search_episode_data() to find a specific item from this response."
210
+ ),
211
+ }
212
+ else:
213
+ result[k] = v
214
+ return result
215
+
216
+ return obj
217
+
218
+
219
+ # ---------------------------------------------------------------------------
220
+ # Cookie injection
221
+ # ---------------------------------------------------------------------------
222
+
223
+ def _inject_cookies(headers: dict, session_state: dict) -> dict:
224
+ """Inject cookies from session_state into the request headers."""
225
+ headers = dict(headers) # copy
226
+
227
+ # Collect cookie values
228
+ cookie_parts = []
229
+ for key, value in session_state.items():
230
+ if key.lower() in ("phpsessid", "sessid", "session", "cookie",
231
+ "mage-cache-sessid", "private_content_version",
232
+ "form_key", "PHPSESSID"):
233
+ cookie_parts.append(f"{key}={value}")
234
+
235
+ # Check if there's a raw cookie header already
236
+ existing = headers.get("cookie", "")
237
+ if cookie_parts:
238
+ all_cookies = existing + ("; " if existing else "") + "; ".join(cookie_parts)
239
+ headers["cookie"] = all_cookies
240
+
241
+ return headers
242
+
243
+
244
+ # ---------------------------------------------------------------------------
245
+ # Session state extraction
246
+ # ---------------------------------------------------------------------------
247
+
248
+ def _extract_set_cookies(response_headers: dict, session_state: dict) -> None:
249
+ """Extract Set-Cookie headers into session_state."""
250
+ for name, value in response_headers.items():
251
+ if name.lower() == "set-cookie":
252
+ # Parse "NAME=VALUE; Path=...; ..."
253
+ cookies = value.split(";")
254
+ if cookies:
255
+ kv = cookies[0].strip()
256
+ if "=" in kv:
257
+ k, _, v = kv.partition("=")
258
+ session_state[k.strip()] = v.strip()
259
+
260
+
261
+ def _extract_tokens_from_body(body: Any, session_state: dict) -> None:
262
+ """Extract auth tokens from JSON response bodies into session_state."""
263
+ if isinstance(body, str) and len(body) > 10 and len(body) < 500:
264
+ # Likely a token (Magento returns bare quoted strings for auth tokens)
265
+ stripped = body.strip('"').strip()
266
+ if re.match(r"^[A-Za-z0-9_\-\.]{20,}$", stripped):
267
+ session_state["_last_token"] = stripped
268
+
269
+ if isinstance(body, dict):
270
+ for key in ("access_token", "token", "cart_id", "form_key"):
271
+ if key in body and body[key]:
272
+ session_state[key] = body[key]
273
+
274
+
275
+ # ---------------------------------------------------------------------------
276
+ # Public API
277
+ # ---------------------------------------------------------------------------
278
+
279
+ def curl_exec(command: str, session_state: dict, episode_store: dict,
280
+ app_base_url: str = "") -> dict:
281
+ """
282
+ Parse and execute a curl command against the live app.
283
+
284
+ Args:
285
+ command: Full curl command string
286
+ session_state: Current session state (cookies/tokens), mutated in place
287
+ episode_store: Per-episode store for BM25 indexing, mutated in place
288
+ app_base_url: Base URL to validate requests against
289
+
290
+ Returns:
291
+ {status_code, headers, body} with smart-truncated body
292
+ """
293
+ try:
294
+ parsed = parse_curl_command(command)
295
+ except Exception as e:
296
+ return {"status_code": -1, "headers": {}, "body": f"curl parse error: {e}", "error": str(e)}
297
+
298
+ if not parsed["url"]:
299
+ return {"status_code": -1, "headers": {}, "body": "No URL in curl command", "error": "missing url"}
300
+
301
+ # Inject session cookies
302
+ parsed["headers"] = _inject_cookies(parsed["headers"], session_state)
303
+
304
+ # Build actual curl args
305
+ args = ["curl", "-s", "-i", "-L", "--max-time", "15"]
306
+ args += ["-X", parsed["method"]]
307
+ args += [parsed["url"]]
308
+
309
+ for h_name, h_val in parsed["headers"].items():
310
+ args += ["-H", f"{h_name}: {h_val}"]
311
+
312
+ if parsed["body"]:
313
+ args += ["-d", parsed["body"]]
314
+
315
+ try:
316
+ result = subprocess.run(
317
+ args,
318
+ capture_output=True,
319
+ text=True,
320
+ timeout=20,
321
+ )
322
+ raw_output = result.stdout
323
+ except subprocess.TimeoutExpired:
324
+ return {"status_code": -1, "headers": {}, "body": "Request timed out (20s)", "error": "timeout"}
325
+ except Exception as e:
326
+ return {"status_code": -1, "headers": {}, "body": f"subprocess error: {e}", "error": str(e)}
327
+
328
+ # Parse HTTP response: headers + body split at blank line
329
+ status_code = 0
330
+ resp_headers: dict[str, str] = {}
331
+ body_text = ""
332
+
333
+ if raw_output:
334
+ # Find status line (handle redirects: multiple HTTP/ headers)
335
+ lines = raw_output.split("\r\n") if "\r\n" in raw_output else raw_output.split("\n")
336
+ header_lines = []
337
+ body_lines = []
338
+ in_body = False
339
+ last_status = 0
340
+
341
+ for line in lines:
342
+ if in_body:
343
+ body_lines.append(line)
344
+ elif line.startswith("HTTP/"):
345
+ # Could be redirect status; keep last
346
+ parts = line.split(" ", 2)
347
+ if len(parts) >= 2:
348
+ try:
349
+ last_status = int(parts[1])
350
+ except ValueError:
351
+ pass
352
+ header_lines = [] # reset headers for this response
353
+ elif line.strip() == "":
354
+ if last_status: # we've seen at least one status line
355
+ in_body = True
356
+ else:
357
+ header_lines.append(line)
358
+
359
+ status_code = last_status
360
+ body_text = "\n".join(body_lines).strip()
361
+
362
+ for h_line in header_lines:
363
+ if ":" in h_line:
364
+ h_name, _, h_val = h_line.partition(":")
365
+ resp_headers[h_name.strip().lower()] = h_val.strip()
366
+
367
+ # Extract cookies / tokens into session_state
368
+ _extract_set_cookies(resp_headers, session_state)
369
+
370
+ # Try to parse body as JSON
371
+ resp_ct = resp_headers.get("content-type", "")
372
+ parsed_body: Any = body_text
373
+ try:
374
+ parsed_body = json.loads(body_text) if body_text else ""
375
+ except (json.JSONDecodeError, ValueError):
376
+ parsed_body = body_text
377
+
378
+ # Extract tokens from body
379
+ _extract_tokens_from_body(parsed_body, session_state)
380
+
381
+ # Index into episode BM25 store BEFORE truncation
382
+ _index_into_episode_store(
383
+ episode_store=episode_store,
384
+ request_body=parsed["body"],
385
+ response_body=parsed_body,
386
+ url=parsed["url"],
387
+ method=parsed["method"],
388
+ status_code=status_code,
389
+ )
390
+
391
+ # Apply smart truncation
392
+ if status_code >= 400:
393
+ # Never truncate errors
394
+ truncated_body = parsed_body
395
+ else:
396
+ body_for_truncation = body_text if isinstance(parsed_body, str) else json.dumps(parsed_body)
397
+ truncated_body = smart_truncate(body_for_truncation, resp_ct)
398
+
399
+ return {
400
+ "status_code": status_code,
401
+ "headers": resp_headers,
402
+ "body": truncated_body,
403
+ }
404
+
405
+
406
+ # ---------------------------------------------------------------------------
407
+ # Episode store indexing
408
+ # ---------------------------------------------------------------------------
409
+
410
+ def _index_into_episode_store(episode_store: dict, request_body: Any,
411
+ response_body: Any, url: str, method: str,
412
+ status_code: int) -> None:
413
+ """Index request/response into episode BM25 store for search_episode_data()."""
414
+ if "bm25_corpus" not in episode_store:
415
+ episode_store["bm25_corpus"] = []
416
+ episode_store["bm25_metadata"] = []
417
+
418
+ def _to_text(obj: Any) -> str:
419
+ if obj is None:
420
+ return ""
421
+ if isinstance(obj, str):
422
+ return obj
423
+ return json.dumps(obj)
424
+
425
+ entry_text = f"url: {url} | method: {method} | status: {status_code} | " \
426
+ f"request: {_to_text(request_body)} | response: {_to_text(response_body)}"
427
+
428
+ episode_store["bm25_corpus"].append(entry_text)
429
+ episode_store["bm25_metadata"].append({
430
+ "url": url,
431
+ "method": method,
432
+ "status_code": status_code,
433
+ "response_body": response_body,
434
+ })
server/tools/search_endpoints.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ search_endpoints tool — semantic search over endpoint embeddings from browser_agent.
3
+
4
+ Returns top-3 endpoint schemas (full text) for a natural language query.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import os
10
+ import numpy as np
11
+
12
+
13
+ def search_endpoints(query: str, episode_store: dict) -> list[str]:
14
+ """
15
+ Semantic search over endpoint embeddings built by browser_agent.
16
+
17
+ Args:
18
+ query: Natural language query (e.g. "create guest cart", "add item to cart")
19
+ episode_store: Mutable dict containing embeddings from browser_agent.
20
+
21
+ Returns:
22
+ List of up to 3 endpoint schema text strings.
23
+ """
24
+ chunks: list[str] = episode_store.get("endpoint_chunks", [])
25
+ embeddings = episode_store.get("endpoint_embeddings")
26
+ spec_entries: list[dict] = episode_store.get("spec_entries", [])
27
+
28
+ if not chunks:
29
+ return ["No endpoint index available. Call browser_agent(task, url) first."]
30
+
31
+ # If no embeddings, use keyword fallback directly
32
+ if embeddings is None or (hasattr(embeddings, '__len__') and len(embeddings) == 0):
33
+ query_lower = query.lower()
34
+ query_terms = query_lower.split()
35
+ matches = []
36
+ for chunk in chunks:
37
+ if any(term in chunk.lower() for term in query_terms):
38
+ matches.append(chunk)
39
+ return matches[:3] if matches else chunks[:3]
40
+
41
+ try:
42
+ model, model_name = _get_embedding_model()
43
+
44
+ if hasattr(model, "encode_query"):
45
+ q_emb = model.encode_query([query], show_progress_bar=False)
46
+ else:
47
+ q_emb = model.encode([query], show_progress_bar=False)
48
+
49
+ if not isinstance(q_emb, np.ndarray):
50
+ q_emb = np.array(q_emb)
51
+
52
+ # Normalize query embedding
53
+ norm = np.linalg.norm(q_emb, axis=1, keepdims=True)
54
+ if norm[0, 0] > 0:
55
+ q_emb = q_emb / norm
56
+
57
+ # Cosine similarity (embeddings already normalized)
58
+ scores = (embeddings @ q_emb.T).flatten()
59
+ top_k = min(3, len(scores))
60
+ top_indices = np.argsort(scores)[::-1][:top_k]
61
+
62
+ results = []
63
+ for idx in top_indices:
64
+ results.append(chunks[int(idx)])
65
+ return results
66
+
67
+ except Exception as e:
68
+ # Fallback: keyword match
69
+ print(f"[search_endpoints] Embedding search failed: {e}. Using keyword fallback.", flush=True)
70
+ query_lower = query.lower()
71
+ matches = []
72
+ for chunk in chunks:
73
+ if any(word in chunk.lower() for word in query_lower.split()):
74
+ matches.append(chunk)
75
+ return matches[:3] if matches else chunks[:3]
76
+
77
+
78
+ # Lazy-load the model (shared with browser_agent)
79
+ _embedding_model = None
80
+ _embedding_model_name = None
81
+
82
+
83
+ def _get_embedding_model():
84
+ global _embedding_model, _embedding_model_name
85
+ if _embedding_model is not None:
86
+ return _embedding_model, _embedding_model_name
87
+
88
+ # Re-use browser_agent's loader
89
+ from .browser_agent import _get_embedding_model as _ba_get
90
+ model, name = _ba_get()
91
+ _embedding_model = model
92
+ _embedding_model_name = name
93
+ return model, name
server/tools/search_episode_data.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ search_episode_data tool — BM25 + semantic search over accumulated episode response data.
3
+
4
+ Searches all request/response bodies from prior curl_exec calls in this episode.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ import re
11
+ from typing import Any
12
+
13
+
14
+ def search_episode_data(query: str, episode_store: dict) -> list[dict]:
15
+ """
16
+ Hybrid BM25 + keyword search over episode accumulated response bodies.
17
+
18
+ Args:
19
+ query: Keyword or natural language query (e.g. "Radiant Tee sku", "_csrf_token")
20
+ episode_store: Per-episode store containing bm25_corpus and bm25_metadata
21
+
22
+ Returns:
23
+ Top-5 matching JSON objects from episode history, annotated with step info
24
+ """
25
+ corpus: list[str] = episode_store.get("bm25_corpus", [])
26
+ metadata: list[dict] = episode_store.get("bm25_metadata", [])
27
+
28
+ if not corpus:
29
+ return [{"note": "No episode data yet. Make API calls with curl_exec() first."}]
30
+
31
+ # Try BM25 ranking
32
+ try:
33
+ from rank_bm25 import BM25Okapi
34
+
35
+ tokenized_corpus = [_tokenize(doc) for doc in corpus]
36
+ tokenized_query = _tokenize(query)
37
+ bm25 = BM25Okapi(tokenized_corpus)
38
+ scores = bm25.get_scores(tokenized_query)
39
+
40
+ # Get top 5 by BM25 score
41
+ import numpy as np
42
+ top_k = min(5, len(scores))
43
+ top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
44
+
45
+ results = []
46
+ for idx in top_indices:
47
+ if scores[idx] > 0:
48
+ meta = metadata[idx]
49
+ result = {
50
+ "step": idx + 1,
51
+ "url": meta.get("url", ""),
52
+ "method": meta.get("method", ""),
53
+ "status_code": meta.get("status_code", 0),
54
+ "data": meta.get("response_body"),
55
+ }
56
+ results.append(result)
57
+
58
+ if results:
59
+ return results
60
+
61
+ except ImportError:
62
+ pass
63
+ except Exception as e:
64
+ print(f"[search_episode_data] BM25 error: {e}", flush=True)
65
+
66
+ # Fallback: keyword match
67
+ query_lower = query.lower()
68
+ query_terms = query_lower.split()
69
+ results = []
70
+ for idx, doc in enumerate(corpus):
71
+ if any(term in doc.lower() for term in query_terms):
72
+ meta = metadata[idx]
73
+ results.append({
74
+ "step": idx + 1,
75
+ "url": meta.get("url", ""),
76
+ "method": meta.get("method", ""),
77
+ "status_code": meta.get("status_code", 0),
78
+ "data": meta.get("response_body"),
79
+ })
80
+ return results[:5] if results else [{"note": f"No results found for: {query}"}]
81
+
82
+
83
+ def _tokenize(text: str) -> list[str]:
84
+ """Simple whitespace + punctuation tokenizer for BM25."""
85
+ text = text.lower()
86
+ tokens = re.findall(r"[a-z0-9_\-\.]+", text)
87
+ return tokens if tokens else [""]
tests/mock_data/mock_catalog.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_meta": {
3
+ "generated": "2026-04-08",
4
+ "source": "mock catalog for testing"
5
+ },
6
+ "endpoints": [
7
+ {
8
+ "api_type": "rest",
9
+ "endpoint": "GET /rest/V1/categories",
10
+ "auth": "none",
11
+ "query_params": {},
12
+ "response_key_fields": ["id", "name", "children_data"],
13
+ "notes": "Returns the full category tree. No auth required."
14
+ },
15
+ {
16
+ "api_type": "rest",
17
+ "endpoint": "GET /rest/V1/products",
18
+ "auth": "none",
19
+ "query_params": {
20
+ "searchCriteria[filter_groups][0][filters][0][field]": {"type": "string", "source": "TASK_SPEC", "notes": "field name to filter on (e.g. 'name', 'sku')"},
21
+ "searchCriteria[filter_groups][0][filters][0][value]": {"type": "string", "source": "TASK_SPEC", "notes": "value to filter for"},
22
+ "searchCriteria[filter_groups][0][filters][0][condition_type]": {"type": "string", "source": "STATIC", "value": "like", "notes": "comparison operator"}
23
+ },
24
+ "response_key_fields": ["items[].sku", "items[].name", "items[].price", "total_count"],
25
+ "notes": "Search/list products. Use searchCriteria filters for targeted lookups."
26
+ },
27
+ {
28
+ "api_type": "rest",
29
+ "endpoint": "POST /rest/V1/guest-carts",
30
+ "auth": "none",
31
+ "body_params": {},
32
+ "response_key_fields": ["cartId (plain string in body)"],
33
+ "notes": "Creates a new guest cart. Returns the cartId as a plain quoted string."
34
+ },
35
+ {
36
+ "api_type": "rest",
37
+ "endpoint": "POST /rest/V1/guest-carts/{cartId}/items",
38
+ "auth": "none",
39
+ "path_params": {
40
+ "cartId": {"type": "string", "source": "PREV_CALL", "from_endpoint": "POST /rest/V1/guest-carts", "from_field": "response body"}
41
+ },
42
+ "body_params": {
43
+ "cartItem.sku": {"type": "string", "source": "PREV_CALL", "from_endpoint": "GET /rest/V1/products", "from_field": "items[].sku"},
44
+ "cartItem.qty": {"type": "number", "source": "TASK_SPEC", "notes": "quantity to add"},
45
+ "cartItem.quote_id": {"type": "string", "source": "DERIVED", "notes": "same value as cartId"}
46
+ },
47
+ "response_key_fields": ["item_id", "sku", "qty"],
48
+ "notes": "Add an item to a guest cart. cartId must exist first."
49
+ },
50
+ {
51
+ "api_type": "rest",
52
+ "endpoint": "POST /rest/V1/integration/customer/token",
53
+ "auth": "none",
54
+ "body_params": {
55
+ "username": {"type": "string", "source": "TASK_SPEC"},
56
+ "password": {"type": "string", "source": "TASK_SPEC"}
57
+ },
58
+ "response_key_fields": ["bearer token (plain string in body)"],
59
+ "notes": "Authenticate a customer. Returns a bearer token string."
60
+ },
61
+ {
62
+ "api_type": "rest",
63
+ "endpoint": "POST /rest/V1/guest-carts/{cartId}/estimate-shipping-methods",
64
+ "auth": "none",
65
+ "path_params": {
66
+ "cartId": {"type": "string", "source": "PREV_CALL", "from_endpoint": "POST /rest/V1/guest-carts", "from_field": "response body"}
67
+ },
68
+ "body_params": {
69
+ "address.city": {"type": "string", "source": "TASK_SPEC"},
70
+ "address.region_id": {"type": "number", "source": "TASK_SPEC"},
71
+ "address.postcode": {"type": "string", "source": "TASK_SPEC"},
72
+ "address.country_id": {"type": "string", "source": "TASK_SPEC"}
73
+ },
74
+ "response_key_fields": ["[].carrier_code", "[].method_code", "[].amount"],
75
+ "notes": "Get available shipping methods for a guest cart."
76
+ },
77
+ {
78
+ "api_type": "rest",
79
+ "endpoint": "GET /rest/V1/guest-carts/{cartId}/totals",
80
+ "auth": "none",
81
+ "path_params": {
82
+ "cartId": {"type": "string", "source": "PREV_CALL", "from_endpoint": "POST /rest/V1/guest-carts", "from_field": "response body"}
83
+ },
84
+ "response_key_fields": ["grand_total", "subtotal", "items[].item_id", "items[].price"],
85
+ "notes": "Get cart totals and line items."
86
+ }
87
+ ]
88
+ }
tests/mock_data/mock_har.json ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "log": {
3
+ "version": "1.2",
4
+ "creator": {"name": "MockHAR", "version": "1.0"},
5
+ "pages": [],
6
+ "entries": [
7
+ {
8
+ "request": {
9
+ "method": "GET",
10
+ "url": "http://localhost:7770/rest/V1/categories",
11
+ "headers": [
12
+ {"name": "Accept", "value": "application/json"},
13
+ {"name": "Content-Type", "value": "application/json"}
14
+ ],
15
+ "queryString": [],
16
+ "postData": null
17
+ },
18
+ "response": {
19
+ "status": 200,
20
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
21
+ "content": {"mimeType": "application/json", "text": "{\"id\":1,\"name\":\"Root\",\"children_data\":[{\"id\":2,\"name\":\"Default Category\"},{\"id\":3,\"name\":\"Beauty & Personal Care\"}]}"}
22
+ }
23
+ },
24
+ {
25
+ "request": {
26
+ "method": "GET",
27
+ "url": "http://localhost:7770/rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee",
28
+ "headers": [
29
+ {"name": "Accept", "value": "application/json"},
30
+ {"name": "Content-Type", "value": "application/json"}
31
+ ],
32
+ "queryString": [{"name": "searchCriteria[filter_groups][0][filters][0][field]", "value": "name"}],
33
+ "postData": null
34
+ },
35
+ "response": {
36
+ "status": 200,
37
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
38
+ "content": {"mimeType": "application/json", "text": "{\"items\":[{\"sku\":\"MH01\",\"name\":\"Radiant Tee\",\"price\":22.0,\"type_id\":\"simple\"},{\"sku\":\"MH02\",\"name\":\"Breathe-Easy Tank\",\"price\":34.0,\"type_id\":\"simple\"},{\"sku\":\"MH03\",\"name\":\"Stellar Solar Jacket\",\"price\":75.0,\"type_id\":\"configurable\"},{\"sku\":\"MH04\",\"name\":\"Argus All-Weather Tank\",\"price\":22.0,\"type_id\":\"simple\"}],\"total_count\":4}"}
39
+ }
40
+ },
41
+ {
42
+ "request": {
43
+ "method": "POST",
44
+ "url": "http://localhost:7770/rest/V1/guest-carts",
45
+ "headers": [
46
+ {"name": "Accept", "value": "application/json"},
47
+ {"name": "Content-Type", "value": "application/json"}
48
+ ],
49
+ "queryString": [],
50
+ "postData": null
51
+ },
52
+ "response": {
53
+ "status": 200,
54
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
55
+ "content": {"mimeType": "application/json", "text": "\"cart-abc123\""}
56
+ }
57
+ },
58
+ {
59
+ "request": {
60
+ "method": "POST",
61
+ "url": "http://localhost:7770/rest/V1/guest-carts/cart-abc123/items",
62
+ "headers": [
63
+ {"name": "Accept", "value": "application/json"},
64
+ {"name": "Content-Type", "value": "application/json"}
65
+ ],
66
+ "queryString": [],
67
+ "postData": {"mimeType": "application/json", "text": "{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1,\"quote_id\":\"cart-abc123\"}}"}
68
+ },
69
+ "response": {
70
+ "status": 200,
71
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
72
+ "content": {"mimeType": "application/json", "text": "{\"item_id\":5,\"sku\":\"MH01\",\"qty\":1,\"name\":\"Radiant Tee\",\"price\":22.0,\"product_type\":\"simple\",\"quote_id\":\"cart-abc123\"}"}
73
+ }
74
+ },
75
+ {
76
+ "request": {
77
+ "method": "GET",
78
+ "url": "http://localhost:7770/rest/V1/guest-carts/cart-abc123/totals",
79
+ "headers": [
80
+ {"name": "Accept", "value": "application/json"},
81
+ {"name": "Content-Type", "value": "application/json"}
82
+ ],
83
+ "queryString": [],
84
+ "postData": null
85
+ },
86
+ "response": {
87
+ "status": 200,
88
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
89
+ "content": {"mimeType": "application/json", "text": "{\"grand_total\":22.0,\"subtotal\":22.0,\"items\":[{\"item_id\":5,\"price\":22.0,\"qty\":1,\"name\":\"Radiant Tee\"}]}"}
90
+ }
91
+ },
92
+ {
93
+ "request": {
94
+ "method": "POST",
95
+ "url": "http://localhost:7770/rest/V1/integration/customer/token",
96
+ "headers": [
97
+ {"name": "Accept", "value": "application/json"},
98
+ {"name": "Content-Type", "value": "application/json"}
99
+ ],
100
+ "queryString": [],
101
+ "postData": {"mimeType": "application/json", "text": "{\"username\":\"emma.lopez@gmail.com\",\"password\":\"Password.1\"}"}
102
+ },
103
+ "response": {
104
+ "status": 200,
105
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
106
+ "content": {"mimeType": "application/json", "text": "\"token-xyz789\""}
107
+ }
108
+ },
109
+ {
110
+ "request": {
111
+ "method": "POST",
112
+ "url": "http://localhost:7770/rest/V1/guest-carts/cart-abc123/estimate-shipping-methods",
113
+ "headers": [
114
+ {"name": "Accept", "value": "application/json"},
115
+ {"name": "Content-Type", "value": "application/json"}
116
+ ],
117
+ "queryString": [],
118
+ "postData": {"mimeType": "application/json", "text": "{\"address\":{\"city\":\"New York\",\"region_id\":43,\"postcode\":\"10001\",\"country_id\":\"US\"}}"}
119
+ },
120
+ "response": {
121
+ "status": 200,
122
+ "headers": [{"name": "Content-Type", "value": "application/json"}],
123
+ "content": {"mimeType": "application/json", "text": "[{\"carrier_code\":\"flatrate\",\"method_code\":\"flatrate\",\"carrier_title\":\"Flat Rate\",\"method_title\":\"Fixed\",\"amount\":5.0,\"available\":true}]"}
124
+ }
125
+ },
126
+ {
127
+ "request": {
128
+ "method": "GET",
129
+ "url": "http://localhost:7770/static/version123/frontend/Magento/luma/en_US/mage/gallery.js",
130
+ "headers": [{"name": "Accept", "value": "*/*"}],
131
+ "queryString": [],
132
+ "postData": null
133
+ },
134
+ "response": {
135
+ "status": 200,
136
+ "headers": [{"name": "Content-Type", "value": "application/javascript"}],
137
+ "content": {"mimeType": "application/javascript", "text": "// gallery js code..."}
138
+ }
139
+ },
140
+ {
141
+ "request": {
142
+ "method": "GET",
143
+ "url": "http://localhost:7770/media/catalog/product/m/h/mh01-black_main.jpg",
144
+ "headers": [{"name": "Accept", "value": "image/*"}],
145
+ "queryString": [],
146
+ "postData": null
147
+ },
148
+ "response": {
149
+ "status": 200,
150
+ "headers": [{"name": "Content-Type", "value": "image/jpeg"}],
151
+ "content": {"mimeType": "image/jpeg", "text": ""}
152
+ }
153
+ },
154
+ {
155
+ "request": {
156
+ "method": "GET",
157
+ "url": "http://localhost:7770/beauty-personal-care.html",
158
+ "headers": [{"name": "Accept", "value": "text/html"}],
159
+ "queryString": [],
160
+ "postData": null
161
+ },
162
+ "response": {
163
+ "status": 200,
164
+ "headers": [{"name": "Content-Type", "value": "text/html; charset=UTF-8"}],
165
+ "content": {"mimeType": "text/html", "text": "<html>...</html>"}
166
+ }
167
+ }
168
+ ]
169
+ }
170
+ }
tests/test_e2e_episode.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ End-to-End Episode Simulation: "Add Radiant Tee to a guest cart"
3
+
4
+ Simulates the full tool chain with mock data:
5
+ browser_agent → search_endpoints → curl_exec → search_episode_data → done
6
+
7
+ Tests that values thread correctly between tools and that each tool's
8
+ output feeds properly into the next tool's input.
9
+ """
10
+
11
+ import json
12
+ import os
13
+ import sys
14
+ import re
15
+
16
+ # Add tests dir to path
17
+ sys.path.insert(0, os.path.dirname(__file__))
18
+
19
+ from tool_browser_agent import browser_agent, extract_openapi_spec, spec_entry_to_text
20
+ from tool_search_endpoints import SearchEndpoints
21
+ from tool_curl_exec import mock_curl_exec, parse_curl_command
22
+ from tool_search_episode_data import EpisodeDataStore
23
+
24
+
25
+ def run_episode():
26
+ """Simulate a full episode: Add 'Radiant Tee' to a guest cart."""
27
+
28
+ print("=" * 70)
29
+ print("E2E EPISODE: Add 'Radiant Tee' to a guest cart")
30
+ print("URL: http://localhost:7770/")
31
+ print("=" * 70)
32
+
33
+ task = "Add 'Radiant Tee' to a guest cart"
34
+ url = "http://localhost:7770/"
35
+ mock_data_dir = os.path.join(os.path.dirname(__file__), "mock_data")
36
+
37
+ # Episode state
38
+ episode_index_docs = []
39
+ episode_store = EpisodeDataStore()
40
+ session_state = {}
41
+ step = 0
42
+
43
+ # -----------------------------------------------------------------------
44
+ # STEP 1: browser_agent — discover endpoints
45
+ # -----------------------------------------------------------------------
46
+ step += 1
47
+ print(f"\n{'='*50}")
48
+ print(f"STEP {step}: browser_agent(\"{task}\", \"{url}\")")
49
+ print(f"{'='*50}")
50
+
51
+ # Load mock HAR directly (simulating HAR file exists)
52
+ mock_har_path = os.path.join(mock_data_dir, "mock_har.json")
53
+ with open(mock_har_path) as f:
54
+ har_data = json.load(f)
55
+
56
+ spec_entries = extract_openapi_spec(har_data, url)
57
+ text_chunks = [spec_entry_to_text(e, "shopping") for e in spec_entries]
58
+
59
+ # Build summary output (what RL agent sees)
60
+ summary = {
61
+ "app": "shopping",
62
+ "endpoints": [{"method": e["method"], "path": e["path"]} for e in spec_entries],
63
+ "total_endpoints": len(spec_entries),
64
+ "note": "Use search_endpoints() for full details on any endpoint."
65
+ }
66
+
67
+ print(f"\nResult: {len(summary['endpoints'])} endpoints discovered:")
68
+ for ep in summary["endpoints"]:
69
+ print(f" {ep['method']:6s} {ep['path']}")
70
+
71
+ # Set up the search_endpoints tool with browser_agent output (NOT catalog ground truth)
72
+ # In the real system, search_endpoints searches the GEMMA embeddings built by browser_agent
73
+ # from HAR data. Here we use keyword search as a test fallback for GEMMA.
74
+ search_tool = SearchEndpoints()
75
+ search_tool.load_from_browser_agent(text_chunks)
76
+
77
+ print(f"\n → search_endpoints index built: {len(text_chunks)} docs from browser_agent HAR output")
78
+
79
+ # -----------------------------------------------------------------------
80
+ # STEP 2: search_endpoints — "how to find a product by name?"
81
+ # -----------------------------------------------------------------------
82
+ step += 1
83
+ print(f"\n{'='*50}")
84
+ print(f"STEP {step}: search_endpoints(\"find product by name get sku\")")
85
+ print(f"{'='*50}")
86
+
87
+ results = search_tool.search("find product by name get sku", top_k=3)
88
+ print(f"\nTop-3 results:")
89
+ for i, r in enumerate(results):
90
+ ep_match = re.search(r'endpoint: (\S+ \S+)', r)
91
+ ep_name = ep_match.group(1) if ep_match else "?"
92
+ print(f" [{i+1}] {ep_name}")
93
+ print(f" {r[:150]}...")
94
+
95
+ # Agent decides: GET /rest/V1/products with searchCriteria filter
96
+ print(f"\n → Agent decides: GET /rest/V1/products with name filter")
97
+
98
+ # -----------------------------------------------------------------------
99
+ # STEP 3: curl_exec — search for Radiant Tee
100
+ # -----------------------------------------------------------------------
101
+ step += 1
102
+ print(f"\n{'='*50}")
103
+ print(f"STEP {step}: curl_exec(\"curl .../products?searchCriteria[...]=Radiant+Tee\")")
104
+ print(f"{'='*50}")
105
+
106
+ result = mock_curl_exec(
107
+ "curl 'http://localhost:7770/rest/V1/products?searchCriteria[filter_groups][0][filters][0][field]=name&searchCriteria[filter_groups][0][filters][0][value]=Radiant+Tee'",
108
+ step, episode_index_docs
109
+ )
110
+ episode_store.add_documents(episode_index_docs[-len(episode_index_docs):])
111
+
112
+ print(f"\nResult: status={result['status_code']}")
113
+ body = result["body"]
114
+ if isinstance(body, dict):
115
+ items = body.get("items", [])
116
+ print(f" items shown: {len(items)}, total: {body.get('total_count', '?')}")
117
+ for item in items:
118
+ print(f" sku={item['sku']}, name={item['name']}, price={item['price']}")
119
+ if "_list_truncated" in body:
120
+ print(f" [TRUNCATED] {body['_list_truncated']['note']}")
121
+
122
+ # Agent extracts: sku="MH01" from response
123
+ target_sku = "MH01"
124
+ print(f"\n → Agent extracts: sku='{target_sku}' for 'Radiant Tee'")
125
+
126
+ # -----------------------------------------------------------------------
127
+ # STEP 4: search_endpoints — "how to create a guest cart?"
128
+ # -----------------------------------------------------------------------
129
+ step += 1
130
+ print(f"\n{'='*50}")
131
+ print(f"STEP {step}: search_endpoints(\"create guest cart get cart id\")")
132
+ print(f"{'='*50}")
133
+
134
+ results = search_tool.search("create guest cart get cart id", top_k=3)
135
+ print(f"\nTop result:")
136
+ ep_match = re.search(r'endpoint: (\S+ \S+)', results[0])
137
+ print(f" {ep_match.group(1) if ep_match else results[0][:80]}")
138
+
139
+ print(f"\n → Agent decides: POST /rest/V1/guest-carts")
140
+
141
+ # -----------------------------------------------------------------------
142
+ # STEP 5: curl_exec — create guest cart
143
+ # -----------------------------------------------------------------------
144
+ step += 1
145
+ print(f"\n{'='*50}")
146
+ print(f"STEP {step}: curl_exec(\"curl -X POST .../guest-carts\")")
147
+ print(f"{'='*50}")
148
+
149
+ result = mock_curl_exec(
150
+ "curl -X POST 'http://localhost:7770/rest/V1/guest-carts' -H 'Content-Type: application/json'",
151
+ step, episode_index_docs
152
+ )
153
+ episode_store.add_documents(episode_index_docs[-1:])
154
+
155
+ cart_id = result["body"]
156
+ print(f"\nResult: status={result['status_code']}, cart_id={cart_id}")
157
+ print(f"\n → Agent extracts: cart_id='{cart_id}'")
158
+
159
+ # -----------------------------------------------------------------------
160
+ # STEP 6: search_endpoints — "how to add item to guest cart?"
161
+ # -----------------------------------------------------------------------
162
+ step += 1
163
+ print(f"\n{'='*50}")
164
+ print(f"STEP {step}: search_endpoints(\"add item to guest cart\")")
165
+ print(f"{'='*50}")
166
+
167
+ results = search_tool.search("add item to guest cart cartId sku", top_k=3)
168
+ print(f"\nTop result:")
169
+ print(f" {results[0][:200]}...")
170
+
171
+ print(f"\n → Agent decides: POST /rest/V1/guest-carts/{{cartId}}/items")
172
+ print(f" cartId = {cart_id} (from step {step-1})")
173
+ print(f" sku = {target_sku} (from step 3)")
174
+ print(f" quote_id = {cart_id} (DERIVED, same as cartId)")
175
+
176
+ # -----------------------------------------------------------------------
177
+ # STEP 7: curl_exec — add Radiant Tee to cart
178
+ # -----------------------------------------------------------------------
179
+ step += 1
180
+ print(f"\n{'='*50}")
181
+ print(f"STEP {step}: curl_exec(\"curl -X POST .../guest-carts/{cart_id}/items\")")
182
+ print(f"{'='*50}")
183
+
184
+ result = mock_curl_exec(
185
+ f'curl -X POST "http://localhost:7770/rest/V1/guest-carts/{cart_id}/items" '
186
+ f'-H "Content-Type: application/json" '
187
+ f'-d \'{{"cartItem":{{"sku":"{target_sku}","qty":1,"quote_id":"{cart_id}"}}}}\'',
188
+ step, episode_index_docs
189
+ )
190
+ episode_store.add_documents(episode_index_docs[-2:]) # request + response docs
191
+
192
+ print(f"\nResult: status={result['status_code']}")
193
+ body = result["body"]
194
+ if isinstance(body, dict):
195
+ print(f" item_id={body.get('item_id')}, sku={body.get('sku')}, qty={body.get('qty')}")
196
+
197
+ # -----------------------------------------------------------------------
198
+ # Test: search_episode_data — can we find values from prior steps?
199
+ # -----------------------------------------------------------------------
200
+ print(f"\n{'='*50}")
201
+ print(f"VERIFICATION: search_episode_data queries")
202
+ print(f"{'='*50}")
203
+
204
+ print(f"\nEpisode index: {episode_store.doc_count} documents total\n")
205
+
206
+ # Can we find the product from step 2?
207
+ results = episode_store.search("Radiant Tee sku", top_k=1)
208
+ found_product = results and "MH01" in results[0]
209
+ print(f" Find 'Radiant Tee' sku from products response: {'PASS' if found_product else 'FAIL'}")
210
+ if results:
211
+ print(f" → {results[0][:100]}...")
212
+
213
+ # Can we find the cart ID from step 4?
214
+ results = episode_store.search("guest-carts cart", top_k=3)
215
+ found_cart = any("cart-mock" in r for r in results)
216
+ print(f"\n Find cart ID from create-cart response: {'PASS' if found_cart else 'FAIL'}")
217
+ if results:
218
+ print(f" → {results[0][:100]}...")
219
+
220
+ # Can we find the add-to-cart confirmation?
221
+ results = episode_store.search("item_id sku MH01", top_k=1)
222
+ found_confirm = results and "item_id" in results[0]
223
+ print(f"\n Find add-to-cart confirmation: {'PASS' if found_confirm else 'FAIL'}")
224
+ if results:
225
+ print(f" → {results[0][:100]}...")
226
+
227
+ # -----------------------------------------------------------------------
228
+ # STEP 8: done
229
+ # -----------------------------------------------------------------------
230
+ step += 1
231
+ print(f"\n{'='*50}")
232
+ print(f"STEP {step}: done(\"Radiant Tee (MH01) added to guest cart {cart_id}\")")
233
+ print(f"{'='*50}")
234
+ print(f"\n Episode complete. {step} steps total.")
235
+ print(f" Episode index: {episode_store.doc_count} documents indexed")
236
+
237
+ # -----------------------------------------------------------------------
238
+ # Summary
239
+ # -----------------------------------------------------------------------
240
+ print(f"\n{'='*70}")
241
+ print(f"EPISODE SUMMARY")
242
+ print(f"{'='*70}")
243
+ print(f"""
244
+ Task: {task}
245
+ App: shopping (port 7770)
246
+ Steps: {step}
247
+ Tools used: browser_agent → search_endpoints (x3) → curl_exec (x3) → done
248
+
249
+ Value Threading:
250
+ Step 3: GET /products → sku='MH01' (Radiant Tee)
251
+ Step 5: POST /guest-carts → cart_id='{cart_id}'
252
+ Step 7: POST /guest-carts/{cart_id}/items
253
+ sku='{target_sku}' (from step 3)
254
+ quote_id='{cart_id}' (DERIVED from step 5)
255
+
256
+ Episode Index: {episode_store.doc_count} documents
257
+ - Categories, products (5 items), cart creation, add-to-cart
258
+ - All searchable via search_episode_data()
259
+
260
+ Result: item_id=5, sku=MH01, qty=1 added to cart {cart_id}
261
+ """)
262
+
263
+ # Assertions
264
+ assert found_product, "Failed to find product in episode data"
265
+ assert found_cart, "Failed to find cart ID in episode data"
266
+ assert found_confirm, "Failed to find add-to-cart confirmation"
267
+
268
+ print("[PASS] End-to-end episode simulation completed successfully\n")
269
+
270
+
271
+ if __name__ == "__main__":
272
+ run_episode()
tests/test_real_har.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test browser_agent pipeline against REAL HAR files.
3
+ Processes the actual recorded HAR data to verify filtering, deduplication,
4
+ and path normalisation work on real-world traffic.
5
+ """
6
+
7
+ import json
8
+ import os
9
+ import sys
10
+
11
+ sys.path.insert(0, os.path.dirname(__file__))
12
+ from tool_browser_agent import extract_openapi_spec, spec_entry_to_text, build_summary_output
13
+
14
+ HARS_DIR = os.path.join(os.path.dirname(__file__), "..", "hars")
15
+
16
+ APPS = {
17
+ "wikipedia": {
18
+ "har": "wikipedia.har",
19
+ "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:8888/",
20
+ },
21
+ "forum": {
22
+ "har": "forum.har",
23
+ "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:9999/",
24
+ },
25
+ "shopping": {
26
+ "har": "shopping.har",
27
+ "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7770/",
28
+ },
29
+ "shopping_admin": {
30
+ "har": "shopping_admin.har",
31
+ "url": "http://ec2-16-59-2-56.us-east-2.compute.amazonaws.com:7780/",
32
+ },
33
+ }
34
+
35
+
36
+ def test_app(app_name: str, config: dict):
37
+ har_path = os.path.join(HARS_DIR, config["har"])
38
+ if not os.path.exists(har_path):
39
+ print(f" [SKIP] {har_path} not found")
40
+ return
41
+
42
+ print(f"\n{'='*60}")
43
+ print(f"APP: {app_name} ({config['har']})")
44
+ print(f"{'='*60}")
45
+
46
+ with open(har_path) as f:
47
+ har_data = json.load(f)
48
+
49
+ total_entries = len(har_data["log"]["entries"])
50
+ spec = extract_openapi_spec(har_data, config["url"])
51
+ chunks = [spec_entry_to_text(e, app_name) for e in spec]
52
+ summary = build_summary_output(spec, app_name)
53
+
54
+ print(f"\n Total HAR entries: {total_entries}")
55
+ print(f" Filtered API endpoints: {len(spec)}")
56
+ print(f" Reduction: {total_entries} → {len(spec)} ({100*(1-len(spec)/max(total_entries,1)):.0f}% filtered out)")
57
+
58
+ print(f"\n Endpoints:")
59
+ for ep in spec:
60
+ auth_marker = " [AUTH]" if ep["auth_observed"] else ""
61
+ body_marker = " [BODY]" if ep.get("request_body") else ""
62
+ print(f" {ep['method']:6s} {ep['path'][:70]:70s} {ep['status_code']}{auth_marker}{body_marker}")
63
+
64
+ print(f"\n Text chunks for embedding: {len(chunks)}")
65
+ if chunks:
66
+ print(f" Sample: {chunks[0][:120]}...")
67
+
68
+ return spec
69
+
70
+
71
+ if __name__ == "__main__":
72
+ print("=" * 60)
73
+ print("TEST: browser_agent pipeline against REAL HAR files")
74
+ print("=" * 60)
75
+
76
+ all_specs = {}
77
+ for app_name, config in APPS.items():
78
+ spec = test_app(app_name, config)
79
+ if spec:
80
+ all_specs[app_name] = spec
81
+
82
+ # Summary
83
+ print(f"\n\n{'='*60}")
84
+ print(f"SUMMARY")
85
+ print(f"{'='*60}")
86
+ for app, spec in all_specs.items():
87
+ methods = {}
88
+ for e in spec:
89
+ methods[e["method"]] = methods.get(e["method"], 0) + 1
90
+ method_str = ", ".join(f"{m}:{c}" for m, c in sorted(methods.items()))
91
+ print(f" {app:20s}: {len(spec):3d} endpoints ({method_str})")
92
+
93
+ print(f"\n[PASS] Real HAR processing completed successfully")
tests/tool_browser_agent.py ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tool 0: browser_agent — HAR processing pipeline.
3
+
4
+ Stages:
5
+ 1. Check for pre-recorded HAR file (by port mapping) → load or fall back to live browser
6
+ 2. Filter HAR entries: skip static assets, HTML pages, deduplicate by (method, normalised path)
7
+ 3. Build OpenAPI-like spec from filtered entries
8
+ 4. Build GEMMA embeddings over the spec (for search_endpoints)
9
+ 5. Return summary endpoint list (method + path only)
10
+ """
11
+
12
+ import json
13
+ import os
14
+ import re
15
+ from urllib.parse import urlparse, parse_qs
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Configuration
19
+ # ---------------------------------------------------------------------------
20
+
21
+ HAR_MAP = {
22
+ ":7770": "hars/shopping.har",
23
+ ":7780": "hars/shopping_admin.har",
24
+ ":9999": "hars/forum.har",
25
+ ":3000": "hars/osm.har",
26
+ ":8888": "hars/wikipedia.har",
27
+ }
28
+
29
+ APP_NAMES = {
30
+ ":7770": "shopping",
31
+ ":7780": "shopping_admin",
32
+ ":9999": "forum",
33
+ ":3000": "osm",
34
+ ":8888": "wikipedia",
35
+ }
36
+
37
+ SKIP_EXTENSIONS = {".css", ".png", ".jpg", ".jpeg", ".svg", ".ico", ".woff",
38
+ ".woff2", ".ttf", ".gif", ".js", ".map"}
39
+
40
+ SKIP_PATH_PREFIXES = ["/static/", "/media/", "/_next/", "/assets/",
41
+ "/__webpack", "/pub/static/"]
42
+
43
+ # ---------------------------------------------------------------------------
44
+ # Path normalisation
45
+ # ---------------------------------------------------------------------------
46
+
47
+ # Patterns for dynamic segments
48
+ _UUID_RE = re.compile(r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}', re.I)
49
+ _LONG_ALPHANUM_RE = re.compile(r'[a-zA-Z0-9]{32,}') # Magento cart IDs etc.
50
+ _NUMERIC_ID_RE = re.compile(r'^[0-9]+$')
51
+ _FORUM_SLUG_RE = re.compile(r'^[0-9]+-[a-z0-9-]+$') # e.g. "1-hello-world"
52
+
53
+
54
+ def _normalise_path(path: str) -> str:
55
+ """Replace concrete IDs/slugs with {id} placeholders."""
56
+ segments = path.strip("/").split("/")
57
+ normalised = []
58
+ for seg in segments:
59
+ if _UUID_RE.fullmatch(seg):
60
+ normalised.append("{id}")
61
+ elif _LONG_ALPHANUM_RE.fullmatch(seg):
62
+ normalised.append("{id}")
63
+ elif _NUMERIC_ID_RE.fullmatch(seg) and len(seg) >= 2:
64
+ normalised.append("{id}")
65
+ elif _FORUM_SLUG_RE.fullmatch(seg):
66
+ normalised.append("{id}-{slug}")
67
+ else:
68
+ normalised.append(seg)
69
+ return "/" + "/".join(normalised)
70
+
71
+
72
+ # ---------------------------------------------------------------------------
73
+ # Filtering
74
+ # ---------------------------------------------------------------------------
75
+
76
+ def _is_static_asset(url: str) -> bool:
77
+ """Check if URL points to a static asset."""
78
+ parsed = urlparse(url)
79
+ path = parsed.path.lower()
80
+
81
+ for ext in SKIP_EXTENSIONS:
82
+ if path.endswith(ext):
83
+ return True
84
+
85
+ for prefix in SKIP_PATH_PREFIXES:
86
+ if path.startswith(prefix):
87
+ return True
88
+
89
+ return False
90
+
91
+
92
+ def _get_response_content_type(resp: dict) -> str:
93
+ """Extract content-type from response headers or content field."""
94
+ # Check headers
95
+ for h in resp.get("headers", []):
96
+ if h["name"].lower() == "content-type":
97
+ return h["value"].lower()
98
+ # Check content field (HAR format)
99
+ content = resp.get("content", {})
100
+ return content.get("mimeType", "").lower()
101
+
102
+
103
+ def _extract_body(req: dict) -> str | None:
104
+ """Extract request body text from HAR entry."""
105
+ pd = req.get("postData")
106
+ if pd is None:
107
+ return None
108
+ if isinstance(pd, dict):
109
+ return pd.get("text")
110
+ return str(pd) if pd else None
111
+
112
+
113
+ def _truncate_body(resp: dict, max_len: int = 500) -> str | None:
114
+ """Extract and truncate response body for spec document."""
115
+ content = resp.get("content", {})
116
+ text = content.get("text", "")
117
+ if not text:
118
+ return None
119
+ if len(text) > max_len:
120
+ return text[:max_len] + "..."
121
+ return text
122
+
123
+
124
+ # ---------------------------------------------------------------------------
125
+ # Core pipeline
126
+ # ---------------------------------------------------------------------------
127
+
128
+ def resolve_har_path(url: str, base_dir: str = ".") -> str | None:
129
+ """Find pre-recorded HAR file for this app URL."""
130
+ for port_key, rel_path in HAR_MAP.items():
131
+ if port_key in url:
132
+ full_path = os.path.join(base_dir, rel_path)
133
+ if os.path.exists(full_path):
134
+ return full_path
135
+ return None
136
+
137
+
138
+ def resolve_app_name(url: str) -> str:
139
+ """Map URL to app name."""
140
+ for port_key, name in APP_NAMES.items():
141
+ if port_key in url:
142
+ return name
143
+ return "unknown"
144
+
145
+
146
+ def extract_openapi_spec(har_data: dict, app_base_url: str) -> list[dict]:
147
+ """
148
+ Stage 2-3: Filter HAR entries and extract OpenAPI-like spec.
149
+ Returns list of structured endpoint documents.
150
+ """
151
+ entries = har_data["log"]["entries"]
152
+ seen = set()
153
+ spec_entries = []
154
+
155
+ for entry in entries:
156
+ req = entry["request"]
157
+ resp = entry["response"]
158
+ raw_url = req["url"]
159
+ method = req["method"]
160
+
161
+ # Skip static assets
162
+ if _is_static_asset(raw_url):
163
+ continue
164
+
165
+ # Skip HTML page navigations
166
+ content_type = _get_response_content_type(resp)
167
+ if "text/html" in content_type and method == "GET":
168
+ continue
169
+
170
+ # Normalise path
171
+ parsed = urlparse(raw_url)
172
+ path = _normalise_path(parsed.path)
173
+
174
+ # Deduplicate
175
+ key = f"{method} {path}"
176
+ if key in seen:
177
+ continue
178
+ seen.add(key)
179
+
180
+ # Auth detection
181
+ has_auth = any(
182
+ h["name"].lower() in ("authorization", "x-api-key", "cookie")
183
+ for h in req.get("headers", [])
184
+ )
185
+
186
+ spec_entries.append({
187
+ "method": method,
188
+ "path": path,
189
+ "query_params": parsed.query or None,
190
+ "request_body": _extract_body(req),
191
+ "status_code": resp["status"],
192
+ "response_content_type": content_type,
193
+ "response_body_sample": _truncate_body(resp),
194
+ "auth_observed": has_auth,
195
+ })
196
+
197
+ return spec_entries
198
+
199
+
200
+ def spec_entry_to_text(entry: dict, app_name: str) -> str:
201
+ """Convert a spec entry to a searchable text document for embedding."""
202
+ parts = [
203
+ f"app: {app_name}",
204
+ f"endpoint: {entry['method']} {entry['path']}",
205
+ f"status: {entry['status_code']}",
206
+ f"auth: {'required' if entry['auth_observed'] else 'none'}",
207
+ ]
208
+ if entry.get("query_params"):
209
+ parts.append(f"query: {entry['query_params']}")
210
+ if entry.get("request_body"):
211
+ parts.append(f"body: {entry['request_body'][:200]}")
212
+ if entry.get("response_body_sample"):
213
+ parts.append(f"response_sample: {entry['response_body_sample'][:200]}")
214
+ return " | ".join(parts)
215
+
216
+
217
+ def build_summary_output(spec_entries: list[dict], app_name: str) -> dict:
218
+ """Stage 5: Build summary-only output for the RL agent."""
219
+ endpoints = [{"method": e["method"], "path": e["path"]} for e in spec_entries]
220
+ return {
221
+ "app": app_name,
222
+ "endpoints": endpoints,
223
+ "total_endpoints": len(endpoints),
224
+ "note": (
225
+ "These endpoints were observed for this application. "
226
+ "Use search_endpoints() with a natural language query to get "
227
+ "the full schema, parameters, and auth details for any endpoint."
228
+ ),
229
+ }
230
+
231
+
232
+ def browser_agent(task: str, url: str, base_dir: str = ".") -> tuple[dict, list[dict], list[str]]:
233
+ """
234
+ Full browser_agent pipeline.
235
+
236
+ Returns:
237
+ (summary_output, spec_entries, text_chunks)
238
+ - summary_output: what the RL agent sees
239
+ - spec_entries: structured spec for internal use
240
+ - text_chunks: searchable text docs for embedding/search
241
+ """
242
+ app_name = resolve_app_name(url)
243
+
244
+ # Stage 1: Get HAR data
245
+ har_path = resolve_har_path(url, base_dir)
246
+ if har_path:
247
+ with open(har_path) as f:
248
+ har_data = json.load(f)
249
+ else:
250
+ raise FileNotFoundError(
251
+ f"No HAR file found for {url}. Live browser fallback not implemented in test mode."
252
+ )
253
+
254
+ # Stage 2-3: Extract spec
255
+ spec_entries = extract_openapi_spec(har_data, url)
256
+
257
+ # Stage 4: Build text chunks (embeddings would happen here)
258
+ text_chunks = [spec_entry_to_text(e, app_name) for e in spec_entries]
259
+
260
+ # Stage 5: Build summary
261
+ summary = build_summary_output(spec_entries, app_name)
262
+
263
+ return summary, spec_entries, text_chunks
264
+
265
+
266
+ # ---------------------------------------------------------------------------
267
+ # Test
268
+ # ---------------------------------------------------------------------------
269
+
270
+ if __name__ == "__main__":
271
+ print("=" * 70)
272
+ print("TEST: browser_agent with mock HAR data")
273
+ print("=" * 70)
274
+
275
+ mock_har_path = os.path.join(os.path.dirname(__file__), "mock_data", "mock_har.json")
276
+ with open(mock_har_path) as f:
277
+ har_data = json.load(f)
278
+
279
+ url = "http://localhost:7770/"
280
+ app_name = "shopping"
281
+
282
+ # Test filtering
283
+ spec = extract_openapi_spec(har_data, url)
284
+ print(f"\nFiltered {len(har_data['log']['entries'])} HAR entries → {len(spec)} API endpoints\n")
285
+
286
+ for e in spec:
287
+ print(f" {e['method']:6s} {e['path']}")
288
+ if e.get("request_body"):
289
+ print(f" body: {e['request_body'][:80]}...")
290
+
291
+ # Test summary output
292
+ summary = build_summary_output(spec, app_name)
293
+ print(f"\n--- Summary Output (what RL agent sees) ---")
294
+ print(json.dumps(summary, indent=2))
295
+
296
+ # Test text chunks
297
+ chunks = [spec_entry_to_text(e, app_name) for e in spec]
298
+ print(f"\n--- Text Chunks for Embedding ({len(chunks)} docs) ---")
299
+ for i, chunk in enumerate(chunks):
300
+ print(f" [{i}] {chunk[:120]}...")
301
+
302
+ # Test path normalisation
303
+ print(f"\n--- Path Normalisation Tests ---")
304
+ test_paths = [
305
+ "/rest/V1/products/42",
306
+ "/rest/V1/guest-carts/3fa85f64-5717-4562-b3fc-2c963f66afa6/items",
307
+ "/rest/V1/guest-carts/abcdef1234567890abcdef1234567890ab/totals",
308
+ "/api/0.6/node/12345678",
309
+ "/f/general/1-hello-world",
310
+ "/rest/V1/categories",
311
+ "/rest/V1/products",
312
+ ]
313
+ for p in test_paths:
314
+ print(f" {p:65s} → {_normalise_path(p)}")
315
+
316
+ # Test static asset detection
317
+ print(f"\n--- Static Asset Detection ---")
318
+ test_urls = [
319
+ "http://localhost:7770/rest/V1/products",
320
+ "http://localhost:7770/static/version1/file.js",
321
+ "http://localhost:7770/media/catalog/product/img.jpg",
322
+ "http://localhost:7770/beauty-personal-care.html",
323
+ ]
324
+ for u in test_urls:
325
+ print(f" {u:60s} → static={_is_static_asset(u)}")
326
+
327
+ print("\n[PASS] browser_agent tool tests completed successfully")
tests/tool_curl_exec.py ADDED
@@ -0,0 +1,442 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tool 2: curl_exec — HTTP execution with truncation and episode indexing.
3
+
4
+ Pipeline:
5
+ 1. Parse curl command string → extract method, URL, headers, body
6
+ 2. Execute via subprocess (or mock in test mode)
7
+ 3. Index full response into episode BM25 store (before truncation)
8
+ 4. Truncate response body for context window
9
+ 5. Return {status_code, headers, body}
10
+ """
11
+
12
+ import json
13
+ import re
14
+ import shlex
15
+ from typing import Any
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Curl command parser
19
+ # ---------------------------------------------------------------------------
20
+
21
+
22
+ def parse_curl_command(command: str) -> dict:
23
+ """
24
+ Parse a curl command string into structured components.
25
+ Returns: {method, url, headers: dict, body: str|None}
26
+ """
27
+ # Handle the command as a shell argument list
28
+ try:
29
+ parts = shlex.split(command)
30
+ except ValueError:
31
+ return {"error": "Failed to parse curl command"}
32
+
33
+ # Remove 'curl' prefix if present
34
+ if parts and parts[0] == "curl":
35
+ parts = parts[1:]
36
+
37
+ result = {
38
+ "method": "GET",
39
+ "url": None,
40
+ "headers": {},
41
+ "body": None,
42
+ }
43
+
44
+ i = 0
45
+ while i < len(parts):
46
+ part = parts[i]
47
+
48
+ if part in ("-X", "--request"):
49
+ i += 1
50
+ if i < len(parts):
51
+ result["method"] = parts[i].upper()
52
+
53
+ elif part in ("-H", "--header"):
54
+ i += 1
55
+ if i < len(parts):
56
+ header = parts[i]
57
+ if ":" in header:
58
+ key, val = header.split(":", 1)
59
+ result["headers"][key.strip()] = val.strip()
60
+
61
+ elif part in ("-d", "--data", "--data-raw"):
62
+ i += 1
63
+ if i < len(parts):
64
+ result["body"] = parts[i]
65
+ if result["method"] == "GET":
66
+ result["method"] = "POST"
67
+
68
+ elif not part.startswith("-"):
69
+ result["url"] = part
70
+
71
+ i += 1
72
+
73
+ return result
74
+
75
+
76
+ # ---------------------------------------------------------------------------
77
+ # Response truncation
78
+ # ---------------------------------------------------------------------------
79
+
80
+ TRUNCATE_LIST_AT = 2
81
+ LARGE_ARRAY_THRESHOLD = 3
82
+ NONJSON_MAX_CHARS = 1000
83
+
84
+
85
+ def _is_json(s: str) -> bool:
86
+ try:
87
+ json.loads(s)
88
+ return True
89
+ except (ValueError, TypeError):
90
+ return False
91
+
92
+
93
+ def truncate_response_body(body: str, status_code: int) -> str:
94
+ """Apply smart truncation rules to response body."""
95
+ # Rule 3: never truncate errors
96
+ if status_code >= 400:
97
+ return body
98
+
99
+ # Rule 1: non-JSON
100
+ if not _is_json(body):
101
+ if len(body) > NONJSON_MAX_CHARS:
102
+ return body[:NONJSON_MAX_CHARS] + " [truncated - non-JSON response]"
103
+ return body
104
+
105
+ parsed = json.loads(body)
106
+
107
+ # Rule 2: primitive
108
+ if not isinstance(parsed, (dict, list)):
109
+ return body
110
+
111
+ # Handle top-level array
112
+ if isinstance(parsed, list):
113
+ if (len(parsed) >= LARGE_ARRAY_THRESHOLD
114
+ and len(parsed) > 0 and isinstance(parsed[0], dict)):
115
+ result = parsed[:TRUNCATE_LIST_AT]
116
+ note = {"_list_truncated": {
117
+ "shown": TRUNCATE_LIST_AT,
118
+ "total": len(parsed),
119
+ "note": f"Showing {TRUNCATE_LIST_AT} of {len(parsed)} items. "
120
+ "Use search_episode_data() to find a specific item from this response."
121
+ }}
122
+ return json.dumps(result + [note])
123
+ return body
124
+
125
+ # Handle dict — check each value for large arrays
126
+ needs_truncation = {
127
+ k for k, v in parsed.items()
128
+ if isinstance(v, list) and len(v) >= LARGE_ARRAY_THRESHOLD
129
+ and len(v) > 0 and isinstance(v[0], dict)
130
+ }
131
+ if not needs_truncation:
132
+ return body
133
+
134
+ result = {}
135
+ total_truncated = {}
136
+ for k, v in parsed.items():
137
+ if k in needs_truncation:
138
+ result[k] = v[:TRUNCATE_LIST_AT]
139
+ total_truncated[k] = len(v)
140
+ else:
141
+ result[k] = v
142
+
143
+ result["_list_truncated"] = {
144
+ "fields": total_truncated,
145
+ "shown_per_field": TRUNCATE_LIST_AT,
146
+ "note": (
147
+ "List fields truncated: "
148
+ + ", ".join(f"{k} showing {TRUNCATE_LIST_AT}/{n}"
149
+ for k, n in total_truncated.items())
150
+ + ". Use search_episode_data() to find a specific item from this response."
151
+ )
152
+ }
153
+ return json.dumps(result)
154
+
155
+
156
+ # ---------------------------------------------------------------------------
157
+ # Episode index document construction
158
+ # ---------------------------------------------------------------------------
159
+
160
+ def build_index_documents(step: int, method: str, path: str,
161
+ request_body: Any, response_body: Any,
162
+ status_code: int) -> list[str]:
163
+ """
164
+ Build BM25-indexable documents from a curl_exec result.
165
+ Called BEFORE truncation so all items are indexed.
166
+ """
167
+ docs = []
168
+
169
+ # Index request body
170
+ if request_body is not None:
171
+ docs.append(
172
+ f"step:{step} source:request endpoint:{method} {path} "
173
+ f"body:{json.dumps(request_body, ensure_ascii=False) if isinstance(request_body, (dict, list)) else str(request_body)}"
174
+ )
175
+
176
+ # Index response body
177
+ if response_body is None:
178
+ return docs
179
+
180
+ if isinstance(response_body, str) and not _is_json(response_body):
181
+ docs.append(
182
+ f"step:{step} source:response endpoint:{method} {path} "
183
+ f"status:{status_code} body:{response_body[:500]}"
184
+ )
185
+ return docs
186
+
187
+ parsed = json.loads(response_body) if isinstance(response_body, str) else response_body
188
+
189
+ # Primitive value
190
+ if isinstance(parsed, (str, int, float, bool)) or parsed is None:
191
+ docs.append(
192
+ f"step:{step} source:response endpoint:{method} {path} "
193
+ f"status:{status_code} value:{parsed}"
194
+ )
195
+ return docs
196
+
197
+ # Top-level array
198
+ if isinstance(parsed, list):
199
+ for item in parsed:
200
+ if isinstance(item, dict):
201
+ docs.append(
202
+ f"step:{step} source:response endpoint:{method} {path} "
203
+ f"status:{status_code} item:{json.dumps(item, ensure_ascii=False)}"
204
+ )
205
+ else:
206
+ docs.append(
207
+ f"step:{step} source:response endpoint:{method} {path} "
208
+ f"status:{status_code} value:{item}"
209
+ )
210
+ return docs
211
+
212
+ # Dict — find array fields
213
+ array_fields = {k: v for k, v in parsed.items()
214
+ if isinstance(v, list) and len(v) > 0 and isinstance(v[0], dict)}
215
+ scalar_fields = {k: v for k, v in parsed.items() if k not in array_fields}
216
+
217
+ if not array_fields:
218
+ docs.append(
219
+ f"step:{step} source:response endpoint:{method} {path} "
220
+ f"status:{status_code} data:{json.dumps(parsed, ensure_ascii=False)}"
221
+ )
222
+ return docs
223
+
224
+ # Array fields — one doc per item with parent context
225
+ parent_context = (
226
+ f"step:{step} source:response endpoint:{method} {path} status:{status_code} "
227
+ + " ".join(f"{k}:{v}" for k, v in scalar_fields.items()
228
+ if not isinstance(v, (dict, list)))
229
+ )
230
+ for field_name, items in array_fields.items():
231
+ for item in items:
232
+ flat_item = {}
233
+ for k, v in item.items():
234
+ flat_item[k] = json.dumps(v) if isinstance(v, (list, dict)) else v
235
+ docs.append(
236
+ f"{parent_context} list_field:{field_name} "
237
+ f"item:{json.dumps(flat_item, ensure_ascii=False)}"
238
+ )
239
+
240
+ return docs
241
+
242
+
243
+ # ---------------------------------------------------------------------------
244
+ # Mock execution (for testing)
245
+ # ---------------------------------------------------------------------------
246
+
247
+ # Mock responses keyed by (method, path_pattern)
248
+ MOCK_RESPONSES = {
249
+ ("GET", "/rest/V1/categories"): {
250
+ "status_code": 200,
251
+ "headers": {"Content-Type": "application/json"},
252
+ "body": json.dumps({
253
+ "id": 1, "name": "Root",
254
+ "children_data": [
255
+ {"id": 2, "name": "Default Category"},
256
+ {"id": 3, "name": "Beauty & Personal Care"}
257
+ ]
258
+ })
259
+ },
260
+ ("GET", "/rest/V1/products"): {
261
+ "status_code": 200,
262
+ "headers": {"Content-Type": "application/json"},
263
+ "body": json.dumps({
264
+ "items": [
265
+ {"sku": "MH01", "name": "Radiant Tee", "price": 22.0, "type_id": "simple"},
266
+ {"sku": "MH02", "name": "Breathe-Easy Tank", "price": 34.0, "type_id": "simple"},
267
+ {"sku": "MH03", "name": "Stellar Solar Jacket", "price": 75.0, "type_id": "configurable"},
268
+ {"sku": "MH04", "name": "Argus All-Weather Tank", "price": 22.0, "type_id": "simple"},
269
+ ],
270
+ "total_count": 4
271
+ })
272
+ },
273
+ ("POST", "/rest/V1/guest-carts"): {
274
+ "status_code": 200,
275
+ "headers": {"Content-Type": "application/json"},
276
+ "body": '"cart-mock-abc123"'
277
+ },
278
+ ("POST", "/rest/V1/guest-carts/{id}/items"): {
279
+ "status_code": 200,
280
+ "headers": {"Content-Type": "application/json"},
281
+ "body": json.dumps({
282
+ "item_id": 5, "sku": "MH01", "qty": 1,
283
+ "name": "Radiant Tee", "price": 22.0,
284
+ "product_type": "simple", "quote_id": "cart-mock-abc123"
285
+ })
286
+ },
287
+ }
288
+
289
+
290
+ def mock_curl_exec(command: str, step: int, episode_index: list) -> dict:
291
+ """
292
+ Mock curl_exec for testing. Matches against MOCK_RESPONSES.
293
+ Also builds index documents and adds to episode_index.
294
+ """
295
+ parsed = parse_curl_command(command)
296
+ if "error" in parsed:
297
+ return {"status_code": 0, "error": parsed["error"]}
298
+
299
+ method = parsed["method"]
300
+ url = parsed["url"]
301
+ from urllib.parse import urlparse
302
+ path = urlparse(url).path
303
+
304
+ # Try exact match first, then pattern match
305
+ response = None
306
+ for (m, p), resp in MOCK_RESPONSES.items():
307
+ if m != method:
308
+ continue
309
+ # Replace {id} with regex for matching
310
+ pattern = re.sub(r'\{[^}]+\}', r'[^/]+', p)
311
+ if re.fullmatch(pattern, path):
312
+ response = resp
313
+ break
314
+
315
+ if response is None:
316
+ response = {
317
+ "status_code": 404,
318
+ "headers": {"Content-Type": "application/json"},
319
+ "body": json.dumps({"message": f"No mock for {method} {path}"})
320
+ }
321
+
322
+ # Build index documents BEFORE truncation
323
+ req_body = None
324
+ if parsed["body"]:
325
+ try:
326
+ req_body = json.loads(parsed["body"])
327
+ except (json.JSONDecodeError, TypeError):
328
+ req_body = parsed["body"]
329
+
330
+ index_docs = build_index_documents(
331
+ step=step,
332
+ method=method,
333
+ path=path,
334
+ request_body=req_body,
335
+ response_body=response["body"],
336
+ status_code=response["status_code"]
337
+ )
338
+ episode_index.extend(index_docs)
339
+
340
+ # Truncate body for context
341
+ truncated_body = truncate_response_body(response["body"], response["status_code"])
342
+
343
+ return {
344
+ "status_code": response["status_code"],
345
+ "headers": response["headers"],
346
+ "body": json.loads(truncated_body) if _is_json(truncated_body) else truncated_body,
347
+ }
348
+
349
+
350
+ # ---------------------------------------------------------------------------
351
+ # Test
352
+ # ---------------------------------------------------------------------------
353
+
354
+ if __name__ == "__main__":
355
+ print("=" * 70)
356
+ print("TEST: curl_exec with mock responses")
357
+ print("=" * 70)
358
+
359
+ # Test curl parsing
360
+ print("\n--- Curl Command Parsing ---")
361
+ commands = [
362
+ 'curl http://localhost:7770/rest/V1/categories',
363
+ 'curl -X POST http://localhost:7770/rest/V1/guest-carts -H "Content-Type: application/json"',
364
+ "curl -X POST 'http://localhost:7770/rest/V1/guest-carts/cart-abc/items' -H 'Content-Type: application/json' -d '{\"cartItem\":{\"sku\":\"MH01\",\"qty\":1}}'",
365
+ ]
366
+ for cmd in commands:
367
+ parsed = parse_curl_command(cmd)
368
+ print(f" {cmd[:70]}...")
369
+ print(f" method={parsed['method']} url={parsed['url']} body={'yes' if parsed['body'] else 'no'}")
370
+
371
+ # Test truncation
372
+ print("\n--- Response Truncation ---")
373
+
374
+ # Primitive (never truncated)
375
+ assert truncate_response_body('"cart-abc123"', 200) == '"cart-abc123"'
376
+ print(" [OK] Primitive string not truncated")
377
+
378
+ # Error (never truncated)
379
+ long_error = json.dumps({"message": "x" * 2000})
380
+ assert truncate_response_body(long_error, 400) == long_error
381
+ print(" [OK] Error response not truncated")
382
+
383
+ # Small object (not truncated)
384
+ small = json.dumps({"id": 1, "name": "test"})
385
+ assert truncate_response_body(small, 200) == small
386
+ print(" [OK] Small object not truncated")
387
+
388
+ # Large array in dict (truncated to 2 items)
389
+ large = json.dumps({
390
+ "items": [{"sku": f"P{i}", "name": f"Product {i}"} for i in range(20)],
391
+ "total_count": 20
392
+ })
393
+ result = json.loads(truncate_response_body(large, 200))
394
+ assert len(result["items"]) == 2
395
+ assert "_list_truncated" in result
396
+ assert result["_list_truncated"]["fields"]["items"] == 20
397
+ print(f" [OK] Large array truncated: 20 items → {len(result['items'])} shown")
398
+ print(f" Note: {result['_list_truncated']['note'][:80]}...")
399
+
400
+ # Top-level array (truncated)
401
+ top_array = json.dumps([{"id": i, "name": f"Item {i}"} for i in range(10)])
402
+ result = json.loads(truncate_response_body(top_array, 200))
403
+ assert len(result) == 3 # 2 items + truncation note
404
+ print(f" [OK] Top-level array truncated: 10 items → 2 shown + note")
405
+
406
+ # Test mock execution with indexing
407
+ print("\n--- Mock Execution + Indexing ---")
408
+ episode_index = []
409
+
410
+ # Step 1: Get categories
411
+ r = mock_curl_exec("curl http://localhost:7770/rest/V1/categories", 1, episode_index)
412
+ print(f" Step 1: GET /categories → {r['status_code']}, body keys: {list(r['body'].keys()) if isinstance(r['body'], dict) else 'primitive'}")
413
+
414
+ # Step 2: Search products
415
+ r = mock_curl_exec(
416
+ "curl 'http://localhost:7770/rest/V1/products?searchCriteria[filter]=name'",
417
+ 2, episode_index
418
+ )
419
+ print(f" Step 2: GET /products → {r['status_code']}, items shown: {len(r['body'].get('items', []))}, total: {r['body'].get('total_count', '?')}")
420
+ if "_list_truncated" in r["body"]:
421
+ print(f" Truncated: {r['body']['_list_truncated']['note'][:60]}...")
422
+
423
+ # Step 3: Create cart
424
+ r = mock_curl_exec(
425
+ "curl -X POST http://localhost:7770/rest/V1/guest-carts -H 'Content-Type: application/json'",
426
+ 3, episode_index
427
+ )
428
+ print(f" Step 3: POST /guest-carts → {r['status_code']}, cart_id: {r['body']}")
429
+
430
+ # Step 4: Add item
431
+ r = mock_curl_exec(
432
+ 'curl -X POST http://localhost:7770/rest/V1/guest-carts/cart-mock-abc123/items -H "Content-Type: application/json" -d \'{"cartItem":{"sku":"MH01","qty":1,"quote_id":"cart-mock-abc123"}}\'',
433
+ 4, episode_index
434
+ )
435
+ print(f" Step 4: POST /guest-carts/.../items → {r['status_code']}, item_id: {r['body'].get('item_id')}")
436
+
437
+ # Show episode index
438
+ print(f"\n--- Episode Index ({len(episode_index)} documents) ---")
439
+ for i, doc in enumerate(episode_index):
440
+ print(f" [{i}] {doc[:120]}...")
441
+
442
+ print("\n[PASS] curl_exec tool tests completed successfully")
tests/tool_search_endpoints.py ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tool 1: search_endpoints — Semantic search over endpoint catalog.
3
+
4
+ Uses GEMMA embeddings (google/embeddinggemma-300m) for semantic search.
5
+ Falls back to keyword matching when GEMMA is not available (test mode).
6
+ """
7
+
8
+ import json
9
+ import os
10
+ import re
11
+ import math
12
+ from collections import Counter
13
+
14
+ # ---------------------------------------------------------------------------
15
+ # Keyword-based fallback search (for testing without GEMMA model)
16
+ # Uses TF-IDF-like scoring
17
+ # ---------------------------------------------------------------------------
18
+
19
+
20
+ def _tokenize(text: str) -> list[str]:
21
+ """Simple whitespace + punctuation tokenizer."""
22
+ return re.findall(r'[a-zA-Z0-9_/{}]+', text.lower())
23
+
24
+
25
+ class KeywordSearchIndex:
26
+ """Simple TF-IDF search index for testing without neural embeddings."""
27
+
28
+ def __init__(self):
29
+ self.documents: list[str] = []
30
+ self.doc_tokens: list[list[str]] = []
31
+ self.idf: dict[str, float] = {}
32
+
33
+ def add_documents(self, docs: list[str]):
34
+ self.documents = docs
35
+ self.doc_tokens = [_tokenize(d) for d in docs]
36
+ self._build_idf()
37
+
38
+ def _build_idf(self):
39
+ n = len(self.documents)
40
+ df = Counter()
41
+ for tokens in self.doc_tokens:
42
+ for t in set(tokens):
43
+ df[t] += 1
44
+ self.idf = {t: math.log(n / (1 + count)) for t, count in df.items()}
45
+
46
+ def search(self, query: str, top_k: int = 3) -> list[tuple[int, float, str]]:
47
+ """Returns list of (index, score, document) tuples."""
48
+ query_tokens = _tokenize(query)
49
+ scores = []
50
+ for i, doc_toks in enumerate(self.doc_tokens):
51
+ tf = Counter(doc_toks)
52
+ score = sum(
53
+ (tf.get(qt, 0) / max(len(doc_toks), 1)) * self.idf.get(qt, 0)
54
+ for qt in query_tokens
55
+ )
56
+ scores.append((i, score, self.documents[i]))
57
+
58
+ scores.sort(key=lambda x: x[1], reverse=True)
59
+ return scores[:top_k]
60
+
61
+
62
+ # ---------------------------------------------------------------------------
63
+ # Catalog loading
64
+ # ---------------------------------------------------------------------------
65
+
66
+ def load_catalog(catalog_path: str) -> list[dict]:
67
+ """Load a ground truth catalog JSON file."""
68
+ with open(catalog_path) as f:
69
+ data = json.load(f)
70
+ return data.get("endpoints", data if isinstance(data, list) else [])
71
+
72
+
73
+ def catalog_entry_to_text(entry: dict, app_name: str = "") -> str:
74
+ """Convert a catalog endpoint to a searchable text document."""
75
+ parts = []
76
+ if app_name:
77
+ parts.append(f"app: {app_name}")
78
+
79
+ endpoint = entry.get("endpoint", "")
80
+ parts.append(f"endpoint: {endpoint}")
81
+
82
+ auth = entry.get("auth", "none")
83
+ parts.append(f"auth: {auth}")
84
+
85
+ # Query params
86
+ qp = entry.get("query_params", {})
87
+ if qp:
88
+ param_strs = []
89
+ for k, v in qp.items():
90
+ if isinstance(v, dict):
91
+ param_strs.append(f"{k} ({v.get('type', '?')}, source: {v.get('source', '?')})")
92
+ else:
93
+ param_strs.append(f"{k}: {v}")
94
+ parts.append(f"query_params: {', '.join(param_strs)}")
95
+
96
+ # Path params
97
+ pp = entry.get("path_params", {})
98
+ if pp:
99
+ param_strs = []
100
+ for k, v in pp.items():
101
+ if isinstance(v, dict):
102
+ src = v.get("source", "?")
103
+ from_ep = v.get("from_endpoint", "")
104
+ param_strs.append(f"{k} ({v.get('type', '?')}, source: {src}, from: {from_ep})")
105
+ else:
106
+ param_strs.append(f"{k}: {v}")
107
+ parts.append(f"path_params: {', '.join(param_strs)}")
108
+
109
+ # Body params
110
+ bp = entry.get("body_params", entry.get("form_params", {}))
111
+ if bp:
112
+ param_strs = []
113
+ for k, v in bp.items():
114
+ if isinstance(v, dict):
115
+ src = v.get("source", "?")
116
+ from_ep = v.get("from_endpoint", "")
117
+ notes = v.get("notes", "")
118
+ param_strs.append(f"{k} ({v.get('type', '?')}, source: {src})")
119
+ else:
120
+ param_strs.append(f"{k}: {v}")
121
+ parts.append(f"body_params: {', '.join(param_strs)}")
122
+
123
+ # Response fields
124
+ rkf = entry.get("response_key_fields", [])
125
+ if rkf:
126
+ parts.append(f"returns: {', '.join(str(f) for f in rkf)}")
127
+
128
+ # Notes
129
+ notes = entry.get("notes", "")
130
+ if notes:
131
+ parts.append(f"notes: {notes}")
132
+
133
+ return " | ".join(parts)
134
+
135
+
136
+ # ---------------------------------------------------------------------------
137
+ # search_endpoints tool
138
+ # ---------------------------------------------------------------------------
139
+
140
+ class SearchEndpoints:
141
+ """
142
+ Tool 1 implementation.
143
+ Loads catalog, builds search index, provides search interface.
144
+ """
145
+
146
+ def __init__(self):
147
+ self.index = KeywordSearchIndex()
148
+ self.raw_entries: list[dict] = []
149
+ self.text_chunks: list[str] = []
150
+
151
+ def load_catalog(self, catalog_path: str, app_name: str = ""):
152
+ """Load a catalog and build the search index."""
153
+ self.raw_entries = load_catalog(catalog_path)
154
+ self.text_chunks = [catalog_entry_to_text(e, app_name) for e in self.raw_entries]
155
+ self.index.add_documents(self.text_chunks)
156
+
157
+ def load_from_browser_agent(self, text_chunks: list[str]):
158
+ """Load text chunks produced by browser_agent Stage 4."""
159
+ self.text_chunks = text_chunks
160
+ self.index.add_documents(text_chunks)
161
+
162
+ def search(self, query: str, top_k: int = 3) -> list[str]:
163
+ """
164
+ Search endpoints by natural language query.
165
+ Returns top-k matching endpoint schema texts.
166
+ """
167
+ results = self.index.search(query, top_k)
168
+ return [doc for _, _, doc in results]
169
+
170
+ def search_with_scores(self, query: str, top_k: int = 3) -> list[tuple[float, str]]:
171
+ """Search with scores for debugging."""
172
+ results = self.index.search(query, top_k)
173
+ return [(score, doc) for _, score, doc in results]
174
+
175
+
176
+ # ---------------------------------------------------------------------------
177
+ # Test
178
+ # ---------------------------------------------------------------------------
179
+
180
+ if __name__ == "__main__":
181
+ print("=" * 70)
182
+ print("TEST: search_endpoints with browser_agent output")
183
+ print("=" * 70)
184
+
185
+ # PRIMARY TEST: load from browser_agent output (this is the real data flow)
186
+ # In production, search_endpoints searches GEMMA embeddings built by browser_agent
187
+ # from HAR data. Here we test with keyword search as a fallback for GEMMA.
188
+ print("\n--- Primary: load from browser_agent HAR output ---")
189
+ from tool_browser_agent import extract_openapi_spec, spec_entry_to_text
190
+
191
+ mock_har_path = os.path.join(os.path.dirname(__file__), "mock_data", "mock_har.json")
192
+ with open(mock_har_path) as f:
193
+ har_data = json.load(f)
194
+
195
+ spec = extract_openapi_spec(har_data, "http://localhost:7770/")
196
+ chunks = [spec_entry_to_text(e, "shopping") for e in spec]
197
+
198
+ tool = SearchEndpoints()
199
+ tool.load_from_browser_agent(chunks)
200
+
201
+ print(f"\nLoaded {len(tool.text_chunks)} endpoint documents from browser_agent output\n")
202
+ for i, chunk in enumerate(tool.text_chunks):
203
+ print(f" [{i}] {chunk[:100]}...")
204
+
205
+ # Test queries against browser_agent output
206
+ queries = [
207
+ "find product by name get sku",
208
+ "create guest cart",
209
+ "add item to guest cart",
210
+ "authenticate customer login",
211
+ "shipping methods for cart",
212
+ "get cart total",
213
+ "list categories",
214
+ ]
215
+
216
+ print(f"\n--- Search Results (from browser_agent HAR output) ---\n")
217
+ for q in queries:
218
+ print(f"Query: \"{q}\"")
219
+ results = tool.search_with_scores(q, top_k=3)
220
+ for score, doc in results:
221
+ # Extract just the endpoint name for display
222
+ ep_match = re.search(r'endpoint: (\S+ \S+)', doc)
223
+ ep_name = ep_match.group(1) if ep_match else doc[:60]
224
+ print(f" [{score:.3f}] {ep_name}")
225
+ print()
226
+
227
+ # SECONDARY TEST: catalog loading (used by judge for ground truth, NOT by search_endpoints)
228
+ print("--- Secondary: catalog loading (for judge ground truth, not search_endpoints) ---")
229
+ catalog_path = os.path.join(os.path.dirname(__file__), "mock_data", "mock_catalog.json")
230
+
231
+ tool2 = SearchEndpoints()
232
+ tool2.load_catalog(catalog_path, app_name="shopping")
233
+ print(f" Catalog loaded: {len(tool2.text_chunks)} endpoint documents (judge reference only)")
234
+
235
+ results = tool2.search("add item to cart", top_k=1)
236
+ print(f" Query: 'add item to cart' → top result:")
237
+ print(f" {results[0][:120]}...")
238
+
239
+ print("\n[PASS] search_endpoints tool tests completed successfully")
tests/tool_search_episode_data.py ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tool 3: search_episode_data — BM25 search over episode request/response history.
3
+
4
+ Uses rank_bm25 for keyword matching over all indexed data from curl_exec calls.
5
+ Falls back to simple keyword matching when rank_bm25 is not available.
6
+ """
7
+
8
+ import re
9
+ import math
10
+ from collections import Counter
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Simple BM25 implementation (no external dependencies)
14
+ # ---------------------------------------------------------------------------
15
+
16
+
17
+ def _tokenize(text: str) -> list[str]:
18
+ """Tokenize text into words."""
19
+ return re.findall(r'[a-zA-Z0-9_./{}:]+', text.lower())
20
+
21
+
22
+ class SimpleBM25:
23
+ """
24
+ Minimal BM25 implementation for episode data search.
25
+ No external dependencies — pure Python.
26
+ """
27
+
28
+ def __init__(self, k1: float = 1.5, b: float = 0.75):
29
+ self.k1 = k1
30
+ self.b = b
31
+ self.corpus: list[str] = []
32
+ self.tokenized: list[list[str]] = []
33
+ self.doc_len: list[int] = []
34
+ self.avgdl: float = 0
35
+ self.idf: dict[str, float] = {}
36
+ self.n_docs: int = 0
37
+
38
+ def index(self, documents: list[str]):
39
+ """Build BM25 index from documents."""
40
+ self.corpus = documents
41
+ self.tokenized = [_tokenize(d) for d in documents]
42
+ self.doc_len = [len(t) for t in self.tokenized]
43
+ self.n_docs = len(documents)
44
+ self.avgdl = sum(self.doc_len) / max(self.n_docs, 1)
45
+
46
+ # Compute IDF
47
+ df = Counter()
48
+ for tokens in self.tokenized:
49
+ for t in set(tokens):
50
+ df[t] += 1
51
+
52
+ self.idf = {}
53
+ for term, freq in df.items():
54
+ # Standard BM25 IDF
55
+ self.idf[term] = math.log(
56
+ (self.n_docs - freq + 0.5) / (freq + 0.5) + 1
57
+ )
58
+
59
+ def add_documents(self, new_docs: list[str]):
60
+ """Incrementally add documents and rebuild index."""
61
+ self.corpus.extend(new_docs)
62
+ new_tokenized = [_tokenize(d) for d in new_docs]
63
+ self.tokenized.extend(new_tokenized)
64
+ self.doc_len.extend(len(t) for t in new_tokenized)
65
+ self.n_docs = len(self.corpus)
66
+ self.avgdl = sum(self.doc_len) / max(self.n_docs, 1)
67
+
68
+ # Recompute IDF
69
+ df = Counter()
70
+ for tokens in self.tokenized:
71
+ for t in set(tokens):
72
+ df[t] += 1
73
+ self.idf = {
74
+ term: math.log((self.n_docs - freq + 0.5) / (freq + 0.5) + 1)
75
+ for term, freq in df.items()
76
+ }
77
+
78
+ def search(self, query: str, top_k: int = 5) -> list[tuple[int, float, str]]:
79
+ """
80
+ Search for query in corpus.
81
+ Returns: list of (doc_index, score, document) tuples, sorted by score descending.
82
+ """
83
+ query_tokens = _tokenize(query)
84
+ scores = []
85
+
86
+ for i, doc_tokens in enumerate(self.tokenized):
87
+ score = 0.0
88
+ tf = Counter(doc_tokens)
89
+ dl = self.doc_len[i]
90
+
91
+ for qt in query_tokens:
92
+ if qt not in self.idf:
93
+ continue
94
+ term_freq = tf.get(qt, 0)
95
+ idf = self.idf[qt]
96
+ numerator = term_freq * (self.k1 + 1)
97
+ denominator = term_freq + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
98
+ score += idf * numerator / denominator
99
+
100
+ scores.append((i, score, self.corpus[i]))
101
+
102
+ scores.sort(key=lambda x: x[1], reverse=True)
103
+ return scores[:top_k]
104
+
105
+
106
+ # ---------------------------------------------------------------------------
107
+ # Episode data store
108
+ # ---------------------------------------------------------------------------
109
+
110
+ class EpisodeDataStore:
111
+ """
112
+ Per-episode BM25 index over all request/response bodies.
113
+ Initialized empty at episode start, grows with each curl_exec call.
114
+ Discarded at episode end.
115
+ """
116
+
117
+ def __init__(self):
118
+ self.bm25 = SimpleBM25()
119
+ self.bm25.index([]) # Initialize empty
120
+
121
+ def add_documents(self, docs: list[str]):
122
+ """Add new documents (from a curl_exec call) to the index."""
123
+ self.bm25.add_documents(docs)
124
+
125
+ def search(self, query: str, top_k: int = 5) -> list[str]:
126
+ """
127
+ Search episode data by keyword query.
128
+ Returns top-k matching documents as strings.
129
+ """
130
+ if self.bm25.n_docs == 0:
131
+ return []
132
+ results = self.bm25.search(query, top_k)
133
+ return [doc for _, score, doc in results if score > 0]
134
+
135
+ def search_with_scores(self, query: str, top_k: int = 5) -> list[tuple[float, str]]:
136
+ """Search with scores for debugging."""
137
+ results = self.bm25.search(query, top_k)
138
+ return [(score, doc) for _, score, doc in results]
139
+
140
+ @property
141
+ def doc_count(self) -> int:
142
+ return self.bm25.n_docs
143
+
144
+ def reset(self):
145
+ """Clear all data (called at episode end)."""
146
+ self.bm25 = SimpleBM25()
147
+ self.bm25.index([])
148
+
149
+
150
+ # ---------------------------------------------------------------------------
151
+ # Test
152
+ # ---------------------------------------------------------------------------
153
+
154
+ if __name__ == "__main__":
155
+ print("=" * 70)
156
+ print("TEST: search_episode_data with simulated episode")
157
+ print("=" * 70)
158
+
159
+ from tool_curl_exec import build_index_documents
160
+ import json
161
+
162
+ store = EpisodeDataStore()
163
+
164
+ # Simulate episode: 4 curl_exec calls building up the index
165
+
166
+ # Step 1: GET /categories
167
+ docs = build_index_documents(
168
+ step=1, method="GET", path="/rest/V1/categories",
169
+ request_body=None,
170
+ response_body=json.dumps({
171
+ "id": 1, "name": "Root",
172
+ "children_data": [
173
+ {"id": 2, "name": "Default Category"},
174
+ {"id": 3, "name": "Beauty & Personal Care"}
175
+ ]
176
+ }),
177
+ status_code=200
178
+ )
179
+ store.add_documents(docs)
180
+ print(f"\nStep 1: indexed {len(docs)} docs from GET /categories")
181
+
182
+ # Step 2: GET /products (with array items)
183
+ docs = build_index_documents(
184
+ step=2, method="GET", path="/rest/V1/products",
185
+ request_body=None,
186
+ response_body=json.dumps({
187
+ "items": [
188
+ {"sku": "MH01", "name": "Radiant Tee", "price": 22.0},
189
+ {"sku": "MH02", "name": "Breathe-Easy Tank", "price": 34.0},
190
+ {"sku": "MH03", "name": "Stellar Solar Jacket", "price": 75.0},
191
+ {"sku": "MH04", "name": "Argus All-Weather Tank", "price": 22.0},
192
+ {"sku": "WS01", "name": "Iris Workout Top", "price": 29.0},
193
+ ],
194
+ "total_count": 5
195
+ }),
196
+ status_code=200
197
+ )
198
+ store.add_documents(docs)
199
+ print(f"Step 2: indexed {len(docs)} docs from GET /products (5 items)")
200
+
201
+ # Step 3: POST /guest-carts
202
+ docs = build_index_documents(
203
+ step=3, method="POST", path="/rest/V1/guest-carts",
204
+ request_body=None,
205
+ response_body='"cart-mock-abc123"',
206
+ status_code=200
207
+ )
208
+ store.add_documents(docs)
209
+ print(f"Step 3: indexed {len(docs)} docs from POST /guest-carts")
210
+
211
+ # Step 4: POST /guest-carts/.../items
212
+ docs = build_index_documents(
213
+ step=4, method="POST", path="/rest/V1/guest-carts/cart-mock-abc123/items",
214
+ request_body={"cartItem": {"sku": "MH01", "qty": 1, "quote_id": "cart-mock-abc123"}},
215
+ response_body=json.dumps({
216
+ "item_id": 5, "sku": "MH01", "qty": 1,
217
+ "name": "Radiant Tee", "price": 22.0
218
+ }),
219
+ status_code=200
220
+ )
221
+ store.add_documents(docs)
222
+ print(f"Step 4: indexed {len(docs)} docs from POST /guest-carts/.../items")
223
+
224
+ print(f"\nTotal documents in episode index: {store.doc_count}")
225
+
226
+ # Test searches
227
+ print(f"\n--- Search Tests ---\n")
228
+
229
+ queries = [
230
+ ("Radiant Tee sku", "Should find MH01 product"),
231
+ ("Stellar Solar Jacket price", "Should find MH03 at $75"),
232
+ ("cart-mock-abc123", "Should find cart ID"),
233
+ ("Beauty Personal Care", "Should find category"),
234
+ ("item_id 5", "Should find add-to-cart result"),
235
+ ("Iris Workout Top", "Should find WS01 product"),
236
+ ]
237
+
238
+ all_passed = True
239
+ for query, description in queries:
240
+ results = store.search_with_scores(query, top_k=3)
241
+ print(f"Query: \"{query}\" ({description})")
242
+ if results:
243
+ for score, doc in results:
244
+ print(f" [{score:.3f}] {doc[:120]}...")
245
+ else:
246
+ print(f" [NO RESULTS]")
247
+ all_passed = False
248
+ print()
249
+
250
+ # Verify specific lookups
251
+ print("--- Specific Value Lookups ---\n")
252
+
253
+ # Can we find the product SKU from a name?
254
+ results = store.search("Radiant Tee", top_k=1)
255
+ found_sku = "MH01" in results[0] if results else False
256
+ print(f" Find 'Radiant Tee' SKU: {'PASS' if found_sku else 'FAIL'} ({'MH01' if found_sku else 'not found'})")
257
+
258
+ # Can we find the cart ID?
259
+ results = store.search("cart guest-carts", top_k=3)
260
+ found_cart = any("cart-mock-abc123" in r for r in results)
261
+ print(f" Find cart ID: {'PASS' if found_cart else 'FAIL'}")
262
+
263
+ # Can we find from which step data came?
264
+ results = store.search("Radiant Tee", top_k=1)
265
+ found_step = "step:2" in results[0] if results else False
266
+ print(f" Step annotation present: {'PASS' if found_step else 'FAIL'}")
267
+
268
+ # Test reset
269
+ store.reset()
270
+ assert store.doc_count == 0
271
+ print(f"\n Episode reset: doc_count = {store.doc_count} [PASS]")
272
+
273
+ print("\n[PASS] search_episode_data tool tests completed successfully")