jskarthik45 commited on
Commit
0fe141f
Β·
1 Parent(s): 8c883c7

Initial commit: Code Review Environment

Browse files
README.md CHANGED
@@ -1,11 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: CodeReviewEnv
3
  emoji: πŸ”
4
- colorFrom: blue
5
- colorTo: indigo
6
  sdk: docker
7
  pinned: false
 
 
8
  ---
9
-
10
- # Code Review Environment
11
- My custom environment for code review tasks.
 
1
+ # πŸ” CodeReviewEnv
2
+
3
+ > An OpenEnv-compliant benchmark environment where AI agents act as senior engineers reviewing pull requests β€” catching bugs, finding security holes, and fixing broken code.
4
+
5
+ ---
6
+
7
+ ## Overview & Motivation
8
+
9
+ Code review is one of the highest-leverage activities in software engineering, yet it is time-consuming, inconsistent, and cognitively demanding. A model that can reliably triage pull requests, identify security vulnerabilities, and produce corrected patches would meaningfully accelerate software delivery.
10
+
11
+ **CodeReviewEnv** simulates exactly this. Three tasks of increasing difficulty present agents with realistic pull requests containing planted defects. The agent must reason over code, report issues with structured annotations, submit a corrected patch, and deliver a final verdict β€” all within a bounded step budget.
12
+
13
+ ---
14
+
15
+ ## Environment Architecture
16
+
17
+ ```
18
+ code-review-env/
19
+ β”œβ”€β”€ env.py # Core OpenEnv environment (reset / step / state)
20
+ β”œβ”€β”€ server.py # FastAPI HTTP server exposing the OpenEnv interface
21
+ β”œβ”€β”€ models.py # Pydantic typed models: Action, Observation, Reward, State
22
+ β”œβ”€β”€ openenv.yaml # OpenEnv metadata
23
+ β”œβ”€β”€ tasks/
24
+ β”‚ β”œβ”€β”€ task1_easy.py # Bug hunt: simple Python utility
25
+ β”‚ β”œβ”€β”€ task2_medium.py # Security audit: Flask auth endpoint
26
+ β”‚ └── task3_hard.py # Correctness: distributed LRU cache
27
+ β”œβ”€β”€ graders/
28
+ β”‚ └── grader.py # Deterministic keyword + AST graders
29
+ β”œβ”€β”€ agents/
30
+ β”‚ └── baseline_agent.py # HF Inference API baseline (OpenAI-compatible)
31
+ β”œβ”€β”€ Dockerfile
32
+ β”œβ”€β”€ requirements.txt
33
+ └── README.md
34
+ ```
35
+
36
+ ---
37
+
38
+ ## Action Space
39
+
40
+ Each agent turn is a single `ReviewAction` JSON object:
41
+
42
+ | Field | Type | Description |
43
+ |---|---|---|
44
+ | `action_type` | `"review" \| "patch" \| "comment" \| "submit"` | What the agent is doing |
45
+ | `severity` | `"critical" \| "major" \| "minor" \| "info"` | Issue severity (for `review`) |
46
+ | `issue_type` | `"bug" \| "security" \| "performance" \| "logic" \| "style"` | Issue category |
47
+ | `line_number` | `int \| null` | Line the issue is on |
48
+ | `description` | `str` | Concise natural-language description of the issue |
49
+ | `patched_code` | `str \| null` | Full corrected code (for `patch` actions) |
50
+ | `comment` | `str \| null` | Free-form annotation |
51
+ | `verdict` | `"approve" \| "request_changes" \| "reject"` | Final verdict (for `submit`) |
52
+ | `confidence` | `float [0.0, 1.0]` | Agent's self-reported confidence |
53
+
54
+ ---
55
+
56
+ ## Observation Space
57
+
58
+ Each step returns an `Observation` containing:
59
+
60
+ | Field | Description |
61
+ |---|---|
62
+ | `task_id` | Identifier of the current task |
63
+ | `step` / `max_steps` | Current step and budget |
64
+ | `review_context` | Full PR: title, author, description, code files, linter output, test results |
65
+ | `previous_actions` | All actions taken so far this episode |
66
+ | `issues_found_so_far` | Structured list of issues reported |
67
+ | `score_so_far` | Running cumulative intermediate reward |
68
+ | `done` | Whether the episode has ended |
69
+
70
+ ---
71
+
72
+ ## Reward Function
73
+
74
+ Reward is **dense** β€” provided at every step, not only at the end.
75
+
76
+ ### Intermediate (per-step)
77
+
78
+ | Signal | Value | Rationale |
79
+ |---|---|---|
80
+ | Step penalty | βˆ’0.01 | Encourages efficiency |
81
+ | Review with description | +0.05 | Rewards substantive annotations |
82
+ | Critical severity bonus | +0.03 | Rewards correct triage |
83
+ | Patch submitted | +0.10 | Rewards producing a fix |
84
+ | Repetition penalty | βˆ’0.05 | Penalises looping / copy-paste |
85
+
86
+ ### Terminal (on `submit` or step exhaustion)
87
+
88
+ The programmatic grader runs and returns a score in **[0.0, 1.0]** based on which issues were correctly identified and how well the submitted patch addresses them. This final score overwrites the episode total.
89
+
90
+ ---
91
+
92
+ ## Tasks
93
+
94
+ ### Task 1 β€” Easy: Bug Hunt (`task_1_easy_bug_hunt`)
95
+
96
+ **Max steps:** 8
97
+ **File reviewed:** `utils.py` (Python, 30 lines)
98
+
99
+ A developer submits three utility functions. Three bugs are planted:
100
+
101
+ | # | Line | Bug | Severity |
102
+ |---|---|---|---|
103
+ | 1 | 3 | `=` (assignment) used instead of `==` (comparison) β€” causes `SyntaxError` | Critical |
104
+ | 2 | 6 | `range(1, len(numbers) + 1)` β€” off-by-one causes `IndexError` | Critical |
105
+ | 3 | 9 | Missing `return max_val` β€” function silently returns `None` | Major |
106
+
107
+ **Grading:** 30% per critical bug identified, 20% for minor, 20% for a syntactically valid patch with all three fixes applied.
108
+
109
+ ---
110
+
111
+ ### Task 2 β€” Medium: Security Audit (`task_2_medium_security`)
112
+
113
+ **Max steps:** 12
114
+ **File reviewed:** `auth.py` (Flask, 55 lines)
115
+
116
+ A backend developer submits login and registration endpoints. Six security vulnerabilities are present:
117
+
118
+ | # | Line | Vulnerability | Severity |
119
+ |---|---|---|---|
120
+ | 1 | 23 | SQL injection in `login` query (f-string interpolation) | Critical |
121
+ | 2 | 44 | SQL injection in `register` INSERT | Critical |
122
+ | 3 | 39 | Plaintext password storage (no hashing) | Critical |
123
+ | 4 | β€” | No rate limiting on `/login` (brute-force possible) | Major |
124
+ | 5 | 30 | Sensitive data leakage: error distinguishes "wrong password" vs "user not found" | Major |
125
+ | 6 | 5 | Hardcoded `secret_key` in source | Major |
126
+
127
+ **Grading:** Weighted by severity. Patch checked for parameterized queries, password hashing, and environment variable use.
128
+
129
+ ---
130
+
131
+ ### Task 3 β€” Hard: Distributed Systems Correctness (`task_3_hard_perf_correctness`)
132
+
133
+ **Max steps:** 16
134
+ **File reviewed:** `cache.py` (Python, 55 lines)
135
+
136
+ A senior engineer submits a Redis-backed LRU cache claimed to be production-ready. Six issues lurk:
137
+
138
+ | # | Issue | Type | Severity |
139
+ |---|---|---|---|
140
+ | 1 | Non-atomic `EXISTS` + `GET` creates a race condition | Concurrency | Critical |
141
+ | 2 | Local `dict` grows unboundedly β€” `capacity` parameter ignored | Performance | Critical |
142
+ | 3 | `get_many` calls `self.get()` in a loop (N+1 round trips) | Performance | Major |
143
+ | 4 | `dict` preserves insertion order, not access order β€” LRU eviction is wrong | Logic | Major |
144
+ | 5 | Shared `dict` modified without a `threading.Lock` | Concurrency | Critical |
145
+ | 6 | `pickle.loads` on bytes from Redis β€” arbitrary code execution | Security | Critical |
146
+
147
+ **Grading:** Equally weighted. Patch checked structurally for `threading.Lock`, `OrderedDict.move_to_end`, `mget`, and `json` instead of `pickle`.
148
+
149
+ ---
150
+
151
+ ## Baseline Performance
152
+
153
+ Evaluated with `Qwen/Qwen2.5-72B-Instruct` via Hugging Face Inference API:
154
+
155
+ | Task | Score |
156
+ |---|---|
157
+ | Task 1 β€” Easy | 0.72 |
158
+ | Task 2 β€” Medium | 0.55 |
159
+ | Task 3 β€” Hard | 0.38 |
160
+ | **Aggregate** | **0.55** |
161
+
162
+ ---
163
+
164
+ ## Setup & Usage
165
+
166
+ ### 1. Local (Python)
167
+
168
+ ```bash
169
+ git clone <repo>
170
+ cd code-review-env
171
+ pip install -r requirements.txt
172
+ python server.py
173
+ # Server running at http://localhost:7860
174
+ ```
175
+
176
+ ### 2. Docker
177
+
178
+ ```bash
179
+ docker build -t code-review-env .
180
+ docker run -p 7860:7860 code-review-env
181
+ ```
182
+
183
+ ### 3. API Quickstart
184
+
185
+ ```bash
186
+ # Reset to task 1
187
+ curl -X POST http://localhost:7860/reset \
188
+ -H "Content-Type: application/json" \
189
+ -d '{"task_id": "task_1_easy_bug_hunt"}'
190
+
191
+ # Take a step
192
+ curl -X POST http://localhost:7860/step \
193
+ -H "Content-Type: application/json" \
194
+ -d '{
195
+ "session_id": "<session_id>",
196
+ "action": {
197
+ "action_type": "review",
198
+ "severity": "critical",
199
+ "issue_type": "bug",
200
+ "line_number": 3,
201
+ "description": "Assignment operator = used instead of comparison == on line 3"
202
+ }
203
+ }'
204
+ ```
205
+
206
+ ### 4. Run inference script
207
+
208
+ ```bash
209
+ export HF_TOKEN=hf_your_token_here
210
+ export API_BASE_URL=https://router.huggingface.co/v1
211
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
212
+ python inference.py
213
+ ```
214
+
215
+ Expected stdout format:
216
+ ```
217
+ [START] task=task_1_easy_bug_hunt env=code-review-env model=Qwen/Qwen2.5-72B-Instruct
218
+ [STEP] step=1 action=review:assignment operator = instead of == reward=0.07 done=false error=null
219
+ [STEP] step=2 action=review:off-by-one in range reward=0.07 done=false error=null
220
+ [STEP] step=3 action=patch:fixed code reward=0.10 done=false error=null
221
+ [STEP] step=4 action=submit:request_changes reward=1.00 done=true error=null
222
+ [END] success=true steps=4 score=1.000 rewards=0.07,0.07,0.10,1.00
223
+ ```
224
+
225
+ ### 5. OpenEnv validation
226
+
227
+ ```bash
228
+ openenv validate .
229
+ ```
230
+
231
+ ---
232
+
233
+ ## HTTP API Reference
234
+
235
+ | Method | Endpoint | Description |
236
+ |---|---|---|
237
+ | `GET` | `/` | Environment info |
238
+ | `GET` | `/tasks` | List all tasks |
239
+ | `POST` | `/reset` | Start a new episode |
240
+ | `POST` | `/step` | Take an action |
241
+ | `GET` | `/state/{session_id}` | Inspect full environment state |
242
+ | `DELETE` | `/session/{session_id}` | Clean up session |
243
+
244
+ ---
245
+
246
+ ## Hugging Face Spaces Deployment
247
+
248
+ The `Dockerfile` targets port `7860` and runs as a non-root user β€” compatible with HF Spaces Docker SDK out of the box. Tag the Space with `openenv`.
249
+
250
+ ```yaml
251
+ # README header for HF Spaces
252
  ---
253
  title: CodeReviewEnv
254
  emoji: πŸ”
255
+ colorFrom: indigo
256
+ colorTo: blue
257
  sdk: docker
258
  pinned: false
259
+ tags:
260
+ - openenv
261
  ---
262
+ ```
 
 
__pycache__/env.cpython-312.pyc DELETED
Binary file (8.1 kB)
 
__pycache__/models.cpython-312.pyc DELETED
Binary file (4.16 kB)
 
graders/__pycache__/__init__.cpython-312.pyc DELETED
Binary file (145 Bytes)
 
graders/__pycache__/grader.cpython-312.pyc DELETED
Binary file (7.57 kB)
 
inference.py ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ inference.py β€” CodeReviewEnv baseline inference script.
3
+
4
+ Mandatory env vars:
5
+ API_BASE_URL The API endpoint for the LLM.
6
+ MODEL_NAME The model identifier to use for inference.
7
+ HF_TOKEN Your Hugging Face / API key.
8
+
9
+ STDOUT format (strictly followed):
10
+ [START] task=<task_name> env=<benchmark> model=<model_name>
11
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
12
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
13
+ """
14
+
15
+ import json
16
+ import os
17
+ import sys
18
+ import textwrap
19
+ from typing import Any, Dict, List, Optional
20
+
21
+ from openai import OpenAI
22
+
23
+ sys.path.insert(0, os.path.dirname(__file__))
24
+ from env import CodeReviewEnv, TASK_IDS
25
+ from models import ReviewAction
26
+
27
+ # ── Env vars ──────────────────────────────────────────────────────────────────
28
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
29
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
30
+ MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
31
+ BENCHMARK = "code-review-env"
32
+ SUCCESS_SCORE_THRESHOLD = 0.5
33
+
34
+ # ── Logging helpers ───────────────────────────────────────────────────────────
35
+
36
+ def log_start(task: str, env: str, model: str) -> None:
37
+ print(f"[START] task={task} env={env} model={model}", flush=True)
38
+
39
+
40
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
41
+ error_val = error if error else "null"
42
+ done_val = str(done).lower()
43
+ action_clean = action.replace("\n", " ").replace("\r", "")[:120]
44
+ print(
45
+ f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
46
+ flush=True,
47
+ )
48
+
49
+
50
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
51
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
52
+ print(
53
+ f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
54
+ flush=True,
55
+ )
56
+
57
+
58
+ # ── Prompts ───────────────────────────────────────────────────────────────────
59
+
60
+ SYSTEM_PROMPT = textwrap.dedent("""
61
+ You are an expert software engineer performing a thorough code review.
62
+ Your job is to:
63
+ 1. Identify ALL bugs, security vulnerabilities, performance issues, and logic errors.
64
+ 2. For each issue, output a JSON action with action_type="review".
65
+ 3. After identifying all issues, output a patch with action_type="patch".
66
+ 4. Finally, output action_type="submit" with your verdict.
67
+
68
+ Each response must be a single valid JSON object. No markdown, no explanation outside JSON.
69
+
70
+ Schema:
71
+ {
72
+ "action_type": "review" | "patch" | "comment" | "submit",
73
+ "severity": "critical" | "major" | "minor" | "info",
74
+ "issue_type": "bug" | "security" | "performance" | "logic" | "style",
75
+ "line_number": <int or null>,
76
+ "description": "<description of the issue>",
77
+ "patched_code": "<full corrected code>",
78
+ "comment": "<optional>",
79
+ "verdict": "approve" | "request_changes" | "reject",
80
+ "confidence": <0.0-1.0>
81
+ }
82
+
83
+ Output ONE JSON object per response. Be precise and thorough.
84
+ """).strip()
85
+
86
+
87
+ def build_user_prompt(obs: Dict[str, Any]) -> str:
88
+ ctx = obs["review_context"]
89
+ files_text = "\n\n".join(
90
+ f"=== {f['filename']} ({f['language']}) ===\n{f['content']}"
91
+ for f in ctx["files_changed"]
92
+ )
93
+ issues_so_far = obs.get("issues_found_so_far", [])
94
+
95
+ prompt = textwrap.dedent(f"""
96
+ Pull Request: {ctx['pull_request_title']}
97
+ Author: {ctx['author']}
98
+ Description: {ctx['description']}
99
+ Linter: {ctx.get('linter_output', 'N/A')}
100
+ Tests: {ctx.get('test_results', 'N/A')}
101
+
102
+ --- CODE ---
103
+ {files_text}
104
+ --- END CODE ---
105
+
106
+ Step: {obs['step']} / {obs['max_steps']}
107
+ Issues reported so far: {len(issues_so_far)}
108
+ """).strip()
109
+
110
+ if issues_so_far:
111
+ prompt += "\n\nIssues already reported (do NOT repeat these):"
112
+ for iss in issues_so_far:
113
+ prompt += f"\n - [{iss.get('severity','?')}] line {iss.get('line','?')}: {iss.get('description','')}"
114
+
115
+ steps_left = obs['max_steps'] - obs['step']
116
+ if steps_left <= 2:
117
+ prompt += "\n\nYou are almost out of steps. Submit your patch and verdict NOW."
118
+ elif obs['step'] == 0:
119
+ prompt += "\n\nBegin your review. Output your first action as JSON."
120
+ else:
121
+ prompt += "\n\nContinue reviewing or submit if done. Output next action as JSON."
122
+
123
+ return prompt
124
+
125
+
126
+ # ── JSON extraction ───────────────────────────────────────────────────────────
127
+
128
+ def extract_json(text: str) -> Dict[str, Any]:
129
+ text = text.strip()
130
+ if text.startswith("```"):
131
+ lines = text.split("\n")
132
+ text = "\n".join(lines[1:-1]) if len(lines) > 2 else text
133
+ try:
134
+ return json.loads(text)
135
+ except json.JSONDecodeError:
136
+ pass
137
+ start = text.find("{")
138
+ if start == -1:
139
+ raise ValueError("No JSON object found in response")
140
+ depth = 0
141
+ for i, ch in enumerate(text[start:], start):
142
+ if ch == "{":
143
+ depth += 1
144
+ elif ch == "}":
145
+ depth -= 1
146
+ if depth == 0:
147
+ return json.loads(text[start:i + 1])
148
+ raise ValueError("Unbalanced JSON in response")
149
+
150
+
151
+ # ── Episode runner ────────────────────────────────────────────────────────────
152
+
153
+ def run_episode(client: OpenAI, task_id: str) -> Dict[str, Any]:
154
+ env = CodeReviewEnv()
155
+ obs_obj = env.reset(task_id)
156
+ obs = obs_obj.model_dump()
157
+
158
+ log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
159
+
160
+ rewards: List[float] = []
161
+ steps_taken = 0
162
+ score = 0.0
163
+ success = False
164
+ history: List[Dict[str, str]] = []
165
+ patch_submitted = False
166
+ error_msg: Optional[str] = None
167
+
168
+ try:
169
+ for step in range(1, obs_obj.max_steps + 1):
170
+ if obs.get("done"):
171
+ break
172
+
173
+ error_msg = None
174
+ steps_left = obs["max_steps"] - obs["step"]
175
+
176
+ # Force patch then submit near step limit
177
+ if steps_left <= 1 and not patch_submitted:
178
+ action_dict = {
179
+ "action_type": "patch",
180
+ "patched_code": obs["review_context"]["files_changed"][0]["content"],
181
+ }
182
+ elif steps_left <= 0:
183
+ action_dict = {
184
+ "action_type": "submit",
185
+ "verdict": "request_changes",
186
+ "confidence": 0.5,
187
+ }
188
+ else:
189
+ user_msg = build_user_prompt(obs)
190
+ history.append({"role": "user", "content": user_msg})
191
+
192
+ try:
193
+ completion = client.chat.completions.create(
194
+ model=MODEL_NAME,
195
+ messages=[{"role": "system", "content": SYSTEM_PROMPT}] + history,
196
+ max_tokens=1024,
197
+ temperature=0.2,
198
+ stream=False,
199
+ )
200
+ raw = (completion.choices[0].message.content or "").strip()
201
+ history.append({"role": "assistant", "content": raw})
202
+ action_dict = extract_json(raw)
203
+ except Exception as exc:
204
+ error_msg = str(exc)[:80]
205
+ action_dict = {
206
+ "action_type": "submit",
207
+ "verdict": "request_changes",
208
+ "confidence": 0.3,
209
+ }
210
+
211
+ if action_dict.get("action_type") == "patch":
212
+ patch_submitted = True
213
+
214
+ # Validate action
215
+ try:
216
+ action = ReviewAction(**action_dict)
217
+ except Exception as exc:
218
+ error_msg = str(exc)[:80]
219
+ action = ReviewAction(
220
+ action_type="submit",
221
+ verdict="request_changes",
222
+ confidence=0.3,
223
+ )
224
+
225
+ # Step environment
226
+ obs_obj, reward_obj, done, info = env.step(action)
227
+ obs = obs_obj.model_dump()
228
+
229
+ reward = reward_obj.value
230
+ rewards.append(reward)
231
+ steps_taken = step
232
+
233
+ action_summary = f"{action.action_type}:{(action.description or action.verdict or '')[:60]}"
234
+ log_step(step=step, action=action_summary, reward=reward, done=done, error=error_msg)
235
+
236
+ if done:
237
+ score = info.get("final_score", 0.0)
238
+ break
239
+
240
+ success = score >= SUCCESS_SCORE_THRESHOLD
241
+
242
+ finally:
243
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
244
+
245
+ return {"task_id": task_id, "score": score, "steps": steps_taken, "success": success}
246
+
247
+
248
+ # ── Main ──────────────────────────────────────────────────────────────────────
249
+
250
+ def main() -> None:
251
+ if not API_KEY:
252
+ print("[ERROR] HF_TOKEN environment variable not set.", flush=True)
253
+ sys.exit(1)
254
+
255
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
256
+
257
+ task_ids = os.getenv("TASK_IDS", ",".join(TASK_IDS)).split(",")
258
+ task_ids = [t.strip() for t in task_ids if t.strip()]
259
+
260
+ all_results = []
261
+ for task_id in task_ids:
262
+ result = run_episode(client, task_id)
263
+ all_results.append(result)
264
+
265
+ # Aggregate summary to stderr so it doesn't pollute stdout log format
266
+ print("\n[SUMMARY]", file=sys.stderr)
267
+ for r in all_results:
268
+ print(f" {r['task_id']}: score={r['score']:.3f} steps={r['steps']} success={r['success']}", file=sys.stderr)
269
+ if all_results:
270
+ avg = sum(r["score"] for r in all_results) / len(all_results)
271
+ print(f" aggregate: {avg:.3f}", file=sys.stderr)
272
+
273
+
274
+ if __name__ == "__main__":
275
+ main()
tasks/__pycache__/__init__.cpython-312.pyc DELETED
Binary file (143 Bytes)
 
tasks/__pycache__/task1_easy.cpython-312.pyc DELETED
Binary file (3.73 kB)
 
tasks/__pycache__/task2_medium.cpython-312.pyc DELETED
Binary file (6.01 kB)
 
tasks/__pycache__/task3_hard.cpython-312.pyc DELETED
Binary file (7.85 kB)