Spaces:
Sleeping
Sleeping
Commit Β·
0fe141f
1
Parent(s): 8c883c7
Initial commit: Code Review Environment
Browse files- README.md +256 -5
- __pycache__/env.cpython-312.pyc +0 -0
- __pycache__/models.cpython-312.pyc +0 -0
- graders/__pycache__/__init__.cpython-312.pyc +0 -0
- graders/__pycache__/grader.cpython-312.pyc +0 -0
- inference.py +275 -0
- tasks/__pycache__/__init__.cpython-312.pyc +0 -0
- tasks/__pycache__/task1_easy.cpython-312.pyc +0 -0
- tasks/__pycache__/task2_medium.cpython-312.pyc +0 -0
- tasks/__pycache__/task3_hard.cpython-312.pyc +0 -0
README.md
CHANGED
|
@@ -1,11 +1,262 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: CodeReviewEnv
|
| 3 |
emoji: π
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
-
|
| 10 |
-
# Code Review Environment
|
| 11 |
-
My custom environment for code review tasks.
|
|
|
|
| 1 |
+
# π CodeReviewEnv
|
| 2 |
+
|
| 3 |
+
> An OpenEnv-compliant benchmark environment where AI agents act as senior engineers reviewing pull requests β catching bugs, finding security holes, and fixing broken code.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Overview & Motivation
|
| 8 |
+
|
| 9 |
+
Code review is one of the highest-leverage activities in software engineering, yet it is time-consuming, inconsistent, and cognitively demanding. A model that can reliably triage pull requests, identify security vulnerabilities, and produce corrected patches would meaningfully accelerate software delivery.
|
| 10 |
+
|
| 11 |
+
**CodeReviewEnv** simulates exactly this. Three tasks of increasing difficulty present agents with realistic pull requests containing planted defects. The agent must reason over code, report issues with structured annotations, submit a corrected patch, and deliver a final verdict β all within a bounded step budget.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Environment Architecture
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
code-review-env/
|
| 19 |
+
βββ env.py # Core OpenEnv environment (reset / step / state)
|
| 20 |
+
βββ server.py # FastAPI HTTP server exposing the OpenEnv interface
|
| 21 |
+
βββ models.py # Pydantic typed models: Action, Observation, Reward, State
|
| 22 |
+
βββ openenv.yaml # OpenEnv metadata
|
| 23 |
+
βββ tasks/
|
| 24 |
+
β βββ task1_easy.py # Bug hunt: simple Python utility
|
| 25 |
+
β βββ task2_medium.py # Security audit: Flask auth endpoint
|
| 26 |
+
β βββ task3_hard.py # Correctness: distributed LRU cache
|
| 27 |
+
βββ graders/
|
| 28 |
+
β βββ grader.py # Deterministic keyword + AST graders
|
| 29 |
+
βββ agents/
|
| 30 |
+
β βββ baseline_agent.py # HF Inference API baseline (OpenAI-compatible)
|
| 31 |
+
βββ Dockerfile
|
| 32 |
+
βββ requirements.txt
|
| 33 |
+
βββ README.md
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## Action Space
|
| 39 |
+
|
| 40 |
+
Each agent turn is a single `ReviewAction` JSON object:
|
| 41 |
+
|
| 42 |
+
| Field | Type | Description |
|
| 43 |
+
|---|---|---|
|
| 44 |
+
| `action_type` | `"review" \| "patch" \| "comment" \| "submit"` | What the agent is doing |
|
| 45 |
+
| `severity` | `"critical" \| "major" \| "minor" \| "info"` | Issue severity (for `review`) |
|
| 46 |
+
| `issue_type` | `"bug" \| "security" \| "performance" \| "logic" \| "style"` | Issue category |
|
| 47 |
+
| `line_number` | `int \| null` | Line the issue is on |
|
| 48 |
+
| `description` | `str` | Concise natural-language description of the issue |
|
| 49 |
+
| `patched_code` | `str \| null` | Full corrected code (for `patch` actions) |
|
| 50 |
+
| `comment` | `str \| null` | Free-form annotation |
|
| 51 |
+
| `verdict` | `"approve" \| "request_changes" \| "reject"` | Final verdict (for `submit`) |
|
| 52 |
+
| `confidence` | `float [0.0, 1.0]` | Agent's self-reported confidence |
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## Observation Space
|
| 57 |
+
|
| 58 |
+
Each step returns an `Observation` containing:
|
| 59 |
+
|
| 60 |
+
| Field | Description |
|
| 61 |
+
|---|---|
|
| 62 |
+
| `task_id` | Identifier of the current task |
|
| 63 |
+
| `step` / `max_steps` | Current step and budget |
|
| 64 |
+
| `review_context` | Full PR: title, author, description, code files, linter output, test results |
|
| 65 |
+
| `previous_actions` | All actions taken so far this episode |
|
| 66 |
+
| `issues_found_so_far` | Structured list of issues reported |
|
| 67 |
+
| `score_so_far` | Running cumulative intermediate reward |
|
| 68 |
+
| `done` | Whether the episode has ended |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Reward Function
|
| 73 |
+
|
| 74 |
+
Reward is **dense** β provided at every step, not only at the end.
|
| 75 |
+
|
| 76 |
+
### Intermediate (per-step)
|
| 77 |
+
|
| 78 |
+
| Signal | Value | Rationale |
|
| 79 |
+
|---|---|---|
|
| 80 |
+
| Step penalty | β0.01 | Encourages efficiency |
|
| 81 |
+
| Review with description | +0.05 | Rewards substantive annotations |
|
| 82 |
+
| Critical severity bonus | +0.03 | Rewards correct triage |
|
| 83 |
+
| Patch submitted | +0.10 | Rewards producing a fix |
|
| 84 |
+
| Repetition penalty | β0.05 | Penalises looping / copy-paste |
|
| 85 |
+
|
| 86 |
+
### Terminal (on `submit` or step exhaustion)
|
| 87 |
+
|
| 88 |
+
The programmatic grader runs and returns a score in **[0.0, 1.0]** based on which issues were correctly identified and how well the submitted patch addresses them. This final score overwrites the episode total.
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Tasks
|
| 93 |
+
|
| 94 |
+
### Task 1 β Easy: Bug Hunt (`task_1_easy_bug_hunt`)
|
| 95 |
+
|
| 96 |
+
**Max steps:** 8
|
| 97 |
+
**File reviewed:** `utils.py` (Python, 30 lines)
|
| 98 |
+
|
| 99 |
+
A developer submits three utility functions. Three bugs are planted:
|
| 100 |
+
|
| 101 |
+
| # | Line | Bug | Severity |
|
| 102 |
+
|---|---|---|---|
|
| 103 |
+
| 1 | 3 | `=` (assignment) used instead of `==` (comparison) β causes `SyntaxError` | Critical |
|
| 104 |
+
| 2 | 6 | `range(1, len(numbers) + 1)` β off-by-one causes `IndexError` | Critical |
|
| 105 |
+
| 3 | 9 | Missing `return max_val` β function silently returns `None` | Major |
|
| 106 |
+
|
| 107 |
+
**Grading:** 30% per critical bug identified, 20% for minor, 20% for a syntactically valid patch with all three fixes applied.
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
### Task 2 β Medium: Security Audit (`task_2_medium_security`)
|
| 112 |
+
|
| 113 |
+
**Max steps:** 12
|
| 114 |
+
**File reviewed:** `auth.py` (Flask, 55 lines)
|
| 115 |
+
|
| 116 |
+
A backend developer submits login and registration endpoints. Six security vulnerabilities are present:
|
| 117 |
+
|
| 118 |
+
| # | Line | Vulnerability | Severity |
|
| 119 |
+
|---|---|---|---|
|
| 120 |
+
| 1 | 23 | SQL injection in `login` query (f-string interpolation) | Critical |
|
| 121 |
+
| 2 | 44 | SQL injection in `register` INSERT | Critical |
|
| 122 |
+
| 3 | 39 | Plaintext password storage (no hashing) | Critical |
|
| 123 |
+
| 4 | β | No rate limiting on `/login` (brute-force possible) | Major |
|
| 124 |
+
| 5 | 30 | Sensitive data leakage: error distinguishes "wrong password" vs "user not found" | Major |
|
| 125 |
+
| 6 | 5 | Hardcoded `secret_key` in source | Major |
|
| 126 |
+
|
| 127 |
+
**Grading:** Weighted by severity. Patch checked for parameterized queries, password hashing, and environment variable use.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
### Task 3 β Hard: Distributed Systems Correctness (`task_3_hard_perf_correctness`)
|
| 132 |
+
|
| 133 |
+
**Max steps:** 16
|
| 134 |
+
**File reviewed:** `cache.py` (Python, 55 lines)
|
| 135 |
+
|
| 136 |
+
A senior engineer submits a Redis-backed LRU cache claimed to be production-ready. Six issues lurk:
|
| 137 |
+
|
| 138 |
+
| # | Issue | Type | Severity |
|
| 139 |
+
|---|---|---|---|
|
| 140 |
+
| 1 | Non-atomic `EXISTS` + `GET` creates a race condition | Concurrency | Critical |
|
| 141 |
+
| 2 | Local `dict` grows unboundedly β `capacity` parameter ignored | Performance | Critical |
|
| 142 |
+
| 3 | `get_many` calls `self.get()` in a loop (N+1 round trips) | Performance | Major |
|
| 143 |
+
| 4 | `dict` preserves insertion order, not access order β LRU eviction is wrong | Logic | Major |
|
| 144 |
+
| 5 | Shared `dict` modified without a `threading.Lock` | Concurrency | Critical |
|
| 145 |
+
| 6 | `pickle.loads` on bytes from Redis β arbitrary code execution | Security | Critical |
|
| 146 |
+
|
| 147 |
+
**Grading:** Equally weighted. Patch checked structurally for `threading.Lock`, `OrderedDict.move_to_end`, `mget`, and `json` instead of `pickle`.
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## Baseline Performance
|
| 152 |
+
|
| 153 |
+
Evaluated with `Qwen/Qwen2.5-72B-Instruct` via Hugging Face Inference API:
|
| 154 |
+
|
| 155 |
+
| Task | Score |
|
| 156 |
+
|---|---|
|
| 157 |
+
| Task 1 β Easy | 0.72 |
|
| 158 |
+
| Task 2 β Medium | 0.55 |
|
| 159 |
+
| Task 3 β Hard | 0.38 |
|
| 160 |
+
| **Aggregate** | **0.55** |
|
| 161 |
+
|
| 162 |
+
---
|
| 163 |
+
|
| 164 |
+
## Setup & Usage
|
| 165 |
+
|
| 166 |
+
### 1. Local (Python)
|
| 167 |
+
|
| 168 |
+
```bash
|
| 169 |
+
git clone <repo>
|
| 170 |
+
cd code-review-env
|
| 171 |
+
pip install -r requirements.txt
|
| 172 |
+
python server.py
|
| 173 |
+
# Server running at http://localhost:7860
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
### 2. Docker
|
| 177 |
+
|
| 178 |
+
```bash
|
| 179 |
+
docker build -t code-review-env .
|
| 180 |
+
docker run -p 7860:7860 code-review-env
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
### 3. API Quickstart
|
| 184 |
+
|
| 185 |
+
```bash
|
| 186 |
+
# Reset to task 1
|
| 187 |
+
curl -X POST http://localhost:7860/reset \
|
| 188 |
+
-H "Content-Type: application/json" \
|
| 189 |
+
-d '{"task_id": "task_1_easy_bug_hunt"}'
|
| 190 |
+
|
| 191 |
+
# Take a step
|
| 192 |
+
curl -X POST http://localhost:7860/step \
|
| 193 |
+
-H "Content-Type: application/json" \
|
| 194 |
+
-d '{
|
| 195 |
+
"session_id": "<session_id>",
|
| 196 |
+
"action": {
|
| 197 |
+
"action_type": "review",
|
| 198 |
+
"severity": "critical",
|
| 199 |
+
"issue_type": "bug",
|
| 200 |
+
"line_number": 3,
|
| 201 |
+
"description": "Assignment operator = used instead of comparison == on line 3"
|
| 202 |
+
}
|
| 203 |
+
}'
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
### 4. Run inference script
|
| 207 |
+
|
| 208 |
+
```bash
|
| 209 |
+
export HF_TOKEN=hf_your_token_here
|
| 210 |
+
export API_BASE_URL=https://router.huggingface.co/v1
|
| 211 |
+
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
| 212 |
+
python inference.py
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
Expected stdout format:
|
| 216 |
+
```
|
| 217 |
+
[START] task=task_1_easy_bug_hunt env=code-review-env model=Qwen/Qwen2.5-72B-Instruct
|
| 218 |
+
[STEP] step=1 action=review:assignment operator = instead of == reward=0.07 done=false error=null
|
| 219 |
+
[STEP] step=2 action=review:off-by-one in range reward=0.07 done=false error=null
|
| 220 |
+
[STEP] step=3 action=patch:fixed code reward=0.10 done=false error=null
|
| 221 |
+
[STEP] step=4 action=submit:request_changes reward=1.00 done=true error=null
|
| 222 |
+
[END] success=true steps=4 score=1.000 rewards=0.07,0.07,0.10,1.00
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
### 5. OpenEnv validation
|
| 226 |
+
|
| 227 |
+
```bash
|
| 228 |
+
openenv validate .
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
+
## HTTP API Reference
|
| 234 |
+
|
| 235 |
+
| Method | Endpoint | Description |
|
| 236 |
+
|---|---|---|
|
| 237 |
+
| `GET` | `/` | Environment info |
|
| 238 |
+
| `GET` | `/tasks` | List all tasks |
|
| 239 |
+
| `POST` | `/reset` | Start a new episode |
|
| 240 |
+
| `POST` | `/step` | Take an action |
|
| 241 |
+
| `GET` | `/state/{session_id}` | Inspect full environment state |
|
| 242 |
+
| `DELETE` | `/session/{session_id}` | Clean up session |
|
| 243 |
+
|
| 244 |
+
---
|
| 245 |
+
|
| 246 |
+
## Hugging Face Spaces Deployment
|
| 247 |
+
|
| 248 |
+
The `Dockerfile` targets port `7860` and runs as a non-root user β compatible with HF Spaces Docker SDK out of the box. Tag the Space with `openenv`.
|
| 249 |
+
|
| 250 |
+
```yaml
|
| 251 |
+
# README header for HF Spaces
|
| 252 |
---
|
| 253 |
title: CodeReviewEnv
|
| 254 |
emoji: π
|
| 255 |
+
colorFrom: indigo
|
| 256 |
+
colorTo: blue
|
| 257 |
sdk: docker
|
| 258 |
pinned: false
|
| 259 |
+
tags:
|
| 260 |
+
- openenv
|
| 261 |
---
|
| 262 |
+
```
|
|
|
|
|
|
__pycache__/env.cpython-312.pyc
DELETED
|
Binary file (8.1 kB)
|
|
|
__pycache__/models.cpython-312.pyc
DELETED
|
Binary file (4.16 kB)
|
|
|
graders/__pycache__/__init__.cpython-312.pyc
DELETED
|
Binary file (145 Bytes)
|
|
|
graders/__pycache__/grader.cpython-312.pyc
DELETED
|
Binary file (7.57 kB)
|
|
|
inference.py
ADDED
|
@@ -0,0 +1,275 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
inference.py β CodeReviewEnv baseline inference script.
|
| 3 |
+
|
| 4 |
+
Mandatory env vars:
|
| 5 |
+
API_BASE_URL The API endpoint for the LLM.
|
| 6 |
+
MODEL_NAME The model identifier to use for inference.
|
| 7 |
+
HF_TOKEN Your Hugging Face / API key.
|
| 8 |
+
|
| 9 |
+
STDOUT format (strictly followed):
|
| 10 |
+
[START] task=<task_name> env=<benchmark> model=<model_name>
|
| 11 |
+
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 12 |
+
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import json
|
| 16 |
+
import os
|
| 17 |
+
import sys
|
| 18 |
+
import textwrap
|
| 19 |
+
from typing import Any, Dict, List, Optional
|
| 20 |
+
|
| 21 |
+
from openai import OpenAI
|
| 22 |
+
|
| 23 |
+
sys.path.insert(0, os.path.dirname(__file__))
|
| 24 |
+
from env import CodeReviewEnv, TASK_IDS
|
| 25 |
+
from models import ReviewAction
|
| 26 |
+
|
| 27 |
+
# ββ Env vars ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 28 |
+
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 29 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 30 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
|
| 31 |
+
BENCHMARK = "code-review-env"
|
| 32 |
+
SUCCESS_SCORE_THRESHOLD = 0.5
|
| 33 |
+
|
| 34 |
+
# ββ Logging helpers βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
|
| 36 |
+
def log_start(task: str, env: str, model: str) -> None:
|
| 37 |
+
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
|
| 41 |
+
error_val = error if error else "null"
|
| 42 |
+
done_val = str(done).lower()
|
| 43 |
+
action_clean = action.replace("\n", " ").replace("\r", "")[:120]
|
| 44 |
+
print(
|
| 45 |
+
f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
|
| 46 |
+
flush=True,
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
|
| 51 |
+
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 52 |
+
print(
|
| 53 |
+
f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
|
| 54 |
+
flush=True,
|
| 55 |
+
)
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
# ββ Prompts βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 59 |
+
|
| 60 |
+
SYSTEM_PROMPT = textwrap.dedent("""
|
| 61 |
+
You are an expert software engineer performing a thorough code review.
|
| 62 |
+
Your job is to:
|
| 63 |
+
1. Identify ALL bugs, security vulnerabilities, performance issues, and logic errors.
|
| 64 |
+
2. For each issue, output a JSON action with action_type="review".
|
| 65 |
+
3. After identifying all issues, output a patch with action_type="patch".
|
| 66 |
+
4. Finally, output action_type="submit" with your verdict.
|
| 67 |
+
|
| 68 |
+
Each response must be a single valid JSON object. No markdown, no explanation outside JSON.
|
| 69 |
+
|
| 70 |
+
Schema:
|
| 71 |
+
{
|
| 72 |
+
"action_type": "review" | "patch" | "comment" | "submit",
|
| 73 |
+
"severity": "critical" | "major" | "minor" | "info",
|
| 74 |
+
"issue_type": "bug" | "security" | "performance" | "logic" | "style",
|
| 75 |
+
"line_number": <int or null>,
|
| 76 |
+
"description": "<description of the issue>",
|
| 77 |
+
"patched_code": "<full corrected code>",
|
| 78 |
+
"comment": "<optional>",
|
| 79 |
+
"verdict": "approve" | "request_changes" | "reject",
|
| 80 |
+
"confidence": <0.0-1.0>
|
| 81 |
+
}
|
| 82 |
+
|
| 83 |
+
Output ONE JSON object per response. Be precise and thorough.
|
| 84 |
+
""").strip()
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def build_user_prompt(obs: Dict[str, Any]) -> str:
|
| 88 |
+
ctx = obs["review_context"]
|
| 89 |
+
files_text = "\n\n".join(
|
| 90 |
+
f"=== {f['filename']} ({f['language']}) ===\n{f['content']}"
|
| 91 |
+
for f in ctx["files_changed"]
|
| 92 |
+
)
|
| 93 |
+
issues_so_far = obs.get("issues_found_so_far", [])
|
| 94 |
+
|
| 95 |
+
prompt = textwrap.dedent(f"""
|
| 96 |
+
Pull Request: {ctx['pull_request_title']}
|
| 97 |
+
Author: {ctx['author']}
|
| 98 |
+
Description: {ctx['description']}
|
| 99 |
+
Linter: {ctx.get('linter_output', 'N/A')}
|
| 100 |
+
Tests: {ctx.get('test_results', 'N/A')}
|
| 101 |
+
|
| 102 |
+
--- CODE ---
|
| 103 |
+
{files_text}
|
| 104 |
+
--- END CODE ---
|
| 105 |
+
|
| 106 |
+
Step: {obs['step']} / {obs['max_steps']}
|
| 107 |
+
Issues reported so far: {len(issues_so_far)}
|
| 108 |
+
""").strip()
|
| 109 |
+
|
| 110 |
+
if issues_so_far:
|
| 111 |
+
prompt += "\n\nIssues already reported (do NOT repeat these):"
|
| 112 |
+
for iss in issues_so_far:
|
| 113 |
+
prompt += f"\n - [{iss.get('severity','?')}] line {iss.get('line','?')}: {iss.get('description','')}"
|
| 114 |
+
|
| 115 |
+
steps_left = obs['max_steps'] - obs['step']
|
| 116 |
+
if steps_left <= 2:
|
| 117 |
+
prompt += "\n\nYou are almost out of steps. Submit your patch and verdict NOW."
|
| 118 |
+
elif obs['step'] == 0:
|
| 119 |
+
prompt += "\n\nBegin your review. Output your first action as JSON."
|
| 120 |
+
else:
|
| 121 |
+
prompt += "\n\nContinue reviewing or submit if done. Output next action as JSON."
|
| 122 |
+
|
| 123 |
+
return prompt
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
# ββ JSON extraction βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 127 |
+
|
| 128 |
+
def extract_json(text: str) -> Dict[str, Any]:
|
| 129 |
+
text = text.strip()
|
| 130 |
+
if text.startswith("```"):
|
| 131 |
+
lines = text.split("\n")
|
| 132 |
+
text = "\n".join(lines[1:-1]) if len(lines) > 2 else text
|
| 133 |
+
try:
|
| 134 |
+
return json.loads(text)
|
| 135 |
+
except json.JSONDecodeError:
|
| 136 |
+
pass
|
| 137 |
+
start = text.find("{")
|
| 138 |
+
if start == -1:
|
| 139 |
+
raise ValueError("No JSON object found in response")
|
| 140 |
+
depth = 0
|
| 141 |
+
for i, ch in enumerate(text[start:], start):
|
| 142 |
+
if ch == "{":
|
| 143 |
+
depth += 1
|
| 144 |
+
elif ch == "}":
|
| 145 |
+
depth -= 1
|
| 146 |
+
if depth == 0:
|
| 147 |
+
return json.loads(text[start:i + 1])
|
| 148 |
+
raise ValueError("Unbalanced JSON in response")
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
# ββ Episode runner ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 152 |
+
|
| 153 |
+
def run_episode(client: OpenAI, task_id: str) -> Dict[str, Any]:
|
| 154 |
+
env = CodeReviewEnv()
|
| 155 |
+
obs_obj = env.reset(task_id)
|
| 156 |
+
obs = obs_obj.model_dump()
|
| 157 |
+
|
| 158 |
+
log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
|
| 159 |
+
|
| 160 |
+
rewards: List[float] = []
|
| 161 |
+
steps_taken = 0
|
| 162 |
+
score = 0.0
|
| 163 |
+
success = False
|
| 164 |
+
history: List[Dict[str, str]] = []
|
| 165 |
+
patch_submitted = False
|
| 166 |
+
error_msg: Optional[str] = None
|
| 167 |
+
|
| 168 |
+
try:
|
| 169 |
+
for step in range(1, obs_obj.max_steps + 1):
|
| 170 |
+
if obs.get("done"):
|
| 171 |
+
break
|
| 172 |
+
|
| 173 |
+
error_msg = None
|
| 174 |
+
steps_left = obs["max_steps"] - obs["step"]
|
| 175 |
+
|
| 176 |
+
# Force patch then submit near step limit
|
| 177 |
+
if steps_left <= 1 and not patch_submitted:
|
| 178 |
+
action_dict = {
|
| 179 |
+
"action_type": "patch",
|
| 180 |
+
"patched_code": obs["review_context"]["files_changed"][0]["content"],
|
| 181 |
+
}
|
| 182 |
+
elif steps_left <= 0:
|
| 183 |
+
action_dict = {
|
| 184 |
+
"action_type": "submit",
|
| 185 |
+
"verdict": "request_changes",
|
| 186 |
+
"confidence": 0.5,
|
| 187 |
+
}
|
| 188 |
+
else:
|
| 189 |
+
user_msg = build_user_prompt(obs)
|
| 190 |
+
history.append({"role": "user", "content": user_msg})
|
| 191 |
+
|
| 192 |
+
try:
|
| 193 |
+
completion = client.chat.completions.create(
|
| 194 |
+
model=MODEL_NAME,
|
| 195 |
+
messages=[{"role": "system", "content": SYSTEM_PROMPT}] + history,
|
| 196 |
+
max_tokens=1024,
|
| 197 |
+
temperature=0.2,
|
| 198 |
+
stream=False,
|
| 199 |
+
)
|
| 200 |
+
raw = (completion.choices[0].message.content or "").strip()
|
| 201 |
+
history.append({"role": "assistant", "content": raw})
|
| 202 |
+
action_dict = extract_json(raw)
|
| 203 |
+
except Exception as exc:
|
| 204 |
+
error_msg = str(exc)[:80]
|
| 205 |
+
action_dict = {
|
| 206 |
+
"action_type": "submit",
|
| 207 |
+
"verdict": "request_changes",
|
| 208 |
+
"confidence": 0.3,
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
if action_dict.get("action_type") == "patch":
|
| 212 |
+
patch_submitted = True
|
| 213 |
+
|
| 214 |
+
# Validate action
|
| 215 |
+
try:
|
| 216 |
+
action = ReviewAction(**action_dict)
|
| 217 |
+
except Exception as exc:
|
| 218 |
+
error_msg = str(exc)[:80]
|
| 219 |
+
action = ReviewAction(
|
| 220 |
+
action_type="submit",
|
| 221 |
+
verdict="request_changes",
|
| 222 |
+
confidence=0.3,
|
| 223 |
+
)
|
| 224 |
+
|
| 225 |
+
# Step environment
|
| 226 |
+
obs_obj, reward_obj, done, info = env.step(action)
|
| 227 |
+
obs = obs_obj.model_dump()
|
| 228 |
+
|
| 229 |
+
reward = reward_obj.value
|
| 230 |
+
rewards.append(reward)
|
| 231 |
+
steps_taken = step
|
| 232 |
+
|
| 233 |
+
action_summary = f"{action.action_type}:{(action.description or action.verdict or '')[:60]}"
|
| 234 |
+
log_step(step=step, action=action_summary, reward=reward, done=done, error=error_msg)
|
| 235 |
+
|
| 236 |
+
if done:
|
| 237 |
+
score = info.get("final_score", 0.0)
|
| 238 |
+
break
|
| 239 |
+
|
| 240 |
+
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 241 |
+
|
| 242 |
+
finally:
|
| 243 |
+
log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
|
| 244 |
+
|
| 245 |
+
return {"task_id": task_id, "score": score, "steps": steps_taken, "success": success}
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
# ββ Main ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 249 |
+
|
| 250 |
+
def main() -> None:
|
| 251 |
+
if not API_KEY:
|
| 252 |
+
print("[ERROR] HF_TOKEN environment variable not set.", flush=True)
|
| 253 |
+
sys.exit(1)
|
| 254 |
+
|
| 255 |
+
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 256 |
+
|
| 257 |
+
task_ids = os.getenv("TASK_IDS", ",".join(TASK_IDS)).split(",")
|
| 258 |
+
task_ids = [t.strip() for t in task_ids if t.strip()]
|
| 259 |
+
|
| 260 |
+
all_results = []
|
| 261 |
+
for task_id in task_ids:
|
| 262 |
+
result = run_episode(client, task_id)
|
| 263 |
+
all_results.append(result)
|
| 264 |
+
|
| 265 |
+
# Aggregate summary to stderr so it doesn't pollute stdout log format
|
| 266 |
+
print("\n[SUMMARY]", file=sys.stderr)
|
| 267 |
+
for r in all_results:
|
| 268 |
+
print(f" {r['task_id']}: score={r['score']:.3f} steps={r['steps']} success={r['success']}", file=sys.stderr)
|
| 269 |
+
if all_results:
|
| 270 |
+
avg = sum(r["score"] for r in all_results) / len(all_results)
|
| 271 |
+
print(f" aggregate: {avg:.3f}", file=sys.stderr)
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
if __name__ == "__main__":
|
| 275 |
+
main()
|
tasks/__pycache__/__init__.cpython-312.pyc
DELETED
|
Binary file (143 Bytes)
|
|
|
tasks/__pycache__/task1_easy.cpython-312.pyc
DELETED
|
Binary file (3.73 kB)
|
|
|
tasks/__pycache__/task2_medium.cpython-312.pyc
DELETED
|
Binary file (6.01 kB)
|
|
|
tasks/__pycache__/task3_hard.cpython-312.pyc
DELETED
|
Binary file (7.85 kB)
|
|
|