Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +136 -98
- openenv.yaml +16 -16
- openenv_code_review.egg-info/SOURCES.txt +2 -1
- server/code_review_environment.py +6 -6
README.md
CHANGED
|
@@ -13,61 +13,89 @@ tags:
|
|
| 13 |
|
| 14 |
# Code Review Environment
|
| 15 |
|
| 16 |
-
A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
## Quick Start
|
| 19 |
-
|
| 20 |
-
Install the OpenEnv core package:
|
| 21 |
```bash
|
| 22 |
pip install openenv-core
|
| 23 |
```
|
| 24 |
|
| 25 |
-
Clone the repo
|
| 26 |
```bash
|
| 27 |
git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
Install packages
|
| 31 |
-
```bash
|
| 32 |
uv pip install -e .
|
| 33 |
```
|
| 34 |
|
| 35 |
-
|
|
|
|
| 36 |
```bash
|
| 37 |
uv run server --host 0.0.0.0 --port 8000
|
| 38 |
```
|
| 39 |
|
| 40 |
-
Run the agent
|
|
|
|
| 41 |
```bash
|
| 42 |
uv run python inference.py
|
| 43 |
```
|
| 44 |
|
| 45 |
-
##
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
## Environment Description
|
| 56 |
-
|
| 57 |
-
The agent receives a pull request observation at each step and must respond with a structured JSON action.
|
| 58 |
-
|
| 59 |
| Step | Expected Action | Purpose |
|
| 60 |
|------|----------------|---------|
|
| 61 |
| 1 | `comment` | Identify all issues in the diff |
|
| 62 |
| 2 | `suggest_fix` | Provide corrected code |
|
| 63 |
| 3 | `final_decision` | Approve or reject the PR |
|
| 64 |
-
|
| 65 |
-
Each step is independently scored
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## Action Space
|
| 68 |
-
|
| 69 |
-
Actions
|
| 70 |
-
|
| 71 |
```json
|
| 72 |
{
|
| 73 |
"action_type": "comment | suggest_fix | final_decision",
|
|
@@ -76,7 +104,7 @@ Actions are instances of `CodeReviewAction` and must be returned as JSON with th
|
|
| 76 |
"decision": "approve | reject | null"
|
| 77 |
}
|
| 78 |
```
|
| 79 |
-
|
| 80 |
| Field | Type | Required | Description |
|
| 81 |
|-------|------|----------|-------------|
|
| 82 |
| `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
|
|
@@ -84,10 +112,12 @@ Actions are instances of `CodeReviewAction` and must be returned as JSON with th
|
|
| 84 |
| `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
|
| 85 |
| `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
|
| 86 |
|
|
|
|
|
|
|
| 87 |
## Observation Space
|
| 88 |
-
|
| 89 |
Each step returns a `CodeReviewObservation` with the following fields:
|
| 90 |
-
|
| 91 |
| Field | Type | Description |
|
| 92 |
|-------|------|-------------|
|
| 93 |
| `pr` | `CodeReviewPullRequest` | The pull request under review |
|
|
@@ -102,93 +132,90 @@ Each step returns a `CodeReviewObservation` with the following fields:
|
|
| 102 |
| `step_count` | `int` | Current step number |
|
| 103 |
| `max_steps` | `int` | Maximum steps per episode (default: 3) |
|
| 104 |
|
|
|
|
|
|
|
| 105 |
## Scoring
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
|
| 112 |
-
|
|
| 113 |
-
|
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
-
|
| 121 |
-
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
## Task Descriptions
|
| 125 |
-
|
| 126 |
-
The dataset contains tasks at three difficulty levels:
|
| 127 |
-
|
| 128 |
### Easy
|
| 129 |
-
|
| 130 |
-
Straightforward single-file issues with an obvious fix.
|
| 131 |
-
|
| 132 |
| PR | Issue | Expected Decision |
|
| 133 |
|----|-------|------------------|
|
| 134 |
| Missing import | `datetime` used without import | reject |
|
| 135 |
-
|
| 136 |
-
**What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import.
|
| 137 |
-
|
| 138 |
---
|
| 139 |
-
|
| 140 |
### Medium
|
| 141 |
-
|
| 142 |
-
Logical or performance issues
|
| 143 |
-
|
| 144 |
| PR | Issue | Expected Decision |
|
| 145 |
|----|-------|------------------|
|
| 146 |
| Division function | No guard against division by zero | reject |
|
| 147 |
-
| Inefficient loop | `range(len(arr))` pattern; can use `in`
|
| 148 |
-
|
| 149 |
-
**What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task,
|
| 150 |
-
|
| 151 |
---
|
| 152 |
-
|
| 153 |
### Hard
|
| 154 |
-
|
| 155 |
-
Security vulnerabilities, injection attacks, and cross-file null-handling bugs.
|
| 156 |
-
|
| 157 |
| PR | Issue | Expected Decision |
|
| 158 |
|----|-------|------------------|
|
| 159 |
| Authentication logic | Hardcoded plaintext password `admin123` | reject |
|
| 160 |
| SQL query | String concatenation exposes SQL injection | reject |
|
| 161 |
| Cross-file null bug | `get_user(None)` called without input validation | reject |
|
| 162 |
-
|
| 163 |
**What the agent must do:**
|
| 164 |
-
-
|
| 165 |
-
-
|
| 166 |
-
-
|
| 167 |
-
|
| 168 |
-
The agent runs `NUM_EPISODES = 16` episodes (configurable) with each `MAX_STEPS = 3` and logs each step:
|
| 169 |
-
|
| 170 |
-
```
|
| 171 |
-
[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
|
| 172 |
-
[STEP] step=1 action=... reward=0.55 done=false error=null
|
| 173 |
-
[STEP] step=2 action=... reward=0.72 done=false error=null
|
| 174 |
-
[STEP] step=3 action=... reward=0.85 done=true error=null
|
| 175 |
-
[END] success=true steps=12 score=0.850 rewards=0.55,0.72,0.85,0.60
|
| 176 |
-
```
|
| 177 |
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|----------|---------|-------------|
|
| 184 |
-
| `MAX_STEPS` | `3` | Steps per episode |
|
| 185 |
-
| `NUM_EPISODES` | `16` | Number of PRs to review |
|
| 186 |
-
| `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
|
| 187 |
-
| `MAX_TOKENS` | `256` | Max tokens per LLM response |
|
| 188 |
-
| `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum normalized score to count as success |
|
| 189 |
|
| 190 |
-
## Score Interpretation
|
| 191 |
-
|
| 192 |
| Score Range | Interpretation |
|
| 193 |
|-------------|---------------|
|
| 194 |
| 0.00 β 0.20 | Failing β agent cannot follow the JSON schema or identify basic issues |
|
|
@@ -196,7 +223,18 @@ Key constants in `inference.py`:
|
|
| 196 |
| 0.50 β 0.75 | Competent β agent handles easy and medium tasks; struggles with hard security/null cases |
|
| 197 |
| 0.75 β 1.00 | Strong β agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
|
| 198 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
## Conclusion
|
| 200 |
-
|
| 201 |
-
The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps β issue detection, fix generation, and final judgment β it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.
|
| 202 |
|
|
|
|
|
|
| 13 |
|
| 14 |
# Code Review Environment
|
| 15 |
|
| 16 |
+
A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks β spanning missing imports, logic errors, and security vulnerabilities.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Motivation
|
| 21 |
+
|
| 22 |
+
Code review is a high-stakes, multi-step reasoning task that requires an agent to:
|
| 23 |
+
|
| 24 |
+
- **Detect bugs and security vulnerabilities** from raw code diffs
|
| 25 |
+
- **Generate corrective code** that resolves identified issues
|
| 26 |
+
- **Make a final judgment** (approve or reject) backed by technical reasoning
|
| 27 |
+
|
| 28 |
+
Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop β detection, remediation, and decision-making β in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Setup and Usage
|
| 33 |
+
|
| 34 |
+
### Install dependencies
|
| 35 |
|
|
|
|
|
|
|
|
|
|
| 36 |
```bash
|
| 37 |
pip install openenv-core
|
| 38 |
```
|
| 39 |
|
|
|
|
| 40 |
```bash
|
| 41 |
git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
uv pip install -e .
|
| 43 |
```
|
| 44 |
|
| 45 |
+
### Run the server locally (optional)
|
| 46 |
+
|
| 47 |
```bash
|
| 48 |
uv run server --host 0.0.0.0 --port 8000
|
| 49 |
```
|
| 50 |
|
| 51 |
+
### Run the agent
|
| 52 |
+
|
| 53 |
```bash
|
| 54 |
uv run python inference.py
|
| 55 |
```
|
| 56 |
|
| 57 |
+
### Environment variables
|
| 58 |
+
|
| 59 |
+
Set the following before running:
|
| 60 |
+
|
| 61 |
+
| Variable | Description |
|
| 62 |
+
|----------|-------------|
|
| 63 |
+
| `API_BASE_URL` | The API endpoint for the LLM (e.g. `https://router.huggingface.co/v1`) |
|
| 64 |
+
| `MODEL_NAME` | The model identifier to use for inference |
|
| 65 |
+
| `HF_TOKEN` | Your Hugging Face / API key |
|
| 66 |
+
|
| 67 |
+
### Key constants in `inference.py`
|
| 68 |
+
|
| 69 |
+
| Constant | Default | Description |
|
| 70 |
+
|----------|---------|-------------|
|
| 71 |
+
| `MAX_STEPS` | `3` | Steps per episode |
|
| 72 |
+
| `NUM_EPISODES` | `16` | Number of PRs to review |
|
| 73 |
+
| `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
|
| 74 |
+
| `MAX_TOKENS` | `512` | Max tokens per LLM response |
|
| 75 |
+
| `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum score to count as success |
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
|
| 79 |
## Environment Description
|
| 80 |
+
|
| 81 |
+
The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to `MAX_STEPS = 3` steps following a fixed workflow:
|
| 82 |
+
|
| 83 |
| Step | Expected Action | Purpose |
|
| 84 |
|------|----------------|---------|
|
| 85 |
| 1 | `comment` | Identify all issues in the diff |
|
| 86 |
| 2 | `suggest_fix` | Provide corrected code |
|
| 87 |
| 3 | `final_decision` | Approve or reject the PR |
|
| 88 |
+
|
| 89 |
+
Each step is independently scored. The final episode score is the maximum score achieved across all steps.
|
| 90 |
+
|
| 91 |
+
The environment automatically selects a grader tier (`easy`, `medium`, or `hard`) based on the `task_type` field of each dataset sample. No manual configuration is needed β the grader switches per episode as `reset()` is called.
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
|
| 95 |
## Action Space
|
| 96 |
+
|
| 97 |
+
Actions must be returned as JSON with the following fields:
|
| 98 |
+
|
| 99 |
```json
|
| 100 |
{
|
| 101 |
"action_type": "comment | suggest_fix | final_decision",
|
|
|
|
| 104 |
"decision": "approve | reject | null"
|
| 105 |
}
|
| 106 |
```
|
| 107 |
+
|
| 108 |
| Field | Type | Required | Description |
|
| 109 |
|-------|------|----------|-------------|
|
| 110 |
| `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
|
|
|
|
| 112 |
| `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
|
| 113 |
| `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
|
| 114 |
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
## Observation Space
|
| 118 |
+
|
| 119 |
Each step returns a `CodeReviewObservation` with the following fields:
|
| 120 |
+
|
| 121 |
| Field | Type | Description |
|
| 122 |
|-------|------|-------------|
|
| 123 |
| `pr` | `CodeReviewPullRequest` | The pull request under review |
|
|
|
|
| 132 |
| `step_count` | `int` | Current step number |
|
| 133 |
| `max_steps` | `int` | Maximum steps per episode (default: 3) |
|
| 134 |
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
## Scoring
|
| 138 |
+
|
| 139 |
+
### Grader tiers
|
| 140 |
+
|
| 141 |
+
The dataset contains three difficulty levels, each backed by a dedicated grader class in `graders.py`. The grader is selected automatically from `task_type` in the dataset sample.
|
| 142 |
+
|
| 143 |
+
| Tier | Class | Issue matching | Wrong decision | Done scoring |
|
| 144 |
+
|------|-------|---------------|---------------|--------------|
|
| 145 |
+
| `easy` | `EasyGrader` | Substring match | 0.2 partial credit | Max over full history |
|
| 146 |
+
| `medium` | `MediumGrader` | Token overlap + substring fallback | 0.1 partial credit | Recency-weighted max |
|
| 147 |
+
| `hard` | `HardGrader` | Token overlap + seq sim (threshold 0.3) | No credit | Final step only |
|
| 148 |
+
|
| 149 |
+
### Score components per tier
|
| 150 |
+
|
| 151 |
+
| Component | Easy | Medium | Hard |
|
| 152 |
+
|-----------|------|--------|------|
|
| 153 |
+
| Issue detection | 40% | 42% | 45% |
|
| 154 |
+
| Fix quality | 30% | 30% | 28% |
|
| 155 |
+
| Decision accuracy | 30% | 28% | 27% |
|
| 156 |
+
|
| 157 |
+
**Fix quality** is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. **Issue detection** checks how many ground-truth issues appear in the agent's comment. All scores are clamped to `[0.01, 0.99]`.
|
| 158 |
+
|
| 159 |
+
### Bonuses and penalties
|
| 160 |
+
|
| 161 |
+
| Condition | Easy | Medium | Hard |
|
| 162 |
+
|-----------|------|--------|------|
|
| 163 |
+
| Comment length > 30 chars | +0.15 | +0.10 | β |
|
| 164 |
+
| Correct decision at step 1 | +0.10 | +0.10 | +0.05 |
|
| 165 |
+
| Correct decision at step 2 | +0.10 | +0.05 | β |
|
| 166 |
+
| No comment on non-decision step | β0.05 | β0.08 | β0.12 |
|
| 167 |
+
| Step count > 3 | β | β0.04/step | β0.05 Γ (steps β 2) |
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
|
| 171 |
## Task Descriptions
|
| 172 |
+
|
|
|
|
|
|
|
| 173 |
### Easy
|
| 174 |
+
|
| 175 |
+
Straightforward single-file issues with an obvious fix. The `EasyGrader` uses simple substring matching β the agent gets full issue credit if the issue phrase appears anywhere in the comment.
|
| 176 |
+
|
| 177 |
| PR | Issue | Expected Decision |
|
| 178 |
|----|-------|------------------|
|
| 179 |
| Missing import | `datetime` used without import | reject |
|
| 180 |
+
|
| 181 |
+
**What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import line.
|
| 182 |
+
|
| 183 |
---
|
| 184 |
+
|
| 185 |
### Medium
|
| 186 |
+
|
| 187 |
+
Logical or performance issues that require understanding of Python semantics. The `MediumGrader` uses token overlap so paraphrased descriptions still score well.
|
| 188 |
+
|
| 189 |
| PR | Issue | Expected Decision |
|
| 190 |
|----|-------|------------------|
|
| 191 |
| Division function | No guard against division by zero | reject |
|
| 192 |
+
| Inefficient loop | `range(len(arr))` pattern; can use `in` directly | approve |
|
| 193 |
+
|
| 194 |
+
**What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognise it as a style issue but not a correctness bug β the correct decision is **approve**.
|
| 195 |
+
|
| 196 |
---
|
| 197 |
+
|
| 198 |
### Hard
|
| 199 |
+
|
| 200 |
+
Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The `HardGrader` applies a minimum similarity threshold: vague or generic comments receive zero issue credit.
|
| 201 |
+
|
| 202 |
| PR | Issue | Expected Decision |
|
| 203 |
|----|-------|------------------|
|
| 204 |
| Authentication logic | Hardcoded plaintext password `admin123` | reject |
|
| 205 |
| SQL query | String concatenation exposes SQL injection | reject |
|
| 206 |
| Cross-file null bug | `get_user(None)` called without input validation | reject |
|
| 207 |
+
|
| 208 |
**What the agent must do:**
|
| 209 |
+
- **Auth:** Detect the hardcoded secret and propose `bcrypt`-based password comparison.
|
| 210 |
+
- **SQL:** Detect string concatenation and replace with a parameterised query using `%s` placeholder + `cursor.execute`.
|
| 211 |
+
- **Null bug:** Validate `id is not None` before the `db[id]` lookup and fix the call site in `controller.py`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## Baseline Scores
|
| 216 |
+
|
| 217 |
+
Expected performance ranges by model capability:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
|
|
|
|
|
|
|
| 219 |
| Score Range | Interpretation |
|
| 220 |
|-------------|---------------|
|
| 221 |
| 0.00 β 0.20 | Failing β agent cannot follow the JSON schema or identify basic issues |
|
|
|
|
| 223 |
| 0.50 β 0.75 | Competent β agent handles easy and medium tasks; struggles with hard security/null cases |
|
| 224 |
| 0.75 β 1.00 | Strong β agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
|
| 225 |
|
| 226 |
+
### Step-level log format
|
| 227 |
+
|
| 228 |
+
```
|
| 229 |
+
[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
|
| 230 |
+
[STEP] step=1 action=comment reward=0.55 done=false error=null
|
| 231 |
+
[STEP] step=2 action=suggest_fix reward=0.72 done=false error=null
|
| 232 |
+
[STEP] step=3 action=final_decision reward=0.85 done=true error=null
|
| 233 |
+
[END] success=true steps=3 score=0.850 rewards=0.55,0.72,0.85
|
| 234 |
+
```
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
## Conclusion
|
|
|
|
|
|
|
| 239 |
|
| 240 |
+
The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps β issue detection, fix generation, and final judgment β and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.
|
openenv.yaml
CHANGED
|
@@ -8,67 +8,67 @@ tasks:
|
|
| 8 |
- id: task_1
|
| 9 |
description: "Easy β missing import detection"
|
| 10 |
max_steps: 3
|
| 11 |
-
grader: graders:EasyGrader
|
| 12 |
- id: task_2
|
| 13 |
description: "Easy β missing return statement"
|
| 14 |
max_steps: 3
|
| 15 |
-
grader: graders:EasyGrader
|
| 16 |
- id: task_3
|
| 17 |
description: "Easy β wrong comparison operator"
|
| 18 |
max_steps: 3
|
| 19 |
-
grader: graders:EasyGrader
|
| 20 |
- id: task_4
|
| 21 |
description: "Easy β undefined variable"
|
| 22 |
max_steps: 3
|
| 23 |
-
grader: graders:EasyGrader
|
| 24 |
- id: task_5
|
| 25 |
description: "Easy β clean utility function"
|
| 26 |
max_steps: 3
|
| 27 |
-
grader: graders:EasyGrader
|
| 28 |
- id: task_6
|
| 29 |
description: "Medium β division by zero handling"
|
| 30 |
max_steps: 3
|
| 31 |
-
grader: graders:MediumGrader
|
| 32 |
- id: task_7
|
| 33 |
description: "Medium β inefficient loop optimization"
|
| 34 |
max_steps: 3
|
| 35 |
-
grader: graders:MediumGrader
|
| 36 |
- id: task_8
|
| 37 |
description: "Medium β mutable default argument"
|
| 38 |
max_steps: 3
|
| 39 |
-
grader: graders:MediumGrader
|
| 40 |
- id: task_9
|
| 41 |
description: "Medium β unhandled exception"
|
| 42 |
max_steps: 3
|
| 43 |
-
grader: graders:MediumGrader
|
| 44 |
- id: task_10
|
| 45 |
description: "Medium β missing input validation"
|
| 46 |
max_steps: 3
|
| 47 |
-
grader: graders:MediumGrader
|
| 48 |
- id: task_11
|
| 49 |
description: "Hard β hardcoded password security vulnerability"
|
| 50 |
max_steps: 3
|
| 51 |
-
grader: graders:HardGrader
|
| 52 |
- id: task_12
|
| 53 |
description: "Hard β SQL injection vulnerability"
|
| 54 |
max_steps: 3
|
| 55 |
-
grader: graders:HardGrader
|
| 56 |
- id: task_13
|
| 57 |
description: "Hard β cross-file null handling bug"
|
| 58 |
max_steps: 3
|
| 59 |
-
grader: graders:HardGrader
|
| 60 |
- id: task_14
|
| 61 |
description: "Hard β race condition in counter"
|
| 62 |
max_steps: 3
|
| 63 |
-
grader: graders:HardGrader
|
| 64 |
- id: task_15
|
| 65 |
description: "Hard β insecure deserialization"
|
| 66 |
max_steps: 3
|
| 67 |
-
grader: graders:HardGrader
|
| 68 |
- id: task_16
|
| 69 |
description: "Hard β path traversal vulnerability"
|
| 70 |
max_steps: 3
|
| 71 |
-
grader: graders:HardGrader
|
| 72 |
endpoints:
|
| 73 |
reset: /reset
|
| 74 |
step: /step
|
|
|
|
| 8 |
- id: task_1
|
| 9 |
description: "Easy β missing import detection"
|
| 10 |
max_steps: 3
|
| 11 |
+
grader: server.graders:EasyGrader
|
| 12 |
- id: task_2
|
| 13 |
description: "Easy β missing return statement"
|
| 14 |
max_steps: 3
|
| 15 |
+
grader: server.graders:EasyGrader
|
| 16 |
- id: task_3
|
| 17 |
description: "Easy β wrong comparison operator"
|
| 18 |
max_steps: 3
|
| 19 |
+
grader: server.graders:EasyGrader
|
| 20 |
- id: task_4
|
| 21 |
description: "Easy β undefined variable"
|
| 22 |
max_steps: 3
|
| 23 |
+
grader: server.graders:EasyGrader
|
| 24 |
- id: task_5
|
| 25 |
description: "Easy β clean utility function"
|
| 26 |
max_steps: 3
|
| 27 |
+
grader: server.graders:EasyGrader
|
| 28 |
- id: task_6
|
| 29 |
description: "Medium β division by zero handling"
|
| 30 |
max_steps: 3
|
| 31 |
+
grader: server.graders:MediumGrader
|
| 32 |
- id: task_7
|
| 33 |
description: "Medium β inefficient loop optimization"
|
| 34 |
max_steps: 3
|
| 35 |
+
grader: server.graders:MediumGrader
|
| 36 |
- id: task_8
|
| 37 |
description: "Medium β mutable default argument"
|
| 38 |
max_steps: 3
|
| 39 |
+
grader: server.graders:MediumGrader
|
| 40 |
- id: task_9
|
| 41 |
description: "Medium β unhandled exception"
|
| 42 |
max_steps: 3
|
| 43 |
+
grader: server.graders:MediumGrader
|
| 44 |
- id: task_10
|
| 45 |
description: "Medium β missing input validation"
|
| 46 |
max_steps: 3
|
| 47 |
+
grader: server.graders:MediumGrader
|
| 48 |
- id: task_11
|
| 49 |
description: "Hard β hardcoded password security vulnerability"
|
| 50 |
max_steps: 3
|
| 51 |
+
grader: server.graders:HardGrader
|
| 52 |
- id: task_12
|
| 53 |
description: "Hard β SQL injection vulnerability"
|
| 54 |
max_steps: 3
|
| 55 |
+
grader: server.graders:HardGrader
|
| 56 |
- id: task_13
|
| 57 |
description: "Hard β cross-file null handling bug"
|
| 58 |
max_steps: 3
|
| 59 |
+
grader: server.graders:HardGrader
|
| 60 |
- id: task_14
|
| 61 |
description: "Hard β race condition in counter"
|
| 62 |
max_steps: 3
|
| 63 |
+
grader: server.graders:HardGrader
|
| 64 |
- id: task_15
|
| 65 |
description: "Hard β insecure deserialization"
|
| 66 |
max_steps: 3
|
| 67 |
+
grader: server.graders:HardGrader
|
| 68 |
- id: task_16
|
| 69 |
description: "Hard β path traversal vulnerability"
|
| 70 |
max_steps: 3
|
| 71 |
+
grader: server.graders:HardGrader
|
| 72 |
endpoints:
|
| 73 |
reset: /reset
|
| 74 |
step: /step
|
openenv_code_review.egg-info/SOURCES.txt
CHANGED
|
@@ -12,4 +12,5 @@ openenv_code_review.egg-info/requires.txt
|
|
| 12 |
openenv_code_review.egg-info/top_level.txt
|
| 13 |
server/__init__.py
|
| 14 |
server/app.py
|
| 15 |
-
server/code_review_environment.py
|
|
|
|
|
|
| 12 |
openenv_code_review.egg-info/top_level.txt
|
| 13 |
server/__init__.py
|
| 14 |
server/app.py
|
| 15 |
+
server/code_review_environment.py
|
| 16 |
+
server/graders.py
|
server/code_review_environment.py
CHANGED
|
@@ -80,11 +80,11 @@ class CodeReviewEnvironment(Environment):
|
|
| 80 |
|
| 81 |
self.sample = self.dataset[self.task_index % len(self.dataset)]
|
| 82 |
|
| 83 |
-
self.pr
|
| 84 |
-
self.gt
|
| 85 |
-
self.task_type
|
| 86 |
-
grader_level
|
| 87 |
-
self.grader
|
| 88 |
self.grader_level = grader_level
|
| 89 |
|
| 90 |
self.history = []
|
|
@@ -166,7 +166,7 @@ class CodeReviewEnvironment(Environment):
|
|
| 166 |
)
|
| 167 |
|
| 168 |
rew = CodeReviewReward(score=score, feedback="graded")
|
| 169 |
-
|
| 170 |
|
| 171 |
return CodeReviewStepResponse(
|
| 172 |
observation=obs,
|
|
|
|
| 80 |
|
| 81 |
self.sample = self.dataset[self.task_index % len(self.dataset)]
|
| 82 |
|
| 83 |
+
self.pr = CodeReviewPullRequest(**self.sample["pr"])
|
| 84 |
+
self.gt = self.sample["ground_truth"]
|
| 85 |
+
self.task_type = self.sample.get("task_type", "unknown")
|
| 86 |
+
grader_level = self.task_type if self.task_type in ("easy", "medium", "hard") else "medium"
|
| 87 |
+
self.grader = get_grader(grader_level)
|
| 88 |
self.grader_level = grader_level
|
| 89 |
|
| 90 |
self.history = []
|
|
|
|
| 166 |
)
|
| 167 |
|
| 168 |
rew = CodeReviewReward(score=score, feedback="graded")
|
| 169 |
+
print(f"[{self.grader_level.upper()}] Step {self.step_count} β Score: {rew.score:.4f}")
|
| 170 |
|
| 171 |
return CodeReviewStepResponse(
|
| 172 |
observation=obs,
|