h1manshu commited on
Commit
a0ea022
Β·
verified Β·
1 Parent(s): 215184f

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -13,61 +13,89 @@ tags:
13
 
14
  # Code Review Environment
15
 
16
- A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- ## Quick Start
19
-
20
- Install the OpenEnv core package:
21
  ```bash
22
  pip install openenv-core
23
  ```
24
 
25
- Clone the repo
26
  ```bash
27
  git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
28
- ```
29
-
30
- Install packages
31
- ```bash
32
  uv pip install -e .
33
  ```
34
 
35
- `[OPTIONAL]` To run server locally
 
36
  ```bash
37
  uv run server --host 0.0.0.0 --port 8000
38
  ```
39
 
40
- Run the agent in another terminal
 
41
  ```bash
42
  uv run python inference.py
43
  ```
44
 
45
- ## Motivation
46
-
47
- Code review is a high-stakes, multi-step reasoning task that requires an agent to:
48
-
49
- - **Detect bugs and security vulnerabilities** from raw code diffs
50
- - **Generate corrective code** that resolves identified issues
51
- - **Make a final judgment** (approve/reject) backed by technical reasoning
52
-
53
- Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop β€” detection, remediation, and decision-making β€” in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in software development pipelines.
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## Environment Description
56
-
57
- The agent receives a pull request observation at each step and must respond with a structured JSON action. The episode runs for up to `MAX_STEPS = 3` steps, following a prescribed workflow:
58
-
59
  | Step | Expected Action | Purpose |
60
  |------|----------------|---------|
61
  | 1 | `comment` | Identify all issues in the diff |
62
  | 2 | `suggest_fix` | Provide corrected code |
63
  | 3 | `final_decision` | Approve or reject the PR |
64
-
65
- Each step is independently scored, and the final episode score is the maximum score achieved across all steps.
 
 
 
 
66
 
67
  ## Action Space
68
-
69
- Actions are instances of `CodeReviewAction` and must be returned as JSON with the following fields:
70
-
71
  ```json
72
  {
73
  "action_type": "comment | suggest_fix | final_decision",
@@ -76,7 +104,7 @@ Actions are instances of `CodeReviewAction` and must be returned as JSON with th
76
  "decision": "approve | reject | null"
77
  }
78
  ```
79
-
80
  | Field | Type | Required | Description |
81
  |-------|------|----------|-------------|
82
  | `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
@@ -84,10 +112,12 @@ Actions are instances of `CodeReviewAction` and must be returned as JSON with th
84
  | `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
85
  | `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
86
 
 
 
87
  ## Observation Space
88
-
89
  Each step returns a `CodeReviewObservation` with the following fields:
90
-
91
  | Field | Type | Description |
92
  |-------|------|-------------|
93
  | `pr` | `CodeReviewPullRequest` | The pull request under review |
@@ -102,93 +132,90 @@ Each step returns a `CodeReviewObservation` with the following fields:
102
  | `step_count` | `int` | Current step number |
103
  | `max_steps` | `int` | Maximum steps per episode (default: 3) |
104
 
 
 
105
  ## Scoring
106
-
107
- Each action is scored across three components:
108
-
109
- | Component | Weight | Method |
110
- |-----------|--------|--------|
111
- | Issue Detection | 40% | Fraction of ground-truth issues mentioned in `comment` |
112
- | Fix Quality | 30% | Token overlap + sequence similarity between `suggested_code` and ground-truth fix |
113
- | Decision Accuracy | 30% | Exact match with ground-truth `approve`/`reject`; partial credit (0.2) for wrong decision |
114
-
115
- **Bonuses and penalties applied per step:**
116
-
117
- - `+0.1` β€” comment length > 30 characters (encourages detail)
118
- - `+0.1` β€” correct final decision reached in step ≀ 2 (encourages efficiency)
119
- - `-0.1` β€” no comment provided on a non-decision step (penalizes lazy steps)
120
- - `-0.05` β€” step count exceeds 3 (penalizes long trajectories)
121
-
122
- The final episode score is the **maximum** `grade_action` score across all steps in the episode. Scores are clamped to `[0.0, 1.0]`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  ## Task Descriptions
125
-
126
- The dataset contains tasks at three difficulty levels:
127
-
128
  ### Easy
129
-
130
- Straightforward single-file issues with an obvious fix.
131
-
132
  | PR | Issue | Expected Decision |
133
  |----|-------|------------------|
134
  | Missing import | `datetime` used without import | reject |
135
-
136
- **What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import.
137
-
138
  ---
139
-
140
  ### Medium
141
-
142
- Logical or performance issues requiring understanding of Python semantics.
143
-
144
  | PR | Issue | Expected Decision |
145
  |----|-------|------------------|
146
  | Division function | No guard against division by zero | reject |
147
- | Inefficient loop | `range(len(arr))` pattern; can use `in` operator | approve |
148
-
149
- **What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognize it as a style/efficiency issue but not a correctness bug β€” the correct decision is **approve**.
150
-
151
  ---
152
-
153
  ### Hard
154
-
155
- Security vulnerabilities, injection attacks, and cross-file null-handling bugs.
156
-
157
  | PR | Issue | Expected Decision |
158
  |----|-------|------------------|
159
  | Authentication logic | Hardcoded plaintext password `admin123` | reject |
160
  | SQL query | String concatenation exposes SQL injection | reject |
161
  | Cross-file null bug | `get_user(None)` called without input validation | reject |
162
-
163
  **What the agent must do:**
164
- - For auth: detect the hardcoded secret and propose `bcrypt`-based password comparison.
165
- - For SQL: detect string concatenation and replace with a parameterized query (`%s` placeholder + `cursor.execute`).
166
- - For null bug: validate `id is not None` before the `db[id]` lookup, and fix the call site in `controller.py`.
167
-
168
- The agent runs `NUM_EPISODES = 16` episodes (configurable) with each `MAX_STEPS = 3` and logs each step:
169
-
170
- ```
171
- [START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
172
- [STEP] step=1 action=... reward=0.55 done=false error=null
173
- [STEP] step=2 action=... reward=0.72 done=false error=null
174
- [STEP] step=3 action=... reward=0.85 done=true error=null
175
- [END] success=true steps=12 score=0.850 rewards=0.55,0.72,0.85,0.60
176
- ```
177
 
178
- ## Configuration
179
-
180
- Key constants in `inference.py`:
181
-
182
- | Constant | Default | Description |
183
- |----------|---------|-------------|
184
- | `MAX_STEPS` | `3` | Steps per episode |
185
- | `NUM_EPISODES` | `16` | Number of PRs to review |
186
- | `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
187
- | `MAX_TOKENS` | `256` | Max tokens per LLM response |
188
- | `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum normalized score to count as success |
189
 
190
- ## Score Interpretation
191
-
192
  | Score Range | Interpretation |
193
  |-------------|---------------|
194
  | 0.00 – 0.20 | Failing β€” agent cannot follow the JSON schema or identify basic issues |
@@ -196,7 +223,18 @@ Key constants in `inference.py`:
196
  | 0.50 – 0.75 | Competent β€” agent handles easy and medium tasks; struggles with hard security/null cases |
197
  | 0.75 – 1.00 | Strong β€” agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
198
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  ## Conclusion
200
-
201
- The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps β€” issue detection, fix generation, and final judgment β€” it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.
202
 
 
 
13
 
14
  # Code Review Environment
15
 
16
+ A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks β€” spanning missing imports, logic errors, and security vulnerabilities.
17
+
18
+ ---
19
+
20
+ ## Motivation
21
+
22
+ Code review is a high-stakes, multi-step reasoning task that requires an agent to:
23
+
24
+ - **Detect bugs and security vulnerabilities** from raw code diffs
25
+ - **Generate corrective code** that resolves identified issues
26
+ - **Make a final judgment** (approve or reject) backed by technical reasoning
27
+
28
+ Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop β€” detection, remediation, and decision-making β€” in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines.
29
+
30
+ ---
31
+
32
+ ## Setup and Usage
33
+
34
+ ### Install dependencies
35
 
 
 
 
36
  ```bash
37
  pip install openenv-core
38
  ```
39
 
 
40
  ```bash
41
  git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
 
 
 
 
42
  uv pip install -e .
43
  ```
44
 
45
+ ### Run the server locally (optional)
46
+
47
  ```bash
48
  uv run server --host 0.0.0.0 --port 8000
49
  ```
50
 
51
+ ### Run the agent
52
+
53
  ```bash
54
  uv run python inference.py
55
  ```
56
 
57
+ ### Environment variables
58
+
59
+ Set the following before running:
60
+
61
+ | Variable | Description |
62
+ |----------|-------------|
63
+ | `API_BASE_URL` | The API endpoint for the LLM (e.g. `https://router.huggingface.co/v1`) |
64
+ | `MODEL_NAME` | The model identifier to use for inference |
65
+ | `HF_TOKEN` | Your Hugging Face / API key |
66
+
67
+ ### Key constants in `inference.py`
68
+
69
+ | Constant | Default | Description |
70
+ |----------|---------|-------------|
71
+ | `MAX_STEPS` | `3` | Steps per episode |
72
+ | `NUM_EPISODES` | `16` | Number of PRs to review |
73
+ | `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
74
+ | `MAX_TOKENS` | `512` | Max tokens per LLM response |
75
+ | `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum score to count as success |
76
+
77
+ ---
78
 
79
  ## Environment Description
80
+
81
+ The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to `MAX_STEPS = 3` steps following a fixed workflow:
82
+
83
  | Step | Expected Action | Purpose |
84
  |------|----------------|---------|
85
  | 1 | `comment` | Identify all issues in the diff |
86
  | 2 | `suggest_fix` | Provide corrected code |
87
  | 3 | `final_decision` | Approve or reject the PR |
88
+
89
+ Each step is independently scored. The final episode score is the maximum score achieved across all steps.
90
+
91
+ The environment automatically selects a grader tier (`easy`, `medium`, or `hard`) based on the `task_type` field of each dataset sample. No manual configuration is needed β€” the grader switches per episode as `reset()` is called.
92
+
93
+ ---
94
 
95
  ## Action Space
96
+
97
+ Actions must be returned as JSON with the following fields:
98
+
99
  ```json
100
  {
101
  "action_type": "comment | suggest_fix | final_decision",
 
104
  "decision": "approve | reject | null"
105
  }
106
  ```
107
+
108
  | Field | Type | Required | Description |
109
  |-------|------|----------|-------------|
110
  | `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
 
112
  | `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
113
  | `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
114
 
115
+ ---
116
+
117
  ## Observation Space
118
+
119
  Each step returns a `CodeReviewObservation` with the following fields:
120
+
121
  | Field | Type | Description |
122
  |-------|------|-------------|
123
  | `pr` | `CodeReviewPullRequest` | The pull request under review |
 
132
  | `step_count` | `int` | Current step number |
133
  | `max_steps` | `int` | Maximum steps per episode (default: 3) |
134
 
135
+ ---
136
+
137
  ## Scoring
138
+
139
+ ### Grader tiers
140
+
141
+ The dataset contains three difficulty levels, each backed by a dedicated grader class in `graders.py`. The grader is selected automatically from `task_type` in the dataset sample.
142
+
143
+ | Tier | Class | Issue matching | Wrong decision | Done scoring |
144
+ |------|-------|---------------|---------------|--------------|
145
+ | `easy` | `EasyGrader` | Substring match | 0.2 partial credit | Max over full history |
146
+ | `medium` | `MediumGrader` | Token overlap + substring fallback | 0.1 partial credit | Recency-weighted max |
147
+ | `hard` | `HardGrader` | Token overlap + seq sim (threshold 0.3) | No credit | Final step only |
148
+
149
+ ### Score components per tier
150
+
151
+ | Component | Easy | Medium | Hard |
152
+ |-----------|------|--------|------|
153
+ | Issue detection | 40% | 42% | 45% |
154
+ | Fix quality | 30% | 30% | 28% |
155
+ | Decision accuracy | 30% | 28% | 27% |
156
+
157
+ **Fix quality** is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. **Issue detection** checks how many ground-truth issues appear in the agent's comment. All scores are clamped to `[0.01, 0.99]`.
158
+
159
+ ### Bonuses and penalties
160
+
161
+ | Condition | Easy | Medium | Hard |
162
+ |-----------|------|--------|------|
163
+ | Comment length > 30 chars | +0.15 | +0.10 | β€” |
164
+ | Correct decision at step 1 | +0.10 | +0.10 | +0.05 |
165
+ | Correct decision at step 2 | +0.10 | +0.05 | β€” |
166
+ | No comment on non-decision step | βˆ’0.05 | βˆ’0.08 | βˆ’0.12 |
167
+ | Step count > 3 | β€” | βˆ’0.04/step | βˆ’0.05 Γ— (steps βˆ’ 2) |
168
+
169
+ ---
170
 
171
  ## Task Descriptions
172
+
 
 
173
  ### Easy
174
+
175
+ Straightforward single-file issues with an obvious fix. The `EasyGrader` uses simple substring matching β€” the agent gets full issue credit if the issue phrase appears anywhere in the comment.
176
+
177
  | PR | Issue | Expected Decision |
178
  |----|-------|------------------|
179
  | Missing import | `datetime` used without import | reject |
180
+
181
+ **What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import line.
182
+
183
  ---
184
+
185
  ### Medium
186
+
187
+ Logical or performance issues that require understanding of Python semantics. The `MediumGrader` uses token overlap so paraphrased descriptions still score well.
188
+
189
  | PR | Issue | Expected Decision |
190
  |----|-------|------------------|
191
  | Division function | No guard against division by zero | reject |
192
+ | Inefficient loop | `range(len(arr))` pattern; can use `in` directly | approve |
193
+
194
+ **What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognise it as a style issue but not a correctness bug β€” the correct decision is **approve**.
195
+
196
  ---
197
+
198
  ### Hard
199
+
200
+ Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The `HardGrader` applies a minimum similarity threshold: vague or generic comments receive zero issue credit.
201
+
202
  | PR | Issue | Expected Decision |
203
  |----|-------|------------------|
204
  | Authentication logic | Hardcoded plaintext password `admin123` | reject |
205
  | SQL query | String concatenation exposes SQL injection | reject |
206
  | Cross-file null bug | `get_user(None)` called without input validation | reject |
207
+
208
  **What the agent must do:**
209
+ - **Auth:** Detect the hardcoded secret and propose `bcrypt`-based password comparison.
210
+ - **SQL:** Detect string concatenation and replace with a parameterised query using `%s` placeholder + `cursor.execute`.
211
+ - **Null bug:** Validate `id is not None` before the `db[id]` lookup and fix the call site in `controller.py`.
 
 
 
 
 
 
 
 
 
 
212
 
213
+ ---
214
+
215
+ ## Baseline Scores
216
+
217
+ Expected performance ranges by model capability:
 
 
 
 
 
 
218
 
 
 
219
  | Score Range | Interpretation |
220
  |-------------|---------------|
221
  | 0.00 – 0.20 | Failing β€” agent cannot follow the JSON schema or identify basic issues |
 
223
  | 0.50 – 0.75 | Competent β€” agent handles easy and medium tasks; struggles with hard security/null cases |
224
  | 0.75 – 1.00 | Strong β€” agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
225
 
226
+ ### Step-level log format
227
+
228
+ ```
229
+ [START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
230
+ [STEP] step=1 action=comment reward=0.55 done=false error=null
231
+ [STEP] step=2 action=suggest_fix reward=0.72 done=false error=null
232
+ [STEP] step=3 action=final_decision reward=0.85 done=true error=null
233
+ [END] success=true steps=3 score=0.850 rewards=0.55,0.72,0.85
234
+ ```
235
+
236
+ ---
237
+
238
  ## Conclusion
 
 
239
 
240
+ The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps β€” issue detection, fix generation, and final judgment β€” and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.
openenv.yaml CHANGED
@@ -8,67 +8,67 @@ tasks:
8
  - id: task_1
9
  description: "Easy β€” missing import detection"
10
  max_steps: 3
11
- grader: graders:EasyGrader
12
  - id: task_2
13
  description: "Easy β€” missing return statement"
14
  max_steps: 3
15
- grader: graders:EasyGrader
16
  - id: task_3
17
  description: "Easy β€” wrong comparison operator"
18
  max_steps: 3
19
- grader: graders:EasyGrader
20
  - id: task_4
21
  description: "Easy β€” undefined variable"
22
  max_steps: 3
23
- grader: graders:EasyGrader
24
  - id: task_5
25
  description: "Easy β€” clean utility function"
26
  max_steps: 3
27
- grader: graders:EasyGrader
28
  - id: task_6
29
  description: "Medium β€” division by zero handling"
30
  max_steps: 3
31
- grader: graders:MediumGrader
32
  - id: task_7
33
  description: "Medium β€” inefficient loop optimization"
34
  max_steps: 3
35
- grader: graders:MediumGrader
36
  - id: task_8
37
  description: "Medium β€” mutable default argument"
38
  max_steps: 3
39
- grader: graders:MediumGrader
40
  - id: task_9
41
  description: "Medium β€” unhandled exception"
42
  max_steps: 3
43
- grader: graders:MediumGrader
44
  - id: task_10
45
  description: "Medium β€” missing input validation"
46
  max_steps: 3
47
- grader: graders:MediumGrader
48
  - id: task_11
49
  description: "Hard β€” hardcoded password security vulnerability"
50
  max_steps: 3
51
- grader: graders:HardGrader
52
  - id: task_12
53
  description: "Hard β€” SQL injection vulnerability"
54
  max_steps: 3
55
- grader: graders:HardGrader
56
  - id: task_13
57
  description: "Hard β€” cross-file null handling bug"
58
  max_steps: 3
59
- grader: graders:HardGrader
60
  - id: task_14
61
  description: "Hard β€” race condition in counter"
62
  max_steps: 3
63
- grader: graders:HardGrader
64
  - id: task_15
65
  description: "Hard β€” insecure deserialization"
66
  max_steps: 3
67
- grader: graders:HardGrader
68
  - id: task_16
69
  description: "Hard β€” path traversal vulnerability"
70
  max_steps: 3
71
- grader: graders:HardGrader
72
  endpoints:
73
  reset: /reset
74
  step: /step
 
8
  - id: task_1
9
  description: "Easy β€” missing import detection"
10
  max_steps: 3
11
+ grader: server.graders:EasyGrader
12
  - id: task_2
13
  description: "Easy β€” missing return statement"
14
  max_steps: 3
15
+ grader: server.graders:EasyGrader
16
  - id: task_3
17
  description: "Easy β€” wrong comparison operator"
18
  max_steps: 3
19
+ grader: server.graders:EasyGrader
20
  - id: task_4
21
  description: "Easy β€” undefined variable"
22
  max_steps: 3
23
+ grader: server.graders:EasyGrader
24
  - id: task_5
25
  description: "Easy β€” clean utility function"
26
  max_steps: 3
27
+ grader: server.graders:EasyGrader
28
  - id: task_6
29
  description: "Medium β€” division by zero handling"
30
  max_steps: 3
31
+ grader: server.graders:MediumGrader
32
  - id: task_7
33
  description: "Medium β€” inefficient loop optimization"
34
  max_steps: 3
35
+ grader: server.graders:MediumGrader
36
  - id: task_8
37
  description: "Medium β€” mutable default argument"
38
  max_steps: 3
39
+ grader: server.graders:MediumGrader
40
  - id: task_9
41
  description: "Medium β€” unhandled exception"
42
  max_steps: 3
43
+ grader: server.graders:MediumGrader
44
  - id: task_10
45
  description: "Medium β€” missing input validation"
46
  max_steps: 3
47
+ grader: server.graders:MediumGrader
48
  - id: task_11
49
  description: "Hard β€” hardcoded password security vulnerability"
50
  max_steps: 3
51
+ grader: server.graders:HardGrader
52
  - id: task_12
53
  description: "Hard β€” SQL injection vulnerability"
54
  max_steps: 3
55
+ grader: server.graders:HardGrader
56
  - id: task_13
57
  description: "Hard β€” cross-file null handling bug"
58
  max_steps: 3
59
+ grader: server.graders:HardGrader
60
  - id: task_14
61
  description: "Hard β€” race condition in counter"
62
  max_steps: 3
63
+ grader: server.graders:HardGrader
64
  - id: task_15
65
  description: "Hard β€” insecure deserialization"
66
  max_steps: 3
67
+ grader: server.graders:HardGrader
68
  - id: task_16
69
  description: "Hard β€” path traversal vulnerability"
70
  max_steps: 3
71
+ grader: server.graders:HardGrader
72
  endpoints:
73
  reset: /reset
74
  step: /step
openenv_code_review.egg-info/SOURCES.txt CHANGED
@@ -12,4 +12,5 @@ openenv_code_review.egg-info/requires.txt
12
  openenv_code_review.egg-info/top_level.txt
13
  server/__init__.py
14
  server/app.py
15
- server/code_review_environment.py
 
 
12
  openenv_code_review.egg-info/top_level.txt
13
  server/__init__.py
14
  server/app.py
15
+ server/code_review_environment.py
16
+ server/graders.py
server/code_review_environment.py CHANGED
@@ -80,11 +80,11 @@ class CodeReviewEnvironment(Environment):
80
 
81
  self.sample = self.dataset[self.task_index % len(self.dataset)]
82
 
83
- self.pr = CodeReviewPullRequest(**self.sample["pr"])
84
- self.gt = self.sample["ground_truth"]
85
- self.task_type = self.sample.get("task_type", "unknown")
86
- grader_level = self.task_type if self.task_type in ("easy", "medium", "hard") else "medium"
87
- self.grader = get_grader(grader_level)
88
  self.grader_level = grader_level
89
 
90
  self.history = []
@@ -166,7 +166,7 @@ class CodeReviewEnvironment(Environment):
166
  )
167
 
168
  rew = CodeReviewReward(score=score, feedback="graded")
169
- # print(f"[{self.grader_level.upper()}] Step {self.step_count} β€” Score: {rew.score:.4f}")
170
 
171
  return CodeReviewStepResponse(
172
  observation=obs,
 
80
 
81
  self.sample = self.dataset[self.task_index % len(self.dataset)]
82
 
83
+ self.pr = CodeReviewPullRequest(**self.sample["pr"])
84
+ self.gt = self.sample["ground_truth"]
85
+ self.task_type = self.sample.get("task_type", "unknown")
86
+ grader_level = self.task_type if self.task_type in ("easy", "medium", "hard") else "medium"
87
+ self.grader = get_grader(grader_level)
88
  self.grader_level = grader_level
89
 
90
  self.history = []
 
166
  )
167
 
168
  rew = CodeReviewReward(score=score, feedback="graded")
169
+ print(f"[{self.grader_level.upper()}] Step {self.step_count} β€” Score: {rew.score:.4f}")
170
 
171
  return CodeReviewStepResponse(
172
  observation=obs,