Spaces:

h1manshu
/

code_review

Sleeping

App Files Files Community

h1manshu commited on 8 days ago

Commit

a0ea022

verified ·

1 Parent(s): 215184f

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +136 -98
openenv.yaml +16 -16
openenv_code_review.egg-info/SOURCES.txt +2 -1
server/code_review_environment.py +6 -6

README.md CHANGED Viewed

@@ -13,61 +13,89 @@ tags:
 # Code Review Environment
-A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks.
-## Quick Start
-Install the OpenEnv core package:
 ```bash
 pip install openenv-core
 ```
-Clone the repo
 ```bash
 git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
-```
-Install packages
-```bash
 uv pip install -e .
 ```
-`[OPTIONAL]` To run server locally
 ```bash
 uv run server --host 0.0.0.0 --port 8000
 ```
-Run the agent in another terminal
 ```bash
 uv run python inference.py
 ```
-## Motivation
-Code review is a high-stakes, multi-step reasoning task that requires an agent to:
-- **Detect bugs and security vulnerabilities** from raw code diffs
-- **Generate corrective code** that resolves identified issues
-- **Make a final judgment** (approve/reject) backed by technical reasoning
-Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop — detection, remediation, and decision-making — in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in software development pipelines.
 ## Environment Description
-The agent receives a pull request observation at each step and must respond with a structured JSON action. The episode runs for up to `MAX_STEPS = 3` steps, following a prescribed workflow:
 | Step | Expected Action | Purpose |
 |------|----------------|---------|
 | 1 | `comment` | Identify all issues in the diff |
 | 2 | `suggest_fix` | Provide corrected code |
 | 3 | `final_decision` | Approve or reject the PR |
-Each step is independently scored, and the final episode score is the maximum score achieved across all steps.
 ## Action Space
-Actions are instances of `CodeReviewAction` and must be returned as JSON with the following fields:
 ```json
 {
   "action_type": "comment | suggest_fix | final_decision",
@@ -76,7 +104,7 @@ Actions are instances of `CodeReviewAction` and must be returned as JSON with th
   "decision": "approve | reject | null"
 }
 ```
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
@@ -84,10 +112,12 @@ Actions are instances of `CodeReviewAction` and must be returned as JSON with th
 | `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
 | `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
 ## Observation Space
 Each step returns a `CodeReviewObservation` with the following fields:
 | Field | Type | Description |
 |-------|------|-------------|
 | `pr` | `CodeReviewPullRequest` | The pull request under review |
@@ -102,93 +132,90 @@ Each step returns a `CodeReviewObservation` with the following fields:
 | `step_count` | `int` | Current step number |
 | `max_steps` | `int` | Maximum steps per episode (default: 3) |
 ## Scoring
-Each action is scored across three components:
-| Component | Weight | Method |
-|-----------|--------|--------|
-| Issue Detection | 40% | Fraction of ground-truth issues mentioned in `comment` |
-| Fix Quality | 30% | Token overlap + sequence similarity between `suggested_code` and ground-truth fix |
-| Decision Accuracy | 30% | Exact match with ground-truth `approve`/`reject`; partial credit (0.2) for wrong decision |
-**Bonuses and penalties applied per step:**
-- `+0.1` — comment length > 30 characters (encourages detail)
-- `+0.1` — correct final decision reached in step ≤ 2 (encourages efficiency)
-- `-0.1` — no comment provided on a non-decision step (penalizes lazy steps)
-- `-0.05` — step count exceeds 3 (penalizes long trajectories)
-The final episode score is the **maximum** `grade_action` score across all steps in the episode. Scores are clamped to `[0.0, 1.0]`.
 ## Task Descriptions
-The dataset contains tasks at three difficulty levels:
 ### Easy
-Straightforward single-file issues with an obvious fix.
 | PR | Issue | Expected Decision |
 |----|-------|------------------|
 | Missing import | `datetime` used without import | reject |
-**What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import.
 ---
 ### Medium
-Logical or performance issues requiring understanding of Python semantics.
 | PR | Issue | Expected Decision |
 |----|-------|------------------|
 | Division function | No guard against division by zero | reject |
-| Inefficient loop | `range(len(arr))` pattern; can use `in` operator | approve |
-**What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognize it as a style/efficiency issue but not a correctness bug — the correct decision is **approve**.
 ---
 ### Hard
-Security vulnerabilities, injection attacks, and cross-file null-handling bugs.
 | PR | Issue | Expected Decision |
 |----|-------|------------------|
 | Authentication logic | Hardcoded plaintext password `admin123` | reject |
 | SQL query | String concatenation exposes SQL injection | reject |
 | Cross-file null bug | `get_user(None)` called without input validation | reject |
 **What the agent must do:**
-- For auth: detect the hardcoded secret and propose `bcrypt`-based password comparison.
-- For SQL: detect string concatenation and replace with a parameterized query (`%s` placeholder + `cursor.execute`).
-- For null bug: validate `id is not None` before the `db[id]` lookup, and fix the call site in `controller.py`.
-The agent runs `NUM_EPISODES = 16` episodes (configurable) with each `MAX_STEPS = 3` and logs each step:
-```
-[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
-[STEP] step=1 action=... reward=0.55 done=false error=null
-[STEP] step=2 action=... reward=0.72 done=false error=null
-[STEP] step=3 action=... reward=0.85 done=true error=null
-[END] success=true steps=12 score=0.850 rewards=0.55,0.72,0.85,0.60
-```
-## Configuration
-Key constants in `inference.py`:
-| Constant | Default | Description |
-|----------|---------|-------------|
-| `MAX_STEPS` | `3` | Steps per episode |
-| `NUM_EPISODES` | `16` | Number of PRs to review |
-| `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
-| `MAX_TOKENS` | `256` | Max tokens per LLM response |
-| `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum normalized score to count as success |
-## Score Interpretation
 | Score Range | Interpretation |
 |-------------|---------------|
 | 0.00 – 0.20 | Failing — agent cannot follow the JSON schema or identify basic issues |
@@ -196,7 +223,18 @@ Key constants in `inference.py`:
 | 0.50 – 0.75 | Competent — agent handles easy and medium tasks; struggles with hard security/null cases |
 | 0.75 – 1.00 | Strong — agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
 ## Conclusion
-The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps — issue detection, fix generation, and final judgment — it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.

 # Code Review Environment
+A reinforcement learning benchmark environment where an agent acts as a senior software engineer reviewing pull requests. The agent must identify bugs, suggest fixes, and make approval decisions across progressively harder code review tasks — spanning missing imports, logic errors, and security vulnerabilities.
+---
+## Motivation
+Code review is a high-stakes, multi-step reasoning task that requires an agent to:
+- **Detect bugs and security vulnerabilities** from raw code diffs
+- **Generate corrective code** that resolves identified issues
+- **Make a final judgment** (approve or reject) backed by technical reasoning
+Existing benchmarks test code generation or comprehension in isolation. This environment tests the full review loop — detection, remediation, and decision-making — in a structured, scorable way. It is designed to evaluate whether LLMs can act as reliable automated reviewers in real software development pipelines.
+---
+## Setup and Usage
+### Install dependencies
 ```bash
 pip install openenv-core
 ```
 ```bash
 git clone https://github.com/Ajay-Ganapathy/code_review && cd code_review
 uv pip install -e .
 ```
+### Run the server locally (optional)
 ```bash
 uv run server --host 0.0.0.0 --port 8000
 ```
+### Run the agent
 ```bash
 uv run python inference.py
 ```
+### Environment variables
+Set the following before running:
+| Variable | Description |
+|----------|-------------|
+| `API_BASE_URL` | The API endpoint for the LLM (e.g. `https://router.huggingface.co/v1`) |
+| `MODEL_NAME` | The model identifier to use for inference |
+| `HF_TOKEN` | Your Hugging Face / API key |
+### Key constants in `inference.py`
+| Constant | Default | Description |
+|----------|---------|-------------|
+| `MAX_STEPS` | `3` | Steps per episode |
+| `NUM_EPISODES` | `16` | Number of PRs to review |
+| `TEMPERATURE` | `0.2` | Sampling temperature (lower = more deterministic) |
+| `MAX_TOKENS` | `512` | Max tokens per LLM response |
+| `SUCCESS_SCORE_THRESHOLD` | `0.1` | Minimum score to count as success |
+---
 ## Environment Description
+The agent receives a pull request observation at each step and must respond with a structured JSON action. Each episode runs for up to `MAX_STEPS = 3` steps following a fixed workflow:
 | Step | Expected Action | Purpose |
 |------|----------------|---------|
 | 1 | `comment` | Identify all issues in the diff |
 | 2 | `suggest_fix` | Provide corrected code |
 | 3 | `final_decision` | Approve or reject the PR |
+Each step is independently scored. The final episode score is the maximum score achieved across all steps.
+The environment automatically selects a grader tier (`easy`, `medium`, or `hard`) based on the `task_type` field of each dataset sample. No manual configuration is needed — the grader switches per episode as `reset()` is called.
+---
 ## Action Space
+Actions must be returned as JSON with the following fields:
 ```json
 {
   "action_type": "comment | suggest_fix | final_decision",
   "decision": "approve | reject | null"
 }
 ```
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `action_type` | `str` | Always | One of `comment`, `suggest_fix`, `final_decision` |
 | `suggested_code` | `str \| null` | Step 2 | Corrected code replacing the buggy diff |
 | `decision` | `str \| null` | Step 3 | `approve` or `reject`; `null` otherwise |
+---
 ## Observation Space
 Each step returns a `CodeReviewObservation` with the following fields:
 | Field | Type | Description |
 |-------|------|-------------|
 | `pr` | `CodeReviewPullRequest` | The pull request under review |
 | `step_count` | `int` | Current step number |
 | `max_steps` | `int` | Maximum steps per episode (default: 3) |
+---
 ## Scoring
+### Grader tiers
+The dataset contains three difficulty levels, each backed by a dedicated grader class in `graders.py`. The grader is selected automatically from `task_type` in the dataset sample.
+| Tier | Class | Issue matching | Wrong decision | Done scoring |
+|------|-------|---------------|---------------|--------------|
+| `easy` | `EasyGrader` | Substring match | 0.2 partial credit | Max over full history |
+| `medium` | `MediumGrader` | Token overlap + substring fallback | 0.1 partial credit | Recency-weighted max |
+| `hard` | `HardGrader` | Token overlap + seq sim (threshold 0.3) | No credit | Final step only |
+### Score components per tier
+| Component | Easy | Medium | Hard |
+|-----------|------|--------|------|
+| Issue detection | 40% | 42% | 45% |
+| Fix quality | 30% | 30% | 28% |
+| Decision accuracy | 30% | 28% | 27% |
+**Fix quality** is computed as a weighted combination of token overlap, sequence similarity, and (for medium/hard) line-level exact matching. **Issue detection** checks how many ground-truth issues appear in the agent's comment. All scores are clamped to `[0.01, 0.99]`.
+### Bonuses and penalties
+| Condition | Easy | Medium | Hard |
+|-----------|------|--------|------|
+| Comment length > 30 chars | +0.15 | +0.10 | — |
+| Correct decision at step 1 | +0.10 | +0.10 | +0.05 |
+| Correct decision at step 2 | +0.10 | +0.05 | — |
+| No comment on non-decision step | −0.05 | −0.08 | −0.12 |
+| Step count > 3 | — | −0.04/step | −0.05 × (steps − 2) |
+---
 ## Task Descriptions
 ### Easy
+Straightforward single-file issues with an obvious fix. The `EasyGrader` uses simple substring matching — the agent gets full issue credit if the issue phrase appears anywhere in the comment.
 | PR | Issue | Expected Decision |
 |----|-------|------------------|
 | Missing import | `datetime` used without import | reject |
+**What the agent must do:** Detect the missing `from datetime import datetime` statement and supply the corrected import line.
 ---
 ### Medium
+Logical or performance issues that require understanding of Python semantics. The `MediumGrader` uses token overlap so paraphrased descriptions still score well.
 | PR | Issue | Expected Decision |
 |----|-------|------------------|
 | Division function | No guard against division by zero | reject |
+| Inefficient loop | `range(len(arr))` pattern; can use `in` directly | approve |
+**What the agent must do:** For the division task, add a `if b == 0: return None` guard. For the loop task, recognise it as a style issue but not a correctness bug — the correct decision is **approve**.
 ---
 ### Hard
+Security vulnerabilities, injection attacks, and cross-file null-handling bugs. The `HardGrader` applies a minimum similarity threshold: vague or generic comments receive zero issue credit.
 | PR | Issue | Expected Decision |
 |----|-------|------------------|
 | Authentication logic | Hardcoded plaintext password `admin123` | reject |
 | SQL query | String concatenation exposes SQL injection | reject |
 | Cross-file null bug | `get_user(None)` called without input validation | reject |
 **What the agent must do:**
+- **Auth:** Detect the hardcoded secret and propose `bcrypt`-based password comparison.
+- **SQL:** Detect string concatenation and replace with a parameterised query using `%s` placeholder + `cursor.execute`.
+- **Null bug:** Validate `id is not None` before the `db[id]` lookup and fix the call site in `controller.py`.
+---
+## Baseline Scores
+Expected performance ranges by model capability:
 | Score Range | Interpretation |
 |-------------|---------------|
 | 0.00 – 0.20 | Failing — agent cannot follow the JSON schema or identify basic issues |
 | 0.50 – 0.75 | Competent — agent handles easy and medium tasks; struggles with hard security/null cases |
 | 0.75 – 1.00 | Strong — agent reliably detects all issue types, generates correct fixes, and makes sound decisions |
+### Step-level log format
+```
+[START] task=code_review env=code_review_benchmark model=meta-llama/Llama-3.1-8B-Instruct
+[STEP]  step=1 action=comment        reward=0.55 done=false error=null
+[STEP]  step=2 action=suggest_fix    reward=0.72 done=false error=null
+[STEP]  step=3 action=final_decision reward=0.85 done=true  error=null
+[END]   success=true steps=3 score=0.850 rewards=0.55,0.72,0.85
+```
+---
 ## Conclusion
+The Code Review Environment provides a structured, reproducible benchmark for evaluating LLM-based agents on one of the most practically valuable tasks in software engineering. By decomposing the review process into three distinct steps — issue detection, fix generation, and final judgment — and by scaling difficulty through dedicated grader tiers, it rewards agents that reason carefully rather than those that simply pattern-match on surface-level symptoms.

openenv.yaml CHANGED Viewed

@@ -8,67 +8,67 @@ tasks:
   - id: task_1
     description: "Easy — missing import detection"
     max_steps: 3
-    grader: graders:EasyGrader
   - id: task_2
     description: "Easy — missing return statement"
     max_steps: 3
-    grader: graders:EasyGrader
   - id: task_3
     description: "Easy — wrong comparison operator"
     max_steps: 3
-    grader: graders:EasyGrader
   - id: task_4
     description: "Easy — undefined variable"
     max_steps: 3
-    grader: graders:EasyGrader
   - id: task_5
     description: "Easy — clean utility function"
     max_steps: 3
-    grader: graders:EasyGrader
   - id: task_6
     description: "Medium — division by zero handling"
     max_steps: 3
-    grader: graders:MediumGrader
   - id: task_7
     description: "Medium — inefficient loop optimization"
     max_steps: 3
-    grader: graders:MediumGrader
   - id: task_8
     description: "Medium — mutable default argument"
     max_steps: 3
-    grader: graders:MediumGrader
   - id: task_9
     description: "Medium — unhandled exception"
     max_steps: 3
-    grader: graders:MediumGrader
   - id: task_10
     description: "Medium — missing input validation"
     max_steps: 3
-    grader: graders:MediumGrader
   - id: task_11
     description: "Hard — hardcoded password security vulnerability"
     max_steps: 3
-    grader: graders:HardGrader
   - id: task_12
     description: "Hard — SQL injection vulnerability"
     max_steps: 3
-    grader: graders:HardGrader
   - id: task_13
     description: "Hard — cross-file null handling bug"
     max_steps: 3
-    grader: graders:HardGrader
   - id: task_14
     description: "Hard — race condition in counter"
     max_steps: 3
-    grader: graders:HardGrader
   - id: task_15
     description: "Hard — insecure deserialization"
     max_steps: 3
-    grader: graders:HardGrader
   - id: task_16
     description: "Hard — path traversal vulnerability"
     max_steps: 3
-    grader: graders:HardGrader
 endpoints:
   reset: /reset
   step: /step

   - id: task_1
     description: "Easy — missing import detection"
     max_steps: 3
+    grader: server.graders:EasyGrader
   - id: task_2
     description: "Easy — missing return statement"
     max_steps: 3
+    grader: server.graders:EasyGrader
   - id: task_3
     description: "Easy — wrong comparison operator"
     max_steps: 3
+    grader: server.graders:EasyGrader
   - id: task_4
     description: "Easy — undefined variable"
     max_steps: 3
+    grader: server.graders:EasyGrader
   - id: task_5
     description: "Easy — clean utility function"
     max_steps: 3
+    grader: server.graders:EasyGrader
   - id: task_6
     description: "Medium — division by zero handling"
     max_steps: 3
+    grader: server.graders:MediumGrader
   - id: task_7
     description: "Medium — inefficient loop optimization"
     max_steps: 3
+    grader: server.graders:MediumGrader
   - id: task_8
     description: "Medium — mutable default argument"
     max_steps: 3
+    grader: server.graders:MediumGrader
   - id: task_9
     description: "Medium — unhandled exception"
     max_steps: 3
+    grader: server.graders:MediumGrader
   - id: task_10
     description: "Medium — missing input validation"
     max_steps: 3
+    grader: server.graders:MediumGrader
   - id: task_11
     description: "Hard — hardcoded password security vulnerability"
     max_steps: 3
+    grader: server.graders:HardGrader
   - id: task_12
     description: "Hard — SQL injection vulnerability"
     max_steps: 3
+    grader: server.graders:HardGrader
   - id: task_13
     description: "Hard — cross-file null handling bug"
     max_steps: 3
+    grader: server.graders:HardGrader
   - id: task_14
     description: "Hard — race condition in counter"
     max_steps: 3
+    grader: server.graders:HardGrader
   - id: task_15
     description: "Hard — insecure deserialization"
     max_steps: 3
+    grader: server.graders:HardGrader
   - id: task_16
     description: "Hard — path traversal vulnerability"
     max_steps: 3
+    grader: server.graders:HardGrader
 endpoints:
   reset: /reset
   step: /step

openenv_code_review.egg-info/SOURCES.txt CHANGED Viewed

@@ -12,4 +12,5 @@ openenv_code_review.egg-info/requires.txt
 openenv_code_review.egg-info/top_level.txt
 server/__init__.py
 server/app.py
-server/code_review_environment.py

 openenv_code_review.egg-info/top_level.txt
 server/__init__.py
 server/app.py
+server/code_review_environment.py
+server/graders.py

server/code_review_environment.py CHANGED Viewed

@@ -80,11 +80,11 @@ class CodeReviewEnvironment(Environment):
         self.sample = self.dataset[self.task_index % len(self.dataset)]
-        self.pr        = CodeReviewPullRequest(**self.sample["pr"])
-        self.gt        = self.sample["ground_truth"]
-        self.task_type = self.sample.get("task_type", "unknown")
-        grader_level = self.task_type if self.task_type in ("easy", "medium", "hard") else "medium"
-        self.grader = get_grader(grader_level)
         self.grader_level = grader_level
         self.history            = []
@@ -166,7 +166,7 @@ class CodeReviewEnvironment(Environment):
         )
         rew = CodeReviewReward(score=score, feedback="graded")
-        # print(f"[{self.grader_level.upper()}] Step {self.step_count} — Score: {rew.score:.4f}")
         return CodeReviewStepResponse(
             observation=obs,

         self.sample = self.dataset[self.task_index % len(self.dataset)]
+        self.pr           = CodeReviewPullRequest(**self.sample["pr"])
+        self.gt           = self.sample["ground_truth"]
+        self.task_type    = self.sample.get("task_type", "unknown")
+        grader_level      = self.task_type if self.task_type in ("easy", "medium", "hard") else "medium"
+        self.grader       = get_grader(grader_level)
         self.grader_level = grader_level
         self.history            = []
         )
         rew = CodeReviewReward(score=score, feedback="graded")
+        print(f"[{self.grader_level.upper()}] Step {self.step_count} — Score: {rew.score:.4f}")
         return CodeReviewStepResponse(
             observation=obs,