Spaces:

jskarthik45
/

CodeReviewEnv

Sleeping

App Files Files Community

jskarthik45 commited on 11 days ago

Commit

0fe141f

1 Parent(s): 8c883c7

Initial commit: Code Review Environment

Browse files

Files changed (10) hide show

README.md +256 -5
__pycache__/env.cpython-312.pyc +0 -0
__pycache__/models.cpython-312.pyc +0 -0
graders/__pycache__/__init__.cpython-312.pyc +0 -0
graders/__pycache__/grader.cpython-312.pyc +0 -0
inference.py +275 -0
tasks/__pycache__/__init__.cpython-312.pyc +0 -0
tasks/__pycache__/task1_easy.cpython-312.pyc +0 -0
tasks/__pycache__/task2_medium.cpython-312.pyc +0 -0
tasks/__pycache__/task3_hard.cpython-312.pyc +0 -0

README.md CHANGED Viewed

@@ -1,11 +1,262 @@
 ---
 title: CodeReviewEnv
 emoji: 🔍
-colorFrom: blue
-colorTo: indigo
 sdk: docker
 pinned: false
 ---
-# Code Review Environment
-My custom environment for code review tasks.

+# 🔍 CodeReviewEnv
+> An OpenEnv-compliant benchmark environment where AI agents act as senior engineers reviewing pull requests — catching bugs, finding security holes, and fixing broken code.
+---
+## Overview & Motivation
+Code review is one of the highest-leverage activities in software engineering, yet it is time-consuming, inconsistent, and cognitively demanding. A model that can reliably triage pull requests, identify security vulnerabilities, and produce corrected patches would meaningfully accelerate software delivery.
+**CodeReviewEnv** simulates exactly this. Three tasks of increasing difficulty present agents with realistic pull requests containing planted defects. The agent must reason over code, report issues with structured annotations, submit a corrected patch, and deliver a final verdict — all within a bounded step budget.
+---
+## Environment Architecture
+```
+code-review-env/
+├── env.py               # Core OpenEnv environment (reset / step / state)
+├── server.py            # FastAPI HTTP server exposing the OpenEnv interface
+├── models.py            # Pydantic typed models: Action, Observation, Reward, State
+├── openenv.yaml         # OpenEnv metadata
+├── tasks/
+│   ├── task1_easy.py    # Bug hunt: simple Python utility
+│   ├── task2_medium.py  # Security audit: Flask auth endpoint
+│   └── task3_hard.py    # Correctness: distributed LRU cache
+├── graders/
+│   └── grader.py        # Deterministic keyword + AST graders
+├── agents/
+│   └── baseline_agent.py  # HF Inference API baseline (OpenAI-compatible)
+├── Dockerfile
+├── requirements.txt
+└── README.md
+```
+---
+## Action Space
+Each agent turn is a single `ReviewAction` JSON object:
+| Field | Type | Description |
+|---|---|---|
+| `action_type` | `"review" \| "patch" \| "comment" \| "submit"` | What the agent is doing |
+| `severity` | `"critical" \| "major" \| "minor" \| "info"` | Issue severity (for `review`) |
+| `issue_type` | `"bug" \| "security" \| "performance" \| "logic" \| "style"` | Issue category |
+| `line_number` | `int \| null` | Line the issue is on |
+| `description` | `str` | Concise natural-language description of the issue |
+| `patched_code` | `str \| null` | Full corrected code (for `patch` actions) |
+| `comment` | `str \| null` | Free-form annotation |
+| `verdict` | `"approve" \| "request_changes" \| "reject"` | Final verdict (for `submit`) |
+| `confidence` | `float [0.0, 1.0]` | Agent's self-reported confidence |
+---
+## Observation Space
+Each step returns an `Observation` containing:
+| Field | Description |
+|---|---|
+| `task_id` | Identifier of the current task |
+| `step` / `max_steps` | Current step and budget |
+| `review_context` | Full PR: title, author, description, code files, linter output, test results |
+| `previous_actions` | All actions taken so far this episode |
+| `issues_found_so_far` | Structured list of issues reported |
+| `score_so_far` | Running cumulative intermediate reward |
+| `done` | Whether the episode has ended |
+---
+## Reward Function
+Reward is **dense** — provided at every step, not only at the end.
+### Intermediate (per-step)
+| Signal | Value | Rationale |
+|---|---|---|
+| Step penalty | −0.01 | Encourages efficiency |
+| Review with description | +0.05 | Rewards substantive annotations |
+| Critical severity bonus | +0.03 | Rewards correct triage |
+| Patch submitted | +0.10 | Rewards producing a fix |
+| Repetition penalty | −0.05 | Penalises looping / copy-paste |
+### Terminal (on `submit` or step exhaustion)
+The programmatic grader runs and returns a score in **[0.0, 1.0]** based on which issues were correctly identified and how well the submitted patch addresses them. This final score overwrites the episode total.
+---
+## Tasks
+### Task 1 — Easy: Bug Hunt (`task_1_easy_bug_hunt`)
+**Max steps:** 8
+**File reviewed:** `utils.py` (Python, 30 lines)
+A developer submits three utility functions. Three bugs are planted:
+| # | Line | Bug | Severity |
+|---|---|---|---|
+| 1 | 3 | `=` (assignment) used instead of `==` (comparison) — causes `SyntaxError` | Critical |
+| 2 | 6 | `range(1, len(numbers) + 1)` — off-by-one causes `IndexError` | Critical |
+| 3 | 9 | Missing `return max_val` — function silently returns `None` | Major |
+**Grading:** 30% per critical bug identified, 20% for minor, 20% for a syntactically valid patch with all three fixes applied.
+---
+### Task 2 — Medium: Security Audit (`task_2_medium_security`)
+**Max steps:** 12
+**File reviewed:** `auth.py` (Flask, 55 lines)
+A backend developer submits login and registration endpoints. Six security vulnerabilities are present:
+| # | Line | Vulnerability | Severity |
+|---|---|---|---|
+| 1 | 23 | SQL injection in `login` query (f-string interpolation) | Critical |
+| 2 | 44 | SQL injection in `register` INSERT | Critical |
+| 3 | 39 | Plaintext password storage (no hashing) | Critical |
+| 4 | — | No rate limiting on `/login` (brute-force possible) | Major |
+| 5 | 30 | Sensitive data leakage: error distinguishes "wrong password" vs "user not found" | Major |
+| 6 | 5 | Hardcoded `secret_key` in source | Major |
+**Grading:** Weighted by severity. Patch checked for parameterized queries, password hashing, and environment variable use.
+---
+### Task 3 — Hard: Distributed Systems Correctness (`task_3_hard_perf_correctness`)
+**Max steps:** 16
+**File reviewed:** `cache.py` (Python, 55 lines)
+A senior engineer submits a Redis-backed LRU cache claimed to be production-ready. Six issues lurk:
+| # | Issue | Type | Severity |
+|---|---|---|---|
+| 1 | Non-atomic `EXISTS` + `GET` creates a race condition | Concurrency | Critical |
+| 2 | Local `dict` grows unboundedly — `capacity` parameter ignored | Performance | Critical |
+| 3 | `get_many` calls `self.get()` in a loop (N+1 round trips) | Performance | Major |
+| 4 | `dict` preserves insertion order, not access order — LRU eviction is wrong | Logic | Major |
+| 5 | Shared `dict` modified without a `threading.Lock` | Concurrency | Critical |
+| 6 | `pickle.loads` on bytes from Redis — arbitrary code execution | Security | Critical |
+**Grading:** Equally weighted. Patch checked structurally for `threading.Lock`, `OrderedDict.move_to_end`, `mget`, and `json` instead of `pickle`.
+---
+## Baseline Performance
+Evaluated with `Qwen/Qwen2.5-72B-Instruct` via Hugging Face Inference API:
+| Task | Score |
+|---|---|
+| Task 1 — Easy | 0.72 |
+| Task 2 — Medium | 0.55 |
+| Task 3 — Hard | 0.38 |
+| **Aggregate** | **0.55** |
+---
+## Setup & Usage
+### 1. Local (Python)
+```bash
+git clone <repo>
+cd code-review-env
+pip install -r requirements.txt
+python server.py
+# Server running at http://localhost:7860
+```
+### 2. Docker
+```bash
+docker build -t code-review-env .
+docker run -p 7860:7860 code-review-env
+```
+### 3. API Quickstart
+```bash
+# Reset to task 1
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_1_easy_bug_hunt"}'
+# Take a step
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{
+    "session_id": "<session_id>",
+    "action": {
+      "action_type": "review",
+      "severity": "critical",
+      "issue_type": "bug",
+      "line_number": 3,
+      "description": "Assignment operator = used instead of comparison == on line 3"
+    }
+  }'
+```
+### 4. Run inference script
+```bash
+export HF_TOKEN=hf_your_token_here
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+python inference.py
+```
+Expected stdout format:
+```
+[START] task=task_1_easy_bug_hunt env=code-review-env model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=review:assignment operator = instead of == reward=0.07 done=false error=null
+[STEP] step=2 action=review:off-by-one in range reward=0.07 done=false error=null
+[STEP] step=3 action=patch:fixed code reward=0.10 done=false error=null
+[STEP] step=4 action=submit:request_changes reward=1.00 done=true error=null
+[END] success=true steps=4 score=1.000 rewards=0.07,0.07,0.10,1.00
+```
+### 5. OpenEnv validation
+```bash
+openenv validate .
+```
+---
+## HTTP API Reference
+| Method | Endpoint | Description |
+|---|---|---|
+| `GET` | `/` | Environment info |
+| `GET` | `/tasks` | List all tasks |
+| `POST` | `/reset` | Start a new episode |
+| `POST` | `/step` | Take an action |
+| `GET` | `/state/{session_id}` | Inspect full environment state |
+| `DELETE` | `/session/{session_id}` | Clean up session |
+---
+## Hugging Face Spaces Deployment
+The `Dockerfile` targets port `7860` and runs as a non-root user — compatible with HF Spaces Docker SDK out of the box. Tag the Space with `openenv`.
+```yaml
+# README header for HF Spaces
 ---
 title: CodeReviewEnv
 emoji: 🔍
+colorFrom: indigo
+colorTo: blue
 sdk: docker
 pinned: false
+tags:
+  - openenv
 ---
+```

__pycache__/env.cpython-312.pyc DELETED Viewed

Binary file (8.1 kB)

__pycache__/models.cpython-312.pyc DELETED Viewed

Binary file (4.16 kB)

graders/__pycache__/__init__.cpython-312.pyc DELETED Viewed

Binary file (145 Bytes)

graders/__pycache__/grader.cpython-312.pyc DELETED Viewed

Binary file (7.57 kB)

inference.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""
+inference.py — CodeReviewEnv baseline inference script.
+Mandatory env vars:
+    API_BASE_URL    The API endpoint for the LLM.
+    MODEL_NAME      The model identifier to use for inference.
+    HF_TOKEN        Your Hugging Face / API key.
+STDOUT format (strictly followed):
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+"""
+import json
+import os
+import sys
+import textwrap
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+sys.path.insert(0, os.path.dirname(__file__))
+from env import CodeReviewEnv, TASK_IDS
+from models import ReviewAction
+# ── Env vars ──────────────────────────────────────────────────────────────────
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK = "code-review-env"
+SUCCESS_SCORE_THRESHOLD = 0.5
+# ── Logging helpers ───────────────────────────────────────────────────────────
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    action_clean = action.replace("\n", " ").replace("\r", "")[:120]
+    print(
+        f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ── Prompts ───────────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = textwrap.dedent("""
+    You are an expert software engineer performing a thorough code review.
+    Your job is to:
+    1. Identify ALL bugs, security vulnerabilities, performance issues, and logic errors.
+    2. For each issue, output a JSON action with action_type="review".
+    3. After identifying all issues, output a patch with action_type="patch".
+    4. Finally, output action_type="submit" with your verdict.
+    Each response must be a single valid JSON object. No markdown, no explanation outside JSON.
+    Schema:
+    {
+      "action_type": "review" | "patch" | "comment" | "submit",
+      "severity": "critical" | "major" | "minor" | "info",
+      "issue_type": "bug" | "security" | "performance" | "logic" | "style",
+      "line_number": <int or null>,
+      "description": "<description of the issue>",
+      "patched_code": "<full corrected code>",
+      "comment": "<optional>",
+      "verdict": "approve" | "request_changes" | "reject",
+      "confidence": <0.0-1.0>
+    }
+    Output ONE JSON object per response. Be precise and thorough.
+""").strip()
+def build_user_prompt(obs: Dict[str, Any]) -> str:
+    ctx = obs["review_context"]
+    files_text = "\n\n".join(
+        f"=== {f['filename']} ({f['language']}) ===\n{f['content']}"
+        for f in ctx["files_changed"]
+    )
+    issues_so_far = obs.get("issues_found_so_far", [])
+    prompt = textwrap.dedent(f"""
+        Pull Request: {ctx['pull_request_title']}
+        Author: {ctx['author']}
+        Description: {ctx['description']}
+        Linter: {ctx.get('linter_output', 'N/A')}
+        Tests: {ctx.get('test_results', 'N/A')}
+        --- CODE ---
+        {files_text}
+        --- END CODE ---
+        Step: {obs['step']} / {obs['max_steps']}
+        Issues reported so far: {len(issues_so_far)}
+    """).strip()
+    if issues_so_far:
+        prompt += "\n\nIssues already reported (do NOT repeat these):"
+        for iss in issues_so_far:
+            prompt += f"\n  - [{iss.get('severity','?')}] line {iss.get('line','?')}: {iss.get('description','')}"
+    steps_left = obs['max_steps'] - obs['step']
+    if steps_left <= 2:
+        prompt += "\n\nYou are almost out of steps. Submit your patch and verdict NOW."
+    elif obs['step'] == 0:
+        prompt += "\n\nBegin your review. Output your first action as JSON."
+    else:
+        prompt += "\n\nContinue reviewing or submit if done. Output next action as JSON."
+    return prompt
+# ── JSON extraction ───────────────────────────────────────────────────────────
+def extract_json(text: str) -> Dict[str, Any]:
+    text = text.strip()
+    if text.startswith("```"):
+        lines = text.split("\n")
+        text = "\n".join(lines[1:-1]) if len(lines) > 2 else text
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+    start = text.find("{")
+    if start == -1:
+        raise ValueError("No JSON object found in response")
+    depth = 0
+    for i, ch in enumerate(text[start:], start):
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return json.loads(text[start:i + 1])
+    raise ValueError("Unbalanced JSON in response")
+# ── Episode runner ────────────────────────────────────────────────────────────
+def run_episode(client: OpenAI, task_id: str) -> Dict[str, Any]:
+    env = CodeReviewEnv()
+    obs_obj = env.reset(task_id)
+    obs = obs_obj.model_dump()
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    history: List[Dict[str, str]] = []
+    patch_submitted = False
+    error_msg: Optional[str] = None
+    try:
+        for step in range(1, obs_obj.max_steps + 1):
+            if obs.get("done"):
+                break
+            error_msg = None
+            steps_left = obs["max_steps"] - obs["step"]
+            # Force patch then submit near step limit
+            if steps_left <= 1 and not patch_submitted:
+                action_dict = {
+                    "action_type": "patch",
+                    "patched_code": obs["review_context"]["files_changed"][0]["content"],
+                }
+            elif steps_left <= 0:
+                action_dict = {
+                    "action_type": "submit",
+                    "verdict": "request_changes",
+                    "confidence": 0.5,
+                }
+            else:
+                user_msg = build_user_prompt(obs)
+                history.append({"role": "user", "content": user_msg})
+                try:
+                    completion = client.chat.completions.create(
+                        model=MODEL_NAME,
+                        messages=[{"role": "system", "content": SYSTEM_PROMPT}] + history,
+                        max_tokens=1024,
+                        temperature=0.2,
+                        stream=False,
+                    )
+                    raw = (completion.choices[0].message.content or "").strip()
+                    history.append({"role": "assistant", "content": raw})
+                    action_dict = extract_json(raw)
+                except Exception as exc:
+                    error_msg = str(exc)[:80]
+                    action_dict = {
+                        "action_type": "submit",
+                        "verdict": "request_changes",
+                        "confidence": 0.3,
+                    }
+            if action_dict.get("action_type") == "patch":
+                patch_submitted = True
+            # Validate action
+            try:
+                action = ReviewAction(**action_dict)
+            except Exception as exc:
+                error_msg = str(exc)[:80]
+                action = ReviewAction(
+                    action_type="submit",
+                    verdict="request_changes",
+                    confidence=0.3,
+                )
+            # Step environment
+            obs_obj, reward_obj, done, info = env.step(action)
+            obs = obs_obj.model_dump()
+            reward = reward_obj.value
+            rewards.append(reward)
+            steps_taken = step
+            action_summary = f"{action.action_type}:{(action.description or action.verdict or '')[:60]}"
+            log_step(step=step, action=action_summary, reward=reward, done=done, error=error_msg)
+            if done:
+                score = info.get("final_score", 0.0)
+                break
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {"task_id": task_id, "score": score, "steps": steps_taken, "success": success}
+# ── Main ──────────────────────────────────────────────────────────────────────
+def main() -> None:
+    if not API_KEY:
+        print("[ERROR] HF_TOKEN environment variable not set.", flush=True)
+        sys.exit(1)
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    task_ids = os.getenv("TASK_IDS", ",".join(TASK_IDS)).split(",")
+    task_ids = [t.strip() for t in task_ids if t.strip()]
+    all_results = []
+    for task_id in task_ids:
+        result = run_episode(client, task_id)
+        all_results.append(result)
+    # Aggregate summary to stderr so it doesn't pollute stdout log format
+    print("\n[SUMMARY]", file=sys.stderr)
+    for r in all_results:
+        print(f"  {r['task_id']}: score={r['score']:.3f} steps={r['steps']} success={r['success']}", file=sys.stderr)
+    if all_results:
+        avg = sum(r["score"] for r in all_results) / len(all_results)
+        print(f"  aggregate: {avg:.3f}", file=sys.stderr)
+if __name__ == "__main__":
+    main()

tasks/__pycache__/__init__.cpython-312.pyc DELETED Viewed

Binary file (143 Bytes)

tasks/__pycache__/task1_easy.cpython-312.pyc DELETED Viewed

Binary file (3.73 kB)

tasks/__pycache__/task2_medium.cpython-312.pyc DELETED Viewed

Binary file (6.01 kB)

tasks/__pycache__/task3_hard.cpython-312.pyc DELETED Viewed

Binary file (7.85 kB)