CodeReviewEnv / openenv.yaml
janakb's picture
comit
ced8fd0
name: code-review-env
version: "1.0.0"
spec: openenv/v1
tags:
- openenv
- code-review
- software-engineering
- security
- agent-evaluation
description: >
A code review environment where AI agents act as senior engineers reviewing
pull requests. Tasks span bug hunting (easy), security auditing (medium),
and distributed systems correctness review (hard). Fully OpenEnv-compliant
with typed Pydantic models, dense reward signals, and programmatic graders.
author: "Meta Hackathon Submission"
license: MIT
observation_space:
type: object
description: >
Structured pull request context including code files, linter output,
test results, and history of previous actions taken in the episode.
fields:
- task_id: string
- step: integer
- max_steps: integer
- review_context: ReviewContext
- previous_actions: list[ReviewAction]
- issues_found_so_far: list[dict]
- score_so_far: float [0.0, 1.0]
- done: boolean
action_space:
type: object
description: >
Agents may review (annotate an issue), patch (submit corrected code),
comment (free-form annotation), or submit (final verdict).
action_types:
- review: annotate a specific issue with severity, type, line, and description
- patch: provide full corrected code
- comment: free-form annotation
- submit: final verdict (approve | request_changes | reject) with confidence
reward:
type: dense
range: [-1.0, 1.0]
description: >
Intermediate reward encourages efficient, non-repetitive, actionable reviews.
Final reward (at submit or max_steps) is the programmatic grader score in [0.0, 1.0].
components:
step_penalty: -0.01 per step (encourages efficiency)
review_description_bonus: +0.05 for substantive review action
critical_severity_bonus: +0.03 for marking an issue as critical
patch_submitted_bonus: +0.10 for submitting non-trivial patch
repetition_penalty: -0.05 for repeating identical descriptions
tasks:
- id: task_1_easy_bug_hunt
difficulty: easy
max_steps: 8
description: >
Find three planted bugs in a Python utility module:
assignment-instead-of-comparison, off-by-one loop bound, missing return.
grader: keyword-match + AST parse of patch
max_score: 1.0
- id: task_2_medium_security
difficulty: medium
max_steps: 12
description: >
Audit a Flask authentication endpoint for six security vulnerabilities:
SQL injection (×2), plaintext passwords, no rate limiting,
sensitive data leakage, hardcoded secret key.
grader: keyword-match across action descriptions + patch structural check
max_score: 1.0
- id: task_3_hard_perf_correctness
difficulty: hard
max_steps: 16
description: >
Review a distributed LRU cache backed by Redis for six issues:
race condition, memory leak, N+1 query, wrong LRU order,
thread-safety violation, pickle deserialization exploit.
grader: keyword-match + patch structural check (Lock, OrderedDict, mget, json)
max_score: 1.0
baseline_scores:
model: Qwen/Qwen2.5-72B-Instruct
task_1_easy_bug_hunt: 0.72
task_2_medium_security: 0.55
task_3_hard_perf_correctness: 0.38
aggregate: 0.55
deployment:
platform: huggingface_spaces
sdk: docker
port: 7860