File size: 7,103 Bytes
1ac5fc9
c8e832f
1ac5fc9
1c8b7f1
 
c8e832f
1c8b7f1
 
c8e832f
1ac5fc9
 
c8e832f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: Python Code Review Environment Server
sdk: docker
app_port: 8000
base_path: /web
pinned: false
tags:
  - openenv
  - code-review
---

# Python Code Review Environment

A production-grade OpenEnv environment for Python code review, repair, and optimization tasks. This environment simulates real-world developer workflows where an AI agent reviews, fixes, and improves Python code.

## Overview

**`python_code_review_env`** is a deterministic benchmark environment featuring:



- βœ… **3 real-world tasks** with increasing difficulty (Syntax, Bug Fix, Optimization)

- βœ… **Deterministic graders** using AST analysis, pytest execution, and performance benchmarking

- βœ… **OpenAI-compatible API** supporting free/open models (Gemini, DeepSeek, Together, OpenRouter)

- βœ… **Production-ready Docker** deployment for Hugging Face Spaces

- βœ… **Structured Observations & Actions** following OpenEnv spec

- βœ… **Rich reward shaping** with bonuses for syntax fixes, test passes, and optimization



## Tasks



### 1. 🟒 Easy: Syntax Fixing



**Task ID**: `syntax-fix-easy`



Fix broken Python code with syntax errors.



- **Difficulty**: Easy

- **Goal**: Repair syntax errors to make code compile

- **Starter Code**: Function with missing closing parenthesis

- **Grading**: Compilation check + code similarity to reference

- **Score Range**: 0.0–1.0



### 2. 🟑 Medium: Bug Fixing



**Task ID**: `bug-fix-medium`



Fix logic bugs with visible and hidden test cases.



- **Difficulty**: Medium  

- **Goal**: Repair a logic error in invoice calculation

- **Starter Code**: Function that returns wrong total (returns subtotal instead of discounted)

- **Grading**: Test pass fraction (visible & hidden)

- **Score Range**: 0.0–1.0



### 3. πŸ”΄ Hard: Optimization & Refactoring



**Task ID**: `optimization-hard`



Optimize inefficient code while maintaining correctness.



- **Difficulty**: Hard

- **Goal**: Convert O(nΒ²) duplicate removal to O(n) with set

- **Starter Code**: Slow nested-loop implementation

- **Grading**: 50% correctness + 30% speedup + 15% code quality + 5% style

- **Score Range**: 0.0–1.0

- **Bonus**: Runtime benchmarking against reference implementation



## Quick Start



### Run Locally



```bash

cd python-code-review-env

pip install -r server/requirements.txt

python -m server.app

```



Visit http://localhost:8000/docs for interactive API



### Run with Docker



```bash

docker build -f server/Dockerfile -t python_code_review_env:latest .

docker run -p 8000:8000 python_code_review_env:latest

```



### Run Inference



```bash

python inference.py --model "gpt-3.5-turbo" --base-url "http://localhost:8000/v1"

```



## OpenEnv Specification



### Observation



```json

{

  "task_id": "syntax-fix-easy",

  "difficulty": "easy",

  "task_description": "Fix syntax errors...",

  "current_code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower(\n    ...",

  "errors": "invalid syntax ( line 2, column 40 )",

  "test_results": "Not run yet.",

  "visible_tests": ["normalize_username('  Alice Smith  ') == 'alice_smith'"],

  "history": [],

  "attempts_remaining": 8,

  "score": 0.0,

  "reward": {

    "value": 0.0,

    "reason": "Episode reset."

  }

}

```



### Action



```json

{

  "action_type": "edit_code",

  "code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower()\n    if not cleaned:\n        return \"anonymous\"\n    return cleaned.replace(\" \", \"_\")"

}

```



### Reward Details



- **+0.2**: Syntax fixed (one-time per episode)

- **+0.15**: Passing additional test (cumulative per test)

- **+0.1**: Code quality improvement  

- **+0.5**: Full correctness (100% hidden tests, one-time)

- **-0.1**: Invalid action



## Architecture



```

python_code_review_env/

β”œβ”€β”€ models.py          # Pydantic models (Observation, Action, Reward)

β”œβ”€β”€ server/

β”‚   β”œβ”€β”€ app.py         # FastAPI server  

β”‚   β”œβ”€β”€ env.py         # OpenEnv environment

β”‚   β”œβ”€β”€ Dockerfile     # Docker config

β”‚   └── requirements.txt

β”œβ”€β”€ graders/

β”‚   β”œβ”€β”€ common.py      # Shared utilities

β”‚   β”œβ”€β”€ syntax.py      # Syntax/bug graders

β”‚   β”œβ”€β”€ optimization.py# Optimization grader

β”‚   └── pytest_runner.py

β”œβ”€β”€ tasks/

β”‚   β”œβ”€β”€ task_bank.py   # 3 deterministic tasks

β”‚   └── __init__.py

β”œβ”€β”€ inference.py       # Baseline evaluation script

β”œβ”€β”€ openenv.yaml       # OpenEnv spec

β”œβ”€β”€ pyproject.toml     # Project metadata

└── README.md

```



## FastAPI Endpoints



- `GET /health` – Health check

- `GET /tasks` – List all tasks

- `GET /tasks/{task_id}` – Get task details

- `POST /tasks/{task_id}/grade` – Grade code offline

- Standard OpenEnv endpoints (`/reset`, `/step`, `/state`)



## Deterministic Graders



### Syntax Fix

```

if code compiles:

  score = 1.0

else:

  score = 0.15 + 0.55 * similarity_to_reference

```



### Bug Fix  

```

score = test_pass_fraction (0.0 to 1.0)

```



### Optimization

```

score = (

  0.5 * test_fraction +

  0.3 * speedup_score +

  0.15 * code_quality +

  0.05 * pep8_style

)

```



## Examples



### Using Python



```python

from server.env import PythonCodeReviewEnvironment

from models import PythonCodeReviewAction



env = PythonCodeReviewEnvironment()

obs = env.reset(task_id="syntax-fix-easy")



action = PythonCodeReviewAction(

    action_type="edit_code",

    code="""def normalize_username(raw_name: str) -> str:

    cleaned = raw_name.strip().lower()

    if not cleaned:

        return "anonymous"

    return cleaned.replace(" ", "_")

"""

)



obs = env.step(action)

print(f"Score: {obs.score}")

print(f"Reward: {obs.reward.value:+.3f}")

```



### Using cURL



```bash

# Check health

curl http://localhost:8000/health



# List tasks

curl http://localhost:8000/tasks



# Grade code

curl -X POST http://localhost:8000/tasks/syntax-fix-easy/grade \

  -H "Content-Type: application/json" \

  -d '{"action_type": "edit_code", "code": "..."}'

```



## Deployment



### Hugging Face Spaces



1. Create Space > Docker

2. Upload files + `server/Dockerfile`

3. Space auto-deploys on CPU

4. Monitor `/health` endpoint



### Local Docker



```bash

docker build -f server/Dockerfile -t python_code_review_env .

docker run -p 8000:8000 \

  -e MAX_CONCURRENT_ENVS=16 \

  python_code_review_env

```



## Performance



- Startup: < 5s

- Reset: < 100ms

- Step: 50ms–3s (depends on action)

- Inference (3 tasks): < 20 minutes

- CPU: Works on 2 vCPU, 8GB RAM



## Validation Checklist



- βœ… 3 deterministic tasks

- βœ… Deterministic graders (AST, pytest, benchmarks)

- βœ… `/health` β†’ 200

- βœ… Scores vary per task (not constant)

- βœ… Docker builds successfully

- βœ… OpenEnv spec compliant

- βœ… Reward shaping working

- βœ… All tests deterministic and reproducible



## License



MIT



---



**Built for production. Deterministic. Deployable. Extensible.**