---
title: Python Code Review Environment Server
sdk: docker
app_port: 8000
base_path: /web
pinned: false
tags:
  - openenv
  - code-review
---

# Python Code Review Environment

A production-grade OpenEnv environment for Python code review, repair, and optimization tasks. This environment simulates real-world developer workflows where an AI agent reviews, fixes, and improves Python code.

## Overview

**`python_code_review_env`** is a deterministic benchmark environment featuring:

- ✅ **3 real-world tasks** with increasing difficulty (Syntax, Bug Fix, Optimization)
- ✅ **Deterministic graders** using AST analysis, pytest execution, and performance benchmarking
- ✅ **OpenAI-compatible API** supporting free/open models (Gemini, DeepSeek, Together, OpenRouter)
- ✅ **Production-ready Docker** deployment for Hugging Face Spaces
- ✅ **Structured Observations & Actions** following OpenEnv spec
- ✅ **Rich reward shaping** with bonuses for syntax fixes, test passes, and optimization

## Tasks

### 1. 🟢 Easy: Syntax Fixing

**Task ID**: `syntax-fix-easy`

Fix broken Python code with syntax errors.

- **Difficulty**: Easy
- **Goal**: Repair syntax errors to make code compile
- **Starter Code**: Function with missing closing parenthesis
- **Grading**: Compilation check + code similarity to reference
- **Score Range**: 0.0–1.0

### 2. 🟡 Medium: Bug Fixing

**Task ID**: `bug-fix-medium`

Fix logic bugs with visible and hidden test cases.

- **Difficulty**: Medium  
- **Goal**: Repair a logic error in invoice calculation
- **Starter Code**: Function that returns wrong total (returns subtotal instead of discounted)
- **Grading**: Test pass fraction (visible & hidden)
- **Score Range**: 0.0–1.0

### 3. 🔴 Hard: Optimization & Refactoring

**Task ID**: `optimization-hard`

Optimize inefficient code while maintaining correctness.

- **Difficulty**: Hard
- **Goal**: Convert O(n²) duplicate removal to O(n) with set
- **Starter Code**: Slow nested-loop implementation
- **Grading**: 50% correctness + 30% speedup + 15% code quality + 5% style
- **Score Range**: 0.0–1.0
- **Bonus**: Runtime benchmarking against reference implementation

## Quick Start

### Run Locally

```bash
cd python-code-review-env
pip install -r server/requirements.txt
python -m server.app
```

Visit http://localhost:8000/docs for interactive API

### Run with Docker

```bash
docker build -f server/Dockerfile -t python_code_review_env:latest .
docker run -p 8000:8000 python_code_review_env:latest
```

### Run Inference

```bash
python inference.py --model "gpt-3.5-turbo" --base-url "http://localhost:8000/v1"
```

## OpenEnv Specification

### Observation

```json
{
  "task_id": "syntax-fix-easy",
  "difficulty": "easy",
  "task_description": "Fix syntax errors...",
  "current_code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower(\n    ...",
  "errors": "invalid syntax ( line 2, column 40 )",
  "test_results": "Not run yet.",
  "visible_tests": ["normalize_username('  Alice Smith  ') == 'alice_smith'"],
  "history": [],
  "attempts_remaining": 8,
  "score": 0.0,
  "reward": {
    "value": 0.0,
    "reason": "Episode reset."
  }
}
```

### Action

```json
{
  "action_type": "edit_code",
  "code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower()\n    if not cleaned:\n        return \"anonymous\"\n    return cleaned.replace(\" \", \"_\")"
}
```

### Reward Details

- **+0.2**: Syntax fixed (one-time per episode)
- **+0.15**: Passing additional test (cumulative per test)
- **+0.1**: Code quality improvement  
- **+0.5**: Full correctness (100% hidden tests, one-time)
- **-0.1**: Invalid action

## Architecture

```
python_code_review_env/
├── models.py          # Pydantic models (Observation, Action, Reward)
├── server/
│   ├── app.py         # FastAPI server  
│   ├── env.py         # OpenEnv environment
│   ├── Dockerfile     # Docker config
│   └── requirements.txt
├── graders/
│   ├── common.py      # Shared utilities
│   ├── syntax.py      # Syntax/bug graders
│   ├── optimization.py# Optimization grader
│   └── pytest_runner.py
├── tasks/
│   ├── task_bank.py   # 3 deterministic tasks
│   └── __init__.py
├── inference.py       # Baseline evaluation script
├── openenv.yaml       # OpenEnv spec
├── pyproject.toml     # Project metadata
└── README.md
```

## FastAPI Endpoints

- `GET /health` – Health check
- `GET /tasks` – List all tasks
- `GET /tasks/{task_id}` – Get task details
- `POST /tasks/{task_id}/grade` – Grade code offline
- Standard OpenEnv endpoints (`/reset`, `/step`, `/state`)

## Deterministic Graders

### Syntax Fix
```
if code compiles:
  score = 1.0
else:
  score = 0.15 + 0.55 * similarity_to_reference
```

### Bug Fix  
```
score = test_pass_fraction (0.0 to 1.0)
```

### Optimization
```
score = (
  0.5 * test_fraction +
  0.3 * speedup_score +
  0.15 * code_quality +
  0.05 * pep8_style
)
```

## Examples

### Using Python

```python
from server.env import PythonCodeReviewEnvironment
from models import PythonCodeReviewAction

env = PythonCodeReviewEnvironment()
obs = env.reset(task_id="syntax-fix-easy")

action = PythonCodeReviewAction(
    action_type="edit_code",
    code="""def normalize_username(raw_name: str) -> str:
    cleaned = raw_name.strip().lower()
    if not cleaned:
        return "anonymous"
    return cleaned.replace(" ", "_")
"""
)

obs = env.step(action)
print(f"Score: {obs.score}")
print(f"Reward: {obs.reward.value:+.3f}")
```

### Using cURL

```bash
# Check health
curl http://localhost:8000/health

# List tasks
curl http://localhost:8000/tasks

# Grade code
curl -X POST http://localhost:8000/tasks/syntax-fix-easy/grade \
  -H "Content-Type: application/json" \
  -d '{"action_type": "edit_code", "code": "..."}'
```

## Deployment

### Hugging Face Spaces

1. Create Space > Docker
2. Upload files + `server/Dockerfile`
3. Space auto-deploys on CPU
4. Monitor `/health` endpoint

### Local Docker

```bash
docker build -f server/Dockerfile -t python_code_review_env .
docker run -p 8000:8000 \
  -e MAX_CONCURRENT_ENVS=16 \
  python_code_review_env
```

## Performance

- Startup: < 5s
- Reset: < 100ms
- Step: 50ms–3s (depends on action)
- Inference (3 tasks): < 20 minutes
- CPU: Works on 2 vCPU, 8GB RAM

## Validation Checklist

- ✅ 3 deterministic tasks
- ✅ Deterministic graders (AST, pytest, benchmarks)
- ✅ `/health` → 200
- ✅ Scores vary per task (not constant)
- ✅ Docker builds successfully
- ✅ OpenEnv spec compliant
- ✅ Reward shaping working
- ✅ All tests deterministic and reproducible

## License

MIT

---

**Built for production. Deterministic. Deployable. Extensible.**