python_env / README.md
uvpatel7271's picture
Upload folder using huggingface_hub
c8e832f verified
---
title: Python Code Review Environment Server
sdk: docker
app_port: 8000
base_path: /web
pinned: false
tags:
- openenv
- code-review
---
# Python Code Review Environment
A production-grade OpenEnv environment for Python code review, repair, and optimization tasks. This environment simulates real-world developer workflows where an AI agent reviews, fixes, and improves Python code.
## Overview
**`python_code_review_env`** is a deterministic benchmark environment featuring:
- βœ… **3 real-world tasks** with increasing difficulty (Syntax, Bug Fix, Optimization)
- βœ… **Deterministic graders** using AST analysis, pytest execution, and performance benchmarking
- βœ… **OpenAI-compatible API** supporting free/open models (Gemini, DeepSeek, Together, OpenRouter)
- βœ… **Production-ready Docker** deployment for Hugging Face Spaces
- βœ… **Structured Observations & Actions** following OpenEnv spec
- βœ… **Rich reward shaping** with bonuses for syntax fixes, test passes, and optimization
## Tasks
### 1. 🟒 Easy: Syntax Fixing
**Task ID**: `syntax-fix-easy`
Fix broken Python code with syntax errors.
- **Difficulty**: Easy
- **Goal**: Repair syntax errors to make code compile
- **Starter Code**: Function with missing closing parenthesis
- **Grading**: Compilation check + code similarity to reference
- **Score Range**: 0.0–1.0
### 2. 🟑 Medium: Bug Fixing
**Task ID**: `bug-fix-medium`
Fix logic bugs with visible and hidden test cases.
- **Difficulty**: Medium
- **Goal**: Repair a logic error in invoice calculation
- **Starter Code**: Function that returns wrong total (returns subtotal instead of discounted)
- **Grading**: Test pass fraction (visible & hidden)
- **Score Range**: 0.0–1.0
### 3. πŸ”΄ Hard: Optimization & Refactoring
**Task ID**: `optimization-hard`
Optimize inefficient code while maintaining correctness.
- **Difficulty**: Hard
- **Goal**: Convert O(nΒ²) duplicate removal to O(n) with set
- **Starter Code**: Slow nested-loop implementation
- **Grading**: 50% correctness + 30% speedup + 15% code quality + 5% style
- **Score Range**: 0.0–1.0
- **Bonus**: Runtime benchmarking against reference implementation
## Quick Start
### Run Locally
```bash
cd python-code-review-env
pip install -r server/requirements.txt
python -m server.app
```
Visit http://localhost:8000/docs for interactive API
### Run with Docker
```bash
docker build -f server/Dockerfile -t python_code_review_env:latest .
docker run -p 8000:8000 python_code_review_env:latest
```
### Run Inference
```bash
python inference.py --model "gpt-3.5-turbo" --base-url "http://localhost:8000/v1"
```
## OpenEnv Specification
### Observation
```json
{
"task_id": "syntax-fix-easy",
"difficulty": "easy",
"task_description": "Fix syntax errors...",
"current_code": "def normalize_username(raw_name: str) -> str:\n cleaned = raw_name.strip().lower(\n ...",
"errors": "invalid syntax ( line 2, column 40 )",
"test_results": "Not run yet.",
"visible_tests": ["normalize_username(' Alice Smith ') == 'alice_smith'"],
"history": [],
"attempts_remaining": 8,
"score": 0.0,
"reward": {
"value": 0.0,
"reason": "Episode reset."
}
}
```
### Action
```json
{
"action_type": "edit_code",
"code": "def normalize_username(raw_name: str) -> str:\n cleaned = raw_name.strip().lower()\n if not cleaned:\n return \"anonymous\"\n return cleaned.replace(\" \", \"_\")"
}
```
### Reward Details
- **+0.2**: Syntax fixed (one-time per episode)
- **+0.15**: Passing additional test (cumulative per test)
- **+0.1**: Code quality improvement
- **+0.5**: Full correctness (100% hidden tests, one-time)
- **-0.1**: Invalid action
## Architecture
```
python_code_review_env/
β”œβ”€β”€ models.py # Pydantic models (Observation, Action, Reward)
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ app.py # FastAPI server
β”‚ β”œβ”€β”€ env.py # OpenEnv environment
β”‚ β”œβ”€β”€ Dockerfile # Docker config
β”‚ └── requirements.txt
β”œβ”€β”€ graders/
β”‚ β”œβ”€β”€ common.py # Shared utilities
β”‚ β”œβ”€β”€ syntax.py # Syntax/bug graders
β”‚ β”œβ”€β”€ optimization.py# Optimization grader
β”‚ └── pytest_runner.py
β”œβ”€β”€ tasks/
β”‚ β”œβ”€β”€ task_bank.py # 3 deterministic tasks
β”‚ └── __init__.py
β”œβ”€β”€ inference.py # Baseline evaluation script
β”œβ”€β”€ openenv.yaml # OpenEnv spec
β”œβ”€β”€ pyproject.toml # Project metadata
└── README.md
```
## FastAPI Endpoints
- `GET /health` – Health check
- `GET /tasks` – List all tasks
- `GET /tasks/{task_id}` – Get task details
- `POST /tasks/{task_id}/grade` – Grade code offline
- Standard OpenEnv endpoints (`/reset`, `/step`, `/state`)
## Deterministic Graders
### Syntax Fix
```
if code compiles:
score = 1.0
else:
score = 0.15 + 0.55 * similarity_to_reference
```
### Bug Fix
```
score = test_pass_fraction (0.0 to 1.0)
```
### Optimization
```
score = (
0.5 * test_fraction +
0.3 * speedup_score +
0.15 * code_quality +
0.05 * pep8_style
)
```
## Examples
### Using Python
```python
from server.env import PythonCodeReviewEnvironment
from models import PythonCodeReviewAction
env = PythonCodeReviewEnvironment()
obs = env.reset(task_id="syntax-fix-easy")
action = PythonCodeReviewAction(
action_type="edit_code",
code="""def normalize_username(raw_name: str) -> str:
cleaned = raw_name.strip().lower()
if not cleaned:
return "anonymous"
return cleaned.replace(" ", "_")
"""
)
obs = env.step(action)
print(f"Score: {obs.score}")
print(f"Reward: {obs.reward.value:+.3f}")
```
### Using cURL
```bash
# Check health
curl http://localhost:8000/health
# List tasks
curl http://localhost:8000/tasks
# Grade code
curl -X POST http://localhost:8000/tasks/syntax-fix-easy/grade \
-H "Content-Type: application/json" \
-d '{"action_type": "edit_code", "code": "..."}'
```
## Deployment
### Hugging Face Spaces
1. Create Space > Docker
2. Upload files + `server/Dockerfile`
3. Space auto-deploys on CPU
4. Monitor `/health` endpoint
### Local Docker
```bash
docker build -f server/Dockerfile -t python_code_review_env .
docker run -p 8000:8000 \
-e MAX_CONCURRENT_ENVS=16 \
python_code_review_env
```
## Performance
- Startup: < 5s
- Reset: < 100ms
- Step: 50ms–3s (depends on action)
- Inference (3 tasks): < 20 minutes
- CPU: Works on 2 vCPU, 8GB RAM
## Validation Checklist
- βœ… 3 deterministic tasks
- βœ… Deterministic graders (AST, pytest, benchmarks)
- βœ… `/health` β†’ 200
- βœ… Scores vary per task (not constant)
- βœ… Docker builds successfully
- βœ… OpenEnv spec compliant
- βœ… Reward shaping working
- βœ… All tests deterministic and reproducible
## License
MIT
---
**Built for production. Deterministic. Deployable. Extensible.**