python_env / README.md
uvpatel7271's picture
Upload folder using huggingface_hub
c8e832f verified
metadata
title: Python Code Review Environment Server
sdk: docker
app_port: 8000
base_path: /web
pinned: false
tags:
  - openenv
  - code-review

Python Code Review Environment

A production-grade OpenEnv environment for Python code review, repair, and optimization tasks. This environment simulates real-world developer workflows where an AI agent reviews, fixes, and improves Python code.

Overview

python_code_review_env is a deterministic benchmark environment featuring:

  • βœ… 3 real-world tasks with increasing difficulty (Syntax, Bug Fix, Optimization)
  • βœ… Deterministic graders using AST analysis, pytest execution, and performance benchmarking
  • βœ… OpenAI-compatible API supporting free/open models (Gemini, DeepSeek, Together, OpenRouter)
  • βœ… Production-ready Docker deployment for Hugging Face Spaces
  • βœ… Structured Observations & Actions following OpenEnv spec
  • βœ… Rich reward shaping with bonuses for syntax fixes, test passes, and optimization

Tasks

1. 🟒 Easy: Syntax Fixing

Task ID: syntax-fix-easy

Fix broken Python code with syntax errors.

  • Difficulty: Easy
  • Goal: Repair syntax errors to make code compile
  • Starter Code: Function with missing closing parenthesis
  • Grading: Compilation check + code similarity to reference
  • Score Range: 0.0–1.0

2. 🟑 Medium: Bug Fixing

Task ID: bug-fix-medium

Fix logic bugs with visible and hidden test cases.

  • Difficulty: Medium
  • Goal: Repair a logic error in invoice calculation
  • Starter Code: Function that returns wrong total (returns subtotal instead of discounted)
  • Grading: Test pass fraction (visible & hidden)
  • Score Range: 0.0–1.0

3. πŸ”΄ Hard: Optimization & Refactoring

Task ID: optimization-hard

Optimize inefficient code while maintaining correctness.

  • Difficulty: Hard
  • Goal: Convert O(nΒ²) duplicate removal to O(n) with set
  • Starter Code: Slow nested-loop implementation
  • Grading: 50% correctness + 30% speedup + 15% code quality + 5% style
  • Score Range: 0.0–1.0
  • Bonus: Runtime benchmarking against reference implementation

Quick Start

Run Locally

cd python-code-review-env
pip install -r server/requirements.txt
python -m server.app

Visit http://localhost:8000/docs for interactive API

Run with Docker

docker build -f server/Dockerfile -t python_code_review_env:latest .
docker run -p 8000:8000 python_code_review_env:latest

Run Inference

python inference.py --model "gpt-3.5-turbo" --base-url "http://localhost:8000/v1"

OpenEnv Specification

Observation

{
  "task_id": "syntax-fix-easy",
  "difficulty": "easy",
  "task_description": "Fix syntax errors...",
  "current_code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower(\n    ...",
  "errors": "invalid syntax ( line 2, column 40 )",
  "test_results": "Not run yet.",
  "visible_tests": ["normalize_username('  Alice Smith  ') == 'alice_smith'"],
  "history": [],
  "attempts_remaining": 8,
  "score": 0.0,
  "reward": {
    "value": 0.0,
    "reason": "Episode reset."
  }
}

Action

{
  "action_type": "edit_code",
  "code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower()\n    if not cleaned:\n        return \"anonymous\"\n    return cleaned.replace(\" \", \"_\")"
}

Reward Details

  • +0.2: Syntax fixed (one-time per episode)
  • +0.15: Passing additional test (cumulative per test)
  • +0.1: Code quality improvement
  • +0.5: Full correctness (100% hidden tests, one-time)
  • -0.1: Invalid action

Architecture

python_code_review_env/
β”œβ”€β”€ models.py          # Pydantic models (Observation, Action, Reward)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py         # FastAPI server  
β”‚   β”œβ”€β”€ env.py         # OpenEnv environment
β”‚   β”œβ”€β”€ Dockerfile     # Docker config
β”‚   └── requirements.txt
β”œβ”€β”€ graders/
β”‚   β”œβ”€β”€ common.py      # Shared utilities
β”‚   β”œβ”€β”€ syntax.py      # Syntax/bug graders
β”‚   β”œβ”€β”€ optimization.py# Optimization grader
β”‚   └── pytest_runner.py
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ task_bank.py   # 3 deterministic tasks
β”‚   └── __init__.py
β”œβ”€β”€ inference.py       # Baseline evaluation script
β”œβ”€β”€ openenv.yaml       # OpenEnv spec
β”œβ”€β”€ pyproject.toml     # Project metadata
└── README.md

FastAPI Endpoints

  • GET /health – Health check
  • GET /tasks – List all tasks
  • GET /tasks/{task_id} – Get task details
  • POST /tasks/{task_id}/grade – Grade code offline
  • Standard OpenEnv endpoints (/reset, /step, /state)

Deterministic Graders

Syntax Fix

if code compiles:
  score = 1.0
else:
  score = 0.15 + 0.55 * similarity_to_reference

Bug Fix

score = test_pass_fraction (0.0 to 1.0)

Optimization

score = (
  0.5 * test_fraction +
  0.3 * speedup_score +
  0.15 * code_quality +
  0.05 * pep8_style
)

Examples

Using Python

from server.env import PythonCodeReviewEnvironment
from models import PythonCodeReviewAction

env = PythonCodeReviewEnvironment()
obs = env.reset(task_id="syntax-fix-easy")

action = PythonCodeReviewAction(
    action_type="edit_code",
    code="""def normalize_username(raw_name: str) -> str:
    cleaned = raw_name.strip().lower()
    if not cleaned:
        return "anonymous"
    return cleaned.replace(" ", "_")
"""
)

obs = env.step(action)
print(f"Score: {obs.score}")
print(f"Reward: {obs.reward.value:+.3f}")

Using cURL

# Check health
curl http://localhost:8000/health

# List tasks
curl http://localhost:8000/tasks

# Grade code
curl -X POST http://localhost:8000/tasks/syntax-fix-easy/grade \
  -H "Content-Type: application/json" \
  -d '{"action_type": "edit_code", "code": "..."}'

Deployment

Hugging Face Spaces

  1. Create Space > Docker
  2. Upload files + server/Dockerfile
  3. Space auto-deploys on CPU
  4. Monitor /health endpoint

Local Docker

docker build -f server/Dockerfile -t python_code_review_env .
docker run -p 8000:8000 \
  -e MAX_CONCURRENT_ENVS=16 \
  python_code_review_env

Performance

  • Startup: < 5s
  • Reset: < 100ms
  • Step: 50ms–3s (depends on action)
  • Inference (3 tasks): < 20 minutes
  • CPU: Works on 2 vCPU, 8GB RAM

Validation Checklist

  • βœ… 3 deterministic tasks
  • βœ… Deterministic graders (AST, pytest, benchmarks)
  • βœ… /health β†’ 200
  • βœ… Scores vary per task (not constant)
  • βœ… Docker builds successfully
  • βœ… OpenEnv spec compliant
  • βœ… Reward shaping working
  • βœ… All tests deterministic and reproducible

License

MIT


Built for production. Deterministic. Deployable. Extensible.