Spaces:
Build error
Build error
| title: Python Code Review Environment Server | |
| sdk: docker | |
| app_port: 8000 | |
| base_path: /web | |
| pinned: false | |
| tags: | |
| - openenv | |
| - code-review | |
| # Python Code Review Environment | |
| A production-grade OpenEnv environment for Python code review, repair, and optimization tasks. This environment simulates real-world developer workflows where an AI agent reviews, fixes, and improves Python code. | |
| ## Overview | |
| **`python_code_review_env`** is a deterministic benchmark environment featuring: | |
| - β **3 real-world tasks** with increasing difficulty (Syntax, Bug Fix, Optimization) | |
| - β **Deterministic graders** using AST analysis, pytest execution, and performance benchmarking | |
| - β **OpenAI-compatible API** supporting free/open models (Gemini, DeepSeek, Together, OpenRouter) | |
| - β **Production-ready Docker** deployment for Hugging Face Spaces | |
| - β **Structured Observations & Actions** following OpenEnv spec | |
| - β **Rich reward shaping** with bonuses for syntax fixes, test passes, and optimization | |
| ## Tasks | |
| ### 1. π’ Easy: Syntax Fixing | |
| **Task ID**: `syntax-fix-easy` | |
| Fix broken Python code with syntax errors. | |
| - **Difficulty**: Easy | |
| - **Goal**: Repair syntax errors to make code compile | |
| - **Starter Code**: Function with missing closing parenthesis | |
| - **Grading**: Compilation check + code similarity to reference | |
| - **Score Range**: 0.0β1.0 | |
| ### 2. π‘ Medium: Bug Fixing | |
| **Task ID**: `bug-fix-medium` | |
| Fix logic bugs with visible and hidden test cases. | |
| - **Difficulty**: Medium | |
| - **Goal**: Repair a logic error in invoice calculation | |
| - **Starter Code**: Function that returns wrong total (returns subtotal instead of discounted) | |
| - **Grading**: Test pass fraction (visible & hidden) | |
| - **Score Range**: 0.0β1.0 | |
| ### 3. π΄ Hard: Optimization & Refactoring | |
| **Task ID**: `optimization-hard` | |
| Optimize inefficient code while maintaining correctness. | |
| - **Difficulty**: Hard | |
| - **Goal**: Convert O(nΒ²) duplicate removal to O(n) with set | |
| - **Starter Code**: Slow nested-loop implementation | |
| - **Grading**: 50% correctness + 30% speedup + 15% code quality + 5% style | |
| - **Score Range**: 0.0β1.0 | |
| - **Bonus**: Runtime benchmarking against reference implementation | |
| ## Quick Start | |
| ### Run Locally | |
| ```bash | |
| cd python-code-review-env | |
| pip install -r server/requirements.txt | |
| python -m server.app | |
| ``` | |
| Visit http://localhost:8000/docs for interactive API | |
| ### Run with Docker | |
| ```bash | |
| docker build -f server/Dockerfile -t python_code_review_env:latest . | |
| docker run -p 8000:8000 python_code_review_env:latest | |
| ``` | |
| ### Run Inference | |
| ```bash | |
| python inference.py --model "gpt-3.5-turbo" --base-url "http://localhost:8000/v1" | |
| ``` | |
| ## OpenEnv Specification | |
| ### Observation | |
| ```json | |
| { | |
| "task_id": "syntax-fix-easy", | |
| "difficulty": "easy", | |
| "task_description": "Fix syntax errors...", | |
| "current_code": "def normalize_username(raw_name: str) -> str:\n cleaned = raw_name.strip().lower(\n ...", | |
| "errors": "invalid syntax ( line 2, column 40 )", | |
| "test_results": "Not run yet.", | |
| "visible_tests": ["normalize_username(' Alice Smith ') == 'alice_smith'"], | |
| "history": [], | |
| "attempts_remaining": 8, | |
| "score": 0.0, | |
| "reward": { | |
| "value": 0.0, | |
| "reason": "Episode reset." | |
| } | |
| } | |
| ``` | |
| ### Action | |
| ```json | |
| { | |
| "action_type": "edit_code", | |
| "code": "def normalize_username(raw_name: str) -> str:\n cleaned = raw_name.strip().lower()\n if not cleaned:\n return \"anonymous\"\n return cleaned.replace(\" \", \"_\")" | |
| } | |
| ``` | |
| ### Reward Details | |
| - **+0.2**: Syntax fixed (one-time per episode) | |
| - **+0.15**: Passing additional test (cumulative per test) | |
| - **+0.1**: Code quality improvement | |
| - **+0.5**: Full correctness (100% hidden tests, one-time) | |
| - **-0.1**: Invalid action | |
| ## Architecture | |
| ``` | |
| python_code_review_env/ | |
| βββ models.py # Pydantic models (Observation, Action, Reward) | |
| βββ server/ | |
| β βββ app.py # FastAPI server | |
| β βββ env.py # OpenEnv environment | |
| β βββ Dockerfile # Docker config | |
| β βββ requirements.txt | |
| βββ graders/ | |
| β βββ common.py # Shared utilities | |
| β βββ syntax.py # Syntax/bug graders | |
| β βββ optimization.py# Optimization grader | |
| β βββ pytest_runner.py | |
| βββ tasks/ | |
| β βββ task_bank.py # 3 deterministic tasks | |
| β βββ __init__.py | |
| βββ inference.py # Baseline evaluation script | |
| βββ openenv.yaml # OpenEnv spec | |
| βββ pyproject.toml # Project metadata | |
| βββ README.md | |
| ``` | |
| ## FastAPI Endpoints | |
| - `GET /health` β Health check | |
| - `GET /tasks` β List all tasks | |
| - `GET /tasks/{task_id}` β Get task details | |
| - `POST /tasks/{task_id}/grade` β Grade code offline | |
| - Standard OpenEnv endpoints (`/reset`, `/step`, `/state`) | |
| ## Deterministic Graders | |
| ### Syntax Fix | |
| ``` | |
| if code compiles: | |
| score = 1.0 | |
| else: | |
| score = 0.15 + 0.55 * similarity_to_reference | |
| ``` | |
| ### Bug Fix | |
| ``` | |
| score = test_pass_fraction (0.0 to 1.0) | |
| ``` | |
| ### Optimization | |
| ``` | |
| score = ( | |
| 0.5 * test_fraction + | |
| 0.3 * speedup_score + | |
| 0.15 * code_quality + | |
| 0.05 * pep8_style | |
| ) | |
| ``` | |
| ## Examples | |
| ### Using Python | |
| ```python | |
| from server.env import PythonCodeReviewEnvironment | |
| from models import PythonCodeReviewAction | |
| env = PythonCodeReviewEnvironment() | |
| obs = env.reset(task_id="syntax-fix-easy") | |
| action = PythonCodeReviewAction( | |
| action_type="edit_code", | |
| code="""def normalize_username(raw_name: str) -> str: | |
| cleaned = raw_name.strip().lower() | |
| if not cleaned: | |
| return "anonymous" | |
| return cleaned.replace(" ", "_") | |
| """ | |
| ) | |
| obs = env.step(action) | |
| print(f"Score: {obs.score}") | |
| print(f"Reward: {obs.reward.value:+.3f}") | |
| ``` | |
| ### Using cURL | |
| ```bash | |
| # Check health | |
| curl http://localhost:8000/health | |
| # List tasks | |
| curl http://localhost:8000/tasks | |
| # Grade code | |
| curl -X POST http://localhost:8000/tasks/syntax-fix-easy/grade \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "edit_code", "code": "..."}' | |
| ``` | |
| ## Deployment | |
| ### Hugging Face Spaces | |
| 1. Create Space > Docker | |
| 2. Upload files + `server/Dockerfile` | |
| 3. Space auto-deploys on CPU | |
| 4. Monitor `/health` endpoint | |
| ### Local Docker | |
| ```bash | |
| docker build -f server/Dockerfile -t python_code_review_env . | |
| docker run -p 8000:8000 \ | |
| -e MAX_CONCURRENT_ENVS=16 \ | |
| python_code_review_env | |
| ``` | |
| ## Performance | |
| - Startup: < 5s | |
| - Reset: < 100ms | |
| - Step: 50msβ3s (depends on action) | |
| - Inference (3 tasks): < 20 minutes | |
| - CPU: Works on 2 vCPU, 8GB RAM | |
| ## Validation Checklist | |
| - β 3 deterministic tasks | |
| - β Deterministic graders (AST, pytest, benchmarks) | |
| - β `/health` β 200 | |
| - β Scores vary per task (not constant) | |
| - β Docker builds successfully | |
| - β OpenEnv spec compliant | |
| - β Reward shaping working | |
| - β All tests deterministic and reproducible | |
| ## License | |
| MIT | |
| --- | |
| **Built for production. Deterministic. Deployable. Extensible.** | |