Spaces:
Runtime error
Runtime error
| title: Python Code Review Environment Server | |
| sdk: docker | |
| app_port: 8000 | |
| base_path: /web | |
| pinned: false | |
| tags: | |
| - openenv | |
| # OpenEnv Python Code Review Environment | |
| Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment. | |
| ## Architecture | |
| ```text | |
| root | |
| βββ inference.py # Root validator entrypoint | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ app/ | |
| β βββ agents/ # Action policy and fallback strategy | |
| β βββ env/ # RL loop runner and stdout contract | |
| β βββ models/ # Inference dataclasses/config | |
| β βββ services/ # OpenAI client wrapper with retries | |
| β βββ utils/ # Formatting, task loading, log suppression | |
| βββ server/ | |
| β βββ env.py # OpenEnv environment and reward shaping | |
| β βββ app.py # FastAPI/OpenEnv app, optional Gradio mount | |
| β βββ Dockerfile # Hugging Face Docker image | |
| βββ graders/ # Syntax, bug-fix, optimization graders | |
| βββ tasks/ # Deterministic benchmark tasks and references | |
| βββ services/ # Multi-domain analysis services | |
| βββ analyzers/ # Domain-specific analyzers | |
| βββ models/ # Lazy-loaded PyTorch scoring model | |
| βββ schemas/ # API request/response contracts | |
| βββ tests/ # Local validation coverage | |
| ``` | |
| Runtime flow: | |
| ```text | |
| inference.py | |
| -> app.env.runner.InferenceRunner | |
| -> env.reset(task_id=...) | |
| -> ReviewAgent(action planning) | |
| -> env.step_result(action) | |
| -> strict [START]/[STEP]/[END] output | |
| ``` | |
| ## What Was Fixed | |
| - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`. | |
| - OpenAI usage is limited to the official Python client: | |
| `client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`. | |
| - Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly. | |
| - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths. | |
| - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop. | |
| - Step errors now surface through `last_action_error` and are printed in `[STEP]`. | |
| - Reward shaping is now dynamic in the OpenEnv environment: | |
| code quality, test progress, runtime progress, error removal, regressions, and completion are all part of the reward. | |
| - The API-side reward service is no longer a static weighted sum and now exposes quality, error-reduction, and completion signals. | |
| - The Docker image now builds from the repo root, caches dependency installation more effectively, and runs `server.app:app` directly on port `8000`. | |
| - Server startup is lighter: | |
| the PyTorch analyzer is lazy-loaded and the Gradio demo is disabled by default. | |
| ## Local Setup | |
| Install dev dependencies: | |
| ```bash | |
| pip install -e .[dev] | |
| ``` | |
| Run the test suite: | |
| ```bash | |
| pytest -q | |
| ``` | |
| Run the OpenEnv server locally: | |
| ```bash | |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Optional demo UI: | |
| ```bash | |
| set ENABLE_GRADIO_DEMO=true | |
| set ENABLE_WEB_INTERFACE=true | |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ## Inference Contract | |
| Required environment variables: | |
| - `API_BASE_URL` | |
| Default: `https://router.huggingface.co/v1` | |
| - `MODEL_NAME` | |
| Default: `Qwen/Qwen2.5-3B-Instruct` | |
| - `HF_TOKEN` | |
| Mandatory, no default is injected | |
| Example: | |
| ```bash | |
| set API_BASE_URL=https://router.huggingface.co/v1 | |
| set MODEL_NAME=Qwen/Qwen2.5-3B-Instruct | |
| set HF_TOKEN=hf_xxx | |
| python inference.py | |
| ``` | |
| Expected stdout shape: | |
| ```text | |
| [START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct | |
| [STEP] step=1 action=run_tests reward=0.12 done=false error=null | |
| [STEP] step=2 action=edit_code reward=0.96 done=false error=null | |
| [STEP] step=3 action=run_tests reward=0.99 done=false error=null | |
| [STEP] step=4 action=submit_solution reward=0.99 done=true error=null | |
| [END] success=true steps=4 rewards=0.12,0.96,0.99,0.99 | |
| ``` | |
| ## Docker | |
| Build from the project root: | |
| ```bash | |
| docker build -f server/Dockerfile . | |
| ``` | |
| Run locally: | |
| ```bash | |
| docker run --rm -p 8000:8000 ^ | |
| -e API_BASE_URL=https://router.huggingface.co/v1 ^ | |
| -e MODEL_NAME=Qwen/Qwen2.5-3B-Instruct ^ | |
| -e HF_TOKEN=hf_xxx ^ | |
| openenv-python-code-review-env | |
| ``` | |
| Container behavior: | |
| - Base image: `python:3.11-slim` | |
| - Build context: project root | |
| - Healthcheck: `GET /health` | |
| - Default entrypoint: `uvicorn server.app:app --host 0.0.0.0 --port 8000` | |
| ## Hugging Face Spaces | |
| Recommended deployment steps: | |
| 1. Create a Docker Space. | |
| 2. Push this repository as-is. | |
| 3. Let Spaces build with `server/Dockerfile`. | |
| 4. Set Space secrets: | |
| `HF_TOKEN` | |
| 5. Set Space variables as needed: | |
| `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false` | |
| `ENABLE_WEB_INTERFACE=false` is also supported for OpenEnv-managed deploys. | |
| 6. Confirm the app listens on port `8000`. | |
| 7. Smoke-test: | |
| `/health` | |
| `/reset` | |
| `/step` | |
| ## Performance Notes | |
| - Max concurrent environments default to `2`, aligned with a `2 vCPU / 8 GB RAM` target. | |
| - The analyzer model is lazy-loaded instead of being created at startup. | |
| - The inference runner relies on short prompts, low token budgets, and limited retries. | |
| - The policy uses deterministic reference-code fallback instead of expensive iterative code generation. | |
| - Public validation is preferred before final submission to avoid wasted hidden-eval steps. | |
| ## Known Limitations | |
| - If `HF_TOKEN` is absent, inference still completes with deterministic fallback actions, but LLM guidance is skipped. | |
| - The benchmark tasks are deterministic and intentionally small; this is good for validator stability but not a full training benchmark. | |
| - Gradio remains optional and is disabled by default to keep deployment lighter. | |