# Hackathon Checklist This file translates the tutorial folder into a concrete plan for `python_env`. It is not a generic OpenEnv summary. It is a project-specific checklist showing: - what the tutorials are teaching - how this repo maps to those ideas - what is already done - what still needs to be finished before submission ## 1. What The Tutorials Mean For This Project ### Tutorial 1: OpenEnv Pattern Main concept: - every environment should follow a clean pattern: - typed models - environment logic - client - FastAPI/OpenEnv app - Docker packaging How `python_env` maps: - `models.py` typed action/observation/config/evaluation models - `server/code_review_environment.py` environment logic - `client.py` Python client for reset/step/state - `server/app.py` OpenEnv app plus helper routes - `server/Dockerfile` container packaging Status: - done What to keep in mind: - do not break the OpenEnv contract while adding features - treat models as the public interface ### Tutorial 2: Deployment Main concept: - local development first - Docker second - HF Spaces deployment third - test `/health`, `/reset`, `/docs`, `/ws` How `python_env` maps: - local server: `uvicorn server.app:app --reload --host 0.0.0.0 --port 8000` - Docker: `docker build -t python_env-env:latest -f server/Dockerfile .` - Spaces: `openenv push` Status: - app boots locally - Dockerfile exists and now supports `HOST`, `PORT`, `WORKERS`, `MAX_CONCURRENT_ENVS` - live Docker build still needs final verification - Spaces deployment still needs to be executed and checked ### Tutorial 3: Scaling Main concept: - OpenEnv works best with WebSocket sessions - use environment class/factory instead of a singleton for OpenEnv session handling - support concurrency with `MAX_CONCURRENT_ENVS` How `python_env` maps: - `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation, max_concurrent_envs=...)` - `MAX_CONCURRENT_ENVS` is now read from env vars - Docker now exposes `MAX_CONCURRENT_ENVS` Status: - partially done Important caveat: - OpenEnv `/reset` and `/step` use the class-based session model - custom routes such as `/history` and `/config` still use a singleton helper instance - this is acceptable for manual tooling, but it is not a perfect unified session model Recommendation: - keep it for now if your priority is submission - refactor only if it starts causing testing confusion ### Tutorial 4: RL Training And Reward Design Main concept: - a good RL environment needs: - meaningful reward - repeated trajectories - enough task diversity - an inference/training loop How `python_env` maps: - reward shaping already exists: - matched rubric items - false-positive penalties - duplicate penalties - hint penalties - patch bonus - finalize bonus - `inference.py` already provides a baseline model-vs-env loop Status: - partially done Gap: - 3 tasks are enough for hackathon minimums - 3 tasks are not enough for serious RL learning ## 2. Current Repo Status ### Strong Areas - real-world task: code review - typed Pydantic/OpenEnv models - deterministic grader - 3 difficulty levels - partial-progress reward shaping - manual routes for health/tasks/review/config/history - baseline inference script - docs in `README.md`, `Project.md` ### Weak Areas - benchmark still small - Docker image build not fully verified end-to-end - HF Spaces deployment not yet executed - `openenv validate` still needs to be run in your actual runtime - no large trajectory dataset yet - custom REST state and OpenEnv session state are not fully unified ## 3. What You Need To Do To Be Submission-Ready ### Step 1: Validate Local Server Run: ```powershell uvicorn server.app:app --reload --host 0.0.0.0 --port 8000 ``` Manually verify: - `http://127.0.0.1:8000/docs` - `http://127.0.0.1:8000/health` - `POST /reset` - `POST /step` - `GET /tasks` - `POST /review` ### Step 2: Run Tests Run: ```powershell python -m pytest tests -q ``` You want all tests green before Docker or HF deployment. ### Step 3: Run OpenEnv Validation Run: ```powershell openenv validate ``` This is a hard requirement. If validation fails: - fix schema mismatch first - fix route mismatch second - fix packaging third ### Step 4: Run Baseline Inference Run: ```powershell $env:API_BASE_URL="https://api.openai.com/v1" $env:MODEL_NAME="gpt-4.1-mini" $env:OPENAI_API_KEY="your_key" $env:ENV_BASE_URL="http://127.0.0.1:8000" python inference.py ``` You want: - script completes without crashing - `inference_results.json` gets written - all 3 tasks run - scores are reproducible ### Step 5: Verify Docker Run: ```powershell docker build -t python_env-env:latest -f server/Dockerfile . docker run --rm -p 8000:8000 python_env-env:latest ``` Then test: - `GET /health` - `POST /reset` - `POST /step` ### Step 6: Deploy To HF Spaces Run: ```powershell openenv push ``` Then verify the live Space: - `/health` - `/docs` - `/reset` - `/web` ## 4. What Will Help You “Win” Instead Of Just “Submit” Passing minimum requirements is not enough. To be competitive, improve these areas: ### A. Increase Task Diversity Current: - 3 benchmark tasks Target: - at least 10 to 20 tasks before final submission if possible Good additions: - SQL injection review - unsafe YAML/pickle loading - file-handle leak - race-condition style bug - retry/backoff misuse - caching bug - logging/privacy leak - API timeout handling ### B. Improve Observation Context Good RL environments provide enough context for the model to improve. Possible improvements: - add matched categories so far - add a short summary of uncovered issue types - add previous actions in structured form, not just free text - add rubric coverage signals without leaking exact answers ### C. Collect Trajectories You need data that shows: - first attempt - improved second attempt - final attempt - failures - false positives - hint usage This is much more useful than only saving final scores. ### D. Improve Reward Design Carefully Current reward design is already decent. Good refinements: - slightly larger reward for critical security findings - bonus for correct line numbers - bonus for high-quality recommendation text - penalty for vague findings with no rationale Do not overcomplicate the reward before submission. Stability matters more. ## 5. Recommended Immediate Priority Order If time is limited, do the work in this order: 1. `pytest` 2. `openenv validate` 3. local inference run 4. Docker build and run 5. HF Space deployment 6. add 5 to 10 more tasks 7. collect trajectory data ## 6. One-Sentence Summary You are following the correct OpenEnv architecture from the tutorials already; the main remaining work is not redesign, it is validation, deployment verification, and expanding task/data quality so the environment scores well in human review.