python_env / tutorial /HackathonChecklist.md
uvpatel7271's picture
Upload folder using huggingface_hub
c8e832f verified
# Hackathon Checklist
This file translates the tutorial folder into a concrete plan for `python_env`.
It is not a generic OpenEnv summary. It is a project-specific checklist showing:
- what the tutorials are teaching
- how this repo maps to those ideas
- what is already done
- what still needs to be finished before submission
## 1. What The Tutorials Mean For This Project
### Tutorial 1: OpenEnv Pattern
Main concept:
- every environment should follow a clean pattern:
- typed models
- environment logic
- client
- FastAPI/OpenEnv app
- Docker packaging
How `python_env` maps:
- `models.py`
typed action/observation/config/evaluation models
- `server/code_review_environment.py`
environment logic
- `client.py`
Python client for reset/step/state
- `server/app.py`
OpenEnv app plus helper routes
- `server/Dockerfile`
container packaging
Status:
- done
What to keep in mind:
- do not break the OpenEnv contract while adding features
- treat models as the public interface
### Tutorial 2: Deployment
Main concept:
- local development first
- Docker second
- HF Spaces deployment third
- test `/health`, `/reset`, `/docs`, `/ws`
How `python_env` maps:
- local server:
`uvicorn server.app:app --reload --host 0.0.0.0 --port 8000`
- Docker:
`docker build -t python_env-env:latest -f server/Dockerfile .`
- Spaces:
`openenv push`
Status:
- app boots locally
- Dockerfile exists and now supports `HOST`, `PORT`, `WORKERS`, `MAX_CONCURRENT_ENVS`
- live Docker build still needs final verification
- Spaces deployment still needs to be executed and checked
### Tutorial 3: Scaling
Main concept:
- OpenEnv works best with WebSocket sessions
- use environment class/factory instead of a singleton for OpenEnv session handling
- support concurrency with `MAX_CONCURRENT_ENVS`
How `python_env` maps:
- `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation, max_concurrent_envs=...)`
- `MAX_CONCURRENT_ENVS` is now read from env vars
- Docker now exposes `MAX_CONCURRENT_ENVS`
Status:
- partially done
Important caveat:
- OpenEnv `/reset` and `/step` use the class-based session model
- custom routes such as `/history` and `/config` still use a singleton helper instance
- this is acceptable for manual tooling, but it is not a perfect unified session model
Recommendation:
- keep it for now if your priority is submission
- refactor only if it starts causing testing confusion
### Tutorial 4: RL Training And Reward Design
Main concept:
- a good RL environment needs:
- meaningful reward
- repeated trajectories
- enough task diversity
- an inference/training loop
How `python_env` maps:
- reward shaping already exists:
- matched rubric items
- false-positive penalties
- duplicate penalties
- hint penalties
- patch bonus
- finalize bonus
- `inference.py` already provides a baseline model-vs-env loop
Status:
- partially done
Gap:
- 3 tasks are enough for hackathon minimums
- 3 tasks are not enough for serious RL learning
## 2. Current Repo Status
### Strong Areas
- real-world task: code review
- typed Pydantic/OpenEnv models
- deterministic grader
- 3 difficulty levels
- partial-progress reward shaping
- manual routes for health/tasks/review/config/history
- baseline inference script
- docs in `README.md`, `Project.md`
### Weak Areas
- benchmark still small
- Docker image build not fully verified end-to-end
- HF Spaces deployment not yet executed
- `openenv validate` still needs to be run in your actual runtime
- no large trajectory dataset yet
- custom REST state and OpenEnv session state are not fully unified
## 3. What You Need To Do To Be Submission-Ready
### Step 1: Validate Local Server
Run:
```powershell
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```
Manually verify:
- `http://127.0.0.1:8000/docs`
- `http://127.0.0.1:8000/health`
- `POST /reset`
- `POST /step`
- `GET /tasks`
- `POST /review`
### Step 2: Run Tests
Run:
```powershell
python -m pytest tests -q
```
You want all tests green before Docker or HF deployment.
### Step 3: Run OpenEnv Validation
Run:
```powershell
openenv validate
```
This is a hard requirement.
If validation fails:
- fix schema mismatch first
- fix route mismatch second
- fix packaging third
### Step 4: Run Baseline Inference
Run:
```powershell
$env:API_BASE_URL="https://api.openai.com/v1"
$env:MODEL_NAME="gpt-4.1-mini"
$env:OPENAI_API_KEY="your_key"
$env:ENV_BASE_URL="http://127.0.0.1:8000"
python inference.py
```
You want:
- script completes without crashing
- `inference_results.json` gets written
- all 3 tasks run
- scores are reproducible
### Step 5: Verify Docker
Run:
```powershell
docker build -t python_env-env:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 python_env-env:latest
```
Then test:
- `GET /health`
- `POST /reset`
- `POST /step`
### Step 6: Deploy To HF Spaces
Run:
```powershell
openenv push
```
Then verify the live Space:
- `/health`
- `/docs`
- `/reset`
- `/web`
## 4. What Will Help You “Win” Instead Of Just “Submit”
Passing minimum requirements is not enough. To be competitive, improve these areas:
### A. Increase Task Diversity
Current:
- 3 benchmark tasks
Target:
- at least 10 to 20 tasks before final submission if possible
Good additions:
- SQL injection review
- unsafe YAML/pickle loading
- file-handle leak
- race-condition style bug
- retry/backoff misuse
- caching bug
- logging/privacy leak
- API timeout handling
### B. Improve Observation Context
Good RL environments provide enough context for the model to improve.
Possible improvements:
- add matched categories so far
- add a short summary of uncovered issue types
- add previous actions in structured form, not just free text
- add rubric coverage signals without leaking exact answers
### C. Collect Trajectories
You need data that shows:
- first attempt
- improved second attempt
- final attempt
- failures
- false positives
- hint usage
This is much more useful than only saving final scores.
### D. Improve Reward Design Carefully
Current reward design is already decent.
Good refinements:
- slightly larger reward for critical security findings
- bonus for correct line numbers
- bonus for high-quality recommendation text
- penalty for vague findings with no rationale
Do not overcomplicate the reward before submission. Stability matters more.
## 5. Recommended Immediate Priority Order
If time is limited, do the work in this order:
1. `pytest`
2. `openenv validate`
3. local inference run
4. Docker build and run
5. HF Space deployment
6. add 5 to 10 more tasks
7. collect trajectory data
## 6. One-Sentence Summary
You are following the correct OpenEnv architecture from the tutorials already; the main remaining work is not redesign, it is validation, deployment verification, and expanding task/data quality so the environment scores well in human review.