Spaces:
Build error
Build error
File size: 7,253 Bytes
c8e832f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | # Hackathon Checklist
This file translates the tutorial folder into a concrete plan for `python_env`.
It is not a generic OpenEnv summary. It is a project-specific checklist showing:
- what the tutorials are teaching
- how this repo maps to those ideas
- what is already done
- what still needs to be finished before submission
## 1. What The Tutorials Mean For This Project
### Tutorial 1: OpenEnv Pattern
Main concept:
- every environment should follow a clean pattern:
- typed models
- environment logic
- client
- FastAPI/OpenEnv app
- Docker packaging
How `python_env` maps:
- `models.py`
typed action/observation/config/evaluation models
- `server/code_review_environment.py`
environment logic
- `client.py`
Python client for reset/step/state
- `server/app.py`
OpenEnv app plus helper routes
- `server/Dockerfile`
container packaging
Status:
- done
What to keep in mind:
- do not break the OpenEnv contract while adding features
- treat models as the public interface
### Tutorial 2: Deployment
Main concept:
- local development first
- Docker second
- HF Spaces deployment third
- test `/health`, `/reset`, `/docs`, `/ws`
How `python_env` maps:
- local server:
`uvicorn server.app:app --reload --host 0.0.0.0 --port 8000`
- Docker:
`docker build -t python_env-env:latest -f server/Dockerfile .`
- Spaces:
`openenv push`
Status:
- app boots locally
- Dockerfile exists and now supports `HOST`, `PORT`, `WORKERS`, `MAX_CONCURRENT_ENVS`
- live Docker build still needs final verification
- Spaces deployment still needs to be executed and checked
### Tutorial 3: Scaling
Main concept:
- OpenEnv works best with WebSocket sessions
- use environment class/factory instead of a singleton for OpenEnv session handling
- support concurrency with `MAX_CONCURRENT_ENVS`
How `python_env` maps:
- `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation, max_concurrent_envs=...)`
- `MAX_CONCURRENT_ENVS` is now read from env vars
- Docker now exposes `MAX_CONCURRENT_ENVS`
Status:
- partially done
Important caveat:
- OpenEnv `/reset` and `/step` use the class-based session model
- custom routes such as `/history` and `/config` still use a singleton helper instance
- this is acceptable for manual tooling, but it is not a perfect unified session model
Recommendation:
- keep it for now if your priority is submission
- refactor only if it starts causing testing confusion
### Tutorial 4: RL Training And Reward Design
Main concept:
- a good RL environment needs:
- meaningful reward
- repeated trajectories
- enough task diversity
- an inference/training loop
How `python_env` maps:
- reward shaping already exists:
- matched rubric items
- false-positive penalties
- duplicate penalties
- hint penalties
- patch bonus
- finalize bonus
- `inference.py` already provides a baseline model-vs-env loop
Status:
- partially done
Gap:
- 3 tasks are enough for hackathon minimums
- 3 tasks are not enough for serious RL learning
## 2. Current Repo Status
### Strong Areas
- real-world task: code review
- typed Pydantic/OpenEnv models
- deterministic grader
- 3 difficulty levels
- partial-progress reward shaping
- manual routes for health/tasks/review/config/history
- baseline inference script
- docs in `README.md`, `Project.md`
### Weak Areas
- benchmark still small
- Docker image build not fully verified end-to-end
- HF Spaces deployment not yet executed
- `openenv validate` still needs to be run in your actual runtime
- no large trajectory dataset yet
- custom REST state and OpenEnv session state are not fully unified
## 3. What You Need To Do To Be Submission-Ready
### Step 1: Validate Local Server
Run:
```powershell
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```
Manually verify:
- `http://127.0.0.1:8000/docs`
- `http://127.0.0.1:8000/health`
- `POST /reset`
- `POST /step`
- `GET /tasks`
- `POST /review`
### Step 2: Run Tests
Run:
```powershell
python -m pytest tests -q
```
You want all tests green before Docker or HF deployment.
### Step 3: Run OpenEnv Validation
Run:
```powershell
openenv validate
```
This is a hard requirement.
If validation fails:
- fix schema mismatch first
- fix route mismatch second
- fix packaging third
### Step 4: Run Baseline Inference
Run:
```powershell
$env:API_BASE_URL="https://api.openai.com/v1"
$env:MODEL_NAME="gpt-4.1-mini"
$env:OPENAI_API_KEY="your_key"
$env:ENV_BASE_URL="http://127.0.0.1:8000"
python inference.py
```
You want:
- script completes without crashing
- `inference_results.json` gets written
- all 3 tasks run
- scores are reproducible
### Step 5: Verify Docker
Run:
```powershell
docker build -t python_env-env:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 python_env-env:latest
```
Then test:
- `GET /health`
- `POST /reset`
- `POST /step`
### Step 6: Deploy To HF Spaces
Run:
```powershell
openenv push
```
Then verify the live Space:
- `/health`
- `/docs`
- `/reset`
- `/web`
## 4. What Will Help You “Win” Instead Of Just “Submit”
Passing minimum requirements is not enough. To be competitive, improve these areas:
### A. Increase Task Diversity
Current:
- 3 benchmark tasks
Target:
- at least 10 to 20 tasks before final submission if possible
Good additions:
- SQL injection review
- unsafe YAML/pickle loading
- file-handle leak
- race-condition style bug
- retry/backoff misuse
- caching bug
- logging/privacy leak
- API timeout handling
### B. Improve Observation Context
Good RL environments provide enough context for the model to improve.
Possible improvements:
- add matched categories so far
- add a short summary of uncovered issue types
- add previous actions in structured form, not just free text
- add rubric coverage signals without leaking exact answers
### C. Collect Trajectories
You need data that shows:
- first attempt
- improved second attempt
- final attempt
- failures
- false positives
- hint usage
This is much more useful than only saving final scores.
### D. Improve Reward Design Carefully
Current reward design is already decent.
Good refinements:
- slightly larger reward for critical security findings
- bonus for correct line numbers
- bonus for high-quality recommendation text
- penalty for vague findings with no rationale
Do not overcomplicate the reward before submission. Stability matters more.
## 5. Recommended Immediate Priority Order
If time is limited, do the work in this order:
1. `pytest`
2. `openenv validate`
3. local inference run
4. Docker build and run
5. HF Space deployment
6. add 5 to 10 more tasks
7. collect trajectory data
## 6. One-Sentence Summary
You are following the correct OpenEnv architecture from the tutorials already; the main remaining work is not redesign, it is validation, deployment verification, and expanding task/data quality so the environment scores well in human review.
|