Spaces:

uvpatel7271
/

python_env

Build error

App Files Files Community

python_env / tutorial /HackathonChecklist.md

uvpatel7271

Upload folder using huggingface_hub

c8e832f verified 6 days ago

preview code

raw

history blame contribute delete

7.25 kB

	# Hackathon Checklist

	This file translates the tutorial folder into a concrete plan for `python_env`.

	It is not a generic OpenEnv summary. It is a project-specific checklist showing:

	- what the tutorials are teaching
	- how this repo maps to those ideas
	- what is already done
	- what still needs to be finished before submission

	## 1. What The Tutorials Mean For This Project

	### Tutorial 1: OpenEnv Pattern

	Main concept:

	- every environment should follow a clean pattern:
	- typed models
	- environment logic
	- client
	- FastAPI/OpenEnv app
	- Docker packaging

	How `python_env` maps:

	- `models.py`
	typed action/observation/config/evaluation models
	- `server/code_review_environment.py`
	environment logic
	- `client.py`
	Python client for reset/step/state
	- `server/app.py`
	OpenEnv app plus helper routes
	- `server/Dockerfile`
	container packaging

	Status:

	- done

	What to keep in mind:

	- do not break the OpenEnv contract while adding features
	- treat models as the public interface

	### Tutorial 2: Deployment

	Main concept:

	- local development first
	- Docker second
	- HF Spaces deployment third
	- test `/health`, `/reset`, `/docs`, `/ws`

	How `python_env` maps:

	- local server:
	`uvicorn server.app:app --reload --host 0.0.0.0 --port 8000`
	- Docker:
	`docker build -t python_env-env:latest -f server/Dockerfile .`
	- Spaces:
	`openenv push`

	Status:

	- app boots locally
	- Dockerfile exists and now supports `HOST`, `PORT`, `WORKERS`, `MAX_CONCURRENT_ENVS`
	- live Docker build still needs final verification
	- Spaces deployment still needs to be executed and checked

	### Tutorial 3: Scaling

	Main concept:

	- OpenEnv works best with WebSocket sessions
	- use environment class/factory instead of a singleton for OpenEnv session handling
	- support concurrency with `MAX_CONCURRENT_ENVS`

	How `python_env` maps:

	- `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation, max_concurrent_envs=...)`
	- `MAX_CONCURRENT_ENVS` is now read from env vars
	- Docker now exposes `MAX_CONCURRENT_ENVS`

	Status:

	- partially done

	Important caveat:

	- OpenEnv `/reset` and `/step` use the class-based session model
	- custom routes such as `/history` and `/config` still use a singleton helper instance
	- this is acceptable for manual tooling, but it is not a perfect unified session model

	Recommendation:

	- keep it for now if your priority is submission
	- refactor only if it starts causing testing confusion

	### Tutorial 4: RL Training And Reward Design

	Main concept:

	- a good RL environment needs:
	- meaningful reward
	- repeated trajectories
	- enough task diversity
	- an inference/training loop

	How `python_env` maps:

	- reward shaping already exists:
	- matched rubric items
	- false-positive penalties
	- duplicate penalties
	- hint penalties
	- patch bonus
	- finalize bonus
	- `inference.py` already provides a baseline model-vs-env loop

	Status:

	- partially done

	Gap:

	- 3 tasks are enough for hackathon minimums
	- 3 tasks are not enough for serious RL learning

	## 2. Current Repo Status

	### Strong Areas

	- real-world task: code review
	- typed Pydantic/OpenEnv models
	- deterministic grader
	- 3 difficulty levels
	- partial-progress reward shaping
	- manual routes for health/tasks/review/config/history
	- baseline inference script
	- docs in `README.md`, `Project.md`

	### Weak Areas

	- benchmark still small
	- Docker image build not fully verified end-to-end
	- HF Spaces deployment not yet executed
	- `openenv validate` still needs to be run in your actual runtime
	- no large trajectory dataset yet
	- custom REST state and OpenEnv session state are not fully unified

	## 3. What You Need To Do To Be Submission-Ready

	### Step 1: Validate Local Server

	Run:

	```powershell
	uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
	```

	Manually verify:

	- `http://127.0.0.1:8000/docs`
	- `http://127.0.0.1:8000/health`
	- `POST /reset`
	- `POST /step`
	- `GET /tasks`
	- `POST /review`

	### Step 2: Run Tests

	Run:

	```powershell
	python -m pytest tests -q
	```

	You want all tests green before Docker or HF deployment.

	### Step 3: Run OpenEnv Validation

	Run:

	```powershell
	openenv validate
	```

	This is a hard requirement.

	If validation fails:

	- fix schema mismatch first
	- fix route mismatch second
	- fix packaging third

	### Step 4: Run Baseline Inference

	Run:

	```powershell
	$env:API_BASE_URL="https://api.openai.com/v1"
	$env:MODEL_NAME="gpt-4.1-mini"
	$env:OPENAI_API_KEY="your_key"
	$env:ENV_BASE_URL="http://127.0.0.1:8000"
	python inference.py
	```

	You want:

	- script completes without crashing
	- `inference_results.json` gets written
	- all 3 tasks run
	- scores are reproducible

	### Step 5: Verify Docker

	Run:

	```powershell
	docker build -t python_env-env:latest -f server/Dockerfile .
	docker run --rm -p 8000:8000 python_env-env:latest
	```

	Then test:

	- `GET /health`
	- `POST /reset`
	- `POST /step`

	### Step 6: Deploy To HF Spaces

	Run:

	```powershell
	openenv push
	```

	Then verify the live Space:

	- `/health`
	- `/docs`
	- `/reset`
	- `/web`

	## 4. What Will Help You “Win” Instead Of Just “Submit”

	Passing minimum requirements is not enough. To be competitive, improve these areas:

	### A. Increase Task Diversity

	Current:

	- 3 benchmark tasks

	Target:

	- at least 10 to 20 tasks before final submission if possible

	Good additions:

	- SQL injection review
	- unsafe YAML/pickle loading
	- file-handle leak
	- race-condition style bug
	- retry/backoff misuse
	- caching bug
	- logging/privacy leak
	- API timeout handling

	### B. Improve Observation Context

	Good RL environments provide enough context for the model to improve.

	Possible improvements:

	- add matched categories so far
	- add a short summary of uncovered issue types
	- add previous actions in structured form, not just free text
	- add rubric coverage signals without leaking exact answers

	### C. Collect Trajectories

	You need data that shows:

	- first attempt
	- improved second attempt
	- final attempt
	- failures
	- false positives
	- hint usage

	This is much more useful than only saving final scores.

	### D. Improve Reward Design Carefully

	Current reward design is already decent.

	Good refinements:

	- slightly larger reward for critical security findings
	- bonus for correct line numbers
	- bonus for high-quality recommendation text
	- penalty for vague findings with no rationale

	Do not overcomplicate the reward before submission. Stability matters more.

	## 5. Recommended Immediate Priority Order

	If time is limited, do the work in this order:

	1. `pytest`
	2. `openenv validate`
	3. local inference run
	4. Docker build and run
	5. HF Space deployment
	6. add 5 to 10 more tasks
	7. collect trajectory data

	## 6. One-Sentence Summary

	You are following the correct OpenEnv architecture from the tutorials already; the main remaining work is not redesign, it is validation, deployment verification, and expanding task/data quality so the environment scores well in human review.