# Hackathon Checklist

This file translates the tutorial folder into a concrete plan for `python_env`.

It is not a generic OpenEnv summary. It is a project-specific checklist showing:

- what the tutorials are teaching
- how this repo maps to those ideas
- what is already done
- what still needs to be finished before submission

## 1. What The Tutorials Mean For This Project

### Tutorial 1: OpenEnv Pattern

Main concept:

- every environment should follow a clean pattern:
  - typed models
  - environment logic
  - client
  - FastAPI/OpenEnv app
  - Docker packaging

How `python_env` maps:

- `models.py`
  typed action/observation/config/evaluation models
- `server/code_review_environment.py`
  environment logic
- `client.py`
  Python client for reset/step/state
- `server/app.py`
  OpenEnv app plus helper routes
- `server/Dockerfile`
  container packaging

Status:

- done

What to keep in mind:

- do not break the OpenEnv contract while adding features
- treat models as the public interface

### Tutorial 2: Deployment

Main concept:

- local development first
- Docker second
- HF Spaces deployment third
- test `/health`, `/reset`, `/docs`, `/ws`

How `python_env` maps:

- local server:
  `uvicorn server.app:app --reload --host 0.0.0.0 --port 8000`
- Docker:
  `docker build -t python_env-env:latest -f server/Dockerfile .`
- Spaces:
  `openenv push`

Status:

- app boots locally
- Dockerfile exists and now supports `HOST`, `PORT`, `WORKERS`, `MAX_CONCURRENT_ENVS`
- live Docker build still needs final verification
- Spaces deployment still needs to be executed and checked

### Tutorial 3: Scaling

Main concept:

- OpenEnv works best with WebSocket sessions
- use environment class/factory instead of a singleton for OpenEnv session handling
- support concurrency with `MAX_CONCURRENT_ENVS`

How `python_env` maps:

- `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation, max_concurrent_envs=...)`
- `MAX_CONCURRENT_ENVS` is now read from env vars
- Docker now exposes `MAX_CONCURRENT_ENVS`

Status:

- partially done

Important caveat:

- OpenEnv `/reset` and `/step` use the class-based session model
- custom routes such as `/history` and `/config` still use a singleton helper instance
- this is acceptable for manual tooling, but it is not a perfect unified session model

Recommendation:

- keep it for now if your priority is submission
- refactor only if it starts causing testing confusion

### Tutorial 4: RL Training And Reward Design

Main concept:

- a good RL environment needs:
  - meaningful reward
  - repeated trajectories
  - enough task diversity
  - an inference/training loop

How `python_env` maps:

- reward shaping already exists:
  - matched rubric items
  - false-positive penalties
  - duplicate penalties
  - hint penalties
  - patch bonus
  - finalize bonus
- `inference.py` already provides a baseline model-vs-env loop

Status:

- partially done

Gap:

- 3 tasks are enough for hackathon minimums
- 3 tasks are not enough for serious RL learning

## 2. Current Repo Status

### Strong Areas

- real-world task: code review
- typed Pydantic/OpenEnv models
- deterministic grader
- 3 difficulty levels
- partial-progress reward shaping
- manual routes for health/tasks/review/config/history
- baseline inference script
- docs in `README.md`, `Project.md`

### Weak Areas

- benchmark still small
- Docker image build not fully verified end-to-end
- HF Spaces deployment not yet executed
- `openenv validate` still needs to be run in your actual runtime
- no large trajectory dataset yet
- custom REST state and OpenEnv session state are not fully unified

## 3. What You Need To Do To Be Submission-Ready

### Step 1: Validate Local Server

Run:

```powershell
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```

Manually verify:

- `http://127.0.0.1:8000/docs`
- `http://127.0.0.1:8000/health`
- `POST /reset`
- `POST /step`
- `GET /tasks`
- `POST /review`

### Step 2: Run Tests

Run:

```powershell
python -m pytest tests -q
```

You want all tests green before Docker or HF deployment.

### Step 3: Run OpenEnv Validation

Run:

```powershell
openenv validate
```

This is a hard requirement.

If validation fails:

- fix schema mismatch first
- fix route mismatch second
- fix packaging third

### Step 4: Run Baseline Inference

Run:

```powershell
$env:API_BASE_URL="https://api.openai.com/v1"
$env:MODEL_NAME="gpt-4.1-mini"
$env:OPENAI_API_KEY="your_key"
$env:ENV_BASE_URL="http://127.0.0.1:8000"
python inference.py
```

You want:

- script completes without crashing
- `inference_results.json` gets written
- all 3 tasks run
- scores are reproducible

### Step 5: Verify Docker

Run:

```powershell
docker build -t python_env-env:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 python_env-env:latest
```

Then test:

- `GET /health`
- `POST /reset`
- `POST /step`

### Step 6: Deploy To HF Spaces

Run:

```powershell
openenv push
```

Then verify the live Space:

- `/health`
- `/docs`
- `/reset`
- `/web`

## 4. What Will Help You “Win” Instead Of Just “Submit”

Passing minimum requirements is not enough. To be competitive, improve these areas:

### A. Increase Task Diversity

Current:

- 3 benchmark tasks

Target:

- at least 10 to 20 tasks before final submission if possible

Good additions:

- SQL injection review
- unsafe YAML/pickle loading
- file-handle leak
- race-condition style bug
- retry/backoff misuse
- caching bug
- logging/privacy leak
- API timeout handling

### B. Improve Observation Context

Good RL environments provide enough context for the model to improve.

Possible improvements:

- add matched categories so far
- add a short summary of uncovered issue types
- add previous actions in structured form, not just free text
- add rubric coverage signals without leaking exact answers

### C. Collect Trajectories

You need data that shows:

- first attempt
- improved second attempt
- final attempt
- failures
- false positives
- hint usage

This is much more useful than only saving final scores.

### D. Improve Reward Design Carefully

Current reward design is already decent.

Good refinements:

- slightly larger reward for critical security findings
- bonus for correct line numbers
- bonus for high-quality recommendation text
- penalty for vague findings with no rationale

Do not overcomplicate the reward before submission. Stability matters more.

## 5. Recommended Immediate Priority Order

If time is limited, do the work in this order:

1. `pytest`
2. `openenv validate`
3. local inference run
4. Docker build and run
5. HF Space deployment
6. add 5 to 10 more tasks
7. collect trajectory data

## 6. One-Sentence Summary

You are following the correct OpenEnv architecture from the tutorials already; the main remaining work is not redesign, it is validation, deployment verification, and expanding task/data quality so the environment scores well in human review.