polyguard-openenv-workbench / polyguard-rl /docs /idea_document_traceability.md
TheJackBright's picture
Deploy GitHub root master to Space
c296d62
# Idea document and participant guide — implementation map
This ties your polypharmacy / OpenEnv design notes and typical hackathon submission requirements to files in this repository.
## Submission narrative (required bullets)
| Requirement | Status | Where |
| --- | --- | --- |
| Problem statement | Documented + implemented | Root [`README.md`](../../README.md), `polyguard-rl/README.md`, `docs/safety.md` |
| Environment (agent operates here) | Implemented | `PolyGuardEnv`, `app/env/env_core.py`, `app/env/fastapi_app.py`, `openenv.yaml`, `server/app.py` |
| Agent capabilities | Implemented | `app/agents/`, `docs/agents.md` |
| Tasks | Implemented | Scenario JSONL under `data/scenarios/`, presets in `app/env/catalog.py` |
| Reward / evaluation logic | Implemented | `app/env/reward_router.py`, `app/env/verifier.py`, `configs/rewards.yaml`, `docs/reward_design.md`, `docs/evaluation.md` |
| Post-training / self-improvement | Implemented | `scripts/train_sft_trl.py`, `scripts/train_grpo_trl.py`, `app/training/grpo_trl.py`, `docs/training.md` |
## Your “Plan” sections vs codebase
| Plan item | Status | Notes |
| --- | --- | --- |
| OpenEnv `reset` / `step` / `state`, timeouts, safety | Done | `env_core.py`, `fastapi_app.py`, max steps per sub-env, `anti_cheat.py` |
| Local + remote execution | Done | Local FastAPI + `docker-compose.yml`, HF Space via `scripts/deploy_space_api.py`, `Dockerfile.space`, `docker/space/` |
| Specific envs: DDI, bandit mining, regimen risk | Done | `SubEnvironment` enum, transitions in `app/env/transition.py` |
| Precision dosing, deprescribing, web search, alternatives, new drug (hard) | Done | Matching enum values + scenario tracks; “new drug” is `NEW_DRUG_DECOMPOSITION` |
| Multiple reward functions + anti-hacking | Done | 13 components → 4 channels; anti-cheat and tests in `tests/` |
| TRL + Unsloth, metrics, generations | Done | TRL scripts + reports; Unsloth optional (`--use-unsloth`); `app/training/metrics.py` |
| Post-training + inference | Done | merge + `test_inference_postsave.py`, active manifest / API path |
| Product / Space demo, UI | Done | FastAPI `app/api/`, React `app/ui/frontend/`, Space deployment scripts |
| Benchmarks + plots + sample generations | Done | `scripts/evaluate_*.py`, `docs/results/`, `scripts/generate_submission_evidence.py` |
| Deploy: OpenEnv, container, HF Space | Done | See `docs/deployment.md` |
| Easy / medium / hard | Done | `scenarios_easy.jsonl`, `scenarios_medium.jsonl`, `scenarios_hard.jsonl` |
## Themes (world modeling, multi-agent, self-improvement)
| Theme | Status | Notes |
| --- | --- | --- |
| World modeling / professional tasks | Primary fit | Stateful regimen, verifiers, tool-like actions |
| Multi-agent | Partial | Supervisor/orchestrator and policy stack (`app/agents/orchestrator.py`, `supervisor_agent.py`); not a separate multi-player env |
| Self-improving systems | Via GRPO | Environment-backed RLVR-style training, not online self-play |
## “What to submit” checklist
| Deliverable | Status |
| --- | --- |
| GitHub repo + URLs in README | Root + `polyguard-rl/README.md` |
| HF Space URL | In README |
| Points from doc | `docs/participant_guide_traceability.md`, this file |
| Colab | `PolyGuard_SFT_GRPO_One_Run_Runner.ipynb`, `notebooks/09_training_loop.ipynb` |
| Video or blog | README links blog; **publish** draft in `docs/hf_blog_draft.md` or swap URL |
## Future ideas from your notes (not claimed as done)
- Medicine images / barcodes: listed under Future Work in README.
- Web search agents: sub-env `WEB_SEARCH_MISSING_DATA` exists; “full web agent product” is beyond current scope.
## Fresh clone reminder
Generated data and many `outputs/` reports are produced by scripts (see `scripts/bootstrap_data.py`, `scripts/acceptance_gate.py` `REQUIRED_ARTIFACTS`). Run the bootstrap/build pipeline before expecting strict `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true` acceptance to pass on an empty workspace.