Spaces:

TheJackBright
/

polyguard-openenv-workbench

Sleeping

App Files Files Community

polyguard-openenv-workbench / polyguard-rl /docs /idea_document_traceability.md

TheJackBright

Deploy GitHub root master to Space

c296d62 11 days ago

preview code

raw

history blame contribute delete

3.98 kB

	# Idea document and participant guide — implementation map

	This ties your polypharmacy / OpenEnv design notes and typical hackathon submission requirements to files in this repository.

	## Submission narrative (required bullets)

	\| Requirement \| Status \| Where \|
	\| --- \| --- \| --- \|
	\| Problem statement \| Documented + implemented \| Root [`README.md`](../../README.md), `polyguard-rl/README.md`, `docs/safety.md` \|
	\| Environment (agent operates here) \| Implemented \| `PolyGuardEnv`, `app/env/env_core.py`, `app/env/fastapi_app.py`, `openenv.yaml`, `server/app.py` \|
	\| Agent capabilities \| Implemented \| `app/agents/`, `docs/agents.md` \|
	\| Tasks \| Implemented \| Scenario JSONL under `data/scenarios/`, presets in `app/env/catalog.py` \|
	\| Reward / evaluation logic \| Implemented \| `app/env/reward_router.py`, `app/env/verifier.py`, `configs/rewards.yaml`, `docs/reward_design.md`, `docs/evaluation.md` \|
	\| Post-training / self-improvement \| Implemented \| `scripts/train_sft_trl.py`, `scripts/train_grpo_trl.py`, `app/training/grpo_trl.py`, `docs/training.md` \|

	## Your “Plan” sections vs codebase

	\| Plan item \| Status \| Notes \|
	\| --- \| --- \| --- \|
	\| OpenEnv `reset` / `step` / `state`, timeouts, safety \| Done \| `env_core.py`, `fastapi_app.py`, max steps per sub-env, `anti_cheat.py` \|
	\| Local + remote execution \| Done \| Local FastAPI + `docker-compose.yml`, HF Space via `scripts/deploy_space_api.py`, `Dockerfile.space`, `docker/space/` \|
	\| Specific envs: DDI, bandit mining, regimen risk \| Done \| `SubEnvironment` enum, transitions in `app/env/transition.py` \|
	\| Precision dosing, deprescribing, web search, alternatives, new drug (hard) \| Done \| Matching enum values + scenario tracks; “new drug” is `NEW_DRUG_DECOMPOSITION` \|
	\| Multiple reward functions + anti-hacking \| Done \| 13 components → 4 channels; anti-cheat and tests in `tests/` \|
	\| TRL + Unsloth, metrics, generations \| Done \| TRL scripts + reports; Unsloth optional (`--use-unsloth`); `app/training/metrics.py` \|
	\| Post-training + inference \| Done \| merge + `test_inference_postsave.py`, active manifest / API path \|
	\| Product / Space demo, UI \| Done \| FastAPI `app/api/`, React `app/ui/frontend/`, Space deployment scripts \|
	\| Benchmarks + plots + sample generations \| Done \| `scripts/evaluate_*.py`, `docs/results/`, `scripts/generate_submission_evidence.py` \|
	\| Deploy: OpenEnv, container, HF Space \| Done \| See `docs/deployment.md` \|
	\| Easy / medium / hard \| Done \| `scenarios_easy.jsonl`, `scenarios_medium.jsonl`, `scenarios_hard.jsonl` \|

	## Themes (world modeling, multi-agent, self-improvement)

	\| Theme \| Status \| Notes \|
	\| --- \| --- \| --- \|
	\| World modeling / professional tasks \| Primary fit \| Stateful regimen, verifiers, tool-like actions \|
	\| Multi-agent \| Partial \| Supervisor/orchestrator and policy stack (`app/agents/orchestrator.py`, `supervisor_agent.py`); not a separate multi-player env \|
	\| Self-improving systems \| Via GRPO \| Environment-backed RLVR-style training, not online self-play \|

	## “What to submit” checklist

	\| Deliverable \| Status \|
	\| --- \| --- \|
	\| GitHub repo + URLs in README \| Root + `polyguard-rl/README.md` \|
	\| HF Space URL \| In README \|
	\| Points from doc \| `docs/participant_guide_traceability.md`, this file \|
	\| Colab \| `PolyGuard_SFT_GRPO_One_Run_Runner.ipynb`, `notebooks/09_training_loop.ipynb` \|
	\| Video or blog \| README links blog; publish draft in `docs/hf_blog_draft.md` or swap URL \|

	## Future ideas from your notes (not claimed as done)

	- Medicine images / barcodes: listed under Future Work in README.
	- Web search agents: sub-env `WEB_SEARCH_MISSING_DATA` exists; “full web agent product” is beyond current scope.

	## Fresh clone reminder

	Generated data and many `outputs/` reports are produced by scripts (see `scripts/bootstrap_data.py`, `scripts/acceptance_gate.py` `REQUIRED_ARTIFACTS`). Run the bootstrap/build pipeline before expecting strict `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true` acceptance to pass on an empty workspace.

	# Idea document and participant guide — implementation map

	This ties your polypharmacy / OpenEnv design notes and typical hackathon submission requirements to files in this repository.

	## Submission narrative (required bullets)

	\| Requirement \| Status \| Where \|
	\| --- \| --- \| --- \|
	\| Problem statement \| Documented + implemented \| Root [`README.md`](../../README.md), `polyguard-rl/README.md`, `docs/safety.md` \|
	\| Environment (agent operates here) \| Implemented \| `PolyGuardEnv`, `app/env/env_core.py`, `app/env/fastapi_app.py`, `openenv.yaml`, `server/app.py` \|
	\| Agent capabilities \| Implemented \| `app/agents/`, `docs/agents.md` \|
	\| Tasks \| Implemented \| Scenario JSONL under `data/scenarios/`, presets in `app/env/catalog.py` \|
	\| Reward / evaluation logic \| Implemented \| `app/env/reward_router.py`, `app/env/verifier.py`, `configs/rewards.yaml`, `docs/reward_design.md`, `docs/evaluation.md` \|
	\| Post-training / self-improvement \| Implemented \| `scripts/train_sft_trl.py`, `scripts/train_grpo_trl.py`, `app/training/grpo_trl.py`, `docs/training.md` \|

	## Your “Plan” sections vs codebase

	\| Plan item \| Status \| Notes \|
	\| --- \| --- \| --- \|
	\| OpenEnv `reset` / `step` / `state`, timeouts, safety \| Done \| `env_core.py`, `fastapi_app.py`, max steps per sub-env, `anti_cheat.py` \|
	\| Local + remote execution \| Done \| Local FastAPI + `docker-compose.yml`, HF Space via `scripts/deploy_space_api.py`, `Dockerfile.space`, `docker/space/` \|
	\| Specific envs: DDI, bandit mining, regimen risk \| Done \| `SubEnvironment` enum, transitions in `app/env/transition.py` \|
	\| Precision dosing, deprescribing, web search, alternatives, new drug (hard) \| Done \| Matching enum values + scenario tracks; “new drug” is `NEW_DRUG_DECOMPOSITION` \|
	\| Multiple reward functions + anti-hacking \| Done \| 13 components → 4 channels; anti-cheat and tests in `tests/` \|
	\| TRL + Unsloth, metrics, generations \| Done \| TRL scripts + reports; Unsloth optional (`--use-unsloth`); `app/training/metrics.py` \|
	\| Post-training + inference \| Done \| merge + `test_inference_postsave.py`, active manifest / API path \|
	\| Product / Space demo, UI \| Done \| FastAPI `app/api/`, React `app/ui/frontend/`, Space deployment scripts \|
	\| Benchmarks + plots + sample generations \| Done \| `scripts/evaluate_*.py`, `docs/results/`, `scripts/generate_submission_evidence.py` \|
	\| Deploy: OpenEnv, container, HF Space \| Done \| See `docs/deployment.md` \|
	\| Easy / medium / hard \| Done \| `scenarios_easy.jsonl`, `scenarios_medium.jsonl`, `scenarios_hard.jsonl` \|

	## Themes (world modeling, multi-agent, self-improvement)

	\| Theme \| Status \| Notes \|
	\| --- \| --- \| --- \|
	\| World modeling / professional tasks \| Primary fit \| Stateful regimen, verifiers, tool-like actions \|
	\| Multi-agent \| Partial \| Supervisor/orchestrator and policy stack (`app/agents/orchestrator.py`, `supervisor_agent.py`); not a separate multi-player env \|
	\| Self-improving systems \| Via GRPO \| Environment-backed RLVR-style training, not online self-play \|

	## “What to submit” checklist

	\| Deliverable \| Status \|
	\| --- \| --- \|
	\| GitHub repo + URLs in README \| Root + `polyguard-rl/README.md` \|
	\| HF Space URL \| In README \|
	\| Points from doc \| `docs/participant_guide_traceability.md`, this file \|
	\| Colab \| `PolyGuard_SFT_GRPO_One_Run_Runner.ipynb`, `notebooks/09_training_loop.ipynb` \|
	\| Video or blog \| README links blog; publish draft in `docs/hf_blog_draft.md` or swap URL \|

	## Future ideas from your notes (not claimed as done)

	- Medicine images / barcodes: listed under Future Work in README.
	- Web search agents: sub-env `WEB_SEARCH_MISSING_DATA` exists; “full web agent product” is beyond current scope.

	## Fresh clone reminder

	Generated data and many `outputs/` reports are produced by scripts (see `scripts/bootstrap_data.py`, `scripts/acceptance_gate.py` `REQUIRED_ARTIFACTS`). Run the bootstrap/build pipeline before expecting strict `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true` acceptance to pass on an empty workspace.