PolyGuard (OpenEnv implementation package)
Run all CLI commands from this directory (cd polyguard-rl). The repository root README.md carries the same submission narrative with paths adjusted for viewers landing on the GitHub repo home page.
Submission Links
- GitHub Repo URL: https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK
- HF Space URL: https://huggingface.co/spaces/TheJackBright/polyguard-openenv-workbench
- Colab Notebook URL: https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/PolyGuard_SFT_GRPO_One_Run_Runner.ipynb (see also
notebooks/09_training_loop.ipynbfor a modular training walkthrough) - YouTube Video URL: not used for this submission; the repository root README is the story artifact.
- Story artifact: the repository root
README.mdis the final blog-style narrative and evidence map.
Shared Environment, Logs, And Scripts
The required environment files, training logs, and training scripts are shared in the repo and indexed in Submission Artifact Index.
- Environment/runtime:
openenv.yaml,pyproject.toml,uv.lock,requirements*.txt,Dockerfile*,app/env/,server/app.py, andapp/hf_space/Dockerfile. - Training scripts/notebooks:
PolyGuard_SFT_GRPO_One_Run_Runner.ipynb,notebooks/09_training_loop.ipynb,scripts/train_sft_trl.py,scripts/train_grpo_trl.py,scripts/deploy_training_space.py,app/hf_space/training_runner.py, andapp/training/. - Training logs/results:
docs/results/final_submission_evidence/reports/,docs/results/sweeps/,docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/, anddocs/results/qwen_completed_runs/reports/. - Final downloadable artifact Space: https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts.
Problem Statement
Polypharmacy decisions are long-horizon, partially observable, and safety-critical. PolyGuard is a research environment where an LLM agent selects constrained clinical actions, receives verifier-backed reward, and improves via SFT + GRPO—not generic open-ended chat fine-tuning.
Environment
PolyGuardEnv exposes OpenEnv-style HTTP/WebSocket endpoints (/reset, /step, /state, /metadata, /schema, /mcp, /health, /ws). Sub-environments include DDI, bandit mining, regimen risk, precision dosing, longitudinal deprescribing, web-search missing data, alternative suggestion, and new-drug decomposition. See openenv.yaml, app/env/env_core.py, app/env/fastapi_app.py, and docs/environment_design.md.
Agent Capabilities
Medication reconciliation, evidence retrieval, graph safety, dosing guardrails, candidate generation, supervisor routing, planner/critic stack, explanations, and contextual bandit ranking for ablations (app/agents/, docs/agents.md).
Tasks
DDI risk reduction, safe adds/substitutions, regimen optimization, taper/deprescribing sequences, precision dosing, missing-data recovery, and new-drug decomposition (data/scenarios/, app/env/catalog.py).
Reward Model / Evaluation Logic
Thirteen verifier-backed reward components roll up into four primary channels (safety_legality, clinical_improvement, dosing_quality, process_integrity), clamped to [0.001, 0.999], with anti-cheat and timeout logic (app/env/reward_router.py, app/env/anti_cheat.py, docs/reward_design.md).
Training And Post-Training Strategy
Build corpora (scripts/bootstrap_data.py, scripts/build_training_corpus.py), SFT with TRL (scripts/train_sft_trl.py), GRPO with environment reward (scripts/train_grpo_trl.py), merge adapters (scripts/merge_adapters_safe.py), validate inference (scripts/test_inference_postsave.py), evaluate and plot (scripts/evaluate_*.py, docs/results/). Optional HF GPU training uses scripts/deploy_training_space.py; public review should start with the repository root README.md, then docs/training.md for implementation notes.