TheJackBright's picture
Deploy GitHub root master to Space
c296d62

PolyGuard (OpenEnv implementation package)

Run all CLI commands from this directory (cd polyguard-rl). The repository root README.md carries the same submission narrative with paths adjusted for viewers landing on the GitHub repo home page.

Submission Links

Shared Environment, Logs, And Scripts

The required environment files, training logs, and training scripts are shared in the repo and indexed in Submission Artifact Index.

  • Environment/runtime: openenv.yaml, pyproject.toml, uv.lock, requirements*.txt, Dockerfile*, app/env/, server/app.py, and app/hf_space/Dockerfile.
  • Training scripts/notebooks: PolyGuard_SFT_GRPO_One_Run_Runner.ipynb, notebooks/09_training_loop.ipynb, scripts/train_sft_trl.py, scripts/train_grpo_trl.py, scripts/deploy_training_space.py, app/hf_space/training_runner.py, and app/training/.
  • Training logs/results: docs/results/final_submission_evidence/reports/, docs/results/sweeps/, docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/, and docs/results/qwen_completed_runs/reports/.
  • Final downloadable artifact Space: https://huggingface.co/spaces/adithya9903/polyguard-openenv-final-artifacts.

Problem Statement

Polypharmacy decisions are long-horizon, partially observable, and safety-critical. PolyGuard is a research environment where an LLM agent selects constrained clinical actions, receives verifier-backed reward, and improves via SFT + GRPO—not generic open-ended chat fine-tuning.

Environment

PolyGuardEnv exposes OpenEnv-style HTTP/WebSocket endpoints (/reset, /step, /state, /metadata, /schema, /mcp, /health, /ws). Sub-environments include DDI, bandit mining, regimen risk, precision dosing, longitudinal deprescribing, web-search missing data, alternative suggestion, and new-drug decomposition. See openenv.yaml, app/env/env_core.py, app/env/fastapi_app.py, and docs/environment_design.md.

Agent Capabilities

Medication reconciliation, evidence retrieval, graph safety, dosing guardrails, candidate generation, supervisor routing, planner/critic stack, explanations, and contextual bandit ranking for ablations (app/agents/, docs/agents.md).

Tasks

DDI risk reduction, safe adds/substitutions, regimen optimization, taper/deprescribing sequences, precision dosing, missing-data recovery, and new-drug decomposition (data/scenarios/, app/env/catalog.py).

Reward Model / Evaluation Logic

Thirteen verifier-backed reward components roll up into four primary channels (safety_legality, clinical_improvement, dosing_quality, process_integrity), clamped to [0.001, 0.999], with anti-cheat and timeout logic (app/env/reward_router.py, app/env/anti_cheat.py, docs/reward_design.md).

Training And Post-Training Strategy

Build corpora (scripts/bootstrap_data.py, scripts/build_training_corpus.py), SFT with TRL (scripts/train_sft_trl.py), GRPO with environment reward (scripts/train_grpo_trl.py), merge adapters (scripts/merge_adapters_safe.py), validate inference (scripts/test_inference_postsave.py), evaluate and plot (scripts/evaluate_*.py, docs/results/). Optional HF GPU training uses scripts/deploy_training_space.py; public review should start with the repository root README.md, then docs/training.md for implementation notes.

Documentation index