--- title: PolyGuard OpenEnv colorFrom: blue colorTo: green sdk: docker app_port: 8100 pinned: false --- # POLYGUARD-OPENENV PolyGuard is an OpenEnv-compatible reinforcement-learning environment for **polypharmacy safety, medication optimization, deprescribing, and precision dosing**. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training. > Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care. ## Submission Links - GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK) - HF Space URL: [https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv](https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv) *(deployment target; verify before final submission)* - Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb) - YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact. - Hugging Face Blog URL: [https://huggingface.co/blog/Vishwa-docs/polyguard-openenv](https://huggingface.co/blog/Vishwa-docs/polyguard-openenv) *(story target; publish before final submission)* ## Current Readiness Verified locally: - `uv run pytest`: 36 tests passed during the audit pass. - `uv run openenv validate .`: local OpenEnv packaging passed. - `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed. - `npm run build` in `app/ui/frontend`: production UI build passed. Still required for final judge-ready submission: - Authenticate Hugging Face with `./.venv/bin/hf auth login`. - Deploy and verify the HF Space. - Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths. - Replace `docs/results/hf_space_verification.json` with a successful verification payload. - Regenerate final plots and reports with `improvement_report.improved == true`. - Run strict readiness: `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py`. ## Problem Statement Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted. PolyGuard targets the OpenEnv **World Modeling / Professional Tasks** theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines. ## Environment The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints: - `POST /reset` - `POST /step` - `GET /state` - `GET /metadata` - `GET /schema` - `POST /mcp` - `GET /health` - Backward-compatible aliases under `/env/*` plus `/ws` OpenEnv packaging lives at repo root: - `openenv.yaml` - `__init__.py` - `client.py` - `models.py` - `server/app.py` Each episode samples a patient/regimen scenario and a sub-environment: - `DDI` - `BANDIT_MINING` - `REGIMEN_RISK` - `PRECISION_DOSING` - `LONGITUDINAL_DEPRESCRIBING` - `WEB_SEARCH_MISSING_DATA` - `ALTERNATIVE_SUGGESTION` - `NEW_DRUG_DECOMPOSITION` Difficulty tracks are available as easy, medium, and hard scenario sets. ## Agent Capabilities The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected: - Medication reconciliation - Evidence retrieval and missing-data recovery - Graph safety analysis for DDI and side effects - Dosing guardrails - Candidate generation - Supervisor routing between regimen, dose, and review modes - Planner policy selection - Critic safety veto - Explanation generation - Contextual bandit ranking for policy-stack ablations ## Tasks PolyGuard evaluates these action-selection tasks: - Find bad drug combinations and reduce DDI/polypharmacy side-effect risk. - Recommend safe adds, substitutions, and alternatives. - Optimize regimens under uncertainty. - Produce taper/deprescribing sequences over time. - Choose precision dosing actions when organ function or dose sensitivity matters. - Fetch evidence when critical data is missing. - Decompose a new drug into components for first-pass safety reasoning. ## Reward Model / Evaluation Logic Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels: - `safety_legality` - `clinical_improvement` - `dosing_quality` - `process_integrity` Reward logic combines: - Legal action checks - Safety delta and burden improvement - Dosing quality - Abstention quality under uncertainty - Format compliance - Process fidelity - Explanation grounding - Anti-cheat and timeout penalties Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs. ## Training And Post-Training Strategy The intended pipeline is: 1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback. 2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format. 3. Run GRPO with environment-backed reward verification. 4. Track per-component reward columns and sampled generations. 5. Run policy-stack ablations against baselines. 6. Merge/export adapters safely. 7. Validate post-save inference from the exported artifact. 8. Deploy the OpenEnv environment to Hugging Face Spaces. Core commands: ```bash cd polyguard-rl bash scripts/bootstrap_venv.sh .venv/bin/python scripts/bootstrap_data.py .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf .venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth .venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth .venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged .venv/bin/python scripts/test_inference_postsave.py --samples 3 .venv/bin/python scripts/evaluate_all.py ``` ## Results Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored. ![Average reward](docs/results/avg_reward.png) ![Policy stack average reward](docs/results/policy_stack_avg_reward.png) Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready: - `docs/results/sft_trl_run.json` currently records a fallback backend. - `docs/results/grpo_trl_run.json` currently records an environment-reward fallback path. - `docs/results/postsave_inference.json` currently uses fallback inference. - `docs/results/improvement_report.json` currently records no positive improvement. - `docs/results/hf_space_verification.json` is blocked until HF auth/deployment succeeds. Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts. ## Dataset Gather Implemented data generation and packaging covers: - Normalized drug vocabulary and class tables - Interaction graph edges - Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules - Synthetic patients - Easy/medium/hard scenario files - Retrieval corpus and local evidence index - Unified SFT and GRPO prompt corpora The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated. ## Deployment Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version. ```bash ./.venv/bin/hf auth login ./.venv/bin/hf auth whoami export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv" bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID" ./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID" openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space" ``` After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`. ## Strict Submission Gate Non-strict local readiness: ```bash .venv/bin/python scripts/acceptance_gate.py ``` Final submission readiness: ```bash export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py ``` Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive. ## Documentation - [Architecture](docs/architecture.md) - [Environment Design](docs/environment_design.md) - [Reward Design](docs/reward_design.md) - [Training](docs/training.md) - [Evaluation](docs/evaluation.md) - [Deployment](docs/deployment.md) - [Safety](docs/safety.md) - [Agents](docs/agents.md) - [Datasets](docs/datasets.md) - [Math](docs/math.md) - [Submission Checklist](docs/submission_checklist.md) ## Future Work - Medicine image/barcode ingestion for regimen capture - Larger model GRPO sweeps - Stronger real-world drug-label ingestion and calibration - More clinician-facing explanation studies - Published HF blog or short video walkthrough ## License MIT