Spaces:
Running
Running
| title: PolyGuard OpenEnv | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8100 | |
| pinned: false | |
| # POLYGUARD-OPENENV | |
| PolyGuard is an OpenEnv-compatible reinforcement-learning environment for **polypharmacy safety, medication optimization, deprescribing, and precision dosing**. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training. | |
| > Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care. | |
| ## Submission Links | |
| - GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK) | |
| - HF Space URL: [https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv](https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv) *(deployment target; verify before final submission)* | |
| - Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb) | |
| - YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact. | |
| - Hugging Face Blog URL: [https://huggingface.co/blog/Vishwa-docs/polyguard-openenv](https://huggingface.co/blog/Vishwa-docs/polyguard-openenv) *(story target; publish before final submission)* | |
| ## Current Readiness | |
| Verified locally: | |
| - `uv run pytest`: 36 tests passed during the audit pass. | |
| - `uv run openenv validate .`: local OpenEnv packaging passed. | |
| - `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed. | |
| - `npm run build` in `app/ui/frontend`: production UI build passed. | |
| Still required for final judge-ready submission: | |
| - Authenticate Hugging Face with `./.venv/bin/hf auth login`. | |
| - Deploy and verify the HF Space. | |
| - Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths. | |
| - Replace `docs/results/hf_space_verification.json` with a successful verification payload. | |
| - Regenerate final plots and reports with `improvement_report.improved == true`. | |
| - Run strict readiness: `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py`. | |
| ## Problem Statement | |
| Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted. | |
| PolyGuard targets the OpenEnv **World Modeling / Professional Tasks** theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines. | |
| ## Environment | |
| The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints: | |
| - `POST /reset` | |
| - `POST /step` | |
| - `GET /state` | |
| - `GET /metadata` | |
| - `GET /schema` | |
| - `POST /mcp` | |
| - `GET /health` | |
| - Backward-compatible aliases under `/env/*` plus `/ws` | |
| OpenEnv packaging lives at repo root: | |
| - `openenv.yaml` | |
| - `__init__.py` | |
| - `client.py` | |
| - `models.py` | |
| - `server/app.py` | |
| Each episode samples a patient/regimen scenario and a sub-environment: | |
| - `DDI` | |
| - `BANDIT_MINING` | |
| - `REGIMEN_RISK` | |
| - `PRECISION_DOSING` | |
| - `LONGITUDINAL_DEPRESCRIBING` | |
| - `WEB_SEARCH_MISSING_DATA` | |
| - `ALTERNATIVE_SUGGESTION` | |
| - `NEW_DRUG_DECOMPOSITION` | |
| Difficulty tracks are available as easy, medium, and hard scenario sets. | |
| ## Agent Capabilities | |
| The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected: | |
| - Medication reconciliation | |
| - Evidence retrieval and missing-data recovery | |
| - Graph safety analysis for DDI and side effects | |
| - Dosing guardrails | |
| - Candidate generation | |
| - Supervisor routing between regimen, dose, and review modes | |
| - Planner policy selection | |
| - Critic safety veto | |
| - Explanation generation | |
| - Contextual bandit ranking for policy-stack ablations | |
| ## Tasks | |
| PolyGuard evaluates these action-selection tasks: | |
| - Find bad drug combinations and reduce DDI/polypharmacy side-effect risk. | |
| - Recommend safe adds, substitutions, and alternatives. | |
| - Optimize regimens under uncertainty. | |
| - Produce taper/deprescribing sequences over time. | |
| - Choose precision dosing actions when organ function or dose sensitivity matters. | |
| - Fetch evidence when critical data is missing. | |
| - Decompose a new drug into components for first-pass safety reasoning. | |
| ## Reward Model / Evaluation Logic | |
| Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels: | |
| - `safety_legality` | |
| - `clinical_improvement` | |
| - `dosing_quality` | |
| - `process_integrity` | |
| Reward logic combines: | |
| - Legal action checks | |
| - Safety delta and burden improvement | |
| - Dosing quality | |
| - Abstention quality under uncertainty | |
| - Format compliance | |
| - Process fidelity | |
| - Explanation grounding | |
| - Anti-cheat and timeout penalties | |
| Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs. | |
| ## Training And Post-Training Strategy | |
| The intended pipeline is: | |
| 1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback. | |
| 2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format. | |
| 3. Run GRPO with environment-backed reward verification. | |
| 4. Track per-component reward columns and sampled generations. | |
| 5. Run policy-stack ablations against baselines. | |
| 6. Merge/export adapters safely. | |
| 7. Validate post-save inference from the exported artifact. | |
| 8. Deploy the OpenEnv environment to Hugging Face Spaces. | |
| Core commands: | |
| ```bash | |
| cd polyguard-rl | |
| bash scripts/bootstrap_venv.sh | |
| .venv/bin/python scripts/bootstrap_data.py | |
| .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf | |
| .venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth | |
| .venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth | |
| .venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged | |
| .venv/bin/python scripts/test_inference_postsave.py --samples 3 | |
| .venv/bin/python scripts/evaluate_all.py | |
| ``` | |
| ## Results | |
| Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored. | |
|  | |
|  | |
| Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready: | |
| - `docs/results/sft_trl_run.json` currently records a fallback backend. | |
| - `docs/results/grpo_trl_run.json` currently records an environment-reward fallback path. | |
| - `docs/results/postsave_inference.json` currently uses fallback inference. | |
| - `docs/results/improvement_report.json` currently records no positive improvement. | |
| - `docs/results/hf_space_verification.json` is blocked until HF auth/deployment succeeds. | |
| Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts. | |
| ## Dataset Gather | |
| Implemented data generation and packaging covers: | |
| - Normalized drug vocabulary and class tables | |
| - Interaction graph edges | |
| - Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules | |
| - Synthetic patients | |
| - Easy/medium/hard scenario files | |
| - Retrieval corpus and local evidence index | |
| - Unified SFT and GRPO prompt corpora | |
| The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated. | |
| ## Deployment | |
| Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version. | |
| ```bash | |
| ./.venv/bin/hf auth login | |
| ./.venv/bin/hf auth whoami | |
| export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv" | |
| bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID" | |
| ./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID" | |
| openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space" | |
| ``` | |
| After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`. | |
| ## Strict Submission Gate | |
| Non-strict local readiness: | |
| ```bash | |
| .venv/bin/python scripts/acceptance_gate.py | |
| ``` | |
| Final submission readiness: | |
| ```bash | |
| export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true | |
| .venv/bin/python scripts/acceptance_gate.py | |
| ``` | |
| Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive. | |
| ## Documentation | |
| - [Architecture](docs/architecture.md) | |
| - [Environment Design](docs/environment_design.md) | |
| - [Reward Design](docs/reward_design.md) | |
| - [Training](docs/training.md) | |
| - [Evaluation](docs/evaluation.md) | |
| - [Deployment](docs/deployment.md) | |
| - [Safety](docs/safety.md) | |
| - [Agents](docs/agents.md) | |
| - [Datasets](docs/datasets.md) | |
| - [Math](docs/math.md) | |
| - [Submission Checklist](docs/submission_checklist.md) | |
| ## Future Work | |
| - Medicine image/barcode ingestion for regimen capture | |
| - Larger model GRPO sweeps | |
| - Stronger real-world drug-label ingestion and calibration | |
| - More clinician-facing explanation studies | |
| - Published HF blog or short video walkthrough | |
| ## License | |
| MIT | |