Spaces:
Running
Running
File size: 10,026 Bytes
f901bea 877add7 f901bea 877add7 f901bea 877add7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | ---
title: PolyGuard OpenEnv
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8100
pinned: false
---
# POLYGUARD-OPENENV
PolyGuard is an OpenEnv-compatible reinforcement-learning environment for **polypharmacy safety, medication optimization, deprescribing, and precision dosing**. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training.
> Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care.
## Submission Links
- GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
- HF Space URL: [https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv](https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv) *(deployment target; verify before final submission)*
- Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb)
- YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact.
- Hugging Face Blog URL: [https://huggingface.co/blog/Vishwa-docs/polyguard-openenv](https://huggingface.co/blog/Vishwa-docs/polyguard-openenv) *(story target; publish before final submission)*
## Current Readiness
Verified locally:
- `uv run pytest`: 36 tests passed during the audit pass.
- `uv run openenv validate .`: local OpenEnv packaging passed.
- `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed.
- `npm run build` in `app/ui/frontend`: production UI build passed.
Still required for final judge-ready submission:
- Authenticate Hugging Face with `./.venv/bin/hf auth login`.
- Deploy and verify the HF Space.
- Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths.
- Replace `docs/results/hf_space_verification.json` with a successful verification payload.
- Regenerate final plots and reports with `improvement_report.improved == true`.
- Run strict readiness: `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py`.
## Problem Statement
Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted.
PolyGuard targets the OpenEnv **World Modeling / Professional Tasks** theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines.
## Environment
The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints:
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /metadata`
- `GET /schema`
- `POST /mcp`
- `GET /health`
- Backward-compatible aliases under `/env/*` plus `/ws`
OpenEnv packaging lives at repo root:
- `openenv.yaml`
- `__init__.py`
- `client.py`
- `models.py`
- `server/app.py`
Each episode samples a patient/regimen scenario and a sub-environment:
- `DDI`
- `BANDIT_MINING`
- `REGIMEN_RISK`
- `PRECISION_DOSING`
- `LONGITUDINAL_DEPRESCRIBING`
- `WEB_SEARCH_MISSING_DATA`
- `ALTERNATIVE_SUGGESTION`
- `NEW_DRUG_DECOMPOSITION`
Difficulty tracks are available as easy, medium, and hard scenario sets.
## Agent Capabilities
The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected:
- Medication reconciliation
- Evidence retrieval and missing-data recovery
- Graph safety analysis for DDI and side effects
- Dosing guardrails
- Candidate generation
- Supervisor routing between regimen, dose, and review modes
- Planner policy selection
- Critic safety veto
- Explanation generation
- Contextual bandit ranking for policy-stack ablations
## Tasks
PolyGuard evaluates these action-selection tasks:
- Find bad drug combinations and reduce DDI/polypharmacy side-effect risk.
- Recommend safe adds, substitutions, and alternatives.
- Optimize regimens under uncertainty.
- Produce taper/deprescribing sequences over time.
- Choose precision dosing actions when organ function or dose sensitivity matters.
- Fetch evidence when critical data is missing.
- Decompose a new drug into components for first-pass safety reasoning.
## Reward Model / Evaluation Logic
Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels:
- `safety_legality`
- `clinical_improvement`
- `dosing_quality`
- `process_integrity`
Reward logic combines:
- Legal action checks
- Safety delta and burden improvement
- Dosing quality
- Abstention quality under uncertainty
- Format compliance
- Process fidelity
- Explanation grounding
- Anti-cheat and timeout penalties
Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs.
## Training And Post-Training Strategy
The intended pipeline is:
1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format.
3. Run GRPO with environment-backed reward verification.
4. Track per-component reward columns and sampled generations.
5. Run policy-stack ablations against baselines.
6. Merge/export adapters safely.
7. Validate post-save inference from the exported artifact.
8. Deploy the OpenEnv environment to Hugging Face Spaces.
Core commands:
```bash
cd polyguard-rl
bash scripts/bootstrap_venv.sh
.venv/bin/python scripts/bootstrap_data.py
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
.venv/bin/python scripts/evaluate_all.py
```
## Results
Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored.


Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready:
- `docs/results/sft_trl_run.json` currently records a fallback backend.
- `docs/results/grpo_trl_run.json` currently records an environment-reward fallback path.
- `docs/results/postsave_inference.json` currently uses fallback inference.
- `docs/results/improvement_report.json` currently records no positive improvement.
- `docs/results/hf_space_verification.json` is blocked until HF auth/deployment succeeds.
Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts.
## Dataset Gather
Implemented data generation and packaging covers:
- Normalized drug vocabulary and class tables
- Interaction graph edges
- Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules
- Synthetic patients
- Easy/medium/hard scenario files
- Retrieval corpus and local evidence index
- Unified SFT and GRPO prompt corpora
The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated.
## Deployment
Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version.
```bash
./.venv/bin/hf auth login
./.venv/bin/hf auth whoami
export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv"
bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID"
./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID"
openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space"
```
After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`.
## Strict Submission Gate
Non-strict local readiness:
```bash
.venv/bin/python scripts/acceptance_gate.py
```
Final submission readiness:
```bash
export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
.venv/bin/python scripts/acceptance_gate.py
```
Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive.
## Documentation
- [Architecture](docs/architecture.md)
- [Environment Design](docs/environment_design.md)
- [Reward Design](docs/reward_design.md)
- [Training](docs/training.md)
- [Evaluation](docs/evaluation.md)
- [Deployment](docs/deployment.md)
- [Safety](docs/safety.md)
- [Agents](docs/agents.md)
- [Datasets](docs/datasets.md)
- [Math](docs/math.md)
- [Submission Checklist](docs/submission_checklist.md)
## Future Work
- Medicine image/barcode ingestion for regimen capture
- Larger model GRPO sweeps
- Stronger real-world drug-label ingestion and calibration
- More clinician-facing explanation studies
- Published HF blog or short video walkthrough
## License
MIT
|