---
title: PolyGuard OpenEnv
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8100
pinned: false
---

# POLYGUARD-OPENENV

PolyGuard is an OpenEnv-compatible reinforcement-learning environment for **polypharmacy safety, medication optimization, deprescribing, and precision dosing**. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training.

> Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care.

## Submission Links

- GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
- HF Space URL: [https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv](https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv) *(deployment target; verify before final submission)*
- Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb)
- YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact.
- Hugging Face Blog URL: [https://huggingface.co/blog/Vishwa-docs/polyguard-openenv](https://huggingface.co/blog/Vishwa-docs/polyguard-openenv) *(story target; publish before final submission)*

## Current Readiness

Verified locally:

- `uv run pytest`: 36 tests passed during the audit pass.
- `uv run openenv validate .`: local OpenEnv packaging passed.
- `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed.
- `npm run build` in `app/ui/frontend`: production UI build passed.

Still required for final judge-ready submission:

- Authenticate Hugging Face with `./.venv/bin/hf auth login`.
- Deploy and verify the HF Space.
- Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths.
- Replace `docs/results/hf_space_verification.json` with a successful verification payload.
- Regenerate final plots and reports with `improvement_report.improved == true`.
- Run strict readiness: `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py`.

## Problem Statement

Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted.

PolyGuard targets the OpenEnv **World Modeling / Professional Tasks** theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines.

## Environment

The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints:

- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /metadata`
- `GET /schema`
- `POST /mcp`
- `GET /health`
- Backward-compatible aliases under `/env/*` plus `/ws`

OpenEnv packaging lives at repo root:

- `openenv.yaml`
- `__init__.py`
- `client.py`
- `models.py`
- `server/app.py`

Each episode samples a patient/regimen scenario and a sub-environment:

- `DDI`
- `BANDIT_MINING`
- `REGIMEN_RISK`
- `PRECISION_DOSING`
- `LONGITUDINAL_DEPRESCRIBING`
- `WEB_SEARCH_MISSING_DATA`
- `ALTERNATIVE_SUGGESTION`
- `NEW_DRUG_DECOMPOSITION`

Difficulty tracks are available as easy, medium, and hard scenario sets.

## Agent Capabilities

The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected:

- Medication reconciliation
- Evidence retrieval and missing-data recovery
- Graph safety analysis for DDI and side effects
- Dosing guardrails
- Candidate generation
- Supervisor routing between regimen, dose, and review modes
- Planner policy selection
- Critic safety veto
- Explanation generation
- Contextual bandit ranking for policy-stack ablations

## Tasks

PolyGuard evaluates these action-selection tasks:

- Find bad drug combinations and reduce DDI/polypharmacy side-effect risk.
- Recommend safe adds, substitutions, and alternatives.
- Optimize regimens under uncertainty.
- Produce taper/deprescribing sequences over time.
- Choose precision dosing actions when organ function or dose sensitivity matters.
- Fetch evidence when critical data is missing.
- Decompose a new drug into components for first-pass safety reasoning.

## Reward Model / Evaluation Logic

Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels:

- `safety_legality`
- `clinical_improvement`
- `dosing_quality`
- `process_integrity`

Reward logic combines:

- Legal action checks
- Safety delta and burden improvement
- Dosing quality
- Abstention quality under uncertainty
- Format compliance
- Process fidelity
- Explanation grounding
- Anti-cheat and timeout penalties

Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs.

## Training And Post-Training Strategy

The intended pipeline is:

1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format.
3. Run GRPO with environment-backed reward verification.
4. Track per-component reward columns and sampled generations.
5. Run policy-stack ablations against baselines.
6. Merge/export adapters safely.
7. Validate post-save inference from the exported artifact.
8. Deploy the OpenEnv environment to Hugging Face Spaces.

Core commands:

```bash
cd polyguard-rl
bash scripts/bootstrap_venv.sh
.venv/bin/python scripts/bootstrap_data.py
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
.venv/bin/python scripts/evaluate_all.py
```

## Results

Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored.

![Average reward](docs/results/avg_reward.png)

![Policy stack average reward](docs/results/policy_stack_avg_reward.png)

Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready:

- `docs/results/sft_trl_run.json` currently records a fallback backend.
- `docs/results/grpo_trl_run.json` currently records an environment-reward fallback path.
- `docs/results/postsave_inference.json` currently uses fallback inference.
- `docs/results/improvement_report.json` currently records no positive improvement.
- `docs/results/hf_space_verification.json` is blocked until HF auth/deployment succeeds.

Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts.

## Dataset Gather

Implemented data generation and packaging covers:

- Normalized drug vocabulary and class tables
- Interaction graph edges
- Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules
- Synthetic patients
- Easy/medium/hard scenario files
- Retrieval corpus and local evidence index
- Unified SFT and GRPO prompt corpora

The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated.

## Deployment

Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version.

```bash
./.venv/bin/hf auth login
./.venv/bin/hf auth whoami
export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv"
bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID"
./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID"
openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space"
```

After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`.

## Strict Submission Gate

Non-strict local readiness:

```bash
.venv/bin/python scripts/acceptance_gate.py
```

Final submission readiness:

```bash
export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
.venv/bin/python scripts/acceptance_gate.py
```

Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive.

## Documentation

- [Architecture](docs/architecture.md)
- [Environment Design](docs/environment_design.md)
- [Reward Design](docs/reward_design.md)
- [Training](docs/training.md)
- [Evaluation](docs/evaluation.md)
- [Deployment](docs/deployment.md)
- [Safety](docs/safety.md)
- [Agents](docs/agents.md)
- [Datasets](docs/datasets.md)
- [Math](docs/math.md)
- [Submission Checklist](docs/submission_checklist.md)

## Future Work

- Medicine image/barcode ingestion for regimen capture
- Larger model GRPO sweeps
- Stronger real-world drug-label ingestion and calibration
- More clinician-facing explanation studies
- Published HF blog or short video walkthrough

## License

MIT