Spaces:
Running
title: PolyGuard OpenEnv
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8100
pinned: false
POLYGUARD-OPENENV
PolyGuard is an OpenEnv-compatible reinforcement-learning environment for polypharmacy safety, medication optimization, deprescribing, and precision dosing. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training.
Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care.
Submission Links
- GitHub Repo URL: https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK
- HF Space URL: https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv (deployment target; verify before final submission)
- Colab Notebook URL: https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb
- YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact.
- Hugging Face Blog URL: https://huggingface.co/blog/Vishwa-docs/polyguard-openenv (story target; publish before final submission)
Current Readiness
Verified locally:
uv run pytest: 36 tests passed during the audit pass.uv run openenv validate .: local OpenEnv packaging passed.bash scripts/bootstrap_openenv.sh --runtime-check: runtime OpenEnv HTTP contract passed when localhost access was allowed.npm run buildinapp/ui/frontend: production UI build passed.
Still required for final judge-ready submission:
- Authenticate Hugging Face with
./.venv/bin/hf auth login. - Deploy and verify the HF Space.
- Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths.
- Replace
docs/results/hf_space_verification.jsonwith a successful verification payload. - Regenerate final plots and reports with
improvement_report.improved == true. - Run strict readiness:
POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py.
Problem Statement
Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted.
PolyGuard targets the OpenEnv World Modeling / Professional Tasks theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines.
Environment
The environment is implemented by PolyGuardEnv and exposed through FastAPI/OpenEnv-compatible endpoints:
POST /resetPOST /stepGET /stateGET /metadataGET /schemaPOST /mcpGET /health- Backward-compatible aliases under
/env/*plus/ws
OpenEnv packaging lives at repo root:
openenv.yaml__init__.pyclient.pymodels.pyserver/app.py
Each episode samples a patient/regimen scenario and a sub-environment:
DDIBANDIT_MININGREGIMEN_RISKPRECISION_DOSINGLONGITUDINAL_DEPRESCRIBINGWEB_SEARCH_MISSING_DATAALTERNATIVE_SUGGESTIONNEW_DRUG_DECOMPOSITION
Difficulty tracks are available as easy, medium, and hard scenario sets.
Agent Capabilities
The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected:
- Medication reconciliation
- Evidence retrieval and missing-data recovery
- Graph safety analysis for DDI and side effects
- Dosing guardrails
- Candidate generation
- Supervisor routing between regimen, dose, and review modes
- Planner policy selection
- Critic safety veto
- Explanation generation
- Contextual bandit ranking for policy-stack ablations
Tasks
PolyGuard evaluates these action-selection tasks:
- Find bad drug combinations and reduce DDI/polypharmacy side-effect risk.
- Recommend safe adds, substitutions, and alternatives.
- Optimize regimens under uncertainty.
- Produce taper/deprescribing sequences over time.
- Choose precision dosing actions when organ function or dose sensitivity matters.
- Fetch evidence when critical data is missing.
- Decompose a new drug into components for first-pass safety reasoning.
Reward Model / Evaluation Logic
Rewards are verifier-backed and clamped to [0.001, 0.999]. The environment exposes 13 detailed reward columns and 4 primary channels:
safety_legalityclinical_improvementdosing_qualityprocess_integrity
Reward logic combines:
- Legal action checks
- Safety delta and burden improvement
- Dosing quality
- Abstention quality under uncertainty
- Format compliance
- Process fidelity
- Explanation grounding
- Anti-cheat and timeout penalties
Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs.
Training And Post-Training Strategy
The intended pipeline is:
- Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
- Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format.
- Run GRPO with environment-backed reward verification.
- Track per-component reward columns and sampled generations.
- Run policy-stack ablations against baselines.
- Merge/export adapters safely.
- Validate post-save inference from the exported artifact.
- Deploy the OpenEnv environment to Hugging Face Spaces.
Core commands:
cd polyguard-rl
bash scripts/bootstrap_venv.sh
.venv/bin/python scripts/bootstrap_data.py
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
.venv/bin/python scripts/evaluate_all.py
Results
Tracked smoke/evaluation artifacts are mirrored in docs/results/ because outputs/ and checkpoints/ are intentionally ignored.
Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready:
docs/results/sft_trl_run.jsoncurrently records a fallback backend.docs/results/grpo_trl_run.jsoncurrently records an environment-reward fallback path.docs/results/postsave_inference.jsoncurrently uses fallback inference.docs/results/improvement_report.jsoncurrently records no positive improvement.docs/results/hf_space_verification.jsonis blocked until HF auth/deployment succeeds.
Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts.
Dataset Gather
Implemented data generation and packaging covers:
- Normalized drug vocabulary and class tables
- Interaction graph edges
- Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules
- Synthetic patients
- Easy/medium/hard scenario files
- Retrieval corpus and local evidence index
- Unified SFT and GRPO prompt corpora
The current local corpus summary is in data/processed/training_corpus_summary.json when generated.
Deployment
Use the repository-local HF CLI entrypoint. The global hf command on this machine is known to be incompatible with its installed Typer version.
./.venv/bin/hf auth login
./.venv/bin/hf auth whoami
export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv"
bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID"
./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID"
openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space"
After deployment, save the successful Space info plus OpenEnv validation payload into docs/results/hf_space_verification.json.
Strict Submission Gate
Non-strict local readiness:
.venv/bin/python scripts/acceptance_gate.py
Final submission readiness:
export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
.venv/bin/python scripts/acceptance_gate.py
Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive.
Documentation
- Architecture
- Environment Design
- Reward Design
- Training
- Evaluation
- Deployment
- Safety
- Agents
- Datasets
- Math
- Submission Checklist
Future Work
- Medicine image/barcode ingestion for regimen capture
- Larger model GRPO sweeps
- Stronger real-world drug-label ingestion and calibration
- More clinician-facing explanation studies
- Published HF blog or short video walkthrough
License
MIT

