polyguard-openenv / README.md
TheJackBright's picture
Deploy PolyGuard OpenEnv Space
877add7 verified
metadata
title: PolyGuard OpenEnv
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8100
pinned: false

POLYGUARD-OPENENV

PolyGuard is an OpenEnv-compatible reinforcement-learning environment for polypharmacy safety, medication optimization, deprescribing, and precision dosing. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training.

Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care.

Submission Links

Current Readiness

Verified locally:

  • uv run pytest: 36 tests passed during the audit pass.
  • uv run openenv validate .: local OpenEnv packaging passed.
  • bash scripts/bootstrap_openenv.sh --runtime-check: runtime OpenEnv HTTP contract passed when localhost access was allowed.
  • npm run build in app/ui/frontend: production UI build passed.

Still required for final judge-ready submission:

  • Authenticate Hugging Face with ./.venv/bin/hf auth login.
  • Deploy and verify the HF Space.
  • Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths.
  • Replace docs/results/hf_space_verification.json with a successful verification payload.
  • Regenerate final plots and reports with improvement_report.improved == true.
  • Run strict readiness: POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py.

Problem Statement

Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted.

PolyGuard targets the OpenEnv World Modeling / Professional Tasks theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines.

Environment

The environment is implemented by PolyGuardEnv and exposed through FastAPI/OpenEnv-compatible endpoints:

  • POST /reset
  • POST /step
  • GET /state
  • GET /metadata
  • GET /schema
  • POST /mcp
  • GET /health
  • Backward-compatible aliases under /env/* plus /ws

OpenEnv packaging lives at repo root:

  • openenv.yaml
  • __init__.py
  • client.py
  • models.py
  • server/app.py

Each episode samples a patient/regimen scenario and a sub-environment:

  • DDI
  • BANDIT_MINING
  • REGIMEN_RISK
  • PRECISION_DOSING
  • LONGITUDINAL_DEPRESCRIBING
  • WEB_SEARCH_MISSING_DATA
  • ALTERNATIVE_SUGGESTION
  • NEW_DRUG_DECOMPOSITION

Difficulty tracks are available as easy, medium, and hard scenario sets.

Agent Capabilities

The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected:

  • Medication reconciliation
  • Evidence retrieval and missing-data recovery
  • Graph safety analysis for DDI and side effects
  • Dosing guardrails
  • Candidate generation
  • Supervisor routing between regimen, dose, and review modes
  • Planner policy selection
  • Critic safety veto
  • Explanation generation
  • Contextual bandit ranking for policy-stack ablations

Tasks

PolyGuard evaluates these action-selection tasks:

  • Find bad drug combinations and reduce DDI/polypharmacy side-effect risk.
  • Recommend safe adds, substitutions, and alternatives.
  • Optimize regimens under uncertainty.
  • Produce taper/deprescribing sequences over time.
  • Choose precision dosing actions when organ function or dose sensitivity matters.
  • Fetch evidence when critical data is missing.
  • Decompose a new drug into components for first-pass safety reasoning.

Reward Model / Evaluation Logic

Rewards are verifier-backed and clamped to [0.001, 0.999]. The environment exposes 13 detailed reward columns and 4 primary channels:

  • safety_legality
  • clinical_improvement
  • dosing_quality
  • process_integrity

Reward logic combines:

  • Legal action checks
  • Safety delta and burden improvement
  • Dosing quality
  • Abstention quality under uncertainty
  • Format compliance
  • Process fidelity
  • Explanation grounding
  • Anti-cheat and timeout penalties

Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs.

Training And Post-Training Strategy

The intended pipeline is:

  1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
  2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format.
  3. Run GRPO with environment-backed reward verification.
  4. Track per-component reward columns and sampled generations.
  5. Run policy-stack ablations against baselines.
  6. Merge/export adapters safely.
  7. Validate post-save inference from the exported artifact.
  8. Deploy the OpenEnv environment to Hugging Face Spaces.

Core commands:

cd polyguard-rl
bash scripts/bootstrap_venv.sh
.venv/bin/python scripts/bootstrap_data.py
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
.venv/bin/python scripts/evaluate_all.py

Results

Tracked smoke/evaluation artifacts are mirrored in docs/results/ because outputs/ and checkpoints/ are intentionally ignored.

Average reward

Policy stack average reward

Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready:

  • docs/results/sft_trl_run.json currently records a fallback backend.
  • docs/results/grpo_trl_run.json currently records an environment-reward fallback path.
  • docs/results/postsave_inference.json currently uses fallback inference.
  • docs/results/improvement_report.json currently records no positive improvement.
  • docs/results/hf_space_verification.json is blocked until HF auth/deployment succeeds.

Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts.

Dataset Gather

Implemented data generation and packaging covers:

  • Normalized drug vocabulary and class tables
  • Interaction graph edges
  • Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules
  • Synthetic patients
  • Easy/medium/hard scenario files
  • Retrieval corpus and local evidence index
  • Unified SFT and GRPO prompt corpora

The current local corpus summary is in data/processed/training_corpus_summary.json when generated.

Deployment

Use the repository-local HF CLI entrypoint. The global hf command on this machine is known to be incompatible with its installed Typer version.

./.venv/bin/hf auth login
./.venv/bin/hf auth whoami
export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv"
bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID"
./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID"
openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space"

After deployment, save the successful Space info plus OpenEnv validation payload into docs/results/hf_space_verification.json.

Strict Submission Gate

Non-strict local readiness:

.venv/bin/python scripts/acceptance_gate.py

Final submission readiness:

export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
.venv/bin/python scripts/acceptance_gate.py

Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive.

Documentation

Future Work

  • Medicine image/barcode ingestion for regimen capture
  • Larger model GRPO sweeps
  • Stronger real-world drug-label ingestion and calibration
  • More clinician-facing explanation studies
  • Published HF blog or short video walkthrough

License

MIT