| # POLYGUARD-OPENENV |
|
|
| PolyGuard is an OpenEnv-compatible reinforcement-learning environment for **polypharmacy safety, medication optimization, deprescribing, and precision dosing**. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training. |
|
|
| > Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care. |
|
|
| ## Submission Links |
|
|
| - GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK) |
| - HF Space URL: [https://huggingface.co/spaces/TheJackBright/polyguard-openenv](https://huggingface.co/spaces/TheJackBright/polyguard-openenv) |
| - Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb) |
| - YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact. |
| - Hugging Face Blog URL: [https://huggingface.co/blog/TheJackBright/polyguard-openenv](https://huggingface.co/blog/TheJackBright/polyguard-openenv) *(story target; currently unpublished/404 until `docs/hf_blog_draft.md` is published or this link is replaced)* |
|
|
| ## Current Readiness |
|
|
| Verified locally and against the live Space: |
|
|
| - `uv run pytest`: 49 tests passed. |
| - `uv run openenv validate .`: local OpenEnv packaging passed. |
| - `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true uv run python scripts/acceptance_gate.py`: strict acceptance gate passed. |
| - `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed. |
| - `npm run build` in `app/ui/frontend`: production UI build passed. |
| - `scripts/train_sft_trl.py`: non-fallback TRL Transformers SFT artifact generated for a tiny local compliance run. |
| - `scripts/train_grpo_trl.py`: non-fallback TRL Transformers GRPO artifact generated with environment-backed reward verification. |
| - `scripts/test_inference_postsave.py`: post-save inference loads the exported merged artifact, not the fallback policy. |
| - `scripts/evaluate_compare_runs.py`: current report shows positive average-reward improvement against the no-change baseline. |
| - `scripts/deploy_training_space.py`: deploys the full Hugging Face A10G training Space and Qwen sweep runner. |
| - `scripts/generate_hf_training_report.py`: writes SFT-vs-GRPO charts, Qwen sweep charts, and anti-hacking/overfit reports. |
| - `scripts/generate_submission_evidence.py`: writes the Qwen 0.5B/1.5B submission evidence bundle without retraining. |
| - `scripts/deploy_evidence_space.py`: deploys a separate HF Space for evaluation-only evidence generation so the running training Space is not interrupted. |
| - `scripts/activate_sweep_model.py`: activates pulled Qwen sweep artifacts for the API/UI inference path. |
| - `curl -s https://thejackbright-polyguard-openenv.hf.space/health`: live Space returned `{"status":"healthy"}` on April 26, 2026. |
| - `curl -s https://thejackbright-polyguard-openenv.hf.space/metadata`: live Space reported `version: 0.2.0`, reward range `[0.001, 0.999]`, and OpenEnv simulation metadata. |
|
|
| Still required for final judge-ready submission: |
|
|
| - Publish the external Hugging Face blog story URL, or replace it with a real YouTube/slide/blog URL. The current blog URL was checked on April 26, 2026 and returns 404. |
| - If you want to claim a full public per-model GRPO sweep, pull and mirror those private artifacts first. The current tracked evidence is a 3-model SFT-baseline sweep plus a top-level environment-backed GRPO run; private HF training artifact repos require authentication and should not be used as public judge links. |
| - After the story artifact is published, run `uv run python scripts/validate_submission_links.py` to catch any remaining broken README URLs. |
|
|
| ## Problem Statement |
|
|
| Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted. |
|
|
| PolyGuard targets the OpenEnv **World Modeling / Professional Tasks** theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines. |
|
|
| ## Environment |
|
|
| The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints: |
|
|
| - `POST /reset` |
| - `POST /step` |
| - `GET /state` |
| - `GET /metadata` |
| - `GET /schema` |
| - `POST /mcp` |
| - `GET /health` |
| - Backward-compatible aliases under `/env/*` plus `/ws` |
|
|
| OpenEnv packaging lives at repo root: |
|
|
| - `openenv.yaml` |
| - `__init__.py` |
| - `client.py` |
| - `models.py` |
| - `server/app.py` |
|
|
| Each episode samples a patient/regimen scenario and a sub-environment: |
|
|
| - `DDI` |
| - `BANDIT_MINING` |
| - `REGIMEN_RISK` |
| - `PRECISION_DOSING` |
| - `LONGITUDINAL_DEPRESCRIBING` |
| - `WEB_SEARCH_MISSING_DATA` |
| - `ALTERNATIVE_SUGGESTION` |
| - `NEW_DRUG_DECOMPOSITION` |
|
|
| Difficulty tracks are available as easy, medium, and hard scenario sets. |
|
|
| ## Agent Capabilities |
|
|
| The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected: |
|
|
| - Medication reconciliation |
| - Evidence retrieval and missing-data recovery |
| - Graph safety analysis for DDI and side effects |
| - Dosing guardrails |
| - Candidate generation |
| - Supervisor routing between regimen, dose, and review modes |
| - Planner policy selection |
| - Critic safety veto |
| - Explanation generation |
| - Contextual bandit ranking for policy-stack ablations |
|
|
| ## Tasks |
|
|
| PolyGuard evaluates these action-selection tasks: |
|
|
| - Find bad drug combinations and reduce DDI/polypharmacy side-effect risk. |
| - Recommend safe adds, substitutions, and alternatives. |
| - Optimize regimens under uncertainty. |
| - Produce taper/deprescribing sequences over time. |
| - Choose precision dosing actions when organ function or dose sensitivity matters. |
| - Fetch evidence when critical data is missing. |
| - Decompose a new drug into components for first-pass safety reasoning. |
|
|
| ## Reward Model / Evaluation Logic |
|
|
| Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels: |
|
|
| - `safety_legality` |
| - `clinical_improvement` |
| - `dosing_quality` |
| - `process_integrity` |
|
|
| Reward logic combines: |
|
|
| - Legal action checks |
| - Safety delta and burden improvement |
| - Dosing quality |
| - Abstention quality under uncertainty |
| - Format compliance |
| - Process fidelity |
| - Explanation grounding |
| - Anti-cheat and timeout penalties |
|
|
| Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs. |
|
|
| ## Training And Post-Training Strategy |
|
|
| The intended pipeline is: |
|
|
| 1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback. |
| 2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format. |
| 3. Run GRPO with environment-backed reward verification. |
| 4. Track per-component reward columns and sampled generations. |
| 5. Run policy-stack ablations against baselines. |
| 6. Merge/export adapters safely. |
| 7. Validate post-save inference from the exported artifact. |
| 8. Deploy the OpenEnv environment to Hugging Face Spaces. |
|
|
| Core commands: |
|
|
| ```bash |
| cd polyguard-rl |
| bash scripts/bootstrap_venv.sh |
| .venv/bin/python scripts/bootstrap_data.py |
| .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf |
| .venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth |
| .venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth |
| .venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged |
| .venv/bin/python scripts/test_inference_postsave.py --samples 3 |
| .venv/bin/python scripts/evaluate_all.py |
| ``` |
|
|
| Optional full GPU training uses Hugging Face Spaces, not local Ollama: |
|
|
| ```bash |
| export HF_TOKEN="<write-token>" |
| .venv/bin/python scripts/deploy_training_space.py \ |
| --repo-id TheJackBright/polyguard-openenv-training-full \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --hardware a10g-large \ |
| --model-sweep Qwen/Qwen2.5-0.5B-Instruct,Qwen/Qwen2.5-1.5B-Instruct,Qwen/Qwen2.5-3B-Instruct \ |
| --sft-epochs 2 \ |
| --grpo-epochs 1 \ |
| --sft-max-steps 0 \ |
| --grpo-max-steps 0 \ |
| --grpo-max-prompts 0 |
| .venv/bin/python scripts/pull_training_artifacts.py \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts |
| ``` |
|
|
| `max_steps <= 0` means full-epoch training over the corpus. The Space uploads per-model SFT/GRPO reports, histories, adapters, post-save inference outputs, and comparison charts under `outputs/reports/sweeps/`, `outputs/plots/`, `docs/results/`, and `checkpoints/sweeps/`. Those artifact repositories are private/authenticated by design; mirror the final public evidence into `docs/results/` before claiming it in the submission story. |
|
|
| Current remote training/evidence links: |
|
|
| - Training Space: [https://huggingface.co/spaces/TheJackBright/polyguard-openenv-training-full](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-training-full) |
| - Training artifact repo: [https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts](https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts) |
| - Evidence Space target: [https://huggingface.co/spaces/TheJackBright/polyguard-openenv-evidence](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-evidence) |
| - Implementation-ready active model bundle: [https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts/tree/main/usable_model_bundles/local-qwen-0-5b-active-smoke](https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts/tree/main/usable_model_bundles/local-qwen-0-5b-active-smoke) |
| - Qwen 0.5B/1.5B evidence bundle: `docs/results/submission_evidence_qwen_0_5b_1_5b/` |
| - Zipped evidence bundle: `submission_bundle/qwen_0_5b_1_5b_evidence.zip` |
| - Local active model bundle: `submission_bundle/model_artifacts/local-qwen-0-5b-active-smoke/` |
| - Local active model zip: `submission_bundle/model_artifacts/local-qwen-0-5b-active-smoke.zip` |
|
|
| Generate the Qwen 0.5B/1.5B submission evidence bundle without retraining: |
|
|
| ```bash |
| .venv/bin/python scripts/generate_submission_evidence.py \ |
| --models qwen-qwen2-5-0-5b-instruct,qwen-qwen2-5-1-5b-instruct \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --training-space-url https://thejackbright-polyguard-openenv-training-full.hf.space \ |
| --episodes 8 |
| ``` |
|
|
| Deploy the evidence-only HF Space: |
|
|
| ```bash |
| export HF_TOKEN="<write-token>" |
| .venv/bin/python scripts/deploy_evidence_space.py \ |
| --repo-id TheJackBright/polyguard-openenv-evidence \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --training-space-url https://thejackbright-polyguard-openenv-training-full.hf.space \ |
| --models qwen-qwen2-5-0-5b-instruct,qwen-qwen2-5-1-5b-instruct \ |
| --hardware cpu-basic |
| ``` |
|
|
| As of the April 26, 2026 live status pull, Qwen 0.5B and 1.5B SFT, GRPO, GRPO post-save inference, and policy ablation have completed on the training Space. The artifact repo still exposes no meaningful files beyond `.gitattributes`, so per-run GRPO curves/checkpoints are labeled `remote_completed_pending_artifact_upload` in the evidence bundle until upload finishes. |
|
|
| The current active implementation bundle has been pushed separately so the product can be tested immediately while the full sweep upload is pending. It contains `grpo_adapter`, `sft_adapter`, `merged`, active manifests, and active-model reports: |
|
|
| ```bash |
| export HF_TOKEN="$(cat ~/.cache/huggingface/token)" |
| ./.venv/bin/hf download TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --repo-type model \ |
| --include 'usable_model_bundles/local-qwen-0-5b-active-smoke/**' \ |
| --local-dir ./hf_artifacts |
| ``` |
|
|
| Restore that bundle into the app: |
|
|
| ```bash |
| cp -R hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/checkpoints/grpo_adapter checkpoints/grpo_adapter |
| cp -R hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/checkpoints/sft_adapter checkpoints/sft_adapter |
| cp -R hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/checkpoints/merged checkpoints/merged |
| mkdir -p checkpoints/active |
| cp hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/manifests/active_model_manifest.json checkpoints/active/active_model_manifest.json |
| ``` |
|
|
| To pull only the Qwen 0.5B run when the artifact repo has uploaded it: |
|
|
| ```bash |
| .venv/bin/python scripts/pull_sweep_artifacts.py \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --run-id qwen-qwen2-5-0-5b-instruct |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source sweep \ |
| --run-id qwen-qwen2-5-0-5b-instruct \ |
| --preferred-artifact grpo_adapter |
| ``` |
|
|
| To pull Qwen 1.5B once available: |
|
|
| ```bash |
| .venv/bin/python scripts/pull_sweep_artifacts.py \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --run-id qwen-qwen2-5-1-5b-instruct |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source sweep \ |
| --run-id qwen-qwen2-5-1-5b-instruct \ |
| --preferred-artifact grpo_adapter |
| ``` |
|
|
| Current local product activation uses the available top-level Qwen 0.5B GRPO smoke artifacts: |
|
|
| ```bash |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source top-level \ |
| --run-id qwen-qwen2-5-0-5b-instruct \ |
| --preferred-artifact grpo_adapter \ |
| --label local-qwen-0.5b-active-smoke |
| curl http://127.0.0.1:8200/policy/model_status |
| ``` |
|
|
| The API/UI inference path checks `checkpoints/active/active_model_manifest.json`, prefers the GRPO adapter, and falls back to the merged Qwen 0.5B artifact if an adapter/base-model load is unavailable. The UI shows the active model chip on the Patient Workbench. |
|
|
| ## Results |
|
|
| Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored. |
|
|
|  |
|
|
|  |
|
|
| Current tracked reports show the environment, evaluation, plotting, SFT, GRPO, and non-fallback inference paths are wired: |
|
|
| - `docs/results/sft_trl_run.json` records `backend: trl_transformers`. |
| - `docs/results/grpo_trl_run.json` records `status: ok`, `backend: trl_transformers`, and an adapter artifact path. |
| - `docs/results/postsave_inference.json` records `model_source: merged`. |
| - `docs/results/improvement_report.json` records `improved: true` against the no-change baseline. |
| - `docs/results/hf_space_verification.json` records the latest HF Space validation result. |
| - `docs/results/hf_sweep_summary.json` currently documents a 3-model SFT-baseline sweep. Do not claim a full public per-model GRPO sweep unless those private artifacts are pulled, mirrored, and documented. |
| - `docs/results/anti_hacking_overfit_report.json` and tracked charts cover reward, legality, success, process fidelity, SFT-vs-GRPO comparison, SFT loss, GRPO reward, Qwen model size comparison, reward components, anti-cheat/failure rate, train-vs-holdout gap, inference validity/reward, and inference latency/validity. |
| - `docs/results/submission_evidence_qwen_0_5b_1_5b/` contains the current Qwen 0.5B/1.5B submission bundle with SFT curves, live HF stage-duration evidence, basic-LLM-vs-PolyGuard verifier comparison, action traces, failure cases, and explicit pending markers for GRPO files not yet uploaded by the private artifact repo. |
|
|
| For best storytelling, replace the tiny local compliance artifacts with the larger Colab/HF GPU run before the final pitch if time permits. |
|
|
| ## Dataset Gather |
|
|
| Implemented data generation and packaging covers: |
|
|
| - Normalized drug vocabulary and class tables |
| - Interaction graph edges |
| - Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules |
| - Synthetic patients |
| - Easy/medium/hard scenario files |
| - Retrieval corpus and local evidence index |
| - Unified SFT and GRPO prompt corpora |
|
|
| The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated. |
|
|
| ## Deployment |
|
|
| Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version. |
|
|
| ```bash |
| ./.venv/bin/hf auth login |
| ./.venv/bin/hf auth whoami |
| export HF_SPACE_REPO_ID="TheJackBright/polyguard-openenv" |
| uv run python scripts/deploy_space_api.py --repo-id "$HF_SPACE_REPO_ID" |
| uv run python -c "from huggingface_hub import HfApi; print(HfApi().space_info('$HF_SPACE_REPO_ID').id)" |
| openenv validate --url "https://thejackbright-polyguard-openenv.hf.space" |
| ``` |
|
|
| After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`. |
|
|
| ## Strict Submission Gate |
|
|
| Non-strict local readiness: |
|
|
| ```bash |
| .venv/bin/python scripts/acceptance_gate.py |
| ``` |
|
|
| Final submission readiness: |
|
|
| ```bash |
| export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true |
| .venv/bin/python scripts/acceptance_gate.py |
| ``` |
|
|
| As of April 26, 2026, strict readiness passes locally. This gate validates required files, artifacts, result assets, and submission link presence; it does not perform live HTTP status checks. Use the optional link validator after publishing the story artifact: |
|
|
| ```bash |
| uv run python scripts/validate_submission_links.py |
| ``` |
|
|
| Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive. |
|
|
| ## Documentation |
|
|
| - [Architecture](docs/architecture.md) |
| - [Environment Design](docs/environment_design.md) |
| - [Reward Design](docs/reward_design.md) |
| - [Training](docs/training.md) |
| - [Evaluation](docs/evaluation.md) |
| - [Deployment](docs/deployment.md) |
| - [Safety](docs/safety.md) |
| - [Agents](docs/agents.md) |
| - [Datasets](docs/datasets.md) |
| - [Math](docs/math.md) |
| - [Submission Checklist](docs/submission_checklist.md) |
| - [Participant Guide Traceability](docs/participant_guide_traceability.md) |
| - [HF Blog Draft](docs/hf_blog_draft.md) |
|
|
| ## Future Work |
|
|
| - Medicine image/barcode ingestion for regimen capture |
| - Larger model GRPO sweeps |
| - Stronger real-world drug-label ingestion and calibration |
| - More clinician-facing explanation studies |
| - Larger Qwen/Unsloth GPU run for stronger final curves |
|
|
| ## License |
|
|
| MIT |
|
|