Spaces:

TheJackBright
/

polyguard-openenv

Running

App Files Files Community

polyguard-openenv / README.md

TheJackBright

Deploy PolyGuard OpenEnv Space

877add7 verified 13 days ago

preview code

raw

history blame contribute delete

10 kB

	---
	title: PolyGuard OpenEnv
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 8100
	pinned: false
	---

	# POLYGUARD-OPENENV

	PolyGuard is an OpenEnv-compatible reinforcement-learning environment for polypharmacy safety, medication optimization, deprescribing, and precision dosing. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training.

	> Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care.

	## Submission Links

	- GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
	- HF Space URL: [https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv](https://huggingface.co/spaces/Vishwa-docs/polyguard-openenv) (deployment target; verify before final submission)
	- Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb)
	- YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact.
	- Hugging Face Blog URL: [https://huggingface.co/blog/Vishwa-docs/polyguard-openenv](https://huggingface.co/blog/Vishwa-docs/polyguard-openenv) (story target; publish before final submission)

	## Current Readiness

	Verified locally:

	- `uv run pytest`: 36 tests passed during the audit pass.
	- `uv run openenv validate .`: local OpenEnv packaging passed.
	- `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed.
	- `npm run build` in `app/ui/frontend`: production UI build passed.

	Still required for final judge-ready submission:

	- Authenticate Hugging Face with `./.venv/bin/hf auth login`.
	- Deploy and verify the HF Space.
	- Run real TRL/Unsloth SFT and GRPO on GPU/Colab so reports no longer show fallback paths.
	- Replace `docs/results/hf_space_verification.json` with a successful verification payload.
	- Regenerate final plots and reports with `improvement_report.improved == true`.
	- Run strict readiness: `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true ./.venv/bin/python scripts/acceptance_gate.py`.

	## Problem Statement

	Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted.

	PolyGuard targets the OpenEnv World Modeling / Professional Tasks theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines.

	## Environment

	The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints:

	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `GET /metadata`
	- `GET /schema`
	- `POST /mcp`
	- `GET /health`
	- Backward-compatible aliases under `/env/*` plus `/ws`

	OpenEnv packaging lives at repo root:

	- `openenv.yaml`
	- `__init__.py`
	- `client.py`
	- `models.py`
	- `server/app.py`

	Each episode samples a patient/regimen scenario and a sub-environment:

	- `DDI`
	- `BANDIT_MINING`
	- `REGIMEN_RISK`
	- `PRECISION_DOSING`
	- `LONGITUDINAL_DEPRESCRIBING`
	- `WEB_SEARCH_MISSING_DATA`
	- `ALTERNATIVE_SUGGESTION`
	- `NEW_DRUG_DECOMPOSITION`

	Difficulty tracks are available as easy, medium, and hard scenario sets.

	## Agent Capabilities

	The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected:

	- Medication reconciliation
	- Evidence retrieval and missing-data recovery
	- Graph safety analysis for DDI and side effects
	- Dosing guardrails
	- Candidate generation
	- Supervisor routing between regimen, dose, and review modes
	- Planner policy selection
	- Critic safety veto
	- Explanation generation
	- Contextual bandit ranking for policy-stack ablations

	## Tasks

	PolyGuard evaluates these action-selection tasks:

	- Find bad drug combinations and reduce DDI/polypharmacy side-effect risk.
	- Recommend safe adds, substitutions, and alternatives.
	- Optimize regimens under uncertainty.
	- Produce taper/deprescribing sequences over time.
	- Choose precision dosing actions when organ function or dose sensitivity matters.
	- Fetch evidence when critical data is missing.
	- Decompose a new drug into components for first-pass safety reasoning.

	## Reward Model / Evaluation Logic

	Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels:

	- `safety_legality`
	- `clinical_improvement`
	- `dosing_quality`
	- `process_integrity`

	Reward logic combines:

	- Legal action checks
	- Safety delta and burden improvement
	- Dosing quality
	- Abstention quality under uncertainty
	- Format compliance
	- Process fidelity
	- Explanation grounding
	- Anti-cheat and timeout penalties

	Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs.

	## Training And Post-Training Strategy

	The intended pipeline is:

	1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
	2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format.
	3. Run GRPO with environment-backed reward verification.
	4. Track per-component reward columns and sampled generations.
	5. Run policy-stack ablations against baselines.
	6. Merge/export adapters safely.
	7. Validate post-save inference from the exported artifact.
	8. Deploy the OpenEnv environment to Hugging Face Spaces.

	Core commands:

	```bash
	cd polyguard-rl
	bash scripts/bootstrap_venv.sh
	.venv/bin/python scripts/bootstrap_data.py
	.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
	.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
	.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
	.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
	.venv/bin/python scripts/test_inference_postsave.py --samples 3
	.venv/bin/python scripts/evaluate_all.py
	```

	## Results

	Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored.

	![Average reward](docs/results/avg_reward.png)

	![Policy stack average reward](docs/results/policy_stack_avg_reward.png)

	Current smoke reports show the environment, evaluation, and plotting paths are wired, but final training is not yet judge-ready:

	- `docs/results/sft_trl_run.json` currently records a fallback backend.
	- `docs/results/grpo_trl_run.json` currently records an environment-reward fallback path.
	- `docs/results/postsave_inference.json` currently uses fallback inference.
	- `docs/results/improvement_report.json` currently records no positive improvement.
	- `docs/results/hf_space_verification.json` is blocked until HF auth/deployment succeeds.

	Final submission should replace these with real GPU/Colab TRL/Unsloth artifacts.

	## Dataset Gather

	Implemented data generation and packaging covers:

	- Normalized drug vocabulary and class tables
	- Interaction graph edges
	- Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules
	- Synthetic patients
	- Easy/medium/hard scenario files
	- Retrieval corpus and local evidence index
	- Unified SFT and GRPO prompt corpora

	The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated.

	## Deployment

	Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version.

	```bash
	./.venv/bin/hf auth login
	./.venv/bin/hf auth whoami
	export HF_SPACE_REPO_ID="Vishwa-docs/polyguard-openenv"
	bash scripts/deploy_space.sh --repo-id "$HF_SPACE_REPO_ID"
	./.venv/bin/hf spaces info "$HF_SPACE_REPO_ID"
	openenv validate --url "https://Vishwa-docs-polyguard-openenv.hf.space"
	```

	After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`.

	## Strict Submission Gate

	Non-strict local readiness:

	```bash
	.venv/bin/python scripts/acceptance_gate.py
	```

	Final submission readiness:

	```bash
	export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
	.venv/bin/python scripts/acceptance_gate.py
	```

	Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive.

	## Documentation

	- [Architecture](docs/architecture.md)
	- [Environment Design](docs/environment_design.md)
	- [Reward Design](docs/reward_design.md)
	- [Training](docs/training.md)
	- [Evaluation](docs/evaluation.md)
	- [Deployment](docs/deployment.md)
	- [Safety](docs/safety.md)
	- [Agents](docs/agents.md)
	- [Datasets](docs/datasets.md)
	- [Math](docs/math.md)
	- [Submission Checklist](docs/submission_checklist.md)

	## Future Work

	- Medicine image/barcode ingestion for regimen capture
	- Larger model GRPO sweeps
	- Stronger real-world drug-label ingestion and calibration
	- More clinician-facing explanation studies
	- Published HF blog or short video walkthrough

	## License

	MIT