Spaces:

adithya9903
/

polyguard-openenv-training-3b-continuation

Paused

App Files Files Community

polyguard-openenv-training-3b-continuation / PROJECT_README.md

adithya9903

Deploy PolyGuard HF training Space

fd0c71a verified 12 days ago

preview code

raw

history blame contribute delete

19.2 kB

	# POLYGUARD-OPENENV

	PolyGuard is an OpenEnv-compatible reinforcement-learning environment for polypharmacy safety, medication optimization, deprescribing, and precision dosing. The project turns medication decision making into a stateful environment where an LLM agent observes a patient/regimen state, chooses constrained clinical actions, receives verifier-backed reward, and improves through TRL/GRPO-style post-training.

	> Clinical safety note: this is a research environment and demo system for RL environment design. It is not a medical device and must not be used for patient care.

	## Submission Links

	- GitHub Repo URL: [https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK](https://github.com/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK)
	- HF Space URL: [https://huggingface.co/spaces/TheJackBright/polyguard-openenv](https://huggingface.co/spaces/TheJackBright/polyguard-openenv)
	- Colab Notebook URL: [https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb](https://colab.research.google.com/github/Vishwa-docs/Meta_Pytorch_OpenEnv_Scaler_VK/blob/master/polyguard-rl/notebooks/09_training_loop.ipynb)
	- YouTube Video URL: not used for this submission; the Hugging Face blog URL below is the selected story artifact.
	- Hugging Face Blog URL: [https://huggingface.co/blog/TheJackBright/polyguard-openenv](https://huggingface.co/blog/TheJackBright/polyguard-openenv) (story target; currently unpublished/404 until `docs/hf_blog_draft.md` is published or this link is replaced)

	## Current Readiness

	Verified locally and against the live Space:

	- `uv run pytest`: 49 tests passed.
	- `uv run openenv validate .`: local OpenEnv packaging passed.
	- `POLYGUARD_ENFORCE_SUBMISSION_LINKS=true uv run python scripts/acceptance_gate.py`: strict acceptance gate passed.
	- `bash scripts/bootstrap_openenv.sh --runtime-check`: runtime OpenEnv HTTP contract passed when localhost access was allowed.
	- `npm run build` in `app/ui/frontend`: production UI build passed.
	- `scripts/train_sft_trl.py`: non-fallback TRL Transformers SFT artifact generated for a tiny local compliance run.
	- `scripts/train_grpo_trl.py`: non-fallback TRL Transformers GRPO artifact generated with environment-backed reward verification.
	- `scripts/test_inference_postsave.py`: post-save inference loads the exported merged artifact, not the fallback policy.
	- `scripts/evaluate_compare_runs.py`: current report shows positive average-reward improvement against the no-change baseline.
	- `scripts/deploy_training_space.py`: deploys the full Hugging Face A10G training Space and Qwen sweep runner.
	- `scripts/generate_hf_training_report.py`: writes SFT-vs-GRPO charts, Qwen sweep charts, and anti-hacking/overfit reports.
	- `scripts/generate_submission_evidence.py`: writes the Qwen 0.5B/1.5B submission evidence bundle without retraining.
	- `scripts/deploy_evidence_space.py`: deploys a separate HF Space for evaluation-only evidence generation so the running training Space is not interrupted.
	- `scripts/activate_sweep_model.py`: activates pulled Qwen sweep artifacts for the API/UI inference path.
	- `curl -s https://thejackbright-polyguard-openenv.hf.space/health`: live Space returned `{"status":"healthy"}` on April 26, 2026.
	- `curl -s https://thejackbright-polyguard-openenv.hf.space/metadata`: live Space reported `version: 0.2.0`, reward range `[0.001, 0.999]`, and OpenEnv simulation metadata.

	Still required for final judge-ready submission:

	- Publish the external Hugging Face blog story URL, or replace it with a real YouTube/slide/blog URL. The current blog URL was checked on April 26, 2026 and returns 404.
	- If you want to claim a full public per-model GRPO sweep, pull and mirror those private artifacts first. The current tracked evidence is a 3-model SFT-baseline sweep plus a top-level environment-backed GRPO run; private HF training artifact repos require authentication and should not be used as public judge links.
	- After the story artifact is published, run `uv run python scripts/validate_submission_links.py` to catch any remaining broken README URLs.

	## Problem Statement

	Polypharmacy decisions are long-horizon, partially observable, and safety-critical. A useful LLM agent must do more than produce a plausible recommendation: it should identify drug-drug interaction risk, reason over comorbidities and labs, choose safe substitutions or deprescribing sequences, request review when uncertain, and expose why it acted.

	PolyGuard targets the OpenEnv World Modeling / Professional Tasks theme, with multi-agent and self-improvement elements. It asks whether environment-backed feedback can make a model better at safe medication action selection than prompt-only or rule-only baselines.

	## Environment

	The environment is implemented by `PolyGuardEnv` and exposed through FastAPI/OpenEnv-compatible endpoints:

	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `GET /metadata`
	- `GET /schema`
	- `POST /mcp`
	- `GET /health`
	- Backward-compatible aliases under `/env/*` plus `/ws`

	OpenEnv packaging lives at repo root:

	- `openenv.yaml`
	- `__init__.py`
	- `client.py`
	- `models.py`
	- `server/app.py`

	Each episode samples a patient/regimen scenario and a sub-environment:

	- `DDI`
	- `BANDIT_MINING`
	- `REGIMEN_RISK`
	- `PRECISION_DOSING`
	- `LONGITUDINAL_DEPRESCRIBING`
	- `WEB_SEARCH_MISSING_DATA`
	- `ALTERNATIVE_SUGGESTION`
	- `NEW_DRUG_DECOMPOSITION`

	Difficulty tracks are available as easy, medium, and hard scenario sets.

	## Agent Capabilities

	The agent stack is deliberately decomposed so reward, safety, and explanation can be inspected:

	- Medication reconciliation
	- Evidence retrieval and missing-data recovery
	- Graph safety analysis for DDI and side effects
	- Dosing guardrails
	- Candidate generation
	- Supervisor routing between regimen, dose, and review modes
	- Planner policy selection
	- Critic safety veto
	- Explanation generation
	- Contextual bandit ranking for policy-stack ablations

	## Tasks

	PolyGuard evaluates these action-selection tasks:

	- Find bad drug combinations and reduce DDI/polypharmacy side-effect risk.
	- Recommend safe adds, substitutions, and alternatives.
	- Optimize regimens under uncertainty.
	- Produce taper/deprescribing sequences over time.
	- Choose precision dosing actions when organ function or dose sensitivity matters.
	- Fetch evidence when critical data is missing.
	- Decompose a new drug into components for first-pass safety reasoning.

	## Reward Model / Evaluation Logic

	Rewards are verifier-backed and clamped to `[0.001, 0.999]`. The environment exposes 13 detailed reward columns and 4 primary channels:

	- `safety_legality`
	- `clinical_improvement`
	- `dosing_quality`
	- `process_integrity`

	Reward logic combines:

	- Legal action checks
	- Safety delta and burden improvement
	- Dosing quality
	- Abstention quality under uncertainty
	- Format compliance
	- Process fidelity
	- Explanation grounding
	- Anti-cheat and timeout penalties

	Anti-hacking checks block repeated action loops, review abuse, keep-regimen abuse, candidate ID mismatches, parser exploit patterns, and unsafe no-op behavior on known holdout DDIs.

	## Training And Post-Training Strategy

	The intended pipeline is:

	1. Build data assets from local knowledge, synthetic patients, scenario rollouts, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
	2. Run SFT with TRL and optional Unsloth/QLoRA acceleration to teach action-selection format.
	3. Run GRPO with environment-backed reward verification.
	4. Track per-component reward columns and sampled generations.
	5. Run policy-stack ablations against baselines.
	6. Merge/export adapters safely.
	7. Validate post-save inference from the exported artifact.
	8. Deploy the OpenEnv environment to Hugging Face Spaces.

	Core commands:

	```bash
	cd polyguard-rl
	bash scripts/bootstrap_venv.sh
	.venv/bin/python scripts/bootstrap_data.py
	.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
	.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth
	.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
	.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
	.venv/bin/python scripts/test_inference_postsave.py --samples 3
	.venv/bin/python scripts/evaluate_all.py
	```

	Optional full GPU training uses Hugging Face Spaces, not local Ollama:

	```bash
	export HF_TOKEN="<write-token>"
	.venv/bin/python scripts/deploy_training_space.py \
	--repo-id TheJackBright/polyguard-openenv-training-full \
	--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
	--hardware a10g-large \
	--model-sweep Qwen/Qwen2.5-0.5B-Instruct,Qwen/Qwen2.5-1.5B-Instruct,Qwen/Qwen2.5-3B-Instruct \
	--sft-epochs 2 \
	--grpo-epochs 1 \
	--sft-max-steps 0 \
	--grpo-max-steps 0 \
	--grpo-max-prompts 0
	.venv/bin/python scripts/pull_training_artifacts.py \
	--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts
	```

	`max_steps <= 0` means full-epoch training over the corpus. The Space uploads per-model SFT/GRPO reports, histories, adapters, post-save inference outputs, and comparison charts under `outputs/reports/sweeps/`, `outputs/plots/`, `docs/results/`, and `checkpoints/sweeps/`. Those artifact repositories are private/authenticated by design; mirror the final public evidence into `docs/results/` before claiming it in the submission story.

	Current remote training/evidence links:

	- Training Space: [https://huggingface.co/spaces/TheJackBright/polyguard-openenv-training-full](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-training-full)
	- Training artifact repo: [https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts](https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts)
	- Evidence Space target: [https://huggingface.co/spaces/TheJackBright/polyguard-openenv-evidence](https://huggingface.co/spaces/TheJackBright/polyguard-openenv-evidence)
	- Implementation-ready active model bundle: [https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts/tree/main/usable_model_bundles/local-qwen-0-5b-active-smoke](https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts/tree/main/usable_model_bundles/local-qwen-0-5b-active-smoke)
	- Qwen 0.5B/1.5B evidence bundle: `docs/results/submission_evidence_qwen_0_5b_1_5b/`
	- Zipped evidence bundle: `submission_bundle/qwen_0_5b_1_5b_evidence.zip`
	- Local active model bundle: `submission_bundle/model_artifacts/local-qwen-0-5b-active-smoke/`
	- Local active model zip: `submission_bundle/model_artifacts/local-qwen-0-5b-active-smoke.zip`

	Generate the Qwen 0.5B/1.5B submission evidence bundle without retraining:

	```bash
	.venv/bin/python scripts/generate_submission_evidence.py \
	--models qwen-qwen2-5-0-5b-instruct,qwen-qwen2-5-1-5b-instruct \
	--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
	--training-space-url https://thejackbright-polyguard-openenv-training-full.hf.space \
	--episodes 8
	```

	Deploy the evidence-only HF Space:

	```bash
	export HF_TOKEN="<write-token>"
	.venv/bin/python scripts/deploy_evidence_space.py \
	--repo-id TheJackBright/polyguard-openenv-evidence \
	--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
	--training-space-url https://thejackbright-polyguard-openenv-training-full.hf.space \
	--models qwen-qwen2-5-0-5b-instruct,qwen-qwen2-5-1-5b-instruct \
	--hardware cpu-basic
	```

	As of the April 26, 2026 live status pull, Qwen 0.5B and 1.5B SFT, GRPO, GRPO post-save inference, and policy ablation have completed on the training Space. The artifact repo still exposes no meaningful files beyond `.gitattributes`, so per-run GRPO curves/checkpoints are labeled `remote_completed_pending_artifact_upload` in the evidence bundle until upload finishes.

	The current active implementation bundle has been pushed separately so the product can be tested immediately while the full sweep upload is pending. It contains `grpo_adapter`, `sft_adapter`, `merged`, active manifests, and active-model reports:

	```bash
	export HF_TOKEN="$(cat ~/.cache/huggingface/token)"
	./.venv/bin/hf download TheJackBright/polyguard-openenv-training-full-artifacts \
	--repo-type model \
	--include 'usable_model_bundles/local-qwen-0-5b-active-smoke/**' \
	--local-dir ./hf_artifacts
	```

	Restore that bundle into the app:

	```bash
	cp -R hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/checkpoints/grpo_adapter checkpoints/grpo_adapter
	cp -R hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/checkpoints/sft_adapter checkpoints/sft_adapter
	cp -R hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/checkpoints/merged checkpoints/merged
	mkdir -p checkpoints/active
	cp hf_artifacts/usable_model_bundles/local-qwen-0-5b-active-smoke/manifests/active_model_manifest.json checkpoints/active/active_model_manifest.json
	```

	To pull only the Qwen 0.5B run when the artifact repo has uploaded it:

	```bash
	.venv/bin/python scripts/pull_sweep_artifacts.py \
	--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
	--run-id qwen-qwen2-5-0-5b-instruct
	.venv/bin/python scripts/activate_sweep_model.py \
	--source sweep \
	--run-id qwen-qwen2-5-0-5b-instruct \
	--preferred-artifact grpo_adapter
	```

	To pull Qwen 1.5B once available:

	```bash
	.venv/bin/python scripts/pull_sweep_artifacts.py \
	--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
	--run-id qwen-qwen2-5-1-5b-instruct
	.venv/bin/python scripts/activate_sweep_model.py \
	--source sweep \
	--run-id qwen-qwen2-5-1-5b-instruct \
	--preferred-artifact grpo_adapter
	```

	Current local product activation uses the available top-level Qwen 0.5B GRPO smoke artifacts:

	```bash
	.venv/bin/python scripts/activate_sweep_model.py \
	--source top-level \
	--run-id qwen-qwen2-5-0-5b-instruct \
	--preferred-artifact grpo_adapter \
	--label local-qwen-0.5b-active-smoke
	curl http://127.0.0.1:8200/policy/model_status
	```

	The API/UI inference path checks `checkpoints/active/active_model_manifest.json`, prefers the GRPO adapter, and falls back to the merged Qwen 0.5B artifact if an adapter/base-model load is unavailable. The UI shows the active model chip on the Patient Workbench.

	## Results

	Tracked smoke/evaluation artifacts are mirrored in `docs/results/` because `outputs/` and `checkpoints/` are intentionally ignored.

	![Average reward](docs/results/avg_reward.png)

	![Policy stack average reward](docs/results/policy_stack_avg_reward.png)

	Current tracked reports show the environment, evaluation, plotting, SFT, GRPO, and non-fallback inference paths are wired:

	- `docs/results/sft_trl_run.json` records `backend: trl_transformers`.
	- `docs/results/grpo_trl_run.json` records `status: ok`, `backend: trl_transformers`, and an adapter artifact path.
	- `docs/results/postsave_inference.json` records `model_source: merged`.
	- `docs/results/improvement_report.json` records `improved: true` against the no-change baseline.
	- `docs/results/hf_space_verification.json` records the latest HF Space validation result.
	- `docs/results/hf_sweep_summary.json` currently documents a 3-model SFT-baseline sweep. Do not claim a full public per-model GRPO sweep unless those private artifacts are pulled, mirrored, and documented.
	- `docs/results/anti_hacking_overfit_report.json` and tracked charts cover reward, legality, success, process fidelity, SFT-vs-GRPO comparison, SFT loss, GRPO reward, Qwen model size comparison, reward components, anti-cheat/failure rate, train-vs-holdout gap, inference validity/reward, and inference latency/validity.
	- `docs/results/submission_evidence_qwen_0_5b_1_5b/` contains the current Qwen 0.5B/1.5B submission bundle with SFT curves, live HF stage-duration evidence, basic-LLM-vs-PolyGuard verifier comparison, action traces, failure cases, and explicit pending markers for GRPO files not yet uploaded by the private artifact repo.

	For best storytelling, replace the tiny local compliance artifacts with the larger Colab/HF GPU run before the final pitch if time permits.

	## Dataset Gather

	Implemented data generation and packaging covers:

	- Normalized drug vocabulary and class tables
	- Interaction graph edges
	- Burden, taper, renal, hepatic, duplicate-therapy, and substitution rules
	- Synthetic patients
	- Easy/medium/hard scenario files
	- Retrieval corpus and local evidence index
	- Unified SFT and GRPO prompt corpora

	The current local corpus summary is in `data/processed/training_corpus_summary.json` when generated.

	## Deployment

	Use the repository-local HF CLI entrypoint. The global `hf` command on this machine is known to be incompatible with its installed Typer version.

	```bash
	./.venv/bin/hf auth login
	./.venv/bin/hf auth whoami
	export HF_SPACE_REPO_ID="TheJackBright/polyguard-openenv"
	uv run python scripts/deploy_space_api.py --repo-id "$HF_SPACE_REPO_ID"
	uv run python -c "from huggingface_hub import HfApi; print(HfApi().space_info('$HF_SPACE_REPO_ID').id)"
	openenv validate --url "https://thejackbright-polyguard-openenv.hf.space"
	```

	After deployment, save the successful Space info plus OpenEnv validation payload into `docs/results/hf_space_verification.json`.

	## Strict Submission Gate

	Non-strict local readiness:

	```bash
	.venv/bin/python scripts/acceptance_gate.py
	```

	Final submission readiness:

	```bash
	export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
	.venv/bin/python scripts/acceptance_gate.py
	```

	As of April 26, 2026, strict readiness passes locally. This gate validates required files, artifacts, result assets, and submission link presence; it does not perform live HTTP status checks. Use the optional link validator after publishing the story artifact:

	```bash
	uv run python scripts/validate_submission_links.py
	```

	Strict mode fails unless README links are real, tracked plots exist, HF Space verification passed, SFT/GRPO used real TRL/Unsloth paths, post-save inference uses the exported artifact, and measured improvement is positive.

	## Documentation

	- [Architecture](docs/architecture.md)
	- [Environment Design](docs/environment_design.md)
	- [Reward Design](docs/reward_design.md)
	- [Training](docs/training.md)
	- [Evaluation](docs/evaluation.md)
	- [Deployment](docs/deployment.md)
	- [Safety](docs/safety.md)
	- [Agents](docs/agents.md)
	- [Datasets](docs/datasets.md)
	- [Math](docs/math.md)
	- [Submission Checklist](docs/submission_checklist.md)
	- [Participant Guide Traceability](docs/participant_guide_traceability.md)
	- [HF Blog Draft](docs/hf_blog_draft.md)

	## Future Work

	- Medicine image/barcode ingestion for regimen capture
	- Larger model GRPO sweeps
	- Stronger real-world drug-label ingestion and calibration
	- More clinician-facing explanation studies
	- Larger Qwen/Unsloth GPU run for stronger final curves

	## License

	MIT