Spaces:

adithya9903
/

polyguard-openenv-training-3b-continuation

Paused

App Files Files Community

polyguard-openenv-training-3b-continuation / docs /hf_blog_draft.md

adithya9903's picture

Deploy PolyGuard HF training Space

fd0c71a verified 12 days ago

|

history blame contribute delete

1.73 kB

	# PolyGuard OpenEnv Blog Draft

	PolyGuard turns polypharmacy safety into an OpenEnv-compatible reinforcement-learning environment. The agent sees a partially observable patient/regimen state, chooses constrained medication actions, and receives verifier-backed feedback over legality, safety, dosing quality, process fidelity, explanation grounding, uncertainty calibration, and anti-cheat checks.

	The environment targets the World Modeling / Professional Tasks theme. Medication optimization is not a one-shot answer task: safe action selection depends on state, evidence, comorbidities, labs, drug-drug interactions, uncertainty, and rollback behavior when an action is unsafe.

	The demo includes:

	- Easy, medium, and hard task presets over DDI screening, regimen risk, bandit mining, precision dosing, deprescribing, missing-data search, alternatives, and new-drug decomposition.
	- A React workbench for reset/step interaction, clickable candidates, task/environment selection, reward bars, action history, and event traces.
	- A TRL SFT warm start and GRPO loop using environment-backed rewards.
	- Post-save inference checks from exported artifacts.
	- Baseline comparison and plots committed under `docs/results/`.

	The current local compliance run uses a tiny model so the full pipeline can be verified quickly. For the final pitch, rerun the same notebook on GPU with the Qwen model and Unsloth enabled, then replace the result artifacts with the stronger run.

	Key result to show: the current benchmark report improves average reward over the no-change baseline while preserving legality. The reward design is intentionally decomposed into multiple independent checks to reduce reward hacking and make failures visible.