Spaces:

modelbuilderhq
/

pharma-vigilance

Sleeping

App Files Files Community

pharma-vigilance / README.md

modelbuilderhq

Upload folder using huggingface_hub

9ab33d8 verified 29 days ago

preview code

raw

history blame contribute delete

10.6 kB

	---
	title: Pharmacovigilance Signal Detector
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	short_description: OpenEnv pharmacovigilance signal detection environment
	tags:
	- openenv
	- healthcare
	- pharmacovigilance
	- safety
	- real-world
	base_path: /web
	---

	# Pharmacovigilance Signal Detector

	`Pharmacovigilance Signal Detector` is a real-world OpenEnv environment where an agent acts like a drug-safety analyst. The agent reviews synthetic adverse event reports, uses a hardcoded drug interaction knowledge base, and decides whether the case is a new safety signal, a known side effect, or low-value noise. This mirrors pharmacovigilance triage work performed by regulators and pharmaceutical safety teams.

	All case data in this repo is synthetic. No real patient data is used.

	## Why This Environment Matters

	Pharmacovigilance teams are responsible for detecting harmful safety patterns after a drug is already on the market. That work is operationally important, high-stakes, and difficult: analysts must distinguish expected reactions from true emerging risks, recognize confounding from polypharmacy, and escalate only when justified. This makes the domain a strong fit for agent evaluation because it tests causal reasoning, prioritization, and safety-sensitive decision making.

	## Environment Overview

	\| Item \| Value \|
	\|---\|---\|
	\| Environment name \| `pharma-vigilance` \|
	\| Domain \| Pharmacovigilance / drug safety triage \|
	\| Episode length \| 2-step triage and review workflow \|
	\| Task count \| 3 \|
	\| Difficulties \| Easy, Medium, Hard \|
	\| Step reward range \| `-0.25` to `1.0` \|
	\| Final grader range \| strict `(0, 1)` \|
	\| API \| `reset()`, `step()`, `state()` \|
	\| Server \| FastAPI \|

	Each episode has two phases. On step 1 the agent performs an initial triage. The environment then returns additional senior-review context through feedback, and on step 2 the agent submits a final reviewed assessment. Each task includes one or more synthetic reports plus a hardcoded drug interaction database. The environment never exposes ground truth to the agent.

	## Action Space

	\| Field \| Type \| Allowed values \| Purpose \|
	\|---\|---\|---\|---\|
	\| `classification` \| `str` \| `new_signal`, `known_side_effect`, `noise`, `duplicate` \| Overall pharmacovigilance judgment \|
	\| `suspect_drug` \| `str` \| Free text \| Drug or interaction the agent believes is causal \|
	\| `severity_assessment` \| `str` \| `mild`, `moderate`, `severe`, `critical` \| Clinical severity assessment \|
	\| `recommended_action` \| `str` \| `escalate`, `log_and_monitor`, `dismiss`, `request_more_info` \| Operational follow-up \|
	\| `reasoning` \| `str` \| Free text \| Short explanation used for grading bonus on hard task \|
	\| `confidence` \| `Optional[int]` \| `0` to `100` \| Optional analyst confidence used for calibration-aware reward shaping \|

	## Observation Space

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `task_id` \| `str` \| Current task identifier \|
	\| `reports` \| `List[AdverseEventReport]` \| Synthetic adverse event reports for the task \|
	\| `drug_interaction_db` \| `dict` \| Hardcoded safety and interaction hints \|
	\| `step_number` \| `int` \| Current step index \|
	\| `max_steps` \| `int` \| Maximum number of steps in the episode \|
	\| `feedback` \| `Optional[str]` \| Feedback or senior-review note returned after the previous action \|

	Each `AdverseEventReport` contains:

	\| Field \| Description \|
	\|---\|---\|
	\| `report_id` \| Unique synthetic report identifier \|
	\| `patient_age` \| Patient age \|
	\| `patient_sex` \| Patient sex \|
	\| `drugs` \| All drugs the patient was taking \|
	\| `suspect_drug` \| Drug named by the original reporter \|
	\| `reaction` \| Observed adverse reaction \|
	\| `onset_days` \| Days after drug start when reaction began \|
	\| `severity` \| Reported severity \|
	\| `outcome` \| Recovery status \|
	\| `similar_reports_last_30d` \| Count of similar recent reports \|

	## Tasks

	\| Task \| Difficulty \| Scenario \| Ground-truth goal \| Expected baseline \|
	\|---\|---\|---\|---\|---\|
	\| `known_signal_easy` \| Easy \| Patient on `Lisinopril` develops persistent dry cough with many similar recent reports already known in-label \| Recognize a known side effect and recommend `log_and_monitor` \| Around `0.85` \|
	\| `cluster_signal_medium` \| Medium \| Four recent `Cardiovexa` cases show symptomatic bradycardia and near-syncope despite no labeled rhythm toxicity \| Recognize a plausible emerging signal and `escalate` \| Around `0.65` \|
	\| `confounded_hard` \| Hard \| Transplant patient with acute kidney injury is blamed on `Trimethoprim-sulfamethoxazole`, but the deeper issue is a `Voriconazole`-`Tacrolimus` interaction \| Detect the interaction, classify as `new_signal`, and `escalate` \| Around `0.40` \|

	The hard task is intentionally more difficult because the named suspect drug is not the true cause. The agent must reason over interaction evidence and therapeutic drug-monitoring clues in the provided hardcoded drug database.

	## Reward Function

	The environment uses deterministic programmatic graders. Reward is now shaped across a true two-step trajectory:

	1. initial triage reward on step 1
	2. final review reward on step 2 after additional context arrives

	Within each step, the agent is also scored on classification, causal attribution, severity,
	and action, then receives extra credit if those sub-decisions form a coherent
	triage story.

	\| Reward component \| Value \|
	\|---\|---\|
	\| Correct `classification` \| `+0.25` \|
	\| Correct `suspect_drug` \| `+0.25` \|
	\| Correct `severity_assessment` \| `+0.20` \|
	\| Correct `recommended_action` \| `+0.15` \|
	\| Consistency bonus when classification, severity, and action form a coherent pharmacovigilance pipeline \| `+0.10` \|
	\| Calibration bonus for high-confidence correct answers \| `+0.05` \|
	\| Overconfidence penalty for high-confidence weak answers \| `-0.10` \|
	\| Underconfidence penalty for low-confidence strong answers \| `-0.03` \|
	\| False alarm penalty: agent says `new_signal` when truth is `noise` \| `-0.10` \|
	\| Missed signal penalty: agent says `noise` when truth is `new_signal` \| `-0.20` \|
	\| Hard-task reasoning bonus if explanation mentions `drug interaction`, `tacrolimus`, `voriconazole`, `azole`, `calcineurin`, or `level monitoring` \| `+0.05` \|

	Notes:
	- Step-level rewards may be slightly negative for clearly unsafe or suboptimal actions.
	- Final grader outputs remain deterministic and strictly bounded inside `(0, 1)` for evaluation safety.
	- `suspect_drug` matching is forgiving for the hard task and allows substring matches.
	- The environment is deterministic and reproducible because all tasks and grading logic are hardcoded.
	- Confidence is optional, but calibrated confidence can improve reward while reckless overconfidence is penalized.
	- Step 1 gives partial reward for initial triage and returns new review context; step 2 gives the final adjudicated reward.
	- The environment also rewards productive revision and penalizes stubbornly repeating a weak initial answer or making an unjustified late flip.

	## Project Structure

	\| Path \| Purpose \|
	\|---\|---\|
	\| `env.py` \| Main environment class and Pydantic models \|
	\| `tasks.py` \| Task definitions and grader functions \|
	\| `data.py` \| Synthetic reports and drug interaction database \|
	\| `server.py` \| Root FastAPI entrypoint \|
	\| `server/app.py` \| OpenEnv-compatible app entrypoint \|
	\| `inference.py` \| Baseline inference runner \|
	\| `openenv.yaml` \| OpenEnv metadata \|
	\| `Dockerfile` \| Multi-stage OpenEnv-style container build \|
	\| `tests/test_env.py` \| Local tests \|
	\| `validate-submission.sh` \| Pre-submission validation helper \|

	## Running Locally

	### Option 1: Local virtual environment

	If you already created the local virtual environment in this repo:

	```powershell
	.\.venv\Scripts\Activate.ps1
	```

	Install dependencies if needed:

	```bash
	pip install -r requirements.txt
	```

	Start the server:

	```bash
	uvicorn server:app --host 0.0.0.0 --port 7860
	```

	### Option 2: Docker

	Build the image:

	```bash
	docker build -t pharmacovigilance-env .
	```

	Run the container:

	```bash
	docker run -p 7860:7860 pharmacovigilance-env
	```

	The health endpoint will be available at:

	```text
	http://localhost:7860/health
	```

	## API Endpoints

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/reset` \| Starts a task and returns the initial observation \|
	\| `POST` \| `/step` \| Submits the current agent action and returns observation, reward, done, info \|
	\| `GET` \| `/state` \| Returns internal environment state summary \|
	\| `GET` \| `/tasks` \| Lists available task ids \|
	\| `GET` \| `/health` \| Health check endpoint \|

	## Baseline Inference Script

	The required baseline runner is `inference.py`.

	It:
	- reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `ENV_URL`
	- uses the OpenAI client for all model calls
	- runs all three tasks sequentially
	- follows the full 2-step episode loop until `done=true`
	- emits the required `[START]`, `[STEP]`, and `[END]` lines
	- keeps stdout restricted to the judge-expected line types

	Required environment variables:

	```bash
	export API_BASE_URL=https://router.huggingface.co/v1
	export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
	export HF_TOKEN=hf_your_token_here
	export ENV_URL=http://localhost:7860
	```

	Run:

	```bash
	python inference.py
	```

	## Testing And Validation

	Run local tests:

	```bash
	pytest tests/test_env.py -q
	```

	Run OpenEnv validation:

	```bash
	openenv validate
	```

	Run the pre-submission helper:

	```bash
	chmod +x validate-submission.sh
	./validate-submission.sh https://your-space.hf.space
	```

	That script checks:
	1. your Hugging Face Space responds to `POST /reset`
	2. the Docker image builds
	3. `openenv validate` passes

	## Submission Checklist

	- `openenv validate` passes
	- `docker build` succeeds
	- `docker run` starts cleanly
	- `POST /reset` returns HTTP `200`
	- `inference.py` runs all 3 tasks successfully
	- your Hugging Face Space responds to `POST /reset`
	- replace the expected baseline values with your measured live baseline values before final submission

	## Notes

	- No external API calls are made by the environment itself.
	- The drug interaction database is hardcoded.
	- Ground truth is never exposed in the observation returned to the agent.
	- The environment is lightweight enough for a 2 vCPU / 8GB RAM target.
	- The expected baseline scores in this README are planning targets until replaced with measured live results.