pharma-vigilance / README.md
modelbuilderhq's picture
Upload folder using huggingface_hub
9ab33d8 verified
metadata
title: Pharmacovigilance Signal Detector
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv pharmacovigilance signal detection environment
tags:
  - openenv
  - healthcare
  - pharmacovigilance
  - safety
  - real-world
base_path: /web

Pharmacovigilance Signal Detector

Pharmacovigilance Signal Detector is a real-world OpenEnv environment where an agent acts like a drug-safety analyst. The agent reviews synthetic adverse event reports, uses a hardcoded drug interaction knowledge base, and decides whether the case is a new safety signal, a known side effect, or low-value noise. This mirrors pharmacovigilance triage work performed by regulators and pharmaceutical safety teams.

All case data in this repo is synthetic. No real patient data is used.

Why This Environment Matters

Pharmacovigilance teams are responsible for detecting harmful safety patterns after a drug is already on the market. That work is operationally important, high-stakes, and difficult: analysts must distinguish expected reactions from true emerging risks, recognize confounding from polypharmacy, and escalate only when justified. This makes the domain a strong fit for agent evaluation because it tests causal reasoning, prioritization, and safety-sensitive decision making.

Environment Overview

Item Value
Environment name pharma-vigilance
Domain Pharmacovigilance / drug safety triage
Episode length 2-step triage and review workflow
Task count 3
Difficulties Easy, Medium, Hard
Step reward range -0.25 to 1.0
Final grader range strict (0, 1)
API reset(), step(), state()
Server FastAPI

Each episode has two phases. On step 1 the agent performs an initial triage. The environment then returns additional senior-review context through feedback, and on step 2 the agent submits a final reviewed assessment. Each task includes one or more synthetic reports plus a hardcoded drug interaction database. The environment never exposes ground truth to the agent.

Action Space

Field Type Allowed values Purpose
classification str new_signal, known_side_effect, noise, duplicate Overall pharmacovigilance judgment
suspect_drug str Free text Drug or interaction the agent believes is causal
severity_assessment str mild, moderate, severe, critical Clinical severity assessment
recommended_action str escalate, log_and_monitor, dismiss, request_more_info Operational follow-up
reasoning str Free text Short explanation used for grading bonus on hard task
confidence Optional[int] 0 to 100 Optional analyst confidence used for calibration-aware reward shaping

Observation Space

Field Type Description
task_id str Current task identifier
reports List[AdverseEventReport] Synthetic adverse event reports for the task
drug_interaction_db dict Hardcoded safety and interaction hints
step_number int Current step index
max_steps int Maximum number of steps in the episode
feedback Optional[str] Feedback or senior-review note returned after the previous action

Each AdverseEventReport contains:

Field Description
report_id Unique synthetic report identifier
patient_age Patient age
patient_sex Patient sex
drugs All drugs the patient was taking
suspect_drug Drug named by the original reporter
reaction Observed adverse reaction
onset_days Days after drug start when reaction began
severity Reported severity
outcome Recovery status
similar_reports_last_30d Count of similar recent reports

Tasks

Task Difficulty Scenario Ground-truth goal Expected baseline
known_signal_easy Easy Patient on Lisinopril develops persistent dry cough with many similar recent reports already known in-label Recognize a known side effect and recommend log_and_monitor Around 0.85
cluster_signal_medium Medium Four recent Cardiovexa cases show symptomatic bradycardia and near-syncope despite no labeled rhythm toxicity Recognize a plausible emerging signal and escalate Around 0.65
confounded_hard Hard Transplant patient with acute kidney injury is blamed on Trimethoprim-sulfamethoxazole, but the deeper issue is a Voriconazole-Tacrolimus interaction Detect the interaction, classify as new_signal, and escalate Around 0.40

The hard task is intentionally more difficult because the named suspect drug is not the true cause. The agent must reason over interaction evidence and therapeutic drug-monitoring clues in the provided hardcoded drug database.

Reward Function

The environment uses deterministic programmatic graders. Reward is now shaped across a true two-step trajectory:

  1. initial triage reward on step 1
  2. final review reward on step 2 after additional context arrives

Within each step, the agent is also scored on classification, causal attribution, severity, and action, then receives extra credit if those sub-decisions form a coherent triage story.

Reward component Value
Correct classification +0.25
Correct suspect_drug +0.25
Correct severity_assessment +0.20
Correct recommended_action +0.15
Consistency bonus when classification, severity, and action form a coherent pharmacovigilance pipeline +0.10
Calibration bonus for high-confidence correct answers +0.05
Overconfidence penalty for high-confidence weak answers -0.10
Underconfidence penalty for low-confidence strong answers -0.03
False alarm penalty: agent says new_signal when truth is noise -0.10
Missed signal penalty: agent says noise when truth is new_signal -0.20
Hard-task reasoning bonus if explanation mentions drug interaction, tacrolimus, voriconazole, azole, calcineurin, or level monitoring +0.05

Notes:

  • Step-level rewards may be slightly negative for clearly unsafe or suboptimal actions.
  • Final grader outputs remain deterministic and strictly bounded inside (0, 1) for evaluation safety.
  • suspect_drug matching is forgiving for the hard task and allows substring matches.
  • The environment is deterministic and reproducible because all tasks and grading logic are hardcoded.
  • Confidence is optional, but calibrated confidence can improve reward while reckless overconfidence is penalized.
  • Step 1 gives partial reward for initial triage and returns new review context; step 2 gives the final adjudicated reward.
  • The environment also rewards productive revision and penalizes stubbornly repeating a weak initial answer or making an unjustified late flip.

Project Structure

Path Purpose
env.py Main environment class and Pydantic models
tasks.py Task definitions and grader functions
data.py Synthetic reports and drug interaction database
server.py Root FastAPI entrypoint
server/app.py OpenEnv-compatible app entrypoint
inference.py Baseline inference runner
openenv.yaml OpenEnv metadata
Dockerfile Multi-stage OpenEnv-style container build
tests/test_env.py Local tests
validate-submission.sh Pre-submission validation helper

Running Locally

Option 1: Local virtual environment

If you already created the local virtual environment in this repo:

.\.venv\Scripts\Activate.ps1

Install dependencies if needed:

pip install -r requirements.txt

Start the server:

uvicorn server:app --host 0.0.0.0 --port 7860

Option 2: Docker

Build the image:

docker build -t pharmacovigilance-env .

Run the container:

docker run -p 7860:7860 pharmacovigilance-env

The health endpoint will be available at:

http://localhost:7860/health

API Endpoints

Method Endpoint Description
POST /reset Starts a task and returns the initial observation
POST /step Submits the current agent action and returns observation, reward, done, info
GET /state Returns internal environment state summary
GET /tasks Lists available task ids
GET /health Health check endpoint

Baseline Inference Script

The required baseline runner is inference.py.

It:

  • reads API_BASE_URL, MODEL_NAME, HF_TOKEN, and optional ENV_URL
  • uses the OpenAI client for all model calls
  • runs all three tasks sequentially
  • follows the full 2-step episode loop until done=true
  • emits the required [START], [STEP], and [END] lines
  • keeps stdout restricted to the judge-expected line types

Required environment variables:

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export HF_TOKEN=hf_your_token_here
export ENV_URL=http://localhost:7860

Run:

python inference.py

Testing And Validation

Run local tests:

pytest tests/test_env.py -q

Run OpenEnv validation:

openenv validate

Run the pre-submission helper:

chmod +x validate-submission.sh
./validate-submission.sh https://your-space.hf.space

That script checks:

  1. your Hugging Face Space responds to POST /reset
  2. the Docker image builds
  3. openenv validate passes

Submission Checklist

  • openenv validate passes
  • docker build succeeds
  • docker run starts cleanly
  • POST /reset returns HTTP 200
  • inference.py runs all 3 tasks successfully
  • your Hugging Face Space responds to POST /reset
  • replace the expected baseline values with your measured live baseline values before final submission

Notes

  • No external API calls are made by the environment itself.
  • The drug interaction database is hardcoded.
  • Ground truth is never exposed in the observation returned to the agent.
  • The environment is lightweight enough for a 2 vCPU / 8GB RAM target.
  • The expected baseline scores in this README are planning targets until replaced with measured live results.