--- title: Pharmacovigilance Signal Detector colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false license: mit short_description: OpenEnv pharmacovigilance signal detection environment tags: - openenv - healthcare - pharmacovigilance - safety - real-world base_path: /web --- # Pharmacovigilance Signal Detector `Pharmacovigilance Signal Detector` is a real-world OpenEnv environment where an agent acts like a drug-safety analyst. The agent reviews synthetic adverse event reports, uses a hardcoded drug interaction knowledge base, and decides whether the case is a new safety signal, a known side effect, or low-value noise. This mirrors pharmacovigilance triage work performed by regulators and pharmaceutical safety teams. All case data in this repo is synthetic. No real patient data is used. ## Why This Environment Matters Pharmacovigilance teams are responsible for detecting harmful safety patterns after a drug is already on the market. That work is operationally important, high-stakes, and difficult: analysts must distinguish expected reactions from true emerging risks, recognize confounding from polypharmacy, and escalate only when justified. This makes the domain a strong fit for agent evaluation because it tests causal reasoning, prioritization, and safety-sensitive decision making. ## Environment Overview | Item | Value | |---|---| | Environment name | `pharma-vigilance` | | Domain | Pharmacovigilance / drug safety triage | | Episode length | 2-step triage and review workflow | | Task count | 3 | | Difficulties | Easy, Medium, Hard | | Step reward range | `-0.25` to `1.0` | | Final grader range | strict `(0, 1)` | | API | `reset()`, `step()`, `state()` | | Server | FastAPI | Each episode has two phases. On step 1 the agent performs an initial triage. The environment then returns additional senior-review context through feedback, and on step 2 the agent submits a final reviewed assessment. Each task includes one or more synthetic reports plus a hardcoded drug interaction database. The environment never exposes ground truth to the agent. ## Action Space | Field | Type | Allowed values | Purpose | |---|---|---|---| | `classification` | `str` | `new_signal`, `known_side_effect`, `noise`, `duplicate` | Overall pharmacovigilance judgment | | `suspect_drug` | `str` | Free text | Drug or interaction the agent believes is causal | | `severity_assessment` | `str` | `mild`, `moderate`, `severe`, `critical` | Clinical severity assessment | | `recommended_action` | `str` | `escalate`, `log_and_monitor`, `dismiss`, `request_more_info` | Operational follow-up | | `reasoning` | `str` | Free text | Short explanation used for grading bonus on hard task | | `confidence` | `Optional[int]` | `0` to `100` | Optional analyst confidence used for calibration-aware reward shaping | ## Observation Space | Field | Type | Description | |---|---|---| | `task_id` | `str` | Current task identifier | | `reports` | `List[AdverseEventReport]` | Synthetic adverse event reports for the task | | `drug_interaction_db` | `dict` | Hardcoded safety and interaction hints | | `step_number` | `int` | Current step index | | `max_steps` | `int` | Maximum number of steps in the episode | | `feedback` | `Optional[str]` | Feedback or senior-review note returned after the previous action | Each `AdverseEventReport` contains: | Field | Description | |---|---| | `report_id` | Unique synthetic report identifier | | `patient_age` | Patient age | | `patient_sex` | Patient sex | | `drugs` | All drugs the patient was taking | | `suspect_drug` | Drug named by the original reporter | | `reaction` | Observed adverse reaction | | `onset_days` | Days after drug start when reaction began | | `severity` | Reported severity | | `outcome` | Recovery status | | `similar_reports_last_30d` | Count of similar recent reports | ## Tasks | Task | Difficulty | Scenario | Ground-truth goal | Expected baseline | |---|---|---|---|---| | `known_signal_easy` | Easy | Patient on `Lisinopril` develops persistent dry cough with many similar recent reports already known in-label | Recognize a known side effect and recommend `log_and_monitor` | Around `0.85` | | `cluster_signal_medium` | Medium | Four recent `Cardiovexa` cases show symptomatic bradycardia and near-syncope despite no labeled rhythm toxicity | Recognize a plausible emerging signal and `escalate` | Around `0.65` | | `confounded_hard` | Hard | Transplant patient with acute kidney injury is blamed on `Trimethoprim-sulfamethoxazole`, but the deeper issue is a `Voriconazole`-`Tacrolimus` interaction | Detect the interaction, classify as `new_signal`, and `escalate` | Around `0.40` | The hard task is intentionally more difficult because the named suspect drug is not the true cause. The agent must reason over interaction evidence and therapeutic drug-monitoring clues in the provided hardcoded drug database. ## Reward Function The environment uses deterministic programmatic graders. Reward is now shaped across a true two-step trajectory: 1. initial triage reward on step 1 2. final review reward on step 2 after additional context arrives Within each step, the agent is also scored on classification, causal attribution, severity, and action, then receives extra credit if those sub-decisions form a coherent triage story. | Reward component | Value | |---|---| | Correct `classification` | `+0.25` | | Correct `suspect_drug` | `+0.25` | | Correct `severity_assessment` | `+0.20` | | Correct `recommended_action` | `+0.15` | | Consistency bonus when classification, severity, and action form a coherent pharmacovigilance pipeline | `+0.10` | | Calibration bonus for high-confidence correct answers | `+0.05` | | Overconfidence penalty for high-confidence weak answers | `-0.10` | | Underconfidence penalty for low-confidence strong answers | `-0.03` | | False alarm penalty: agent says `new_signal` when truth is `noise` | `-0.10` | | Missed signal penalty: agent says `noise` when truth is `new_signal` | `-0.20` | | Hard-task reasoning bonus if explanation mentions `drug interaction`, `tacrolimus`, `voriconazole`, `azole`, `calcineurin`, or `level monitoring` | `+0.05` | Notes: - Step-level rewards may be slightly negative for clearly unsafe or suboptimal actions. - Final grader outputs remain deterministic and strictly bounded inside `(0, 1)` for evaluation safety. - `suspect_drug` matching is forgiving for the hard task and allows substring matches. - The environment is deterministic and reproducible because all tasks and grading logic are hardcoded. - Confidence is optional, but calibrated confidence can improve reward while reckless overconfidence is penalized. - Step 1 gives partial reward for initial triage and returns new review context; step 2 gives the final adjudicated reward. - The environment also rewards productive revision and penalizes stubbornly repeating a weak initial answer or making an unjustified late flip. ## Project Structure | Path | Purpose | |---|---| | `env.py` | Main environment class and Pydantic models | | `tasks.py` | Task definitions and grader functions | | `data.py` | Synthetic reports and drug interaction database | | `server.py` | Root FastAPI entrypoint | | `server/app.py` | OpenEnv-compatible app entrypoint | | `inference.py` | Baseline inference runner | | `openenv.yaml` | OpenEnv metadata | | `Dockerfile` | Multi-stage OpenEnv-style container build | | `tests/test_env.py` | Local tests | | `validate-submission.sh` | Pre-submission validation helper | ## Running Locally ### Option 1: Local virtual environment If you already created the local virtual environment in this repo: ```powershell .\.venv\Scripts\Activate.ps1 ``` Install dependencies if needed: ```bash pip install -r requirements.txt ``` Start the server: ```bash uvicorn server:app --host 0.0.0.0 --port 7860 ``` ### Option 2: Docker Build the image: ```bash docker build -t pharmacovigilance-env . ``` Run the container: ```bash docker run -p 7860:7860 pharmacovigilance-env ``` The health endpoint will be available at: ```text http://localhost:7860/health ``` ## API Endpoints | Method | Endpoint | Description | |---|---|---| | `POST` | `/reset` | Starts a task and returns the initial observation | | `POST` | `/step` | Submits the current agent action and returns observation, reward, done, info | | `GET` | `/state` | Returns internal environment state summary | | `GET` | `/tasks` | Lists available task ids | | `GET` | `/health` | Health check endpoint | ## Baseline Inference Script The required baseline runner is `inference.py`. It: - reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `ENV_URL` - uses the OpenAI client for all model calls - runs all three tasks sequentially - follows the full 2-step episode loop until `done=true` - emits the required `[START]`, `[STEP]`, and `[END]` lines - keeps stdout restricted to the judge-expected line types Required environment variables: ```bash export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct export HF_TOKEN=hf_your_token_here export ENV_URL=http://localhost:7860 ``` Run: ```bash python inference.py ``` ## Testing And Validation Run local tests: ```bash pytest tests/test_env.py -q ``` Run OpenEnv validation: ```bash openenv validate ``` Run the pre-submission helper: ```bash chmod +x validate-submission.sh ./validate-submission.sh https://your-space.hf.space ``` That script checks: 1. your Hugging Face Space responds to `POST /reset` 2. the Docker image builds 3. `openenv validate` passes ## Submission Checklist - `openenv validate` passes - `docker build` succeeds - `docker run` starts cleanly - `POST /reset` returns HTTP `200` - `inference.py` runs all 3 tasks successfully - your Hugging Face Space responds to `POST /reset` - replace the expected baseline values with your measured live baseline values before final submission ## Notes - No external API calls are made by the environment itself. - The drug interaction database is hardcoded. - Ground truth is never exposed in the observation returned to the agent. - The environment is lightweight enough for a 2 vCPU / 8GB RAM target. - The expected baseline scores in this README are planning targets until replaced with measured live results.