Spaces:
Sleeping
Sleeping
| title: Pharmacovigilance Signal Detector | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| short_description: OpenEnv pharmacovigilance signal detection environment | |
| tags: | |
| - openenv | |
| - healthcare | |
| - pharmacovigilance | |
| - safety | |
| - real-world | |
| base_path: /web | |
| # Pharmacovigilance Signal Detector | |
| `Pharmacovigilance Signal Detector` is a real-world OpenEnv environment where an agent acts like a drug-safety analyst. The agent reviews synthetic adverse event reports, uses a hardcoded drug interaction knowledge base, and decides whether the case is a new safety signal, a known side effect, or low-value noise. This mirrors pharmacovigilance triage work performed by regulators and pharmaceutical safety teams. | |
| All case data in this repo is synthetic. No real patient data is used. | |
| ## Why This Environment Matters | |
| Pharmacovigilance teams are responsible for detecting harmful safety patterns after a drug is already on the market. That work is operationally important, high-stakes, and difficult: analysts must distinguish expected reactions from true emerging risks, recognize confounding from polypharmacy, and escalate only when justified. This makes the domain a strong fit for agent evaluation because it tests causal reasoning, prioritization, and safety-sensitive decision making. | |
| ## Environment Overview | |
| | Item | Value | | |
| |---|---| | |
| | Environment name | `pharma-vigilance` | | |
| | Domain | Pharmacovigilance / drug safety triage | | |
| | Episode length | 2-step triage and review workflow | | |
| | Task count | 3 | | |
| | Difficulties | Easy, Medium, Hard | | |
| | Step reward range | `-0.25` to `1.0` | | |
| | Final grader range | strict `(0, 1)` | | |
| | API | `reset()`, `step()`, `state()` | | |
| | Server | FastAPI | | |
| Each episode has two phases. On step 1 the agent performs an initial triage. The environment then returns additional senior-review context through feedback, and on step 2 the agent submits a final reviewed assessment. Each task includes one or more synthetic reports plus a hardcoded drug interaction database. The environment never exposes ground truth to the agent. | |
| ## Action Space | |
| | Field | Type | Allowed values | Purpose | | |
| |---|---|---|---| | |
| | `classification` | `str` | `new_signal`, `known_side_effect`, `noise`, `duplicate` | Overall pharmacovigilance judgment | | |
| | `suspect_drug` | `str` | Free text | Drug or interaction the agent believes is causal | | |
| | `severity_assessment` | `str` | `mild`, `moderate`, `severe`, `critical` | Clinical severity assessment | | |
| | `recommended_action` | `str` | `escalate`, `log_and_monitor`, `dismiss`, `request_more_info` | Operational follow-up | | |
| | `reasoning` | `str` | Free text | Short explanation used for grading bonus on hard task | | |
| | `confidence` | `Optional[int]` | `0` to `100` | Optional analyst confidence used for calibration-aware reward shaping | | |
| ## Observation Space | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `task_id` | `str` | Current task identifier | | |
| | `reports` | `List[AdverseEventReport]` | Synthetic adverse event reports for the task | | |
| | `drug_interaction_db` | `dict` | Hardcoded safety and interaction hints | | |
| | `step_number` | `int` | Current step index | | |
| | `max_steps` | `int` | Maximum number of steps in the episode | | |
| | `feedback` | `Optional[str]` | Feedback or senior-review note returned after the previous action | | |
| Each `AdverseEventReport` contains: | |
| | Field | Description | | |
| |---|---| | |
| | `report_id` | Unique synthetic report identifier | | |
| | `patient_age` | Patient age | | |
| | `patient_sex` | Patient sex | | |
| | `drugs` | All drugs the patient was taking | | |
| | `suspect_drug` | Drug named by the original reporter | | |
| | `reaction` | Observed adverse reaction | | |
| | `onset_days` | Days after drug start when reaction began | | |
| | `severity` | Reported severity | | |
| | `outcome` | Recovery status | | |
| | `similar_reports_last_30d` | Count of similar recent reports | | |
| ## Tasks | |
| | Task | Difficulty | Scenario | Ground-truth goal | Expected baseline | | |
| |---|---|---|---|---| | |
| | `known_signal_easy` | Easy | Patient on `Lisinopril` develops persistent dry cough with many similar recent reports already known in-label | Recognize a known side effect and recommend `log_and_monitor` | Around `0.85` | | |
| | `cluster_signal_medium` | Medium | Four recent `Cardiovexa` cases show symptomatic bradycardia and near-syncope despite no labeled rhythm toxicity | Recognize a plausible emerging signal and `escalate` | Around `0.65` | | |
| | `confounded_hard` | Hard | Transplant patient with acute kidney injury is blamed on `Trimethoprim-sulfamethoxazole`, but the deeper issue is a `Voriconazole`-`Tacrolimus` interaction | Detect the interaction, classify as `new_signal`, and `escalate` | Around `0.40` | | |
| The hard task is intentionally more difficult because the named suspect drug is not the true cause. The agent must reason over interaction evidence and therapeutic drug-monitoring clues in the provided hardcoded drug database. | |
| ## Reward Function | |
| The environment uses deterministic programmatic graders. Reward is now shaped across a true two-step trajectory: | |
| 1. initial triage reward on step 1 | |
| 2. final review reward on step 2 after additional context arrives | |
| Within each step, the agent is also scored on classification, causal attribution, severity, | |
| and action, then receives extra credit if those sub-decisions form a coherent | |
| triage story. | |
| | Reward component | Value | | |
| |---|---| | |
| | Correct `classification` | `+0.25` | | |
| | Correct `suspect_drug` | `+0.25` | | |
| | Correct `severity_assessment` | `+0.20` | | |
| | Correct `recommended_action` | `+0.15` | | |
| | Consistency bonus when classification, severity, and action form a coherent pharmacovigilance pipeline | `+0.10` | | |
| | Calibration bonus for high-confidence correct answers | `+0.05` | | |
| | Overconfidence penalty for high-confidence weak answers | `-0.10` | | |
| | Underconfidence penalty for low-confidence strong answers | `-0.03` | | |
| | False alarm penalty: agent says `new_signal` when truth is `noise` | `-0.10` | | |
| | Missed signal penalty: agent says `noise` when truth is `new_signal` | `-0.20` | | |
| | Hard-task reasoning bonus if explanation mentions `drug interaction`, `tacrolimus`, `voriconazole`, `azole`, `calcineurin`, or `level monitoring` | `+0.05` | | |
| Notes: | |
| - Step-level rewards may be slightly negative for clearly unsafe or suboptimal actions. | |
| - Final grader outputs remain deterministic and strictly bounded inside `(0, 1)` for evaluation safety. | |
| - `suspect_drug` matching is forgiving for the hard task and allows substring matches. | |
| - The environment is deterministic and reproducible because all tasks and grading logic are hardcoded. | |
| - Confidence is optional, but calibrated confidence can improve reward while reckless overconfidence is penalized. | |
| - Step 1 gives partial reward for initial triage and returns new review context; step 2 gives the final adjudicated reward. | |
| - The environment also rewards productive revision and penalizes stubbornly repeating a weak initial answer or making an unjustified late flip. | |
| ## Project Structure | |
| | Path | Purpose | | |
| |---|---| | |
| | `env.py` | Main environment class and Pydantic models | | |
| | `tasks.py` | Task definitions and grader functions | | |
| | `data.py` | Synthetic reports and drug interaction database | | |
| | `server.py` | Root FastAPI entrypoint | | |
| | `server/app.py` | OpenEnv-compatible app entrypoint | | |
| | `inference.py` | Baseline inference runner | | |
| | `openenv.yaml` | OpenEnv metadata | | |
| | `Dockerfile` | Multi-stage OpenEnv-style container build | | |
| | `tests/test_env.py` | Local tests | | |
| | `validate-submission.sh` | Pre-submission validation helper | | |
| ## Running Locally | |
| ### Option 1: Local virtual environment | |
| If you already created the local virtual environment in this repo: | |
| ```powershell | |
| .\.venv\Scripts\Activate.ps1 | |
| ``` | |
| Install dependencies if needed: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Start the server: | |
| ```bash | |
| uvicorn server:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Option 2: Docker | |
| Build the image: | |
| ```bash | |
| docker build -t pharmacovigilance-env . | |
| ``` | |
| Run the container: | |
| ```bash | |
| docker run -p 7860:7860 pharmacovigilance-env | |
| ``` | |
| The health endpoint will be available at: | |
| ```text | |
| http://localhost:7860/health | |
| ``` | |
| ## API Endpoints | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/reset` | Starts a task and returns the initial observation | | |
| | `POST` | `/step` | Submits the current agent action and returns observation, reward, done, info | | |
| | `GET` | `/state` | Returns internal environment state summary | | |
| | `GET` | `/tasks` | Lists available task ids | | |
| | `GET` | `/health` | Health check endpoint | | |
| ## Baseline Inference Script | |
| The required baseline runner is `inference.py`. | |
| It: | |
| - reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `ENV_URL` | |
| - uses the OpenAI client for all model calls | |
| - runs all three tasks sequentially | |
| - follows the full 2-step episode loop until `done=true` | |
| - emits the required `[START]`, `[STEP]`, and `[END]` lines | |
| - keeps stdout restricted to the judge-expected line types | |
| Required environment variables: | |
| ```bash | |
| export API_BASE_URL=https://router.huggingface.co/v1 | |
| export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| export HF_TOKEN=hf_your_token_here | |
| export ENV_URL=http://localhost:7860 | |
| ``` | |
| Run: | |
| ```bash | |
| python inference.py | |
| ``` | |
| ## Testing And Validation | |
| Run local tests: | |
| ```bash | |
| pytest tests/test_env.py -q | |
| ``` | |
| Run OpenEnv validation: | |
| ```bash | |
| openenv validate | |
| ``` | |
| Run the pre-submission helper: | |
| ```bash | |
| chmod +x validate-submission.sh | |
| ./validate-submission.sh https://your-space.hf.space | |
| ``` | |
| That script checks: | |
| 1. your Hugging Face Space responds to `POST /reset` | |
| 2. the Docker image builds | |
| 3. `openenv validate` passes | |
| ## Submission Checklist | |
| - `openenv validate` passes | |
| - `docker build` succeeds | |
| - `docker run` starts cleanly | |
| - `POST /reset` returns HTTP `200` | |
| - `inference.py` runs all 3 tasks successfully | |
| - your Hugging Face Space responds to `POST /reset` | |
| - replace the expected baseline values with your measured live baseline values before final submission | |
| ## Notes | |
| - No external API calls are made by the environment itself. | |
| - The drug interaction database is hardcoded. | |
| - Ground truth is never exposed in the observation returned to the agent. | |
| - The environment is lightweight enough for a 2 vCPU / 8GB RAM target. | |
| - The expected baseline scores in this README are planning targets until replaced with measured live results. | |