Spaces:

modelbuilderhq
/

pharma-vigilance

Sleeping

File size: 10,586 Bytes

---

title: Pharmacovigilance Signal Detector
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv pharmacovigilance signal detection environment
tags:
  - openenv
  - healthcare
  - pharmacovigilance
  - safety
  - real-world
base_path: /web
---


# Pharmacovigilance Signal Detector

`Pharmacovigilance Signal Detector` is a real-world OpenEnv environment where an agent acts like a drug-safety analyst. The agent reviews synthetic adverse event reports, uses a hardcoded drug interaction knowledge base, and decides whether the case is a new safety signal, a known side effect, or low-value noise. This mirrors pharmacovigilance triage work performed by regulators and pharmaceutical safety teams.

All case data in this repo is synthetic. No real patient data is used.

## Why This Environment Matters

Pharmacovigilance teams are responsible for detecting harmful safety patterns after a drug is already on the market. That work is operationally important, high-stakes, and difficult: analysts must distinguish expected reactions from true emerging risks, recognize confounding from polypharmacy, and escalate only when justified. This makes the domain a strong fit for agent evaluation because it tests causal reasoning, prioritization, and safety-sensitive decision making.

## Environment Overview

| Item | Value |
|---|---|
| Environment name | `pharma-vigilance` |
| Domain | Pharmacovigilance / drug safety triage |
| Episode length | 2-step triage and review workflow |
| Task count | 3 |
| Difficulties | Easy, Medium, Hard |
| Step reward range | `-0.25` to `1.0` |
| Final grader range | strict `(0, 1)` |
| API | `reset()`, `step()`, `state()` |
| Server | FastAPI |

Each episode has two phases. On step 1 the agent performs an initial triage. The environment then returns additional senior-review context through feedback, and on step 2 the agent submits a final reviewed assessment. Each task includes one or more synthetic reports plus a hardcoded drug interaction database. The environment never exposes ground truth to the agent.

## Action Space

| Field | Type | Allowed values | Purpose |
|---|---|---|---|
| `classification` | `str` | `new_signal`, `known_side_effect`, `noise`, `duplicate` | Overall pharmacovigilance judgment |
| `suspect_drug` | `str` | Free text | Drug or interaction the agent believes is causal |
| `severity_assessment` | `str` | `mild`, `moderate`, `severe`, `critical` | Clinical severity assessment |
| `recommended_action` | `str` | `escalate`, `log_and_monitor`, `dismiss`, `request_more_info` | Operational follow-up |
| `reasoning` | `str` | Free text | Short explanation used for grading bonus on hard task |
| `confidence` | `Optional[int]` | `0` to `100` | Optional analyst confidence used for calibration-aware reward shaping |

## Observation Space

| Field | Type | Description |
|---|---|---|
| `task_id` | `str` | Current task identifier |
| `reports` | `List[AdverseEventReport]` | Synthetic adverse event reports for the task |
| `drug_interaction_db` | `dict` | Hardcoded safety and interaction hints |
| `step_number` | `int` | Current step index |
| `max_steps` | `int` | Maximum number of steps in the episode |
| `feedback` | `Optional[str]` | Feedback or senior-review note returned after the previous action |

Each `AdverseEventReport` contains:

| Field | Description |
|---|---|
| `report_id` | Unique synthetic report identifier |
| `patient_age` | Patient age |
| `patient_sex` | Patient sex |
| `drugs` | All drugs the patient was taking |
| `suspect_drug` | Drug named by the original reporter |
| `reaction` | Observed adverse reaction |
| `onset_days` | Days after drug start when reaction began |
| `severity` | Reported severity |
| `outcome` | Recovery status |
| `similar_reports_last_30d` | Count of similar recent reports |

## Tasks

| Task | Difficulty | Scenario | Ground-truth goal | Expected baseline |
|---|---|---|---|---|
| `known_signal_easy` | Easy | Patient on `Lisinopril` develops persistent dry cough with many similar recent reports already known in-label | Recognize a known side effect and recommend `log_and_monitor` | Around `0.85` |
| `cluster_signal_medium` | Medium | Four recent `Cardiovexa` cases show symptomatic bradycardia and near-syncope despite no labeled rhythm toxicity | Recognize a plausible emerging signal and `escalate` | Around `0.65` |
| `confounded_hard` | Hard | Transplant patient with acute kidney injury is blamed on `Trimethoprim-sulfamethoxazole`, but the deeper issue is a `Voriconazole`-`Tacrolimus` interaction | Detect the interaction, classify as `new_signal`, and `escalate` | Around `0.40` |

The hard task is intentionally more difficult because the named suspect drug is not the true cause. The agent must reason over interaction evidence and therapeutic drug-monitoring clues in the provided hardcoded drug database.

## Reward Function

The environment uses deterministic programmatic graders. Reward is now shaped across a true two-step trajectory:

1. initial triage reward on step 1
2. final review reward on step 2 after additional context arrives

Within each step, the agent is also scored on classification, causal attribution, severity,
and action, then receives extra credit if those sub-decisions form a coherent
triage story.

| Reward component | Value |
|---|---|
| Correct `classification` | `+0.25` |
| Correct `suspect_drug` | `+0.25` |
| Correct `severity_assessment` | `+0.20` |
| Correct `recommended_action` | `+0.15` |
| Consistency bonus when classification, severity, and action form a coherent pharmacovigilance pipeline | `+0.10` |
| Calibration bonus for high-confidence correct answers | `+0.05` |
| Overconfidence penalty for high-confidence weak answers | `-0.10` |
| Underconfidence penalty for low-confidence strong answers | `-0.03` |
| False alarm penalty: agent says `new_signal` when truth is `noise` | `-0.10` |
| Missed signal penalty: agent says `noise` when truth is `new_signal` | `-0.20` |
| Hard-task reasoning bonus if explanation mentions `drug interaction`, `tacrolimus`, `voriconazole`, `azole`, `calcineurin`, or `level monitoring` | `+0.05` |

Notes:
- Step-level rewards may be slightly negative for clearly unsafe or suboptimal actions.
- Final grader outputs remain deterministic and strictly bounded inside `(0, 1)` for evaluation safety.
- `suspect_drug` matching is forgiving for the hard task and allows substring matches.
- The environment is deterministic and reproducible because all tasks and grading logic are hardcoded.
- Confidence is optional, but calibrated confidence can improve reward while reckless overconfidence is penalized.
- Step 1 gives partial reward for initial triage and returns new review context; step 2 gives the final adjudicated reward.
- The environment also rewards productive revision and penalizes stubbornly repeating a weak initial answer or making an unjustified late flip.

## Project Structure

| Path | Purpose |
|---|---|
| `env.py` | Main environment class and Pydantic models |
| `tasks.py` | Task definitions and grader functions |
| `data.py` | Synthetic reports and drug interaction database |
| `server.py` | Root FastAPI entrypoint |
| `server/app.py` | OpenEnv-compatible app entrypoint |
| `inference.py` | Baseline inference runner |
| `openenv.yaml` | OpenEnv metadata |
| `Dockerfile` | Multi-stage OpenEnv-style container build |
| `tests/test_env.py` | Local tests |
| `validate-submission.sh` | Pre-submission validation helper |

## Running Locally

### Option 1: Local virtual environment

If you already created the local virtual environment in this repo:

```powershell

.\.venv\Scripts\Activate.ps1

```

Install dependencies if needed:

```bash

pip install -r requirements.txt

```

Start the server:

```bash

uvicorn server:app --host 0.0.0.0 --port 7860

```

### Option 2: Docker

Build the image:

```bash

docker build -t pharmacovigilance-env .

```

Run the container:

```bash

docker run -p 7860:7860 pharmacovigilance-env

```

The health endpoint will be available at:

```text

http://localhost:7860/health

```

## API Endpoints

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/reset` | Starts a task and returns the initial observation |
| `POST` | `/step` | Submits the current agent action and returns observation, reward, done, info |
| `GET` | `/state` | Returns internal environment state summary |
| `GET` | `/tasks` | Lists available task ids |
| `GET` | `/health` | Health check endpoint |

## Baseline Inference Script

The required baseline runner is `inference.py`.

It:
- reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `ENV_URL`
- uses the OpenAI client for all model calls
- runs all three tasks sequentially
- follows the full 2-step episode loop until `done=true`
- emits the required `[START]`, `[STEP]`, and `[END]` lines
- keeps stdout restricted to the judge-expected line types

Required environment variables:

```bash

export API_BASE_URL=https://router.huggingface.co/v1

export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct

export HF_TOKEN=hf_your_token_here

export ENV_URL=http://localhost:7860

```

Run:

```bash

python inference.py

```

## Testing And Validation

Run local tests:

```bash

pytest tests/test_env.py -q

```

Run OpenEnv validation:

```bash

openenv validate

```

Run the pre-submission helper:

```bash

chmod +x validate-submission.sh

./validate-submission.sh https://your-space.hf.space

```

That script checks:
1. your Hugging Face Space responds to `POST /reset`
2. the Docker image builds
3. `openenv validate` passes

## Submission Checklist

- `openenv validate` passes
- `docker build` succeeds
- `docker run` starts cleanly
- `POST /reset` returns HTTP `200`
- `inference.py` runs all 3 tasks successfully
- your Hugging Face Space responds to `POST /reset`
- replace the expected baseline values with your measured live baseline values before final submission

## Notes

- No external API calls are made by the environment itself.
- The drug interaction database is hardcoded.
- Ground truth is never exposed in the observation returned to the agent.
- The environment is lightweight enough for a 2 vCPU / 8GB RAM target.
- The expected baseline scores in this README are planning targets until replaced with measured live results.