File size: 10,586 Bytes
f2beac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60c0453
f2beac3
 
9ab33d8
 
f2beac3
 
 
60c0453
f2beac3
 
 
 
 
 
 
 
 
 
60c0453
f2beac3
 
 
 
 
 
 
 
 
 
60c0453
f2beac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60c0453
 
 
 
 
 
 
 
f2beac3
 
 
 
 
60c0453
 
 
 
 
 
f2beac3
 
60c0453
f2beac3
 
9ab33d8
 
f2beac3
 
60c0453
 
 
f2beac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60c0453
f2beac3
 
 
 
 
 
 
 
 
 
 
 
60c0453
f2beac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---

title: Pharmacovigilance Signal Detector
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv pharmacovigilance signal detection environment
tags:
  - openenv
  - healthcare
  - pharmacovigilance
  - safety
  - real-world
base_path: /web
---


# Pharmacovigilance Signal Detector

`Pharmacovigilance Signal Detector` is a real-world OpenEnv environment where an agent acts like a drug-safety analyst. The agent reviews synthetic adverse event reports, uses a hardcoded drug interaction knowledge base, and decides whether the case is a new safety signal, a known side effect, or low-value noise. This mirrors pharmacovigilance triage work performed by regulators and pharmaceutical safety teams.

All case data in this repo is synthetic. No real patient data is used.

## Why This Environment Matters

Pharmacovigilance teams are responsible for detecting harmful safety patterns after a drug is already on the market. That work is operationally important, high-stakes, and difficult: analysts must distinguish expected reactions from true emerging risks, recognize confounding from polypharmacy, and escalate only when justified. This makes the domain a strong fit for agent evaluation because it tests causal reasoning, prioritization, and safety-sensitive decision making.

## Environment Overview

| Item | Value |
|---|---|
| Environment name | `pharma-vigilance` |
| Domain | Pharmacovigilance / drug safety triage |
| Episode length | 2-step triage and review workflow |
| Task count | 3 |
| Difficulties | Easy, Medium, Hard |
| Step reward range | `-0.25` to `1.0` |
| Final grader range | strict `(0, 1)` |
| API | `reset()`, `step()`, `state()` |
| Server | FastAPI |

Each episode has two phases. On step 1 the agent performs an initial triage. The environment then returns additional senior-review context through feedback, and on step 2 the agent submits a final reviewed assessment. Each task includes one or more synthetic reports plus a hardcoded drug interaction database. The environment never exposes ground truth to the agent.

## Action Space

| Field | Type | Allowed values | Purpose |
|---|---|---|---|
| `classification` | `str` | `new_signal`, `known_side_effect`, `noise`, `duplicate` | Overall pharmacovigilance judgment |
| `suspect_drug` | `str` | Free text | Drug or interaction the agent believes is causal |
| `severity_assessment` | `str` | `mild`, `moderate`, `severe`, `critical` | Clinical severity assessment |
| `recommended_action` | `str` | `escalate`, `log_and_monitor`, `dismiss`, `request_more_info` | Operational follow-up |
| `reasoning` | `str` | Free text | Short explanation used for grading bonus on hard task |
| `confidence` | `Optional[int]` | `0` to `100` | Optional analyst confidence used for calibration-aware reward shaping |

## Observation Space

| Field | Type | Description |
|---|---|---|
| `task_id` | `str` | Current task identifier |
| `reports` | `List[AdverseEventReport]` | Synthetic adverse event reports for the task |
| `drug_interaction_db` | `dict` | Hardcoded safety and interaction hints |
| `step_number` | `int` | Current step index |
| `max_steps` | `int` | Maximum number of steps in the episode |
| `feedback` | `Optional[str]` | Feedback or senior-review note returned after the previous action |

Each `AdverseEventReport` contains:

| Field | Description |
|---|---|
| `report_id` | Unique synthetic report identifier |
| `patient_age` | Patient age |
| `patient_sex` | Patient sex |
| `drugs` | All drugs the patient was taking |
| `suspect_drug` | Drug named by the original reporter |
| `reaction` | Observed adverse reaction |
| `onset_days` | Days after drug start when reaction began |
| `severity` | Reported severity |
| `outcome` | Recovery status |
| `similar_reports_last_30d` | Count of similar recent reports |

## Tasks

| Task | Difficulty | Scenario | Ground-truth goal | Expected baseline |
|---|---|---|---|---|
| `known_signal_easy` | Easy | Patient on `Lisinopril` develops persistent dry cough with many similar recent reports already known in-label | Recognize a known side effect and recommend `log_and_monitor` | Around `0.85` |
| `cluster_signal_medium` | Medium | Four recent `Cardiovexa` cases show symptomatic bradycardia and near-syncope despite no labeled rhythm toxicity | Recognize a plausible emerging signal and `escalate` | Around `0.65` |
| `confounded_hard` | Hard | Transplant patient with acute kidney injury is blamed on `Trimethoprim-sulfamethoxazole`, but the deeper issue is a `Voriconazole`-`Tacrolimus` interaction | Detect the interaction, classify as `new_signal`, and `escalate` | Around `0.40` |

The hard task is intentionally more difficult because the named suspect drug is not the true cause. The agent must reason over interaction evidence and therapeutic drug-monitoring clues in the provided hardcoded drug database.

## Reward Function

The environment uses deterministic programmatic graders. Reward is now shaped across a true two-step trajectory:

1. initial triage reward on step 1
2. final review reward on step 2 after additional context arrives

Within each step, the agent is also scored on classification, causal attribution, severity,
and action, then receives extra credit if those sub-decisions form a coherent
triage story.

| Reward component | Value |
|---|---|
| Correct `classification` | `+0.25` |
| Correct `suspect_drug` | `+0.25` |
| Correct `severity_assessment` | `+0.20` |
| Correct `recommended_action` | `+0.15` |
| Consistency bonus when classification, severity, and action form a coherent pharmacovigilance pipeline | `+0.10` |
| Calibration bonus for high-confidence correct answers | `+0.05` |
| Overconfidence penalty for high-confidence weak answers | `-0.10` |
| Underconfidence penalty for low-confidence strong answers | `-0.03` |
| False alarm penalty: agent says `new_signal` when truth is `noise` | `-0.10` |
| Missed signal penalty: agent says `noise` when truth is `new_signal` | `-0.20` |
| Hard-task reasoning bonus if explanation mentions `drug interaction`, `tacrolimus`, `voriconazole`, `azole`, `calcineurin`, or `level monitoring` | `+0.05` |

Notes:
- Step-level rewards may be slightly negative for clearly unsafe or suboptimal actions.
- Final grader outputs remain deterministic and strictly bounded inside `(0, 1)` for evaluation safety.
- `suspect_drug` matching is forgiving for the hard task and allows substring matches.
- The environment is deterministic and reproducible because all tasks and grading logic are hardcoded.
- Confidence is optional, but calibrated confidence can improve reward while reckless overconfidence is penalized.
- Step 1 gives partial reward for initial triage and returns new review context; step 2 gives the final adjudicated reward.
- The environment also rewards productive revision and penalizes stubbornly repeating a weak initial answer or making an unjustified late flip.

## Project Structure

| Path | Purpose |
|---|---|
| `env.py` | Main environment class and Pydantic models |
| `tasks.py` | Task definitions and grader functions |
| `data.py` | Synthetic reports and drug interaction database |
| `server.py` | Root FastAPI entrypoint |
| `server/app.py` | OpenEnv-compatible app entrypoint |
| `inference.py` | Baseline inference runner |
| `openenv.yaml` | OpenEnv metadata |
| `Dockerfile` | Multi-stage OpenEnv-style container build |
| `tests/test_env.py` | Local tests |
| `validate-submission.sh` | Pre-submission validation helper |

## Running Locally

### Option 1: Local virtual environment

If you already created the local virtual environment in this repo:

```powershell

.\.venv\Scripts\Activate.ps1

```

Install dependencies if needed:

```bash

pip install -r requirements.txt

```

Start the server:

```bash

uvicorn server:app --host 0.0.0.0 --port 7860

```

### Option 2: Docker

Build the image:

```bash

docker build -t pharmacovigilance-env .

```

Run the container:

```bash

docker run -p 7860:7860 pharmacovigilance-env

```

The health endpoint will be available at:

```text

http://localhost:7860/health

```

## API Endpoints

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/reset` | Starts a task and returns the initial observation |
| `POST` | `/step` | Submits the current agent action and returns observation, reward, done, info |
| `GET` | `/state` | Returns internal environment state summary |
| `GET` | `/tasks` | Lists available task ids |
| `GET` | `/health` | Health check endpoint |

## Baseline Inference Script

The required baseline runner is `inference.py`.

It:
- reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `ENV_URL`
- uses the OpenAI client for all model calls
- runs all three tasks sequentially
- follows the full 2-step episode loop until `done=true`
- emits the required `[START]`, `[STEP]`, and `[END]` lines
- keeps stdout restricted to the judge-expected line types

Required environment variables:

```bash

export API_BASE_URL=https://router.huggingface.co/v1

export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct

export HF_TOKEN=hf_your_token_here

export ENV_URL=http://localhost:7860

```

Run:

```bash

python inference.py

```

## Testing And Validation

Run local tests:

```bash

pytest tests/test_env.py -q

```

Run OpenEnv validation:

```bash

openenv validate

```

Run the pre-submission helper:

```bash

chmod +x validate-submission.sh

./validate-submission.sh https://your-space.hf.space

```

That script checks:
1. your Hugging Face Space responds to `POST /reset`
2. the Docker image builds
3. `openenv validate` passes

## Submission Checklist

- `openenv validate` passes
- `docker build` succeeds
- `docker run` starts cleanly
- `POST /reset` returns HTTP `200`
- `inference.py` runs all 3 tasks successfully
- your Hugging Face Space responds to `POST /reset`
- replace the expected baseline values with your measured live baseline values before final submission

## Notes

- No external API calls are made by the environment itself.
- The drug interaction database is hardcoded.
- Ground truth is never exposed in the observation returned to the agent.
- The environment is lightweight enough for a 2 vCPU / 8GB RAM target.
- The expected baseline scores in this README are planning targets until replaced with measured live results.