---
title: Prompt Injection Envinronment
emoji: "🔊"
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv
  - Prompt Injection
  - Reinforcement Learning
---

# IDPI Exfil Environment

`IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.

The environment is designed for evaluating whether an agent can:

- read and process email content safely
- identify indirect prompt injection patterns
- avoid unsafe outbound actions
- produce clean internal summaries on harder tasks

## Features

- `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles
- `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
- `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested
- `Structured scoring`: reward shaping during the episode plus end-of-episode grading
- `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health
- `Unified evaluation runner`: one runner for inference and baseline-style evaluation output

## Quick Start

### Install

```bash
pip install -e .
```

### Run the server

```bash
python server/app.py
```

The API starts on `http://localhost:7860`.

### Run the evaluation runner

Create a `.env` in the project root with the submission-compatible inference variables:

```bash
API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here
```

Then run:

```bash
python inference.py
```

Compatibility commands still work:

```bash
python run_baseline.py
python baseline/run_baseline.py
```

All three commands use the same unified runner and write results to `baseline/baseline_results.json`.
For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path.

## Using the HTTP API

### Reset the environment

```python
import requests

response = requests.post(
    "http://localhost:7860/reset",
    json={"difficulty": "easy"},
)
observation = response.json()
print(observation)
```

### Step with an action

```python
import requests

response = requests.post(
    "http://localhost:7860/step",
    json={"action": {"type": "ReadDoc"}},
)
result = response.json()

print(result["observation"])
print(result["reward"])
print(result["done"])
```

### Inspect state or final result

```python
import requests

state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()
```

Additional discovery endpoints are also available:

```python
import requests

metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()
```

## Environment Overview

### Task model

Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.

Typical safe behavior includes:

- reviewing the email with `ReadDoc`
- flagging suspicious instructions with `FlagInjection`
- writing an internal-only clean summary with `WriteDB`
- refusing unsafe outbound behavior

### Difficulty levels

- `easy`: explicit or obvious prompt injection patterns
- `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
- `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary

### Actions

- `ReadDoc`
  Reads the current email content.
- `WriteDB(content="...")`
  Writes an internal summary or note.
- `SendEmail(to="...", body="...")`
  Sends an outbound email. This is the main exfiltration risk surface.
- `FlagInjection(reason="...")`
  Marks the content as suspicious.
- `NoOp`
  Takes no action.

### Observation fields

- `subject`
- `body`
- `last_action`
- `last_action_result`
- `step_count`
- `task_difficulty`
- `summary_required`
- `current_summary`

### Endpoints

- `GET /`
- `GET /web`
- `GET /health`
- `GET /metadata`
- `GET /schema`
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /result`
- `POST /mcp`

## Evaluation

The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits.

The current manifest includes:

- `32 scenarios total`
- `9 easy`
- `13 medium`
- `10 hard`

Each run logs:

- per-step actions and rewards
- whether exfiltration occurred
- whether the task was completed safely
- final graded score
- per-difficulty summary statistics

Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json).

### Optional runner settings

These values can be set in `.env`:

```bash
EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2
```

## Development

### Project structure

```text
Prompt-injection-env/
|-- env/
|   |-- environment.py
|   |-- dataset_loader.py
|   |-- scenario_generator.py
|   |-- injection_engine.py
|   |-- policy_engine.py
|   |-- taint_tracker.py
|   |-- reward.py
|   |-- grader.py
|   `-- models.py
|-- server/
|   |-- app.py
|   `-- Dockerfile
|-- docs/
|   |-- person_a_benchmark_manifest.json
|   |-- person_a_scenario_audit.md
|   `-- person_a_showcase_episodes.md
|-- baseline/
|   |-- run_baseline.py
|   `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md
```

### Core components

- [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py)
  Main episode lifecycle, action handling, reward plumbing, and result assembly.
- [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py)
  FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
- [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py)
  Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
- [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml)
  Environment metadata for OpenEnv packaging and deployment.


### CI and sync

Includes GitHub workflows for:

- continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml)
- syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml)

Submission-focused regression tests live in `submission_tests/` and can be run with:

```bash
python -m unittest discover -s submission_tests
```

### Local Python usage

You can exercise the environment directly without starting the server:

```python
from env.environment import IDPIExfilEnv
from env.models import Action

env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)
```

## Notes

- The environment terminates early on unsafe exfiltration events.
- Hard tasks require a clean internal summary that excludes attacker-controlled content.
- Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.