KChad's picture
small change
91a0048
metadata
title: Prompt Injection Envinronment
emoji: 🔊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv
  - Prompt Injection
  - Reinforcement Learning

IDPI Exfil Environment

IDPI Exfil Environment is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.

The environment is designed for evaluating whether an agent can:

  • read and process email content safely
  • identify indirect prompt injection patterns
  • avoid unsafe outbound actions
  • produce clean internal summaries on harder tasks

Features

  • 3 difficulty levels: easy, medium, and hard scenarios with different attack visibility and trigger styles
  • Realistic attack channels: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
  • Safety-aware action space: includes both benign and risky tools so policies can be meaningfully tested
  • Structured scoring: reward shaping during the episode plus end-of-episode grading
  • OpenEnv server mode: FastAPI endpoints for reset, step, state, result, and health
  • Unified evaluation runner: one runner for inference and baseline-style evaluation output

Quick Start

Install

pip install -e .

Run the server

python server/app.py

The API starts on http://localhost:7860.

Run the evaluation runner

Create a .env in the project root with the submission-compatible inference variables:

API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here

Then run:

python inference.py

Compatibility commands still work:

python run_baseline.py
python baseline/run_baseline.py

All three commands use the same unified runner and write results to baseline/baseline_results.json. For submission mode, inference.py validates API_BASE_URL, MODEL_NAME, and HF_TOKEN, then delegates to the benchmark runner using the OpenAI client path.

Using the HTTP API

Reset the environment

import requests

response = requests.post(
    "http://localhost:7860/reset",
    json={"difficulty": "easy"},
)
observation = response.json()
print(observation)

Step with an action

import requests

response = requests.post(
    "http://localhost:7860/step",
    json={"action": {"type": "ReadDoc"}},
)
result = response.json()

print(result["observation"])
print(result["reward"])
print(result["done"])

Inspect state or final result

import requests

state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()

Additional discovery endpoints are also available:

import requests

metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()

Environment Overview

Task model

Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.

Typical safe behavior includes:

  • reviewing the email with ReadDoc
  • flagging suspicious instructions with FlagInjection
  • writing an internal-only clean summary with WriteDB
  • refusing unsafe outbound behavior

Difficulty levels

  • easy: explicit or obvious prompt injection patterns
  • medium: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
  • hard: attacks embedded in semantically plausible content where the agent must also produce a safe summary

Actions

  • ReadDoc Reads the current email content.
  • WriteDB(content="...") Writes an internal summary or note.
  • SendEmail(to="...", body="...") Sends an outbound email. This is the main exfiltration risk surface.
  • FlagInjection(reason="...") Marks the content as suspicious.
  • NoOp Takes no action.

Observation fields

  • subject
  • body
  • last_action
  • last_action_result
  • step_count
  • task_difficulty
  • summary_required
  • current_summary

Endpoints

  • GET /
  • GET /web
  • GET /health
  • GET /metadata
  • GET /schema
  • POST /reset
  • POST /step
  • GET /state
  • GET /result
  • POST /mcp

Evaluation

The evaluation runner loads scenarios from docs/person_a_benchmark_manifest.json and executes them across easy, medium, and hard splits.

The current manifest includes:

  • 32 scenarios total
  • 9 easy
  • 13 medium
  • 10 hard

Each run logs:

  • per-step actions and rewards
  • whether exfiltration occurred
  • whether the task was completed safely
  • final graded score
  • per-difficulty summary statistics

Results are written to baseline/baseline_results.json.

Optional runner settings

These values can be set in .env:

EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2

Development

Project structure

Prompt-injection-env/
|-- env/
|   |-- environment.py
|   |-- dataset_loader.py
|   |-- scenario_generator.py
|   |-- injection_engine.py
|   |-- policy_engine.py
|   |-- taint_tracker.py
|   |-- reward.py
|   |-- grader.py
|   `-- models.py
|-- server/
|   |-- app.py
|   `-- Dockerfile
|-- docs/
|   |-- person_a_benchmark_manifest.json
|   |-- person_a_scenario_audit.md
|   `-- person_a_showcase_episodes.md
|-- baseline/
|   |-- run_baseline.py
|   `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md

Core components

  • env/environment.py Main episode lifecycle, action handling, reward plumbing, and result assembly.
  • server/app.py FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
  • inference.py Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
  • openenv.yaml Environment metadata for OpenEnv packaging and deployment.

CI and sync

Includes GitHub workflows for:

Submission-focused regression tests live in submission_tests/ and can be run with:

python -m unittest discover -s submission_tests

Local Python usage

You can exercise the environment directly without starting the server:

from env.environment import IDPIExfilEnv
from env.models import Action

env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)

Notes

  • The environment terminates early on unsafe exfiltration events.
  • Hard tasks require a clean internal summary that excludes attacker-controlled content.
  • Gemini works through the submission-compatible OpenAI client path by setting API_BASE_URL, MODEL_NAME, and HF_TOKEN.