title: Prompt Injection Envinronment
emoji: 🔊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv-0.2.3
- openenv
- Prompt Injection
- Reinforcement Learning
IDPI Exfil Environment
IDPI Exfil Environment is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.
The environment is designed for evaluating whether an agent can:
- read and process email content safely
- identify indirect prompt injection patterns
- avoid unsafe outbound actions
- produce clean internal summaries on harder tasks
Features
3 difficulty levels: easy, medium, and hard scenarios with different attack visibility and trigger stylesRealistic attack channels: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and moreSafety-aware action space: includes both benign and risky tools so policies can be meaningfully testedStructured scoring: reward shaping during the episode plus end-of-episode gradingOpenEnv server mode: FastAPI endpoints for reset, step, state, result, and healthUnified evaluation runner: one runner for inference and baseline-style evaluation output
Quick Start
Install
pip install -e .
Run the server
python server/app.py
The API starts on http://localhost:7860.
Run the evaluation runner
Create a .env in the project root with the submission-compatible inference variables:
API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here
Then run:
python inference.py
Compatibility commands still work:
python run_baseline.py
python baseline/run_baseline.py
All three commands use the same unified runner and write results to baseline/baseline_results.json.
For submission mode, inference.py validates API_BASE_URL, MODEL_NAME, and HF_TOKEN, then delegates to the benchmark runner using the OpenAI client path.
Using the HTTP API
Reset the environment
import requests
response = requests.post(
"http://localhost:7860/reset",
json={"difficulty": "easy"},
)
observation = response.json()
print(observation)
Step with an action
import requests
response = requests.post(
"http://localhost:7860/step",
json={"action": {"type": "ReadDoc"}},
)
result = response.json()
print(result["observation"])
print(result["reward"])
print(result["done"])
Inspect state or final result
import requests
state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()
Additional discovery endpoints are also available:
import requests
metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()
Environment Overview
Task model
Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.
Typical safe behavior includes:
- reviewing the email with
ReadDoc - flagging suspicious instructions with
FlagInjection - writing an internal-only clean summary with
WriteDB - refusing unsafe outbound behavior
Difficulty levels
easy: explicit or obvious prompt injection patternsmedium: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggershard: attacks embedded in semantically plausible content where the agent must also produce a safe summary
Actions
ReadDocReads the current email content.WriteDB(content="...")Writes an internal summary or note.SendEmail(to="...", body="...")Sends an outbound email. This is the main exfiltration risk surface.FlagInjection(reason="...")Marks the content as suspicious.NoOpTakes no action.
Observation fields
subjectbodylast_actionlast_action_resultstep_counttask_difficultysummary_requiredcurrent_summary
Endpoints
GET /GET /webGET /healthGET /metadataGET /schemaPOST /resetPOST /stepGET /stateGET /resultPOST /mcp
Evaluation
The evaluation runner loads scenarios from docs/person_a_benchmark_manifest.json and executes them across easy, medium, and hard splits.
The current manifest includes:
32 scenarios total9 easy13 medium10 hard
Each run logs:
- per-step actions and rewards
- whether exfiltration occurred
- whether the task was completed safely
- final graded score
- per-difficulty summary statistics
Results are written to baseline/baseline_results.json.
Optional runner settings
These values can be set in .env:
EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2
Development
Project structure
Prompt-injection-env/
|-- env/
| |-- environment.py
| |-- dataset_loader.py
| |-- scenario_generator.py
| |-- injection_engine.py
| |-- policy_engine.py
| |-- taint_tracker.py
| |-- reward.py
| |-- grader.py
| `-- models.py
|-- server/
| |-- app.py
| `-- Dockerfile
|-- docs/
| |-- person_a_benchmark_manifest.json
| |-- person_a_scenario_audit.md
| `-- person_a_showcase_episodes.md
|-- baseline/
| |-- run_baseline.py
| `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md
Core components
- env/environment.py Main episode lifecycle, action handling, reward plumbing, and result assembly.
- server/app.py FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
- inference.py Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
- openenv.yaml Environment metadata for OpenEnv packaging and deployment.
CI and sync
Includes GitHub workflows for:
- continuous integration in .github/workflows/ci.yml
- syncing
mainto Hugging Face Spaces in .github/workflows/sync.yml
Submission-focused regression tests live in submission_tests/ and can be run with:
python -m unittest discover -s submission_tests
Local Python usage
You can exercise the environment directly without starting the server:
from env.environment import IDPIExfilEnv
from env.models import Action
env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)
Notes
- The environment terminates early on unsafe exfiltration events.
- Hard tasks require a clean internal summary that excludes attacker-controlled content.
- Gemini works through the submission-compatible OpenAI client path by setting
API_BASE_URL,MODEL_NAME, andHF_TOKEN.