--- title: Prompt Injection Envinronment emoji: "🔊" colorFrom: blue colorTo: green sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv-0.2.3 - openenv - Prompt Injection - Reinforcement Learning --- # IDPI Exfil Environment `IDPI Exfil Environment` is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays. The environment is designed for evaluating whether an agent can: - read and process email content safely - identify indirect prompt injection patterns - avoid unsafe outbound actions - produce clean internal summaries on harder tasks ## Features - `3 difficulty levels`: easy, medium, and hard scenarios with different attack visibility and trigger styles - `Realistic attack channels`: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more - `Safety-aware action space`: includes both benign and risky tools so policies can be meaningfully tested - `Structured scoring`: reward shaping during the episode plus end-of-episode grading - `OpenEnv server mode`: FastAPI endpoints for reset, step, state, result, and health - `Unified evaluation runner`: one runner for inference and baseline-style evaluation output ## Quick Start ### Install ```bash pip install -e . ``` ### Run the server ```bash python server/app.py ``` The API starts on `http://localhost:7860`. ### Run the evaluation runner Create a `.env` in the project root with the submission-compatible inference variables: ```bash API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/ MODEL_NAME=gemini-2.5-flash HF_TOKEN=your_gemini_api_key_here ``` Then run: ```bash python inference.py ``` Compatibility commands still work: ```bash python run_baseline.py python baseline/run_baseline.py ``` All three commands use the same unified runner and write results to `baseline/baseline_results.json`. For submission mode, `inference.py` validates `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`, then delegates to the benchmark runner using the OpenAI client path. ## Using the HTTP API ### Reset the environment ```python import requests response = requests.post( "http://localhost:7860/reset", json={"difficulty": "easy"}, ) observation = response.json() print(observation) ``` ### Step with an action ```python import requests response = requests.post( "http://localhost:7860/step", json={"action": {"type": "ReadDoc"}}, ) result = response.json() print(result["observation"]) print(result["reward"]) print(result["done"]) ``` ### Inspect state or final result ```python import requests state = requests.get("http://localhost:7860/state").json() episode_result = requests.get("http://localhost:7860/result").json() ``` Additional discovery endpoints are also available: ```python import requests metadata = requests.get("http://localhost:7860/metadata").json() schema = requests.get("http://localhost:7860/schema").json() ``` ## Environment Overview ### Task model Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels. Typical safe behavior includes: - reviewing the email with `ReadDoc` - flagging suspicious instructions with `FlagInjection` - writing an internal-only clean summary with `WriteDB` - refusing unsafe outbound behavior ### Difficulty levels - `easy`: explicit or obvious prompt injection patterns - `medium`: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers - `hard`: attacks embedded in semantically plausible content where the agent must also produce a safe summary ### Actions - `ReadDoc` Reads the current email content. - `WriteDB(content="...")` Writes an internal summary or note. - `SendEmail(to="...", body="...")` Sends an outbound email. This is the main exfiltration risk surface. - `FlagInjection(reason="...")` Marks the content as suspicious. - `NoOp` Takes no action. ### Observation fields - `subject` - `body` - `last_action` - `last_action_result` - `step_count` - `task_difficulty` - `summary_required` - `current_summary` ### Endpoints - `GET /` - `GET /web` - `GET /health` - `GET /metadata` - `GET /schema` - `POST /reset` - `POST /step` - `GET /state` - `GET /result` - `POST /mcp` ## Evaluation The evaluation runner loads scenarios from [docs/person_a_benchmark_manifest.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/docs/person_a_benchmark_manifest.json) and executes them across `easy`, `medium`, and `hard` splits. The current manifest includes: - `32 scenarios total` - `9 easy` - `13 medium` - `10 hard` Each run logs: - per-step actions and rewards - whether exfiltration occurred - whether the task was completed safely - final graded score - per-difficulty summary statistics Results are written to [baseline/baseline_results.json](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/baseline/baseline_results.json). ### Optional runner settings These values can be set in `.env`: ```bash EPISODES_PER_DIFFICULTY=1 BASELINE_RESULTS_PATH=baseline/submission_results.json REQUEST_DELAY_SECONDS=3.0 MAX_MODEL_RETRIES=2 ``` ## Development ### Project structure ```text Prompt-injection-env/ |-- env/ | |-- environment.py | |-- dataset_loader.py | |-- scenario_generator.py | |-- injection_engine.py | |-- policy_engine.py | |-- taint_tracker.py | |-- reward.py | |-- grader.py | `-- models.py |-- server/ | |-- app.py | `-- Dockerfile |-- docs/ | |-- person_a_benchmark_manifest.json | |-- person_a_scenario_audit.md | `-- person_a_showcase_episodes.md |-- baseline/ | |-- run_baseline.py | `-- baseline_results.json |-- inference.py |-- run_baseline.py |-- openenv.yaml `-- README.md ``` ### Core components - [env/environment.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/env/environment.py) Main episode lifecycle, action handling, reward plumbing, and result assembly. - [server/app.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/server/app.py) FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints. - [inference.py](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/inference.py) Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner. - [openenv.yaml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/openenv.yaml) Environment metadata for OpenEnv packaging and deployment. ### CI and sync Includes GitHub workflows for: - continuous integration in [.github/workflows/ci.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/ci.yml) - syncing `main` to Hugging Face Spaces in [.github/workflows/sync.yml](/C:/Users/Admin/Desktop/My%20projects/Prompt-injection-env/.github/workflows/sync.yml) Submission-focused regression tests live in `submission_tests/` and can be run with: ```bash python -m unittest discover -s submission_tests ``` ### Local Python usage You can exercise the environment directly without starting the server: ```python from env.environment import IDPIExfilEnv from env.models import Action env = IDPIExfilEnv() obs = env.reset(difficulty="easy") obs, reward, done, info = env.step(Action(type="ReadDoc")) print(obs) print(reward) ``` ## Notes - The environment terminates early on unsafe exfiltration events. - Hard tasks require a clean internal summary that excludes attacker-controlled content. - Gemini works through the submission-compatible OpenAI client path by setting `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`.