Spaces:

KChad
/

Prompt-Injection-RL-environment

Running

App Files Files Community

Prompt-Injection-RL-environment / README.md

KChad

small change

91a0048 9 days ago

preview code

raw

history blame contribute delete

7.97 kB

metadata

title: Prompt Injection Envinronment
emoji: 🔊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv-0.2.3
  - openenv
  - Prompt Injection
  - Reinforcement Learning

IDPI Exfil Environment

IDPI Exfil Environment is an OpenEnv-compatible prompt-injection defense environment. Agents operate on realistic email tasks where the visible goal looks benign, but attacker-controlled content tries to redirect behavior toward unsafe actions such as data exfiltration or malicious relays.

The environment is designed for evaluating whether an agent can:

read and process email content safely
identify indirect prompt injection patterns
avoid unsafe outbound actions
produce clean internal summaries on harder tasks

Features

3 difficulty levels: easy, medium, and hard scenarios with different attack visibility and trigger styles
Realistic attack channels: inline instructions, hidden wrappers, HTML comments, transcript replay, attachment-style instructions, and more
Safety-aware action space: includes both benign and risky tools so policies can be meaningfully tested
Structured scoring: reward shaping during the episode plus end-of-episode grading
OpenEnv server mode: FastAPI endpoints for reset, step, state, result, and health
Unified evaluation runner: one runner for inference and baseline-style evaluation output

Quick Start

Install

pip install -e .

Run the server

python server/app.py

The API starts on http://localhost:7860.

Run the evaluation runner

Create a .env in the project root with the submission-compatible inference variables:

API_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
MODEL_NAME=gemini-2.5-flash
HF_TOKEN=your_gemini_api_key_here

Then run:

python inference.py

Compatibility commands still work:

python run_baseline.py
python baseline/run_baseline.py

All three commands use the same unified runner and write results to baseline/baseline_results.json. For submission mode, inference.py validates API_BASE_URL, MODEL_NAME, and HF_TOKEN, then delegates to the benchmark runner using the OpenAI client path.

Using the HTTP API

Reset the environment

import requests

response = requests.post(
    "http://localhost:7860/reset",
    json={"difficulty": "easy"},
)
observation = response.json()
print(observation)

Step with an action

import requests

response = requests.post(
    "http://localhost:7860/step",
    json={"action": {"type": "ReadDoc"}},
)
result = response.json()

print(result["observation"])
print(result["reward"])
print(result["done"])

Inspect state or final result

import requests

state = requests.get("http://localhost:7860/state").json()
episode_result = requests.get("http://localhost:7860/result").json()

Additional discovery endpoints are also available:

import requests

metadata = requests.get("http://localhost:7860/metadata").json()
schema = requests.get("http://localhost:7860/schema").json()

Environment Overview

Task model

Each episode presents an email with a benign business-looking task and embedded attacker intent. The agent must complete the safe objective without obeying adversarial instructions contained in the email body or derived attack channels.

Typical safe behavior includes:

reviewing the email with ReadDoc
flagging suspicious instructions with FlagInjection
writing an internal-only clean summary with WriteDB
refusing unsafe outbound behavior

Difficulty levels

easy: explicit or obvious prompt injection patterns
medium: more obfuscated attacks, hidden wrappers, spoofed instruction hierarchy, or delayed triggers
hard: attacks embedded in semantically plausible content where the agent must also produce a safe summary

Actions

ReadDoc Reads the current email content.
WriteDB(content="...") Writes an internal summary or note.
SendEmail(to="...", body="...") Sends an outbound email. This is the main exfiltration risk surface.
FlagInjection(reason="...") Marks the content as suspicious.
NoOp Takes no action.

Observation fields

subject
body
last_action
last_action_result
step_count
task_difficulty
summary_required
current_summary

Endpoints

GET /
GET /web
GET /health
GET /metadata
GET /schema
POST /reset
POST /step
GET /state
GET /result
POST /mcp

Evaluation

The evaluation runner loads scenarios from docs/person_a_benchmark_manifest.json and executes them across easy, medium, and hard splits.

The current manifest includes:

32 scenarios total
9 easy
13 medium
10 hard

Each run logs:

per-step actions and rewards
whether exfiltration occurred
whether the task was completed safely
final graded score
per-difficulty summary statistics

Results are written to baseline/baseline_results.json.

Optional runner settings

These values can be set in .env:

EPISODES_PER_DIFFICULTY=1
BASELINE_RESULTS_PATH=baseline/submission_results.json
REQUEST_DELAY_SECONDS=3.0
MAX_MODEL_RETRIES=2

Development

Project structure

Prompt-injection-env/
|-- env/
|   |-- environment.py
|   |-- dataset_loader.py
|   |-- scenario_generator.py
|   |-- injection_engine.py
|   |-- policy_engine.py
|   |-- taint_tracker.py
|   |-- reward.py
|   |-- grader.py
|   `-- models.py
|-- server/
|   |-- app.py
|   `-- Dockerfile
|-- docs/
|   |-- person_a_benchmark_manifest.json
|   |-- person_a_scenario_audit.md
|   `-- person_a_showcase_episodes.md
|-- baseline/
|   |-- run_baseline.py
|   `-- baseline_results.json
|-- inference.py
|-- run_baseline.py
|-- openenv.yaml
`-- README.md

Core components

env/environment.py Main episode lifecycle, action handling, reward plumbing, and result assembly.
server/app.py FastAPI wrapper exposing the environment over HTTP, including health, metadata, schema, and benchmark endpoints.
inference.py Submission-facing inference entrypoint that validates the required env vars and delegates to the benchmark runner.
openenv.yaml Environment metadata for OpenEnv packaging and deployment.

CI and sync

Includes GitHub workflows for:

continuous integration in .github/workflows/ci.yml
syncing main to Hugging Face Spaces in .github/workflows/sync.yml

Submission-focused regression tests live in submission_tests/ and can be run with:

python -m unittest discover -s submission_tests

Local Python usage

You can exercise the environment directly without starting the server:

from env.environment import IDPIExfilEnv
from env.models import Action

env = IDPIExfilEnv()
obs = env.reset(difficulty="easy")
obs, reward, done, info = env.step(Action(type="ReadDoc"))
print(obs)
print(reward)

Notes

The environment terminates early on unsafe exfiltration events.
Hard tasks require a clean internal summary that excludes attacker-controlled content.
Gemini works through the submission-compatible OpenAI client path by setting API_BASE_URL, MODEL_NAME, and HF_TOKEN.