iitian commited on
Commit
b23936a
Β·
1 Parent(s): 30134ef

Prepare and optimize for Hugging Face Spaces deployment

Browse files
Files changed (6) hide show
  1. .vscode/settings.json +1 -0
  2. Dockerfile +3 -3
  3. README.md +57 -47
  4. inference.py +65 -70
  5. openenv.yaml +2 -2
  6. server/app.py +1 -1
.vscode/settings.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
Dockerfile CHANGED
@@ -14,8 +14,8 @@ COPY openenv.yaml .
14
  COPY README.md .
15
  COPY DOCUMENTATION.md .
16
 
17
- # Expose the API port
18
- EXPOSE 8000
19
 
20
  # Start server
21
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
 
14
  COPY README.md .
15
  COPY DOCUMENTATION.md .
16
 
17
+ # Expose the API port (Hugging Face default)
18
+ EXPOSE 7860
19
 
20
  # Start server
21
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,64 +1,74 @@
1
- # CloudSecurityAuditor OpenEnv
 
 
 
 
 
 
 
 
 
2
 
3
- A standardized AI agent environment for simulating real-world cloud security audits. Built using the **OpenEnv** specification, it allows agents to interact with a mock cloud infrastructure to identify and remediate vulnerabilities.
 
 
 
 
4
 
5
  ## 🌟 Key Features
6
- - **Typed Models**: Full Pydantic support for actions and observations.
7
- - **Three Task Tiers**: Includes Easy (Information Gathering), Medium (Remediation), and Hard (Forensic Analysis).
8
- - **Gymnasium-Compatible API**: Implements `step()`, `reset()`, and `state()` methods.
9
- - **Reward-Driven**: Scalar rewards from 0.0 to 1.0 based on task completion.
10
 
11
- ## πŸ›  Action Space
12
- The agent can perform the following actions via the `step()` method:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- - **`list`**: Lists resources of a specific type (`s3`, `ec2`).
15
- - **`describe`**: Fetches detailed configuration for a specific resource ID.
16
- - **`modify`**: Updates resource configurations (e.g., security groups).
17
- - **`logs`**: Retrieves logs for a specific resource or service.
18
- - **`submit`**: Submits the final answer for the evaluation tasks.
19
 
20
- ## πŸ“Š Observation Space
21
- Each step returns a `CloudObservation` containing:
22
- - `resources`: A list of discovered resource records.
23
- - `details`: Metadata for a specific resource.
24
- - `logs`: Relevant log entries.
25
- - `status`: Human-readable status message.
26
- - `info`: Additional environment metadata.
27
 
28
- ## πŸ“‹ Tasks
 
 
 
 
29
 
30
- 1. **Easy (S3 Public Audit)**: Identify all public S3 buckets in the 'prod' region.
31
- 2. **Medium (EC2 Security Patch)**: Find an EC2 instance with RDP port open to the internet and close it.
32
- 3. **Hard (IAM Log Forensic)**: Trace unauthorized actions in `auth-logs` to identify a rogue IP address.
33
 
34
- ## πŸš€ Setup & Installation
 
 
 
 
 
 
35
 
36
- ### Local Installation
37
  ```bash
 
38
  pip install -r requirements.txt
39
- ```
40
 
41
- ### Running the Server
42
- ```bash
43
  python -m server.app
44
- ```
45
- The server will start on `http://localhost:8000`.
46
-
47
- ### Running the Baseline Agent
48
- ```bash
49
- python scripts/baseline_inference.py
50
- ```
51
 
52
- ## 🐳 Docker Deployment
53
- To build and run the containerized environment:
54
- ```bash
55
- docker build -t cloud-security-auditor-env .
56
- docker run -p 8000:8000 cloud-security-auditor-env
57
  ```
58
 
59
- ## πŸ€— Hugging Face Spaces
60
- This environment is designed to be deployed as an **OpenEnv Space**.
61
- 1. Create a new Space on Hugging Face.
62
- 2. Select **Docker** as the SDK.
63
- 3. Upload the repository contents (including `openenv.yaml` and `Dockerfile`).
64
- 4. Set the `entrypoint` to match the `uvicorn` command in `openenv.yaml`.
 
1
+ ---
2
+ title: Cloud Security Auditor
3
+ emoji: πŸ›‘οΈ
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ license: apache-2.0
10
+ ---
11
 
12
+ # πŸ›‘οΈ CloudSecurityAuditor OpenEnv
13
+
14
+ **CloudSecurityAuditor** is a high-fidelity, standardized AI agent environment designed to simulate real-world cloud security audit scenarios. Built upon the **OpenEnv** specification, it provides a safe, reproducible sandbox where autonomous agents can practice identifying, analyzing, and remediating critical security vulnerabilities in a mock cloud infrastructure.
15
+
16
+ This environment is specifically engineered for benchmarking LLM-based security agents, offering a structured API and deterministic evaluation metrics.
17
 
18
  ## 🌟 Key Features
 
 
 
 
19
 
20
+ - **Standardized API**: Fully compliant with the `openenv-core` specification, featuring Gymnasium-style `step()`, `reset()`, and `state()` methods.
21
+ - **Realistic Cloud Mocking**: Simulates S3 bucket configurations, EC2 security groups, and IAM audit logs with high precision.
22
+ - **Multi-Tiered Evaluation**:
23
+ - **Easy (Audit)**: Focuses on information gathering and resource tagging.
24
+ - **Medium (Remediation)**: Requires active patching and configuration changes.
25
+ - **Hard (Forensics)**: Demands log analysis and pattern matching to identify rogue actors.
26
+ - **Typed Observations**: Robust Pydantic-based action and observation models ensure reliable agent-environment interactions.
27
+ - **Automated Grading**: Scalar reward functions (0.0 to 1.0) provide immediate, granular feedback on agent performance.
28
+
29
+ ## πŸ›  Action & Observation Space
30
+
31
+ ### Actions
32
+ - `list`: Inventory resources (`s3`, `ec2`).
33
+ - `describe`: Deep-dive into resource metadata.
34
+ - `modify`: Apply security patches and rule updates.
35
+ - `logs`: Extract forensic evidence from authentication logs.
36
+ - `submit`: Finalize the task with a structured answer.
37
 
38
+ ### Observations
39
+ - `resources`: Comprehensive resource records.
40
+ - `details`: Metadata for specific entities.
41
+ - `logs`: Event-based log entries.
42
+ - `status`: Execution status and helper messages.
43
 
44
+ ## πŸ“Š Available Tasks
 
 
 
 
 
 
45
 
46
+ | ID | Name | Objective | Difficulty |
47
+ |:---|:---|:---|:---|
48
+ | `easy` | **S3 Public Audit** | Identify public 'prod' buckets. | Auditing |
49
+ | `medium` | **EC2 Security Patch** | Remediate open RDP ports (3389). | Remediation |
50
+ | `hard` | **IAM Log Forensic** | Trace 'DeleteStorage' actions in logs. | Forensics |
51
 
52
+ ## πŸš€ Quick Start (Hugging Face)
 
 
53
 
54
+ If you are running this in a **Hugging Face Space**:
55
+
56
+ 1. **Examine the API**: The environment is hosted as a FastAPI server. Use the `/ui` endpoint for a visual dashboard.
57
+ 2. **Inference**: Run the `inference.py` script locally, pointing the `ENV_URL` to your Space's URL.
58
+ 3. **Evaluate**: The system will emit standardized logs for automated leaderboard tracking.
59
+
60
+ ## 🐳 Local Deployment
61
 
 
62
  ```bash
63
+ # Clone and Install
64
  pip install -r requirements.txt
 
65
 
66
+ # Run Server
 
67
  python -m server.app
 
 
 
 
 
 
 
68
 
69
+ # Run Baseline
70
+ python inference.py
 
 
 
71
  ```
72
 
73
+ ---
74
+ Built with ❀️ for the AI Security community.
 
 
 
 
inference.py CHANGED
@@ -21,11 +21,13 @@ from openai import OpenAI
21
  # ──────────────────────────────────────────────
22
  # Configuration from environment variables
23
  # ──────────────────────────────────────────────
24
- API_BASE_URL = os.environ.get("API_BASE_URL", "https://openrouter.ai/api/v1")
25
- MODEL_NAME = os.environ.get("MODEL_NAME", "openai/gpt-4o-mini")
26
- HF_TOKEN = os.environ.get("HF_TOKEN", "")
 
27
 
28
- ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
 
29
 
30
  # Initialize OpenAI-compatible client
31
  client = OpenAI(
@@ -119,7 +121,13 @@ def ask_llm(system_prompt: str, conversation: list) -> dict:
119
  # Strip markdown code fences if present
120
  if raw.startswith("```"):
121
  lines = raw.split("\n")
122
- raw = "\n".join(lines[1:-1]) if len(lines) > 2 else raw
 
 
 
 
 
 
123
 
124
  try:
125
  return json.loads(raw)
@@ -135,81 +143,91 @@ def ask_llm(system_prompt: str, conversation: list) -> dict:
135
  # ──────────────────────────────────────────────
136
  # Structured logging helpers
137
  # ──────────────────────────────────────────────
138
- def log_start(task_id: str, task_name: str):
139
- """Emit [START] log."""
140
- print(f"[START] task_id={task_id} task_name={task_name} timestamp={datetime.now(timezone.utc).isoformat()}")
 
 
141
  sys.stdout.flush()
142
 
143
 
144
- def log_step(task_id: str, step_num: int, action: dict, observation: dict, reward: float, done: bool):
145
- """Emit [STEP] log."""
146
- print(
147
- f"[STEP] task_id={task_id} step={step_num} "
148
- f"action={json.dumps(action)} "
149
- f"observation={json.dumps(observation)} "
150
- f"reward={reward} done={done} "
151
- f"timestamp={datetime.now(timezone.utc).isoformat()}"
152
- )
153
  sys.stdout.flush()
154
 
155
 
156
- def log_end(task_id: str, task_name: str, final_score: float, total_steps: int):
157
- """Emit [END] log."""
158
- print(
159
- f"[END] task_id={task_id} task_name={task_name} "
160
- f"score={final_score} steps={total_steps} "
161
- f"timestamp={datetime.now(timezone.utc).isoformat()}"
162
- )
163
  sys.stdout.flush()
164
 
165
 
166
  # ──────────────────────────────────────────────
167
  # Main task runner
168
  # ──────────────────────────────────────────────
169
- def run_task(task: dict) -> float:
170
- """Run a single task using the LLM agent. Returns the final reward score."""
171
  task_id = task["id"]
172
  task_name = task["name"]
173
  system_prompt = task["system_prompt"]
174
 
175
- log_start(task_id, task_name)
176
 
177
  # Reset environment
178
- reset_data = env_reset(task_id)
179
- obs = reset_data.get("observation", {})
180
- info = obs.get("info", "")
 
 
 
 
181
 
182
  conversation = [
183
  {"role": "user", "content": f"Task started. Environment says: {info}\nDecide your first action."}
184
  ]
185
 
186
- cumulative_reward = 0.0
187
  step_num = 0
 
 
188
 
189
  for step_num in range(1, MAX_STEPS_PER_TASK + 1):
190
  try:
191
  # Ask LLM for next action
192
  action = ask_llm(system_prompt, conversation)
193
  except Exception as e:
194
- print(f"[ERROR] LLM call failed at step {step_num}: {e}", file=sys.stderr)
 
195
  break
196
 
197
  # Execute the action in the environment
198
  try:
199
  result = env_step(action)
 
 
 
 
200
  except Exception as e:
201
- print(f"[ERROR] Environment step failed at step {step_num}: {e}", file=sys.stderr)
 
202
  break
203
 
204
- obs = result.get("observation", {})
205
- reward = result.get("reward", 0.0)
206
- done = result.get("done", False)
207
- cumulative_reward += reward
208
-
209
- # Log the step
210
- log_step(task_id, step_num, action, obs, reward, done)
211
 
212
  if done:
 
213
  break
214
 
215
  # Build observation summary for the LLM
@@ -231,44 +249,21 @@ def run_task(task: dict) -> float:
231
  conversation.append({"role": "assistant", "content": json.dumps(action)})
232
  conversation.append({"role": "user", "content": f"Observation from environment:\n{obs_text}\n\nDecide your next action."})
233
 
234
- log_end(task_id, task_name, cumulative_reward, step_num)
235
- return cumulative_reward
 
 
236
 
237
 
238
  # ──────────────────────────────────────────────
239
  # Entry point
240
  # ──────────────────────────────────────────────
241
  def main():
242
- print("=" * 60)
243
- print("CloudSecurityAuditor β€” OpenEnv Inference")
244
- print(f"Model: {MODEL_NAME}")
245
- print(f"API: {API_BASE_URL}")
246
- print(f"Env: {ENV_URL}")
247
- print("=" * 60)
248
- sys.stdout.flush()
249
-
250
- total_score = 0.0
251
- results = []
252
-
253
  for task in TASKS:
254
  try:
255
- score = run_task(task)
256
- results.append({"task_id": task["id"], "task_name": task["name"], "score": score})
257
- total_score += score
258
- except Exception as e:
259
- print(f"[ERROR] Task {task['id']} failed: {e}", file=sys.stderr)
260
- results.append({"task_id": task["id"], "task_name": task["name"], "score": 0.0})
261
-
262
- # Final summary
263
- print("\n" + "=" * 60)
264
- print("FINAL RESULTS")
265
- print("=" * 60)
266
- for r in results:
267
- status = "βœ… PASS" if r["score"] >= 1.0 else "❌ FAIL"
268
- print(f" {r['task_name']:25s} β†’ score={r['score']:.2f} {status}")
269
- print(f"\n Total Score: {total_score:.2f} / {len(TASKS)}.00")
270
- print("=" * 60)
271
- sys.stdout.flush()
272
 
273
 
274
  if __name__ == "__main__":
 
21
  # ──────────────────────────────────────────────
22
  # Configuration from environment variables
23
  # ──────────────────────────────────────────────
24
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://openrouter.ai/api/v1")
25
+ MODEL_NAME = os.getenv("MODEL_NAME", "openai/gpt-4o-mini")
26
+ HF_TOKEN = os.getenv("HF_TOKEN", "")
27
+ LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "")
28
 
29
+ ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
30
+ BENCHMARK_NAME = "cloud-security-auditor"
31
 
32
  # Initialize OpenAI-compatible client
33
  client = OpenAI(
 
121
  # Strip markdown code fences if present
122
  if raw.startswith("```"):
123
  lines = raw.split("\n")
124
+ # Handle cases where the JSON block is the only content
125
+ if "{" in raw:
126
+ start = raw.find("{")
127
+ end = raw.rfind("}") + 1
128
+ raw = raw[start:end]
129
+ else:
130
+ raw = "\n".join(lines[1:-1]) if len(lines) > 2 else raw
131
 
132
  try:
133
  return json.loads(raw)
 
143
  # ──────────────────────────────────────────────
144
  # Structured logging helpers
145
  # ──────────────────────────────────────────────
146
+ def log_start(task_name: str):
147
+ """
148
+ [START] task=<task_name> env=<benchmark> model=<model_name>
149
+ """
150
+ print(f"[START] task={task_name} env={BENCHMARK_NAME} model={MODEL_NAME}")
151
  sys.stdout.flush()
152
 
153
 
154
+ def log_step(step_num: int, action: dict, reward: float, done: bool, error: str = None):
155
+ """
156
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
157
+ """
158
+ error_str = "null" if not error else error
159
+ # Remove newlines from action for single-line requirement
160
+ action_str = json.dumps(action).replace("\n", " ")
161
+ done_str = "true" if done else "false"
162
+ print(f"[STEP] step={step_num} action={action_str} reward={reward:.2f} done={done_str} error={error_str}")
163
  sys.stdout.flush()
164
 
165
 
166
+ def log_end(success: bool, total_steps: int, score: float, rewards: list):
167
+ """
168
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
169
+ """
170
+ success_str = "true" if success else "false"
171
+ rewards_str = ",".join([f"{r:.2f}" for r in rewards])
172
+ print(f"[END] success={success_str} steps={total_steps} score={score:.2f} rewards={rewards_str}")
173
  sys.stdout.flush()
174
 
175
 
176
  # ──────────────────────────────────────────────
177
  # Main task runner
178
  # ──────────────────────────────────────────────
179
+ def run_task(task: dict):
180
+ """Run a single task using the LLM agent."""
181
  task_id = task["id"]
182
  task_name = task["name"]
183
  system_prompt = task["system_prompt"]
184
 
185
+ log_start(task_id)
186
 
187
  # Reset environment
188
+ try:
189
+ reset_data = env_reset(task_id)
190
+ obs = reset_data.get("observation", {})
191
+ info = obs.get("info", "")
192
+ except Exception as e:
193
+ log_end(success=False, total_steps=0, score=0.0, rewards=[])
194
+ return
195
 
196
  conversation = [
197
  {"role": "user", "content": f"Task started. Environment says: {info}\nDecide your first action."}
198
  ]
199
 
200
+ rewards = []
201
  step_num = 0
202
+ success = False
203
+ last_error = None
204
 
205
  for step_num in range(1, MAX_STEPS_PER_TASK + 1):
206
  try:
207
  # Ask LLM for next action
208
  action = ask_llm(system_prompt, conversation)
209
  except Exception as e:
210
+ last_error = f"LLM error: {str(e)}"
211
+ log_step(step_num, {"error": "LLM failed"}, 0.0, True, error=last_error)
212
  break
213
 
214
  # Execute the action in the environment
215
  try:
216
  result = env_step(action)
217
+ obs = result.get("observation", {})
218
+ reward = result.get("reward", 0.0)
219
+ done = result.get("done", False)
220
+ last_error = obs.get("last_action_error")
221
  except Exception as e:
222
+ last_error = f"Env error: {str(e)}"
223
+ log_step(step_num, action, 0.0, True, error=last_error)
224
  break
225
 
226
+ rewards.append(reward)
227
+ log_step(step_num, action, reward, done, error=last_error)
 
 
 
 
 
228
 
229
  if done:
230
+ success = (reward >= 1.0) # Assume 1.0 is full success
231
  break
232
 
233
  # Build observation summary for the LLM
 
249
  conversation.append({"role": "assistant", "content": json.dumps(action)})
250
  conversation.append({"role": "user", "content": f"Observation from environment:\n{obs_text}\n\nDecide your next action."})
251
 
252
+ # Calculate final score (normalized to [0, 1])
253
+ final_score = max(0.0, min(1.0, sum(rewards)))
254
+
255
+ log_end(success=success, total_steps=step_num, score=final_score, rewards=rewards)
256
 
257
 
258
  # ──────────────────────────────────────────────
259
  # Entry point
260
  # ──────────────────────────────────────────────
261
  def main():
 
 
 
 
 
 
 
 
 
 
 
262
  for task in TASKS:
263
  try:
264
+ run_task(task)
265
+ except Exception:
266
+ pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
267
 
268
 
269
  if __name__ == "__main__":
openenv.yaml CHANGED
@@ -5,8 +5,8 @@ hardware:
5
  tier: "cpu-small"
6
  vCPU: 2
7
  RAM: 4Gi
8
- port: 8000
9
- entrypoint: "uvicorn server.app:app --host 0.0.0.0 --port 8000"
10
  tags:
11
  - security
12
  - cloud
 
5
  tier: "cpu-small"
6
  vCPU: 2
7
  RAM: 4Gi
8
+ port: 7860
9
+ entrypoint: "uvicorn server.app:app --host 0.0.0.0 --port 7860"
10
  tags:
11
  - security
12
  - cloud
server/app.py CHANGED
@@ -39,4 +39,4 @@ async def get_state():
39
 
40
  if __name__ == "__main__":
41
  import uvicorn
42
- uvicorn.run(app, host="0.0.0.0", port=8000)
 
39
 
40
  if __name__ == "__main__":
41
  import uvicorn
42
+ uvicorn.run(app, host="0.0.0.0", port=7860)