DEVessi commited on
Commit
ec8b2ca
Β·
verified Β·
1 Parent(s): 1f6c7ae

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +255 -224
  2. models.py +26 -10
  3. scenario_config.json +131 -0
  4. server/devops_sandbox_environment.py +313 -156
README.md CHANGED
@@ -1,224 +1,255 @@
1
- ---
2
- title: Self-Healing DevOps Sandbox
3
- emoji: πŸ”§
4
- colorFrom: red
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- base_path: /web
10
- tags:
11
- - openenv
12
- ---
13
-
14
- # Self-Healing DevOps Sandbox
15
-
16
- An OpenEnv RL environment where an AI agent is dropped into a **broken Node.js backend** inside a Docker container. The agent must use **bash commands only** to diagnose bugs, edit files, and fix the app -- just like a real DevOps engineer would.
17
-
18
- Built for the **Meta PyTorch OpenEnv Hackathon**.
19
-
20
- ---
21
-
22
- ## What Is This?
23
-
24
- A 3-task challenge of increasing difficulty. The agent starts in a Docker container with a broken Express.js app in `/app` and must make all endpoints healthy.
25
-
26
- | # | Difficulty | Bug | What's Wrong |
27
- |---|-----------|-----------------|---------------------------------------|
28
- | 1 | Easy | `config.json` | Port set to `9999` instead of `3000` |
29
- | 2 | Medium | `routes/users.js`| Missing `)` causes SyntaxError crash |
30
- | 3 | Hard | `routes/data.js` | Missing `await` causes HTTP 500 |
31
-
32
- **Goal:** Fix all bugs so these endpoints return HTTP 200:
33
- - `GET /health` returns `{"status": "ok"}`
34
- - `GET /api/users` returns `{"users": [...]}`
35
- - `GET /api/data` returns `{"records": [...]}`
36
-
37
- ---
38
-
39
- ## Scoring (Partial Rewards)
40
-
41
- The grader runs **after every command** and awards cumulative points:
42
-
43
- | Milestone | Points | Total |
44
- |----------------------------------|--------|----------|
45
- | App starts on port 3000 | +0.35 | 0.35 |
46
- | `/health` returns 200 | +0.10 | 0.45 |
47
- | `/api/users` returns valid JSON | +0.15 | 0.60 |
48
- | `/api/data` returns valid JSON | +0.25 | 0.85 |
49
- | All endpoints correct | +0.15 | **1.00** |
50
-
51
- ---
52
-
53
- ## Getting Started
54
-
55
- ### Prerequisites
56
-
57
- - **Python 3.10+**
58
- - **Docker Desktop** (running)
59
- - **uv** package manager (`pip install uv`)
60
-
61
- ### 1. Install Dependencies
62
-
63
- ```bash
64
- cd devops_sandbox
65
- uv sync
66
- ```
67
-
68
- ### 2. Build the Sandbox Docker Image
69
-
70
- ```bash
71
- docker build -t devops-sandbox-node:latest -f simulated_app/Dockerfile simulated_app/
72
- ```
73
-
74
- ### 3. Start the Environment Server
75
-
76
- ```bash
77
- uv run server
78
- ```
79
-
80
- The server starts at `http://localhost:8000`.
81
-
82
- ### 4. Run the Baseline Agent
83
-
84
- In a **separate terminal**:
85
-
86
- ```bash
87
- # Set your OpenAI API key
88
- export OPENAI_API_KEY="sk-..." # Linux/Mac
89
- $env:OPENAI_API_KEY = "sk-..." # PowerShell
90
-
91
- # Run the baseline
92
- uv run python baseline.py
93
- ```
94
-
95
- ---
96
-
97
- ## Test Your Own Agent
98
-
99
- ### Option A: Use the Python Client
100
-
101
- ```python
102
- from devops_sandbox import BashAction, DevopsSandboxEnv
103
-
104
- with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env:
105
- # Reset creates a fresh Docker container
106
- result = env.reset()
107
- print(result.observation.stdout) # Task description
108
- print(result.observation.grader_score) # 0.0
109
-
110
- # Send bash commands
111
- result = env.step(BashAction(command="cat /app/config.json"))
112
- print(result.observation.stdout) # File contents
113
- print(result.observation.grader_score) # Score after grading
114
-
115
- # Fix a bug
116
- result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json"))
117
- print(result.observation.grader_score) # Partial score
118
-
119
- # Check if done
120
- if result.done:
121
- print("Episode complete!")
122
- ```
123
-
124
- ### Option B: Use the REST API Directly
125
-
126
- ```bash
127
- # Reset the environment
128
- curl -X POST http://localhost:8000/reset
129
-
130
- # Send a command
131
- curl -X POST http://localhost:8000/step \
132
- -H "Content-Type: application/json" \
133
- -d '{"action": {"command": "ls -la /app"}}'
134
- ```
135
-
136
- ### Option C: Use the WebSocket Endpoint
137
-
138
- Connect to `ws://localhost:8000/ws` for persistent sessions.
139
-
140
- ---
141
-
142
- ## Project Structure
143
-
144
- ```
145
- devops_sandbox/
146
- |-- openenv.yaml # OpenEnv manifest
147
- |-- pyproject.toml # Python dependencies
148
- |-- README.md # This file
149
- |-- baseline.py # LLM-powered baseline agent
150
- |-- models.py # BashAction & TerminalObservation schemas
151
- |-- client.py # Python client for the environment
152
- |
153
- |-- server/
154
- | |-- app.py # FastAPI server (entry point)
155
- | +-- devops_sandbox_environment.py # Environment logic + grader
156
- |
157
- +-- simulated_app/ # The broken Node.js app (Docker context)
158
- |-- Dockerfile # node:20-slim sandbox container
159
- |-- package.json # Express.js project
160
- |-- server.js # Main entry point
161
- |-- config.json # Bug 1: wrong port
162
- +-- routes/
163
- |-- users.js # Bug 2: syntax error
164
- +-- data.js # Bug 3: missing await
165
- ```
166
-
167
- ---
168
-
169
- ## How It Works
170
-
171
- ```
172
- +-----------+ BashAction +------------+ docker exec +--------------+
173
- | Agent | --------------> | OpenEnv | --------------> | Docker |
174
- | (LLM/RL) | | Server | | Container |
175
- | | <-------------- | (8000) | <-------------- | (broken app)|
176
- +-----------+ Observation +-----+------+ stdout/stderr +--------------+
177
- + grader_score |
178
- +-----+------+
179
- | Grader |
180
- | (curl test |
181
- | endpoints)|
182
- +------------+
183
- ```
184
-
185
- 1. **Agent** sends a `BashAction` (e.g., `cat /app/config.json`)
186
- 2. **Server** runs it inside the Docker container via `docker exec`
187
- 3. **Grader** restarts the Node app and curls all endpoints
188
- 4. **Observation** returns: stdout, stderr, score (0.0-1.0), feedback
189
-
190
- ---
191
-
192
- ## Configuration
193
-
194
- | Env Variable | Default | Description |
195
- |--------------------|--------------------------|------------------------------------|
196
- | `OPENAI_API_KEY` | *(required)* | OpenAI API key for baseline |
197
- | `OPENAI_MODEL` | `gpt-4o-mini` | LLM model to use |
198
- | `OPENAI_BASE_URL` | *(OpenAI default)* | Custom endpoint (Ollama, vLLM) |
199
- | `MAX_TURNS` | `30` | Max steps per episode |
200
- | `DEVOPS_SANDBOX_URL`| `http://localhost:8000` | Environment server URL |
201
-
202
- ### Use with Local LLMs (Ollama, vLLM)
203
-
204
- ```bash
205
- export OPENAI_BASE_URL="http://localhost:11434/v1"
206
- export OPENAI_MODEL="llama3"
207
- export OPENAI_API_KEY="dummy"
208
- uv run python baseline.py
209
- ```
210
-
211
- ---
212
-
213
- ## Validation
214
-
215
- ```bash
216
- uv run openenv validate
217
- # Expected: [OK] devops_sandbox: Ready for multi-mode deployment
218
- ```
219
-
220
- ---
221
-
222
- ## License
223
-
224
- BSD-style license. See LICENSE for details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Self-Healing DevOps Sandbox
3
+ emoji: πŸ”§
4
+ colorFrom: red
5
+ colorTo: green
6
+ sdk: docker
7
+ pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
+ ---
13
+
14
+ # πŸ”§ Self-Healing DevOps Sandbox
15
+
16
+ An **OpenEnv RL environment** where an AI agent is dropped into a broken Node.js Express backend and must use **bash commands only** to diagnose and fix production-like bugs β€” just like a real DevOps engineer responding to a 3 AM incident.
17
+
18
+ Built for the **Meta PyTorch OpenEnv Hackathon**.
19
+
20
+ ---
21
+
22
+ ## 🎯 Why This Environment?
23
+
24
+ DevOps debugging is one of the most **high-value, real-world tasks** for AI agents. Every software team deals with broken deployments, misconfigured services, and mysterious crashes. This environment tests whether an AI agent can:
25
+
26
+ - **Read and understand** error logs, config files, and source code
27
+ - **Diagnose root causes** from symptoms (crash logs β†’ specific file + line)
28
+ - **Apply targeted fixes** using command-line tools (sed, echo, etc.)
29
+ - **Verify its own work** by restarting services and checking endpoints
30
+
31
+ ---
32
+
33
+ ## πŸ—οΈ Task Design
34
+
35
+ ### Three Difficulty Levels
36
+
37
+ | # | Task | Bugs | What's Broken | Grading Target |
38
+ |---|------|------|---------------|----------------|
39
+ | 1 | `easy` | 1 | `config.json` β†’ port `9999` instead of `3000` | Fix port, app starts |
40
+ | 2 | `medium` | 2 | + `routes/users.js` β†’ missing `)` causes SyntaxError | + `/api/users` works |
41
+ | 3 | `hard` | 3 | + `routes/data.js` β†’ missing `await` breaks async response | All endpoints pass |
42
+
43
+ Each task builds on the previous β€” meaningful difficulty progression where easy tasks are subsets of harder ones.
44
+
45
+ ### The Broken App (`/app`)
46
+
47
+ ```
48
+ /app/
49
+ β”œβ”€β”€ config.json ← Bug 1: port set to 9999 (should be 3000)
50
+ β”œβ”€β”€ package.json ← Express.js project config
51
+ β”œβ”€β”€ server.js ← Main entry point (loads config + routes)
52
+ └── routes/
53
+ β”œβ”€β”€ users.js ← Bug 2: missing closing parenthesis on router.get()
54
+ └── data.js ← Bug 3: missing `await` before async DB call
55
+ ```
56
+
57
+ ---
58
+
59
+ ## πŸ“Š Reward Shaping
60
+
61
+ The grader runs **after every command** and awards granular partial credit:
62
+
63
+ ### Phase 1: File-Level Verification
64
+ | Event | Points |
65
+ |-------|--------|
66
+ | Modified `config.json` | +0.05 |
67
+ | Modified `routes/users.js` | +0.05 |
68
+ | Modified `routes/data.js` | +0.05 |
69
+
70
+ ### Phase 2: HTTP Endpoint Testing
71
+ | Milestone | Points |
72
+ |-----------|--------|
73
+ | App starts on port 3000 | +0.30 |
74
+ | `GET /health` returns 200 | +0.10 |
75
+ | `GET /api/users` returns valid JSON | +0.15 |
76
+ | `GET /api/data` returns valid JSON | +0.20 |
77
+ | All endpoints passing (bonus) | +0.05 |
78
+
79
+ ### Phase 3: Difficulty Scaling
80
+ Raw scores are scaled by task difficulty so each task can reach near-maximum independently.
81
+
82
+ > **All scores are strictly within (0, 1)** per the OpenEnv specification β€” never exactly 0.0 or 1.0.
83
+
84
+ ---
85
+
86
+ ## πŸš€ Getting Started
87
+
88
+ ### Docker (Recommended)
89
+
90
+ ```bash
91
+ docker build -t devops-sandbox:latest .
92
+ docker run --rm -p 8000:8000 devops-sandbox:latest
93
+ curl http://localhost:8000/health
94
+ ```
95
+
96
+ Health response: `{"status":"healthy","service":"devops_sandbox"}`
97
+
98
+ ### Without Docker
99
+
100
+ ```bash
101
+ uv sync
102
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
103
+ ```
104
+
105
+ ### Quick Start (Demo)
106
+
107
+ Update the API key in `scenario_config.json` and run:
108
+
109
+ ```bash
110
+ python inference.py
111
+ ```
112
+
113
+ ---
114
+
115
+ ## πŸ§ͺ Test Your Own Agent
116
+
117
+ ### Option A: Python Client
118
+
119
+ ```python
120
+ from client import DevopsSandboxEnv
121
+ from models import BashAction
122
+
123
+ with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env:
124
+ # Reset with task difficulty
125
+ result = env.reset(task_name="easy")
126
+ print(result.observation.stdout) # Task description
127
+ print(result.observation.grader_score) # 0.01
128
+
129
+ # Send bash commands
130
+ result = env.step(BashAction(command="cat /app/config.json"))
131
+ print(result.observation.stdout) # File contents
132
+ print(result.observation.metadata) # Rich metadata
133
+
134
+ # Fix a bug
135
+ result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json"))
136
+ print(result.observation.grader_score) # Score increases
137
+ print(result.observation.grader_feedback) # "βœ“ Modified config.json (+0.05)"
138
+ ```
139
+
140
+ ### Option B: REST API
141
+
142
+ ```bash
143
+ # Reset the environment
144
+ curl -X POST http://localhost:8000/reset -d '{"task_name": "hard"}'
145
+
146
+ # Send a command
147
+ curl -X POST http://localhost:8000/step \
148
+ -H "Content-Type: application/json" \
149
+ -d '{"action": {"command": "ls -la /app"}}'
150
+ ```
151
+
152
+ ### Option C: WebSocket
153
+
154
+ Connect to `ws://localhost:8000/ws` for persistent sessions.
155
+
156
+ ---
157
+
158
+ ## πŸ“ Project Structure
159
+
160
+ ```
161
+ devops_sandbox/
162
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest (spec_version: 1)
163
+ β”œβ”€β”€ pyproject.toml # Python dependencies
164
+ β”œβ”€β”€ Dockerfile # HF Spaces deployment
165
+ β”œβ”€β”€ scenario_config.json # Task definitions + verifiers
166
+ β”œβ”€β”€ models.py # BashAction & TerminalObservation (Pydantic)
167
+ β”œβ”€β”€ client.py # Python client for the environment
168
+ β”œβ”€β”€ inference.py # LLM baseline agent (3-task evaluation)
169
+ β”‚
170
+ β”œβ”€β”€ server/
171
+ β”‚ β”œβ”€β”€ app.py # FastAPI server (OpenEnv entry point)
172
+ β”‚ └── devops_sandbox_environment.py # Core environment + grader
173
+ β”‚
174
+ └── simulated_app/ # The broken Node.js app
175
+ β”œβ”€β”€ package.json
176
+ β”œβ”€β”€ server.js
177
+ β”œβ”€β”€ config.json # Bug 1: wrong port
178
+ └── routes/
179
+ β”œβ”€β”€ users.js # Bug 2: syntax error
180
+ └── data.js # Bug 3: missing await
181
+ ```
182
+
183
+ ---
184
+
185
+ ## βš™οΈ Architecture
186
+
187
+ ```
188
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” BashAction β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” subprocess β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
189
+ β”‚ Agent β”‚ ──────────────> β”‚ OpenEnv β”‚ ────────────> β”‚ /app/ β”‚
190
+ β”‚ (LLM/RL) β”‚ β”‚ Server β”‚ β”‚ (broken app) β”‚
191
+ β”‚ β”‚ <────────────── β”‚ (:8000) β”‚ <──────────── β”‚ β”‚
192
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Observation β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ stdout/stderr β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
193
+ + grader_score β”‚
194
+ + metadata β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
195
+ β”‚ Grader β”‚
196
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
197
+ β”‚ β”‚File Ξ” β”‚ β”‚ ← Detects which files were modified
198
+ β”‚ β”‚Checker β”‚ β”‚
199
+ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
200
+ β”‚ β”‚HTTP β”‚ β”‚ ← Starts app, curls all endpoints
201
+ β”‚ β”‚Tester β”‚ β”‚
202
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
203
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
204
+ ```
205
+
206
+ 1. **Agent** sends a `BashAction` (e.g., `cat /app/config.json`)
207
+ 2. **Server** executes it via `subprocess.run()` in the `/app` directory
208
+ 3. **Grader** runs two-phase verification:
209
+ - **File tracking**: MD5 hash comparison to detect which bug files changed
210
+ - **HTTP testing**: Starts the Node app, curls `/health`, `/api/users`, `/api/data`
211
+ 4. **Observation** returns: stdout, stderr, score (0.01–0.99), feedback, and metadata
212
+
213
+ ---
214
+
215
+ ## πŸ“‹ Observation Metadata
216
+
217
+ Each observation includes rich metadata for training analysis:
218
+
219
+ ```json
220
+ {
221
+ "episode_id": "abc-123",
222
+ "step": 3,
223
+ "task": "hard",
224
+ "max_steps": 50,
225
+ "bugs_total": 3,
226
+ "files_modified": ["config.json", "routes/users.js"],
227
+ "commands_count": 3
228
+ }
229
+ ```
230
+
231
+ ---
232
+
233
+ ## πŸ”§ Configuration
234
+
235
+ | Env Variable | Default | Description |
236
+ |-------------|---------|-------------|
237
+ | `HF_TOKEN` | *(required)* | Hugging Face token for LLM API |
238
+ | `MODEL_NAME` | `gpt-4o-mini` | LLM model to use |
239
+ | `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM endpoint |
240
+ | `MAX_TURNS` | `8` | Max steps per task in inference |
241
+
242
+ ---
243
+
244
+ ## βœ… Validation
245
+
246
+ ```bash
247
+ uv run openenv validate
248
+ # Expected: [OK] devops_sandbox: Ready for deployment
249
+ ```
250
+
251
+ ---
252
+
253
+ ## πŸ“„ License
254
+
255
+ BSD-style license. See LICENSE for details.
models.py CHANGED
@@ -8,10 +8,11 @@
8
  Data models for the Self-Healing DevOps Sandbox Environment.
9
 
10
  Defines the Action and Observation types used by the RL agent to interact
11
- with a broken Node.js backend running inside a Docker container.
 
12
  """
13
 
14
- from typing import Any, Dict
15
 
16
  from pydantic import Field
17
 
@@ -19,10 +20,10 @@ from openenv.core.env_server.types import Action, Observation
19
 
20
 
21
  class BashAction(Action):
22
- """Action: a bash command to execute inside the Docker sandbox.
23
 
24
- The agent sends shell commands (ls, cat, sed, node, etc.) to diagnose
25
- and repair the broken Node.js application.
26
  """
27
 
28
  command: str = Field(
@@ -39,7 +40,7 @@ class TerminalObservation(Observation):
39
  """Observation returned after executing a bash command.
40
 
41
  Includes stdout/stderr from the command, working directory context,
42
- the current task identifier, and the grader's partial score.
43
  """
44
 
45
  stdout: str = Field(
@@ -56,15 +57,30 @@ class TerminalObservation(Observation):
56
  )
57
  task_id: str = Field(
58
  default="devops_sandbox",
59
- description="Identifier for the current task scenario.",
60
  )
61
  grader_score: float = Field(
62
- default=0.0,
63
  ge=0.0,
64
  le=1.0,
65
- description="The grader's partial reward (0.0 to 1.0).",
66
  )
67
  grader_feedback: str = Field(
68
  default="",
69
  description="Human-readable feedback from the grader.",
70
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  Data models for the Self-Healing DevOps Sandbox Environment.
9
 
10
  Defines the Action and Observation types used by the RL agent to interact
11
+ with a broken Node.js backend. The agent acts as a DevOps engineer, diagnosing
12
+ and fixing production-like bugs using bash commands.
13
  """
14
 
15
+ from typing import Any, Dict, List, Optional
16
 
17
  from pydantic import Field
18
 
 
20
 
21
 
22
  class BashAction(Action):
23
+ """Action: a bash command to execute inside the sandbox.
24
 
25
+ The agent sends shell commands (ls, cat, sed, grep, node, npm, etc.)
26
+ to diagnose and repair the broken Node.js application.
27
  """
28
 
29
  command: str = Field(
 
40
  """Observation returned after executing a bash command.
41
 
42
  Includes stdout/stderr from the command, working directory context,
43
+ the current task identifier, grader's partial score, and episode metadata.
44
  """
45
 
46
  stdout: str = Field(
 
57
  )
58
  task_id: str = Field(
59
  default="devops_sandbox",
60
+ description="Identifier for the current task scenario (easy/medium/hard).",
61
  )
62
  grader_score: float = Field(
63
+ default=0.01,
64
  ge=0.0,
65
  le=1.0,
66
+ description="The grader's partial reward strictly within (0, 1).",
67
  )
68
  grader_feedback: str = Field(
69
  default="",
70
  description="Human-readable feedback from the grader.",
71
+ )
72
+ done: bool = Field(
73
+ default=False,
74
+ description="Whether the episode is complete (all bugs fixed or max steps reached).",
75
+ )
76
+ reward: Optional[float] = Field(
77
+ default=None,
78
+ description="Incremental reward for this step (score delta).",
79
+ )
80
+ metadata: Dict[str, Any] = Field(
81
+ default_factory=dict,
82
+ description="Additional metadata: files_modified, commands_count, bugs_found, etc.",
83
+ )
84
+
85
+
86
+ __all__ = ["BashAction", "TerminalObservation"]
scenario_config.json ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "env_name": "devops_sandbox",
3
+ "env_url": "http://localhost:8000",
4
+ "llm_provider": "openai",
5
+ "llm_model": "gpt-4o-mini",
6
+ "llm_api_key": "",
7
+ "temperature": 0.2,
8
+ "max_tokens": 256,
9
+ "max_steps_per_task": 8,
10
+ "system_prompt": "You are an expert DevOps engineer and Node.js developer. You have been dropped into a Linux container with a broken Express.js backend in /app. Your goal is to diagnose and fix bugs so the app runs correctly. Respond ONLY with a JSON object: {\"command\": \"<bash command>\"}. Use standard bash/Linux commands. Do NOT use interactive editors (vi, nano). Be methodical: read files first, understand the bug, then fix it.",
11
+ "tasks": [
12
+ {
13
+ "task_name": "easy",
14
+ "description": "Fix the port configuration bug so the app starts on port 3000 and /health returns HTTP 200.",
15
+ "hints": ["Check config.json for wrong settings"],
16
+ "expected_bugs": ["config.json has port set to 9999 instead of 3000"],
17
+ "verifiers": [
18
+ {
19
+ "name": "Config Port Fixed",
20
+ "description": "Ensures config.json has port set to 3000",
21
+ "verification_type": "file_content",
22
+ "file_path": "/app/config.json",
23
+ "expected_content": "3000"
24
+ },
25
+ {
26
+ "name": "Health Endpoint Responds",
27
+ "description": "GET /health returns HTTP 200",
28
+ "verification_type": "http_check",
29
+ "endpoint": "/health",
30
+ "expected_status": 200
31
+ }
32
+ ]
33
+ },
34
+ {
35
+ "task_name": "medium",
36
+ "description": "Fix the port config AND the syntax error in routes/users.js so /api/users returns valid JSON.",
37
+ "hints": [
38
+ "Check config.json for wrong settings",
39
+ "Look for syntax errors in routes/users.js β€” missing closing parenthesis"
40
+ ],
41
+ "expected_bugs": [
42
+ "config.json has port set to 9999 instead of 3000",
43
+ "routes/users.js is missing a closing parenthesis on the router.get() call"
44
+ ],
45
+ "verifiers": [
46
+ {
47
+ "name": "Config Port Fixed",
48
+ "description": "Ensures config.json has port set to 3000",
49
+ "verification_type": "file_content",
50
+ "file_path": "/app/config.json",
51
+ "expected_content": "3000"
52
+ },
53
+ {
54
+ "name": "Syntax Error Fixed",
55
+ "description": "routes/users.js has closing parenthesis on router.get",
56
+ "verification_type": "file_content",
57
+ "file_path": "/app/routes/users.js",
58
+ "expected_content": "});"
59
+ },
60
+ {
61
+ "name": "Users Endpoint Responds",
62
+ "description": "GET /api/users returns HTTP 200 with users array",
63
+ "verification_type": "http_check",
64
+ "endpoint": "/api/users",
65
+ "expected_status": 200,
66
+ "expected_body_contains": "\"users\""
67
+ }
68
+ ]
69
+ },
70
+ {
71
+ "task_name": "hard",
72
+ "description": "Fix ALL three bugs: port config, syntax error, and missing await in the async data handler.",
73
+ "hints": [
74
+ "Check config.json for wrong settings",
75
+ "Look for syntax errors that prevent startup",
76
+ "Watch out for async/await issues in routes/data.js"
77
+ ],
78
+ "expected_bugs": [
79
+ "config.json has port set to 9999 instead of 3000",
80
+ "routes/users.js is missing a closing parenthesis on the router.get() call",
81
+ "routes/data.js is missing 'await' before fetchDataFromDB() causing a pending Promise"
82
+ ],
83
+ "verifiers": [
84
+ {
85
+ "name": "Config Port Fixed",
86
+ "description": "Ensures config.json has port set to 3000",
87
+ "verification_type": "file_content",
88
+ "file_path": "/app/config.json",
89
+ "expected_content": "3000"
90
+ },
91
+ {
92
+ "name": "Syntax Error Fixed",
93
+ "description": "routes/users.js has closing parenthesis",
94
+ "verification_type": "file_content",
95
+ "file_path": "/app/routes/users.js",
96
+ "expected_content": "});"
97
+ },
98
+ {
99
+ "name": "Await Added",
100
+ "description": "routes/data.js uses await before fetchDataFromDB()",
101
+ "verification_type": "file_content",
102
+ "file_path": "/app/routes/data.js",
103
+ "expected_content": "await fetchDataFromDB"
104
+ },
105
+ {
106
+ "name": "Health Endpoint",
107
+ "description": "GET /health returns HTTP 200",
108
+ "verification_type": "http_check",
109
+ "endpoint": "/health",
110
+ "expected_status": 200
111
+ },
112
+ {
113
+ "name": "Users Endpoint",
114
+ "description": "GET /api/users returns valid JSON with users",
115
+ "verification_type": "http_check",
116
+ "endpoint": "/api/users",
117
+ "expected_status": 200,
118
+ "expected_body_contains": "\"users\""
119
+ },
120
+ {
121
+ "name": "Data Endpoint",
122
+ "description": "GET /api/data returns valid JSON with records",
123
+ "verification_type": "http_check",
124
+ "endpoint": "/api/data",
125
+ "expected_status": 200,
126
+ "expected_body_contains": "\"records\""
127
+ }
128
+ ]
129
+ }
130
+ ]
131
+ }
server/devops_sandbox_environment.py CHANGED
@@ -7,17 +7,32 @@
7
  """
8
  Self-Healing DevOps Sandbox β€” Environment Implementation.
9
 
 
 
 
10
  Runs entirely natively on the host filesystem (Hugging Face Spaces compatible).
11
- The RL agent executes bash commands to diagnose and fix 3 bugs via direct subprocesses.
 
 
 
 
 
 
 
 
 
 
12
  """
13
 
 
 
14
  import logging
15
  import os
16
  import shutil
17
  import subprocess
18
  import sys
19
  from pathlib import Path
20
- from typing import Any, Optional
21
  from uuid import uuid4
22
 
23
  from openenv.core.env_server.interfaces import Environment
@@ -37,11 +52,27 @@ EXPECTED_PORT = 3000 # The port the fixed app should listen on
37
  MAX_STEPS = 50 # Episode budget
38
  SIMULATED_APP_DIR = Path(__file__).resolve().parent.parent / "simulated_app"
39
 
 
 
 
 
 
 
 
 
40
  class DevOpsSandbox(Environment):
41
  """
42
  RL environment: fix a broken Node.js backend.
43
- No longer uses Docker (Docker-in-Docker is unsupported in HF Spaces).
44
- Instead, uses native subprocess.run() in a reset /app/ directory.
 
 
 
 
 
 
 
 
45
  """
46
 
47
  SUPPORTS_CONCURRENT_SESSIONS: bool = False
@@ -52,12 +83,12 @@ class DevOpsSandbox(Environment):
52
  self._current_dir: str = "/app"
53
  self._last_score: float = 0.01
54
  self._current_task: str = "hard"
55
-
56
- # When running on Windows locally, `/app` and `/app_backup` don't exist naturally,
57
- # so we will use absolute paths mapped to our repo if they aren't at root.
58
- # But for HF Space (Linux), /app will be at root.
 
59
  if sys.platform == "win32":
60
- # For Windows local dev, use safe paths inside the workspace
61
  workspace = Path(__file__).resolve().parent.parent
62
  self._app_dir = str(workspace / ".app_sandbox")
63
  self._app_backup_dir = str(SIMULATED_APP_DIR)
@@ -65,101 +96,89 @@ class DevOpsSandbox(Environment):
65
  os.makedirs(self._tmp_dir, exist_ok=True)
66
  self._current_dir = self._app_dir
67
  else:
68
- # For Hugging Face Spaces (Linux)
69
  self._app_dir = "/app"
70
  self._app_backup_dir = "/app_backup"
71
  self._tmp_dir = "/tmp"
72
  self._current_dir = "/app"
73
 
 
 
 
74
  def reset(
75
  self,
76
  seed: Optional[int] = None,
77
  episode_id: Optional[str] = None,
78
  **kwargs: Any,
79
  ) -> TerminalObservation:
80
- """Reset the environment state by copying the backup to the working dir."""
 
 
 
 
 
 
 
 
 
81
  eid = episode_id or str(uuid4())
82
  self._state = State(episode_id=eid, step_count=0)
83
  self._last_score = 0.01
84
  self._current_dir = self._app_dir
85
  self._current_task = kwargs.get("task_name", "hard")
 
 
86
 
87
  self._reset_filesystem()
 
88
  self._inject_grader_script()
89
 
90
  # Gather initial observation
91
- init_stdout = self._exec_cmd(f"ls -la {self._app_dir} && echo '---' && cat {os.path.join(self._app_dir, 'config.json')}")
92
-
93
- if self._current_task == "easy":
94
- task_prompt = (
95
- "=== SELF-HEALING DEVOPS SANDBOX ===\n"
96
- f"You have been dropped into a container with a broken Node.js Express backend in {self._app_dir}.\n\n"
97
- "YOUR MISSION [EASY]: Diagnose and fix the port bug so that:\n"
98
- " 1. The app starts without errors on port 3000\n"
99
- " 2. GET /health returns HTTP 200\n\n"
100
- "HINTS:\n"
101
- " - Check config.json for wrong settings\n\n"
102
- "Use bash commands to explore, edit files, and test.\n"
103
- "When you think you've fixed everything, run: npm start\n\n"
104
- "--- INITIAL DIRECTORY LISTING ---\n"
105
- f"{init_stdout}\n"
106
- )
107
- elif self._current_task == "medium":
108
- task_prompt = (
109
- "=== SELF-HEALING DEVOPS SANDBOX ===\n"
110
- f"You have been dropped into a container with a broken Node.js Express backend in {self._app_dir}.\n\n"
111
- "YOUR MISSION [MEDIUM]: Diagnose and fix TWO bugs so that:\n"
112
- " 1. The app starts without errors on port 3000\n"
113
- " 2. GET /health returns HTTP 200\n"
114
- " 3. GET /api/users returns HTTP 200 with valid JSON\n\n"
115
- "HINTS:\n"
116
- " - Check config.json for wrong settings\n"
117
- " - Look for syntax errors in routes/users.js\n\n"
118
- "Use bash commands to explore, edit files, and test.\n"
119
- "When you think you've fixed everything, run: npm start\n\n"
120
- "--- INITIAL DIRECTORY LISTING ---\n"
121
- f"{init_stdout}\n"
122
- )
123
- else:
124
- task_prompt = (
125
- "=== SELF-HEALING DEVOPS SANDBOX ===\n"
126
- f"You have been dropped into a container with a broken Node.js Express backend in {self._app_dir}.\n\n"
127
- "YOUR MISSION: Diagnose and fix ALL bugs so that:\n"
128
- " 1. The app starts without errors on port 3000\n"
129
- " 2. GET /health returns HTTP 200\n"
130
- " 3. GET /api/users returns HTTP 200 with valid JSON\n"
131
- " 4. GET /api/data returns HTTP 200 with valid JSON\n\n"
132
- "HINTS:\n"
133
- " - Check config files for wrong settings\n"
134
- " - Look for syntax errors that prevent startup\n"
135
- " - Watch out for async/await issues\n\n"
136
- "Use bash commands to explore, edit files, and test.\n"
137
- "When you think you've fixed everything, run: npm start\n\n"
138
- "--- INITIAL DIRECTORY LISTING ---\n"
139
- f"{init_stdout}\n"
140
  )
141
 
 
 
142
  return TerminalObservation(
143
  stdout=task_prompt,
144
  stderr="",
145
  current_dir=self._current_dir,
146
  task_id=self._current_task,
147
  grader_score=0.01,
148
- grader_feedback="Episode started. Fix the bugs!",
149
  done=False,
150
  reward=0.01,
 
 
 
 
 
 
 
 
151
  )
152
 
 
 
 
153
  def step(
154
  self,
155
- action: BashAction, # type: ignore[override]
156
  timeout_s: Optional[float] = None,
157
  **kwargs: Any,
158
  ) -> TerminalObservation:
159
- """Execute the agent's command natively, run grader, return observation."""
160
- self._state.step_count += 1
161
 
 
 
 
 
 
 
 
 
162
  command = action.command.strip()
 
163
  if not command:
164
  return TerminalObservation(
165
  stdout="",
@@ -170,42 +189,14 @@ class DevOpsSandbox(Environment):
170
  grader_feedback="No command executed.",
171
  done=False,
172
  reward=0.01,
 
173
  )
174
 
175
- # Handle 'cd' commands manually since subprocess run is transient
176
- if command.startswith("cd "):
177
- target = command[3:].strip()
178
- # Handle standard cd edge cases
179
- if target == "" or target == "~":
180
- # Assuming /app is home for this exercise
181
- new_dir = self._app_dir
182
- elif target.startswith("/"):
183
- new_dir = os.path.normpath(target)
184
- else:
185
- new_dir = os.path.normpath(os.path.join(self._current_dir, target))
186
-
187
- if os.path.isdir(new_dir):
188
- self._current_dir = new_dir
189
- stdout, stderr = "", ""
190
- else:
191
- stdout, stderr = "", f"bash: cd: {target}: No such file or directory"
192
-
193
- # Run the grader anyway, even if just a cd
194
- score, feedback = self._grade()
195
- reward = max(0.0, score - self._last_score)
196
- self._last_score = score
197
- episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
198
 
199
- return TerminalObservation(
200
- stdout=stdout,
201
- stderr=stderr,
202
- current_dir=self._current_dir,
203
- task_id=self._current_task,
204
- grader_score=score,
205
- grader_feedback=feedback,
206
- done=episode_done,
207
- reward=reward,
208
- )
209
 
210
  # Execute normal command
211
  try:
@@ -214,8 +205,12 @@ class DevOpsSandbox(Environment):
214
  except Exception as e:
215
  stdout, stderr = "", f"Command execution error: {e}"
216
 
 
 
 
 
217
  score, feedback = self._grade()
218
- reward = max(0.0, score - self._last_score)
219
  self._last_score = score
220
  episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
221
 
@@ -228,6 +223,7 @@ class DevOpsSandbox(Environment):
228
  grader_feedback=feedback,
229
  done=episode_done,
230
  reward=reward,
 
231
  )
232
 
233
  @property
@@ -235,18 +231,151 @@ class DevOpsSandbox(Environment):
235
  return self._state
236
 
237
  def close(self) -> None:
238
- # pkill node servers that we might have spawned during the session
239
  self._exec_cmd("pkill -f 'node server.js'")
240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
  # ==================================================================
242
  # FILESYSTEM & EXECUTION HELPERS
243
  # ==================================================================
244
  def _reset_filesystem(self) -> None:
245
- """Replace the current working /app with the pristine /app_backup."""
246
- # Ensure we don't accidentally wipe out the whole host on windows if paths are wrong
247
  os.makedirs(self._app_dir, exist_ok=True)
248
-
249
- # Clean contents of /app instead of deleting /app itself
250
  for item in os.listdir(self._app_dir):
251
  item_path = os.path.join(self._app_dir, item)
252
  if os.path.isdir(item_path):
@@ -256,8 +385,8 @@ class DevOpsSandbox(Environment):
256
  os.remove(item_path)
257
  except OSError:
258
  pass
259
-
260
- # Copy from backup to app dir
261
  if os.path.exists(self._app_backup_dir):
262
  for item in os.listdir(self._app_backup_dir):
263
  s = os.path.join(self._app_backup_dir, item)
@@ -267,14 +396,17 @@ class DevOpsSandbox(Environment):
267
  else:
268
  shutil.copy2(s, d)
269
  else:
270
- logger.warning(f"Backup directory {self._app_backup_dir} not found. Ensure Dockerfile copied simulated_app here.")
 
 
 
271
 
272
  def _exec_cmd(self, cmd: str, timeout: float = 30.0) -> str:
273
  """Execute command natively; return combined output."""
274
  stdout, stderr = self._exec_cmd_split(cmd, timeout)
275
  return (stdout + "\n" + stderr).strip()
276
 
277
- def _exec_cmd_split(self, cmd: str, timeout: float = 30.0) -> tuple:
278
  """Execute command natively; return (stdout, stderr)."""
279
  kwargs = {
280
  "cwd": self._current_dir,
@@ -282,8 +414,6 @@ class DevOpsSandbox(Environment):
282
  "capture_output": True,
283
  "timeout": timeout,
284
  }
285
-
286
- # Hugging Face space requires POSIX bash, windows uses powershell/cmd
287
  if sys.platform != "win32":
288
  kwargs["executable"] = "/bin/bash"
289
 
@@ -302,6 +432,7 @@ class DevOpsSandbox(Environment):
302
  # GRADER
303
  # ==================================================================
304
  def _inject_grader_script(self) -> None:
 
305
  self.grader_path = os.path.join(self._tmp_dir, "grader.sh")
306
  lines = [
307
  '#!/bin/bash',
@@ -314,6 +445,7 @@ class DevOpsSandbox(Environment):
314
  f'node server.js > {self._tmp_dir}/node.log 2>&1 &',
315
  'NODE_PID=$!',
316
  '',
 
317
  'for i in 1 2 3 4; do',
318
  ' sleep 1',
319
  ' if curl -s http://localhost:3000/health > /dev/null 2>&1; then',
@@ -339,22 +471,45 @@ class DevOpsSandbox(Environment):
339
  'echo "GRADER_USERS_BODY:${USERS_BODY}"',
340
  'echo "GRADER_DATA_BODY:${DATA_BODY}"',
341
  ]
342
-
343
  script_content = '\n'.join(lines) + '\n'
344
  with open(self.grader_path, "w", newline='\n') as f:
345
  f.write(script_content)
346
-
347
  if sys.platform != "win32":
348
  subprocess.run(["chmod", "+x", self.grader_path])
349
 
350
- def _grade(self) -> tuple:
 
 
 
 
 
 
 
 
 
 
 
 
351
  score = 0.0
352
  feedback_parts = []
353
 
 
 
 
 
 
 
 
 
 
 
 
 
 
354
  try:
355
  if sys.platform == "win32":
356
- # We use bash via wsl or bash.exe on Windows if we can,
357
- # but if not we might fail grading natively on Windows unless Git Bash is installed.
358
  raw = self._exec_cmd(f"bash {self.grader_path}", timeout=20.0)
359
  else:
360
  raw = self._exec_cmd(f"/bin/bash {self.grader_path}", timeout=20.0)
@@ -373,68 +528,70 @@ class DevOpsSandbox(Environment):
373
  data_body = results.get("GRADER_DATA_BODY", "")
374
 
375
  has_syntax_error = "SyntaxError" in startup_log
376
- has_crash = (has_syntax_error
377
- or "Cannot find module" in startup_log
378
- or "ReferenceError" in startup_log)
 
 
379
  app_listening = f"Server running on port {EXPECTED_PORT}" in startup_log
380
 
381
  if has_crash and not app_listening:
382
- feedback_parts.append(f"βœ— App crashes on startup")
383
  if has_syntax_error:
384
  feedback_parts.append("(SyntaxError detected)")
385
- return (score, " | ".join(feedback_parts))
386
-
387
- if app_listening:
388
- score += 0.35
389
- feedback_parts.append("βœ“ App starts on port 3000 (+0.35)")
390
- else:
391
  feedback_parts.append("βœ— App not listening on port 3000")
392
- return (score, " | ".join(feedback_parts))
393
-
394
- if health_code == "200":
395
- score += 0.10
396
- feedback_parts.append("βœ“ /health returns 200 (+0.10)")
397
  else:
398
- feedback_parts.append(f"βœ— /health returned {health_code}")
 
 
399
 
400
- if users_code == "200":
401
- if '"users"' in users_body:
402
- score += 0.15
403
- feedback_parts.append("βœ“ /api/users returns valid JSON (+0.15)")
404
  else:
405
- score += 0.05
406
- feedback_parts.append("~ /api/users 200 but bad body (+0.05)")
407
- else:
408
- feedback_parts.append(f"βœ— /api/users returned {users_code}")
409
-
410
- if data_code == "200":
411
- if '"records"' in data_body:
412
- score += 0.25
413
- feedback_parts.append("βœ“ /api/data returns valid JSON (+0.25)")
414
  else:
415
- score += 0.05
416
- feedback_parts.append("~ /api/data 200 but bad body (+0.05)")
417
- else:
418
- feedback_parts.append(f"βœ— /api/data returned {data_code}")
 
 
 
 
 
 
 
419
 
420
- if score >= 0.85:
421
- score = min(score + 0.15, 1.0)
422
- feedback_parts.append("βœ“ All endpoints healthy β€” FULL SCORE (+0.15)")
423
 
424
  except Exception as exc:
425
  logger.exception("Grader error")
426
  feedback_parts.append(f"Grader error (score preserved): {exc}")
427
 
428
- # Scale score based on task difficulty
429
  if self._current_task == "easy":
430
- raw_target = 0.45
431
  elif self._current_task == "medium":
432
- raw_target = 0.60
433
  else:
434
  raw_target = 1.0
435
-
436
  final_score = min(1.0, score / raw_target)
437
- # Cap strictly within (0, 1) per Phase 2 Validator requirements
438
  final_score = round(min(max(final_score, 0.01), 0.99), 2)
439
-
440
  return (final_score, " | ".join(feedback_parts))
 
7
  """
8
  Self-Healing DevOps Sandbox β€” Environment Implementation.
9
 
10
+ An RL environment where an AI agent is dropped into a broken Node.js Express
11
+ backend and must use bash commands to diagnose and fix production-like bugs.
12
+
13
  Runs entirely natively on the host filesystem (Hugging Face Spaces compatible).
14
+ The agent executes bash commands to diagnose and fix 3 bugs via direct subprocesses.
15
+
16
+ Bugs injected:
17
+ 1. config.json β€” wrong port (9999 instead of 3000)
18
+ 2. routes/users.js β€” missing closing parenthesis (SyntaxError)
19
+ 3. routes/data.js β€” missing `await` on async DB call (broken response)
20
+
21
+ Grading:
22
+ - File-level verification: did the agent edit the correct file?
23
+ - HTTP endpoint testing: does the app start and respond correctly?
24
+ - Partial credit: smooth reward progression from 0.01 to 0.99
25
  """
26
 
27
+ import hashlib
28
+ import json
29
  import logging
30
  import os
31
  import shutil
32
  import subprocess
33
  import sys
34
  from pathlib import Path
35
+ from typing import Any, Dict, List, Optional, Tuple
36
  from uuid import uuid4
37
 
38
  from openenv.core.env_server.interfaces import Environment
 
52
  MAX_STEPS = 50 # Episode budget
53
  SIMULATED_APP_DIR = Path(__file__).resolve().parent.parent / "simulated_app"
54
 
55
+ # Files that contain bugs β€” used for file-change tracking
56
+ BUG_FILES = {
57
+ "config.json": "port",
58
+ "routes/users.js": "syntax",
59
+ "routes/data.js": "await",
60
+ }
61
+
62
+
63
  class DevOpsSandbox(Environment):
64
  """
65
  RL environment: fix a broken Node.js backend.
66
+
67
+ The agent operates in a Linux filesystem with a broken Express.js app.
68
+ It must use bash commands (ls, cat, sed, grep, etc.) to find and fix bugs.
69
+
70
+ Features:
71
+ - 3 difficulty levels (easy/medium/hard) with progressive bug counts
72
+ - File-change tracking for granular reward shaping
73
+ - HTTP endpoint verification via automated grader
74
+ - Rich metadata in observations (files_modified, bugs_found, etc.)
75
+ - All scores strictly within (0, 1) per OpenEnv spec
76
  """
77
 
78
  SUPPORTS_CONCURRENT_SESSIONS: bool = False
 
83
  self._current_dir: str = "/app"
84
  self._last_score: float = 0.01
85
  self._current_task: str = "hard"
86
+ self._file_hashes: Dict[str, str] = {}
87
+ self._files_modified: List[str] = []
88
+ self._commands_history: List[str] = []
89
+
90
+ # Platform-specific paths
91
  if sys.platform == "win32":
 
92
  workspace = Path(__file__).resolve().parent.parent
93
  self._app_dir = str(workspace / ".app_sandbox")
94
  self._app_backup_dir = str(SIMULATED_APP_DIR)
 
96
  os.makedirs(self._tmp_dir, exist_ok=True)
97
  self._current_dir = self._app_dir
98
  else:
 
99
  self._app_dir = "/app"
100
  self._app_backup_dir = "/app_backup"
101
  self._tmp_dir = "/tmp"
102
  self._current_dir = "/app"
103
 
104
+ # ==================================================================
105
+ # RESET
106
+ # ==================================================================
107
  def reset(
108
  self,
109
  seed: Optional[int] = None,
110
  episode_id: Optional[str] = None,
111
  **kwargs: Any,
112
  ) -> TerminalObservation:
113
+ """Reset the environment state for a new episode.
114
+
115
+ Args:
116
+ seed: Optional random seed (unused, bugs are deterministic).
117
+ episode_id: Optional episode identifier.
118
+ **kwargs: Must include task_name='easy'|'medium'|'hard'.
119
+
120
+ Returns:
121
+ TerminalObservation with the task prompt and initial state.
122
+ """
123
  eid = episode_id or str(uuid4())
124
  self._state = State(episode_id=eid, step_count=0)
125
  self._last_score = 0.01
126
  self._current_dir = self._app_dir
127
  self._current_task = kwargs.get("task_name", "hard")
128
+ self._files_modified = []
129
+ self._commands_history = []
130
 
131
  self._reset_filesystem()
132
+ self._snapshot_file_hashes()
133
  self._inject_grader_script()
134
 
135
  # Gather initial observation
136
+ init_stdout = self._exec_cmd(
137
+ f"ls -la {self._app_dir} && echo '---' && cat {os.path.join(self._app_dir, 'config.json')}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  )
139
 
140
+ task_prompt = self._build_task_prompt(init_stdout)
141
+
142
  return TerminalObservation(
143
  stdout=task_prompt,
144
  stderr="",
145
  current_dir=self._current_dir,
146
  task_id=self._current_task,
147
  grader_score=0.01,
148
+ grader_feedback="Episode started. Diagnose and fix the bugs!",
149
  done=False,
150
  reward=0.01,
151
+ metadata={
152
+ "episode_id": eid,
153
+ "task": self._current_task,
154
+ "max_steps": MAX_STEPS,
155
+ "bugs_total": self._bugs_for_task(),
156
+ "bugs_found": 0,
157
+ "files_modified": [],
158
+ },
159
  )
160
 
161
+ # ==================================================================
162
+ # STEP
163
+ # ==================================================================
164
  def step(
165
  self,
166
+ action: BashAction,
167
  timeout_s: Optional[float] = None,
168
  **kwargs: Any,
169
  ) -> TerminalObservation:
170
+ """Execute the agent's command, run the grader, return observation.
 
171
 
172
+ Args:
173
+ action: BashAction containing the command string.
174
+ timeout_s: Optional timeout for command execution.
175
+
176
+ Returns:
177
+ TerminalObservation with command output, score, and metadata.
178
+ """
179
+ self._state.step_count += 1
180
  command = action.command.strip()
181
+
182
  if not command:
183
  return TerminalObservation(
184
  stdout="",
 
189
  grader_feedback="No command executed.",
190
  done=False,
191
  reward=0.01,
192
+ metadata=self._build_metadata(),
193
  )
194
 
195
+ self._commands_history.append(command)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
+ # Handle 'cd' commands manually (subprocess is transient)
198
+ if command.startswith("cd "):
199
+ return self._handle_cd(command)
 
 
 
 
 
 
 
200
 
201
  # Execute normal command
202
  try:
 
205
  except Exception as e:
206
  stdout, stderr = "", f"Command execution error: {e}"
207
 
208
+ # Check for file modifications
209
+ self._detect_file_changes()
210
+
211
+ # Grade the current state
212
  score, feedback = self._grade()
213
+ reward = max(0.01, score - self._last_score)
214
  self._last_score = score
215
  episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
216
 
 
223
  grader_feedback=feedback,
224
  done=episode_done,
225
  reward=reward,
226
+ metadata=self._build_metadata(),
227
  )
228
 
229
  @property
 
231
  return self._state
232
 
233
  def close(self) -> None:
234
+ """Clean up: kill any Node.js servers spawned during the episode."""
235
  self._exec_cmd("pkill -f 'node server.js'")
236
 
237
+ # ==================================================================
238
+ # TASK PROMPTS
239
+ # ==================================================================
240
+ def _build_task_prompt(self, init_stdout: str) -> str:
241
+ """Build the task prompt based on the current difficulty level."""
242
+ base = (
243
+ "=== SELF-HEALING DEVOPS SANDBOX ===\n"
244
+ f"You have been dropped into a container with a broken Node.js "
245
+ f"Express backend in {self._app_dir}.\n\n"
246
+ )
247
+
248
+ if self._current_task == "easy":
249
+ mission = (
250
+ "YOUR MISSION [EASY β€” 1 bug]:\n"
251
+ " Fix the port configuration so that:\n"
252
+ " 1. The app starts without errors on port 3000\n"
253
+ " 2. GET /health returns HTTP 200\n\n"
254
+ "HINTS:\n"
255
+ " - Check config.json for wrong settings\n"
256
+ )
257
+ elif self._current_task == "medium":
258
+ mission = (
259
+ "YOUR MISSION [MEDIUM β€” 2 bugs]:\n"
260
+ " Fix BOTH bugs so that:\n"
261
+ " 1. The app starts without errors on port 3000\n"
262
+ " 2. GET /health returns HTTP 200\n"
263
+ " 3. GET /api/users returns HTTP 200 with valid JSON\n\n"
264
+ "HINTS:\n"
265
+ " - Check config.json for wrong settings\n"
266
+ " - Look for syntax errors in routes/users.js\n"
267
+ )
268
+ else:
269
+ mission = (
270
+ "YOUR MISSION [HARD β€” 3 bugs]:\n"
271
+ " Fix ALL bugs so that:\n"
272
+ " 1. The app starts without errors on port 3000\n"
273
+ " 2. GET /health returns HTTP 200\n"
274
+ " 3. GET /api/users returns HTTP 200 with valid JSON\n"
275
+ " 4. GET /api/data returns HTTP 200 with valid JSON\n\n"
276
+ "HINTS:\n"
277
+ " - Check config files for wrong settings\n"
278
+ " - Look for syntax errors that prevent startup\n"
279
+ " - Watch out for async/await issues\n"
280
+ )
281
+
282
+ return (
283
+ base + mission +
284
+ "\nUse bash commands to explore, edit files, and test.\n"
285
+ "When you think you've fixed everything, run: npm start\n\n"
286
+ f"--- INITIAL DIRECTORY LISTING ---\n{init_stdout}\n"
287
+ )
288
+
289
+ def _bugs_for_task(self) -> int:
290
+ """Return the number of bugs for the current task difficulty."""
291
+ return {"easy": 1, "medium": 2, "hard": 3}.get(self._current_task, 3)
292
+
293
+ # ==================================================================
294
+ # CD HANDLER
295
+ # ==================================================================
296
+ def _handle_cd(self, command: str) -> TerminalObservation:
297
+ """Handle cd commands manually since subprocess.run is transient."""
298
+ target = command[3:].strip()
299
+ if target == "" or target == "~":
300
+ new_dir = self._app_dir
301
+ elif target.startswith("/"):
302
+ new_dir = os.path.normpath(target)
303
+ else:
304
+ new_dir = os.path.normpath(os.path.join(self._current_dir, target))
305
+
306
+ if os.path.isdir(new_dir):
307
+ self._current_dir = new_dir
308
+ stdout, stderr = "", ""
309
+ else:
310
+ stdout, stderr = "", f"bash: cd: {target}: No such file or directory"
311
+
312
+ score, feedback = self._grade()
313
+ reward = max(0.01, score - self._last_score)
314
+ self._last_score = score
315
+ episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
316
+
317
+ return TerminalObservation(
318
+ stdout=stdout,
319
+ stderr=stderr,
320
+ current_dir=self._current_dir,
321
+ task_id=self._current_task,
322
+ grader_score=score,
323
+ grader_feedback=feedback,
324
+ done=episode_done,
325
+ reward=reward,
326
+ metadata=self._build_metadata(),
327
+ )
328
+
329
+ # ==================================================================
330
+ # METADATA & FILE TRACKING
331
+ # ==================================================================
332
+ def _build_metadata(self) -> Dict[str, Any]:
333
+ """Build rich metadata for the current observation."""
334
+ return {
335
+ "episode_id": self._state.episode_id,
336
+ "step": self._state.step_count,
337
+ "task": self._current_task,
338
+ "max_steps": MAX_STEPS,
339
+ "bugs_total": self._bugs_for_task(),
340
+ "files_modified": list(self._files_modified),
341
+ "commands_count": len(self._commands_history),
342
+ }
343
+
344
+ def _snapshot_file_hashes(self) -> None:
345
+ """Take a hash snapshot of all bug-related files for change detection."""
346
+ self._file_hashes = {}
347
+ for relative_path in BUG_FILES:
348
+ full_path = os.path.join(self._app_dir, relative_path)
349
+ if os.path.isfile(full_path):
350
+ try:
351
+ with open(full_path, "rb") as f:
352
+ self._file_hashes[relative_path] = hashlib.md5(f.read()).hexdigest()
353
+ except OSError:
354
+ pass
355
+
356
+ def _detect_file_changes(self) -> None:
357
+ """Detect which bug files have been modified since reset."""
358
+ for relative_path in BUG_FILES:
359
+ if relative_path in self._files_modified:
360
+ continue
361
+ full_path = os.path.join(self._app_dir, relative_path)
362
+ if os.path.isfile(full_path):
363
+ try:
364
+ with open(full_path, "rb") as f:
365
+ current_hash = hashlib.md5(f.read()).hexdigest()
366
+ if current_hash != self._file_hashes.get(relative_path):
367
+ self._files_modified.append(relative_path)
368
+ except OSError:
369
+ pass
370
+
371
  # ==================================================================
372
  # FILESYSTEM & EXECUTION HELPERS
373
  # ==================================================================
374
  def _reset_filesystem(self) -> None:
375
+ """Replace the working /app with the pristine backup."""
 
376
  os.makedirs(self._app_dir, exist_ok=True)
377
+
378
+ # Clean contents of /app
379
  for item in os.listdir(self._app_dir):
380
  item_path = os.path.join(self._app_dir, item)
381
  if os.path.isdir(item_path):
 
385
  os.remove(item_path)
386
  except OSError:
387
  pass
388
+
389
+ # Copy from backup
390
  if os.path.exists(self._app_backup_dir):
391
  for item in os.listdir(self._app_backup_dir):
392
  s = os.path.join(self._app_backup_dir, item)
 
396
  else:
397
  shutil.copy2(s, d)
398
  else:
399
+ logger.warning(
400
+ f"Backup directory {self._app_backup_dir} not found. "
401
+ "Ensure Dockerfile copied simulated_app here."
402
+ )
403
 
404
  def _exec_cmd(self, cmd: str, timeout: float = 30.0) -> str:
405
  """Execute command natively; return combined output."""
406
  stdout, stderr = self._exec_cmd_split(cmd, timeout)
407
  return (stdout + "\n" + stderr).strip()
408
 
409
+ def _exec_cmd_split(self, cmd: str, timeout: float = 30.0) -> Tuple[str, str]:
410
  """Execute command natively; return (stdout, stderr)."""
411
  kwargs = {
412
  "cwd": self._current_dir,
 
414
  "capture_output": True,
415
  "timeout": timeout,
416
  }
 
 
417
  if sys.platform != "win32":
418
  kwargs["executable"] = "/bin/bash"
419
 
 
432
  # GRADER
433
  # ==================================================================
434
  def _inject_grader_script(self) -> None:
435
+ """Write the grader bash script that tests the Node.js app endpoints."""
436
  self.grader_path = os.path.join(self._tmp_dir, "grader.sh")
437
  lines = [
438
  '#!/bin/bash',
 
445
  f'node server.js > {self._tmp_dir}/node.log 2>&1 &',
446
  'NODE_PID=$!',
447
  '',
448
+ '# Wait for server to start (up to 4 seconds)',
449
  'for i in 1 2 3 4; do',
450
  ' sleep 1',
451
  ' if curl -s http://localhost:3000/health > /dev/null 2>&1; then',
 
471
  'echo "GRADER_USERS_BODY:${USERS_BODY}"',
472
  'echo "GRADER_DATA_BODY:${DATA_BODY}"',
473
  ]
474
+
475
  script_content = '\n'.join(lines) + '\n'
476
  with open(self.grader_path, "w", newline='\n') as f:
477
  f.write(script_content)
478
+
479
  if sys.platform != "win32":
480
  subprocess.run(["chmod", "+x", self.grader_path])
481
 
482
+ def _grade(self) -> Tuple[float, str]:
483
+ """Run the grader and return (score, feedback).
484
+
485
+ Scoring breakdown:
486
+ - File-level: +0.05 per correctly modified bug file
487
+ - App starts on port 3000: +0.30
488
+ - /health returns 200: +0.10
489
+ - /api/users returns valid JSON: +0.15
490
+ - /api/data returns valid JSON: +0.20
491
+ - All endpoints pass: +0.05 bonus
492
+
493
+ Total raw score is then scaled by task difficulty and clamped to (0, 1).
494
+ """
495
  score = 0.0
496
  feedback_parts = []
497
 
498
+ # --- Phase 1: File-change rewards (micro-rewards for finding bugs) ---
499
+ files_to_check = {
500
+ "easy": ["config.json"],
501
+ "medium": ["config.json", "routes/users.js"],
502
+ "hard": ["config.json", "routes/users.js", "routes/data.js"],
503
+ }.get(self._current_task, list(BUG_FILES.keys()))
504
+
505
+ for f in files_to_check:
506
+ if f in self._files_modified:
507
+ score += 0.05
508
+ feedback_parts.append(f"βœ“ Modified {f} (+0.05)")
509
+
510
+ # --- Phase 2: HTTP endpoint testing ---
511
  try:
512
  if sys.platform == "win32":
 
 
513
  raw = self._exec_cmd(f"bash {self.grader_path}", timeout=20.0)
514
  else:
515
  raw = self._exec_cmd(f"/bin/bash {self.grader_path}", timeout=20.0)
 
528
  data_body = results.get("GRADER_DATA_BODY", "")
529
 
530
  has_syntax_error = "SyntaxError" in startup_log
531
+ has_crash = (
532
+ has_syntax_error
533
+ or "Cannot find module" in startup_log
534
+ or "ReferenceError" in startup_log
535
+ )
536
  app_listening = f"Server running on port {EXPECTED_PORT}" in startup_log
537
 
538
  if has_crash and not app_listening:
539
+ feedback_parts.append("βœ— App crashes on startup")
540
  if has_syntax_error:
541
  feedback_parts.append("(SyntaxError detected)")
542
+ # Fall through to clamping β€” NO early return
543
+ elif not app_listening:
 
 
 
 
544
  feedback_parts.append("βœ— App not listening on port 3000")
545
+ # Fall through to clamping β€” NO early return
 
 
 
 
546
  else:
547
+ # App is running β€” grade each endpoint
548
+ score += 0.30
549
+ feedback_parts.append("βœ“ App starts on port 3000 (+0.30)")
550
 
551
+ if health_code == "200":
552
+ score += 0.10
553
+ feedback_parts.append("βœ“ /health returns 200 (+0.10)")
 
554
  else:
555
+ feedback_parts.append(f"βœ— /health returned {health_code}")
556
+
557
+ if users_code == "200":
558
+ if '"users"' in users_body:
559
+ score += 0.15
560
+ feedback_parts.append("βœ“ /api/users returns valid JSON (+0.15)")
561
+ else:
562
+ score += 0.05
563
+ feedback_parts.append("~ /api/users 200 but malformed body (+0.05)")
564
  else:
565
+ feedback_parts.append(f"βœ— /api/users returned {users_code}")
566
+
567
+ if data_code == "200":
568
+ if '"records"' in data_body:
569
+ score += 0.20
570
+ feedback_parts.append("βœ“ /api/data returns valid JSON (+0.20)")
571
+ else:
572
+ score += 0.05
573
+ feedback_parts.append("~ /api/data 200 but malformed body (+0.05)")
574
+ else:
575
+ feedback_parts.append(f"βœ— /api/data returned {data_code}")
576
 
577
+ if score >= 0.80:
578
+ score += 0.05
579
+ feedback_parts.append("βœ“ All endpoints healthy β€” bonus (+0.05)")
580
 
581
  except Exception as exc:
582
  logger.exception("Grader error")
583
  feedback_parts.append(f"Grader error (score preserved): {exc}")
584
 
585
+ # --- Phase 3: Scale by difficulty and clamp ---
586
  if self._current_task == "easy":
587
+ raw_target = 0.50
588
  elif self._current_task == "medium":
589
+ raw_target = 0.65
590
  else:
591
  raw_target = 1.0
592
+
593
  final_score = min(1.0, score / raw_target)
594
+ # Clamp strictly within (0, 1) β€” EVERY code path reaches here
595
  final_score = round(min(max(final_score, 0.01), 0.99), 2)
596
+
597
  return (final_score, " | ".join(feedback_parts))