Spaces:

DEVessi
/

devops_sandbox

Sleeping

App Files Files Community

DEVessi commited on 7 days ago

Commit

ec8b2ca

verified ·

1 Parent(s): 1f6c7ae

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +255 -224
models.py +26 -10
scenario_config.json +131 -0
server/devops_sandbox_environment.py +313 -156

README.md CHANGED Viewed

@@ -1,224 +1,255 @@
----
-title: Self-Healing DevOps Sandbox
-emoji: 🔧
-colorFrom: red
-colorTo: green
-sdk: docker
-pinned: false
-app_port: 8000
-base_path: /web
-tags:
-  - openenv
----
-# Self-Healing DevOps Sandbox
-An OpenEnv RL environment where an AI agent is dropped into a **broken Node.js backend** inside a Docker container. The agent must use **bash commands only** to diagnose bugs, edit files, and fix the app -- just like a real DevOps engineer would.
-Built for the **Meta PyTorch OpenEnv Hackathon**.
----
-## What Is This?
-A 3-task challenge of increasing difficulty. The agent starts in a Docker container with a broken Express.js app in `/app` and must make all endpoints healthy.
-| # | Difficulty | Bug             | What's Wrong                          |
-|---|-----------|-----------------|---------------------------------------|
-| 1 | Easy      | `config.json`    | Port set to `9999` instead of `3000`  |
-| 2 | Medium    | `routes/users.js`| Missing `)` causes SyntaxError crash  |
-| 3 | Hard      | `routes/data.js` | Missing `await` causes HTTP 500       |
-**Goal:** Fix all bugs so these endpoints return HTTP 200:
-- `GET /health` returns `{"status": "ok"}`
-- `GET /api/users` returns `{"users": [...]}`
-- `GET /api/data` returns `{"records": [...]}`
----
-## Scoring (Partial Rewards)
-The grader runs **after every command** and awards cumulative points:
-| Milestone                        | Points | Total    |
-|----------------------------------|--------|----------|
-| App starts on port 3000          | +0.35  | 0.35     |
-| `/health` returns 200            | +0.10  | 0.45     |
-| `/api/users` returns valid JSON  | +0.15  | 0.60     |
-| `/api/data` returns valid JSON   | +0.25  | 0.85     |
-| All endpoints correct            | +0.15  | **1.00** |
----
-## Getting Started
-### Prerequisites
-- **Python 3.10+**
-- **Docker Desktop** (running)
-- **uv** package manager (`pip install uv`)
-### 1. Install Dependencies
-```bash
-cd devops_sandbox
-uv sync
-```
-### 2. Build the Sandbox Docker Image
-```bash
-docker build -t devops-sandbox-node:latest -f simulated_app/Dockerfile simulated_app/
-```
-### 3. Start the Environment Server
-```bash
-uv run server
-```
-The server starts at `http://localhost:8000`.
-### 4. Run the Baseline Agent
-In a **separate terminal**:
-```bash
-# Set your OpenAI API key
-export OPENAI_API_KEY="sk-..."          # Linux/Mac
-$env:OPENAI_API_KEY = "sk-..."          # PowerShell
-# Run the baseline
-uv run python baseline.py
-```
----
-## Test Your Own Agent
-### Option A: Use the Python Client
-```python
-from devops_sandbox import BashAction, DevopsSandboxEnv
-with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env:
-    # Reset creates a fresh Docker container
-    result = env.reset()
-    print(result.observation.stdout)       # Task description
-    print(result.observation.grader_score)  # 0.0
-    # Send bash commands
-    result = env.step(BashAction(command="cat /app/config.json"))
-    print(result.observation.stdout)       # File contents
-    print(result.observation.grader_score)  # Score after grading
-    # Fix a bug
-    result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json"))
-    print(result.observation.grader_score)  # Partial score
-    # Check if done
-    if result.done:
-        print("Episode complete!")
-```
-### Option B: Use the REST API Directly
-```bash
-# Reset the environment
-curl -X POST http://localhost:8000/reset
-# Send a command
-curl -X POST http://localhost:8000/step \
-  -H "Content-Type: application/json" \
-  -d '{"action": {"command": "ls -la /app"}}'
-```
-### Option C: Use the WebSocket Endpoint
-Connect to `ws://localhost:8000/ws` for persistent sessions.
----
-## Project Structure
-```
-devops_sandbox/
-|-- openenv.yaml                 # OpenEnv manifest
-|-- pyproject.toml               # Python dependencies
-|-- README.md                    # This file
-|-- baseline.py                  # LLM-powered baseline agent
-|-- models.py                    # BashAction & TerminalObservation schemas
-|-- client.py                    # Python client for the environment
-|
-|-- server/
-|   |-- app.py                   # FastAPI server (entry point)
-|   +-- devops_sandbox_environment.py  # Environment logic + grader
-|
-+-- simulated_app/               # The broken Node.js app (Docker context)
-    |-- Dockerfile               # node:20-slim sandbox container
-    |-- package.json             # Express.js project
-    |-- server.js                # Main entry point
-    |-- config.json              # Bug 1: wrong port
-    +-- routes/
-        |-- users.js             # Bug 2: syntax error
-        +-- data.js              # Bug 3: missing await
-```
----
-## How It Works
-```
-+-----------+   BashAction    +------------+   docker exec   +--------------+
-|  Agent    | --------------> |  OpenEnv   | --------------> |  Docker      |
-| (LLM/RL) |                 |  Server    |                 |  Container   |
-|           | <-------------- |  (8000)    | <-------------- |  (broken app)|
-+-----------+  Observation    +-----+------+   stdout/stderr +--------------+
-               + grader_score       |
-                              +-----+------+
-                              |   Grader   |
-                              | (curl test |
-                              |  endpoints)|
-                              +------------+
-```
-1. **Agent** sends a `BashAction` (e.g., `cat /app/config.json`)
-2. **Server** runs it inside the Docker container via `docker exec`
-3. **Grader** restarts the Node app and curls all endpoints
-4. **Observation** returns: stdout, stderr, score (0.0-1.0), feedback
----
-## Configuration
-| Env Variable        | Default                  | Description                        |
-|--------------------|--------------------------|------------------------------------|
-| `OPENAI_API_KEY`    | *(required)*             | OpenAI API key for baseline        |
-| `OPENAI_MODEL`      | `gpt-4o-mini`            | LLM model to use                   |
-| `OPENAI_BASE_URL`   | *(OpenAI default)*       | Custom endpoint (Ollama, vLLM)     |
-| `MAX_TURNS`         | `30`                     | Max steps per episode              |
-| `DEVOPS_SANDBOX_URL`| `http://localhost:8000`  | Environment server URL             |
-### Use with Local LLMs (Ollama, vLLM)
-```bash
-export OPENAI_BASE_URL="http://localhost:11434/v1"
-export OPENAI_MODEL="llama3"
-export OPENAI_API_KEY="dummy"
-uv run python baseline.py
-```
----
-## Validation
-```bash
-uv run openenv validate
-# Expected: [OK] devops_sandbox: Ready for multi-mode deployment
-```
----
-## License
-BSD-style license. See LICENSE for details.

+---
+title: Self-Healing DevOps Sandbox
+emoji: 🔧
+colorFrom: red
+colorTo: green
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+---
+# 🔧 Self-Healing DevOps Sandbox
+An **OpenEnv RL environment** where an AI agent is dropped into a broken Node.js Express backend and must use **bash commands only** to diagnose and fix production-like bugs — just like a real DevOps engineer responding to a 3 AM incident.
+Built for the **Meta PyTorch OpenEnv Hackathon**.
+---
+## 🎯 Why This Environment?
+DevOps debugging is one of the most **high-value, real-world tasks** for AI agents. Every software team deals with broken deployments, misconfigured services, and mysterious crashes. This environment tests whether an AI agent can:
+- **Read and understand** error logs, config files, and source code
+- **Diagnose root causes** from symptoms (crash logs → specific file + line)
+- **Apply targeted fixes** using command-line tools (sed, echo, etc.)
+- **Verify its own work** by restarting services and checking endpoints
+---
+## 🏗️ Task Design
+### Three Difficulty Levels
+| # | Task | Bugs | What's Broken | Grading Target |
+|---|------|------|---------------|----------------|
+| 1 | `easy` | 1 | `config.json` → port `9999` instead of `3000` | Fix port, app starts |
+| 2 | `medium` | 2 | + `routes/users.js` → missing `)` causes SyntaxError | + `/api/users` works |
+| 3 | `hard` | 3 | + `routes/data.js` → missing `await` breaks async response | All endpoints pass |
+Each task builds on the previous — meaningful difficulty progression where easy tasks are subsets of harder ones.
+### The Broken App (`/app`)
+```
+/app/
+├── config.json          ← Bug 1: port set to 9999 (should be 3000)
+├── package.json         ← Express.js project config
+├── server.js            ← Main entry point (loads config + routes)
+└── routes/
+    ├── users.js         ← Bug 2: missing closing parenthesis on router.get()
+    └── data.js          ← Bug 3: missing `await` before async DB call
+```
+---
+## 📊 Reward Shaping
+The grader runs **after every command** and awards granular partial credit:
+### Phase 1: File-Level Verification
+| Event | Points |
+|-------|--------|
+| Modified `config.json` | +0.05 |
+| Modified `routes/users.js` | +0.05 |
+| Modified `routes/data.js` | +0.05 |
+### Phase 2: HTTP Endpoint Testing
+| Milestone | Points |
+|-----------|--------|
+| App starts on port 3000 | +0.30 |
+| `GET /health` returns 200 | +0.10 |
+| `GET /api/users` returns valid JSON | +0.15 |
+| `GET /api/data` returns valid JSON | +0.20 |
+| All endpoints passing (bonus) | +0.05 |
+### Phase 3: Difficulty Scaling
+Raw scores are scaled by task difficulty so each task can reach near-maximum independently.
+> **All scores are strictly within (0, 1)** per the OpenEnv specification — never exactly 0.0 or 1.0.
+---
+## 🚀 Getting Started
+### Docker (Recommended)
+```bash
+docker build -t devops-sandbox:latest .
+docker run --rm -p 8000:8000 devops-sandbox:latest
+curl http://localhost:8000/health
+```
+Health response: `{"status":"healthy","service":"devops_sandbox"}`
+### Without Docker
+```bash
+uv sync
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+### Quick Start (Demo)
+Update the API key in `scenario_config.json` and run:
+```bash
+python inference.py
+```
+---
+## 🧪 Test Your Own Agent
+### Option A: Python Client
+```python
+from client import DevopsSandboxEnv
+from models import BashAction
+with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env:
+    # Reset with task difficulty
+    result = env.reset(task_name="easy")
+    print(result.observation.stdout)        # Task description
+    print(result.observation.grader_score)   # 0.01
+    # Send bash commands
+    result = env.step(BashAction(command="cat /app/config.json"))
+    print(result.observation.stdout)         # File contents
+    print(result.observation.metadata)       # Rich metadata
+    # Fix a bug
+    result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json"))
+    print(result.observation.grader_score)   # Score increases
+    print(result.observation.grader_feedback) # "✓ Modified config.json (+0.05)"
+```
+### Option B: REST API
+```bash
+# Reset the environment
+curl -X POST http://localhost:8000/reset -d '{"task_name": "hard"}'
+# Send a command
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{"action": {"command": "ls -la /app"}}'
+```
+### Option C: WebSocket
+Connect to `ws://localhost:8000/ws` for persistent sessions.
+---
+## 📁 Project Structure
+```
+devops_sandbox/
+├── openenv.yaml               # OpenEnv manifest (spec_version: 1)
+├── pyproject.toml              # Python dependencies
+├── Dockerfile                  # HF Spaces deployment
+├── scenario_config.json        # Task definitions + verifiers
+├── models.py                   # BashAction & TerminalObservation (Pydantic)
+├── client.py                   # Python client for the environment
+├── inference.py                # LLM baseline agent (3-task evaluation)
+│
+├── server/
+│   ├── app.py                  # FastAPI server (OpenEnv entry point)
+│   └── devops_sandbox_environment.py  # Core environment + grader
+│
+└── simulated_app/              # The broken Node.js app
+    ├── package.json
+    ├── server.js
+    ├── config.json             # Bug 1: wrong port
+    └── routes/
+        ├── users.js            # Bug 2: syntax error
+        └── data.js             # Bug 3: missing await
+```
+---
+## ⚙️ Architecture
+```
+┌──────────┐   BashAction    ┌────────────┐   subprocess   ┌──────────────┐
+│  Agent   │ ──────────────> │  OpenEnv   │ ────────────> │  /app/       │
+│ (LLM/RL) │                 │  Server    │               │ (broken app) │
+│          │ <────────────── │  (:8000)   │ <──────────── │              │
+└──────────┘  Observation    └─────┬──────┘  stdout/stderr └──────────────┘
+              + grader_score       │
+              + metadata     ┌─────┴──────┐
+                             │  Grader    │
+                             │ ┌────────┐ │
+                             │ │File Δ  │ │  ← Detects which files were modified
+                             │ │Checker │ │
+                             │ ├────────┤ │
+                             │ │HTTP    │ │  ← Starts app, curls all endpoints
+                             │ │Tester  │ │
+                             │ └────────┘ │
+                             └────────────┘
+```
+1. **Agent** sends a `BashAction` (e.g., `cat /app/config.json`)
+2. **Server** executes it via `subprocess.run()` in the `/app` directory
+3. **Grader** runs two-phase verification:
+   - **File tracking**: MD5 hash comparison to detect which bug files changed
+   - **HTTP testing**: Starts the Node app, curls `/health`, `/api/users`, `/api/data`
+4. **Observation** returns: stdout, stderr, score (0.01–0.99), feedback, and metadata
+---
+## 📋 Observation Metadata
+Each observation includes rich metadata for training analysis:
+```json
+{
+  "episode_id": "abc-123",
+  "step": 3,
+  "task": "hard",
+  "max_steps": 50,
+  "bugs_total": 3,
+  "files_modified": ["config.json", "routes/users.js"],
+  "commands_count": 3
+}
+```
+---
+## 🔧 Configuration
+| Env Variable | Default | Description |
+|-------------|---------|-------------|
+| `HF_TOKEN` | *(required)* | Hugging Face token for LLM API |
+| `MODEL_NAME` | `gpt-4o-mini` | LLM model to use |
+| `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM endpoint |
+| `MAX_TURNS` | `8` | Max steps per task in inference |
+---
+## ✅ Validation
+```bash
+uv run openenv validate
+# Expected: [OK] devops_sandbox: Ready for deployment
+```
+---
+## 📄 License
+BSD-style license. See LICENSE for details.

models.py CHANGED Viewed

@@ -8,10 +8,11 @@
 Data models for the Self-Healing DevOps Sandbox Environment.
 Defines the Action and Observation types used by the RL agent to interact
-with a broken Node.js backend running inside a Docker container.
 """
-from typing import Any, Dict
 from pydantic import Field
@@ -19,10 +20,10 @@ from openenv.core.env_server.types import Action, Observation
 class BashAction(Action):
-    """Action: a bash command to execute inside the Docker sandbox.
-    The agent sends shell commands (ls, cat, sed, node, etc.) to diagnose
-    and repair the broken Node.js application.
     """
     command: str = Field(
@@ -39,7 +40,7 @@ class TerminalObservation(Observation):
     """Observation returned after executing a bash command.
     Includes stdout/stderr from the command, working directory context,
-    the current task identifier, and the grader's partial score.
     """
     stdout: str = Field(
@@ -56,15 +57,30 @@ class TerminalObservation(Observation):
     )
     task_id: str = Field(
         default="devops_sandbox",
-        description="Identifier for the current task scenario.",
     )
     grader_score: float = Field(
-        default=0.0,
         ge=0.0,
         le=1.0,
-        description="The grader's partial reward (0.0 to 1.0).",
     )
     grader_feedback: str = Field(
         default="",
         description="Human-readable feedback from the grader.",
-    )

 Data models for the Self-Healing DevOps Sandbox Environment.
 Defines the Action and Observation types used by the RL agent to interact
+with a broken Node.js backend. The agent acts as a DevOps engineer, diagnosing
+and fixing production-like bugs using bash commands.
 """
+from typing import Any, Dict, List, Optional
 from pydantic import Field
 class BashAction(Action):
+    """Action: a bash command to execute inside the sandbox.
+    The agent sends shell commands (ls, cat, sed, grep, node, npm, etc.)
+    to diagnose and repair the broken Node.js application.
     """
     command: str = Field(
     """Observation returned after executing a bash command.
     Includes stdout/stderr from the command, working directory context,
+    the current task identifier, grader's partial score, and episode metadata.
     """
     stdout: str = Field(
     )
     task_id: str = Field(
         default="devops_sandbox",
+        description="Identifier for the current task scenario (easy/medium/hard).",
     )
     grader_score: float = Field(
+        default=0.01,
         ge=0.0,
         le=1.0,
+        description="The grader's partial reward strictly within (0, 1).",
     )
     grader_feedback: str = Field(
         default="",
         description="Human-readable feedback from the grader.",
+    )
+    done: bool = Field(
+        default=False,
+        description="Whether the episode is complete (all bugs fixed or max steps reached).",
+    )
+    reward: Optional[float] = Field(
+        default=None,
+        description="Incremental reward for this step (score delta).",
+    )
+    metadata: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="Additional metadata: files_modified, commands_count, bugs_found, etc.",
+    )
+__all__ = ["BashAction", "TerminalObservation"]

scenario_config.json ADDED Viewed

	@@ -0,0 +1,131 @@

+{
+    "env_name": "devops_sandbox",
+    "env_url": "http://localhost:8000",
+    "llm_provider": "openai",
+    "llm_model": "gpt-4o-mini",
+    "llm_api_key": "",
+    "temperature": 0.2,
+    "max_tokens": 256,
+    "max_steps_per_task": 8,
+    "system_prompt": "You are an expert DevOps engineer and Node.js developer. You have been dropped into a Linux container with a broken Express.js backend in /app. Your goal is to diagnose and fix bugs so the app runs correctly. Respond ONLY with a JSON object: {\"command\": \"<bash command>\"}. Use standard bash/Linux commands. Do NOT use interactive editors (vi, nano). Be methodical: read files first, understand the bug, then fix it.",
+    "tasks": [
+        {
+            "task_name": "easy",
+            "description": "Fix the port configuration bug so the app starts on port 3000 and /health returns HTTP 200.",
+            "hints": ["Check config.json for wrong settings"],
+            "expected_bugs": ["config.json has port set to 9999 instead of 3000"],
+            "verifiers": [
+                {
+                    "name": "Config Port Fixed",
+                    "description": "Ensures config.json has port set to 3000",
+                    "verification_type": "file_content",
+                    "file_path": "/app/config.json",
+                    "expected_content": "3000"
+                },
+                {
+                    "name": "Health Endpoint Responds",
+                    "description": "GET /health returns HTTP 200",
+                    "verification_type": "http_check",
+                    "endpoint": "/health",
+                    "expected_status": 200
+                }
+            ]
+        },
+        {
+            "task_name": "medium",
+            "description": "Fix the port config AND the syntax error in routes/users.js so /api/users returns valid JSON.",
+            "hints": [
+                "Check config.json for wrong settings",
+                "Look for syntax errors in routes/users.js — missing closing parenthesis"
+            ],
+            "expected_bugs": [
+                "config.json has port set to 9999 instead of 3000",
+                "routes/users.js is missing a closing parenthesis on the router.get() call"
+            ],
+            "verifiers": [
+                {
+                    "name": "Config Port Fixed",
+                    "description": "Ensures config.json has port set to 3000",
+                    "verification_type": "file_content",
+                    "file_path": "/app/config.json",
+                    "expected_content": "3000"
+                },
+                {
+                    "name": "Syntax Error Fixed",
+                    "description": "routes/users.js has closing parenthesis on router.get",
+                    "verification_type": "file_content",
+                    "file_path": "/app/routes/users.js",
+                    "expected_content": "});"
+                },
+                {
+                    "name": "Users Endpoint Responds",
+                    "description": "GET /api/users returns HTTP 200 with users array",
+                    "verification_type": "http_check",
+                    "endpoint": "/api/users",
+                    "expected_status": 200,
+                    "expected_body_contains": "\"users\""
+                }
+            ]
+        },
+        {
+            "task_name": "hard",
+            "description": "Fix ALL three bugs: port config, syntax error, and missing await in the async data handler.",
+            "hints": [
+                "Check config.json for wrong settings",
+                "Look for syntax errors that prevent startup",
+                "Watch out for async/await issues in routes/data.js"
+            ],
+            "expected_bugs": [
+                "config.json has port set to 9999 instead of 3000",
+                "routes/users.js is missing a closing parenthesis on the router.get() call",
+                "routes/data.js is missing 'await' before fetchDataFromDB() causing a pending Promise"
+            ],
+            "verifiers": [
+                {
+                    "name": "Config Port Fixed",
+                    "description": "Ensures config.json has port set to 3000",
+                    "verification_type": "file_content",
+                    "file_path": "/app/config.json",
+                    "expected_content": "3000"
+                },
+                {
+                    "name": "Syntax Error Fixed",
+                    "description": "routes/users.js has closing parenthesis",
+                    "verification_type": "file_content",
+                    "file_path": "/app/routes/users.js",
+                    "expected_content": "});"
+                },
+                {
+                    "name": "Await Added",
+                    "description": "routes/data.js uses await before fetchDataFromDB()",
+                    "verification_type": "file_content",
+                    "file_path": "/app/routes/data.js",
+                    "expected_content": "await fetchDataFromDB"
+                },
+                {
+                    "name": "Health Endpoint",
+                    "description": "GET /health returns HTTP 200",
+                    "verification_type": "http_check",
+                    "endpoint": "/health",
+                    "expected_status": 200
+                },
+                {
+                    "name": "Users Endpoint",
+                    "description": "GET /api/users returns valid JSON with users",
+                    "verification_type": "http_check",
+                    "endpoint": "/api/users",
+                    "expected_status": 200,
+                    "expected_body_contains": "\"users\""
+                },
+                {
+                    "name": "Data Endpoint",
+                    "description": "GET /api/data returns valid JSON with records",
+                    "verification_type": "http_check",
+                    "endpoint": "/api/data",
+                    "expected_status": 200,
+                    "expected_body_contains": "\"records\""
+                }
+            ]
+        }
+    ]
+}

server/devops_sandbox_environment.py CHANGED Viewed

@@ -7,17 +7,32 @@
 """
 Self-Healing DevOps Sandbox — Environment Implementation.
 Runs entirely natively on the host filesystem (Hugging Face Spaces compatible).
-The RL agent executes bash commands to diagnose and fix 3 bugs via direct subprocesses.
 """
 import logging
 import os
 import shutil
 import subprocess
 import sys
 from pathlib import Path
-from typing import Any, Optional
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
@@ -37,11 +52,27 @@ EXPECTED_PORT = 3000          # The port the fixed app should listen on
 MAX_STEPS = 50                # Episode budget
 SIMULATED_APP_DIR = Path(__file__).resolve().parent.parent / "simulated_app"
 class DevOpsSandbox(Environment):
     """
     RL environment: fix a broken Node.js backend.
-    No longer uses Docker (Docker-in-Docker is unsupported in HF Spaces).
-    Instead, uses native subprocess.run() in a reset /app/ directory.
     """
     SUPPORTS_CONCURRENT_SESSIONS: bool = False
@@ -52,12 +83,12 @@ class DevOpsSandbox(Environment):
         self._current_dir: str = "/app"
         self._last_score: float = 0.01
         self._current_task: str = "hard"
-        # When running on Windows locally, `/app` and `/app_backup` don't exist naturally,
-        # so we will use absolute paths mapped to our repo if they aren't at root.
-        # But for HF Space (Linux), /app will be at root.
         if sys.platform == "win32":
-            # For Windows local dev, use safe paths inside the workspace
             workspace = Path(__file__).resolve().parent.parent
             self._app_dir = str(workspace / ".app_sandbox")
             self._app_backup_dir = str(SIMULATED_APP_DIR)
@@ -65,101 +96,89 @@ class DevOpsSandbox(Environment):
             os.makedirs(self._tmp_dir, exist_ok=True)
             self._current_dir = self._app_dir
         else:
-            # For Hugging Face Spaces (Linux)
             self._app_dir = "/app"
             self._app_backup_dir = "/app_backup"
             self._tmp_dir = "/tmp"
             self._current_dir = "/app"
     def reset(
         self,
         seed: Optional[int] = None,
         episode_id: Optional[str] = None,
         **kwargs: Any,
     ) -> TerminalObservation:
-        """Reset the environment state by copying the backup to the working dir."""
         eid = episode_id or str(uuid4())
         self._state = State(episode_id=eid, step_count=0)
         self._last_score = 0.01
         self._current_dir = self._app_dir
         self._current_task = kwargs.get("task_name", "hard")
         self._reset_filesystem()
         self._inject_grader_script()
         # Gather initial observation
-        init_stdout = self._exec_cmd(f"ls -la {self._app_dir} && echo '---' && cat {os.path.join(self._app_dir, 'config.json')}")
-        if self._current_task == "easy":
-            task_prompt = (
-                "=== SELF-HEALING DEVOPS SANDBOX ===\n"
-                f"You have been dropped into a container with a broken Node.js Express backend in {self._app_dir}.\n\n"
-                "YOUR MISSION [EASY]: Diagnose and fix the port bug so that:\n"
-                "  1. The app starts without errors on port 3000\n"
-                "  2. GET /health returns HTTP 200\n\n"
-                "HINTS:\n"
-                "  - Check config.json for wrong settings\n\n"
-                "Use bash commands to explore, edit files, and test.\n"
-                "When you think you've fixed everything, run: npm start\n\n"
-                "--- INITIAL DIRECTORY LISTING ---\n"
-                f"{init_stdout}\n"
-            )
-        elif self._current_task == "medium":
-            task_prompt = (
-                "=== SELF-HEALING DEVOPS SANDBOX ===\n"
-                f"You have been dropped into a container with a broken Node.js Express backend in {self._app_dir}.\n\n"
-                "YOUR MISSION [MEDIUM]: Diagnose and fix TWO bugs so that:\n"
-                "  1. The app starts without errors on port 3000\n"
-                "  2. GET /health returns HTTP 200\n"
-                "  3. GET /api/users returns HTTP 200 with valid JSON\n\n"
-                "HINTS:\n"
-                "  - Check config.json for wrong settings\n"
-                "  - Look for syntax errors in routes/users.js\n\n"
-                "Use bash commands to explore, edit files, and test.\n"
-                "When you think you've fixed everything, run: npm start\n\n"
-                "--- INITIAL DIRECTORY LISTING ---\n"
-                f"{init_stdout}\n"
-            )
-        else:
-            task_prompt = (
-            "=== SELF-HEALING DEVOPS SANDBOX ===\n"
-            f"You have been dropped into a container with a broken Node.js Express backend in {self._app_dir}.\n\n"
-            "YOUR MISSION: Diagnose and fix ALL bugs so that:\n"
-            "  1. The app starts without errors on port 3000\n"
-            "  2. GET /health returns HTTP 200\n"
-            "  3. GET /api/users returns HTTP 200 with valid JSON\n"
-            "  4. GET /api/data returns HTTP 200 with valid JSON\n\n"
-            "HINTS:\n"
-            "  - Check config files for wrong settings\n"
-            "  - Look for syntax errors that prevent startup\n"
-            "  - Watch out for async/await issues\n\n"
-            "Use bash commands to explore, edit files, and test.\n"
-            "When you think you've fixed everything, run: npm start\n\n"
-            "--- INITIAL DIRECTORY LISTING ---\n"
-            f"{init_stdout}\n"
         )
         return TerminalObservation(
             stdout=task_prompt,
             stderr="",
             current_dir=self._current_dir,
             task_id=self._current_task,
             grader_score=0.01,
-            grader_feedback="Episode started. Fix the bugs!",
             done=False,
             reward=0.01,
         )
     def step(
         self,
-        action: BashAction,  # type: ignore[override]
         timeout_s: Optional[float] = None,
         **kwargs: Any,
     ) -> TerminalObservation:
-        """Execute the agent's command natively, run grader, return observation."""
-        self._state.step_count += 1
         command = action.command.strip()
         if not command:
             return TerminalObservation(
                 stdout="",
@@ -170,42 +189,14 @@ class DevOpsSandbox(Environment):
                 grader_feedback="No command executed.",
                 done=False,
                 reward=0.01,
             )
-        # Handle 'cd' commands manually since subprocess run is transient
-        if command.startswith("cd "):
-            target = command[3:].strip()
-            # Handle standard cd edge cases
-            if target == "" or target == "~":
-                # Assuming /app is home for this exercise
-                new_dir = self._app_dir
-            elif target.startswith("/"):
-                new_dir = os.path.normpath(target)
-            else:
-                new_dir = os.path.normpath(os.path.join(self._current_dir, target))
-            if os.path.isdir(new_dir):
-                self._current_dir = new_dir
-                stdout, stderr = "", ""
-            else:
-                stdout, stderr = "", f"bash: cd: {target}: No such file or directory"
-            # Run the grader anyway, even if just a cd
-            score, feedback = self._grade()
-            reward = max(0.0, score - self._last_score)
-            self._last_score = score
-            episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
-            return TerminalObservation(
-                stdout=stdout,
-                stderr=stderr,
-                current_dir=self._current_dir,
-                task_id=self._current_task,
-                grader_score=score,
-                grader_feedback=feedback,
-                done=episode_done,
-                reward=reward,
-            )
         # Execute normal command
         try:
@@ -214,8 +205,12 @@ class DevOpsSandbox(Environment):
         except Exception as e:
             stdout, stderr = "", f"Command execution error: {e}"
         score, feedback = self._grade()
-        reward = max(0.0, score - self._last_score)
         self._last_score = score
         episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
@@ -228,6 +223,7 @@ class DevOpsSandbox(Environment):
             grader_feedback=feedback,
             done=episode_done,
             reward=reward,
         )
     @property
@@ -235,18 +231,151 @@ class DevOpsSandbox(Environment):
         return self._state
     def close(self) -> None:
-        # pkill node servers that we might have spawned during the session
         self._exec_cmd("pkill -f 'node server.js'")
     # ==================================================================
     #  FILESYSTEM & EXECUTION HELPERS
     # ==================================================================
     def _reset_filesystem(self) -> None:
-        """Replace the current working /app with the pristine /app_backup."""
-        # Ensure we don't accidentally wipe out the whole host on windows if paths are wrong
         os.makedirs(self._app_dir, exist_ok=True)
-        # Clean contents of /app instead of deleting /app itself
         for item in os.listdir(self._app_dir):
             item_path = os.path.join(self._app_dir, item)
             if os.path.isdir(item_path):
@@ -256,8 +385,8 @@ class DevOpsSandbox(Environment):
                     os.remove(item_path)
                 except OSError:
                     pass
-        # Copy from backup to app dir
         if os.path.exists(self._app_backup_dir):
             for item in os.listdir(self._app_backup_dir):
                 s = os.path.join(self._app_backup_dir, item)
@@ -267,14 +396,17 @@ class DevOpsSandbox(Environment):
                 else:
                     shutil.copy2(s, d)
         else:
-            logger.warning(f"Backup directory {self._app_backup_dir} not found. Ensure Dockerfile copied simulated_app here.")
     def _exec_cmd(self, cmd: str, timeout: float = 30.0) -> str:
         """Execute command natively; return combined output."""
         stdout, stderr = self._exec_cmd_split(cmd, timeout)
         return (stdout + "\n" + stderr).strip()
-    def _exec_cmd_split(self, cmd: str, timeout: float = 30.0) -> tuple:
         """Execute command natively; return (stdout, stderr)."""
         kwargs = {
             "cwd": self._current_dir,
@@ -282,8 +414,6 @@ class DevOpsSandbox(Environment):
             "capture_output": True,
             "timeout": timeout,
         }
-        # Hugging Face space requires POSIX bash, windows uses powershell/cmd
         if sys.platform != "win32":
             kwargs["executable"] = "/bin/bash"
@@ -302,6 +432,7 @@ class DevOpsSandbox(Environment):
     #  GRADER
     # ==================================================================
     def _inject_grader_script(self) -> None:
         self.grader_path = os.path.join(self._tmp_dir, "grader.sh")
         lines = [
             '#!/bin/bash',
@@ -314,6 +445,7 @@ class DevOpsSandbox(Environment):
             f'node server.js > {self._tmp_dir}/node.log 2>&1 &',
             'NODE_PID=$!',
             '',
             'for i in 1 2 3 4; do',
             '  sleep 1',
             '  if curl -s http://localhost:3000/health > /dev/null 2>&1; then',
@@ -339,22 +471,45 @@ class DevOpsSandbox(Environment):
             'echo "GRADER_USERS_BODY:${USERS_BODY}"',
             'echo "GRADER_DATA_BODY:${DATA_BODY}"',
         ]
         script_content = '\n'.join(lines) + '\n'
         with open(self.grader_path, "w", newline='\n') as f:
             f.write(script_content)
         if sys.platform != "win32":
             subprocess.run(["chmod", "+x", self.grader_path])
-    def _grade(self) -> tuple:
         score = 0.0
         feedback_parts = []
         try:
             if sys.platform == "win32":
-                # We use bash via wsl or bash.exe on Windows if we can,
-                # but if not we might fail grading natively on Windows unless Git Bash is installed.
                 raw = self._exec_cmd(f"bash {self.grader_path}", timeout=20.0)
             else:
                 raw = self._exec_cmd(f"/bin/bash {self.grader_path}", timeout=20.0)
@@ -373,68 +528,70 @@ class DevOpsSandbox(Environment):
             data_body = results.get("GRADER_DATA_BODY", "")
             has_syntax_error = "SyntaxError" in startup_log
-            has_crash = (has_syntax_error
-                         or "Cannot find module" in startup_log
-                         or "ReferenceError" in startup_log)
             app_listening = f"Server running on port {EXPECTED_PORT}" in startup_log
             if has_crash and not app_listening:
-                feedback_parts.append(f"✗ App crashes on startup")
                 if has_syntax_error:
                     feedback_parts.append("(SyntaxError detected)")
-                return (score, " | ".join(feedback_parts))
-            if app_listening:
-                score += 0.35
-                feedback_parts.append("✓ App starts on port 3000 (+0.35)")
-            else:
                 feedback_parts.append("✗ App not listening on port 3000")
-                return (score, " | ".join(feedback_parts))
-            if health_code == "200":
-                score += 0.10
-                feedback_parts.append("✓ /health returns 200 (+0.10)")
             else:
-                feedback_parts.append(f"✗ /health returned {health_code}")
-            if users_code == "200":
-                if '"users"' in users_body:
-                    score += 0.15
-                    feedback_parts.append("✓ /api/users returns valid JSON (+0.15)")
                 else:
-                    score += 0.05
-                    feedback_parts.append("~ /api/users 200 but bad body (+0.05)")
-            else:
-                feedback_parts.append(f"✗ /api/users returned {users_code}")
-            if data_code == "200":
-                if '"records"' in data_body:
-                    score += 0.25
-                    feedback_parts.append("✓ /api/data returns valid JSON (+0.25)")
                 else:
-                    score += 0.05
-                    feedback_parts.append("~ /api/data 200 but bad body (+0.05)")
-            else:
-                feedback_parts.append(f"✗ /api/data returned {data_code}")
-            if score >= 0.85:
-                score = min(score + 0.15, 1.0)
-                feedback_parts.append("✓ All endpoints healthy — FULL SCORE (+0.15)")
         except Exception as exc:
             logger.exception("Grader error")
             feedback_parts.append(f"Grader error (score preserved): {exc}")
-        # Scale score based on task difficulty
         if self._current_task == "easy":
-            raw_target = 0.45
         elif self._current_task == "medium":
-            raw_target = 0.60
         else:
             raw_target = 1.0
         final_score = min(1.0, score / raw_target)
-        # Cap strictly within (0, 1) per Phase 2 Validator requirements
         final_score = round(min(max(final_score, 0.01), 0.99), 2)
         return (final_score, " | ".join(feedback_parts))

 """
 Self-Healing DevOps Sandbox — Environment Implementation.
+An RL environment where an AI agent is dropped into a broken Node.js Express
+backend and must use bash commands to diagnose and fix production-like bugs.
 Runs entirely natively on the host filesystem (Hugging Face Spaces compatible).
+The agent executes bash commands to diagnose and fix 3 bugs via direct subprocesses.
+Bugs injected:
+  1. config.json — wrong port (9999 instead of 3000)
+  2. routes/users.js — missing closing parenthesis (SyntaxError)
+  3. routes/data.js — missing `await` on async DB call (broken response)
+Grading:
+  - File-level verification: did the agent edit the correct file?
+  - HTTP endpoint testing: does the app start and respond correctly?
+  - Partial credit: smooth reward progression from 0.01 to 0.99
 """
+import hashlib
+import json
 import logging
 import os
 import shutil
 import subprocess
 import sys
 from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
 MAX_STEPS = 50                # Episode budget
 SIMULATED_APP_DIR = Path(__file__).resolve().parent.parent / "simulated_app"
+# Files that contain bugs — used for file-change tracking
+BUG_FILES = {
+    "config.json": "port",
+    "routes/users.js": "syntax",
+    "routes/data.js": "await",
+}
 class DevOpsSandbox(Environment):
     """
     RL environment: fix a broken Node.js backend.
+    The agent operates in a Linux filesystem with a broken Express.js app.
+    It must use bash commands (ls, cat, sed, grep, etc.) to find and fix bugs.
+    Features:
+      - 3 difficulty levels (easy/medium/hard) with progressive bug counts
+      - File-change tracking for granular reward shaping
+      - HTTP endpoint verification via automated grader
+      - Rich metadata in observations (files_modified, bugs_found, etc.)
+      - All scores strictly within (0, 1) per OpenEnv spec
     """
     SUPPORTS_CONCURRENT_SESSIONS: bool = False
         self._current_dir: str = "/app"
         self._last_score: float = 0.01
         self._current_task: str = "hard"
+        self._file_hashes: Dict[str, str] = {}
+        self._files_modified: List[str] = []
+        self._commands_history: List[str] = []
+        # Platform-specific paths
         if sys.platform == "win32":
             workspace = Path(__file__).resolve().parent.parent
             self._app_dir = str(workspace / ".app_sandbox")
             self._app_backup_dir = str(SIMULATED_APP_DIR)
             os.makedirs(self._tmp_dir, exist_ok=True)
             self._current_dir = self._app_dir
         else:
             self._app_dir = "/app"
             self._app_backup_dir = "/app_backup"
             self._tmp_dir = "/tmp"
             self._current_dir = "/app"
+    # ==================================================================
+    #  RESET
+    # ==================================================================
     def reset(
         self,
         seed: Optional[int] = None,
         episode_id: Optional[str] = None,
         **kwargs: Any,
     ) -> TerminalObservation:
+        """Reset the environment state for a new episode.
+        Args:
+            seed: Optional random seed (unused, bugs are deterministic).
+            episode_id: Optional episode identifier.
+            **kwargs: Must include task_name='easy'|'medium'|'hard'.
+        Returns:
+            TerminalObservation with the task prompt and initial state.
+        """
         eid = episode_id or str(uuid4())
         self._state = State(episode_id=eid, step_count=0)
         self._last_score = 0.01
         self._current_dir = self._app_dir
         self._current_task = kwargs.get("task_name", "hard")
+        self._files_modified = []
+        self._commands_history = []
         self._reset_filesystem()
+        self._snapshot_file_hashes()
         self._inject_grader_script()
         # Gather initial observation
+        init_stdout = self._exec_cmd(
+            f"ls -la {self._app_dir} && echo '---' && cat {os.path.join(self._app_dir, 'config.json')}"
         )
+        task_prompt = self._build_task_prompt(init_stdout)
         return TerminalObservation(
             stdout=task_prompt,
             stderr="",
             current_dir=self._current_dir,
             task_id=self._current_task,
             grader_score=0.01,
+            grader_feedback="Episode started. Diagnose and fix the bugs!",
             done=False,
             reward=0.01,
+            metadata={
+                "episode_id": eid,
+                "task": self._current_task,
+                "max_steps": MAX_STEPS,
+                "bugs_total": self._bugs_for_task(),
+                "bugs_found": 0,
+                "files_modified": [],
+            },
         )
+    # ==================================================================
+    #  STEP
+    # ==================================================================
     def step(
         self,
+        action: BashAction,
         timeout_s: Optional[float] = None,
         **kwargs: Any,
     ) -> TerminalObservation:
+        """Execute the agent's command, run the grader, return observation.
+        Args:
+            action: BashAction containing the command string.
+            timeout_s: Optional timeout for command execution.
+        Returns:
+            TerminalObservation with command output, score, and metadata.
+        """
+        self._state.step_count += 1
         command = action.command.strip()
         if not command:
             return TerminalObservation(
                 stdout="",
                 grader_feedback="No command executed.",
                 done=False,
                 reward=0.01,
+                metadata=self._build_metadata(),
             )
+        self._commands_history.append(command)
+        # Handle 'cd' commands manually (subprocess is transient)
+        if command.startswith("cd "):
+            return self._handle_cd(command)
         # Execute normal command
         try:
         except Exception as e:
             stdout, stderr = "", f"Command execution error: {e}"
+        # Check for file modifications
+        self._detect_file_changes()
+        # Grade the current state
         score, feedback = self._grade()
+        reward = max(0.01, score - self._last_score)
         self._last_score = score
         episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
             grader_feedback=feedback,
             done=episode_done,
             reward=reward,
+            metadata=self._build_metadata(),
         )
     @property
         return self._state
     def close(self) -> None:
+        """Clean up: kill any Node.js servers spawned during the episode."""
         self._exec_cmd("pkill -f 'node server.js'")
+    # ==================================================================
+    #  TASK PROMPTS
+    # ==================================================================
+    def _build_task_prompt(self, init_stdout: str) -> str:
+        """Build the task prompt based on the current difficulty level."""
+        base = (
+            "=== SELF-HEALING DEVOPS SANDBOX ===\n"
+            f"You have been dropped into a container with a broken Node.js "
+            f"Express backend in {self._app_dir}.\n\n"
+        )
+        if self._current_task == "easy":
+            mission = (
+                "YOUR MISSION [EASY — 1 bug]:\n"
+                "  Fix the port configuration so that:\n"
+                "  1. The app starts without errors on port 3000\n"
+                "  2. GET /health returns HTTP 200\n\n"
+                "HINTS:\n"
+                "  - Check config.json for wrong settings\n"
+            )
+        elif self._current_task == "medium":
+            mission = (
+                "YOUR MISSION [MEDIUM — 2 bugs]:\n"
+                "  Fix BOTH bugs so that:\n"
+                "  1. The app starts without errors on port 3000\n"
+                "  2. GET /health returns HTTP 200\n"
+                "  3. GET /api/users returns HTTP 200 with valid JSON\n\n"
+                "HINTS:\n"
+                "  - Check config.json for wrong settings\n"
+                "  - Look for syntax errors in routes/users.js\n"
+            )
+        else:
+            mission = (
+                "YOUR MISSION [HARD — 3 bugs]:\n"
+                "  Fix ALL bugs so that:\n"
+                "  1. The app starts without errors on port 3000\n"
+                "  2. GET /health returns HTTP 200\n"
+                "  3. GET /api/users returns HTTP 200 with valid JSON\n"
+                "  4. GET /api/data returns HTTP 200 with valid JSON\n\n"
+                "HINTS:\n"
+                "  - Check config files for wrong settings\n"
+                "  - Look for syntax errors that prevent startup\n"
+                "  - Watch out for async/await issues\n"
+            )
+        return (
+            base + mission +
+            "\nUse bash commands to explore, edit files, and test.\n"
+            "When you think you've fixed everything, run: npm start\n\n"
+            f"--- INITIAL DIRECTORY LISTING ---\n{init_stdout}\n"
+        )
+    def _bugs_for_task(self) -> int:
+        """Return the number of bugs for the current task difficulty."""
+        return {"easy": 1, "medium": 2, "hard": 3}.get(self._current_task, 3)
+    # ==================================================================
+    #  CD HANDLER
+    # ==================================================================
+    def _handle_cd(self, command: str) -> TerminalObservation:
+        """Handle cd commands manually since subprocess.run is transient."""
+        target = command[3:].strip()
+        if target == "" or target == "~":
+            new_dir = self._app_dir
+        elif target.startswith("/"):
+            new_dir = os.path.normpath(target)
+        else:
+            new_dir = os.path.normpath(os.path.join(self._current_dir, target))
+        if os.path.isdir(new_dir):
+            self._current_dir = new_dir
+            stdout, stderr = "", ""
+        else:
+            stdout, stderr = "", f"bash: cd: {target}: No such file or directory"
+        score, feedback = self._grade()
+        reward = max(0.01, score - self._last_score)
+        self._last_score = score
+        episode_done = (score >= 0.99) or (self._state.step_count >= MAX_STEPS)
+        return TerminalObservation(
+            stdout=stdout,
+            stderr=stderr,
+            current_dir=self._current_dir,
+            task_id=self._current_task,
+            grader_score=score,
+            grader_feedback=feedback,
+            done=episode_done,
+            reward=reward,
+            metadata=self._build_metadata(),
+        )
+    # ==================================================================
+    #  METADATA & FILE TRACKING
+    # ==================================================================
+    def _build_metadata(self) -> Dict[str, Any]:
+        """Build rich metadata for the current observation."""
+        return {
+            "episode_id": self._state.episode_id,
+            "step": self._state.step_count,
+            "task": self._current_task,
+            "max_steps": MAX_STEPS,
+            "bugs_total": self._bugs_for_task(),
+            "files_modified": list(self._files_modified),
+            "commands_count": len(self._commands_history),
+        }
+    def _snapshot_file_hashes(self) -> None:
+        """Take a hash snapshot of all bug-related files for change detection."""
+        self._file_hashes = {}
+        for relative_path in BUG_FILES:
+            full_path = os.path.join(self._app_dir, relative_path)
+            if os.path.isfile(full_path):
+                try:
+                    with open(full_path, "rb") as f:
+                        self._file_hashes[relative_path] = hashlib.md5(f.read()).hexdigest()
+                except OSError:
+                    pass
+    def _detect_file_changes(self) -> None:
+        """Detect which bug files have been modified since reset."""
+        for relative_path in BUG_FILES:
+            if relative_path in self._files_modified:
+                continue
+            full_path = os.path.join(self._app_dir, relative_path)
+            if os.path.isfile(full_path):
+                try:
+                    with open(full_path, "rb") as f:
+                        current_hash = hashlib.md5(f.read()).hexdigest()
+                    if current_hash != self._file_hashes.get(relative_path):
+                        self._files_modified.append(relative_path)
+                except OSError:
+                    pass
     # ==================================================================
     #  FILESYSTEM & EXECUTION HELPERS
     # ==================================================================
     def _reset_filesystem(self) -> None:
+        """Replace the working /app with the pristine backup."""
         os.makedirs(self._app_dir, exist_ok=True)
+        # Clean contents of /app
         for item in os.listdir(self._app_dir):
             item_path = os.path.join(self._app_dir, item)
             if os.path.isdir(item_path):
                     os.remove(item_path)
                 except OSError:
                     pass
+        # Copy from backup
         if os.path.exists(self._app_backup_dir):
             for item in os.listdir(self._app_backup_dir):
                 s = os.path.join(self._app_backup_dir, item)
                 else:
                     shutil.copy2(s, d)
         else:
+            logger.warning(
+                f"Backup directory {self._app_backup_dir} not found. "
+                "Ensure Dockerfile copied simulated_app here."
+            )
     def _exec_cmd(self, cmd: str, timeout: float = 30.0) -> str:
         """Execute command natively; return combined output."""
         stdout, stderr = self._exec_cmd_split(cmd, timeout)
         return (stdout + "\n" + stderr).strip()
+    def _exec_cmd_split(self, cmd: str, timeout: float = 30.0) -> Tuple[str, str]:
         """Execute command natively; return (stdout, stderr)."""
         kwargs = {
             "cwd": self._current_dir,
             "capture_output": True,
             "timeout": timeout,
         }
         if sys.platform != "win32":
             kwargs["executable"] = "/bin/bash"
     #  GRADER
     # ==================================================================
     def _inject_grader_script(self) -> None:
+        """Write the grader bash script that tests the Node.js app endpoints."""
         self.grader_path = os.path.join(self._tmp_dir, "grader.sh")
         lines = [
             '#!/bin/bash',
             f'node server.js > {self._tmp_dir}/node.log 2>&1 &',
             'NODE_PID=$!',
             '',
+            '# Wait for server to start (up to 4 seconds)',
             'for i in 1 2 3 4; do',
             '  sleep 1',
             '  if curl -s http://localhost:3000/health > /dev/null 2>&1; then',
             'echo "GRADER_USERS_BODY:${USERS_BODY}"',
             'echo "GRADER_DATA_BODY:${DATA_BODY}"',
         ]
         script_content = '\n'.join(lines) + '\n'
         with open(self.grader_path, "w", newline='\n') as f:
             f.write(script_content)
         if sys.platform != "win32":
             subprocess.run(["chmod", "+x", self.grader_path])
+    def _grade(self) -> Tuple[float, str]:
+        """Run the grader and return (score, feedback).
+        Scoring breakdown:
+          - File-level: +0.05 per correctly modified bug file
+          - App starts on port 3000: +0.30
+          - /health returns 200: +0.10
+          - /api/users returns valid JSON: +0.15
+          - /api/data returns valid JSON: +0.20
+          - All endpoints pass: +0.05 bonus
+        Total raw score is then scaled by task difficulty and clamped to (0, 1).
+        """
         score = 0.0
         feedback_parts = []
+        # --- Phase 1: File-change rewards (micro-rewards for finding bugs) ---
+        files_to_check = {
+            "easy": ["config.json"],
+            "medium": ["config.json", "routes/users.js"],
+            "hard": ["config.json", "routes/users.js", "routes/data.js"],
+        }.get(self._current_task, list(BUG_FILES.keys()))
+        for f in files_to_check:
+            if f in self._files_modified:
+                score += 0.05
+                feedback_parts.append(f"✓ Modified {f} (+0.05)")
+        # --- Phase 2: HTTP endpoint testing ---
         try:
             if sys.platform == "win32":
                 raw = self._exec_cmd(f"bash {self.grader_path}", timeout=20.0)
             else:
                 raw = self._exec_cmd(f"/bin/bash {self.grader_path}", timeout=20.0)
             data_body = results.get("GRADER_DATA_BODY", "")
             has_syntax_error = "SyntaxError" in startup_log
+            has_crash = (
+                has_syntax_error
+                or "Cannot find module" in startup_log
+                or "ReferenceError" in startup_log
+            )
             app_listening = f"Server running on port {EXPECTED_PORT}" in startup_log
             if has_crash and not app_listening:
+                feedback_parts.append("✗ App crashes on startup")
                 if has_syntax_error:
                     feedback_parts.append("(SyntaxError detected)")
+                # Fall through to clamping — NO early return
+            elif not app_listening:
                 feedback_parts.append("✗ App not listening on port 3000")
+                # Fall through to clamping — NO early return
             else:
+                # App is running — grade each endpoint
+                score += 0.30
+                feedback_parts.append("✓ App starts on port 3000 (+0.30)")
+                if health_code == "200":
+                    score += 0.10
+                    feedback_parts.append("✓ /health returns 200 (+0.10)")
                 else:
+                    feedback_parts.append(f"✗ /health returned {health_code}")
+                if users_code == "200":
+                    if '"users"' in users_body:
+                        score += 0.15
+                        feedback_parts.append("✓ /api/users returns valid JSON (+0.15)")
+                    else:
+                        score += 0.05
+                        feedback_parts.append("~ /api/users 200 but malformed body (+0.05)")
                 else:
+                    feedback_parts.append(f"✗ /api/users returned {users_code}")
+                if data_code == "200":
+                    if '"records"' in data_body:
+                        score += 0.20
+                        feedback_parts.append("✓ /api/data returns valid JSON (+0.20)")
+                    else:
+                        score += 0.05
+                        feedback_parts.append("~ /api/data 200 but malformed body (+0.05)")
+                else:
+                    feedback_parts.append(f"✗ /api/data returned {data_code}")
+                if score >= 0.80:
+                    score += 0.05
+                    feedback_parts.append("✓ All endpoints healthy — bonus (+0.05)")
         except Exception as exc:
             logger.exception("Grader error")
             feedback_parts.append(f"Grader error (score preserved): {exc}")
+        # --- Phase 3: Scale by difficulty and clamp ---
         if self._current_task == "easy":
+            raw_target = 0.50
         elif self._current_task == "medium":
+            raw_target = 0.65
         else:
             raw_target = 1.0
         final_score = min(1.0, score / raw_target)
+        # Clamp strictly within (0, 1) — EVERY code path reaches here
         final_score = round(min(max(final_score, 0.01), 0.99), 2)
         return (final_score, " | ".join(feedback_parts))