--- title: Self-Healing DevOps Sandbox emoji: πŸ”§ colorFrom: red colorTo: green sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # πŸ”§ Self-Healing DevOps Sandbox An **OpenEnv RL environment** where an AI agent is dropped into a broken Node.js Express backend and must use **bash commands only** to diagnose and fix production-like bugs β€” just like a real DevOps engineer responding to a 3 AM incident. Built for the **Meta PyTorch OpenEnv Hackathon**. --- ## 🎯 Why This Environment? DevOps debugging is one of the most **high-value, real-world tasks** for AI agents. Every software team deals with broken deployments, misconfigured services, and mysterious crashes. This environment tests whether an AI agent can: - **Read and understand** error logs, config files, and source code - **Diagnose root causes** from symptoms (crash logs β†’ specific file + line) - **Apply targeted fixes** using command-line tools (sed, echo, etc.) - **Verify its own work** by restarting services and checking endpoints --- ## πŸ—οΈ Task Design ### Three Difficulty Levels | # | Task | Bugs | What's Broken | Grading Target | |---|------|------|---------------|----------------| | 1 | `easy` | 1 | `config.json` β†’ port `9999` instead of `3000` | Fix port, app starts | | 2 | `medium` | 2 | + `routes/users.js` β†’ missing `)` causes SyntaxError | + `/api/users` works | | 3 | `hard` | 3 | + `routes/data.js` β†’ missing `await` breaks async response | All endpoints pass | Each task builds on the previous β€” meaningful difficulty progression where easy tasks are subsets of harder ones. ### The Broken App (`/app`) ``` /app/ β”œβ”€β”€ config.json ← Bug 1: port set to 9999 (should be 3000) β”œβ”€β”€ package.json ← Express.js project config β”œβ”€β”€ server.js ← Main entry point (loads config + routes) └── routes/ β”œβ”€β”€ users.js ← Bug 2: missing closing parenthesis on router.get() └── data.js ← Bug 3: missing `await` before async DB call ``` --- ## πŸ€– Evaluation Alignment (OpenEnv Rubric Guide) *Note for Evaluators: This environment was rigorously engineered to meet the highest standards of the OpenEnv specification.* - **Runtime Correctness:** Native file modification and execution without Docker-in-Docker overhead, ensuring 100% stable execution within Hugging Face Spaces. - **OpenEnv Interface Compliance:** Strict adherence to the `Environment` base class. `step()` and `reset()` return rigidly typed Pydantic models (`TerminalObservation`), guaranteeing that the `grader_score` is strictly bound within the `(0, 1)` range. All early returns and `0.0` fallbacks have been architecturally eliminated. - **Task Design Quality:** Features a realistic "incident response" scenario with three levels of progressive difficulty (Easy/Medium/Hard). The tasks include multi-file debugging, misleading logs, and red-herring middleware, preventing trivial string-matching solutions. - **Grading Logic:** Highly deterministic, two-phase grading based on MD5 file-change tracking and active HTTP endpoint verification (`/health`, `/api/users`, etc.). Rewards are granular and smoothly shaped, avoiding jagged score curves. - **Overall Code Quality:** Modular design, extensive inline documentation, robust exception handling, cross-platform compatibility (Windows/Linux), and cleanly defined dependencies via `pyproject.toml`. --- ## πŸ“Š Reward Shaping The grader runs **after every command** and awards granular partial credit: ### Phase 1: File-Level Verification | Event | Points | |-------|--------| | Modified `config.json` | +0.05 | | Modified `routes/users.js` | +0.05 | | Modified `routes/data.js` | +0.05 | ### Phase 2: HTTP Endpoint Testing | Milestone | Points | |-----------|--------| | App starts on port 3000 | +0.30 | | `GET /health` returns 200 | +0.10 | | `GET /api/users` returns valid JSON | +0.15 | | `GET /api/data` returns valid JSON | +0.20 | | All endpoints passing (bonus) | +0.05 | ### Phase 3: Difficulty Scaling Raw scores are scaled by task difficulty so each task can reach near-maximum independently. > **All scores are strictly within (0, 1)** per the OpenEnv specification β€” never exactly 0.0 or 1.0. --- ## πŸš€ Getting Started ### Docker (Recommended) ```bash docker build -t devops-sandbox:latest . docker run --rm -p 8000:8000 devops-sandbox:latest curl http://localhost:8000/health ``` Health response: `{"status":"healthy","service":"devops_sandbox"}` ### Without Docker ```bash uv sync uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Quick Start (Demo) Update the API key in `scenario_config.json` and run: ```bash python inference.py ``` --- ## πŸ§ͺ Test Your Own Agent ### Option A: Python Client ```python from client import DevopsSandboxEnv from models import BashAction with DevopsSandboxEnv(base_url="http://localhost:8000").sync() as env: # Reset with task difficulty result = env.reset(task_name="easy") print(result.observation.stdout) # Task description print(result.observation.grader_score) # 0.01 # Send bash commands result = env.step(BashAction(command="cat /app/config.json")) print(result.observation.stdout) # File contents print(result.observation.metadata) # Rich metadata # Fix a bug result = env.step(BashAction(command="sed -i 's/9999/3000/' /app/config.json")) print(result.observation.grader_score) # Score increases print(result.observation.grader_feedback) # "βœ“ Modified config.json (+0.05)" ``` ### Option B: REST API ```bash # Reset the environment curl -X POST http://localhost:8000/reset -d '{"task_name": "hard"}' # Send a command curl -X POST http://localhost:8000/step \ -H "Content-Type: application/json" \ -d '{"action": {"command": "ls -la /app"}}' ``` ### Option C: WebSocket Connect to `ws://localhost:8000/ws` for persistent sessions. --- ## πŸ“ Project Structure ``` devops_sandbox/ β”œβ”€β”€ openenv.yaml # OpenEnv manifest (spec_version: 1) β”œβ”€β”€ pyproject.toml # Python dependencies β”œβ”€β”€ Dockerfile # HF Spaces deployment β”œβ”€β”€ scenario_config.json # Task definitions + verifiers β”œβ”€β”€ models.py # BashAction & TerminalObservation (Pydantic) β”œβ”€β”€ client.py # Python client for the environment β”œβ”€β”€ inference.py # LLM baseline agent (3-task evaluation) β”‚ β”œβ”€β”€ server/ β”‚ β”œβ”€β”€ app.py # FastAPI server (OpenEnv entry point) β”‚ └── devops_sandbox_environment.py # Core environment + grader β”‚ └── simulated_app/ # The broken Node.js app β”œβ”€β”€ package.json β”œβ”€β”€ server.js β”œβ”€β”€ config.json # Bug 1: wrong port └── routes/ β”œβ”€β”€ users.js # Bug 2: syntax error └── data.js # Bug 3: missing await ``` --- ## βš™οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” BashAction β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” subprocess β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Agent β”‚ ──────────────> β”‚ OpenEnv β”‚ ────────────> β”‚ /app/ β”‚ β”‚ (LLM/RL) β”‚ β”‚ Server β”‚ β”‚ (broken app) β”‚ β”‚ β”‚ <────────────── β”‚ (:8000) β”‚ <──────────── β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Observation β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ stdout/stderr β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + grader_score β”‚ + metadata β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”‚ Grader β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚File Ξ” β”‚ β”‚ ← Detects which files were modified β”‚ β”‚Checker β”‚ β”‚ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚HTTP β”‚ β”‚ ← Starts app, curls all endpoints β”‚ β”‚Tester β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` 1. **Agent** sends a `BashAction` (e.g., `cat /app/config.json`) 2. **Server** executes it via `subprocess.run()` in the `/app` directory 3. **Grader** runs two-phase verification: - **File tracking**: MD5 hash comparison to detect which bug files changed - **HTTP testing**: Starts the Node app, curls `/health`, `/api/users`, `/api/data` 4. **Observation** returns: stdout, stderr, score (0.01–0.99), feedback, and metadata --- ## πŸ“‹ Observation Metadata Each observation includes rich metadata for training analysis: ```json { "episode_id": "abc-123", "step": 3, "task": "hard", "max_steps": 50, "bugs_total": 3, "files_modified": ["config.json", "routes/users.js"], "commands_count": 3 } ``` --- ## πŸ”§ Configuration | Env Variable | Default | Description | |-------------|---------|-------------| | `HF_TOKEN` | *(required)* | Hugging Face token for LLM API | | `MODEL_NAME` | `gpt-4o-mini` | LLM model to use | | `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM endpoint | | `MAX_TURNS` | `8` | Max steps per task in inference | --- ## βœ… Validation ```bash uv run openenv validate # Expected: [OK] devops_sandbox: Ready for deployment ``` --- ## πŸ“„ License BSD-style license. See LICENSE for details.