Spaces:

yadnyeshkolte
/

api-debug-env

Sleeping

App Files Files Community

yadnyeshkolte commited on 25 days ago

Commit

8b10144

1 Parent(s): 486044c

chore: remove pycache files

Browse files

Files changed (19) hide show

.gitignore +8 -0
LICENSE +24 -0
README.md +220 -150
__init__.py +12 -2
__pycache__/__init__.cpython-313.pyc +0 -0
__pycache__/client.cpython-313.pyc +0 -0
__pycache__/models.cpython-313.pyc +0 -0
__pycache__/scenarios.cpython-313.pyc +0 -0
inference.py +58 -19
models.py +19 -2
scenarios.py +961 -286
server/__pycache__/__init__.cpython-313.pyc +0 -0
server/__pycache__/api_debug_env_environment.cpython-313.pyc +0 -0
server/__pycache__/app.cpython-313.pyc +0 -0
server/api_debug_env_environment.py +390 -46
server/app.py +40 -7
tests/__pycache__/__init__.cpython-313.pyc +0 -0
tests/__pycache__/test_environment.cpython-313-pytest-8.4.1.pyc +0 -0
tests/test_environment.py +409 -45

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+hackathonDetails/
+.agents/
+AGENTS.md
+PROGRESS.md
+__pycache__/
+*.pyc
+*.pyo
+.env

LICENSE ADDED Viewed

	@@ -0,0 +1,24 @@

+BSD 2-Clause License
+Copyright (c) 2026, Yadnyesh
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md CHANGED Viewed

@@ -1,224 +1,294 @@
 ---
-title: Api Debug Env
-emoji: 🛠️
-colorFrom: blue
-colorTo: indigo
 sdk: docker
-pinned: false
 app_port: 8000
 tags:
   - openenv
 ---
-# API Integration Debugging Environment
-An OpenEnv environment where AI agents diagnose and fix broken API integrations — a real-world task that developers face daily.
-## Overview
-Agents interact with a simulated multi-service API ecosystem that has various misconfigurations. Through a `step()/reset()/state()` API, the agent must:
-1. **Inspect error logs** to identify failure patterns
-2. **Inspect service configurations** to find misconfigurations
-3. **Test endpoints** to observe current behavior
-4. **Submit fixes** with corrected configuration payloads
-The environment features:
-- **3 difficulty levels** with increasing complexity (2, 3, and 5 issues)
-- **Strict value validation** on fixes (grader checks both key AND value)
-- **Seed-based randomization** for reproducible yet varied episodes
-- **Penalty for repeated inspections** to encourage efficient exploration
-- **Comprehensive test suite** with 30+ unit tests
-## Action Space
-```python
-class ApiDebugAction(Action):
-    action_type: str   # "inspect_logs" | "inspect_config" | "inspect_endpoint" | "submit_fix"
-    target: str        # Service name (e.g. "payment_client", "webhook_sender")
-    fix_payload: dict  # Required when action_type="submit_fix"
-```
-| Action | Description | Reward |
-|--------|-------------|--------|
-| `inspect_logs` | Read error logs for a service | +0.15 (finds new issue) / +0.05 (first time, no issue) / 0.0 (repeat) |
-| `inspect_config` | View current config of a service | +0.05 (has issues) / +0.01 (no issues) / 0.0 (repeat) |
-| `inspect_endpoint` | Test-call an endpoint | +0.02 to +0.05 |
-| `submit_fix` | Submit a configuration fix | +0.25 (correct) / -0.1 (wrong) |
-| *step cost* | Applied every step | -0.01 |
-## Observation Space
-```python
-class ApiDebugObservation(Observation):
-    task_id: str              # "easy", "medium", or "hard"
-    task_description: str     # Human-readable task description
-    logs: List[str]           # Error log lines from inspected service
-    config_snapshot: dict     # Configuration of inspected service
-    api_response: dict        # Response from endpoint test
-    hints: List[str]          # Progressive hints based on step count
-    remaining_steps: int      # Steps before episode timeout
-    issues_found: int         # Issues identified so far
-    issues_fixed: int         # Issues correctly fixed so far
-    issues_total: int         # Total issues in scenario
-    action_result: str        # Feedback on last action
-    available_targets: List   # Valid service names
 ```
 ## Tasks
-### Task 1: Easy — Payment API Auth Fix
-- **Issues**: 2 (missing `Authorization` header, wrong `Content-Type`)
-- **Max Steps**: 15
 - **Services**: `payment_client`, `payment_gateway`
-- **Scenario**: Payment gateway rejects requests with 401/415 errors
-### Task 2: Medium — Webhook Chain Debugging
-- **Issues**: 3 (rate limit too high, insufficient retries, empty webhook signature)
-- **Max Steps**: 25
 - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
-- **Scenario**: Events are dropped across a webhook notification pipeline
-### Task 3: Hard — Microservice Cascade Failure
-- **Issues**: 5 (wrong endpoint URL, timeout too short, sync mode race condition, expired auth token, missing token refresh)
-- **Max Steps**: 40
 - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
-- **Scenario**: E-commerce order processing pipeline fails with cascading 500s
-## Reward Function
-- **Step cost**: -0.01 per step to encourage efficiency
-- **Partial progress**: First useful inspection earns reward (+0.05 to +0.15)
-- **Repeated inspection**: 0 reward (prevents reward farming)
-- **Fix rewards**: +0.25 per correctly fixed issue (strict key+value validation)
-- **Completion bonus**: +0.2 when all issues are resolved
-- **Penalties**: -0.1 for wrong fixes, -0.05 for invalid actions
-## Grading
 ```
-Score = (issues_fixed / issues_total) × efficiency_bonus + exploration_bonus
-efficiency_bonus = 1.0 + (remaining_steps / max_steps × 0.3)
-exploration_bonus = issues_found / issues_total × 0.1
 ```
-Faster fixes earn up to 30% bonus. Scores strictly clamped to (0.001, 0.999).
-## Baseline Scores (Rule-Based Agent)
-| Task | Score | Issues Fixed | Issues Total | Steps |
-|------|-------|-------------|-------------|-------|
-| Easy | ~0.85 | 2/2 | 2 | 6 |
-| Medium | ~0.65 | 3/3 | 3 | 9 |
-| Hard | ~0.55 | 5/5 | 5 | 15 |
-> The rule-based baseline inspects logs/configs then submits known fixes. An LLM agent with proper reasoning can achieve higher scores by solving issues more efficiently.
-## Example Interaction (Easy Task)
-```text
-[START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct
-# Agent inspects logs and finds Auth error
-[STEP] step=1 action=inspect_logs(target=payment_client) reward=0.14 done=false error=null
-# Agent checks config to understand current headers
-[STEP] step=2 action=inspect_config(target=payment_client) reward=0.04 done=false error=null
-# Agent fixes the authorization header
-[STEP] step=3 action=submit_fix(target=payment_client,fix={"headers.Authorization":"Bearer sk_live_token123"}) reward=0.24 done=false error=null
-# Agent fixes the content type
-[STEP] step=4 action=submit_fix(target=payment_client,fix={"headers.Content-Type":"application/json"}) reward=0.44 done=true error=null
-[END] success=true steps=4 score=0.899 rewards=0.14,0.04,0.24,0.44
 ```
 ## Setup & Usage
-### Prerequisites
-- Python 3.10+
-- Docker (for containerized deployment)
-### Local Development
 ```bash
-cd api_debug_env
-# Install dependencies
 uv sync
-# Run server
-uv run server
-# or
-uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
 ```
-### Docker
 ```bash
-cd api_debug_env
-docker build -t api_debug_env:latest -f server/Dockerfile .
-docker run -p 8000:8000 api_debug_env:latest
 ```
-### Run Inference
 ```bash
-# Set API credentials
-export HF_TOKEN=your-key
-# Run inference on all tasks
-python inference.py
 ```
-### Run Tests
 ```bash
-cd api_debug_env
-pytest tests/ -v --tb=short
 ```
 ### API Endpoints
 | Endpoint | Method | Description |
 |----------|--------|-------------|
-| `/` | GET | Root — environment info and links |
-| `/reset` | POST | Reset environment, start new episode |
 | `/step` | POST | Execute an action |
 | `/state` | GET | Get current state |
-| `/tasks` | GET | List all tasks with action schemas |
-| `/grader` | POST | Get grader score for completed episode |
-| `/baseline` | POST | Run baseline inference on all tasks |
-| `/schema` | GET | Get action/observation JSON schemas |
-| `/health` | GET | Health check endpoint |
-## Project Structure
-```
-api_debug_env/
-├── inference.py        # ★ MANDATORY hackathon inference script
-├── models.py           # Pydantic Action & Observation models
-├── scenarios.py        # 3 task scenarios with randomization support
-├── client.py           # WebSocket client for the environment
-├── openenv.yaml        # OpenEnv metadata (spec v1)
-├── pyproject.toml      # Dependencies & build config
-├── server/
-│   ├── app.py                        # FastAPI application
-│   ├── api_debug_env_environment.py  # Core environment logic
-│   └── Dockerfile                    # Container build
-├── tests/
-│   └── test_environment.py           # 30+ unit & integration tests
-└── scripts/
-    └── baseline_inference.py         # Original baseline agent script
 ```
-## Randomization & Reproducibility
-The environment supports seed-based randomization via `reset(seed=42)`. This:
-- Shuffles log entry order so agents can't memorize positions
-- Ensures reproducible episodes for consistent evaluation
-- When `seed=None` (default), returns the canonical scenario for testing
-## License
-BSD-style license. See LICENSE file.

 ---
+title: API Debug Env
+emoji: 🔧
+colorFrom: red
+colorTo: yellow
 sdk: docker
 app_port: 8000
 tags:
   - openenv
 ---
+# 🔧 API Integration Debugging Environment
+> **A real-world environment for training and evaluating AI agents on multi-service API debugging with cascading failures, dynamic state, and multi-dimensional grading.**
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
+[![Python](https://img.shields.io/badge/Python-3.10%2B-green)](https://python.org)
+[![License](https://img.shields.io/badge/License-BSD-yellow)](LICENSE)
+## Why API Debugging?
+API integration failures are one of the most common and time-consuming issues in production software. When Service A calls Service B which calls Service C, a single misconfiguration can cascade through the entire system. Debugging requires:
+- **Structured diagnosis**: inspecting logs, configs, and endpoints across services
+- **Dependency awareness**: understanding which service failures affect which downstream services
+- **Strategic reasoning**: fixing upstream issues first to unmask downstream problems
+This environment simulates *real-world cascading API failures* — not toy string-matching puzzles.
+## How It Works
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   Agent Debugging Loop                       │
+│                                                              │
+│  1. reset() → Initial observation with broken service state  │
+│  2. step(inspect_logs) → Error logs from target service      │
+│  3. step(inspect_config) → Current (broken) configuration    │
+│  4. step(inspect_endpoint) → Live error response simulation  │
+│  5. step(submit_fix) → Fix validation + cascade resolution   │
+│  6. grade() → Multi-dimensional rubric score                 │
+└─────────────────────────────────────────────────────────────┘
+```
+### Service Dependency Graphs
+Each task models a real multi-service system with dependency chains:
+```mermaid
+graph LR
+    A[order_service] --> B[inventory_service]
+    B --> C[shipping_service]
+    A --> D[api_gateway]
+    B --> E[auth_service]
+    style A fill:#ff6b6b
+    style B fill:#ffd93d
+    style C fill:#6bcb77
+    style D fill:#6bcb77
+    style E fill:#6bcb77
 ```
+**Red** = error, **Yellow** = degraded, **Green** = healthy. Fixing upstream issues changes downstream health.
+## Environment Design
+### Dynamic State
+Unlike static environments, our state changes as the agent acts:
+1. **Service health tracking**: Each service has a status (`healthy`, `degraded`, `error`) that updates when issues are fixed
+2. **Dynamic logs**: After fixing an issue, re-inspecting logs shows *new entries* reflecting the fix
+3. **Cascading effects**: Fixing an upstream issue can change downstream service behavior
+4. **Error trace**: Shows the full error propagation chain, shrinking as issues are fixed
+### Reward Shaping
+| Action | Reward | Condition |
+|--------|--------|-----------|
+| `inspect_logs` (new service, finds issues) | +0.15 | New relevant error patterns found |
+| `inspect_logs` (new service, no issues) | +0.05 | Valid inspection but no issues here |
+| `inspect_logs` (repeat, unchanged) | 0.00 | No new information |
+| `inspect_logs` (repeat, dynamic logs) | +0.05 | New logs appeared after a fix |
+| `inspect_config` (service has issues) | +0.05 | Relevant configuration retrieved |
+| `inspect_endpoint` | +0.02 to +0.05 | Endpoint testing |
+| `submit_fix` (correct) | +0.25 | Issue resolved |
+| `submit_fix` (correct + inspected first) | +0.30 | Diagnosis + fix strategy bonus |
+| `submit_fix` (partial — close value) | +0.03 | Right key, close but not exact value |
+| `submit_fix` (wrong) | -0.10 | Incorrect fix |
+| All actions complete | +0.20 | Completion bonus |
+| Every step | -0.01 | Step cost (encourages efficiency) |
 ## Tasks
+### Easy: Payment API Integration (2 issues, 15 steps)
+Payment client failing to connect to payment gateway. Issues involve authentication and protocol errors.
+- **Issue pool**: 4 possible issues, 2 selected per episode
 - **Services**: `payment_client`, `payment_gateway`
+- **Issue types**: Auth header missing, wrong Content-Type, timeout, deprecated endpoint
+### Medium: Webhook Event Chain (3 issues, 25 steps)
+Webhook notification system dropping events across a 3-service chain.
+- **Issue pool**: 5 possible issues, 3 selected per episode
 - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
+- **Issue types**: Rate limiting, retry misconfiguration, webhook signature, endpoint URL, compression
+- **Dependencies**: Retry issue is masked by rate limit — must fix rate limit first
+### Hard: E-Commerce Order Pipeline (5 issues, 40 steps)
+Complex order processing pipeline with cascading failures across 5 services.
+- **Issue pool**: 7 possible issues, 5 selected per episode
 - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
+- **Issue types**: Deprecated URLs, timeouts, race conditions, expired tokens, missing token refresh, circuit breakers, idempotency
+- **Dependencies**: Timeout masked by wrong URL; token refresh masked by expired token
+## Grading Rubric
+The grader uses a **multi-dimensional rubric**, not a simple fix ratio:
+| Dimension | Weight | Description |
+|-----------|--------|-------------|
+| **Fix Score** | 40% | `issues_fixed / total_issues` |
+| **Diagnosis Score** | 20% | Did the agent inspect the service before fixing it? |
+| **Efficiency Score** | 15% | `remaining_steps / max_steps` — faster is better |
+| **Strategy Score** | 25% | Logical debugging approach: inspect before fix, avoid repeats, follow dependency order, use all action types |
 ```
+Final Score = fix × 0.40 + diagnosis × 0.20 + efficiency × 0.15 + strategy × 0.25
+Clamped to (0.001, 0.999)
 ```
+### Baseline Scores
+| Task | Score | Steps | Issues Fixed |
+|------|-------|-------|--------------|
+| Easy | ~0.75 | 7 | 2/2 |
+| Medium | ~0.55 | 10 | 3/3 |
+| Hard | ~0.45 | 15 | 5/5 |
+*Baseline uses a rule-based heuristic agent (inspect all → fix all).*
+## Action & Observation Spaces
+### Action Space
+```json
+{
+  "action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
+  "target": "service_name",
+  "fix_payload": {
+    "config_key": "corrected_value"
+  }
+}
+```
+### Observation Space
+```json
+{
+  "task_id": "easy",
+  "task_description": "...",
+  "logs": ["[ERROR] ..."],
+  "config_snapshot": {"headers": {"Content-Type": "text/plain"}},
+  "api_response": {"status": "error", "status_code": 401},
+  "service_status": {"payment_client": "error", "payment_gateway": "healthy"},
+  "dependency_graph": {"payment_client": ["payment_gateway"]},
+  "error_trace": ["[CRITICAL] payment_client: Missing Authorization header"],
+  "remaining_steps": 14,
+  "issues_found": 1,
+  "issues_fixed": 0,
+  "issues_total": 2,
+  "hints": ["Check headers.Authorization"],
+  "available_targets": ["payment_client", "payment_gateway"]
+}
+```
+## Example Transcript
+```
+>>> reset(task_id="easy")
+task_description: "Payment processing API integration is failing..."
+service_status: {payment_client: "error", payment_gateway: "healthy"}
+error_trace: [
+  "[CRITICAL] payment_client: Missing Authorization header",
+  "  └─> payment_gateway: All requests rejected with 401",
+  "[ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)",
+  "  └─> payment_gateway: Request body parsing fails"
+]
+>>> step(inspect_logs, target=payment_client)
+logs: ["[ERROR] POST /process -> 401 Unauthorized", ...]
+issues_found: 2, reward: +0.15
+>>> step(inspect_config, target=payment_client)
+config: {headers: {Content-Type: "text/plain", Accept: "..."}, ...}
+reward: +0.05
+>>> step(submit_fix, target=payment_client, fix_payload={headers.Authorization: "Bearer sk_key"})
+action_result: "Fix accepted! Fixed 1 issue(s)."
+service_status: {payment_client: "degraded"}  # still has content-type issue
+reward: +0.30
+>>> step(inspect_logs, target=payment_client)  # re-inspect shows new logs!
+logs: [...original..., "[INFO] Authorization header set. Retrying request..."]
+reward: +0.05  # reward for checking updated state
+>>> step(submit_fix, target=payment_client, fix_payload={headers.Content-Type: "application/json"})
+action_result: "Fix accepted! All issues fixed! Episode complete. 🎉"
+service_status: {payment_client: "healthy", payment_gateway: "healthy"}
+error_trace: ["All issues resolved. No error cascades active."]
+reward: +0.50 (fix + completion bonus)
+>>> grade()
+score: 0.82 (fix=1.0, diagnosis=1.0, efficiency=0.67, strategy=0.8)
 ```
 ## Setup & Usage
+### Install Dependencies
 ```bash
+cd api_debug_env  # or project root
 uv sync
 ```
+### Run Locally
 ```bash
+uvicorn server.app:app --reload --port 8000
 ```
+### Run Tests
 ```bash
+python -m pytest tests/ -v --tb=short
 ```
+### Docker
 ```bash
+docker build -t api_debug_env -f server/Dockerfile .
+docker run -p 8000:8000 api_debug_env
 ```
 ### API Endpoints
 | Endpoint | Method | Description |
 |----------|--------|-------------|
+| `/` | GET | Environment info + status |
+| `/reset` | POST | Reset environment |
 | `/step` | POST | Execute an action |
 | `/state` | GET | Get current state |
+| `/tasks` | GET | List all tasks with schemas |
+| `/grader` | POST | Get grading score |
+| `/baseline` | POST | Run baseline agent |
+| `/health` | GET | Health check |
+### Run Inference
+```bash
+export HF_TOKEN=your_token_here
+python inference.py
 ```
+## Design Philosophy
+This environment is designed to be useful for **RL/agent training**, not just evaluation:
+1. **Dense Rewards**: Every action type can yield positive or negative reward, enabling gradient-based training
+2. **Progressive Difficulty**: Easy→Medium→Hard with increasing service count and dependency complexity
+3. **Partial Credit**: Close-but-wrong fixes get feedback instead of binary rejection
+4. **Strategy Incentives**: The multi-dimensional rubric rewards *how* the agent solves, not just *what* it solves
+5. **Stochastic**: Seed-based randomization prevents policy overfitting to memorized scenarios
+6. **Cascading Dynamics**: Upstream fixes change downstream state, requiring multi-step reasoning
+## Project Structure
+```
+├── models.py                       # Pydantic Action & Observation definitions
+├── scenarios.py                    # Task scenarios with dependency graphs
+├── inference.py                    # MANDATORY baseline inference script
+├── openenv.yaml                    # OpenEnv metadata
+├── pyproject.toml                  # Dependencies
+├── server/
+│   ├── api_debug_env_environment.py  # Core environment logic
+│   ├── app.py                      # FastAPI endpoints
+│   └── Dockerfile                  # HF Spaces deployment
+└── tests/
+    └── test_environment.py         # 48+ unit & integration tests
+```

__init__.py CHANGED Viewed

@@ -6,8 +6,18 @@
 """Api Debug Env Environment."""
-from .client import ApiDebugEnv
-from .models import ApiDebugAction, ApiDebugObservation
 __all__ = [
     "ApiDebugAction",

 """Api Debug Env Environment."""
+try:
+    from .client import ApiDebugEnv
+    from .models import ApiDebugAction, ApiDebugObservation
+except ImportError:
+    # When running tests or scripts directly from the project root,
+    # relative imports won't work. Fall back to absolute imports.
+    try:
+        from client import ApiDebugEnv
+        from models import ApiDebugAction, ApiDebugObservation
+    except ImportError:
+        ApiDebugEnv = None  # type: ignore
+        from models import ApiDebugAction, ApiDebugObservation
 __all__ = [
     "ApiDebugAction",

__pycache__/__init__.cpython-313.pyc DELETED Viewed

Binary file (367 Bytes)

__pycache__/client.cpython-313.pyc DELETED Viewed

Binary file (3.65 kB)

__pycache__/models.cpython-313.pyc DELETED Viewed

Binary file (3.49 kB)

__pycache__/scenarios.cpython-313.pyc DELETED Viewed

Binary file (13.1 kB)

inference.py CHANGED Viewed

@@ -46,31 +46,39 @@ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
 BENCHMARK = "api_debug_env"
 MAX_STEPS = 40  # max across all tasks (hard has 40)
 TEMPERATURE = 0.3
-MAX_TOKENS = 800
 SUCCESS_SCORE_THRESHOLD = 0.1
 SYSTEM_PROMPT = textwrap.dedent("""
 You are an expert API debugging agent. You are tasked with diagnosing and fixing
-broken API integrations. You interact with a simulated multi-service environment.
-Available actions (respond with JSON):
 {
   "action_type": "inspect_logs" | "inspect_config" | "inspect_endpoint" | "submit_fix",
   "target": "<service_name>",
   "fix_payload": { ... }  // required only for submit_fix
 }
-Strategy:
-1. First inspect_logs on each service to identify error patterns
-2. Then inspect_config to understand current (broken) settings
-3. Use inspect_endpoint to see actual error responses
-4. Submit fixes with corrected configuration values
-IMPORTANT: When submitting a fix, include ALL the corrected key-value pairs in fix_payload.
-For nested keys like "headers.Authorization", use the nested format:
-{"headers.Authorization": "Bearer <token>"}
-Respond with ONLY valid JSON. No explanation text.
 """).strip()
@@ -100,26 +108,46 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
 # ─── LLM Interaction ────────────────────────────────────────────────────────────
 def build_user_prompt(obs: ApiDebugObservation, step: int) -> str:
-    """Build a prompt from the current observation."""
     parts = [
-        f"Step: {step}",
         f"Task: {obs.task_description}",
         f"Remaining steps: {obs.remaining_steps}",
         f"Issues found: {obs.issues_found}/{obs.issues_total}",
         f"Issues fixed: {obs.issues_fixed}/{obs.issues_total}",
         f"Last action result: {obs.action_result}",
-        f"Available targets: {obs.available_targets}",
     ]
     if obs.logs:
         parts.append("Logs:\n" + "\n".join(obs.logs))
     if obs.config_snapshot:
-        parts.append(f"Config: {json.dumps(obs.config_snapshot, indent=2)}")
     if obs.api_response:
-        parts.append(f"API Response: {json.dumps(obs.api_response, indent=2)}")
     if obs.hints:
         parts.append(f"Hints: {'; '.join(obs.hints)}")
     return "\n".join(parts)
@@ -152,6 +180,14 @@ def get_model_action(
                 json_end = text.rfind("}") + 1
                 if json_start >= 0 and json_end > json_start:
                     text = text[json_start:json_end]
             action_json = json.loads(text)
             messages.append({"role": "assistant", "content": json.dumps(action_json)})
@@ -164,6 +200,9 @@ def get_model_action(
         except json.JSONDecodeError as exc:
             print(f"[DEBUG] JSON parse failed (attempt {attempt+1}/{max_retries}): {exc}", flush=True)
             last_error = exc
         except Exception as exc:
             print(f"[DEBUG] API call failed (attempt {attempt+1}/{max_retries}): {exc}", flush=True)
             last_error = exc

 BENCHMARK = "api_debug_env"
 MAX_STEPS = 40  # max across all tasks (hard has 40)
 TEMPERATURE = 0.3
+MAX_TOKENS = 1024
 SUCCESS_SCORE_THRESHOLD = 0.1
 SYSTEM_PROMPT = textwrap.dedent("""
 You are an expert API debugging agent. You are tasked with diagnosing and fixing
+broken API integrations in a multi-service environment.
+## Available Actions (respond with JSON only):
 {
   "action_type": "inspect_logs" | "inspect_config" | "inspect_endpoint" | "submit_fix",
   "target": "<service_name>",
   "fix_payload": { ... }  // required only for submit_fix
 }
+## Debugging Strategy (follow this order):
+1. **Inspect logs** on each service to identify error patterns and root causes
+2. **Inspect config** to understand current (broken) settings
+3. **Inspect endpoint** to see actual error responses if needed
+4. **Submit fix** with corrected configuration values
+## Key Rules:
+- ALWAYS inspect logs and configs BEFORE submitting fixes
+- Pay attention to the service dependency graph — upstream failures cascade downstream
+- Fix upstream issues first (they may mask downstream problems)
+- When submitting a fix, use the exact key format from the config
+  - For nested keys: {"headers.Authorization": "Bearer <token>"}
+  - For nested objects: {"retry": {"max_retries": 3, "backoff_factor": 2}}
+- Check service_status to see which services are healthy/degraded/error
+- After fixing, re-inspect logs on affected services — new logs appear showing the fix effect
+## Response Format:
+Respond with ONLY a single JSON object. No text, no explanation, no markdown.
+Example: {"action_type": "inspect_logs", "target": "payment_client"}
 """).strip()
 # ─── LLM Interaction ────────────────────────────────────────────────────────────
 def build_user_prompt(obs: ApiDebugObservation, step: int) -> str:
+    """Build a detailed prompt from the current observation."""
     parts = [
+        f"=== Step {step} ===",
         f"Task: {obs.task_description}",
         f"Remaining steps: {obs.remaining_steps}",
         f"Issues found: {obs.issues_found}/{obs.issues_total}",
         f"Issues fixed: {obs.issues_fixed}/{obs.issues_total}",
         f"Last action result: {obs.action_result}",
     ]
+    # Show service health (dynamic state)
+    if obs.service_status:
+        status_str = ", ".join(f"{svc}={status}" for svc, status in obs.service_status.items())
+        parts.append(f"Service health: {status_str}")
+    # Show dependency graph
+    if obs.dependency_graph:
+        deps = []
+        for svc, dep_list in obs.dependency_graph.items():
+            if dep_list:
+                deps.append(f"  {svc} -> {', '.join(dep_list)}")
+        if deps:
+            parts.append("Service dependencies:\n" + "\n".join(deps))
+    # Show error cascades
+    if obs.error_trace:
+        parts.append("Active error cascades:\n" + "\n".join(f"  {t}" for t in obs.error_trace[:5]))
+    parts.append(f"Available targets: {obs.available_targets}")
     if obs.logs:
         parts.append("Logs:\n" + "\n".join(obs.logs))
     if obs.config_snapshot:
+        parts.append(f"Config:\n{json.dumps(obs.config_snapshot, indent=2)}")
     if obs.api_response:
+        parts.append(f"API Response:\n{json.dumps(obs.api_response, indent=2)}")
     if obs.hints:
         parts.append(f"Hints: {'; '.join(obs.hints)}")
+    parts.append("\nDecide your next action. Respond with ONLY a JSON object.")
     return "\n".join(parts)
                 json_end = text.rfind("}") + 1
                 if json_start >= 0 and json_end > json_start:
                     text = text[json_start:json_end]
+            elif text.startswith("{"):
+                pass  # Already JSON
+            else:
+                # Try to extract JSON from mixed text
+                json_start = text.find("{")
+                json_end = text.rfind("}") + 1
+                if json_start >= 0 and json_end > json_start:
+                    text = text[json_start:json_end]
             action_json = json.loads(text)
             messages.append({"role": "assistant", "content": json.dumps(action_json)})
         except json.JSONDecodeError as exc:
             print(f"[DEBUG] JSON parse failed (attempt {attempt+1}/{max_retries}): {exc}", flush=True)
             last_error = exc
+            # Add corrective message
+            messages.append({"role": "assistant", "content": text if 'text' in dir() else ""})
+            messages.append({"role": "user", "content": "Invalid response. Respond with ONLY a valid JSON object like: {\"action_type\": \"inspect_logs\", \"target\": \"payment_client\"}"})
         except Exception as exc:
             print(f"[DEBUG] API call failed (attempt {attempt+1}/{max_retries}): {exc}", flush=True)
             last_error = exc

models.py CHANGED Viewed

@@ -9,9 +9,12 @@ Data models for the API Integration Debugging Environment.
 An agent must diagnose and fix broken API integrations by reading error logs,
 inspecting configurations, and writing corrected API calls.
 """
-from typing import Dict, List, Optional
 from openenv.core.env_server.types import Action, Observation
 from pydantic import Field
@@ -47,7 +50,7 @@ class ApiDebugObservation(Observation):
     What the agent sees after each action.
     Provides error logs, configuration snapshots, API responses,
-    and progress tracking for the debugging task.
     """
     # Environment context
@@ -60,6 +63,20 @@ class ApiDebugObservation(Observation):
     api_response: Optional[Dict] = Field(default=None, description="Response from testing the current endpoint config")
     hints: List[str] = Field(default_factory=list, description="Progressive hints based on step count")
     # Progress tracking
     remaining_steps: int = Field(default=0, description="Steps remaining before episode timeout")
     issues_found: int = Field(default=0, description="Issues the agent has correctly identified so far")

 An agent must diagnose and fix broken API integrations by reading error logs,
 inspecting configurations, and writing corrected API calls.
+The observation space includes dynamic state: service health, dependency graph,
+and error traces that update as the agent fixes issues.
 """
+from typing import Any, Dict, List, Optional
 from openenv.core.env_server.types import Action, Observation
 from pydantic import Field
     What the agent sees after each action.
     Provides error logs, configuration snapshots, API responses,
+    service health status, dependency graph, and progress tracking.
     """
     # Environment context
     api_response: Optional[Dict] = Field(default=None, description="Response from testing the current endpoint config")
     hints: List[str] = Field(default_factory=list, description="Progressive hints based on step count")
+    # Dynamic state (NEW — makes the environment interactive)
+    service_status: Dict[str, str] = Field(
+        default_factory=dict,
+        description="Current health of each service: 'healthy', 'degraded', 'error', 'unreachable'",
+    )
+    dependency_graph: Dict[str, List[str]] = Field(
+        default_factory=dict,
+        description="Service dependency graph: {service: [services it depends on]}",
+    )
+    error_trace: List[str] = Field(
+        default_factory=list,
+        description="Error propagation chain showing how failures cascade across services",
+    )
     # Progress tracking
     remaining_steps: int = Field(default=0, description="Steps remaining before episode timeout")
     issues_found: int = Field(default=0, description="Issues the agent has correctly identified so far")

scenarios.py CHANGED Viewed

@@ -7,12 +7,15 @@
 """
 Scenario definitions for the API Integration Debugging Environment.
-Each scenario defines a broken API integration that the agent must diagnose and fix.
-Scenarios contain: services, their configs, error logs, issues, and expected fixes.
 """
 from dataclasses import dataclass, field
-from typing import Any, Dict, List, Optional
 import random
@@ -25,11 +28,32 @@ class Issue:
     expected_fix: Dict[str, Any]
     fix_key: str  # The key in the config that needs fixing
     log_hint: str  # Log line that hints at this issue
 @dataclass
 class Scenario:
-    """A complete API debugging scenario."""
     task_id: str
     difficulty: str
     description: str
@@ -38,6 +62,15 @@ class Scenario:
     configs: Dict[str, Dict[str, Any]]
     logs: Dict[str, List[str]]
     issues: List[Issue]
 def get_scenario(task_id: str, seed: Optional[int] = None) -> Scenario:
@@ -46,10 +79,9 @@ def get_scenario(task_id: str, seed: Optional[int] = None) -> Scenario:
     Args:
         task_id: One of 'easy', 'medium', 'hard'
-        seed: Optional seed for deterministic but varied issue selection.
-              When provided, a random subset of issues is selected from the
-              pool for each difficulty level. When None, the default scenario
-              is returned (deterministic, for testing).
     """
     scenario_builders = {
         "easy": _easy_scenario,
@@ -59,22 +91,7 @@ def get_scenario(task_id: str, seed: Optional[int] = None) -> Scenario:
     if task_id not in scenario_builders:
         raise ValueError(f"Unknown task_id: {task_id}. Must be one of: {list(scenario_builders.keys())}")
-    scenario = scenario_builders[task_id]()
-    # If seed is provided, randomize the scenario
-    if seed is not None:
-        rng = random.Random(seed)
-        # Shuffle log entries for each service (order shouldn't matter)
-        for service_logs in scenario.logs.values():
-            rng.shuffle(service_logs)
-        # Randomize timestamps in log entries
-        for service, log_list in scenario.logs.items():
-            new_logs = []
-            for log_line in log_list:
-                # Replace dates with seed-derived dates to vary output
-                new_logs.append(log_line)
-            scenario.logs[service] = new_logs
     return scenario
@@ -83,320 +100,978 @@ def get_all_task_ids() -> List[str]:
     return ["easy", "medium", "hard"]
 # ─── Easy Scenario ───────────────────────────────────────────────────────────
-def _easy_scenario() -> Scenario:
     """
-    Easy: Missing Authorization header + wrong Content-Type in a payment API.
-    Agent must inspect logs, find the two issues, and submit fixes.
     """
-    return Scenario(
-        task_id="easy",
-        difficulty="easy",
-        description=(
-            "A payment processing API integration is failing. "
-            "The client is sending requests to the payment gateway but getting 401 and 415 errors. "
-            "Diagnose and fix the API client configuration."
-        ),
-        max_steps=15,
-        services=["payment_client", "payment_gateway"],
-        configs={
-            "payment_client": {
-                "base_url": "https://api.paymentgateway.com/v2",
-                "headers": {
-                    "Content-Type": "text/plain",  # BUG: should be application/json
-                    "Accept": "application/json",
-                    # BUG: missing Authorization header
-                },
-                "timeout": 30,
-                "retry_count": 3,
             },
-            "payment_gateway": {
-                "endpoint": "/process",
-                "method": "POST",
-                "required_headers": ["Authorization", "Content-Type"],
-                "accepted_content_types": ["application/json"],
-                "auth_scheme": "Bearer",
             },
-        },
-        logs={
-            "payment_client": [
                 "[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized",
                 "[ERROR] 2026-03-25T10:15:23Z Response: {'error': 'Missing or invalid Authorization header'}",
                 "[WARN]  2026-03-25T10:15:22Z Request headers: Content-Type=text/plain, Accept=application/json",
                 "[ERROR] 2026-03-25T10:15:24Z POST /process -> 415 Unsupported Media Type",
                 "[ERROR] 2026-03-25T10:15:24Z Response: {'error': 'Content-Type must be application/json'}",
-                "[INFO]  2026-03-25T10:15:20Z Payment client initialized with base_url=https://api.paymentgateway.com/v2",
-            ],
-            "payment_gateway": [
-                "[WARN]  2026-03-25T10:15:23Z Rejected request: no Authorization header present",
-                "[WARN]  2026-03-25T10:15:24Z Rejected request: unsupported Content-Type 'text/plain'",
-                "[INFO]  2026-03-25T10:15:20Z Gateway ready, accepting application/json with Bearer auth",
-            ],
         },
-        issues=[
-            Issue(
-                issue_id="easy_auth",
-                service="payment_client",
-                description="Missing Authorization header in payment client",
-                expected_fix={"headers.Authorization": "Bearer <token>"},
-                fix_key="headers.Authorization",
-                log_hint="Missing or invalid Authorization header",
-            ),
-            Issue(
-                issue_id="easy_content_type",
-                service="payment_client",
-                description="Wrong Content-Type header (text/plain instead of application/json)",
-                expected_fix={"headers.Content-Type": "application/json"},
-                fix_key="headers.Content-Type",
-                log_hint="Content-Type must be application/json",
-            ),
-        ],
     )
 # ─── Medium Scenario ─────────────────────────────────────────────────────────
-def _medium_scenario() -> Scenario:
     """
-    Medium: Webhook chain with rate limiting misconfiguration,
-    incorrect retry logic, and missing signature validation.
     """
-    return Scenario(
-        task_id="medium",
-        difficulty="medium",
-        description=(
-            "A webhook-based notification system is dropping events. "
-            "Service A sends webhooks to Service B, which forwards to Service C. "
-            "Events are being lost with 429, retry exhaustion, and signature validation failures. "
-            "Fix the webhook chain configuration."
         ),
-        max_steps=25,
-        services=["webhook_sender", "webhook_receiver", "notification_service"],
-        configs={
-            "webhook_sender": {
-                "target_url": "https://receiver.internal/webhook",
-                "headers": {
-                    "Content-Type": "application/json",
-                    "X-Webhook-Signature": "",  # BUG: empty signature
-                },
-                "rate_limit": {
-                    "requests_per_second": 100,  # BUG: too high, receiver allows 10/s
-                    "burst_size": 200,
-                },
-                "retry": {
-                    "max_retries": 1,  # BUG: should be at least 3
-                    "backoff_factor": 0,  # BUG: no backoff
-                    "retry_on_status": [500],  # BUG: should also retry on 429
-                },
-                "signing_secret": "whsec_abc123secret",
             },
-            "webhook_receiver": {
-                "endpoint": "/webhook",
-                "rate_limit": {
-                    "requests_per_second": 10,
-                    "burst_size": 20,
-                },
-                "signature_validation": True,
-                "expected_signature_header": "X-Webhook-Signature",
-                "signing_secret": "whsec_abc123secret",
-                "forward_to": "https://notifications.internal/notify",
             },
-            "notification_service": {
-                "endpoint": "/notify",
-                "accepts_from": ["webhook_receiver"],
-                "status": "healthy",
             },
         },
-        logs={
-            "webhook_sender": [
-                "[ERROR] 2026-03-25T11:00:01Z POST /webhook -> 429 Too Many Requests",
                 "[ERROR] 2026-03-25T11:00:01Z Rate limited. Retry-After: 5s",
                 "[WARN]  2026-03-25T11:00:02Z Retry attempt 1/1 failed. No more retries.",
                 "[ERROR] 2026-03-25T11:00:03Z Event evt_12345 dropped after retry exhaustion",
-                "[WARN]  2026-03-25T11:00:00Z Sending at 100 req/s (burst=200)",
-                "[INFO]  2026-03-25T10:59:59Z Webhook sender started. Signature header: X-Webhook-Signature",
-            ],
-            "webhook_receiver": [
-                "[WARN]  2026-03-25T11:00:01Z Rate limit exceeded: 100 req/s > 10 req/s allowed",
                 "[ERROR] 2026-03-25T11:00:02Z Signature validation FAILED: received empty signature",
                 "[WARN]  2026-03-25T11:00:02Z Dropping event: invalid signature from webhook_sender",
-                "[INFO]  2026-03-25T10:59:59Z Receiver ready. Rate limit: 10 req/s. Signature validation: ON",
-            ],
-            "notification_service": [
-                "[WARN]  2026-03-25T11:00:05Z No events received in last 60s",
-                "[INFO]  2026-03-25T10:59:59Z Notification service healthy. Waiting for events.",
-            ],
         },
-        issues=[
-            Issue(
-                issue_id="medium_rate_limit",
-                service="webhook_sender",
-                description="Rate limit too high (100/s vs receiver's 10/s limit)",
-                expected_fix={"rate_limit.requests_per_second": 10},
-                fix_key="rate_limit.requests_per_second",
-                log_hint="Rate limit exceeded: 100 req/s > 10 req/s allowed",
-            ),
-            Issue(
-                issue_id="medium_retry",
-                service="webhook_sender",
-                description="Insufficient retry config: only 1 retry, no backoff, missing 429 in retry_on_status",
-                expected_fix={
-                    "retry.max_retries": 3,
-                    "retry.backoff_factor": 2,
-                    "retry.retry_on_status": [429, 500],
-                },
-                fix_key="retry",
-                log_hint="Retry attempt 1/1 failed. No more retries.",
-            ),
-            Issue(
-                issue_id="medium_signature",
-                service="webhook_sender",
-                description="Webhook signature header is empty — receiver rejects unsigned events",
-                expected_fix={"headers.X-Webhook-Signature": "sha256=<computed>"},
-                fix_key="headers.X-Webhook-Signature",
-                log_hint="Signature validation FAILED: received empty signature",
-            ),
-        ],
     )
 # ─── Hard Scenario ────────────────────────────────────────────────────────────
-def _hard_scenario() -> Scenario:
     """
-    Hard: Race condition in a 3-service order processing chain.
-    Service A (order) -> Service B (inventory) -> Service C (shipping).
-    Cascading 500s due to ordering issues, wrong URLs, missing timeouts, and auth failures.
     """
-    return Scenario(
-        task_id="hard",
-        difficulty="hard",
-        description=(
-            "An e-commerce order processing pipeline is failing with cascading errors. "
-            "Order Service sends to Inventory Service, which sends to Shipping Service. "
-            "Requests are timing out, hitting wrong endpoints, failing auth, and "
-            "the ordering causes race conditions. Fix all 5 issues across the chain."
         ),
-        max_steps=40,
-        services=["order_service", "inventory_service", "shipping_service", "api_gateway", "auth_service"],
-        configs={
-            "order_service": {
-                "name": "order_service",
-                "inventory_url": "https://inventory.internal/v1/check",  # BUG: wrong path, should be /v2/reserve
-                "headers": {
-                    "Content-Type": "application/json",
-                    "Authorization": "Bearer valid_token_123",
-                },
-                "timeout": 2,  # BUG: too short for inventory, which needs 5s
-                "async_mode": False,  # BUG: should be True to avoid race condition
-                "callback_url": "https://orders.internal/callback",
             },
-            "inventory_service": {
-                "name": "inventory_service",
-                "endpoint_version": "v2",
-                "reserve_path": "/v2/reserve",
-                "check_path": "/v2/check",
-                "shipping_url": "https://shipping.internal/v1/create",
-                "headers": {
-                    "Content-Type": "application/json",
-                    "Authorization": "Bearer expired_token_456",  # BUG: expired token
-                },
-                "timeout": 10,
-                "processing_time_avg": 4,  # seconds — this is why order_service's 2s timeout fails
             },
-            "shipping_service": {
-                "name": "shipping_service",
-                "create_path": "/v1/create",
-                "requires_auth": True,
-                "accepted_auth": ["Bearer"],
-                "token_validation_url": "https://auth.internal/validate",
-                "status": "healthy",
             },
-            "api_gateway": {
-                "routes": {
-                    "/v1/check": "DEPRECATED — use /v2/check",
-                    "/v2/reserve": "inventory_service",
-                    "/v2/check": "inventory_service",
-                    "/v1/create": "shipping_service",
-                },
-                "timeout": 30,
             },
-            "auth_service": {
-                "valid_tokens": ["valid_token_123", "valid_token_789"],
-                "expired_tokens": ["expired_token_456"],
-                "token_refresh_endpoint": "/refresh",
             },
         },
-        logs={
-            "order_service": [
                 "[ERROR] 2026-03-25T12:00:05Z POST inventory.internal/v1/check -> 301 Moved Permanently",
                 "[ERROR] 2026-03-25T12:00:05Z Response: {'error': 'Endpoint deprecated. Use /v2/reserve'}",
                 "[ERROR] 2026-03-25T12:00:07Z Timeout after 2s waiting for inventory response",
                 "[ERROR] 2026-03-25T12:00:07Z Order ord_999 failed: inventory check timed out",
                 "[WARN]  2026-03-25T12:00:08Z Synchronous mode: blocking on inventory response",
                 "[ERROR] 2026-03-25T12:00:09Z Race condition: order ord_998 processed before ord_997 completed",
-            ],
-            "inventory_service": [
-                "[INFO]  2026-03-25T12:00:05Z Received request on /v1/check -> redirecting to /v2/check",
-                "[WARN]  2026-03-25T12:00:06Z Processing reservation... avg time: 4s",
                 "[ERROR] 2026-03-25T12:00:10Z POST shipping.internal/v1/create -> 401 Unauthorized",
                 "[ERROR] 2026-03-25T12:00:10Z Auth token expired_token_456 is no longer valid",
                 "[ERROR] 2026-03-25T12:00:10Z Cannot create shipment: authentication failed",
-            ],
-            "shipping_service": [
-                "[WARN]  2026-03-25T12:00:10Z Rejected request: token 'expired_token_456' is expired",
-                "[INFO]  2026-03-25T12:00:00Z Shipping service healthy, awaiting authenticated requests",
-            ],
-            "api_gateway": [
-                "[WARN]  2026-03-25T12:00:05Z Deprecated endpoint /v1/check accessed by order_service",
-                "[INFO]  2026-03-25T12:00:05Z Redirecting /v1/check -> /v2/check (301)",
-            ],
-            "auth_service": [
-                "[WARN]  2026-03-25T12:00:10Z Token validation failed: expired_token_456 expired at 2026-03-24T00:00:00Z",
-                "[INFO]  2026-03-25T12:00:00Z Auth service ready. Valid tokens: 2, Expired: 1",
-            ],
         },
-        issues=[
-            Issue(
-                issue_id="hard_wrong_url",
-                service="order_service",
-                description="Order service calling deprecated /v1/check instead of /v2/reserve",
-                expected_fix={"inventory_url": "https://inventory.internal/v2/reserve"},
-                fix_key="inventory_url",
-                log_hint="Endpoint deprecated. Use /v2/reserve",
-            ),
-            Issue(
-                issue_id="hard_timeout",
-                service="order_service",
-                description="Timeout too short (2s) for inventory service that takes ~4s to process",
-                expected_fix={"timeout": 10},
-                fix_key="timeout",
-                log_hint="Timeout after 2s waiting for inventory response",
-            ),
-            Issue(
-                issue_id="hard_async",
-                service="order_service",
-                description="Synchronous mode causes race conditions between concurrent orders",
-                expected_fix={"async_mode": True},
-                fix_key="async_mode",
-                log_hint="Race condition: order ord_998 processed before ord_997 completed",
-            ),
-            Issue(
-                issue_id="hard_expired_token",
-                service="inventory_service",
-                description="Expired auth token used for shipping service requests",
-                expected_fix={"headers.Authorization": "Bearer valid_token_789"},
-                fix_key="headers.Authorization",
-                log_hint="Auth token expired_token_456 is no longer valid",
-            ),
-            Issue(
-                issue_id="hard_token_refresh",
-                service="inventory_service",
-                description="No automatic token refresh mechanism configured",
-                expected_fix={"token_refresh_url": "https://auth.internal/refresh", "auto_refresh": True},
-                fix_key="token_refresh_url",
-                log_hint="Token validation failed: expired_token_456 expired",
-            ),
-        ],
     )

 """
 Scenario definitions for the API Integration Debugging Environment.
+Each scenario models a realistic multi-service API ecosystem with:
+- Service dependency graphs (upstream/downstream relationships)
+- Cascading failures (upstream bugs propagate downstream)
+- Dynamic logs that update when issues are fixed
+- Expanded issue pools for seed-based random subset selection
 """
 from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple
 import random
     expected_fix: Dict[str, Any]
     fix_key: str  # The key in the config that needs fixing
     log_hint: str  # Log line that hints at this issue
+    # --- New fields for cascading failures ---
+    depends_on: List[str] = field(default_factory=list)
+    # Issues that must be fixed before this one can be diagnosed
+    cascade_effects: Dict[str, str] = field(default_factory=dict)
+    # service -> error message caused by this issue being unfixed
+    category: str = "configuration"
+    # Issue category: configuration, authentication, networking, protocol
+    severity: str = "error"
+    # Severity: error, warning, critical
+    root_cause_explanation: str = ""
+    # Detailed explanation of why this issue occurs (for grading diagnosis quality)
+@dataclass
+class ServiceNode:
+    """A node in the service dependency graph."""
+    name: str
+    depends_on: List[str] = field(default_factory=list)
+    # Services this one calls (upstream dependencies)
+    health_status: str = "degraded"
+    # healthy, degraded, error, unreachable
 @dataclass
 class Scenario:
+    """A complete API debugging scenario with dependency graph."""
     task_id: str
     difficulty: str
     description: str
     configs: Dict[str, Dict[str, Any]]
     logs: Dict[str, List[str]]
     issues: List[Issue]
+    # --- New fields ---
+    service_graph: Dict[str, ServiceNode] = field(default_factory=dict)
+    # Service dependency graph
+    dynamic_logs: Dict[str, Dict[str, List[str]]] = field(default_factory=dict)
+    # service -> {issue_id: [new logs when fixed]}
+    optimal_fix_order: List[str] = field(default_factory=list)
+    # Optimal order to fix issues (for strategy scoring)
+    context: str = ""
+    # Additional scenario context for the agent
 def get_scenario(task_id: str, seed: Optional[int] = None) -> Scenario:
     Args:
         task_id: One of 'easy', 'medium', 'hard'
+        seed: Optional seed for deterministic but varied scenarios.
+              When provided, selects a random subset of issues from the pool
+              and randomizes log order. When None, returns the canonical scenario.
     """
     scenario_builders = {
         "easy": _easy_scenario,
     if task_id not in scenario_builders:
         raise ValueError(f"Unknown task_id: {task_id}. Must be one of: {list(scenario_builders.keys())}")
+    scenario = scenario_builders[task_id](seed=seed)
     return scenario
     return ["easy", "medium", "hard"]
+def _select_issues(pool: List[Issue], count: int, rng: random.Random) -> List[Issue]:
+    """Select a random subset of issues from a pool, respecting dependencies."""
+    if count >= len(pool):
+        selected = list(pool)
+    else:
+        # Build dependency-aware selection
+        available = list(pool)
+        selected = []
+        while len(selected) < count and available:
+            # Pick a random issue
+            issue = rng.choice(available)
+            available.remove(issue)
+            # Add its dependencies too if not already selected
+            deps_satisfied = all(
+                any(s.issue_id == dep for s in selected)
+                for dep in issue.depends_on
+            )
+            if deps_satisfied or not issue.depends_on:
+                selected.append(issue)
+            else:
+                # Add dependencies first
+                for dep_id in issue.depends_on:
+                    dep_issue = next((i for i in pool if i.issue_id == dep_id), None)
+                    if dep_issue and dep_issue not in selected:
+                        selected.append(dep_issue)
+                        if dep_issue in available:
+                            available.remove(dep_issue)
+                selected.append(issue)
+    # Shuffle log order for selected issues
+    rng.shuffle(selected)
+    return selected[:count]
+def _randomize_scenario(scenario: Scenario, seed: int) -> Scenario:
+    """Apply seed-based randomization to a scenario."""
+    rng = random.Random(seed)
+    # Shuffle log entries for each service
+    for service_logs in scenario.logs.values():
+        rng.shuffle(service_logs)
+    # Vary timestamps in log entries
+    base_hour = rng.randint(8, 16)
+    base_minute = rng.randint(0, 59)
+    for service, log_list in scenario.logs.items():
+        new_logs = []
+        for i, log_line in enumerate(log_list):
+            # Replace the timestamp portion
+            minute = (base_minute + i * rng.randint(1, 5)) % 60
+            hour = base_hour + (base_minute + i * rng.randint(1, 5)) // 60
+            new_log = log_line
+            if "2026-" in new_log:
+                # Replace date with varied date
+                day = rng.randint(20, 28)
+                new_log = new_log.replace(
+                    "2026-03-25",
+                    f"2026-03-{day:02d}"
+                ).replace(
+                    "2026-03-24",
+                    f"2026-03-{day-1:02d}"
+                )
+            new_logs.append(new_log)
+        scenario.logs[service] = new_logs
+    return scenario
 # ─── Easy Scenario ───────────────────────────────────────────────────────────
+def _easy_scenario(seed: Optional[int] = None) -> Scenario:
     """
+    Easy: Payment API integration failures.
+    Agent must diagnose auth + content-type issues with clear log signals.
+    Issue pool has 4 possible issues; canonical scenario uses 2.
     """
+    # Full issue pool (4 issues, canonical uses 2)
+    issue_pool = [
+        Issue(
+            issue_id="easy_auth",
+            service="payment_client",
+            description="Missing Authorization header — payment gateway requires Bearer token authentication",
+            expected_fix={"headers.Authorization": "Bearer <token>"},
+            fix_key="headers.Authorization",
+            log_hint="Missing or invalid Authorization header",
+            category="authentication",
+            severity="critical",
+            root_cause_explanation=(
+                "The payment_client is missing the Authorization header entirely. "
+                "The payment_gateway requires Bearer token auth on all /process requests. "
+                "This results in HTTP 401 on every payment attempt."
+            ),
+            cascade_effects={
+                "payment_gateway": "All requests from payment_client rejected with 401"
             },
+        ),
+        Issue(
+            issue_id="easy_content_type",
+            service="payment_client",
+            description="Wrong Content-Type header (text/plain instead of application/json)",
+            expected_fix={"headers.Content-Type": "application/json"},
+            fix_key="headers.Content-Type",
+            log_hint="Content-Type must be application/json",
+            category="protocol",
+            severity="error",
+            root_cause_explanation=(
+                "The payment_client sends Content-Type: text/plain, but the gateway "
+                "only accepts application/json. This causes HTTP 415 Unsupported Media Type. "
+                "The gateway cannot parse the request body."
+            ),
+            cascade_effects={
+                "payment_gateway": "Request body parsing fails for payment_client requests"
             },
+        ),
+        Issue(
+            issue_id="easy_timeout",
+            service="payment_client",
+            description="Timeout set too low (5s) for payment processing that takes 8-12s",
+            expected_fix={"timeout": 30},
+            fix_key="timeout",
+            log_hint="Request timed out after 5s",
+            category="networking",
+            severity="error",
+            root_cause_explanation=(
+                "The payment_client has timeout=5s, but payment processing at the gateway "
+                "takes 8-12s for fraud checks. Legitimate payments are timing out."
+            ),
+        ),
+        Issue(
+            issue_id="easy_base_url",
+            service="payment_client",
+            description="Base URL pointing to deprecated v1 endpoint instead of v2",
+            expected_fix={"base_url": "https://api.paymentgateway.com/v2"},
+            fix_key="base_url",
+            log_hint="API v1 is deprecated",
+            category="configuration",
+            severity="warning",
+            root_cause_explanation=(
+                "The payment_client uses /v1 which is deprecated and returning 301 redirects. "
+                "The gateway v2 endpoint has different request schemas, causing deserialization errors."
+            ),
+        ),
+    ]
+    # Select issues based on seed
+    if seed is not None:
+        rng = random.Random(seed)
+        issues = _select_issues(issue_pool, 2, rng)
+    else:
+        issues = issue_pool[:2]  # Canonical: auth + content_type
+    # Build logs based on selected issues
+    client_logs = [
+        "[INFO]  2026-03-25T10:15:20Z Payment client initialized with base_url=https://api.paymentgateway.com/v2",
+    ]
+    gateway_logs = [
+        "[INFO]  2026-03-25T10:15:20Z Gateway ready, accepting application/json with Bearer auth",
+    ]
+    for issue in issues:
+        if issue.issue_id == "easy_auth":
+            client_logs.extend([
                 "[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized",
                 "[ERROR] 2026-03-25T10:15:23Z Response: {'error': 'Missing or invalid Authorization header'}",
                 "[WARN]  2026-03-25T10:15:22Z Request headers: Content-Type=text/plain, Accept=application/json",
+            ])
+            gateway_logs.append(
+                "[WARN]  2026-03-25T10:15:23Z Rejected request: no Authorization header present"
+            )
+        elif issue.issue_id == "easy_content_type":
+            client_logs.extend([
                 "[ERROR] 2026-03-25T10:15:24Z POST /process -> 415 Unsupported Media Type",
                 "[ERROR] 2026-03-25T10:15:24Z Response: {'error': 'Content-Type must be application/json'}",
+            ])
+            gateway_logs.append(
+                "[WARN]  2026-03-25T10:15:24Z Rejected request: unsupported Content-Type 'text/plain'"
+            )
+        elif issue.issue_id == "easy_timeout":
+            client_logs.extend([
+                "[ERROR] 2026-03-25T10:15:30Z POST /process -> Request timed out after 5s",
+                "[WARN]  2026-03-25T10:15:30Z Payment processing takes 8-12s for fraud verification",
+            ])
+            gateway_logs.append(
+                "[INFO]  2026-03-25T10:15:30Z Processing payment... estimated time: 10s"
+            )
+        elif issue.issue_id == "easy_base_url":
+            client_logs.extend([
+                "[ERROR] 2026-03-25T10:15:21Z GET /v1/status -> 301 Moved Permanently",
+                "[WARN]  2026-03-25T10:15:21Z API v1 is deprecated, migrate to /v2",
+            ])
+            gateway_logs.append(
+                "[WARN]  2026-03-25T10:15:21Z Deprecated v1 endpoint accessed"
+            )
+    # Determine initial config based on selected issues
+    configs = {
+        "payment_client": {
+            "base_url": "https://api.paymentgateway.com/v2",
+            "headers": {
+                "Content-Type": "application/json",
+                "Accept": "application/json",
+            },
+            "timeout": 30,
+            "retry_count": 3,
         },
+        "payment_gateway": {
+            "endpoint": "/process",
+            "method": "POST",
+            "required_headers": ["Authorization", "Content-Type"],
+            "accepted_content_types": ["application/json"],
+            "auth_scheme": "Bearer",
+            "processing_time_ms": "8000-12000",
+        },
+    }
+    # Apply broken config for each selected issue
+    for issue in issues:
+        if issue.issue_id == "easy_auth":
+            # Remove auth header (it shouldn't exist)
+            configs["payment_client"]["headers"].pop("Authorization", None)
+        elif issue.issue_id == "easy_content_type":
+            configs["payment_client"]["headers"]["Content-Type"] = "text/plain"
+        elif issue.issue_id == "easy_timeout":
+            configs["payment_client"]["timeout"] = 5
+        elif issue.issue_id == "easy_base_url":
+            configs["payment_client"]["base_url"] = "https://api.paymentgateway.com/v1"
+    # Dynamic logs: what changes after fixing each issue
+    dynamic_logs = {}
+    for issue in issues:
+        if issue.issue_id == "easy_auth":
+            dynamic_logs["easy_auth"] = {
+                "payment_client": ["[INFO]  Authorization header set. Retrying request..."],
+                "payment_gateway": ["[INFO]  Authentication successful for payment_client"],
+            }
+        elif issue.issue_id == "easy_content_type":
+            dynamic_logs["easy_content_type"] = {
+                "payment_client": ["[INFO]  Content-Type set to application/json. Request body parsed."],
+                "payment_gateway": ["[INFO]  Request body parsed successfully as JSON"],
+            }
+        elif issue.issue_id == "easy_timeout":
+            dynamic_logs["easy_timeout"] = {
+                "payment_client": ["[INFO]  Timeout increased to 30s. Payment processing completing normally."],
+            }
+        elif issue.issue_id == "easy_base_url":
+            dynamic_logs["easy_base_url"] = {
+                "payment_client": ["[INFO]  Migrated to v2 API endpoint. Requests routing correctly."],
+            }
+    # Service dependency graph
+    service_graph = {
+        "payment_client": ServiceNode(
+            name="payment_client",
+            depends_on=["payment_gateway"],
+            health_status="error",
+        ),
+        "payment_gateway": ServiceNode(
+            name="payment_gateway",
+            depends_on=[],
+            health_status="healthy",
+        ),
+    }
+    scenario = Scenario(
+        task_id="easy",
+        difficulty="easy",
+        description=(
+            "A payment processing API integration is failing. "
+            "The client is sending requests to the payment gateway but getting error responses. "
+            "Diagnose the root causes by inspecting error logs and service configurations, "
+            "then submit the correct configuration fixes."
+        ),
+        max_steps=15,
+        services=["payment_client", "payment_gateway"],
+        configs=configs,
+        logs={"payment_client": client_logs, "payment_gateway": gateway_logs},
+        issues=issues,
+        service_graph=service_graph,
+        dynamic_logs=dynamic_logs,
+        optimal_fix_order=[i.issue_id for i in issues],
+        context=(
+            "The payment_client sends HTTP requests to payment_gateway. "
+            "payment_gateway requires Bearer authentication and JSON content type."
+        ),
     )
+    if seed is not None:
+        scenario = _randomize_scenario(scenario, seed)
+    return scenario
 # ─── Medium Scenario ─────────────────────────────────────────────────────────
+def _medium_scenario(seed: Optional[int] = None) -> Scenario:
     """
+    Medium: Webhook chain with cascading failures.
+    Service A -> Service B -> Service C, with rate limiting, retry, and auth issues.
+    Issue pool has 5 possible issues; canonical scenario uses 3.
+    Issues have dependencies — fixing rate_limit reveals the real retry issue.
     """
+    issue_pool = [
+        Issue(
+            issue_id="medium_rate_limit",
+            service="webhook_sender",
+            description="Rate limit too high (100/s vs receiver's 10/s limit) causing 429 responses",
+            expected_fix={"rate_limit.requests_per_second": 10},
+            fix_key="rate_limit.requests_per_second",
+            log_hint="Rate limit exceeded: 100 req/s > 10 req/s allowed",
+            category="networking",
+            severity="error",
+            root_cause_explanation=(
+                "webhook_sender fires at 100 req/s but webhook_receiver only accepts 10 req/s. "
+                "The excess requests get 429 Too Many Requests, and with only 1 retry, most events are dropped."
+            ),
+            cascade_effects={
+                "webhook_receiver": "Overwhelmed with requests, dropping 90% of events",
+                "notification_service": "No events arriving downstream",
+            },
         ),
+        Issue(
+            issue_id="medium_retry",
+            service="webhook_sender",
+            description="Insufficient retry config: only 1 retry, no backoff, missing 429 in retry_on_status",
+            expected_fix={
+                "retry.max_retries": 3,
+                "retry.backoff_factor": 2,
+                "retry.retry_on_status": [429, 500],
             },
+            fix_key="retry",
+            log_hint="Retry attempt 1/1 failed. No more retries.",
+            depends_on=["medium_rate_limit"],
+            # The retry issue is masked by the rate limit issue — even with retries,
+            # 100 req/s would still overwhelm the receiver
+            category="configuration",
+            severity="error",
+            root_cause_explanation=(
+                "Even after fixing the rate limit, the sender only retries once with no backoff. "
+                "Transient 429s during bursts aren't retried because 429 isn't in retry_on_status. "
+                "This causes event loss on any temporary load spike."
+            ),
+        ),
+        Issue(
+            issue_id="medium_signature",
+            service="webhook_sender",
+            description="Webhook signature header is empty — receiver rejects unsigned events",
+            expected_fix={"headers.X-Webhook-Signature": "sha256=<computed>"},
+            fix_key="headers.X-Webhook-Signature",
+            log_hint="Signature validation FAILED: received empty signature",
+            category="authentication",
+            severity="critical",
+            root_cause_explanation=(
+                "webhook_sender has signing_secret configured but the X-Webhook-Signature header "
+                "is empty string. webhook_receiver validates signatures and drops all unsigned "
+                "events as potential spoofing attempts."
+            ),
+            cascade_effects={
+                "webhook_receiver": "Dropping all events as unsigned/spoofed",
+                "notification_service": "Zero events forwarded from receiver",
             },
+        ),
+        Issue(
+            issue_id="medium_target_url",
+            service="webhook_sender",
+            description="Target URL pointing to wrong receiver endpoint (/webhook vs /hooks/incoming)",
+            expected_fix={"target_url": "https://receiver.internal/hooks/incoming"},
+            fix_key="target_url",
+            log_hint="404 Not Found on /webhook endpoint",
+            category="configuration",
+            severity="error",
+            root_cause_explanation=(
+                "webhook_sender posts to /webhook but the receiver listens on /hooks/incoming. "
+                "All requests get 404 Not Found."
+            ),
+        ),
+        Issue(
+            issue_id="medium_content_encoding",
+            service="webhook_sender",
+            description="Payload compression enabled but receiver doesn't support gzip",
+            expected_fix={"compression": "none"},
+            fix_key="compression",
+            log_hint="Unsupported Content-Encoding: gzip",
+            category="protocol",
+            severity="warning",
+            root_cause_explanation=(
+                "webhook_sender compresses payloads with gzip but webhook_receiver "
+                "doesn't have a decompression middleware. Requests fail with 415."
+            ),
+        ),
+    ]
+    if seed is not None:
+        rng = random.Random(seed)
+        issues = _select_issues(issue_pool, 3, rng)
+    else:
+        issues = issue_pool[:3]  # Canonical: rate_limit, retry, signature
+    # Build configs
+    configs = {
+        "webhook_sender": {
+            "target_url": "https://receiver.internal/hooks/incoming",
+            "headers": {
+                "Content-Type": "application/json",
+                "X-Webhook-Signature": "sha256=computed_hmac",
+            },
+            "rate_limit": {
+                "requests_per_second": 10,
+                "burst_size": 20,
+            },
+            "retry": {
+                "max_retries": 3,
+                "backoff_factor": 2,
+                "retry_on_status": [429, 500],
             },
+            "signing_secret": "whsec_abc123secret",
+            "compression": "none",
         },
+        "webhook_receiver": {
+            "endpoint": "/hooks/incoming",
+            "rate_limit": {
+                "requests_per_second": 10,
+                "burst_size": 20,
+            },
+            "signature_validation": True,
+            "expected_signature_header": "X-Webhook-Signature",
+            "signing_secret": "whsec_abc123secret",
+            "forward_to": "https://notifications.internal/notify",
+            "supported_encodings": ["identity"],
+        },
+        "notification_service": {
+            "endpoint": "/notify",
+            "accepts_from": ["webhook_receiver"],
+            "status": "healthy",
+        },
+    }
+    # Apply broken config for each selected issue
+    for issue in issues:
+        if issue.issue_id == "medium_rate_limit":
+            configs["webhook_sender"]["rate_limit"]["requests_per_second"] = 100
+            configs["webhook_sender"]["rate_limit"]["burst_size"] = 200
+        elif issue.issue_id == "medium_retry":
+            configs["webhook_sender"]["retry"] = {
+                "max_retries": 1,
+                "backoff_factor": 0,
+                "retry_on_status": [500],
+            }
+        elif issue.issue_id == "medium_signature":
+            configs["webhook_sender"]["headers"]["X-Webhook-Signature"] = ""
+        elif issue.issue_id == "medium_target_url":
+            configs["webhook_sender"]["target_url"] = "https://receiver.internal/webhook"
+        elif issue.issue_id == "medium_content_encoding":
+            configs["webhook_sender"]["compression"] = "gzip"
+    # Build logs based on selected issues
+    sender_logs = [
+        "[INFO]  2026-03-25T10:59:59Z Webhook sender started. Signature header: X-Webhook-Signature",
+    ]
+    receiver_logs = [
+        "[INFO]  2026-03-25T10:59:59Z Receiver ready. Rate limit: 10 req/s. Signature validation: ON",
+    ]
+    notif_logs = [
+        "[INFO]  2026-03-25T10:59:59Z Notification service healthy. Waiting for events.",
+    ]
+    for issue in issues:
+        if issue.issue_id == "medium_rate_limit":
+            sender_logs.extend([
+                "[ERROR] 2026-03-25T11:00:01Z POST /hooks/incoming -> 429 Too Many Requests",
                 "[ERROR] 2026-03-25T11:00:01Z Rate limited. Retry-After: 5s",
+                "[WARN]  2026-03-25T11:00:00Z Sending at 100 req/s (burst=200)",
+            ])
+            receiver_logs.append(
+                "[WARN]  2026-03-25T11:00:01Z Rate limit exceeded: 100 req/s > 10 req/s allowed"
+            )
+        elif issue.issue_id == "medium_retry":
+            sender_logs.extend([
                 "[WARN]  2026-03-25T11:00:02Z Retry attempt 1/1 failed. No more retries.",
                 "[ERROR] 2026-03-25T11:00:03Z Event evt_12345 dropped after retry exhaustion",
+            ])
+        elif issue.issue_id == "medium_signature":
+            receiver_logs.extend([
                 "[ERROR] 2026-03-25T11:00:02Z Signature validation FAILED: received empty signature",
                 "[WARN]  2026-03-25T11:00:02Z Dropping event: invalid signature from webhook_sender",
+            ])
+        elif issue.issue_id == "medium_target_url":
+            sender_logs.extend([
+                "[ERROR] 2026-03-25T11:00:01Z POST /webhook -> 404 Not Found on /webhook endpoint",
+                "[WARN]  2026-03-25T11:00:01Z Receiver endpoint may have changed",
+            ])
+        elif issue.issue_id == "medium_content_encoding":
+            receiver_logs.extend([
+                "[ERROR] 2026-03-25T11:00:02Z Unsupported Content-Encoding: gzip",
+                "[WARN]  2026-03-25T11:00:02Z Cannot decompress payload from webhook_sender",
+            ])
+    notif_logs.append("[WARN]  2026-03-25T11:00:05Z No events received in last 60s")
+    # Dynamic logs
+    dynamic_logs = {
+        "medium_rate_limit": {
+            "webhook_sender": ["[INFO]  Rate limit adjusted to 10 req/s. 429 errors resolved."],
+            "webhook_receiver": ["[INFO]  Incoming request rate normalized. Processing events."],
         },
+        "medium_retry": {
+            "webhook_sender": ["[INFO]  Retry config updated: 3 retries with backoff. 429 now retried."],
+        },
+        "medium_signature": {
+            "webhook_sender": ["[INFO]  Webhook signature computed and attached to requests."],
+            "webhook_receiver": ["[INFO]  Signature validation passed for incoming events."],
+        },
+        "medium_target_url": {
+            "webhook_sender": ["[INFO]  Target URL corrected to /hooks/incoming. Requests routing OK."],
+        },
+        "medium_content_encoding": {
+            "webhook_sender": ["[INFO]  Compression disabled. Receiver parsing payloads correctly."],
+        },
+    }
+    service_graph = {
+        "webhook_sender": ServiceNode(
+            name="webhook_sender",
+            depends_on=["webhook_receiver"],
+            health_status="error",
+        ),
+        "webhook_receiver": ServiceNode(
+            name="webhook_receiver",
+            depends_on=["notification_service"],
+            health_status="degraded",
+        ),
+        "notification_service": ServiceNode(
+            name="notification_service",
+            depends_on=[],
+            health_status="healthy",
+        ),
+    }
+    # Determine optimal fix order (respect dependencies)
+    issue_ids = [i.issue_id for i in issues]
+    optimal_order = []
+    # Rate limit should be fixed before retry (dependency)
+    if "medium_rate_limit" in issue_ids:
+        optimal_order.append("medium_rate_limit")
+    if "medium_retry" in issue_ids:
+        optimal_order.append("medium_retry")
+    for iid in issue_ids:
+        if iid not in optimal_order:
+            optimal_order.append(iid)
+    scenario = Scenario(
+        task_id="medium",
+        difficulty="medium",
+        description=(
+            "A webhook-based notification system is dropping events. "
+            "webhook_sender sends webhooks to webhook_receiver, which forwards to notification_service. "
+            "Events are being lost due to multiple cascading failures in the webhook chain. "
+            "Fix the webhook_sender configuration to restore event delivery."
+        ),
+        max_steps=25,
+        services=["webhook_sender", "webhook_receiver", "notification_service"],
+        configs=configs,
+        logs={
+            "webhook_sender": sender_logs,
+            "webhook_receiver": receiver_logs,
+            "notification_service": notif_logs,
+        },
+        issues=issues,
+        service_graph=service_graph,
+        dynamic_logs=dynamic_logs,
+        optimal_fix_order=optimal_order,
+        context=(
+            "Event flow: webhook_sender -> webhook_receiver -> notification_service. "
+            "webhook_receiver validates signatures and enforces rate limits. "
+            "Fixing upstream issues may reveal additional downstream problems."
+        ),
     )
+    if seed is not None:
+        scenario = _randomize_scenario(scenario, seed)
+    return scenario
 # ─── Hard Scenario ────────────────────────────────────────────────────────────
+def _hard_scenario(seed: Optional[int] = None) -> Scenario:
     """
+    Hard: E-commerce order processing pipeline with cascading failures.
+    order_service -> inventory_service -> shipping_service
+    Plus api_gateway and auth_service.
+    Issue pool has 7 possible issues; canonical scenario uses 5.
+    Multiple dependency chains make this genuinely challenging.
     """
+    issue_pool = [
+        Issue(
+            issue_id="hard_wrong_url",
+            service="order_service",
+            description="Order service calling deprecated /v1/check instead of /v2/reserve",
+            expected_fix={"inventory_url": "https://inventory.internal/v2/reserve"},
+            fix_key="inventory_url",
+            log_hint="Endpoint deprecated. Use /v2/reserve",
+            category="configuration",
+            severity="error",
+            root_cause_explanation=(
+                "order_service calls /v1/check which was deprecated. The API gateway returns "
+                "301 Moved Permanently. The redirect goes to /v2/check (read-only) instead of "
+                "/v2/reserve (write). Inventory is never actually reserved."
+            ),
+            cascade_effects={
+                "inventory_service": "Receiving read-only check requests instead of reservation requests",
+                "api_gateway": "Generating 301 redirect responses for deprecated endpoints",
+            },
         ),
+        Issue(
+            issue_id="hard_timeout",
+            service="order_service",
+            description="Timeout too short (2s) for inventory service that takes ~4s to process",
+            expected_fix={"timeout": 10},
+            fix_key="timeout",
+            log_hint="Timeout after 2s waiting for inventory response",
+            depends_on=["hard_wrong_url"],
+            # Timeout issue is masked by wrong URL — fix URL first to see real timeout
+            category="networking",
+            severity="error",
+            root_cause_explanation=(
+                "order_service has timeout=2s but inventory_service takes ~4s for reservation "
+                "(including DB lock + stock validation). After fixing the URL, requests now reach "
+                "inventory but time out before completion."
+            ),
+            cascade_effects={
+                "inventory_service": "Connections killed mid-processing, leaving orphaned DB locks",
             },
+        ),
+        Issue(
+            issue_id="hard_async",
+            service="order_service",
+            description="Synchronous mode causes race conditions between concurrent orders",
+            expected_fix={"async_mode": True},
+            fix_key="async_mode",
+            log_hint="Race condition: order ord_998 processed before ord_997 completed",
+            category="configuration",
+            severity="critical",
+            root_cause_explanation=(
+                "order_service runs in sync mode, blocking the main thread on each inventory call. "
+                "Concurrent orders queue up and when timeouts occur, orders are processed out of "
+                "order, causing double-reservation and stock inconsistencies."
+            ),
+        ),
+        Issue(
+            issue_id="hard_expired_token",
+            service="inventory_service",
+            description="Expired auth token used for shipping service requests",
+            expected_fix={"headers.Authorization": "Bearer valid_token_789"},
+            fix_key="headers.Authorization",
+            log_hint="Auth token expired_token_456 is no longer valid",
+            category="authentication",
+            severity="critical",
+            root_cause_explanation=(
+                "inventory_service uses Bearer expired_token_456 to authenticate with "
+                "shipping_service. This token expired on 2026-03-24. All shipment creation "
+                "requests fail with 401, so reserved inventory is never shipped."
+            ),
+            cascade_effects={
+                "shipping_service": "Rejecting all requests from inventory_service",
+                "auth_service": "Logging repeated failed token validations",
+            },
+        ),
+        Issue(
+            issue_id="hard_token_refresh",
+            service="inventory_service",
+            description="No automatic token refresh mechanism configured",
+            expected_fix={"token_refresh_url": "https://auth.internal/refresh", "auto_refresh": True},
+            fix_key="token_refresh_url",
+            log_hint="Token validation failed: expired_token_456 expired",
+            depends_on=["hard_expired_token"],
+            # Token refresh is only relevant after fixing the expired token
+            category="configuration",
+            severity="error",
+            root_cause_explanation=(
+                "Even after replacing the expired token, there's no auto-refresh mechanism. "
+                "Tokens expire every 24h, so without auto_refresh=True and a refresh URL, "
+                "the same issue will recur tomorrow."
+            ),
+        ),
+        Issue(
+            issue_id="hard_circuit_breaker",
+            service="order_service",
+            description="No circuit breaker — failed requests keep hammering inventory_service",
+            expected_fix={"circuit_breaker.enabled": True, "circuit_breaker.failure_threshold": 5},
+            fix_key="circuit_breaker",
+            log_hint="Circuit breaker not configured",
+            category="configuration",
+            severity="warning",
+            root_cause_explanation=(
+                "Without a circuit breaker, order_service keeps sending requests to "
+                "inventory_service even when it's consistently failing. This wastes resources "
+                "and can cause a cascading overload."
+            ),
+        ),
+        Issue(
+            issue_id="hard_idempotency",
+            service="order_service",
+            description="Missing idempotency key — retried requests create duplicate orders",
+            expected_fix={"headers.Idempotency-Key": "order-{order_id}"},
+            fix_key="headers.Idempotency-Key",
+            log_hint="Duplicate order detected: ord_997 submitted twice",
+            depends_on=["hard_async"],
+            category="protocol",
+            severity="error",
+            root_cause_explanation=(
+                "When async retries fire, there's no Idempotency-Key header to deduplicate "
+                "requests. inventory_service creates duplicate reservations for the same order."
+            ),
+        ),
+    ]
+    if seed is not None:
+        rng = random.Random(seed)
+        issues = _select_issues(issue_pool, 5, rng)
+    else:
+        issues = issue_pool[:5]  # Canonical: first 5
+    configs = {
+        "order_service": {
+            "name": "order_service",
+            "inventory_url": "https://inventory.internal/v2/reserve",
+            "headers": {
+                "Content-Type": "application/json",
+                "Authorization": "Bearer valid_token_123",
             },
+            "timeout": 10,
+            "async_mode": True,
+            "callback_url": "https://orders.internal/callback",
+            "circuit_breaker": {
+                "enabled": True,
+                "failure_threshold": 5,
             },
+        },
+        "inventory_service": {
+            "name": "inventory_service",
+            "endpoint_version": "v2",
+            "reserve_path": "/v2/reserve",
+            "check_path": "/v2/check",
+            "shipping_url": "https://shipping.internal/v1/create",
+            "headers": {
+                "Content-Type": "application/json",
+                "Authorization": "Bearer valid_token_789",
             },
+            "timeout": 10,
+            "processing_time_avg": 4,
+            "token_refresh_url": "https://auth.internal/refresh",
+            "auto_refresh": True,
+        },
+        "shipping_service": {
+            "name": "shipping_service",
+            "create_path": "/v1/create",
+            "requires_auth": True,
+            "accepted_auth": ["Bearer"],
+            "token_validation_url": "https://auth.internal/validate",
+            "status": "healthy",
+        },
+        "api_gateway": {
+            "routes": {
+                "/v1/check": "DEPRECATED — use /v2/check",
+                "/v2/reserve": "inventory_service",
+                "/v2/check": "inventory_service",
+                "/v1/create": "shipping_service",
             },
+            "timeout": 30,
         },
+        "auth_service": {
+            "valid_tokens": ["valid_token_123", "valid_token_789"],
+            "expired_tokens": ["expired_token_456"],
+            "token_refresh_endpoint": "/refresh",
+            "token_ttl_hours": 24,
+        },
+    }
+    # Apply broken config for each selected issue
+    for issue in issues:
+        if issue.issue_id == "hard_wrong_url":
+            configs["order_service"]["inventory_url"] = "https://inventory.internal/v1/check"
+        elif issue.issue_id == "hard_timeout":
+            configs["order_service"]["timeout"] = 2
+        elif issue.issue_id == "hard_async":
+            configs["order_service"]["async_mode"] = False
+        elif issue.issue_id == "hard_expired_token":
+            configs["inventory_service"]["headers"]["Authorization"] = "Bearer expired_token_456"
+        elif issue.issue_id == "hard_token_refresh":
+            configs["inventory_service"].pop("token_refresh_url", None)
+            configs["inventory_service"]["auto_refresh"] = False
+        elif issue.issue_id == "hard_circuit_breaker":
+            configs["order_service"]["circuit_breaker"] = {"enabled": False}
+        elif issue.issue_id == "hard_idempotency":
+            configs["order_service"]["headers"].pop("Idempotency-Key", None)
+    # Build logs
+    order_logs = []
+    inventory_logs = []
+    shipping_logs = []
+    gateway_logs = []
+    auth_logs = [
+        "[INFO]  2026-03-25T12:00:00Z Auth service ready. Valid tokens: 2, Expired: 1",
+    ]
+    for issue in issues:
+        if issue.issue_id == "hard_wrong_url":
+            order_logs.extend([
                 "[ERROR] 2026-03-25T12:00:05Z POST inventory.internal/v1/check -> 301 Moved Permanently",
                 "[ERROR] 2026-03-25T12:00:05Z Response: {'error': 'Endpoint deprecated. Use /v2/reserve'}",
+            ])
+            inventory_logs.append(
+                "[INFO]  2026-03-25T12:00:05Z Received request on /v1/check -> redirecting to /v2/check"
+            )
+            gateway_logs.extend([
+                "[WARN]  2026-03-25T12:00:05Z Deprecated endpoint /v1/check accessed by order_service",
+                "[INFO]  2026-03-25T12:00:05Z Redirecting /v1/check -> /v2/check (301)",
+            ])
+        elif issue.issue_id == "hard_timeout":
+            order_logs.extend([
                 "[ERROR] 2026-03-25T12:00:07Z Timeout after 2s waiting for inventory response",
                 "[ERROR] 2026-03-25T12:00:07Z Order ord_999 failed: inventory check timed out",
+            ])
+            inventory_logs.append(
+                "[WARN]  2026-03-25T12:00:06Z Processing reservation... avg time: 4s"
+            )
+        elif issue.issue_id == "hard_async":
+            order_logs.extend([
                 "[WARN]  2026-03-25T12:00:08Z Synchronous mode: blocking on inventory response",
                 "[ERROR] 2026-03-25T12:00:09Z Race condition: order ord_998 processed before ord_997 completed",
+            ])
+        elif issue.issue_id == "hard_expired_token":
+            inventory_logs.extend([
                 "[ERROR] 2026-03-25T12:00:10Z POST shipping.internal/v1/create -> 401 Unauthorized",
                 "[ERROR] 2026-03-25T12:00:10Z Auth token expired_token_456 is no longer valid",
                 "[ERROR] 2026-03-25T12:00:10Z Cannot create shipment: authentication failed",
+            ])
+            shipping_logs.append(
+                "[WARN]  2026-03-25T12:00:10Z Rejected request: token 'expired_token_456' is expired"
+            )
+            auth_logs.append(
+                "[WARN]  2026-03-25T12:00:10Z Token validation failed: expired_token_456 expired at 2026-03-24T00:00:00Z"
+            )
+        elif issue.issue_id == "hard_token_refresh":
+            auth_logs.append(
+                "[WARN]  2026-03-25T12:00:11Z Token validation failed: expired_token_456 expired. No refresh configured."
+            )
+        elif issue.issue_id == "hard_circuit_breaker":
+            order_logs.extend([
+                "[WARN]  2026-03-25T12:00:12Z Circuit breaker not configured, continuing to send requests after 10 failures",
+                "[ERROR] 2026-03-25T12:00:12Z System overload: 50 pending requests to inventory_service",
+            ])
+        elif issue.issue_id == "hard_idempotency":
+            order_logs.append(
+                "[ERROR] 2026-03-25T12:00:13Z Duplicate order detected: ord_997 submitted twice"
+            )
+            inventory_logs.append(
+                "[WARN]  2026-03-25T12:00:13Z Duplicate reservation request for order ord_997"
+            )
+    if not shipping_logs:
+        shipping_logs.append(
+            "[INFO]  2026-03-25T12:00:00Z Shipping service healthy, awaiting authenticated requests"
+        )
+    dynamic_logs = {
+        "hard_wrong_url": {
+            "order_service": ["[INFO]  URL corrected to /v2/reserve. Inventory requests routing correctly."],
+            "api_gateway": ["[INFO]  order_service now using correct /v2/reserve endpoint."],
         },
+        "hard_timeout": {
+            "order_service": ["[INFO]  Timeout increased to 10s. Inventory responses completing."],
+            "inventory_service": ["[INFO]  Reservations completing successfully within timeout."],
+        },
+        "hard_async": {
+            "order_service": ["[INFO]  Async mode enabled. Orders processing concurrently without blocking."],
+        },
+        "hard_expired_token": {
+            "inventory_service": ["[INFO]  Auth token refreshed. Shipping service requests authenticated."],
+            "shipping_service": ["[INFO]  Authentication successful for inventory_service."],
+        },
+        "hard_token_refresh": {
+            "inventory_service": ["[INFO]  Auto token refresh configured. Tokens will be refreshed before expiry."],
+        },
+        "hard_circuit_breaker": {
+            "order_service": ["[INFO]  Circuit breaker enabled. Will stop sending after 5 consecutive failures."],
+        },
+        "hard_idempotency": {
+            "order_service": ["[INFO]  Idempotency keys set. Duplicate requests will be safely deduplicated."],
+        },
+    }
+    service_graph = {
+        "order_service": ServiceNode(
+            name="order_service",
+            depends_on=["inventory_service", "api_gateway"],
+            health_status="error",
+        ),
+        "inventory_service": ServiceNode(
+            name="inventory_service",
+            depends_on=["shipping_service", "auth_service"],
+            health_status="degraded",
+        ),
+        "shipping_service": ServiceNode(
+            name="shipping_service",
+            depends_on=[],
+            health_status="healthy",
+        ),
+        "api_gateway": ServiceNode(
+            name="api_gateway",
+            depends_on=[],
+            health_status="healthy",
+        ),
+        "auth_service": ServiceNode(
+            name="auth_service",
+            depends_on=[],
+            health_status="healthy",
+        ),
+    }
+    # Build optimal fix order respecting dependencies
+    issue_ids = [i.issue_id for i in issues]
+    optimal_order = []
+    ordered_preference = [
+        "hard_wrong_url", "hard_timeout", "hard_async",
+        "hard_expired_token", "hard_token_refresh",
+        "hard_circuit_breaker", "hard_idempotency",
+    ]
+    for iid in ordered_preference:
+        if iid in issue_ids:
+            optimal_order.append(iid)
+    for iid in issue_ids:
+        if iid not in optimal_order:
+            optimal_order.append(iid)
+    scenario = Scenario(
+        task_id="hard",
+        difficulty="hard",
+        description=(
+            "An e-commerce order processing pipeline is failing with cascading errors. "
+            "Order Service calls Inventory Service, which calls Shipping Service. "
+            "Multiple issues span the pipeline: wrong endpoints, timeouts, race conditions, "
+            "expired authentication tokens, and missing resilience patterns. "
+            "Some issues are masked by upstream failures — you must fix issues in the right "
+            "order to diagnose the full chain."
+        ),
+        max_steps=40,
+        services=["order_service", "inventory_service", "shipping_service", "api_gateway", "auth_service"],
+        configs=configs,
+        logs={
+            "order_service": order_logs,
+            "inventory_service": inventory_logs,
+            "shipping_service": shipping_logs,
+            "api_gateway": gateway_logs,
+            "auth_service": auth_logs,
+        },
+        issues=issues,
+        service_graph=service_graph,
+        dynamic_logs=dynamic_logs,
+        optimal_fix_order=optimal_order,
+        context=(
+            "Request flow: order_service -> api_gateway -> inventory_service -> shipping_service. "
+            "auth_service provides token validation for all inter-service calls. "
+            "Some issues are masked by upstream failures — fixing upstream issues may reveal "
+            "new errors downstream. Pay attention to service dependencies."
+        ),
     )
+    if seed is not None:
+        scenario = _randomize_scenario(scenario, seed)
+    return scenario

server/__pycache__/__init__.cpython-313.pyc DELETED Viewed

Binary file (330 Bytes)

server/__pycache__/api_debug_env_environment.cpython-313.pyc DELETED Viewed

Binary file (25.9 kB)

server/__pycache__/app.cpython-313.pyc DELETED Viewed

Binary file (8.14 kB)

server/api_debug_env_environment.py CHANGED Viewed

@@ -10,10 +10,16 @@ API Integration Debugging Environment Implementation.
 A real-world environment where an AI agent diagnoses and fixes broken
 API integrations by reading error logs, inspecting configurations,
 and submitting corrected configurations.
 """
 import copy
-from typing import Any, Dict, List, Optional, Set
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
@@ -37,8 +43,8 @@ class ApiDebugEnvironment(Environment):
     3. Testing endpoints to observe failures
     4. Submitting configuration fixes
-    Supports 3 difficulty levels (easy, medium, hard) with different
-    numbers of issues and complexity.
     """
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
@@ -60,6 +66,13 @@ class ApiDebugEnvironment(Environment):
         self._done = False
         self._last_action_result = ""
         self._cumulative_reward = 0.0
     def reset(self, task_id: Optional[str] = None, seed: Optional[int] = None) -> ApiDebugObservation:
         """
@@ -84,6 +97,25 @@ class ApiDebugEnvironment(Environment):
         self._done = False
         self._last_action_result = ""
         self._cumulative_reward = 0.0
         return ApiDebugObservation(
             task_id=self._task_id,
@@ -100,6 +132,9 @@ class ApiDebugEnvironment(Environment):
             available_targets=self._scenario.services,
             done=False,
             reward=0.0,
         )
     def step(self, action: ApiDebugAction) -> ApiDebugObservation:  # type: ignore[override]
@@ -124,6 +159,13 @@ class ApiDebugEnvironment(Environment):
         config_snapshot: Dict[str, Any] = {}
         api_response: Optional[Dict[str, Any]] = None
         # Validate target
         if action.target not in self._scenario.services:
             self._last_action_result = (
@@ -162,6 +204,11 @@ class ApiDebugEnvironment(Environment):
             self._done = True
             self._last_action_result += " ⏰ Out of steps. Episode ended."
         return ApiDebugObservation(
             task_id=self._task_id,
             task_description=self._scenario.description,
@@ -177,6 +224,9 @@ class ApiDebugEnvironment(Environment):
             available_targets=self._scenario.services,
             done=self._done,
             reward=reward,
             metadata={
                 "cumulative_reward": self._cumulative_reward,
                 "step": self._state.step_count,
@@ -195,11 +245,18 @@ class ApiDebugEnvironment(Environment):
     def _handle_inspect_logs(self, target: str) -> tuple:
         """Return logs for a service and reward for relevant inspection."""
         assert self._scenario is not None
-        logs = self._scenario.logs.get(target, [])
         inspect_key = f"logs:{target}"
         is_repeat = inspect_key in self._inspected_targets
         self._inspected_targets.add(inspect_key)
         # Check if any unfound issues have log hints in these logs
         found_new = False
         for issue in self._scenario.issues:
@@ -212,9 +269,12 @@ class ApiDebugEnvironment(Environment):
         if found_new:
             reward = 0.15
             self._last_action_result = f"Inspected logs for '{target}'. Found relevant error patterns!"
-        elif is_repeat:
-            reward = 0.0  # No reward for re-inspecting same logs
             self._last_action_result = f"Re-inspected logs for '{target}'. No new information."
         elif logs:
             reward = 0.05
             self._last_action_result = f"Inspected logs for '{target}'. {len(logs)} log entries found."
@@ -232,8 +292,15 @@ class ApiDebugEnvironment(Environment):
         is_repeat = inspect_key in self._inspected_targets
         self._inspected_targets.add(inspect_key)
         # Reward based on relevance and novelty
-        has_issues = any(i.service == target for i in self._scenario.issues if i.issue_id not in self._issues_fixed)
         if is_repeat:
             reward = 0.0  # No reward for re-inspecting same config
             self._last_action_result = f"Re-inspected config for '{target}'. No changes since last check."
@@ -247,31 +314,65 @@ class ApiDebugEnvironment(Environment):
         return config, reward
     def _handle_inspect_endpoint(self, target: str) -> tuple:
-        """Simulate testing an endpoint and return the response."""
         assert self._scenario is not None
         # Find unfixed issues for this service
         unfixed = [
             i for i in self._scenario.issues
             if i.service == target and i.issue_id not in self._issues_fixed
         ]
         if unfixed:
-            # Simulate a failure based on the first unfixed issue
             issue = unfixed[0]
             api_response = {
                 "status": "error",
-                "status_code": 401 if "auth" in issue.issue_id else 500,
                 "error": issue.description,
-                "hint": f"Check the {issue.fix_key} configuration",
             }
             reward = 0.05
-            self._last_action_result = f"Tested endpoint on '{target}'. Got error response."
         else:
             api_response = {
                 "status": "success",
                 "status_code": 200,
                 "message": f"{target} is working correctly.",
             }
             reward = 0.02
             self._last_action_result = f"Tested endpoint on '{target}'. Service responding OK."
@@ -279,7 +380,7 @@ class ApiDebugEnvironment(Environment):
         return api_response, reward
     def _handle_submit_fix(self, target: str, fix_payload: Dict[str, Any]) -> float:
-        """Process a fix submission and score it."""
         assert self._scenario is not None
         if not fix_payload:
@@ -298,14 +399,28 @@ class ApiDebugEnvironment(Environment):
         reward = 0.0
         fixed_any = False
         for issue in target_issues:
-            if self._check_fix(issue, fix_payload):
                 self._issues_fixed.add(issue.issue_id)
-                self._issues_found.add(issue.issue_id)  # finding + fixing counts
                 self._apply_fix(target, fix_payload)
                 reward += 0.25
                 fixed_any = True
         if fixed_any:
             fixed_count = sum(1 for i in target_issues if i.issue_id in self._issues_fixed)
@@ -314,6 +429,11 @@ class ApiDebugEnvironment(Environment):
                 f"Fixed {fixed_count} issue(s). "
                 f"Total fixed: {len(self._issues_fixed)}/{len(self._scenario.issues)}"
             )
         else:
             self._last_action_result = (
                 f"Fix rejected for '{target}'. The payload doesn't address any known issues. "
@@ -323,6 +443,71 @@ class ApiDebugEnvironment(Environment):
         return reward
     # ─── Helper Methods ───────────────────────────────────────────────────
     @staticmethod
@@ -343,7 +528,7 @@ class ApiDebugEnvironment(Environment):
         Supports:
         - Exact match
         - Case-insensitive string match
-        - Numeric tolerance
         - Boolean coercion (e.g., "true" -> True)
         - List containment (submitted must contain all expected elements)
         - Pattern match for token-like values (Bearer <anything> matches Bearer <token>)
@@ -356,11 +541,11 @@ class ApiDebugEnvironment(Environment):
         if norm_expected == norm_submitted:
             return True
-        # Numeric comparison with tolerance
         if isinstance(expected, (int, float)) and isinstance(submitted, (int, float)):
             if expected == 0:
                 return submitted == 0
-            return abs(expected - submitted) / max(abs(expected), 1) < 0.25
         # Boolean coercion
         if isinstance(expected, bool):
@@ -379,7 +564,6 @@ class ApiDebugEnvironment(Environment):
                     return True
             # If submitted has same prefix structure
             if exp_lower.startswith("bearer ") and sub_lower.startswith("bearer "):
-                # Any valid bearer token is acceptable
                 return len(sub_lower) > len("bearer ")
         # List: submitted must contain all expected elements
@@ -388,22 +572,42 @@ class ApiDebugEnvironment(Environment):
         return False
-    def _check_fix(self, issue: Issue, fix_payload: Dict[str, Any]) -> bool:
         """
         Check if a fix payload correctly addresses an issue.
-        Validates both the key AND the value. The fix is accepted if:
-        1. The fix_key is present with a matching value, OR
-        2. An expected_fix key is present with a matching value
         """
         # Direct key match with value validation
         if issue.fix_key in fix_payload:
             expected_val = issue.expected_fix.get(issue.fix_key)
             if expected_val is not None:
-                return self._values_match(expected_val, fix_payload[issue.fix_key])
-            # If the submitted value is a dict and expected_fix has nested keys,
-            # validate the nested key-value pairs inside the dict
             submitted_val = fix_payload[issue.fix_key]
             if isinstance(submitted_val, dict):
                 nested_prefix = issue.fix_key + "."
@@ -413,38 +617,58 @@ class ApiDebugEnvironment(Environment):
                     if k.startswith(nested_prefix)
                 }
                 if nested_expected:
-                    # All nested expected keys must match
-                    return all(
                         k in submitted_val and self._values_match(v, submitted_val[k])
                         for k, v in nested_expected.items()
                     )
-            return True  # Key exists, no expected value to validate against
         # Check nested key (e.g., "headers.Authorization" -> check payload for "Authorization")
         if "." in issue.fix_key:
             parts = issue.fix_key.split(".")
             leaf_key = parts[-1]
             if leaf_key in fix_payload:
                 expected_val = issue.expected_fix.get(issue.fix_key)
                 if expected_val is not None:
-                    return self._values_match(expected_val, fix_payload[leaf_key])
-                return True
         # Check expected fix keys with value validation
         for key, expected_val in issue.expected_fix.items():
             # Direct key in payload
             if key in fix_payload:
                 if self._values_match(expected_val, fix_payload[key]):
-                    return True
             # Nested key leaf match
             if "." in key:
                 leaf = key.split(".")[-1]
                 if leaf in fix_payload:
                     if self._values_match(expected_val, fix_payload[leaf]):
-                        return True
-        return False
     def _apply_fix(self, target: str, fix_payload: Dict[str, Any]) -> None:
         """Apply a fix to the current configuration."""
@@ -466,7 +690,7 @@ class ApiDebugEnvironment(Environment):
                 config[key] = value
     def _get_hints(self) -> List[str]:
-        """Return progressive hints based on step count."""
         if self._scenario is None:
             return []
@@ -478,6 +702,8 @@ class ApiDebugEnvironment(Environment):
         if step == 0:
             hints.append("Start by inspecting error logs for each service to find clues.")
             hints.append(f"There are {total_issues} issues to find and fix.")
         elif step > 0 and len(self._issues_found) == 0:
             hints.append("Try 'inspect_logs' on different services to find error patterns.")
         elif len(self._issues_found) > 0 and len(self._issues_fixed) == 0:
@@ -485,24 +711,44 @@ class ApiDebugEnvironment(Environment):
         elif unfixed > 0:
             hints.append(f"{unfixed} issue(s) remaining. Check services you haven't inspected yet.")
         # Late-game hints
         if self._scenario.max_steps - step <= 5 and unfixed > 0:
-            # Give more specific hints when running low on steps
             for issue in self._scenario.issues:
                 if issue.issue_id not in self._issues_fixed:
-                    hints.append(f"Hint: Check '{issue.service}' — look for '{issue.fix_key}' in the config.")
         return hints
-    # ─── Grading ──────────────────────────────────────────────────────────
     def grade(self) -> float:
         """
-        Grade the agent's performance on the current episode.
-        Score = (issues_fixed / issues_total) * efficiency_bonus + exploration_bonus
-        Efficiency bonus = 1.0 + (remaining_steps / max_steps * 0.3)
-        Exploration bonus = small credit for inspecting services (max 0.05)
         Returns:
             Score strictly between 0 and 1 (exclusive): in range (0.001, 0.999)
@@ -514,18 +760,111 @@ class ApiDebugEnvironment(Environment):
         if total == 0:
             return 0.999
         fix_ratio = len(self._issues_fixed) / total
         remaining = max(0, self._scenario.max_steps - self._state.step_count)
-        efficiency_bonus = 1.0 + (remaining / self._scenario.max_steps * 0.3)
-        # Small partial credit for exploration even if no fixes submitted
-        exploration_bonus = min(0.05, len(self._inspected_targets) * 0.005)
-        score = fix_ratio * efficiency_bonus + exploration_bonus
         # Clamp strictly to (0.001, 0.999) — NEVER exactly 0.0 or 1.0
         return max(0.001, min(0.999, round(score, 4)))
     def get_task_info(self) -> Dict[str, Any]:
         """Return information about the current task."""
         if self._scenario is None:
@@ -538,6 +877,11 @@ class ApiDebugEnvironment(Environment):
             "max_steps": self._scenario.max_steps,
             "issues_total": len(self._scenario.issues),
             "services": self._scenario.services,
             "action_schema": {
                 "action_type": {
                     "type": "string",

 A real-world environment where an AI agent diagnoses and fixes broken
 API integrations by reading error logs, inspecting configurations,
 and submitting corrected configurations.
+Key design features:
+- Dynamic state: fixing issues changes service health and produces new logs
+- Cascading failures: upstream fixes reveal downstream issues
+- Multi-dimensional rubric grading (diagnosis, fix, efficiency, strategy)
+- Rich reward signal with partial credit and diminishing returns
 """
 import copy
+from typing import Any, Dict, List, Optional, Set, Tuple
 from uuid import uuid4
 from openenv.core.env_server.interfaces import Environment
     3. Testing endpoints to observe failures
     4. Submitting configuration fixes
+    Supports 3 difficulty levels (easy, medium, hard) with cascading
+    failure dynamics and multi-dimensional grading.
     """
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
         self._done = False
         self._last_action_result = ""
         self._cumulative_reward = 0.0
+        # Dynamic state tracking
+        self._service_health: Dict[str, str] = {}
+        self._dynamic_log_buffer: Dict[str, List[str]] = {}
+        # Strategy tracking for grading
+        self._action_history: List[Dict[str, Any]] = []
+        self._diagnosed_before_fix: Set[str] = set()
+        # Track which services were inspected before a fix was submitted
     def reset(self, task_id: Optional[str] = None, seed: Optional[int] = None) -> ApiDebugObservation:
         """
         self._done = False
         self._last_action_result = ""
         self._cumulative_reward = 0.0
+        self._action_history = []
+        self._diagnosed_before_fix = set()
+        # Initialize service health from scenario graph
+        self._service_health = {}
+        for svc_name, node in self._scenario.service_graph.items():
+            self._service_health[svc_name] = node.health_status
+        # Fill in any services not in graph
+        for svc in self._scenario.services:
+            if svc not in self._service_health:
+                self._service_health[svc] = "unknown"
+        # Initialize dynamic log buffer
+        self._dynamic_log_buffer = {svc: [] for svc in self._scenario.services}
+        # Build dependency graph for observation
+        dep_graph = {}
+        for svc_name, node in self._scenario.service_graph.items():
+            dep_graph[svc_name] = node.depends_on
         return ApiDebugObservation(
             task_id=self._task_id,
             available_targets=self._scenario.services,
             done=False,
             reward=0.0,
+            service_status=dict(self._service_health),
+            dependency_graph=dep_graph,
+            error_trace=self._build_error_trace(),
         )
     def step(self, action: ApiDebugAction) -> ApiDebugObservation:  # type: ignore[override]
         config_snapshot: Dict[str, Any] = {}
         api_response: Optional[Dict[str, Any]] = None
+        # Record action for strategy scoring
+        self._action_history.append({
+            "step": self._state.step_count,
+            "action_type": action.action_type,
+            "target": action.target,
+        })
         # Validate target
         if action.target not in self._scenario.services:
             self._last_action_result = (
             self._done = True
             self._last_action_result += " ⏰ Out of steps. Episode ended."
+        # Build dependency graph
+        dep_graph = {}
+        for svc_name, node in self._scenario.service_graph.items():
+            dep_graph[svc_name] = node.depends_on
         return ApiDebugObservation(
             task_id=self._task_id,
             task_description=self._scenario.description,
             available_targets=self._scenario.services,
             done=self._done,
             reward=reward,
+            service_status=dict(self._service_health),
+            dependency_graph=dep_graph,
+            error_trace=self._build_error_trace(),
             metadata={
                 "cumulative_reward": self._cumulative_reward,
                 "step": self._state.step_count,
     def _handle_inspect_logs(self, target: str) -> tuple:
         """Return logs for a service and reward for relevant inspection."""
         assert self._scenario is not None
+        # Combine static logs with dynamic logs from fixes
+        static_logs = self._scenario.logs.get(target, [])
+        dynamic_logs = self._dynamic_log_buffer.get(target, [])
+        logs = static_logs + dynamic_logs
         inspect_key = f"logs:{target}"
         is_repeat = inspect_key in self._inspected_targets
         self._inspected_targets.add(inspect_key)
+        # Track that this service was inspected (for strategy scoring)
+        self._diagnosed_before_fix.add(target)
         # Check if any unfound issues have log hints in these logs
         found_new = False
         for issue in self._scenario.issues:
         if found_new:
             reward = 0.15
             self._last_action_result = f"Inspected logs for '{target}'. Found relevant error patterns!"
+        elif is_repeat and not dynamic_logs:
+            reward = 0.0  # No reward for re-inspecting same logs with no changes
             self._last_action_result = f"Re-inspected logs for '{target}'. No new information."
+        elif is_repeat and dynamic_logs:
+            reward = 0.05  # Some reward for checking updated logs
+            self._last_action_result = f"Re-inspected logs for '{target}'. New entries found after recent fixes."
         elif logs:
             reward = 0.05
             self._last_action_result = f"Inspected logs for '{target}'. {len(logs)} log entries found."
         is_repeat = inspect_key in self._inspected_targets
         self._inspected_targets.add(inspect_key)
+        # Track that this service was inspected (for strategy scoring)
+        self._diagnosed_before_fix.add(target)
         # Reward based on relevance and novelty
+        has_issues = any(
+            i.service == target
+            for i in self._scenario.issues
+            if i.issue_id not in self._issues_fixed
+        )
         if is_repeat:
             reward = 0.0  # No reward for re-inspecting same config
             self._last_action_result = f"Re-inspected config for '{target}'. No changes since last check."
         return config, reward
     def _handle_inspect_endpoint(self, target: str) -> tuple:
+        """Simulate testing an endpoint. Response changes based on current fix state."""
         assert self._scenario is not None
+        # Track that this service was inspected
+        self._diagnosed_before_fix.add(target)
         # Find unfixed issues for this service
         unfixed = [
             i for i in self._scenario.issues
             if i.service == target and i.issue_id not in self._issues_fixed
         ]
+        # Also check if any DEPENDENCY issues are unfixed (cascade simulation)
+        upstream_broken = False
+        if target in self._scenario.service_graph:
+            node = self._scenario.service_graph[target]
+            for dep_svc in node.depends_on:
+                dep_unfixed = [
+                    i for i in self._scenario.issues
+                    if i.service == dep_svc and i.issue_id not in self._issues_fixed
+                ]
+                if dep_unfixed:
+                    upstream_broken = True
         if unfixed:
             issue = unfixed[0]
+            # Determine status code based on issue category
+            status_codes = {
+                "authentication": 401,
+                "protocol": 415,
+                "networking": 504,
+                "configuration": 500,
+            }
+            status_code = status_codes.get(issue.category, 500)
             api_response = {
                 "status": "error",
+                "status_code": status_code,
                 "error": issue.description,
+                "hint": f"Check the {issue.fix_key} configuration for '{target}'",
+                "service_health": self._service_health.get(target, "unknown"),
             }
             reward = 0.05
+            self._last_action_result = f"Tested endpoint on '{target}'. Got {status_code} error response."
+        elif upstream_broken:
+            api_response = {
+                "status": "degraded",
+                "status_code": 503,
+                "error": f"{target} configuration is correct but upstream dependencies are failing.",
+                "hint": "Fix upstream services first — check the dependency graph.",
+                "service_health": "degraded",
+            }
+            reward = 0.03
+            self._last_action_result = f"Tested '{target}'. Service config OK but upstream is broken."
         else:
             api_response = {
                 "status": "success",
                 "status_code": 200,
                 "message": f"{target} is working correctly.",
+                "service_health": "healthy",
             }
             reward = 0.02
             self._last_action_result = f"Tested endpoint on '{target}'. Service responding OK."
         return api_response, reward
     def _handle_submit_fix(self, target: str, fix_payload: Dict[str, Any]) -> float:
+        """Process a fix submission with strict validation and cascade effects."""
         assert self._scenario is not None
         if not fix_payload:
         reward = 0.0
         fixed_any = False
+        partial_credit = False
+        # Check if the agent inspected this service before submitting
+        inspected_first = target in self._diagnosed_before_fix
         for issue in target_issues:
+            match_result = self._check_fix(issue, fix_payload)
+            if match_result == "exact":
                 self._issues_fixed.add(issue.issue_id)
+                self._issues_found.add(issue.issue_id)
                 self._apply_fix(target, fix_payload)
+                self._update_service_health(issue)
+                self._inject_dynamic_logs(issue)
                 reward += 0.25
                 fixed_any = True
+                # Bonus for inspecting before fixing (strategy reward)
+                if inspected_first:
+                    reward += 0.05
+            elif match_result == "partial":
+                # Right key, close value — give partial credit
+                partial_credit = True
+                reward += 0.03
         if fixed_any:
             fixed_count = sum(1 for i in target_issues if i.issue_id in self._issues_fixed)
                 f"Fixed {fixed_count} issue(s). "
                 f"Total fixed: {len(self._issues_fixed)}/{len(self._scenario.issues)}"
             )
+        elif partial_credit:
+            self._last_action_result = (
+                f"Fix partially correct for '{target}'. "
+                "The key is right but the value isn't quite right. Check the logs for exact values."
+            )
         else:
             self._last_action_result = (
                 f"Fix rejected for '{target}'. The payload doesn't address any known issues. "
         return reward
+    # ─── Dynamic State Methods ────────────────────────────────────────────
+    def _update_service_health(self, fixed_issue: Issue) -> None:
+        """Update service health status after an issue is fixed."""
+        assert self._scenario is not None
+        # Check if the fixed service has any remaining issues
+        remaining = [
+            i for i in self._scenario.issues
+            if i.service == fixed_issue.service and i.issue_id not in self._issues_fixed
+        ]
+        if not remaining:
+            self._service_health[fixed_issue.service] = "healthy"
+        else:
+            self._service_health[fixed_issue.service] = "degraded"
+        # Update downstream services affected by cascade
+        for affected_svc, _effect in fixed_issue.cascade_effects.items():
+            if affected_svc in self._service_health:
+                # Check if the affected service still has its own issues
+                svc_issues = [
+                    i for i in self._scenario.issues
+                    if i.service == affected_svc and i.issue_id not in self._issues_fixed
+                ]
+                if not svc_issues:
+                    # Check if all upstream deps are healthy
+                    if affected_svc in self._scenario.service_graph:
+                        upstream_healthy = all(
+                            self._service_health.get(dep, "error") == "healthy"
+                            for dep in self._scenario.service_graph[affected_svc].depends_on
+                        )
+                        if upstream_healthy:
+                            self._service_health[affected_svc] = "healthy"
+                        else:
+                            self._service_health[affected_svc] = "degraded"
+                    else:
+                        self._service_health[affected_svc] = "healthy"
+    def _inject_dynamic_logs(self, fixed_issue: Issue) -> None:
+        """Inject new log entries after an issue is fixed."""
+        assert self._scenario is not None
+        if fixed_issue.issue_id in self._scenario.dynamic_logs:
+            for svc, new_logs in self._scenario.dynamic_logs[fixed_issue.issue_id].items():
+                if svc in self._dynamic_log_buffer:
+                    self._dynamic_log_buffer[svc].extend(new_logs)
+    def _build_error_trace(self) -> List[str]:
+        """Build an error propagation trace showing cascade chain."""
+        if self._scenario is None:
+            return []
+        trace = []
+        for issue in self._scenario.issues:
+            if issue.issue_id not in self._issues_fixed:
+                trace.append(
+                    f"[{issue.severity.upper()}] {issue.service}: {issue.description}"
+                )
+                for affected_svc, effect in issue.cascade_effects.items():
+                    trace.append(f"  └─> {affected_svc}: {effect}")
+        if not trace:
+            trace.append("All issues resolved. No error cascades active.")
+        return trace
     # ─── Helper Methods ───────────────────────────────────────────────────
     @staticmethod
         Supports:
         - Exact match
         - Case-insensitive string match
+        - Numeric tolerance (10%)
         - Boolean coercion (e.g., "true" -> True)
         - List containment (submitted must contain all expected elements)
         - Pattern match for token-like values (Bearer <anything> matches Bearer <token>)
         if norm_expected == norm_submitted:
             return True
+        # Numeric comparison with tolerance (10% — tighter than before)
         if isinstance(expected, (int, float)) and isinstance(submitted, (int, float)):
             if expected == 0:
                 return submitted == 0
+            return abs(expected - submitted) / max(abs(expected), 1) < 0.10
         # Boolean coercion
         if isinstance(expected, bool):
                     return True
             # If submitted has same prefix structure
             if exp_lower.startswith("bearer ") and sub_lower.startswith("bearer "):
                 return len(sub_lower) > len("bearer ")
         # List: submitted must contain all expected elements
         return False
+    def _values_close(self, expected: Any, submitted: Any) -> bool:
+        """Check if values are 'close' for partial credit (same type, right ballpark)."""
+        if isinstance(expected, (int, float)) and isinstance(submitted, (int, float)):
+            if expected == 0:
+                return abs(submitted) < 5
+            return abs(expected - submitted) / max(abs(expected), 1) < 0.50
+        if isinstance(expected, str) and isinstance(submitted, str):
+            # Same prefix / similar structure
+            return expected.split("/")[0].lower() == submitted.split("/")[0].lower()
+        if isinstance(expected, bool) and isinstance(submitted, bool):
+            return True  # Right type at least
+        return False
+    def _check_fix(self, issue: Issue, fix_payload: Dict[str, Any]) -> str:
         """
         Check if a fix payload correctly addresses an issue.
+        Returns:
+            'exact' if fix is correct
+            'partial' if fix has right key but wrong value
+            'none' if fix doesn't match at all
         """
+        found_key = False
         # Direct key match with value validation
         if issue.fix_key in fix_payload:
+            found_key = True
             expected_val = issue.expected_fix.get(issue.fix_key)
             if expected_val is not None:
+                if self._values_match(expected_val, fix_payload[issue.fix_key]):
+                    return "exact"
+                elif self._values_close(expected_val, fix_payload[issue.fix_key]):
+                    return "partial"
+                return "none"  # Right key, wrong value
+            # If the submitted value is a dict and expected_fix has nested keys
             submitted_val = fix_payload[issue.fix_key]
             if isinstance(submitted_val, dict):
                 nested_prefix = issue.fix_key + "."
                     if k.startswith(nested_prefix)
                 }
                 if nested_expected:
+                    all_match = all(
                         k in submitted_val and self._values_match(v, submitted_val[k])
                         for k, v in nested_expected.items()
                     )
+                    if all_match:
+                        return "exact"
+                    # Check partial
+                    any_match = any(
+                        k in submitted_val and self._values_match(v, submitted_val[k])
+                        for k, v in nested_expected.items()
+                    )
+                    if any_match:
+                        return "partial"
+                    return "none"
+            # No expected value found — this shouldn't happen with well-defined issues
+            # Do NOT accept blindly — require value validation
+            return "none"
         # Check nested key (e.g., "headers.Authorization" -> check payload for "Authorization")
         if "." in issue.fix_key:
             parts = issue.fix_key.split(".")
             leaf_key = parts[-1]
             if leaf_key in fix_payload:
+                found_key = True
                 expected_val = issue.expected_fix.get(issue.fix_key)
                 if expected_val is not None:
+                    if self._values_match(expected_val, fix_payload[leaf_key]):
+                        return "exact"
+                    elif self._values_close(expected_val, fix_payload[leaf_key]):
+                        return "partial"
+                    return "none"
+                return "none"
         # Check expected fix keys with value validation
         for key, expected_val in issue.expected_fix.items():
             # Direct key in payload
             if key in fix_payload:
+                found_key = True
                 if self._values_match(expected_val, fix_payload[key]):
+                    return "exact"
             # Nested key leaf match
             if "." in key:
                 leaf = key.split(".")[-1]
                 if leaf in fix_payload:
+                    found_key = True
                     if self._values_match(expected_val, fix_payload[leaf]):
+                        return "exact"
+        if found_key:
+            return "partial"  # Found the key but value didn't match
+        return "none"
     def _apply_fix(self, target: str, fix_payload: Dict[str, Any]) -> None:
         """Apply a fix to the current configuration."""
                 config[key] = value
     def _get_hints(self) -> List[str]:
+        """Return progressive hints based on step count and progress."""
         if self._scenario is None:
             return []
         if step == 0:
             hints.append("Start by inspecting error logs for each service to find clues.")
             hints.append(f"There are {total_issues} issues to find and fix.")
+            if self._scenario.context:
+                hints.append(f"Context: {self._scenario.context}")
         elif step > 0 and len(self._issues_found) == 0:
             hints.append("Try 'inspect_logs' on different services to find error patterns.")
         elif len(self._issues_found) > 0 and len(self._issues_fixed) == 0:
         elif unfixed > 0:
             hints.append(f"{unfixed} issue(s) remaining. Check services you haven't inspected yet.")
+        # Dependency hints
+        for issue in self._scenario.issues:
+            if issue.issue_id not in self._issues_fixed and issue.depends_on:
+                deps_met = all(d in self._issues_fixed for d in issue.depends_on)
+                if not deps_met:
+                    dep_names = [
+                        next((i.service for i in self._scenario.issues if i.issue_id == d), d)
+                        for d in issue.depends_on
+                    ]
+                    if len(self._issues_fixed) > 0:
+                        hints.append(
+                            f"Some issues may be masked by upstream failures. "
+                            f"Check services: {', '.join(set(dep_names))}"
+                        )
+                        break
         # Late-game hints
         if self._scenario.max_steps - step <= 5 and unfixed > 0:
             for issue in self._scenario.issues:
                 if issue.issue_id not in self._issues_fixed:
+                    hints.append(
+                        f"Hint: Check '{issue.service}' — look for '{issue.fix_key}' in the config."
+                    )
         return hints
+    # ─── Multi-Dimensional Grading ────────────────────────────────────────
     def grade(self) -> float:
         """
+        Grade the agent's performance using a multi-dimensional rubric.
+        Score = weighted_average(
+            diagnosis_score × 0.20,    # Did the agent inspect before fixing?
+            fix_score × 0.40,          # Issues fixed / total
+            efficiency_score × 0.15,   # Steps used vs available
+            strategy_score × 0.25,     # Logical debugging approach
+        )
         Returns:
             Score strictly between 0 and 1 (exclusive): in range (0.001, 0.999)
         if total == 0:
             return 0.999
+        # 1. Fix Score (40% weight) — most important
         fix_ratio = len(self._issues_fixed) / total
+        fix_score = fix_ratio
+        # 2. Diagnosis Score (20% weight) — did you inspect before fixing?
+        if self._issues_fixed:
+            diagnosed_count = sum(
+                1 for issue_id in self._issues_fixed
+                if any(
+                    i.service in self._diagnosed_before_fix
+                    for i in self._scenario.issues
+                    if i.issue_id == issue_id
+                )
+            )
+            diagnosis_score = diagnosed_count / len(self._issues_fixed)
+        else:
+            # Give partial credit for exploration even without fixes
+            diagnosis_score = min(1.0, len(self._inspected_targets) / max(1, len(self._scenario.services)))
+        # 3. Efficiency Score (15% weight) — faster is better
         remaining = max(0, self._scenario.max_steps - self._state.step_count)
+        efficiency_score = remaining / self._scenario.max_steps
+        # 4. Strategy Score (25% weight) — logical debugging approach
+        strategy_score = self._compute_strategy_score()
+        # Weighted combination
+        score = (
+            fix_score * 0.40 +
+            diagnosis_score * 0.20 +
+            efficiency_score * 0.15 +
+            strategy_score * 0.25
+        )
         # Clamp strictly to (0.001, 0.999) — NEVER exactly 0.0 or 1.0
         return max(0.001, min(0.999, round(score, 4)))
+    def _compute_strategy_score(self) -> float:
+        """
+        Score the agent's debugging strategy.
+        Good strategy:
+        - Inspect logs before configs (logs have more diagnostic info)
+        - Don't repeat the same inspection
+        - Fix issues in dependency order
+        - Don't submit fixes without inspecting first
+        """
+        if not self._action_history:
+            return 0.0
+        score = 0.0
+        total_checks = 0
+        # Check 1: Did the agent inspect logs before submitting any fix?
+        first_fix_step = None
+        first_inspect_step = None
+        for action in self._action_history:
+            if action["action_type"] == "submit_fix" and first_fix_step is None:
+                first_fix_step = action["step"]
+            if action["action_type"] in ("inspect_logs", "inspect_config") and first_inspect_step is None:
+                first_inspect_step = action["step"]
+        total_checks += 1
+        if first_inspect_step is not None and (first_fix_step is None or first_inspect_step < first_fix_step):
+            score += 1.0  # Inspected before fixing
+        # Check 2: Ratio of unique inspections to total inspections
+        total_inspections = sum(
+            1 for a in self._action_history
+            if a["action_type"] in ("inspect_logs", "inspect_config", "inspect_endpoint")
+        )
+        unique_inspections = len(self._inspected_targets)
+        total_checks += 1
+        if total_inspections > 0:
+            score += min(1.0, unique_inspections / total_inspections)
+        # Check 3: Did fixes follow dependency order?
+        if self._scenario and self._scenario.optimal_fix_order and len(self._issues_fixed) > 1:
+            total_checks += 1
+            fix_order = []
+            for action in self._action_history:
+                if action["action_type"] == "submit_fix":
+                    # Find which issue was fixed in this step
+                    for issue_id in self._issues_fixed:
+                        issue = next((i for i in self._scenario.issues if i.issue_id == issue_id), None)
+                        if issue and issue_id not in fix_order:
+                            fix_order.append(issue_id)
+            # Compare fix order with optimal order
+            optimal = [o for o in self._scenario.optimal_fix_order if o in fix_order]
+            if len(optimal) > 1:
+                in_order = sum(
+                    1 for i in range(len(fix_order) - 1)
+                    if fix_order[i] in optimal and fix_order[i+1] in optimal
+                    and optimal.index(fix_order[i]) < optimal.index(fix_order[i+1])
+                )
+                score += in_order / max(1, len(fix_order) - 1)
+        # Check 4: Did the agent use a variety of action types?
+        total_checks += 1
+        action_types_used = set(a["action_type"] for a in self._action_history)
+        score += len(action_types_used) / 4.0  # 4 possible action types
+        return score / total_checks if total_checks > 0 else 0.0
     def get_task_info(self) -> Dict[str, Any]:
         """Return information about the current task."""
         if self._scenario is None:
             "max_steps": self._scenario.max_steps,
             "issues_total": len(self._scenario.issues),
             "services": self._scenario.services,
+            "service_dependencies": {
+                svc: node.depends_on
+                for svc, node in self._scenario.service_graph.items()
+            },
+            "context": self._scenario.context,
             "action_schema": {
                 "action_type": {
                     "type": "string",

server/app.py CHANGED Viewed

@@ -64,8 +64,15 @@ async def root():
     """Root endpoint — returns environment info and available endpoints."""
     return {
         "name": "api_debug_env",
-        "description": "API Integration Debugging Environment",
         "status": "running",
         "endpoints": ["/reset", "/step", "/state", "/tasks", "/grader", "/baseline", "/health", "/schema", "/docs"],
     }
@@ -86,7 +93,7 @@ class BaselineRequest(BaseModel):
 @app.get("/tasks")
 async def list_tasks():
-    """Return list of all tasks with action schema."""
     tasks = []
     for task_id in get_all_task_ids():
         scenario = get_scenario(task_id)
@@ -97,6 +104,11 @@ async def list_tasks():
             "max_steps": scenario.max_steps,
             "issues_count": len(scenario.issues),
             "services": scenario.services,
             "action_schema": {
                 "action_type": {
                     "type": "string",
@@ -129,6 +141,12 @@ async def run_grader(request: GraderRequest):
             "issues_fixed": len(env._issues_fixed),
             "issues_total": len(env._scenario.issues) if env._scenario else 0,
             "steps_used": env._state.step_count,
         }
     return {
@@ -142,13 +160,19 @@ async def run_grader(request: GraderRequest):
 async def run_baseline(request: Optional[BaselineRequest] = None):
     """
     Run a rule-based baseline agent on all tasks.
-    The baseline inspects logs/configs and then submits known fixes.
     Returns baseline scores for each task.
     """
     # Known fixes for each task (a heuristic baseline, not an LLM)
     known_fixes = {
         "easy": [
-            {"target": "payment_client", "fix": {"headers.Authorization": "Bearer sk_live_token123", "headers.Content-Type": "application/json"}},
         ],
         "medium": [
             {"target": "webhook_sender", "fix": {"rate_limit.requests_per_second": 10}},
@@ -170,7 +194,7 @@ async def run_baseline(request: Optional[BaselineRequest] = None):
         env = ApiDebugEnvironment(task_id=task_id)
         obs = env.reset()
-        # Phase 1: Inspect all logs
         for service in obs.available_targets:
             if env._done:
                 break
@@ -179,7 +203,7 @@ async def run_baseline(request: Optional[BaselineRequest] = None):
                 target=service,
             ))
-        # Phase 2: Inspect all configs
         for service in obs.available_targets:
             if env._done:
                 break
@@ -188,7 +212,16 @@ async def run_baseline(request: Optional[BaselineRequest] = None):
                 target=service,
             ))
-        # Phase 3: Submit fixes
         for fix_info in known_fixes.get(task_id, []):
             if env._done:
                 break

     """Root endpoint — returns environment info and available endpoints."""
     return {
         "name": "api_debug_env",
+        "description": "API Integration Debugging Environment — diagnose and fix broken API integrations",
+        "version": "2.0.0",
         "status": "running",
+        "features": [
+            "Cascading failure simulation",
+            "Dynamic service health tracking",
+            "Multi-dimensional rubric grading",
+            "Seed-based scenario randomization",
+        ],
         "endpoints": ["/reset", "/step", "/state", "/tasks", "/grader", "/baseline", "/health", "/schema", "/docs"],
     }
 @app.get("/tasks")
 async def list_tasks():
+    """Return list of all tasks with action schema and dependency info."""
     tasks = []
     for task_id in get_all_task_ids():
         scenario = get_scenario(task_id)
             "max_steps": scenario.max_steps,
             "issues_count": len(scenario.issues),
             "services": scenario.services,
+            "service_dependencies": {
+                svc: node.depends_on
+                for svc, node in scenario.service_graph.items()
+            },
+            "context": scenario.context,
             "action_schema": {
                 "action_type": {
                     "type": "string",
             "issues_fixed": len(env._issues_fixed),
             "issues_total": len(env._scenario.issues) if env._scenario else 0,
             "steps_used": env._state.step_count,
+            "grading_rubric": {
+                "fix_score_weight": 0.40,
+                "diagnosis_score_weight": 0.20,
+                "efficiency_score_weight": 0.15,
+                "strategy_score_weight": 0.25,
+            },
         }
     return {
 async def run_baseline(request: Optional[BaselineRequest] = None):
     """
     Run a rule-based baseline agent on all tasks.
+    The baseline follows a proper debugging strategy:
+    1. Inspect logs for each service (diagnosis phase)
+    2. Inspect configs for services with issues (investigation phase)
+    3. Submit known fixes (resolution phase)
     Returns baseline scores for each task.
     """
     # Known fixes for each task (a heuristic baseline, not an LLM)
     known_fixes = {
         "easy": [
+            {"target": "payment_client", "fix": {"headers.Authorization": "Bearer sk_live_token123"}},
+            {"target": "payment_client", "fix": {"headers.Content-Type": "application/json"}},
         ],
         "medium": [
             {"target": "webhook_sender", "fix": {"rate_limit.requests_per_second": 10}},
         env = ApiDebugEnvironment(task_id=task_id)
         obs = env.reset()
+        # Phase 1: Inspect all logs (proper diagnosis strategy)
         for service in obs.available_targets:
             if env._done:
                 break
                 target=service,
             ))
+        # Phase 2: Inspect configs for services that have issues
         for service in obs.available_targets:
             if env._done:
                 break
                 target=service,
             ))
+        # Phase 3: Test endpoints to observe failures
+        for service in obs.available_targets[:2]:  # Just test a couple
+            if env._done:
+                break
+            obs = env.step(ApiDebugAction(
+                action_type="inspect_endpoint",
+                target=service,
+            ))
+        # Phase 4: Submit fixes
         for fix_info in known_fixes.get(task_id, []):
             if env._done:
                 break

tests/__pycache__/__init__.cpython-313.pyc DELETED Viewed

Binary file (165 Bytes)

tests/__pycache__/test_environment.cpython-313-pytest-8.4.1.pyc DELETED Viewed

Binary file (66.9 kB)

tests/test_environment.py CHANGED Viewed

@@ -7,11 +7,13 @@ Comprehensive tests for the API Integration Debugging Environment.
 Tests cover:
 - Environment reset and initialization
 - Action handling (inspect_logs, inspect_config, inspect_endpoint, submit_fix)
-- Grading formula correctness
-- Fix validation (strict value matching)
 - Episode termination conditions
 - Repeated inspection penalty
-- Seed-based reproducibility
 """
 import sys
@@ -60,20 +62,20 @@ class TestScenarios:
         s = get_scenario("hard")
         assert len(s.issues) == 5
-    def test_seed_randomization_shuffles_logs(self):
-        """Same seed should produce same order, different seed different order."""
         s1 = get_scenario("easy", seed=42)
         s2 = get_scenario("easy", seed=42)
-        s3 = get_scenario("easy", seed=99)
-        # Same seed = same log order
-        for service in s1.services:
-            assert s1.logs.get(service) == s2.logs.get(service)
-        # Different seed = potentially different order (may be same by chance,
-        # but with enough log entries, it's unlikely)
-        # We just verify it doesn't crash
-        assert s3 is not None
     def test_each_issue_has_log_hint(self):
         """Every issue should have a corresponding log hint findable in the logs."""
@@ -90,6 +92,41 @@ class TestScenarios:
                         break
                 assert found, f"Issue {issue.issue_id} log_hint '{issue.log_hint}' not found in any logs"
 # ─── Environment Reset Tests ─────────────────────────────────────────────────
@@ -126,6 +163,25 @@ class TestEnvironmentReset:
         obs = env.reset()
         assert obs.reward == 0.0
 # ─── Action Handler Tests ────────────────────────────────────────────────────
@@ -150,25 +206,41 @@ class TestInspectLogs:
             target="payment_client",
         ))
         assert obs.issues_found > 0
-        assert obs.reward > 0  # Should get positive reward for finding issues
     def test_repeated_inspect_logs_no_reward(self):
-        """Second inspection of same target should give 0 reward."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
-        # First inspection
         obs1 = env.step(ApiDebugAction(
             action_type="inspect_logs",
             target="payment_client",
         ))
-        # Second inspection (repeat)
         obs2 = env.step(ApiDebugAction(
             action_type="inspect_logs",
             target="payment_client",
         ))
-        # The step cost is -0.01, repeat inspect gives 0 + (-0.01) base
         assert obs2.reward < obs1.reward
 class TestInspectConfig:
     """Test inspect_config action."""
@@ -197,12 +269,41 @@ class TestInspectEndpoint:
         assert obs.api_response is not None
         assert obs.api_response["status"] == "error"
 class TestSubmitFix:
-    """Test submit_fix action with value validation."""
     def test_correct_fix_accepted(self):
-        """Submitting the right key AND value should be accepted."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         obs = env.step(ApiDebugAction(
@@ -220,10 +321,21 @@ class TestSubmitFix:
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
-            fix_payload={"headers.Content-Type": "text/xml"},  # Wrong value!
         ))
         assert obs.issues_fixed == 0
-        assert obs.reward < 0  # Should get negative reward
     def test_correct_auth_fix(self):
         """Bearer token fix should work with any valid token."""
@@ -260,13 +372,11 @@ class TestSubmitFix:
         """Fixing all issues should mark episode as done with completion bonus."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
-        # Fix auth
         env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
             fix_payload={"headers.Authorization": "Bearer valid_token_123"},
         ))
-        # Fix content-type
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
@@ -275,25 +385,98 @@ class TestSubmitFix:
         assert obs.done is True
         assert obs.issues_fixed == 2
 # ─── Grading Tests ────────────────────────────────────────────────────────────
 class TestGrading:
-    """Test the grading formula."""
     def test_grade_no_fixes_is_low(self):
-        """Grade with no fixes should be very low (just exploration bonus)."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         env.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
         score = env.grade()
-        assert 0.0 < score < 0.1  # Exploration bonus only
     def test_grade_all_fixes_is_high(self):
         """Grade with all fixes should be high."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
@@ -305,7 +488,7 @@ class TestGrading:
             fix_payload={"headers.Content-Type": "application/json"},
         ))
         score = env.grade()
-        assert score > 0.8  # Should be high with efficiency bonus
     def test_grade_strictly_between_0_and_1(self):
         """Grade must be strictly in (0, 1), never exactly 0.0 or 1.0."""
@@ -316,10 +499,11 @@ class TestGrading:
             assert 0.0 < score < 1.0, f"Score for {task_id} was {score}"
     def test_efficiency_bonus(self):
-        """Faster solutions should score higher."""
-        # Quick partial solve (1 step, fix 1 of 2 issues)
         env1 = ApiDebugEnvironment(task_id="easy")
         env1.reset()
         env1.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
@@ -327,11 +511,11 @@ class TestGrading:
         ))
         score_fast = env1.grade()
-        # Slow partial solve (many inspection steps, then fix same 1 issue)
         env2 = ApiDebugEnvironment(task_id="easy")
         env2.reset()
         for _ in range(10):
-            env2.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
         env2.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
@@ -341,6 +525,56 @@ class TestGrading:
         assert score_fast > score_slow, f"Fast={score_fast} should beat Slow={score_slow}"
 # ─── Episode Termination Tests ────────────────────────────────────────────────
@@ -351,7 +585,6 @@ class TestEpisodeTermination:
     def test_out_of_steps_ends_episode(self):
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
-        # Take max_steps actions
         for _ in range(15):
             obs = env.step(ApiDebugAction(
                 action_type="inspect_logs",
@@ -388,9 +621,11 @@ class TestValueMatching:
     def test_numeric_exact(self):
         assert self.env._values_match(10, 10)
-    def test_numeric_tolerance(self):
-        assert self.env._values_match(10, 9)  # Within 25%
-        assert not self.env._values_match(10, 5)  # Outside 25%
     def test_boolean_match(self):
         assert self.env._values_match(True, True)
@@ -413,32 +648,80 @@ class TestValueMatching:
         assert not self.env._values_match(10, 100)
-# ─── Integration Test ─────────────────────────────────────────────────────────
 class TestFullEpisode:
-    """Test a complete episode flow."""
     def test_easy_full_solve(self):
         """Run a complete easy episode from start to finish."""
         env = ApiDebugEnvironment(task_id="easy")
         obs = env.reset()
-        # Step 1: Inspect logs
         obs = env.step(ApiDebugAction(
             action_type="inspect_logs",
             target="payment_client",
         ))
         assert obs.issues_found >= 1
-        # Step 2: Inspect config
         obs = env.step(ApiDebugAction(
             action_type="inspect_config",
             target="payment_client",
         ))
         assert "headers" in obs.config_snapshot
-        # Step 3: Fix auth
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
@@ -446,7 +729,6 @@ class TestFullEpisode:
         ))
         assert obs.issues_fixed >= 1
-        # Step 4: Fix content-type
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
@@ -455,9 +737,91 @@ class TestFullEpisode:
         assert obs.issues_fixed == 2
         assert obs.done is True
-        # Grade
         score = env.grade()
-        assert score > 0.8
 if __name__ == "__main__":

 Tests cover:
 - Environment reset and initialization
 - Action handling (inspect_logs, inspect_config, inspect_endpoint, submit_fix)
+- Multi-dimensional grading rubric
+- Fix validation (strict value matching + partial credit)
 - Episode termination conditions
 - Repeated inspection penalty
+- Seed-based reproducibility and issue pool selection
+- Dynamic state: service health, cascading failures, dynamic logs
+- Strategy scoring
 """
 import sys
         s = get_scenario("hard")
         assert len(s.issues) == 5
+    def test_seed_randomization_reproducible(self):
+        """Same seed should produce same scenario."""
         s1 = get_scenario("easy", seed=42)
         s2 = get_scenario("easy", seed=42)
+        assert [i.issue_id for i in s1.issues] == [i.issue_id for i in s2.issues]
+    def test_different_seeds_may_vary(self):
+        """Different seeds should produce potentially different scenarios."""
+        s1 = get_scenario("easy", seed=42)
+        s2 = get_scenario("easy", seed=99)
+        # They might differ (pool has 4 issues, selecting 2)
+        # At minimum, they should both be valid
+        assert len(s1.issues) == 2
+        assert len(s2.issues) == 2
     def test_each_issue_has_log_hint(self):
         """Every issue should have a corresponding log hint findable in the logs."""
                         break
                 assert found, f"Issue {issue.issue_id} log_hint '{issue.log_hint}' not found in any logs"
+    def test_service_graph_exists(self):
+        """Every scenario should have a service dependency graph."""
+        for task_id in get_all_task_ids():
+            s = get_scenario(task_id)
+            assert len(s.service_graph) > 0
+            for svc in s.services:
+                assert svc in s.service_graph, f"Service {svc} missing from graph in {task_id}"
+    def test_dynamic_logs_defined(self):
+        """Every scenario should have dynamic logs for at least some issues."""
+        for task_id in get_all_task_ids():
+            s = get_scenario(task_id)
+            assert len(s.dynamic_logs) > 0, f"No dynamic logs in {task_id}"
+    def test_optimal_fix_order_defined(self):
+        """Every scenario should have an optimal fix order."""
+        for task_id in get_all_task_ids():
+            s = get_scenario(task_id)
+            assert len(s.optimal_fix_order) > 0
+    def test_issues_have_categories(self):
+        """Every issue should have a category."""
+        for task_id in get_all_task_ids():
+            s = get_scenario(task_id)
+            for issue in s.issues:
+                assert issue.category in (
+                    "configuration", "authentication", "networking", "protocol"
+                ), f"Issue {issue.issue_id} has invalid category: {issue.category}"
+    def test_context_provided(self):
+        """Every scenario should have context."""
+        for task_id in get_all_task_ids():
+            s = get_scenario(task_id)
+            assert len(s.context) > 0
 # ─── Environment Reset Tests ─────────────────────────────────────────────────
         obs = env.reset()
         assert obs.reward == 0.0
+    def test_reset_includes_service_status(self):
+        """Reset should include service health status."""
+        env = ApiDebugEnvironment(task_id="easy")
+        obs = env.reset()
+        assert len(obs.service_status) > 0
+        assert "payment_client" in obs.service_status
+    def test_reset_includes_dependency_graph(self):
+        """Reset should include service dependency graph."""
+        env = ApiDebugEnvironment(task_id="easy")
+        obs = env.reset()
+        assert len(obs.dependency_graph) > 0
+    def test_reset_includes_error_trace(self):
+        """Reset should include initial error trace."""
+        env = ApiDebugEnvironment(task_id="easy")
+        obs = env.reset()
+        assert len(obs.error_trace) > 0
 # ─── Action Handler Tests ────────────────────────────────────────────────────
             target="payment_client",
         ))
         assert obs.issues_found > 0
+        assert obs.reward > 0
     def test_repeated_inspect_logs_no_reward(self):
+        """Second inspection of same target should give 0 reward (+ step cost)."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         obs1 = env.step(ApiDebugAction(
             action_type="inspect_logs",
             target="payment_client",
         ))
         obs2 = env.step(ApiDebugAction(
             action_type="inspect_logs",
             target="payment_client",
         ))
         assert obs2.reward < obs1.reward
+    def test_dynamic_logs_after_fix(self):
+        """After fixing an issue, re-inspecting should show new log entries."""
+        env = ApiDebugEnvironment(task_id="easy")
+        env.reset()
+        # Fix content-type
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        # Re-inspect logs — should include dynamic log entries
+        obs = env.step(ApiDebugAction(
+            action_type="inspect_logs",
+            target="payment_client",
+        ))
+        # Should have the original logs PLUS dynamic logs
+        assert any("application/json" in log.lower() or "parsed" in log.lower()
+                    for log in obs.logs)
 class TestInspectConfig:
     """Test inspect_config action."""
         assert obs.api_response is not None
         assert obs.api_response["status"] == "error"
+    def test_inspect_endpoint_shows_success_after_fix(self):
+        """After all issues fixed, endpoint should show success."""
+        env = ApiDebugEnvironment(task_id="easy")
+        env.reset()
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Authorization": "Bearer valid_token_123"},
+        ))
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        # Episode is done now, but let's check service status
+        # The service health should be updated
+        assert env._service_health.get("payment_client") == "healthy"
+    def test_inspect_endpoint_shows_category_status_code(self):
+        """Endpoint errors should have category-appropriate status codes."""
+        env = ApiDebugEnvironment(task_id="easy")
+        env.reset()
+        obs = env.step(ApiDebugAction(
+            action_type="inspect_endpoint",
+            target="payment_client",
+        ))
+        assert obs.api_response is not None
+        # Should have a realistic HTTP status code
+        assert obs.api_response["status_code"] in [401, 415, 500, 504]
 class TestSubmitFix:
+    """Test submit_fix action with value validation and partial credit."""
     def test_correct_fix_accepted(self):
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         obs = env.step(ApiDebugAction(
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
+            fix_payload={"headers.Content-Type": "text/xml"},
         ))
         assert obs.issues_fixed == 0
+    def test_partial_credit_close_value(self):
+        """Right key, close value should get partial credit feedback."""
+        env = ApiDebugEnvironment(task_id="easy")
+        env.reset()
+        obs = env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/xml"},
+        ))
+        # Should get partial credit (same prefix "application/")
+        assert obs.reward > -0.05  # Better than full reject
     def test_correct_auth_fix(self):
         """Bearer token fix should work with any valid token."""
         """Fixing all issues should mark episode as done with completion bonus."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
             fix_payload={"headers.Authorization": "Bearer valid_token_123"},
         ))
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
         assert obs.done is True
         assert obs.issues_fixed == 2
+    def test_strategy_bonus_for_inspecting_first(self):
+        """Should get higher reward when inspecting before fixing."""
+        env1 = ApiDebugEnvironment(task_id="easy")
+        env1.reset()
+        # Fix directly (no inspection)
+        obs1 = env1.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        env2 = ApiDebugEnvironment(task_id="easy")
+        env2.reset()
+        # Inspect first, then fix
+        env2.step(ApiDebugAction(
+            action_type="inspect_logs",
+            target="payment_client",
+        ))
+        obs2 = env2.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        # Fix with prior inspection should give higher reward
+        assert obs2.reward > obs1.reward
+# ─── Service Health Tests ─────────────────────────────────────────────────────
+class TestServiceHealth:
+    """Test dynamic service health tracking."""
+    def test_initial_health_reflects_issues(self):
+        """Services with issues should start as degraded/error."""
+        env = ApiDebugEnvironment(task_id="easy")
+        obs = env.reset()
+        assert obs.service_status.get("payment_client") in ("error", "degraded")
+    def test_health_updates_after_fix(self):
+        """Fixing all issues on a service should mark it healthy."""
+        env = ApiDebugEnvironment(task_id="easy")
+        env.reset()
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Authorization": "Bearer valid_token_123"},
+        ))
+        obs = env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        # payment_client should be healthy after both fixes
+        assert env._service_health.get("payment_client") == "healthy"
+    def test_error_trace_updates(self):
+        """Error trace should shrink as issues are fixed."""
+        env = ApiDebugEnvironment(task_id="easy")
+        obs1 = env.reset()
+        initial_trace_len = len(obs1.error_trace)
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        trace_after_fix = env._build_error_trace()
+        assert len(trace_after_fix) < initial_trace_len
 # ─── Grading Tests ────────────────────────────────────────────────────────────
 class TestGrading:
+    """Test the multi-dimensional grading rubric."""
     def test_grade_no_fixes_is_low(self):
+        """Grade with no fixes should be low (but not zero — exploration gets some credit)."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         env.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
         score = env.grade()
+        assert 0.0 < score < 0.5  # Gets some credit for exploration and efficiency
     def test_grade_all_fixes_is_high(self):
         """Grade with all fixes should be high."""
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
+        env.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
+        env.step(ApiDebugAction(action_type="inspect_config", target="payment_client"))
         env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
             fix_payload={"headers.Content-Type": "application/json"},
         ))
         score = env.grade()
+        assert score > 0.6
     def test_grade_strictly_between_0_and_1(self):
         """Grade must be strictly in (0, 1), never exactly 0.0 or 1.0."""
             assert 0.0 < score < 1.0, f"Score for {task_id} was {score}"
     def test_efficiency_bonus(self):
+        """Faster solutions with same fix count should score higher efficiency component."""
+        # Both inspect then fix (same strategy), but one uses more steps
         env1 = ApiDebugEnvironment(task_id="easy")
         env1.reset()
+        env1.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
         env1.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
         ))
         score_fast = env1.grade()
         env2 = ApiDebugEnvironment(task_id="easy")
         env2.reset()
+        env2.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
         for _ in range(10):
+            env2.step(ApiDebugAction(action_type="inspect_logs", target="payment_gateway"))
         env2.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
         assert score_fast > score_slow, f"Fast={score_fast} should beat Slow={score_slow}"
+    def test_strategy_affects_grade(self):
+        """Proper strategy (inspect before fix) should improve grade."""
+        # No inspection
+        env1 = ApiDebugEnvironment(task_id="easy")
+        env1.reset()
+        env1.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Authorization": "Bearer token"},
+        ))
+        env1.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        score_no_inspect = env1.grade()
+        # With inspection
+        env2 = ApiDebugEnvironment(task_id="easy")
+        env2.reset()
+        env2.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
+        env2.step(ApiDebugAction(action_type="inspect_config", target="payment_client"))
+        env2.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Authorization": "Bearer token"},
+        ))
+        env2.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        score_with_inspect = env2.grade()
+        # Both should be decent but strategy should boost the inspecting one
+        assert score_with_inspect >= score_no_inspect * 0.9  # At least close
+    def test_grade_dimensions_nonzero(self):
+        """Each grading dimension should be computable."""
+        env = ApiDebugEnvironment(task_id="easy")
+        env.reset()
+        env.step(ApiDebugAction(action_type="inspect_logs", target="payment_client"))
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="payment_client",
+            fix_payload={"headers.Content-Type": "application/json"},
+        ))
+        score = env.grade()
+        assert score > 0.001  # Should have some score from partial fix
 # ─── Episode Termination Tests ────────────────────────────────────────────────
     def test_out_of_steps_ends_episode(self):
         env = ApiDebugEnvironment(task_id="easy")
         env.reset()
         for _ in range(15):
             obs = env.step(ApiDebugAction(
                 action_type="inspect_logs",
     def test_numeric_exact(self):
         assert self.env._values_match(10, 10)
+    def test_numeric_tolerance_tight(self):
+        """10% tolerance — 10 accepts 10 and 9.5 but not 8."""
+        assert self.env._values_match(10, 10)  # Exact
+        assert self.env._values_match(10, 9.5)  # Within 10% (5% diff)
+        assert not self.env._values_match(10, 8)  # Outside 10% (20% diff)
     def test_boolean_match(self):
         assert self.env._values_match(True, True)
         assert not self.env._values_match(10, 100)
+class TestPartialCredit:
+    """Test the _values_close method for partial credit."""
+    def setup_method(self):
+        self.env = ApiDebugEnvironment(task_id="easy")
+    def test_numeric_close(self):
+        assert self.env._values_close(10, 7)  # Within 50%
+        assert not self.env._values_close(10, 100)
+    def test_string_same_prefix(self):
+        assert self.env._values_close("application/json", "application/xml")
+    def test_check_fix_returns_partial(self):
+        """Right key, close value should return 'partial'."""
+        issue = Issue(
+            issue_id="test",
+            service="test_svc",
+            description="test",
+            expected_fix={"timeout": 10},
+            fix_key="timeout",
+            log_hint="test",
+        )
+        result = self.env._check_fix(issue, {"timeout": 7})
+        assert result == "partial"
+    def test_check_fix_returns_exact(self):
+        issue = Issue(
+            issue_id="test",
+            service="test_svc",
+            description="test",
+            expected_fix={"timeout": 10},
+            fix_key="timeout",
+            log_hint="test",
+        )
+        result = self.env._check_fix(issue, {"timeout": 10})
+        assert result == "exact"
+    def test_check_fix_returns_none(self):
+        issue = Issue(
+            issue_id="test",
+            service="test_svc",
+            description="test",
+            expected_fix={"timeout": 10},
+            fix_key="timeout",
+            log_hint="test",
+        )
+        result = self.env._check_fix(issue, {"base_url": "http://example.com"})
+        assert result == "none"
+# ─── Integration Tests ────────────────────────────────────────────────────────
 class TestFullEpisode:
+    """Test complete episode flows."""
     def test_easy_full_solve(self):
         """Run a complete easy episode from start to finish."""
         env = ApiDebugEnvironment(task_id="easy")
         obs = env.reset()
         obs = env.step(ApiDebugAction(
             action_type="inspect_logs",
             target="payment_client",
         ))
         assert obs.issues_found >= 1
         obs = env.step(ApiDebugAction(
             action_type="inspect_config",
             target="payment_client",
         ))
         assert "headers" in obs.config_snapshot
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
         ))
         assert obs.issues_fixed >= 1
         obs = env.step(ApiDebugAction(
             action_type="submit_fix",
             target="payment_client",
         assert obs.issues_fixed == 2
         assert obs.done is True
         score = env.grade()
+        assert score > 0.6
+    def test_medium_full_solve(self):
+        """Run a complete medium episode."""
+        env = ApiDebugEnvironment(task_id="medium")
+        obs = env.reset()
+        assert obs.issues_total == 3
+        # Inspect logs
+        for svc in obs.available_targets:
+            obs = env.step(ApiDebugAction(
+                action_type="inspect_logs", target=svc,
+            ))
+        # Inspect configs
+        obs = env.step(ApiDebugAction(
+            action_type="inspect_config", target="webhook_sender",
+        ))
+        # Fix rate limit
+        obs = env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="webhook_sender",
+            fix_payload={"rate_limit.requests_per_second": 10},
+        ))
+        assert obs.issues_fixed >= 1
+        # Fix retry
+        obs = env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="webhook_sender",
+            fix_payload={"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}},
+        ))
+        # Fix signature
+        obs = env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="webhook_sender",
+            fix_payload={"headers.X-Webhook-Signature": "sha256=computed_hmac"},
+        ))
+        assert obs.done is True
+        score = env.grade()
+        assert score > 0.4
+    def test_hard_partial_solve(self):
+        """Partially solve hard task and verify partial credit in grading."""
+        env = ApiDebugEnvironment(task_id="hard")
+        obs = env.reset()
+        assert obs.issues_total == 5
+        # Fix just 2 issues
+        env.step(ApiDebugAction(action_type="inspect_logs", target="order_service"))
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="order_service",
+            fix_payload={"inventory_url": "https://inventory.internal/v2/reserve"},
+        ))
+        env.step(ApiDebugAction(
+            action_type="submit_fix",
+            target="order_service",
+            fix_payload={"timeout": 10},
+        ))
+        score = env.grade()
+        assert 0.0 < score < 0.999
+        assert len(env._issues_fixed) == 2
+class TestCascadingFailures:
+    """Test cascading failure dynamics."""
+    def test_hard_dependency_chain(self):
+        """Hard scenario has dependent issues (timeout depends on wrong_url)."""
+        s = get_scenario("hard")
+        timeout_issue = next(i for i in s.issues if i.issue_id == "hard_timeout")
+        assert "hard_wrong_url" in timeout_issue.depends_on
+    def test_cascade_effects_defined(self):
+        """Issues with cascade effects should specify affected services."""
+        for task_id in get_all_task_ids():
+            s = get_scenario(task_id)
+            any_cascade = any(len(i.cascade_effects) > 0 for i in s.issues)
+            assert any_cascade, f"No cascade effects in {task_id}"
 if __name__ == "__main__":

chore: remove __pycache__ files

chore: remove pycache files