diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..75826008de4fd9a2420b2cd8835fda8ce71f8c3f
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,9 @@
+__pycache__/
+*.pyc
+*.pyo
+venv/
+.env
+*.egg-info/
+dist/
+build/
+.pytest_cache/
diff --git a/Dockerfile b/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..dce0cd59282039576f48785c64c4882a46b5e303
--- /dev/null
+++ b/Dockerfile
@@ -0,0 +1,31 @@
+FROM python:3.11-slim
+
+# Create non-root user for security — MANDATORY for running agent code safely
+RUN useradd -m -u 1000 envuser
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy and install Python dependencies first (layer caching)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy project
+COPY . .
+
+# Make repo_templates readable
+RUN chmod -R 755 repo_templates/
+
+# Create temp directory for working copies
+RUN mkdir -p /tmp/openenv_work && chmod 777 /tmp/openenv_work
+
+# Switch to non-root for security
+USER envuser
+
+EXPOSE 7860
+
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]
diff --git a/README.md b/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..faeef02378ce2c06e97b70372143eac67bda76e9
--- /dev/null
+++ b/README.md
@@ -0,0 +1,144 @@
+---
+title: Codebase Navigation Repair OpenEnv
+emoji: 🔍
+colorFrom: blue
+colorTo: green
+sdk: docker
+pinned: false
+app_port: 7860
+license: mit
+tags:
+  - openenv
+  - reinforcement-learning
+  - coding-agent
+---
+
+# Codebase Navigation & Repair — OpenEnv Environment v2.0
+
+**An RL environment + evaluation layer that makes AI coding agents reliable, testable, and debuggable.**
+
+AI agents navigate unfamiliar Python codebases, identify bugs, and implement features — graded by running actual tests. Unlike existing benchmarks, this system provides **process-level evaluation**, not just final output scoring.
+
+## Why This Exists
+
+Every coding agent (Devin, Cursor, Copilot, Codex) fails ~25%+ on complex tasks. Current benchmarks tell you the agent scored 0.4 but not **why** it failed. This environment answers:
+
+- Did the agent explore strategically or waste steps?
+- Did it verify its fixes before submitting?
+- Can it resist misleading comments and prompt injection?
+- How efficiently does it use its context window?
+
+## Architecture
+
+```
+┌──────────────────────────────────────────────────────────┐
+│                    FastAPI Server                         │
+│  /reset  /step  /state  /trajectory  /evaluate  /metrics │
+└──────────┬───────────────────────────────────────────────┘
+           │
+┌──────────▼───────────────────────────────────────────────┐
+│              CodebaseNavEnvironment (extended)             │
+│                                                           │
+│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐  │
+│  │ Trajectory   │  │  Evaluator   │  │  Security       │  │
+│  │ Logger       │  │  (process)   │  │  Scanner        │  │
+│  └─────────────┘  └──────────────┘  └─────────────────┘  │
+│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐  │
+│  │ Fault       │  │  Memory      │  │  Grader         │  │
+│  │ Injector    │  │  Tracker     │  │  (pytest)       │  │
+│  └─────────────┘  └──────────────┘  └─────────────────┘  │
+└──────────────────────────────────────────────────────────┘
+```
+
+## Tasks
+
+| Task | Difficulty | Description |
+|------|-----------|-------------|
+| task1 | Easy | Single-file bug repair (5 variants) |
+| task2 | Medium | Cross-module interface bug + regression test (5 variants) |
+| task3 | Hard | Feature implementation from spec (5 variants) |
+
+## API Endpoints
+
+### Core (OpenEnv-compliant)
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset?task=task1` | POST | Start new episode |
+| `/step` | POST | Take one action |
+| `/state` | GET | Get current state |
+| `/health` | GET | Health check |
+
+### Evaluation Layer (v2.0)
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/trajectory` | GET | Full action log with timing, diffs, security flags |
+| `/evaluate` | GET | Multi-dimensional scores (6 axes) |
+| `/metrics` | GET | Comprehensive stats: memory, security, timeline |
+| `/fault-config` | POST | Enable fault injection: "none", "light", "heavy" |
+
+## Multi-Dimensional Evaluation
+
+The `/evaluate` endpoint scores agents across **6 quality dimensions**:
+
+| Dimension | Weight | What It Measures |
+|-----------|--------|-----------------|
+| Efficiency | 20% | Steps used vs optimal path |
+| Navigation | 15% | Read relevant files first? Explored strategically? |
+| Correctness | 30% | Final test pass rate + regression detection |
+| Reasoning | 15% | read→write→test pattern adherence |
+| Robustness | 10% | Error recovery + fault injection handling |
+| Security | 10% | Unsafe code detection + prompt injection resistance |
+
+## Fault Injection
+
+Test agent robustness by injecting controlled faults:
+
+```bash
+# Enable heavy fault injection
+curl -X POST http://localhost:7860/fault-config -d '{"level":"heavy"}'
+
+# Next reset will inject:
+# - Misleading "BUG:" comments on correct lines
+# - Red herring files that look buggy but aren't
+# - Noisy docstrings claiming code is correct
+```
+
+## Quick Start
+
+### Local
+```bash
+pip install -r requirements.txt
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+```
+
+### Docker
+```bash
+docker build -t codebase-nav-env .
+docker run -p 7860:7860 codebase-nav-env
+```
+
+### Run Inference
+```bash
+export HF_TOKEN=your_token
+export ENV_BASE_URL=http://localhost:7860
+python inference.py
+```
+
+## Example Output: `/evaluate`
+```json
+{
+  "composite_score": 0.874,
+  "dimensions": {
+    "efficiency": {"score": 0.8, "evidence": ["Used 5 steps vs 4 optimal"]},
+    "navigation": {"score": 1.0, "evidence": ["Good: first read was relevant file"]},
+    "correctness": {"score": 0.714, "evidence": ["No test regressions"]},
+    "reasoning": {"score": 1.0, "evidence": ["Agent tested after writing"]},
+    "robustness": {"score": 1.0, "evidence": ["Clean execution"]},
+    "security": {"score": 1.0, "evidence": ["No security violations"]}
+  }
+}
+```
+
+## License
+
+MIT
diff --git a/inference.py b/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d31618151a4683eefee92eccb2082f9b857ba47
--- /dev/null
+++ b/inference.py
@@ -0,0 +1,247 @@
+#!/usr/bin/env python3
+"""
+inference.py — Mandatory OpenEnv baseline inference script.
+Runs an LLM agent against all 3 tasks and emits required log format.
+
+Environment variables required:
+    API_BASE_URL   — LLM API endpoint
+    MODEL_NAME     — model identifier
+    HF_TOKEN       — Hugging Face API token
+"""
+import os
+import json
+import textwrap
+from typing import List, Optional
+
+from openai import OpenAI
+import httpx
+
+# ── Configuration ─────────────────────────────────────────────────────────────
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
+
+MAX_STEPS_PER_TASK = {"task1": 12, "task2": 18, "task3": 22}
+TEMPERATURE = 0.2
+MAX_TOKENS = 800
+SUCCESS_THRESHOLD = 0.5
+
+TASKS = ["task1", "task2", "task3"]
+
+
+# ── Logging helpers ────────────────────────────────────────────────────────────
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+
+
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} "
+        f"done={str(done).lower()} error={error_val}",
+        flush=True,
+    )
+
+
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+
+
+# ── Environment client ─────────────────────────────────────────────────────────
+class EnvClient:
+    def __init__(self, base_url: str):
+        self.base_url = base_url.rstrip("/")
+        self.client = httpx.Client(timeout=60.0)
+
+    def reset(self, task: str) -> dict:
+        r = self.client.post(f"{self.base_url}/reset", params={"task": task})
+        r.raise_for_status()
+        return r.json()
+
+    def step(self, action: dict) -> dict:
+        r = self.client.post(f"{self.base_url}/step", json=action)
+        r.raise_for_status()
+        return r.json()
+
+    def state(self) -> dict:
+        r = self.client.get(f"{self.base_url}/state")
+        r.raise_for_status()
+        return r.json()
+
+    def close(self):
+        self.client.close()
+
+
+# ── LLM Agent ─────────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = textwrap.dedent("""
+    You are an expert software engineer working inside a Python code repository.
+    You can take the following actions (respond with ONLY a valid JSON object):
+
+    {"action_type": "read_file", "path": "src/some_file.py"}
+    {"action_type": "write_file", "path": "src/some_file.py", "content": "...full new content..."}
+    {"action_type": "run_tests", "path": "tests/test_something.py"}
+    {"action_type": "search_code", "query": "function_name_or_keyword"}
+    {"action_type": "submit"}
+
+    Strategy:
+    1. ALWAYS read relevant source files before writing any fixes
+    2. For task1/task2: read failing test file first to understand what is expected
+    3. For task3: read FEATURE_SPEC.md first, then existing source files
+    4. Run tests after writing a fix to verify improvement
+    5. Submit only when confident tests will pass
+
+    Reply with ONLY the JSON action object. No explanation. No markdown. No extra text.
+""").strip()
+
+
+def build_user_prompt(obs: dict, step: int, history: List[str]) -> str:
+    tree_str = "\n".join(obs.get("repo_tree", []))
+    files_read_str = ", ".join(obs.get("files_read", [])) or "none yet"
+    failing_str = ", ".join(obs.get("failing_tests", [])) or "unknown"
+    last_result = obs.get("last_action_result") or "none"
+    last_error = obs.get("last_action_error") or "none"
+    steps_left = obs.get("steps_remaining", 0)
+    history_str = "\n".join(history[-5:]) if history else "none"
+
+    return textwrap.dedent(f"""
+        Step: {step}
+        Task: {obs.get('current_task')}
+        Description: {obs.get('task_description')}
+        Steps remaining: {steps_left}
+
+        Repository files:
+        {tree_str}
+
+        Files already read: {files_read_str}
+        Known failing tests: {failing_str}
+        Last action result: {last_result[:1000]}
+        Last action error: {last_error}
+
+        Recent history:
+        {history_str}
+
+        What is your next action? Reply with ONLY a JSON action object.
+    """).strip()
+
+
+def get_agent_action(client: OpenAI, obs: dict, step: int, history: List[str]) -> dict:
+    user_prompt = build_user_prompt(obs, step, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+
+        # Strip markdown code fences if present
+        if text.startswith("```"):
+            text = text.split("```")[1]
+            if text.startswith("json"):
+                text = text[4:]
+
+        action = json.loads(text)
+        return action
+    except json.JSONDecodeError:
+        print(f"[DEBUG] Failed to parse action JSON: {text[:200]}", flush=True)
+        return {"action_type": "submit"}  # Fallback
+    except Exception as e:
+        print(f"[DEBUG] LLM call failed: {e}", flush=True)
+        return {"action_type": "submit"}
+
+
+def run_task(env_client: EnvClient, llm_client: OpenAI, task: str) -> tuple:
+    """Run one complete episode for a task. Returns (score, steps, rewards)."""
+    max_steps = MAX_STEPS_PER_TASK.get(task, 15)
+    benchmark = "codebase-nav-env"
+
+    rewards = []
+    history = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+
+    log_start(task=task, env=benchmark, model=MODEL_NAME)
+
+    try:
+        reset_result = env_client.reset(task=task)
+        obs = reset_result["observation"]
+
+        for step_num in range(1, max_steps + 1):
+            if obs.get("steps_remaining", 0) <= 0:
+                break
+
+            action = get_agent_action(llm_client, obs, step_num, history)
+            action_str = json.dumps(action)
+
+            try:
+                step_result = env_client.step(action)
+            except Exception as e:
+                log_step(step_num, action_str, 0.0, True, str(e))
+                break
+
+            reward = step_result.get("reward", 0.0)
+            done = step_result.get("done", False)
+            error = step_result["observation"].get("last_action_error")
+
+            rewards.append(reward)
+            steps_taken = step_num
+            obs = step_result["observation"]
+
+            history.append(f"Step {step_num}: {action.get('action_type')} -> reward {reward:+.2f}")
+
+            log_step(step=step_num, action=action_str[:200], reward=reward, done=done, error=error)
+
+            if done:
+                # Get final score from state
+                state = env_client.state()
+                score = state.get("current_score", 0.0)
+                break
+
+        # If not done yet (step budget exhausted), force submit
+        if not obs.get("last_action_result", "").startswith("=== FINAL GRADER"):
+            try:
+                step_result = env_client.step({"action_type": "submit"})
+                state = env_client.state()
+                score = state.get("current_score", 0.0)
+            except Exception:
+                pass
+
+        success = score >= SUCCESS_THRESHOLD
+
+    except Exception as e:
+        print(f"[DEBUG] Episode error: {e}", flush=True)
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+
+    return score, steps_taken, rewards
+
+
+def main():
+    env_client = EnvClient(ENV_BASE_URL)
+    llm_client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+
+    all_scores = []
+    for task in TASKS:
+        score, steps, rewards = run_task(env_client, llm_client, task)
+        all_scores.append(score)
+        print(f"[INFO] {task} complete: score={score:.3f} steps={steps}", flush=True)
+
+    avg_score = sum(all_scores) / len(all_scores)
+    print(f"[INFO] Average score across all tasks: {avg_score:.3f}", flush=True)
+
+    env_client.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/openenv.yaml b/openenv.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..84628eb34927efeef9113f66c9e877cb57f786ee
--- /dev/null
+++ b/openenv.yaml
@@ -0,0 +1,56 @@
+name: codebase-nav-env
+version: "1.0.0"
+description: >
+  An RL environment where an LLM agent navigates an unfamiliar Python codebase,
+  finds bugs, and implements features by reading files and running tests.
+  Graded by actual pytest execution — fully deterministic.
+
+author: your-hf-username
+license: MIT
+
+tasks:
+  - id: task1
+    name: "Single-file bug repair"
+    description: "Find and fix bugs in a Python module so all tests pass."
+    difficulty: easy
+    max_steps: 20
+    reward_range: [0.0, 1.0]
+
+  - id: task2
+    name: "Cross-module interface bug"
+    description: "Fix a type mismatch between two modules and add a regression test."
+    difficulty: medium
+    max_steps: 25
+    reward_range: [0.0, 1.0]
+
+  - id: task3
+    name: "Feature implementation from spec"
+    description: "Read FEATURE_SPEC.md and implement the feature across multiple files."
+    difficulty: hard
+    max_steps: 30
+    reward_range: [0.0, 1.0]
+
+action_space:
+  type: text
+  schema:
+    action_type: string
+    path: string (optional)
+    content: string (optional)
+    query: string (optional)
+
+observation_space:
+  type: structured
+  fields:
+    - repo_tree: list of file paths
+    - task_description: string
+    - failing_tests: list of test names
+    - files_read: list of paths read so far
+    - last_action_result: string
+    - steps_remaining: integer
+    - current_task: string
+
+endpoints:
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  health: GET /health
diff --git a/repo_templates/task1/variant_1/meta.json b/repo_templates/task1/variant_1/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..632e14ddc47aa5cda80c5adcb10ed931671451ca
--- /dev/null
+++ b/repo_templates/task1/variant_1/meta.json
@@ -0,0 +1,15 @@
+{
+  "variant_id": "task1_v1",
+  "task": "task1",
+  "bug_files": ["src/auth.py"],
+  "bug_description": "validate_token uses != instead of == and get_user_permissions has off-by-one",
+  "failing_tests": ["test_valid_token", "test_user_permissions"],
+  "correct_lines": {
+    "src/auth.py": {
+      "return token != secret": "return token == secret",
+      "return permissions[user_id + 1]": "return permissions[user_id]"
+    }
+  },
+  "total_files": 3,
+  "optimal_steps": 4
+}
diff --git a/repo_templates/task1/variant_1/src/auth.py b/repo_templates/task1/variant_1/src/auth.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f7ee3b3f1088bbd324fb4244dbea33f67d1232d
--- /dev/null
+++ b/repo_templates/task1/variant_1/src/auth.py
@@ -0,0 +1,14 @@
+def validate_token(token: str, secret: str) -> bool:
+    """Validate a user token against the secret."""
+    if token is None:
+        return False
+    # BUG: should be == not !=
+    return token != secret
+
+
+def get_user_permissions(user_id: int, permissions: list) -> list:
+    """Return permissions for a user ID."""
+    if user_id < 0:
+        return []
+    # BUG: off-by-one — should be permissions[user_id] not permissions[user_id + 1]
+    return permissions[user_id + 1] if user_id + 1 < len(permissions) else []
diff --git a/repo_templates/task1/variant_1/src/utils.py b/repo_templates/task1/variant_1/src/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3accdeb4d377f3a96c0ad1dac30ac5e24f2fcfa1
--- /dev/null
+++ b/repo_templates/task1/variant_1/src/utils.py
@@ -0,0 +1,16 @@
+"""Utility functions for the auth module."""
+
+
+def sanitize_input(text: str) -> str:
+    """Remove leading/trailing whitespace and normalize."""
+    if not isinstance(text, str):
+        return ""
+    return text.strip().lower()
+
+
+def format_response(status: str, data: dict = None) -> dict:
+    """Format a standard API response."""
+    return {
+        "status": status,
+        "data": data or {},
+    }
diff --git a/repo_templates/task1/variant_1/tests/test_auth.py b/repo_templates/task1/variant_1/tests/test_auth.py
new file mode 100644
index 0000000000000000000000000000000000000000..9702eeee53b086ee8a29ed6f42ce5e74f3c3f4cd
--- /dev/null
+++ b/repo_templates/task1/variant_1/tests/test_auth.py
@@ -0,0 +1,23 @@
+import pytest
+from src.auth import validate_token, get_user_permissions
+
+
+def test_valid_token():
+    assert validate_token("abc123", "abc123") == True  # FAILS because of != bug
+
+
+def test_invalid_token():
+    assert validate_token("wrong", "abc123") == False
+
+
+def test_none_token():
+    assert validate_token(None, "abc123") == False
+
+
+def test_user_permissions():
+    perms = ["read", "write", "admin"]
+    assert get_user_permissions(0, perms) == "read"  # FAILS because of off-by-one bug
+
+
+def test_negative_user_id():
+    assert get_user_permissions(-1, ["read"]) == []
diff --git a/repo_templates/task1/variant_2/meta.json b/repo_templates/task1/variant_2/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..5c381a60d114a65640830ffb720279eef814a83d
--- /dev/null
+++ b/repo_templates/task1/variant_2/meta.json
@@ -0,0 +1,15 @@
+{
+  "variant_id": "task1_v2",
+  "task": "task1",
+  "bug_files": ["src/calculator.py"],
+  "bug_description": "divide() missing zero-division check; average() crashes on empty list",
+  "failing_tests": ["test_divide_by_zero", "test_average_empty"],
+  "correct_lines": {
+    "src/calculator.py": {
+      "return numerator / denominator": "if denominator == 0:\n        return 0.0\n    return numerator / denominator",
+      "total = sum(numbers)\n    return total / len(numbers)": "if not numbers:\n        return 0.0\n    total = sum(numbers)\n    return total / len(numbers)"
+    }
+  },
+  "total_files": 3,
+  "optimal_steps": 4
+}
diff --git a/repo_templates/task1/variant_2/src/calculator.py b/repo_templates/task1/variant_2/src/calculator.py
new file mode 100644
index 0000000000000000000000000000000000000000..abbba8fbe72ec5424d312433fcd26b0fca4e3791
--- /dev/null
+++ b/repo_templates/task1/variant_2/src/calculator.py
@@ -0,0 +1,23 @@
+"""Calculator module with basic math operations."""
+
+
+def divide(numerator: float, denominator: float) -> float:
+    """Divide numerator by denominator safely."""
+    # BUG: missing zero-division check — should check denominator == 0
+    return numerator / denominator
+
+
+def average(numbers: list) -> float:
+    """Calculate the average of a list of numbers."""
+    # BUG: doesn't handle empty list — should return 0.0 for empty
+    total = sum(numbers)
+    return total / len(numbers)
+
+
+def clamp(value: float, min_val: float, max_val: float) -> float:
+    """Clamp a value between min and max."""
+    if value < min_val:
+        return min_val
+    if value > max_val:
+        return max_val
+    return value
diff --git a/repo_templates/task1/variant_2/src/helpers.py b/repo_templates/task1/variant_2/src/helpers.py
new file mode 100644
index 0000000000000000000000000000000000000000..062a213a273c55fbbde080cb6fce6ca4ac185dcf
--- /dev/null
+++ b/repo_templates/task1/variant_2/src/helpers.py
@@ -0,0 +1,14 @@
+"""Helper utilities for the calculator module."""
+
+
+def parse_number(value: str) -> float:
+    """Parse a string to a float, returning 0.0 on failure."""
+    try:
+        return float(value)
+    except (ValueError, TypeError):
+        return 0.0
+
+
+def format_result(value: float, decimals: int = 2) -> str:
+    """Format a numeric result to a string with given decimal places."""
+    return f"{value:.{decimals}f}"
diff --git a/repo_templates/task1/variant_2/tests/test_calculator.py b/repo_templates/task1/variant_2/tests/test_calculator.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcce822118a2cb0eec62c81565dd79fc62a2e75c
--- /dev/null
+++ b/repo_templates/task1/variant_2/tests/test_calculator.py
@@ -0,0 +1,32 @@
+import pytest
+from src.calculator import divide, average, clamp
+
+
+def test_divide_normal():
+    assert divide(10, 2) == 5.0
+
+
+def test_divide_by_zero():
+    # FAILS — ZeroDivisionError because no zero check
+    assert divide(10, 0) == 0.0
+
+
+def test_average_normal():
+    assert average([1, 2, 3]) == 2.0
+
+
+def test_average_empty():
+    # FAILS — ZeroDivisionError because empty list not handled
+    assert average([]) == 0.0
+
+
+def test_clamp_within():
+    assert clamp(5, 0, 10) == 5
+
+
+def test_clamp_below():
+    assert clamp(-5, 0, 10) == 0
+
+
+def test_clamp_above():
+    assert clamp(15, 0, 10) == 10
diff --git a/repo_templates/task1/variant_3/meta.json b/repo_templates/task1/variant_3/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..5989b9cc3dbb0b53f13c0ed63beff54d279b767d
--- /dev/null
+++ b/repo_templates/task1/variant_3/meta.json
@@ -0,0 +1,15 @@
+{
+  "variant_id": "task1_v3",
+  "task": "task1",
+  "bug_files": ["src/inventory.py"],
+  "bug_description": "check_stock uses >= 0 instead of > 0; get_low_stock_items uses <= instead of <",
+  "failing_tests": ["test_out_of_stock", "test_low_stock_items"],
+  "correct_lines": {
+    "src/inventory.py": {
+      "return inventory[item_id] >= 0": "return inventory[item_id] > 0",
+      "if qty <= threshold": "if qty < threshold"
+    }
+  },
+  "total_files": 3,
+  "optimal_steps": 4
+}
diff --git a/repo_templates/task1/variant_3/src/inventory.py b/repo_templates/task1/variant_3/src/inventory.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9f929d4a13f73da389cf3b512d062a13ef3aa7f
--- /dev/null
+++ b/repo_templates/task1/variant_3/src/inventory.py
@@ -0,0 +1,26 @@
+"""Inventory management module."""
+
+
+def check_stock(item_id: str, inventory: dict) -> bool:
+    """Check if an item is in stock (quantity > 0)."""
+    if item_id not in inventory:
+        return False
+    # BUG: should be > 0, not >= 0 (zero stock means out of stock)
+    return inventory[item_id] >= 0
+
+
+def restock(item_id: str, quantity: int, inventory: dict) -> dict:
+    """Add stock for an item."""
+    if quantity < 0:
+        raise ValueError("Cannot restock negative quantity")
+    if item_id in inventory:
+        inventory[item_id] += quantity
+    else:
+        inventory[item_id] = quantity
+    return inventory
+
+
+def get_low_stock_items(inventory: dict, threshold: int = 5) -> list:
+    """Return items with stock below threshold."""
+    # BUG: should be < threshold, not <= threshold
+    return [item for item, qty in inventory.items() if qty <= threshold]
diff --git a/repo_templates/task1/variant_3/src/logger.py b/repo_templates/task1/variant_3/src/logger.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ac0486a499e0554b10ee9552202bc1f30108977
--- /dev/null
+++ b/repo_templates/task1/variant_3/src/logger.py
@@ -0,0 +1,9 @@
+"""Logging utilities for inventory operations."""
+
+
+def log_operation(operation: str, item_id: str, details: str = "") -> str:
+    """Create a log entry for an inventory operation."""
+    entry = f"[INVENTORY] {operation}: {item_id}"
+    if details:
+        entry += f" — {details}"
+    return entry
diff --git a/repo_templates/task1/variant_3/tests/test_inventory.py b/repo_templates/task1/variant_3/tests/test_inventory.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e17827e5b90571529a4ca4c3412375f8c4d29e9
--- /dev/null
+++ b/repo_templates/task1/variant_3/tests/test_inventory.py
@@ -0,0 +1,44 @@
+import pytest
+from src.inventory import check_stock, restock, get_low_stock_items
+
+
+def test_in_stock():
+    inv = {"apple": 10, "banana": 5}
+    assert check_stock("apple", inv) == True
+
+
+def test_out_of_stock():
+    inv = {"apple": 0}
+    # FAILS — returns True because >= 0 is wrong, should be > 0
+    assert check_stock("apple", inv) == False
+
+
+def test_item_not_found():
+    assert check_stock("ghost", {}) == False
+
+
+def test_restock_existing():
+    inv = {"apple": 5}
+    result = restock("apple", 3, inv)
+    assert result["apple"] == 8
+
+
+def test_restock_new():
+    inv = {}
+    result = restock("orange", 10, inv)
+    assert result["orange"] == 10
+
+
+def test_restock_negative():
+    with pytest.raises(ValueError):
+        restock("apple", -1, {})
+
+
+def test_low_stock_items():
+    inv = {"apple": 3, "banana": 5, "cherry": 10}
+    # FAILS — banana (qty=5) should NOT be in low stock when threshold=5
+    # but <= threshold incorrectly includes items AT the threshold
+    result = get_low_stock_items(inv, threshold=5)
+    assert "apple" in result
+    assert "banana" not in result
+    assert "cherry" not in result
diff --git a/repo_templates/task1/variant_4/meta.json b/repo_templates/task1/variant_4/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..5c0b667d4f08f9d3e0419daaf15963f16cbed676
--- /dev/null
+++ b/repo_templates/task1/variant_4/meta.json
@@ -0,0 +1,15 @@
+{
+  "variant_id": "task1_v4",
+  "task": "task1",
+  "bug_files": ["src/scheduler.py"],
+  "bug_description": "is_available uses <= instead of < for adjacent slot check; days_until has off-by-one (+1)",
+  "failing_tests": ["test_adjacent_slots_allowed", "test_days_until", "test_days_until_same_day"],
+  "correct_lines": {
+    "src/scheduler.py": {
+      "if start <= slot_end and end >= slot_start:": "if start < slot_end and end > slot_start:",
+      "return delta.days + 1": "return delta.days"
+    }
+  },
+  "total_files": 3,
+  "optimal_steps": 4
+}
diff --git a/repo_templates/task1/variant_4/src/scheduler.py b/repo_templates/task1/variant_4/src/scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ee814dda7e52fd4fd6b8a62189d85f71cb9c32a
--- /dev/null
+++ b/repo_templates/task1/variant_4/src/scheduler.py
@@ -0,0 +1,34 @@
+"""Meeting and event scheduler module."""
+from datetime import datetime, timedelta
+
+
+def is_available(start: datetime, end: datetime, booked_slots: list) -> bool:
+    """Check if a time slot is available (no overlap with booked slots)."""
+    for slot in booked_slots:
+        slot_start = slot["start"]
+        slot_end = slot["end"]
+        # BUG: off-by-one — should be < not <= for end comparison
+        # Adjacent meetings (one ends exactly when another starts) should be allowed
+        if start <= slot_end and end >= slot_start:
+            return False
+    return True
+
+
+def get_next_available(after: datetime, duration_minutes: int, booked_slots: list) -> datetime:
+    """Find the next available slot after the given time."""
+    candidate = after
+    for _ in range(100):  # safety limit
+        candidate_end = candidate + timedelta(minutes=duration_minutes)
+        if is_available(candidate, candidate_end, booked_slots):
+            return candidate
+        candidate += timedelta(minutes=15)  # check in 15-minute increments
+    return None
+
+
+def days_until(target: datetime, now: datetime = None) -> int:
+    """Calculate whole days until target date."""
+    if now is None:
+        now = datetime.now()
+    delta = target - now
+    # BUG: should return delta.days, not delta.days + 1
+    return delta.days + 1
diff --git a/repo_templates/task1/variant_4/src/time_helpers.py b/repo_templates/task1/variant_4/src/time_helpers.py
new file mode 100644
index 0000000000000000000000000000000000000000..353a6748b15a39599de221718037573423dcab1c
--- /dev/null
+++ b/repo_templates/task1/variant_4/src/time_helpers.py
@@ -0,0 +1,12 @@
+"""Time helper functions."""
+from datetime import datetime
+
+
+def format_time(dt: datetime) -> str:
+    """Format datetime to string."""
+    return dt.strftime("%Y-%m-%d %H:%M")
+
+
+def parse_time(s: str) -> datetime:
+    """Parse string to datetime."""
+    return datetime.strptime(s, "%Y-%m-%d %H:%M")
diff --git a/repo_templates/task1/variant_4/tests/test_scheduler.py b/repo_templates/task1/variant_4/tests/test_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..912691d6bff1e4b5ae23855a7949163a2c0fde66
--- /dev/null
+++ b/repo_templates/task1/variant_4/tests/test_scheduler.py
@@ -0,0 +1,52 @@
+import pytest
+from datetime import datetime, timedelta
+from src.scheduler import is_available, get_next_available, days_until
+
+
+def test_slot_available():
+    booked = [
+        {"start": datetime(2024, 1, 1, 10, 0), "end": datetime(2024, 1, 1, 11, 0)}
+    ]
+    assert is_available(
+        datetime(2024, 1, 1, 12, 0),
+        datetime(2024, 1, 1, 13, 0),
+        booked
+    ) == True
+
+
+def test_slot_overlap():
+    booked = [
+        {"start": datetime(2024, 1, 1, 10, 0), "end": datetime(2024, 1, 1, 11, 0)}
+    ]
+    assert is_available(
+        datetime(2024, 1, 1, 10, 30),
+        datetime(2024, 1, 1, 11, 30),
+        booked
+    ) == False
+
+
+def test_adjacent_slots_allowed():
+    """Meeting starting exactly when another ends should be allowed."""
+    booked = [
+        {"start": datetime(2024, 1, 1, 10, 0), "end": datetime(2024, 1, 1, 11, 0)}
+    ]
+    # FAILS — returns False because <= is used instead of <
+    assert is_available(
+        datetime(2024, 1, 1, 11, 0),
+        datetime(2024, 1, 1, 12, 0),
+        booked
+    ) == True
+
+
+def test_days_until():
+    now = datetime(2024, 1, 1, 0, 0)
+    target = datetime(2024, 1, 11, 0, 0)
+    # FAILS — returns 11 instead of 10 because of +1 bug
+    assert days_until(target, now) == 10
+
+
+def test_days_until_same_day():
+    now = datetime(2024, 6, 15, 8, 0)
+    target = datetime(2024, 6, 15, 20, 0)
+    # FAILS — returns 1 instead of 0
+    assert days_until(target, now) == 0
diff --git a/repo_templates/task1/variant_5/meta.json b/repo_templates/task1/variant_5/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..2dda9ca0848cb96c1c32ffa7d7f4ac50bf218dd5
--- /dev/null
+++ b/repo_templates/task1/variant_5/meta.json
@@ -0,0 +1,15 @@
+{
+  "variant_id": "task1_v5",
+  "task": "task1",
+  "bug_files": ["src/formatter.py"],
+  "bug_description": "truncate doesn't account for ellipsis length; extract_between doesn't offset past start marker",
+  "failing_tests": ["test_truncate_long", "test_extract_between"],
+  "correct_lines": {
+    "src/formatter.py": {
+      "return text[:max_length] + \"...\"": "return text[:max_length - 3] + \"...\"",
+      "content_start = start_idx": "content_start = start_idx + len(start_marker)"
+    }
+  },
+  "total_files": 3,
+  "optimal_steps": 4
+}
diff --git a/repo_templates/task1/variant_5/src/constants.py b/repo_templates/task1/variant_5/src/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5ddc580300c91eba7587f5ef501df2084579636
--- /dev/null
+++ b/repo_templates/task1/variant_5/src/constants.py
@@ -0,0 +1,4 @@
+"""Constants for the formatter module."""
+
+DEFAULT_MAX_LENGTH = 50
+ELLIPSIS = "..."
diff --git a/repo_templates/task1/variant_5/src/formatter.py b/repo_templates/task1/variant_5/src/formatter.py
new file mode 100644
index 0000000000000000000000000000000000000000..32cc63c129e849d13925504487469ebb8f1cb0e6
--- /dev/null
+++ b/repo_templates/task1/variant_5/src/formatter.py
@@ -0,0 +1,29 @@
+"""Text formatter module for processing and formatting strings."""
+
+
+def truncate(text: str, max_length: int) -> str:
+    """Truncate text to max_length, adding '...' if truncated."""
+    if not text:
+        return ""
+    if len(text) <= max_length:
+        return text
+    # BUG: should be text[:max_length - 3] + "..." to account for ellipsis length
+    return text[:max_length] + "..."
+
+
+def extract_between(text: str, start_marker: str, end_marker: str) -> str:
+    """Extract text between two markers."""
+    start_idx = text.find(start_marker)
+    if start_idx == -1:
+        return ""
+    # BUG: should start after the marker, i.e. start_idx + len(start_marker)
+    content_start = start_idx  # wrong — includes the start_marker itself
+    end_idx = text.find(end_marker, content_start)
+    if end_idx == -1:
+        return ""
+    return text[content_start:end_idx]
+
+
+def capitalize_words(text: str) -> str:
+    """Capitalize the first letter of every word."""
+    return " ".join(w.capitalize() for w in text.split())
diff --git a/repo_templates/task1/variant_5/tests/test_formatter.py b/repo_templates/task1/variant_5/tests/test_formatter.py
new file mode 100644
index 0000000000000000000000000000000000000000..bcee191def644b17bbf838ba3bf7f9364eff432c
--- /dev/null
+++ b/repo_templates/task1/variant_5/tests/test_formatter.py
@@ -0,0 +1,35 @@
+import pytest
+from src.formatter import truncate, extract_between, capitalize_words
+
+
+def test_truncate_short():
+    assert truncate("hello", 10) == "hello"
+
+
+def test_truncate_long():
+    # FAILS — returns "hello worl..." (13 chars) instead of "hello w..." (10 chars)
+    result = truncate("hello world", 10)
+    assert len(result) <= 10
+    assert result == "hello w..."
+
+
+def test_truncate_empty():
+    assert truncate("", 5) == ""
+
+
+def test_extract_between():
+    text = "start[CONTENT]end"
+    # FAILS — returns "[CONTENT]" instead of "CONTENT" because start_idx not offset
+    assert extract_between(text, "[", "]") == "CONTENT"
+
+
+def test_extract_missing_marker():
+    assert extract_between("no markers here", "[", "]") == ""
+
+
+def test_capitalize_words():
+    assert capitalize_words("hello world foo") == "Hello World Foo"
+
+
+def test_capitalize_single():
+    assert capitalize_words("test") == "Test"
diff --git a/repo_templates/task2/variant_1/meta.json b/repo_templates/task2/variant_1/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..a3f8faffce86c1c5fa6d8d6a8d6ce3022cc4c4a6
--- /dev/null
+++ b/repo_templates/task2/variant_1/meta.json
@@ -0,0 +1,13 @@
+{
+  "variant_id": "task2_v1",
+  "task": "task2",
+  "bug_files": ["src/data_pipeline.py"],
+  "interface_files": ["src/validator.py"],
+  "bug_description": "data_pipeline passes str(record_id) but validator.py expects int",
+  "failing_tests": ["test_process_valid_batch"],
+  "fix_file": "src/data_pipeline.py",
+  "fix_description": "Remove str() wrapping — pass record['id'] directly",
+  "regression_test_must_cover": "TypeError raised when string is passed to validate_record",
+  "total_files": 4,
+  "optimal_steps": 6
+}
diff --git a/repo_templates/task2/variant_1/src/data_pipeline.py b/repo_templates/task2/variant_1/src/data_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..b33b464eaa81d0ab8e7e1034d42c5d3b46bea14b
--- /dev/null
+++ b/repo_templates/task2/variant_1/src/data_pipeline.py
@@ -0,0 +1,12 @@
+from src.validator import validate_record
+
+
+def process_batch(records: list) -> list:
+    """Process a batch of records through the validation pipeline."""
+    results = []
+    for record in records:
+        # BUG: passing record["id"] as string, but validate_record expects int
+        validated = validate_record(str(record["id"]), record["data"])
+        if validated:
+            results.append(validated)
+    return results
diff --git a/repo_templates/task2/variant_1/src/models.py b/repo_templates/task2/variant_1/src/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d276376dea2f2e84e6a4a78bead24982e1c52ab
--- /dev/null
+++ b/repo_templates/task2/variant_1/src/models.py
@@ -0,0 +1,10 @@
+"""Data models for the pipeline."""
+
+
+class Record:
+    def __init__(self, record_id: int, data: dict):
+        self.record_id = record_id
+        self.data = data
+
+    def to_dict(self) -> dict:
+        return {"id": self.record_id, "data": self.data}
diff --git a/repo_templates/task2/variant_1/src/validator.py b/repo_templates/task2/variant_1/src/validator.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ccbbe169a7fa40ccfb8cb69d5030022c72d70c2
--- /dev/null
+++ b/repo_templates/task2/variant_1/src/validator.py
@@ -0,0 +1,7 @@
+def validate_record(record_id: int, data: dict) -> dict:
+    """Validate a record. record_id must be a positive integer."""
+    if not isinstance(record_id, int):
+        raise TypeError(f"record_id must be int, got {type(record_id)}")
+    if record_id <= 0:
+        return None
+    return {"id": record_id, "data": data, "valid": True}
diff --git a/repo_templates/task2/variant_1/tests/test_pipeline.py b/repo_templates/task2/variant_1/tests/test_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0dad9873095ae7e9530723ac056a4eddbc6ce3d
--- /dev/null
+++ b/repo_templates/task2/variant_1/tests/test_pipeline.py
@@ -0,0 +1,18 @@
+import pytest
+from src.data_pipeline import process_batch
+
+
+def test_process_valid_batch():
+    records = [{"id": 1, "data": {"name": "test"}}, {"id": 2, "data": {"name": "test2"}}]
+    result = process_batch(records)
+    assert len(result) == 2  # FAILS — TypeError from wrong type
+
+
+def test_process_with_invalid_id():
+    records = [{"id": -1, "data": {"name": "bad"}}]
+    result = process_batch(records)
+    assert result == []
+
+
+def test_empty_batch():
+    assert process_batch([]) == []
diff --git a/repo_templates/task2/variant_2/meta.json b/repo_templates/task2/variant_2/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..01ccf3ea2394e8e5539b8d39372101cdef50d943
--- /dev/null
+++ b/repo_templates/task2/variant_2/meta.json
@@ -0,0 +1,13 @@
+{
+  "variant_id": "task2_v2",
+  "task": "task2",
+  "bug_files": ["src/email_sender.py"],
+  "interface_files": ["src/template_engine.py"],
+  "bug_description": "email_sender passes name= kwarg but template_engine expects username=",
+  "failing_tests": ["test_send_welcome_email", "test_welcome_email_structure"],
+  "fix_file": "src/email_sender.py",
+  "fix_description": "Change name=user_name to username=user_name in send_welcome_email",
+  "regression_test_must_cover": "KeyError when wrong kwarg name is used",
+  "total_files": 4,
+  "optimal_steps": 6
+}
diff --git a/repo_templates/task2/variant_2/src/config.py b/repo_templates/task2/variant_2/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..55a755efa4e499ae6bd0ae5f4c0d5f165973cf87
--- /dev/null
+++ b/repo_templates/task2/variant_2/src/config.py
@@ -0,0 +1,5 @@
+"""Configuration for the email service."""
+
+SMTP_HOST = "localhost"
+SMTP_PORT = 587
+FROM_EMAIL = "noreply@example.com"
diff --git a/repo_templates/task2/variant_2/src/email_sender.py b/repo_templates/task2/variant_2/src/email_sender.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3298b8eb54869c14bd3c9635072d77373984762
--- /dev/null
+++ b/repo_templates/task2/variant_2/src/email_sender.py
@@ -0,0 +1,25 @@
+"""Email sending service that uses the template engine."""
+from src.template_engine import render_template
+
+
+def send_welcome_email(user_name: str, user_email: str) -> dict:
+    """Send a welcome email to a new user."""
+    # BUG: passing 'name' but template_engine expects 'username'
+    body = render_template("welcome", name=user_name, email=user_email)
+    return {
+        "to": user_email,
+        "subject": "Welcome!",
+        "body": body,
+        "sent": True,
+    }
+
+
+def send_reset_email(user_email: str, reset_link: str) -> dict:
+    """Send a password reset email."""
+    body = render_template("reset", email=user_email, link=reset_link)
+    return {
+        "to": user_email,
+        "subject": "Password Reset",
+        "body": body,
+        "sent": True,
+    }
diff --git a/repo_templates/task2/variant_2/src/template_engine.py b/repo_templates/task2/variant_2/src/template_engine.py
new file mode 100644
index 0000000000000000000000000000000000000000..29a6c12fbf57926a83a17d3e99cc614e0f495d2d
--- /dev/null
+++ b/repo_templates/task2/variant_2/src/template_engine.py
@@ -0,0 +1,26 @@
+"""Template rendering engine for email bodies."""
+
+TEMPLATES = {
+    "welcome": "Hello {username}, welcome to our platform! Your email {email} has been registered.",
+    "reset": "Click here to reset your password: {link}. This was requested for {email}.",
+    "notify": "Hi {username}, you have a new notification: {message}.",
+}
+
+
+def render_template(template_name: str, **kwargs) -> str:
+    """
+    Render an email template with the given keyword arguments.
+    
+    Expected kwargs per template:
+    - welcome: username (str), email (str)
+    - reset: email (str), link (str)
+    - notify: username (str), message (str)
+    """
+    if template_name not in TEMPLATES:
+        raise ValueError(f"Unknown template: {template_name}")
+    
+    template = TEMPLATES[template_name]
+    try:
+        return template.format(**kwargs)
+    except KeyError as e:
+        raise KeyError(f"Missing required template variable: {e}")
diff --git a/repo_templates/task2/variant_2/tests/test_email.py b/repo_templates/task2/variant_2/tests/test_email.py
new file mode 100644
index 0000000000000000000000000000000000000000..97611af9de667e8a75c4c5821cee61456a1b8c32
--- /dev/null
+++ b/repo_templates/task2/variant_2/tests/test_email.py
@@ -0,0 +1,23 @@
+import pytest
+from src.email_sender import send_welcome_email, send_reset_email
+
+
+def test_send_welcome_email():
+    # FAILS — KeyError because email_sender passes 'name' but template expects 'username'
+    result = send_welcome_email("Alice", "alice@example.com")
+    assert result["sent"] == True
+    assert "Alice" in result["body"]
+    assert "alice@example.com" in result["body"]
+
+
+def test_send_reset_email():
+    result = send_reset_email("bob@example.com", "https://reset.link/abc")
+    assert result["sent"] == True
+    assert "https://reset.link/abc" in result["body"]
+
+
+def test_welcome_email_structure():
+    # FAILS — same KeyError as test_send_welcome_email
+    result = send_welcome_email("Charlie", "charlie@test.com")
+    assert result["to"] == "charlie@test.com"
+    assert result["subject"] == "Welcome!"
diff --git a/repo_templates/task2/variant_3/meta.json b/repo_templates/task2/variant_3/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..68fe446e33c1215870a7910826b92862a6b4241b
--- /dev/null
+++ b/repo_templates/task2/variant_3/meta.json
@@ -0,0 +1,13 @@
+{
+  "variant_id": "task2_v3",
+  "task": "task2",
+  "bug_files": ["src/order_processor.py"],
+  "interface_files": ["src/inventory_checker.py"],
+  "bug_description": "order_processor passes list of items but inventory_checker expects dict {sku: qty}",
+  "failing_tests": ["test_process_valid_order", "test_order_structure"],
+  "fix_file": "src/order_processor.py",
+  "fix_description": "Convert items list to dict: {item['sku']: item['qty'] for item in items}",
+  "regression_test_must_cover": "TypeError when list is passed to check_availability",
+  "total_files": 4,
+  "optimal_steps": 6
+}
diff --git a/repo_templates/task2/variant_3/src/inventory_checker.py b/repo_templates/task2/variant_3/src/inventory_checker.py
new file mode 100644
index 0000000000000000000000000000000000000000..a436f4644735bb3fa3ecf5fc4b363b37bcf483b0
--- /dev/null
+++ b/repo_templates/task2/variant_3/src/inventory_checker.py
@@ -0,0 +1,33 @@
+"""Inventory checking service. Verifies stock levels for orders."""
+
+# Simulated stock database
+STOCK = {
+    "WIDGET-A": 100,
+    "WIDGET-B": 50,
+    "GADGET-X": 0,
+    "GADGET-Y": 25,
+}
+
+
+def check_availability(requested_items: dict) -> bool:
+    """
+    Check if all requested items are available in stock.
+    
+    Args:
+        requested_items: dict mapping SKU to quantity, e.g. {"WIDGET-A": 5, "GADGET-Y": 2}
+    
+    Returns:
+        True if all items are available in sufficient quantity.
+    """
+    if not isinstance(requested_items, dict):
+        raise TypeError(
+            f"requested_items must be dict, got {type(requested_items).__name__}. "
+            f"Expected format: {{'SKU': quantity}}"
+        )
+    
+    for sku, qty in requested_items.items():
+        if sku not in STOCK:
+            return False
+        if STOCK[sku] < qty:
+            return False
+    return True
diff --git a/repo_templates/task2/variant_3/src/models.py b/repo_templates/task2/variant_3/src/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..df879f1fee6f11648a776235c31906b990e6a564
--- /dev/null
+++ b/repo_templates/task2/variant_3/src/models.py
@@ -0,0 +1,10 @@
+"""Shared models for the order system."""
+
+
+class OrderItem:
+    def __init__(self, sku: str, qty: int):
+        self.sku = sku
+        self.qty = qty
+    
+    def to_dict(self) -> dict:
+        return {"sku": self.sku, "qty": self.qty}
diff --git a/repo_templates/task2/variant_3/src/order_processor.py b/repo_templates/task2/variant_3/src/order_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..72e342db14b3910457b14575747bed744a3a0f2b
--- /dev/null
+++ b/repo_templates/task2/variant_3/src/order_processor.py
@@ -0,0 +1,20 @@
+"""Order processing module that checks inventory before fulfillment."""
+from src.inventory_checker import check_availability
+
+
+def process_order(order: dict) -> dict:
+    """
+    Process an order by checking inventory availability.
+    order format: {"items": [{"sku": "ABC", "qty": 2}, ...], "customer": "..."}
+    """
+    items = order.get("items", [])
+    if not items:
+        return {"status": "error", "message": "No items in order"}
+
+    # BUG: passing items as list, but check_availability expects a dict {sku: qty}
+    available = check_availability(items)
+
+    if available:
+        return {"status": "confirmed", "items": items}
+    else:
+        return {"status": "out_of_stock", "items": items}
diff --git a/repo_templates/task2/variant_3/tests/test_orders.py b/repo_templates/task2/variant_3/tests/test_orders.py
new file mode 100644
index 0000000000000000000000000000000000000000..06204781cacd88c4c86fb15bc208543862352092
--- /dev/null
+++ b/repo_templates/task2/variant_3/tests/test_orders.py
@@ -0,0 +1,27 @@
+import pytest
+from src.order_processor import process_order
+
+
+def test_process_valid_order():
+    order = {
+        "items": [{"sku": "WIDGET-A", "qty": 2}, {"sku": "GADGET-Y", "qty": 1}],
+        "customer": "alice@example.com",
+    }
+    # FAILS — TypeError because list is passed instead of dict
+    result = process_order(order)
+    assert result["status"] == "confirmed"
+
+
+def test_empty_order():
+    result = process_order({"items": [], "customer": "bob@example.com"})
+    assert result["status"] == "error"
+
+
+def test_order_structure():
+    order = {
+        "items": [{"sku": "WIDGET-B", "qty": 5}],
+        "customer": "charlie@example.com",
+    }
+    # FAILS — same TypeError
+    result = process_order(order)
+    assert "items" in result
diff --git a/repo_templates/task2/variant_4/meta.json b/repo_templates/task2/variant_4/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..d51a929cb359d6e87f2897d20e767b5ee69deed9
--- /dev/null
+++ b/repo_templates/task2/variant_4/meta.json
@@ -0,0 +1,13 @@
+{
+  "variant_id": "task2_v4",
+  "task": "task2",
+  "bug_files": ["src/report_builder.py"],
+  "interface_files": ["src/date_formatter.py"],
+  "bug_description": "report_builder passes ISO string but date_formatter expects datetime object",
+  "failing_tests": ["test_build_monthly_report", "test_report_structure"],
+  "fix_file": "src/report_builder.py",
+  "fix_description": "Parse ISO strings to datetime before passing: datetime.strptime(start_date, '%Y-%m-%d')",
+  "regression_test_must_cover": "TypeError when string is passed to format_date_range",
+  "total_files": 4,
+  "optimal_steps": 6
+}
diff --git a/repo_templates/task2/variant_4/src/date_formatter.py b/repo_templates/task2/variant_4/src/date_formatter.py
new file mode 100644
index 0000000000000000000000000000000000000000..5baa0482b2159cdcba0f648f6b3737679f818695
--- /dev/null
+++ b/repo_templates/task2/variant_4/src/date_formatter.py
@@ -0,0 +1,28 @@
+"""Date formatting utilities for reports."""
+from datetime import datetime
+
+
+def format_date_range(start: datetime, end: datetime) -> str:
+    """
+    Format a date range for display in reports.
+    
+    Args:
+        start: datetime object for range start
+        end: datetime object for range end
+    
+    Returns:
+        Formatted string like "Jan 01, 2024 — Jan 31, 2024"
+    """
+    if not isinstance(start, datetime):
+        raise TypeError(f"start must be datetime, got {type(start).__name__}")
+    if not isinstance(end, datetime):
+        raise TypeError(f"end must be datetime, got {type(end).__name__}")
+    
+    return f"{start.strftime('%b %d, %Y')} — {end.strftime('%b %d, %Y')}"
+
+
+def format_single_date(dt: datetime) -> str:
+    """Format a single date."""
+    if not isinstance(dt, datetime):
+        raise TypeError(f"Expected datetime, got {type(dt).__name__}")
+    return dt.strftime("%B %d, %Y")
diff --git a/repo_templates/task2/variant_4/src/models.py b/repo_templates/task2/variant_4/src/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..69dafb44b0baa406b52d6277bef3ca9129b8508e
--- /dev/null
+++ b/repo_templates/task2/variant_4/src/models.py
@@ -0,0 +1,3 @@
+"""Shared models for the reporting system."""
+
+REPORT_TYPES = ["monthly", "quarterly", "annual", "summary"]
diff --git a/repo_templates/task2/variant_4/src/report_builder.py b/repo_templates/task2/variant_4/src/report_builder.py
new file mode 100644
index 0000000000000000000000000000000000000000..832ba03400f7d55ea9318516b2494f9759285540
--- /dev/null
+++ b/repo_templates/task2/variant_4/src/report_builder.py
@@ -0,0 +1,28 @@
+"""Report builder that assembles reports with formatted dates."""
+from src.date_formatter import format_date_range
+
+
+def build_monthly_report(title: str, start_date: str, end_date: str, data: list) -> dict:
+    """
+    Build a monthly report with formatted date header.
+    
+    Args:
+        title: Report title
+        start_date: ISO format string 'YYYY-MM-DD'
+        end_date: ISO format string 'YYYY-MM-DD'
+        data: List of data points
+    """
+    # BUG: passing ISO string directly, but format_date_range expects datetime objects
+    date_header = format_date_range(start_date, end_date)
+    
+    return {
+        "title": title,
+        "period": date_header,
+        "total_records": len(data),
+        "data": data,
+    }
+
+
+def build_summary(title: str, content: str) -> dict:
+    """Build a simple summary report."""
+    return {"title": title, "content": content, "type": "summary"}
diff --git a/repo_templates/task2/variant_4/tests/test_reports.py b/repo_templates/task2/variant_4/tests/test_reports.py
new file mode 100644
index 0000000000000000000000000000000000000000..c654753d103a30373ea267e9f97bdaa6f84e55bc
--- /dev/null
+++ b/repo_templates/task2/variant_4/tests/test_reports.py
@@ -0,0 +1,28 @@
+import pytest
+from src.report_builder import build_monthly_report, build_summary
+
+
+def test_build_monthly_report():
+    # FAILS — TypeError because ISO string passed instead of datetime
+    result = build_monthly_report(
+        "Sales Report",
+        "2024-01-01",
+        "2024-01-31",
+        [{"amount": 100}, {"amount": 200}],
+    )
+    assert result["title"] == "Sales Report"
+    assert result["total_records"] == 2
+    assert "Jan" in result["period"]
+
+
+def test_build_summary():
+    result = build_summary("Q1 Summary", "Revenue increased 15%")
+    assert result["title"] == "Q1 Summary"
+    assert result["type"] == "summary"
+
+
+def test_report_structure():
+    # FAILS — same TypeError
+    result = build_monthly_report("Inventory", "2024-03-01", "2024-03-31", [])
+    assert "period" in result
+    assert result["total_records"] == 0
diff --git a/repo_templates/task2/variant_5/meta.json b/repo_templates/task2/variant_5/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..e4d6f929aed1c84fd03c8c9bfb41480fa3977c95
--- /dev/null
+++ b/repo_templates/task2/variant_5/meta.json
@@ -0,0 +1,13 @@
+{
+  "variant_id": "task2_v5",
+  "task": "task2",
+  "bug_files": ["src/cache_manager.py"],
+  "interface_files": ["src/serializer.py"],
+  "bug_description": "cache_manager passes bytes (.encode()) but serializer expects str",
+  "failing_tests": ["test_cache_set_and_get", "test_cache_delete"],
+  "fix_file": "src/cache_manager.py",
+  "fix_description": "Remove .encode('utf-8') — pass str(value) directly to serialize_value",
+  "regression_test_must_cover": "TypeError when bytes is passed to serialize_value",
+  "total_files": 4,
+  "optimal_steps": 6
+}
diff --git a/repo_templates/task2/variant_5/src/cache_manager.py b/repo_templates/task2/variant_5/src/cache_manager.py
new file mode 100644
index 0000000000000000000000000000000000000000..b94186ff7e6458926d021ad9027ae5008d1e472d
--- /dev/null
+++ b/repo_templates/task2/variant_5/src/cache_manager.py
@@ -0,0 +1,36 @@
+"""Cache management service that stores serialized data."""
+from src.serializer import serialize_value, deserialize_value
+
+
+class CacheManager:
+    """Simple in-memory cache with serialization."""
+    
+    def __init__(self):
+        self._store = {}
+    
+    def set(self, key: str, value) -> None:
+        """Store a value in the cache after serializing it."""
+        # BUG: passing bytes (encoded) instead of str to serialize_value
+        serialized = serialize_value(str(value).encode('utf-8'))
+        self._store[key] = serialized
+    
+    def get(self, key: str, default=None):
+        """Retrieve and deserialize a value from cache."""
+        if key not in self._store:
+            return default
+        return deserialize_value(self._store[key])
+    
+    def delete(self, key: str) -> bool:
+        """Remove a key from cache."""
+        if key in self._store:
+            del self._store[key]
+            return True
+        return False
+    
+    def clear(self):
+        """Clear all cached values."""
+        self._store.clear()
+    
+    def keys(self) -> list:
+        """Return all cache keys."""
+        return list(self._store.keys())
diff --git a/repo_templates/task2/variant_5/src/config.py b/repo_templates/task2/variant_5/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..6aa7a442c951c2091dcca774cf0c194e12054407
--- /dev/null
+++ b/repo_templates/task2/variant_5/src/config.py
@@ -0,0 +1,4 @@
+"""Cache configuration constants."""
+
+MAX_CACHE_SIZE = 1000
+DEFAULT_TTL = 300  # seconds
diff --git a/repo_templates/task2/variant_5/src/serializer.py b/repo_templates/task2/variant_5/src/serializer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a547217b173e8bf7873c2862742d9ae8e4e91803
--- /dev/null
+++ b/repo_templates/task2/variant_5/src/serializer.py
@@ -0,0 +1,25 @@
+"""Serialization utilities for the cache system."""
+import json
+
+
+def serialize_value(value: str) -> str:
+    """
+    Serialize a value to a JSON string for storage.
+    
+    Args:
+        value: must be a string (str type)
+    
+    Returns:
+        JSON-encoded string
+    """
+    if not isinstance(value, str):
+        raise TypeError(f"value must be str, got {type(value).__name__}")
+    return json.dumps({"data": value})
+
+
+def deserialize_value(serialized: str):
+    """Deserialize a JSON string back to the original value."""
+    if not isinstance(serialized, str):
+        raise TypeError(f"serialized must be str, got {type(serialized).__name__}")
+    result = json.loads(serialized)
+    return result.get("data")
diff --git a/repo_templates/task2/variant_5/tests/test_cache.py b/repo_templates/task2/variant_5/tests/test_cache.py
new file mode 100644
index 0000000000000000000000000000000000000000..9187669bca147c864991afff0a0b6be403db44be
--- /dev/null
+++ b/repo_templates/task2/variant_5/tests/test_cache.py
@@ -0,0 +1,37 @@
+import pytest
+from src.cache_manager import CacheManager
+
+
+def test_cache_set_and_get():
+    cache = CacheManager()
+    # FAILS — TypeError because bytes passed to serializer instead of str
+    cache.set("user:1", "Alice")
+    assert cache.get("user:1") == "Alice"
+
+
+def test_cache_get_missing():
+    cache = CacheManager()
+    assert cache.get("nonexistent", "default") == "default"
+
+
+def test_cache_delete():
+    cache = CacheManager()
+    # FAILS — same TypeError on set
+    cache.set("temp", "data")
+    assert cache.delete("temp") == True
+    assert cache.get("temp") is None
+
+
+def test_cache_clear():
+    cache = CacheManager()
+    cache._store["a"] = '{"data": "1"}'
+    cache._store["b"] = '{"data": "2"}'
+    cache.clear()
+    assert cache.keys() == []
+
+
+def test_cache_keys():
+    cache = CacheManager()
+    cache._store["x"] = '{"data": "1"}'
+    cache._store["y"] = '{"data": "2"}'
+    assert sorted(cache.keys()) == ["x", "y"]
diff --git a/repo_templates/task3/variant_1/FEATURE_SPEC.md b/repo_templates/task3/variant_1/FEATURE_SPEC.md
new file mode 100644
index 0000000000000000000000000000000000000000..5156aa5b4090a9fa7c270acbb181c51515a9400f
--- /dev/null
+++ b/repo_templates/task3/variant_1/FEATURE_SPEC.md
@@ -0,0 +1,19 @@
+# Feature: Add request rate limiting to the API
+
+## Background
+The current API in src/api.py has no rate limiting. Any client can make unlimited requests.
+
+## What to implement
+Add a rate limiter that:
+1. Tracks requests per client IP in a dict stored in src/middleware.py
+2. Allows maximum 5 requests per minute per IP
+3. Returns HTTP 429 status with message "Rate limit exceeded" when limit is hit
+4. Resets the count for an IP after 60 seconds of no requests
+
+## Files to modify
+- src/middleware.py — add RateLimiter class with is_allowed(ip: str) -> bool method
+- src/api.py — import and use RateLimiter in the request handler
+
+## Do not modify
+- tests/test_api.py — tests are already written, make them pass
+- src/config.py — contains RATE_LIMIT and RATE_WINDOW constants you should use
diff --git a/repo_templates/task3/variant_1/meta.json b/repo_templates/task3/variant_1/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..0f52d14cb004ace7bbc882f43f2b48eb04ae8004
--- /dev/null
+++ b/repo_templates/task3/variant_1/meta.json
@@ -0,0 +1,11 @@
+{
+  "variant_id": "task3_v1",
+  "task": "task3",
+  "feature_spec_file": "FEATURE_SPEC.md",
+  "files_to_implement": ["src/middleware.py", "src/api.py"],
+  "read_first_files": ["src/config.py", "FEATURE_SPEC.md", "tests/test_api.py"],
+  "test_file": "tests/test_api.py",
+  "total_tests": 4,
+  "total_files": 5,
+  "optimal_steps": 8
+}
diff --git a/repo_templates/task3/variant_1/src/api.py b/repo_templates/task3/variant_1/src/api.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f5e59e4de8d6ed4f7f1a871f39073cf4baf0f9a
--- /dev/null
+++ b/repo_templates/task3/variant_1/src/api.py
@@ -0,0 +1,12 @@
+from fastapi import FastAPI, Request, HTTPException
+# TODO: import RateLimiter from middleware
+
+app = FastAPI()
+# TODO: instantiate rate_limiter = RateLimiter()
+
+
+@app.get("/data")
+async def get_data(request: Request):
+    client_ip = request.client.host
+    # TODO: check rate_limiter.is_allowed(client_ip) and raise HTTPException(429) if not
+    return {"data": "some_data", "ip": client_ip}
diff --git a/repo_templates/task3/variant_1/src/config.py b/repo_templates/task3/variant_1/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e9d612b816b11292427d3be66680dc05fbbcb5d
--- /dev/null
+++ b/repo_templates/task3/variant_1/src/config.py
@@ -0,0 +1,2 @@
+RATE_LIMIT = 5       # max requests per window
+RATE_WINDOW = 60     # window in seconds
diff --git a/repo_templates/task3/variant_1/src/middleware.py b/repo_templates/task3/variant_1/src/middleware.py
new file mode 100644
index 0000000000000000000000000000000000000000..651da0438e38af1a48798eda861d809fc118c1b1
--- /dev/null
+++ b/repo_templates/task3/variant_1/src/middleware.py
@@ -0,0 +1,3 @@
+# TODO: implement RateLimiter class
+# Must have: is_allowed(ip: str) -> bool
+# Uses config.RATE_LIMIT and config.RATE_WINDOW
diff --git a/repo_templates/task3/variant_1/tests/test_api.py b/repo_templates/task3/variant_1/tests/test_api.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e5c9ce252302bc9d2d060f0239af771f68fff1c
--- /dev/null
+++ b/repo_templates/task3/variant_1/tests/test_api.py
@@ -0,0 +1,38 @@
+import pytest
+import time
+from unittest.mock import patch
+from src.middleware import RateLimiter
+from src.config import RATE_LIMIT
+
+
+def test_rate_limiter_allows_within_limit():
+    rl = RateLimiter()
+    for _ in range(RATE_LIMIT):
+        assert rl.is_allowed("192.168.1.1") == True
+
+
+def test_rate_limiter_blocks_over_limit():
+    rl = RateLimiter()
+    for _ in range(RATE_LIMIT):
+        rl.is_allowed("192.168.1.1")
+    assert rl.is_allowed("192.168.1.1") == False
+
+
+def test_rate_limiter_different_ips():
+    rl = RateLimiter()
+    for _ in range(RATE_LIMIT):
+        rl.is_allowed("192.168.1.1")
+    # Different IP should still be allowed
+    assert rl.is_allowed("10.0.0.1") == True
+
+
+def test_rate_limiter_resets_after_window():
+    rl = RateLimiter()
+    with patch('time.time') as mock_time:
+        mock_time.return_value = 1000.0
+        for _ in range(RATE_LIMIT):
+            rl.is_allowed("192.168.1.1")
+        assert rl.is_allowed("192.168.1.1") == False
+        # Advance time past window
+        mock_time.return_value = 1065.0
+        assert rl.is_allowed("192.168.1.1") == True
diff --git a/repo_templates/task3/variant_2/FEATURE_SPEC.md b/repo_templates/task3/variant_2/FEATURE_SPEC.md
new file mode 100644
index 0000000000000000000000000000000000000000..4ad452fa18163d3e098de696bd968e69a4e5cc5d
--- /dev/null
+++ b/repo_templates/task3/variant_2/FEATURE_SPEC.md
@@ -0,0 +1,22 @@
+# Feature: Implement an LRU Cache
+
+## Background
+The data store in src/data_store.py performs expensive computations. We need a Least Recently Used (LRU) cache to avoid redundant calculations.
+
+## What to implement
+Add an LRU cache class in src/cache.py that:
+1. Has a configurable `max_size` parameter (use CACHE_MAX_SIZE from config)
+2. Implements `get(key)` -> returns value or None if not cached
+3. Implements `put(key, value)` -> stores value, evicts LRU entry if at capacity
+4. Implements `size` property -> returns current number of cached items
+5. Implements `clear()` -> removes all entries
+
+Then integrate it in src/data_store.py by caching compute results.
+
+## Files to modify
+- src/cache.py — implement the LRUCache class
+- src/data_store.py — use LRUCache to cache expensive_compute results
+
+## Do not modify
+- tests/test_cache.py — tests are already written, make them pass
+- src/config.py — contains CACHE_MAX_SIZE constant
diff --git a/repo_templates/task3/variant_2/meta.json b/repo_templates/task3/variant_2/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..dc6c8f950bd500402d4fa6c4b9e7656f23c312cf
--- /dev/null
+++ b/repo_templates/task3/variant_2/meta.json
@@ -0,0 +1,11 @@
+{
+  "variant_id": "task3_v2",
+  "task": "task3",
+  "feature_spec_file": "FEATURE_SPEC.md",
+  "files_to_implement": ["src/cache.py", "src/data_store.py"],
+  "read_first_files": ["src/config.py", "FEATURE_SPEC.md", "tests/test_cache.py"],
+  "test_file": "tests/test_cache.py",
+  "total_tests": 7,
+  "total_files": 5,
+  "optimal_steps": 8
+}
diff --git a/repo_templates/task3/variant_2/src/cache.py b/repo_templates/task3/variant_2/src/cache.py
new file mode 100644
index 0000000000000000000000000000000000000000..fae59aff5e7e65236591c6ad49161d436fad00fe
--- /dev/null
+++ b/repo_templates/task3/variant_2/src/cache.py
@@ -0,0 +1,3 @@
+# TODO: implement LRUCache class
+# Must have: get(key), put(key, value), size property, clear()
+# Uses config.CACHE_MAX_SIZE for max capacity
diff --git a/repo_templates/task3/variant_2/src/config.py b/repo_templates/task3/variant_2/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c751005ab956692f7a366344ae9fa58373788be
--- /dev/null
+++ b/repo_templates/task3/variant_2/src/config.py
@@ -0,0 +1 @@
+CACHE_MAX_SIZE = 3   # maximum number of items in LRU cache
diff --git a/repo_templates/task3/variant_2/src/data_store.py b/repo_templates/task3/variant_2/src/data_store.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ab2fcd1c044ed4b12e5f108e51f914aa1a26c48
--- /dev/null
+++ b/repo_templates/task3/variant_2/src/data_store.py
@@ -0,0 +1,22 @@
+"""Data store with expensive computations."""
+# TODO: import LRUCache from cache and use it to cache results
+
+
+def expensive_compute(key: str) -> dict:
+    """Simulate an expensive computation."""
+    # This is intentionally slow to motivate caching
+    result = {"key": key, "value": sum(ord(c) for c in key), "computed": True}
+    return result
+
+
+class DataStore:
+    """Store that computes and caches results."""
+    
+    def __init__(self):
+        # TODO: initialize an LRUCache instance here
+        pass
+    
+    def get_data(self, key: str) -> dict:
+        """Get data for key, using cache if available."""
+        # TODO: check cache first, compute only on miss, store in cache
+        return expensive_compute(key)
diff --git a/repo_templates/task3/variant_2/tests/test_cache.py b/repo_templates/task3/variant_2/tests/test_cache.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3b885b7d4304d47d480b0641d2da08d3c18e75c
--- /dev/null
+++ b/repo_templates/task3/variant_2/tests/test_cache.py
@@ -0,0 +1,60 @@
+import pytest
+from src.cache import LRUCache
+from src.config import CACHE_MAX_SIZE
+
+
+def test_cache_put_and_get():
+    cache = LRUCache(max_size=CACHE_MAX_SIZE)
+    cache.put("a", 1)
+    assert cache.get("a") == 1
+
+
+def test_cache_miss():
+    cache = LRUCache(max_size=CACHE_MAX_SIZE)
+    assert cache.get("nonexistent") is None
+
+
+def test_cache_eviction():
+    cache = LRUCache(max_size=3)
+    cache.put("a", 1)
+    cache.put("b", 2)
+    cache.put("c", 3)
+    cache.put("d", 4)  # should evict "a"
+    assert cache.get("a") is None
+    assert cache.get("d") == 4
+
+
+def test_cache_lru_order():
+    cache = LRUCache(max_size=3)
+    cache.put("a", 1)
+    cache.put("b", 2)
+    cache.put("c", 3)
+    cache.get("a")  # access "a" makes it most recent
+    cache.put("d", 4)  # should evict "b" (least recently used)
+    assert cache.get("a") == 1
+    assert cache.get("b") is None
+
+
+def test_cache_size():
+    cache = LRUCache(max_size=5)
+    assert cache.size == 0
+    cache.put("x", 10)
+    cache.put("y", 20)
+    assert cache.size == 2
+
+
+def test_cache_clear():
+    cache = LRUCache(max_size=5)
+    cache.put("a", 1)
+    cache.put("b", 2)
+    cache.clear()
+    assert cache.size == 0
+    assert cache.get("a") is None
+
+
+def test_cache_update_existing():
+    cache = LRUCache(max_size=3)
+    cache.put("a", 1)
+    cache.put("a", 100)
+    assert cache.get("a") == 100
+    assert cache.size == 1
diff --git a/repo_templates/task3/variant_3/FEATURE_SPEC.md b/repo_templates/task3/variant_3/FEATURE_SPEC.md
new file mode 100644
index 0000000000000000000000000000000000000000..ece792f5f7194bdb8b3d5df293d9b5ecc8f2b9c6
--- /dev/null
+++ b/repo_templates/task3/variant_3/FEATURE_SPEC.md
@@ -0,0 +1,23 @@
+# Feature: Add Input Validation Decorator
+
+## Background
+The handlers in src/handlers.py accept user input but have no validation. We need a reusable decorator.
+
+## What to implement
+Create a `validate_input` decorator in src/validators.py that:
+1. Takes a schema dict mapping field names to types (e.g., `{"name": str, "age": int}`)
+2. Validates the first argument (a dict) passed to the decorated function
+3. Raises `ValueError` with descriptive message if:
+   - A required field is missing from the input
+   - A field has the wrong type
+4. Passes through to the original function if validation succeeds
+
+Then apply the decorator to handlers in src/handlers.py.
+
+## Files to modify
+- src/validators.py — implement the `validate_input` decorator
+- src/handlers.py — apply `@validate_input(schema)` to each handler
+
+## Do not modify
+- tests/test_handlers.py — tests are already written, make them pass
+- src/config.py — contains handler configuration
diff --git a/repo_templates/task3/variant_3/meta.json b/repo_templates/task3/variant_3/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..9696d4fd7490bea1d023beab538d178409a55f3c
--- /dev/null
+++ b/repo_templates/task3/variant_3/meta.json
@@ -0,0 +1,11 @@
+{
+  "variant_id": "task3_v3",
+  "task": "task3",
+  "feature_spec_file": "FEATURE_SPEC.md",
+  "files_to_implement": ["src/validators.py", "src/handlers.py"],
+  "read_first_files": ["src/config.py", "FEATURE_SPEC.md", "tests/test_handlers.py"],
+  "test_file": "tests/test_handlers.py",
+  "total_tests": 6,
+  "total_files": 5,
+  "optimal_steps": 8
+}
diff --git a/repo_templates/task3/variant_3/src/config.py b/repo_templates/task3/variant_3/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..419fd241ee4f0a176899169c68e193abf0e67dde
--- /dev/null
+++ b/repo_templates/task3/variant_3/src/config.py
@@ -0,0 +1,3 @@
+"""Handler configuration."""
+ALLOWED_ROLES = ["admin", "user", "moderator"]
+MAX_NAME_LENGTH = 100
diff --git a/repo_templates/task3/variant_3/src/handlers.py b/repo_templates/task3/variant_3/src/handlers.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d52a7a233fda0844027f88b5692d90040832947
--- /dev/null
+++ b/repo_templates/task3/variant_3/src/handlers.py
@@ -0,0 +1,29 @@
+"""Request handlers for the application."""
+# TODO: import validate_input from validators
+# TODO: apply @validate_input decorator to each handler
+
+
+def create_user(data: dict) -> dict:
+    """Create a new user from input data.
+    Expected fields: name (str), age (int), email (str)
+    """
+    # TODO: add @validate_input({"name": str, "age": int, "email": str})
+    return {
+        "id": 1,
+        "name": data["name"],
+        "age": data["age"],
+        "email": data["email"],
+        "created": True,
+    }
+
+
+def update_settings(data: dict) -> dict:
+    """Update user settings.
+    Expected fields: theme (str), notifications (bool)
+    """
+    # TODO: add @validate_input({"theme": str, "notifications": bool})
+    return {
+        "theme": data["theme"],
+        "notifications": data["notifications"],
+        "updated": True,
+    }
diff --git a/repo_templates/task3/variant_3/src/validators.py b/repo_templates/task3/variant_3/src/validators.py
new file mode 100644
index 0000000000000000000000000000000000000000..f24543f132dde9376df77123ebeb6142c13dd9ca
--- /dev/null
+++ b/repo_templates/task3/variant_3/src/validators.py
@@ -0,0 +1,4 @@
+# TODO: implement validate_input decorator
+# Should take a schema dict {field_name: expected_type}
+# Should validate the first argument of the decorated function
+# Raise ValueError on missing fields or wrong types
diff --git a/repo_templates/task3/variant_3/tests/test_handlers.py b/repo_templates/task3/variant_3/tests/test_handlers.py
new file mode 100644
index 0000000000000000000000000000000000000000..74e58d8958f14750a3ea47e1bcbb6f33b1c2433b
--- /dev/null
+++ b/repo_templates/task3/variant_3/tests/test_handlers.py
@@ -0,0 +1,33 @@
+import pytest
+from src.handlers import create_user, update_settings
+
+
+def test_create_user_valid():
+    result = create_user({"name": "Alice", "age": 30, "email": "alice@test.com"})
+    assert result["created"] == True
+    assert result["name"] == "Alice"
+
+
+def test_create_user_missing_field():
+    with pytest.raises(ValueError, match="missing"):
+        create_user({"name": "Bob"})
+
+
+def test_create_user_wrong_type():
+    with pytest.raises(ValueError, match="type"):
+        create_user({"name": "Charlie", "age": "thirty", "email": "c@test.com"})
+
+
+def test_update_settings_valid():
+    result = update_settings({"theme": "dark", "notifications": True})
+    assert result["updated"] == True
+
+
+def test_update_settings_missing():
+    with pytest.raises(ValueError, match="missing"):
+        update_settings({"theme": "light"})
+
+
+def test_update_settings_wrong_type():
+    with pytest.raises(ValueError, match="type"):
+        update_settings({"theme": 123, "notifications": True})
diff --git a/repo_templates/task3/variant_4/FEATURE_SPEC.md b/repo_templates/task3/variant_4/FEATURE_SPEC.md
new file mode 100644
index 0000000000000000000000000000000000000000..7634258994c9b42b2dac5d14bc3c65100319ebf1
--- /dev/null
+++ b/repo_templates/task3/variant_4/FEATURE_SPEC.md
@@ -0,0 +1,19 @@
+# Feature: Add HTTP Retry Mechanism
+
+## Background
+The HTTP client in src/http_client.py makes external API calls that sometimes fail with transient errors (5xx, timeouts). We need automatic retry logic.
+
+## What to implement
+Create a retry decorator/function in src/http_client.py that:
+1. Retries on specified exception types (from retry_config.py)
+2. Uses exponential backoff: wait = base_delay * (2 ** attempt)
+3. Respects MAX_RETRIES from config
+4. Raises the last exception if all retries are exhausted
+5. Keeps a log of attempts (list of attempt dicts)
+
+## Files to modify
+- src/http_client.py — implement `RetryHandler` class with `execute(func, *args, **kwargs)` method
+
+## Do not modify
+- tests/test_retry.py — tests are already written, make them pass
+- src/retry_config.py — contains retry configuration
diff --git a/repo_templates/task3/variant_4/meta.json b/repo_templates/task3/variant_4/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..6c329da67e1061345694491575d63d81f097f78d
--- /dev/null
+++ b/repo_templates/task3/variant_4/meta.json
@@ -0,0 +1,11 @@
+{
+  "variant_id": "task3_v4",
+  "task": "task3",
+  "feature_spec_file": "FEATURE_SPEC.md",
+  "files_to_implement": ["src/http_client.py"],
+  "read_first_files": ["src/retry_config.py", "FEATURE_SPEC.md", "tests/test_retry.py"],
+  "test_file": "tests/test_retry.py",
+  "total_tests": 5,
+  "total_files": 4,
+  "optimal_steps": 8
+}
diff --git a/repo_templates/task3/variant_4/src/http_client.py b/repo_templates/task3/variant_4/src/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..94ba59c38aabedd27265ca72134cffd9e731e96d
--- /dev/null
+++ b/repo_templates/task3/variant_4/src/http_client.py
@@ -0,0 +1,6 @@
+"""HTTP client with retry capabilities."""
+# TODO: implement RetryHandler class
+# Must have: execute(func, *args, **kwargs) method
+# Uses retry_config.MAX_RETRIES, BASE_DELAY, RETRYABLE_EXCEPTIONS
+# Should use exponential backoff: wait = base_delay * (2 ** attempt)
+# Should store attempt logs in self.attempts list
diff --git a/repo_templates/task3/variant_4/src/retry_config.py b/repo_templates/task3/variant_4/src/retry_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..aec988cfe9e536a90b126bf8b9cd24ca181c536d
--- /dev/null
+++ b/repo_templates/task3/variant_4/src/retry_config.py
@@ -0,0 +1,5 @@
+"""Retry configuration constants."""
+
+MAX_RETRIES = 3              # max number of retry attempts
+BASE_DELAY = 0.1             # base delay in seconds
+RETRYABLE_EXCEPTIONS = (ConnectionError, TimeoutError, OSError)
diff --git a/repo_templates/task3/variant_4/tests/test_retry.py b/repo_templates/task3/variant_4/tests/test_retry.py
new file mode 100644
index 0000000000000000000000000000000000000000..dda0bba0ef08e114793d6877b2c2f4c835332e48
--- /dev/null
+++ b/repo_templates/task3/variant_4/tests/test_retry.py
@@ -0,0 +1,67 @@
+import pytest
+from unittest.mock import patch
+from src.http_client import RetryHandler
+from src.retry_config import MAX_RETRIES
+
+
+def test_execute_success_first_try():
+    handler = RetryHandler()
+    result = handler.execute(lambda: "ok")
+    assert result == "ok"
+    assert len(handler.attempts) == 1
+
+
+def test_execute_retries_on_failure():
+    call_count = 0
+    def flaky():
+        nonlocal call_count
+        call_count += 1
+        if call_count < 3:
+            raise ConnectionError("connection failed")
+        return "success"
+    
+    handler = RetryHandler()
+    with patch('time.sleep'):  # skip actual delays
+        result = handler.execute(flaky)
+    assert result == "success"
+    assert len(handler.attempts) == 3
+
+
+def test_execute_exhausts_retries():
+    def always_fail():
+        raise ConnectionError("permanent failure")
+    
+    handler = RetryHandler()
+    with patch('time.sleep'):
+        with pytest.raises(ConnectionError):
+            handler.execute(always_fail)
+    assert len(handler.attempts) == MAX_RETRIES + 1
+
+
+def test_non_retryable_exception():
+    def bad_input():
+        raise ValueError("bad input — not retryable")
+    
+    handler = RetryHandler()
+    with pytest.raises(ValueError):
+        handler.execute(bad_input)
+    assert len(handler.attempts) == 1
+
+
+def test_exponential_backoff_delays():
+    call_count = 0
+    def fail_twice():
+        nonlocal call_count
+        call_count += 1
+        if call_count <= 2:
+            raise TimeoutError("timeout")
+        return "done"
+    
+    handler = RetryHandler()
+    delays = []
+    with patch('time.sleep', side_effect=lambda d: delays.append(d)):
+        handler.execute(fail_twice)
+    
+    # Should have exponential delays
+    assert len(delays) == 2
+    assert delays[1] > delays[0]
diff --git a/repo_templates/task3/variant_5/FEATURE_SPEC.md b/repo_templates/task3/variant_5/FEATURE_SPEC.md
new file mode 100644
index 0000000000000000000000000000000000000000..1aa84093c572929427e9778463f54ddb2a2d16b9
--- /dev/null
+++ b/repo_templates/task3/variant_5/FEATURE_SPEC.md
@@ -0,0 +1,23 @@
+# Feature: Add Pagination Support to Query Builder
+
+## Background
+The query builder in src/query_builder.py constructs database queries but currently returns all results. We need pagination.
+
+## What to implement
+Add a `Paginator` class in src/query_builder.py that:
+1. Takes `items` (list), `page` (int, 1-indexed), and `page_size` (int) from config
+2. Implements `get_page()` -> returns the items for the current page
+3. Implements `total_pages` property -> returns total number of pages
+4. Implements `has_next` property -> True if there are more pages
+5. Implements `has_prev` property -> True if current page > 1
+6. Implements `get_page_info()` -> returns dict with page metadata
+
+Then use Paginator in src/api_endpoints.py to paginate query results.
+
+## Files to modify
+- src/query_builder.py — implement Paginator class
+- src/api_endpoints.py — use Paginator in list_items endpoint
+
+## Do not modify
+- tests/test_pagination.py — tests are already written, make them pass
+- src/config.py — contains DEFAULT_PAGE_SIZE
diff --git a/repo_templates/task3/variant_5/meta.json b/repo_templates/task3/variant_5/meta.json
new file mode 100644
index 0000000000000000000000000000000000000000..6a2343b4b812940a877aff916dcf1c2ff6ec38d3
--- /dev/null
+++ b/repo_templates/task3/variant_5/meta.json
@@ -0,0 +1,11 @@
+{
+  "variant_id": "task3_v5",
+  "task": "task3",
+  "feature_spec_file": "FEATURE_SPEC.md",
+  "files_to_implement": ["src/query_builder.py", "src/api_endpoints.py"],
+  "read_first_files": ["src/config.py", "FEATURE_SPEC.md", "tests/test_pagination.py"],
+  "test_file": "tests/test_pagination.py",
+  "total_tests": 8,
+  "total_files": 5,
+  "optimal_steps": 8
+}
diff --git a/repo_templates/task3/variant_5/src/api_endpoints.py b/repo_templates/task3/variant_5/src/api_endpoints.py
new file mode 100644
index 0000000000000000000000000000000000000000..920ff58bfbfbc8e4d1ec286b6524bce0911ed528
--- /dev/null
+++ b/repo_templates/task3/variant_5/src/api_endpoints.py
@@ -0,0 +1,13 @@
+"""API endpoints that return paginated results."""
+# TODO: import Paginator from query_builder
+from src.query_builder import build_query
+
+
+def list_items(table: str, page: int = 1, page_size: int = None) -> dict:
+    """List items from a table with pagination."""
+    # TODO: use Paginator to paginate results
+    items = build_query(table)
+    return {
+        "items": items,  # Should return only current page
+        "total": len(items),
+    }
diff --git a/repo_templates/task3/variant_5/src/config.py b/repo_templates/task3/variant_5/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..e82387b0c552edff2456fd39530ee25d753e4e92
--- /dev/null
+++ b/repo_templates/task3/variant_5/src/config.py
@@ -0,0 +1,3 @@
+"""Pagination configuration."""
+DEFAULT_PAGE_SIZE = 10
+MAX_PAGE_SIZE = 100
diff --git a/repo_templates/task3/variant_5/src/query_builder.py b/repo_templates/task3/variant_5/src/query_builder.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e5a2474987223a5ca6b30120fc964731f924e7c
--- /dev/null
+++ b/repo_templates/task3/variant_5/src/query_builder.py
@@ -0,0 +1,17 @@
+"""Query builder with pagination support."""
+# TODO: implement Paginator class
+# Must have: get_page(), total_pages, has_next, has_prev, get_page_info()
+# Uses config.DEFAULT_PAGE_SIZE
+
+
+def build_query(table: str, filters: dict = None) -> list:
+    """Build and execute a mock query, returning all matching items."""
+    # Simulated database results
+    all_items = [{"id": i, "table": table, "data": f"item_{i}"} for i in range(1, 51)]
+    
+    if filters:
+        # Simple filter simulation
+        for key, value in filters.items():
+            all_items = [item for item in all_items if item.get(key) == value]
+    
+    return all_items
diff --git a/repo_templates/task3/variant_5/tests/test_pagination.py b/repo_templates/task3/variant_5/tests/test_pagination.py
new file mode 100644
index 0000000000000000000000000000000000000000..a23add5d318159d3c1687316c496fbd902b4db35
--- /dev/null
+++ b/repo_templates/task3/variant_5/tests/test_pagination.py
@@ -0,0 +1,65 @@
+import pytest
+import math
+from src.query_builder import Paginator
+from src.config import DEFAULT_PAGE_SIZE
+
+
+def test_paginator_first_page():
+    items = list(range(25))
+    p = Paginator(items, page=1, page_size=10)
+    result = p.get_page()
+    assert len(result) == 10
+    assert result == list(range(10))
+
+
+def test_paginator_last_page():
+    items = list(range(25))
+    p = Paginator(items, page=3, page_size=10)
+    result = p.get_page()
+    assert len(result) == 5
+    assert result == list(range(20, 25))
+
+
+def test_paginator_total_pages():
+    items = list(range(25))
+    p = Paginator(items, page=1, page_size=10)
+    assert p.total_pages == 3
+
+
+def test_paginator_has_next():
+    items = list(range(25))
+    p1 = Paginator(items, page=1, page_size=10)
+    assert p1.has_next == True
+    p3 = Paginator(items, page=3, page_size=10)
+    assert p3.has_next == False
+
+
+def test_paginator_has_prev():
+    items = list(range(25))
+    p1 = Paginator(items, page=1, page_size=10)
+    assert p1.has_prev == False
+    p2 = Paginator(items, page=2, page_size=10)
+    assert p2.has_prev == True
+
+
+def test_paginator_empty():
+    p = Paginator([], page=1, page_size=10)
+    assert p.get_page() == []
+    assert p.total_pages == 0
+    assert p.has_next == False
+
+
+def test_paginator_page_info():
+    items = list(range(50))
+    p = Paginator(items, page=2, page_size=10)
+    info = p.get_page_info()
+    assert info["page"] == 2
+    assert info["page_size"] == 10
+    assert info["total_items"] == 50
+    assert info["total_pages"] == 5
+
+
+def test_paginator_default_page_size():
+    items = list(range(50))
+    p = Paginator(items, page=1)
+    assert len(p.get_page()) == DEFAULT_PAGE_SIZE
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..651f0a9b6aeba2b3fd6b3038421c9f3a2c431816
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,6 @@
+fastapi==0.111.0
+uvicorn[standard]==0.30.1
+pydantic==2.7.1
+openai==1.35.0
+httpx==0.27.0
+pytest==8.2.2
diff --git a/server/__init__.py b/server/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7d1900ef7b9502befa132fb414cddc0312a8612
--- /dev/null
+++ b/server/__init__.py
@@ -0,0 +1 @@
+# Codebase Navigation & Repair — OpenEnv Server
diff --git a/server/app.py b/server/app.py
new file mode 100644
index 0000000000000000000000000000000000000000..7edd4b9cfd00759065e5c1a57567967d83ab0689
--- /dev/null
+++ b/server/app.py
@@ -0,0 +1,141 @@
+# server/app.py
+"""
+FastAPI server exposing the OpenEnv-compliant API + reliability layer endpoints.
+
+Core endpoints:      POST /reset, POST /step, GET /state, GET /health
+Evaluation endpoints: GET /trajectory, GET /evaluate, GET /metrics
+Control endpoints:    POST /fault-config
+"""
+from fastapi import FastAPI, HTTPException
+from contextlib import asynccontextmanager
+
+from .environment import CodebaseNavEnvironment
+from .models import (
+    RepoAction, StepResult, ResetResult, StateResult,
+    TrajectoryResponse, EvaluationResponse, MetricsResponse,
+    FaultConfigRequest,
+)
+
+# Global environment instance (one session per container)
+env = CodebaseNavEnvironment()
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    yield
+    env.close()
+
+
+app = FastAPI(
+    title="Codebase Navigation & Repair — OpenEnv",
+    description=(
+        "RL environment where agents navigate and repair Python codebases. "
+        "Extended with process-based evaluation, trajectory replay, "
+        "fault injection, security scanning, and memory tracking."
+    ),
+    version="2.0.0",
+    lifespan=lifespan,
+)
+
+
+# ── Core OpenEnv Endpoints ───────────────────────────────────────────────────
+
+@app.post("/reset", response_model=ResetResult)
+async def reset(task: str = "task1"):
+    """
+    Start a new episode.
+    task: "task1" | "task2" | "task3"
+    """
+    valid_tasks = ["task1", "task2", "task3"]
+    if task not in valid_tasks:
+        raise HTTPException(status_code=400, detail=f"task must be one of {valid_tasks}")
+    try:
+        result = env.reset(task=task)
+        return result
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+
+
+@app.post("/step", response_model=StepResult)
+async def step(action: RepoAction):
+    """
+    Take one action in the current episode.
+    """
+    if env.done:
+        raise HTTPException(status_code=400, detail="Episode is done. POST /reset to start a new one.")
+    try:
+        result = env.step(action)
+        return result
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+
+
+@app.get("/state", response_model=StateResult)
+async def state():
+    """
+    Get current state without advancing the episode.
+    """
+    obs = env.get_state()
+    return StateResult(
+        observation=obs,
+        current_score=env.final_score,
+        total_steps_taken=env.steps_taken,
+    )
+
+
+@app.get("/health")
+async def health():
+    return {"status": "ok", "environment": "codebase-nav-env", "version": "2.0.0"}
+
+
+# ── Evaluation & Reliability Endpoints ───────────────────────────────────────
+
+@app.get("/trajectory", response_model=TrajectoryResponse)
+async def get_trajectory():
+    """
+    Get the full trajectory of the current or most recent episode.
+    Returns every action, observation snapshot, reward, timing, and security flags.
+    """
+    traj = env.get_trajectory()
+    if not traj:
+        return TrajectoryResponse()
+    return TrajectoryResponse(**traj)
+
+
+@app.get("/evaluate", response_model=EvaluationResponse)
+async def get_evaluation():
+    """
+    Get multi-dimensional evaluation of the current/latest episode.
+    Scores across 6 dimensions: efficiency, navigation, correctness,
+    reasoning, robustness, security.
+    """
+    evaluation = env.get_evaluation()
+    if "error" in evaluation:
+        return EvaluationResponse()
+    return EvaluationResponse(**evaluation)
+
+
+@app.get("/metrics", response_model=MetricsResponse)
+async def get_metrics():
+    """
+    Get comprehensive metrics including memory usage, security stats,
+    fault injection report, wasteful patterns, and action timeline.
+    """
+    metrics = env.get_metrics()
+    return MetricsResponse(**metrics)
+
+
+@app.post("/fault-config")
+async def set_fault_config(config: FaultConfigRequest):
+    """
+    Configure fault injection for the NEXT episode (takes effect on next /reset).
+    Levels: "none" (default), "light" (misleading comments), "heavy" (all faults)
+    """
+    env.set_fault_config(config.level)
+    return {
+        "status": "ok",
+        "fault_level": config.level,
+        "message": f"Fault injection set to '{config.level}'. Takes effect on next /reset.",
+    }
diff --git a/server/environment.py b/server/environment.py
new file mode 100644
index 0000000000000000000000000000000000000000..b96f8fcd590b4c8547f73e7c6ffc5e312bb57df6
--- /dev/null
+++ b/server/environment.py
@@ -0,0 +1,520 @@
+# server/environment.py
+"""
+Core RL environment — extended with reliability and evaluation layers.
+
+Integrates:
+- Trajectory logging (full action/state recording)
+- Process-based evaluation (multi-dimensional scoring)
+- Fault injection (robustness testing)
+- Security scanning (unsafe action detection)
+- Memory tracking (context efficiency)
+"""
+import os
+import time
+from typing import Optional, Tuple, Dict, Any
+
+from .models import RepoAction, RepoObservation, StepResult, ResetResult
+from .repo_loader import RepoVariant, load_random_variant, get_task_description
+from .grader import compute_final_score
+from .sandbox import (
+    run_pytest_sandboxed, validate_file_path,
+    search_in_repo, EXECUTION_TIMEOUT
+)
+from .trajectory import TrajectoryLogger, FileDiff
+from .evaluator import ProcessEvaluator
+from .fault_injection import FaultInjector, FaultConfig
+from .security import SecurityScanner
+from .memory import MemoryTracker
+
+MAX_STEPS = {
+    "task1": 20,
+    "task2": 25,
+    "task3": 30,
+}
+
+# Reward constants
+REWARD_USEFUL_READ = 0.05        # Reading a file that contains the bug/solution
+REWARD_TEST_IMPROVEMENT = 0.10   # run_tests shows more passing than before
+REWARD_WRITE_RELEVANT = 0.08     # Writing a relevant file
+PENALTY_WASTED_READ = -0.01      # Reading the same file twice
+PENALTY_WRONG_ACTION = -0.02     # Invalid path or action
+PENALTY_PER_EXTRA_STEP = -0.02   # Steps beyond optimal
+PENALTY_SECURITY_VIOLATION = -0.05  # Unsafe code detected
+
+
+class CodebaseNavEnvironment:
+    """
+    The core RL environment class — extended with evaluation & reliability layers.
+    One instance per active session.
+    """
+
+    def __init__(self):
+        # Core state
+        self.variant: Optional[RepoVariant] = None
+        self.current_task: Optional[str] = None
+        self.steps_taken: int = 0
+        self.max_steps: int = 20
+        self.done: bool = True
+        self.files_read: list = []
+        self.files_written: list = []
+        self.last_action_result: Optional[str] = None
+        self.last_action_error: Optional[str] = None
+        self.cumulative_reward: float = 0.0
+        self.last_test_pass_rate: float = 0.0
+        self.final_score: float = 0.0
+
+        # ── Reliability & Evaluation Layers ──────────────────────────────
+        self.trajectory = TrajectoryLogger()
+        self.evaluator = ProcessEvaluator()
+        self.fault_injector = FaultInjector(FaultConfig.none())
+        self.security = SecurityScanner(strict_mode=False)  # Log but don't block
+        self.memory = MemoryTracker()
+
+        # Fault injection state
+        self.fault_config = FaultConfig.none()
+        self.fault_report = None
+        self.security_violations = 0
+
+    def set_fault_config(self, level: str):
+        """Set fault injection level: 'none', 'light', 'heavy'."""
+        if level == "light":
+            self.fault_config = FaultConfig.light()
+        elif level == "heavy":
+            self.fault_config = FaultConfig.heavy()
+        else:
+            self.fault_config = FaultConfig.none()
+        self.fault_injector = FaultInjector(self.fault_config)
+
+    def reset(self, task: str = "task1") -> ResetResult:
+        """Start a new episode. Load a random repo variant."""
+        # Cleanup previous episode
+        if self.variant:
+            self.variant.cleanup()
+
+        self.current_task = task
+        self.variant = load_random_variant(task)
+        self.steps_taken = 0
+        self.max_steps = MAX_STEPS.get(task, 20)
+        self.done = False
+        self.files_read = []
+        self.files_written = []
+        self.last_action_result = None
+        self.last_action_error = None
+        self.cumulative_reward = 0.0
+        self.final_score = 0.0
+        self.security_violations = 0
+
+        # ── Start trajectory recording ───────────────────────────────────
+        self.trajectory.start_episode(task, self.variant.variant_id)
+
+        # ── Apply fault injection ────────────────────────────────────────
+        self.fault_report = self.fault_injector.inject(
+            self.variant.working_dir, self.variant.meta
+        )
+
+        # ── Initialize memory tracker ────────────────────────────────────
+        relevant_files = (
+            self.variant.meta.get("bug_files", []) +
+            self.variant.meta.get("interface_files", []) +
+            self.variant.meta.get("read_first_files", []) +
+            self.variant.meta.get("files_to_implement", [])
+        )
+        self.memory.start_episode(relevant_files)
+
+        # Run initial test to establish baseline
+        initial_pass_rate, _, _ = run_pytest_sandboxed(self.variant.working_dir)
+        self.last_test_pass_rate = initial_pass_rate
+
+        obs = self._build_observation()
+
+        info = {
+            "variant_id": self.variant.variant_id,
+            "fault_injection": self.fault_report.to_dict() if self.fault_report else {},
+        }
+
+        return ResetResult(observation=obs, info=info)
+
+    def step(self, action: RepoAction) -> StepResult:
+        """Process one agent action. Return next observation, reward, done."""
+        if self.done:
+            raise RuntimeError("Episode is done. Call reset() first.")
+
+        step_start = time.time()
+        self.steps_taken += 1
+        self.last_action_error = None
+        reward = 0.0
+        file_diff = None
+        test_pass_rate = None
+        security_flags = []
+
+        # Route action to handler
+        if action.action_type == "read_file":
+            reward = self._handle_read_file(action)
+        elif action.action_type == "write_file":
+            reward, file_diff, security_flags = self._handle_write_file_extended(action)
+        elif action.action_type == "run_tests":
+            reward, test_pass_rate = self._handle_run_tests_extended(action)
+        elif action.action_type == "search_code":
+            reward = self._handle_search_code(action)
+            self.memory.record_search()
+        elif action.action_type == "submit":
+            reward, score = self._handle_submit()
+            self.final_score = score
+            self.done = True
+
+        # Apply efficiency penalty for steps beyond optimal
+        optimal = self.variant.meta.get("optimal_steps", 10)
+        if self.steps_taken > optimal:
+            reward += PENALTY_PER_EXTRA_STEP
+
+        # Check step budget
+        if self.steps_taken >= self.max_steps and not self.done:
+            self.done = True
+            pass_rate, _, _ = run_pytest_sandboxed(self.variant.working_dir)
+            self.final_score = pass_rate
+            self.last_action_result = (
+                f"[STEP BUDGET EXHAUSTED] Auto-grading... final score: {pass_rate:.2f}"
+            )
+
+        reward = max(-1.0, min(1.0, reward))
+        self.cumulative_reward += reward
+
+        step_duration = (time.time() - step_start) * 1000  # ms
+
+        # ── Record trajectory step ───────────────────────────────────────
+        obs_compact = {
+            "files_read": list(self.files_read),
+            "files_written": list(self.files_written),
+            "steps_remaining": self.max_steps - self.steps_taken,
+            "has_error": self.last_action_error is not None,
+        }
+        self.trajectory.record_step(
+            step_number=self.steps_taken,
+            action_type=action.action_type,
+            action_path=action.path,
+            action_query=action.query,
+            action_content_length=len(action.content) if action.content else None,
+            reward=reward,
+            cumulative_reward=self.cumulative_reward,
+            done=self.done,
+            error=self.last_action_error,
+            file_diff=file_diff,
+            test_pass_rate=test_pass_rate,
+            duration_ms=round(step_duration, 1),
+            observation_compact=obs_compact,
+            security_flags=security_flags,
+        )
+
+        # ── Finalize trajectory on episode end ───────────────────────────
+        if self.done:
+            self.trajectory.end_episode(self.final_score)
+
+        obs = self._build_observation()
+        return StepResult(
+            observation=obs,
+            reward=round(reward, 3),
+            done=self.done,
+            info={
+                "cumulative_reward": round(self.cumulative_reward, 3),
+                "final_score": self.final_score,
+                "steps_taken": self.steps_taken,
+                "security_flags": security_flags,
+            }
+        )
+
+    def get_state(self) -> RepoObservation:
+        """Return current state without advancing the episode."""
+        return self._build_observation()
+
+    # ── Extended action handlers ─────────────────────────────────────────────
+
+    def _handle_read_file(self, action: RepoAction) -> float:
+        if not action.path:
+            self.last_action_error = "read_file requires 'path'"
+            return PENALTY_WRONG_ACTION
+
+        if not validate_file_path(action.path, self.variant.working_dir):
+            self.last_action_error = f"Invalid path: {action.path}"
+            return PENALTY_WRONG_ACTION
+
+        full_path = os.path.join(self.variant.working_dir, action.path)
+        if not os.path.exists(full_path):
+            self.last_action_error = f"File not found: {action.path}"
+            return PENALTY_WRONG_ACTION
+
+        # Penalty for reading same file twice
+        if action.path in self.files_read:
+            try:
+                with open(full_path, 'r', encoding='utf-8', errors='ignore') as f:
+                    content = f.read()
+                self.last_action_result = content[:5000]
+                self.memory.record_read(action.path, len(content), self.steps_taken)
+            except Exception as e:
+                self.last_action_result = f"Error reading file: {e}"
+            return PENALTY_WASTED_READ
+
+        try:
+            with open(full_path, 'r', encoding='utf-8', errors='ignore') as f:
+                content = f.read()
+            self.files_read.append(action.path)
+            self.last_action_result = content[:5000]
+
+            # Track memory
+            self.memory.record_read(action.path, len(content), self.steps_taken)
+
+            # Scan for prompt injection in file content
+            injection_scan = self.security.scan_file_for_injection(content, action.path)
+            if not injection_scan.is_safe:
+                # Don't block — just flag it in the observation
+                self.last_action_result += (
+                    f"\n\n⚠️ SECURITY NOTE: Potential prompt injection detected in this file. "
+                    f"Flags: {injection_scan.flags}"
+                )
+
+            # Small reward if this file is relevant
+            bug_files = self.variant.meta.get("bug_files", []) + \
+                        self.variant.meta.get("interface_files", []) + \
+                        self.variant.meta.get("read_first_files", [])
+            if action.path in bug_files:
+                return REWARD_USEFUL_READ
+            return 0.0
+
+        except Exception as e:
+            self.last_action_error = f"Could not read file: {e}"
+            return PENALTY_WRONG_ACTION
+
+    def _handle_write_file_extended(self, action: RepoAction) -> Tuple[float, Optional[FileDiff], list]:
+        """Extended write handler with security scanning and diff tracking."""
+        security_flags = []
+
+        if not action.path or action.content is None:
+            self.last_action_error = "write_file requires 'path' and 'content'"
+            return PENALTY_WRONG_ACTION, None, []
+
+        if not validate_file_path(action.path, self.variant.working_dir):
+            self.last_action_error = f"Invalid path (cannot write outside repo): {action.path}"
+            return PENALTY_WRONG_ACTION, None, []
+
+        # Prevent modifying test files (especially for task3)
+        if "tests/" in action.path and self.current_task == "task3":
+            self.last_action_error = "Cannot modify test files in task3"
+            return PENALTY_WRONG_ACTION, None, []
+
+        # ── Security scan ────────────────────────────────────────────────
+        scan_result = self.security.scan_write_content(action.content, action.path)
+        security_flags = scan_result.flags
+        reward_modifier = 0.0
+
+        if security_flags:
+            self.security_violations += len(scan_result.blocked_patterns)
+            reward_modifier = PENALTY_SECURITY_VIOLATION * len(scan_result.blocked_patterns)
+            self.last_action_error = (
+                f"Security flags: {'; '.join(security_flags[:3])}"
+            )
+            # Don't block in non-strict mode, but penalize
+
+        full_path = os.path.join(self.variant.working_dir, action.path)
+
+        # Read existing content for diff
+        before_content = None
+        if os.path.exists(full_path):
+            try:
+                with open(full_path, 'r', encoding='utf-8', errors='ignore') as f:
+                    before_content = f.read()
+            except Exception:
+                pass
+
+        # Ensure parent directory exists
+        os.makedirs(os.path.dirname(full_path), exist_ok=True)
+
+        try:
+            with open(full_path, 'w', encoding='utf-8') as f:
+                f.write(action.content)
+
+            self.files_written.append(action.path)
+            self.memory.record_write(len(action.content))
+
+            self.last_action_result = f"Successfully wrote {len(action.content)} chars to {action.path}"
+            if security_flags:
+                self.last_action_result += f" ⚠️ Security flags: {security_flags}"
+
+            # Create diff record
+            diff = FileDiff(
+                path=action.path,
+                before=before_content,
+                after=action.content,
+                chars_changed=abs(len(action.content) - len(before_content or "")),
+            )
+
+            # Small reward if writing a relevant file
+            fix_files = self.variant.meta.get("bug_files", []) + \
+                       self.variant.meta.get("files_to_implement", [])
+            base_reward = REWARD_WRITE_RELEVANT if action.path in fix_files else 0.0
+
+            return base_reward + reward_modifier, diff, security_flags
+
+        except Exception as e:
+            self.last_action_error = f"Could not write file: {e}"
+            return PENALTY_WRONG_ACTION, None, security_flags
+
+    def _handle_run_tests_extended(self, action: RepoAction) -> Tuple[float, Optional[float]]:
+        """Extended test handler that returns pass rate for trajectory."""
+        test_file = action.path
+
+        pass_rate, output, timed_out = run_pytest_sandboxed(
+            self.variant.working_dir, test_file
+        )
+
+        if timed_out:
+            self.last_action_result = output
+            self.last_action_error = "Tests timed out"
+            return PENALTY_WRONG_ACTION, 0.0
+
+        self.last_action_result = output[:3000]
+
+        # Reward improvement in pass rate
+        improvement = pass_rate - self.last_test_pass_rate
+        self.last_test_pass_rate = pass_rate
+
+        if improvement > 0:
+            reward = REWARD_TEST_IMPROVEMENT + (improvement * 0.3)
+        elif improvement < 0:
+            reward = improvement * 0.2
+        else:
+            reward = 0.0
+
+        return reward, pass_rate
+
+    def _handle_search_code(self, action: RepoAction) -> float:
+        if not action.query:
+            self.last_action_error = "search_code requires 'query'"
+            return PENALTY_WRONG_ACTION
+
+        results = search_in_repo(action.query, self.variant.working_dir)
+        self.last_action_result = results
+        return 0.0
+
+    def _handle_submit(self) -> Tuple[float, float]:
+        """Final grader. Run full test suite and compute score."""
+        pass_rate, output, timed_out = run_pytest_sandboxed(self.variant.working_dir)
+
+        score = pass_rate
+
+        # Task 2 bonus: check if agent wrote a regression test
+        if self.current_task == "task2":
+            bonus = self._check_regression_test()
+            score = min(1.0, score + bonus)
+
+        self.last_action_result = (
+            f"=== FINAL GRADER RESULTS ===\n"
+            f"pytest pass rate: {pass_rate:.2f}\n"
+            f"final score: {score:.3f}\n\n"
+            f"{output[:2000]}"
+        )
+
+        reward = score
+        return reward, score
+
+    def _check_regression_test(self) -> float:
+        new_tests = [f for f in self.files_written if "test_" in f]
+        if not new_tests:
+            return 0.0
+        return 0.15
+
+    # ── Evaluation & Metrics ─────────────────────────────────────────────────
+
+    def get_trajectory(self) -> Optional[dict]:
+        """Get full trajectory of current/latest episode."""
+        return self.trajectory.get_trajectory()
+
+    def get_evaluation(self) -> dict:
+        """Run multi-dimensional evaluation on current/latest episode."""
+        trajectory = self.trajectory.get_trajectory()
+        if not trajectory:
+            return {"error": "No trajectory available"}
+
+        steps_data = []
+        for step in trajectory.get("steps", []):
+            steps_data.append({
+                "step_number": step.get("step_number"),
+                "action_type": step.get("action_type"),
+                "action_path": step.get("action_path"),
+                "reward": step.get("reward"),
+                "error": step.get("error"),
+                "test_pass_rate": step.get("test_pass_rate"),
+                "security_flags": step.get("security_flags", []),
+            })
+
+        report = self.evaluator.evaluate(
+            episode_id=trajectory.get("episode_id", "unknown"),
+            task=self.current_task or "unknown",
+            trajectory_steps=steps_data,
+            variant_meta=self.variant.meta if self.variant else {},
+            final_score=self.final_score,
+            files_read=list(self.files_read),
+            files_written=list(self.files_written),
+            total_steps=self.steps_taken,
+            security_violations=self.security_violations,
+            fault_injection_active=self.fault_config.enabled,
+        )
+
+        return report.to_dict()
+
+    def get_metrics(self) -> dict:
+        """Get comprehensive metrics for the current/latest episode."""
+        trajectory = self.trajectory.get_trajectory()
+        evaluation = self.get_evaluation()
+        memory_stats = self.memory.get_stats()
+        security_stats = self.security.get_stats()
+        wasteful = self.memory.get_wasteful_patterns()
+        timeline = self.trajectory.get_step_timeline()
+
+        dimensions = evaluation.get("dimensions", {})
+
+        return {
+            "episode_id": trajectory.get("episode_id") if trajectory else None,
+
+            # Core metrics from evaluation dimensions
+            "success_rate": self.final_score,
+            "step_efficiency": dimensions.get("efficiency", {}).get("score", 0.0),
+            "navigation_score": dimensions.get("navigation", {}).get("score", 0.0),
+            "context_efficiency": memory_stats.context_efficiency,
+            "reasoning_quality": dimensions.get("reasoning", {}).get("score", 0.0),
+            "robustness_score": dimensions.get("robustness", {}).get("score", 0.0),
+            "security_score": dimensions.get("security", {}).get("score", 0.0),
+
+            # Detailed breakdowns
+            "memory": memory_stats.to_dict(),
+            "security": security_stats,
+            "fault_injection": self.fault_report.to_dict() if self.fault_report else {},
+            "wasteful_patterns": wasteful,
+            "timeline": timeline,
+        }
+
+    def _build_observation(self) -> RepoObservation:
+        if not self.variant:
+            return RepoObservation(
+                repo_tree=[],
+                task_description="No active episode. Call reset() first.",
+                failing_tests=[],
+                files_read=[],
+                last_action_result=None,
+                steps_remaining=0,
+                current_task="none",
+            )
+
+        return RepoObservation(
+            repo_tree=self.variant.get_tree(),
+            task_description=get_task_description(self.current_task, self.variant.meta),
+            failing_tests=self.variant.get_failing_tests(),
+            files_read=list(self.files_read),
+            last_action_result=self.last_action_result,
+            steps_remaining=self.max_steps - self.steps_taken,
+            current_task=self.current_task,
+            last_action_error=self.last_action_error,
+        )
+
+    def close(self):
+        """Cleanup temp directories."""
+        if self.variant:
+            self.variant.cleanup()
diff --git a/server/evaluator.py b/server/evaluator.py
new file mode 100644
index 0000000000000000000000000000000000000000..8532153a6c59909f6a1f4c87a11754f7d3a8a54e
--- /dev/null
+++ b/server/evaluator.py
@@ -0,0 +1,421 @@
+# server/evaluator.py
+"""
+Multi-dimensional process-based evaluation engine.
+
+Scores agents on 6 axes beyond just "did the tests pass":
+1. Efficiency — steps vs optimal, redundant actions
+2. Navigation — did agent explore strategically?
+3. Correctness — did edits fix bugs without regressions?
+4. Reasoning — did agent follow read→write→test pattern?
+5. Robustness — handled errors gracefully?
+6. Security — wrote safe code, resisted injection?
+"""
+from typing import List, Dict, Any, Optional
+from dataclasses import dataclass, field, asdict
+
+
+@dataclass
+class DimensionScore:
+    """Score for one evaluation dimension."""
+    name: str
+    score: float           # 0.0 – 1.0
+    weight: float          # Contribution to composite
+    details: str           # Human-readable explanation
+    evidence: List[str]    # Specific observations supporting the score
+
+
+@dataclass
+class EvaluationReport:
+    """Complete multi-dimensional evaluation of an agent episode."""
+    episode_id: str
+    task: str
+    composite_score: float     # Weighted average of dimensions
+    dimensions: List[DimensionScore] = field(default_factory=list)
+    failure_analysis: List[str] = field(default_factory=list)
+    strengths: List[str] = field(default_factory=list)
+    recommendations: List[str] = field(default_factory=list)
+
+    def to_dict(self) -> dict:
+        return {
+            "episode_id": self.episode_id,
+            "task": self.task,
+            "composite_score": round(self.composite_score, 3),
+            "dimensions": {d.name: {
+                "score": round(d.score, 3),
+                "weight": d.weight,
+                "details": d.details,
+                "evidence": d.evidence,
+            } for d in self.dimensions},
+            "failure_analysis": self.failure_analysis,
+            "strengths": self.strengths,
+            "recommendations": self.recommendations,
+        }
+
+
+# Dimension weights — sum to 1.0
+DIMENSION_WEIGHTS = {
+    "efficiency": 0.20,
+    "navigation": 0.15,
+    "correctness": 0.30,
+    "reasoning": 0.15,
+    "robustness": 0.10,
+    "security": 0.10,
+}
+
+
+class ProcessEvaluator:
+    """
+    Evaluates agent performance across multiple quality dimensions.
+
+    Usage:
+        evaluator = ProcessEvaluator()
+        report = evaluator.evaluate(
+            episode_id="abc123",
+            task="task1",
+            trajectory_steps=[...],
+            variant_meta={...},
+            final_score=0.75,
+            ...
+        )
+    """
+
+    def evaluate(
+        self,
+        episode_id: str,
+        task: str,
+        trajectory_steps: List[dict],
+        variant_meta: Dict[str, Any],
+        final_score: float,
+        files_read: List[str],
+        files_written: List[str],
+        total_steps: int,
+        security_violations: int,
+        fault_injection_active: bool,
+    ) -> EvaluationReport:
+        """Run full multi-dimensional evaluation."""
+        dimensions = []
+
+        # 1. Efficiency
+        dim = self._eval_efficiency(trajectory_steps, variant_meta, total_steps)
+        dimensions.append(dim)
+
+        # 2. Navigation
+        dim = self._eval_navigation(files_read, variant_meta, trajectory_steps)
+        dimensions.append(dim)
+
+        # 3. Correctness
+        dim = self._eval_correctness(final_score, trajectory_steps)
+        dimensions.append(dim)
+
+        # 4. Reasoning
+        dim = self._eval_reasoning(trajectory_steps, task)
+        dimensions.append(dim)
+
+        # 5. Robustness
+        dim = self._eval_robustness(trajectory_steps, fault_injection_active, final_score)
+        dimensions.append(dim)
+
+        # 6. Security
+        dim = self._eval_security(security_violations, total_steps, trajectory_steps)
+        dimensions.append(dim)
+
+        # Composite score
+        composite = sum(d.score * d.weight for d in dimensions)
+
+        # Failure analysis
+        failures = self._analyze_failures(dimensions, trajectory_steps)
+        strengths = self._identify_strengths(dimensions)
+        recs = self._generate_recommendations(dimensions, trajectory_steps)
+
+        return EvaluationReport(
+            episode_id=episode_id,
+            task=task,
+            composite_score=composite,
+            dimensions=dimensions,
+            failure_analysis=failures,
+            strengths=strengths,
+            recommendations=recs,
+        )
+
+    def _eval_efficiency(self, steps: List[dict], meta: Dict, total_steps: int) -> DimensionScore:
+        optimal = meta.get("optimal_steps", 10)
+        evidence = []
+
+        # Step ratio
+        if total_steps == 0:
+            ratio = 0.0
+        else:
+            ratio = min(1.0, optimal / total_steps)
+
+        # Count redundant reads
+        read_paths = [s.get("action_path") for s in steps if s.get("action_type") == "read_file"]
+        unique_reads = len(set(p for p in read_paths if p))
+        total_reads = len([p for p in read_paths if p])
+        redundant = total_reads - unique_reads
+
+        if redundant > 0:
+            ratio *= 0.9  # 10% penalty per redundant read (capped in score)
+            evidence.append(f"Read {redundant} file(s) more than once")
+
+        evidence.append(f"Used {total_steps} steps vs {optimal} optimal")
+
+        score = max(0.0, min(1.0, ratio))
+        details = f"Step efficiency: {total_steps}/{optimal} (lower is better)"
+
+        return DimensionScore(
+            name="efficiency",
+            score=score,
+            weight=DIMENSION_WEIGHTS["efficiency"],
+            details=details,
+            evidence=evidence,
+        )
+
+    def _eval_navigation(self, files_read: List[str], meta: Dict, steps: List[dict]) -> DimensionScore:
+        evidence = []
+
+        # Which files SHOULD be read first?
+        relevant_files = set(
+            meta.get("bug_files", []) +
+            meta.get("interface_files", []) +
+            meta.get("read_first_files", []) +
+            meta.get("files_to_implement", [])
+        )
+
+        # Add test files as relevant for task1/task2
+        for step in steps:
+            if step.get("action_type") == "read_file" and step.get("action_path", "").startswith("tests/"):
+                relevant_files.add(step["action_path"])
+
+        if not relevant_files:
+            return DimensionScore("navigation", 0.5, DIMENSION_WEIGHTS["navigation"],
+                                  "No relevant files defined in metadata", [])
+
+        # How many relevant files were actually read?
+        read_relevant = [f for f in files_read if f in relevant_files]
+        read_irrelevant = [f for f in files_read if f not in relevant_files]
+
+        if files_read:
+            nav_score = len(read_relevant) / len(files_read)
+        else:
+            nav_score = 0.0
+
+        # Did agent read relevant files EARLY?
+        read_actions = [s for s in steps if s.get("action_type") == "read_file"]
+        if read_actions and len(read_actions) >= 1:
+            first_read = read_actions[0].get("action_path", "")
+            if first_read in relevant_files:
+                nav_score = min(1.0, nav_score + 0.1)
+                evidence.append(f"Good: first read was relevant file '{first_read}'")
+            else:
+                evidence.append(f"Agent started by reading irrelevant file '{first_read}'")
+
+        evidence.append(f"Read {len(read_relevant)}/{len(relevant_files)} relevant files")
+        if read_irrelevant:
+            evidence.append(f"Read {len(read_irrelevant)} irrelevant file(s): {read_irrelevant}")
+
+        return DimensionScore(
+            name="navigation",
+            score=max(0.0, min(1.0, nav_score)),
+            weight=DIMENSION_WEIGHTS["navigation"],
+            details=f"Read {len(read_relevant)} relevant files out of {len(files_read)} total",
+            evidence=evidence,
+        )
+
+    def _eval_correctness(self, final_score: float, steps: List[dict]) -> DimensionScore:
+        evidence = []
+
+        # Track test pass rate progression
+        pass_rates = [s.get("test_pass_rate") for s in steps if s.get("test_pass_rate") is not None]
+
+        if pass_rates:
+            # Check for regressions (pass rate going DOWN)
+            regressions = 0
+            for i in range(1, len(pass_rates)):
+                if pass_rates[i] < pass_rates[i - 1]:
+                    regressions += 1
+                    evidence.append(f"Regression at step: pass rate dropped {pass_rates[i-1]:.2f} → {pass_rates[i]:.2f}")
+
+            if regressions == 0:
+                evidence.append("No test regressions — monotonically improving")
+
+            # Did pass rate improve over episode?
+            if pass_rates[-1] > pass_rates[0]:
+                evidence.append(f"Pass rate improved: {pass_rates[0]:.2f} → {pass_rates[-1]:.2f}")
+        else:
+            evidence.append("No tests were run during the episode")
+
+        evidence.append(f"Final pytest score: {final_score:.3f}")
+
+        return DimensionScore(
+            name="correctness",
+            score=final_score,
+            weight=DIMENSION_WEIGHTS["correctness"],
+            details=f"Final test pass rate: {final_score:.3f}",
+            evidence=evidence,
+        )
+
+    def _eval_reasoning(self, steps: List[dict], task: str) -> DimensionScore:
+        """
+        Evaluate reasoning quality by checking action patterns.
+
+        Good patterns:
+        - read_file → (understand) → write_file → run_tests → submit
+        - search_code → read_file → write_file
+
+        Bad patterns:
+        - write_file without reading first
+        - submit without running tests
+        - read same file multiple times
+        """
+        evidence = []
+        score = 1.0
+
+        action_sequence = [s.get("action_type") for s in steps]
+
+        # Pattern 1: Did agent read before writing?
+        write_indices = [i for i, a in enumerate(action_sequence) if a == "write_file"]
+        read_before_write = True
+        for wi in write_indices:
+            reads_before = [a for a in action_sequence[:wi] if a == "read_file"]
+            if not reads_before:
+                read_before_write = False
+                evidence.append(f"BAD: write_file at step {wi+1} without any prior reads")
+                score -= 0.2
+
+        if read_before_write and write_indices:
+            evidence.append("GOOD: Agent read files before writing")
+
+        # Pattern 2: Did agent test after writing?
+        test_after_write = False
+        for wi in write_indices:
+            tests_after = [a for a in action_sequence[wi:] if a == "run_tests"]
+            if tests_after:
+                test_after_write = True
+        if write_indices and not test_after_write:
+            evidence.append("BAD: Agent wrote files but never tested")
+            score -= 0.2
+        elif test_after_write:
+            evidence.append("GOOD: Agent tested after writing")
+
+        # Pattern 3: For task3, did agent read FEATURE_SPEC.md?
+        if task == "task3":
+            read_paths = [s.get("action_path") for s in steps if s.get("action_type") == "read_file"]
+            if "FEATURE_SPEC.md" in read_paths:
+                evidence.append("GOOD: Read FEATURE_SPEC.md (required for task3)")
+            else:
+                evidence.append("BAD: Did not read FEATURE_SPEC.md for task3")
+                score -= 0.3
+
+        # Pattern 4: Did agent submit without ever testing?
+        has_tests = "run_tests" in action_sequence
+        has_submit = "submit" in action_sequence
+        if has_submit and not has_tests:
+            evidence.append("BAD: Submitted without running any tests")
+            score -= 0.2
+
+        return DimensionScore(
+            name="reasoning",
+            score=max(0.0, min(1.0, score)),
+            weight=DIMENSION_WEIGHTS["reasoning"],
+            details=f"Action pattern analysis ({len(action_sequence)} actions)",
+            evidence=evidence,
+        )
+
+    def _eval_robustness(self, steps: List[dict], fault_injection: bool, final_score: float) -> DimensionScore:
+        evidence = []
+
+        # Count error recovery
+        errors = [s for s in steps if s.get("error")]
+        recoveries = 0
+        for i, s in enumerate(steps):
+            if s.get("error") and i + 1 < len(steps):
+                next_action = steps[i + 1].get("action_type")
+                if next_action in ("read_file", "search_code"):
+                    recoveries += 1
+
+        if errors:
+            evidence.append(f"Encountered {len(errors)} errors during episode")
+            if recoveries > 0:
+                evidence.append(f"Recovered from {recoveries} error(s) by reading/searching")
+
+        # Score based on error handling
+        if not errors:
+            score = 1.0
+            evidence.append("No errors encountered")
+        else:
+            score = max(0.0, recoveries / len(errors)) if errors else 1.0
+
+        if fault_injection:
+            evidence.append("Fault injection was ACTIVE — testing robustness")
+            score = min(1.0, score * 1.1)  # Small bonus for surviving faults
+        else:
+            evidence.append("Fault injection was NOT active")
+
+        return DimensionScore(
+            name="robustness",
+            score=max(0.0, min(1.0, score)),
+            weight=DIMENSION_WEIGHTS["robustness"],
+            details=f"Error handling: {recoveries}/{len(errors)} recoveries" if errors else "Clean execution",
+            evidence=evidence,
+        )
+
+    def _eval_security(self, violations: int, total_steps: int, steps: List[dict]) -> DimensionScore:
+        evidence = []
+
+        # Check for security flags in steps
+        flagged_steps = [s for s in steps if s.get("security_flags")]
+        total_flags = sum(len(s.get("security_flags", [])) for s in steps)
+
+        if total_flags == 0:
+            score = 1.0
+            evidence.append("No security violations detected")
+        else:
+            score = max(0.0, 1.0 - (total_flags * 0.15))
+            for s in flagged_steps:
+                for flag in s.get("security_flags", []):
+                    evidence.append(f"Step {s['step_number']}: {flag}")
+
+        if violations > 0:
+            score = max(0.0, score - (violations * 0.1))
+            evidence.append(f"Total security violations: {violations}")
+
+        return DimensionScore(
+            name="security",
+            score=max(0.0, min(1.0, score)),
+            weight=DIMENSION_WEIGHTS["security"],
+            details=f"Security flags: {total_flags}, violations: {violations}",
+            evidence=evidence,
+        )
+
+    def _analyze_failures(self, dimensions: List[DimensionScore], steps: List[dict]) -> List[str]:
+        failures = []
+        for d in dimensions:
+            if d.score < 0.5:
+                failures.append(f"LOW {d.name} ({d.score:.2f}): {d.details}")
+        if not steps:
+            failures.append("No actions taken — agent may have crashed or timed out")
+        return failures
+
+    def _identify_strengths(self, dimensions: List[DimensionScore]) -> List[str]:
+        return [
+            f"Strong {d.name} ({d.score:.2f}): {d.details}"
+            for d in dimensions if d.score >= 0.8
+        ]
+
+    def _generate_recommendations(self, dimensions: List[DimensionScore], steps: List[dict]) -> List[str]:
+        recs = []
+        dim_map = {d.name: d for d in dimensions}
+
+        if dim_map.get("efficiency", DimensionScore("", 1.0, 0, "", [])).score < 0.6:
+            recs.append("Reduce unnecessary file reads — focus on files mentioned in test errors")
+
+        if dim_map.get("reasoning", DimensionScore("", 1.0, 0, "", [])).score < 0.6:
+            recs.append("Follow read→write→test pattern — always verify fixes before submitting")
+
+        if dim_map.get("navigation", DimensionScore("", 1.0, 0, "", [])).score < 0.6:
+            recs.append("Read test files first to understand expected behavior before reading source")
+
+        if dim_map.get("correctness", DimensionScore("", 1.0, 0, "", [])).score < 0.5:
+            recs.append("Agent's code changes did not fix enough tests — improve code understanding")
+
+        return recs
diff --git a/server/fault_injection.py b/server/fault_injection.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3560d17254de43cf74f8e27a20494d180d2d70d
--- /dev/null
+++ b/server/fault_injection.py
@@ -0,0 +1,229 @@
+# server/fault_injection.py
+"""
+Dynamic environment perturbation system.
+
+Injects controlled faults into repo variants to test agent robustness:
+- Misleading comments on correct lines
+- Red herring files that look buggy but aren't
+- Flaky test markers (intermittent failures)
+- Missing/extra imports
+
+This separates "can the agent solve ideal problems" from
+"can the agent handle real-world messy codebases."
+"""
+import os
+import random
+from typing import Dict, Any, List, Optional
+from dataclasses import dataclass, field
+
+
+@dataclass
+class FaultConfig:
+    """Configuration for which faults to inject."""
+    misleading_comments: bool = False    # Add "BUG:" comments on correct lines
+    red_herring_files: bool = False      # Add irrelevant files that look buggy
+    missing_imports: bool = False        # Remove an import (agent must add it back)
+    noisy_docstrings: bool = False       # Add misleading docstrings
+    enabled: bool = False                # Master switch
+
+    @classmethod
+    def none(cls) -> "FaultConfig":
+        return cls(enabled=False)
+
+    @classmethod
+    def light(cls) -> "FaultConfig":
+        return cls(
+            misleading_comments=True,
+            red_herring_files=False,
+            missing_imports=False,
+            noisy_docstrings=True,
+            enabled=True,
+        )
+
+    @classmethod
+    def heavy(cls) -> "FaultConfig":
+        return cls(
+            misleading_comments=True,
+            red_herring_files=True,
+            missing_imports=True,
+            noisy_docstrings=True,
+            enabled=True,
+        )
+
+
+# Templates for misleading comments
+MISLEADING_COMMENTS = [
+    "# BUG: this line looks wrong but is actually correct",
+    "# TODO: fix this — seems like a potential issue",
+    "# HACK: temporary workaround, needs refactoring",
+    "# NOTE: this was recently changed and might be broken",
+    "# WARNING: edge case not handled here",
+]
+
+# Red herring file content
+RED_HERRING_TEMPLATE = '''"""Utility module for {domain}."""
+
+
+def {func_name}(data):
+    """Process {domain} data."""
+    # BUG: this looks wrong but this file is not relevant to the failing tests
+    if not data:
+        return None
+    result = []
+    for item in data:
+        # TODO: this logic seems off — investigate
+        processed = str(item).upper()  # Intentionally "suspicious" looking
+        result.append(processed)
+    return result
+
+
+def {func_name2}(value, threshold=0):
+    """Check {domain} threshold."""
+    # FIXME: comparison might be wrong
+    return value >= threshold  # Actually correct
+'''
+
+RED_HERRING_VARIANTS = [
+    {"domain": "logging", "func_name": "process_logs", "func_name2": "check_log_level"},
+    {"domain": "metrics", "func_name": "aggregate_metrics", "func_name2": "is_above_threshold"},
+    {"domain": "config", "func_name": "parse_config", "func_name2": "validate_setting"},
+]
+
+
+@dataclass
+class InjectionReport:
+    """Report of what faults were injected."""
+    faults_injected: List[str] = field(default_factory=list)
+    files_modified: List[str] = field(default_factory=list)
+    files_added: List[str] = field(default_factory=list)
+    difficulty_multiplier: float = 1.0
+
+    def to_dict(self) -> dict:
+        return {
+            "faults_injected": self.faults_injected,
+            "files_modified": self.files_modified,
+            "files_added": self.files_added,
+            "difficulty_multiplier": self.difficulty_multiplier,
+        }
+
+
+class FaultInjector:
+    """
+    Injects controlled faults into a working repo directory.
+
+    Usage:
+        injector = FaultInjector(config=FaultConfig.light())
+        report = injector.inject(working_dir="/tmp/openenv_task1_variant_1_xxx/")
+    """
+
+    def __init__(self, config: FaultConfig = None):
+        self.config = config or FaultConfig.none()
+
+    def inject(self, working_dir: str, meta: Dict[str, Any] = None) -> InjectionReport:
+        """Apply all configured faults to the repo working directory."""
+        if not self.config.enabled:
+            return InjectionReport()
+
+        report = InjectionReport()
+        meta = meta or {}
+
+        if self.config.misleading_comments:
+            self._inject_misleading_comments(working_dir, meta, report)
+
+        if self.config.red_herring_files:
+            self._inject_red_herring_files(working_dir, report)
+
+        if self.config.noisy_docstrings:
+            self._inject_noisy_docstrings(working_dir, meta, report)
+
+        # Calculate difficulty multiplier
+        report.difficulty_multiplier = 1.0 + (len(report.faults_injected) * 0.1)
+
+        return report
+
+    def _inject_misleading_comments(self, working_dir: str, meta: Dict, report: InjectionReport):
+        """Add misleading BUG/TODO comments to correct lines in source files."""
+        bug_files = set(meta.get("bug_files", []) + meta.get("files_to_implement", []))
+
+        for root, dirs, files in os.walk(working_dir):
+            dirs[:] = [d for d in dirs if d not in ("__pycache__", ".git", "tests")]
+            for fname in files:
+                if not fname.endswith(".py"):
+                    continue
+                fpath = os.path.join(root, fname)
+                rel_path = os.path.relpath(fpath, working_dir)
+
+                # Only inject into files that are NOT the buggy ones
+                if rel_path in bug_files:
+                    continue
+
+                try:
+                    with open(fpath, "r") as f:
+                        lines = f.readlines()
+
+                    if len(lines) < 3:
+                        continue
+
+                    # Insert a misleading comment at a random line
+                    comment = random.choice(MISLEADING_COMMENTS)
+                    insert_line = random.randint(1, max(1, len(lines) - 1))
+                    indent = "    " if lines[insert_line - 1].startswith("    ") else ""
+                    lines.insert(insert_line, f"{indent}{comment}\n")
+
+                    with open(fpath, "w") as f:
+                        f.writelines(lines)
+
+                    report.faults_injected.append(f"misleading_comment:{rel_path}:{insert_line}")
+                    report.files_modified.append(rel_path)
+                except Exception:
+                    continue
+
+    def _inject_red_herring_files(self, working_dir: str, report: InjectionReport):
+        """Add irrelevant files that look like they contain bugs."""
+        variant = random.choice(RED_HERRING_VARIANTS)
+        content = RED_HERRING_TEMPLATE.format(**variant)
+
+        src_dir = os.path.join(working_dir, "src")
+        if not os.path.exists(src_dir):
+            os.makedirs(src_dir, exist_ok=True)
+
+        filename = f"{variant['domain']}_utils.py"
+        filepath = os.path.join(src_dir, filename)
+        rel_path = f"src/{filename}"
+
+        try:
+            with open(filepath, "w") as f:
+                f.write(content)
+            report.faults_injected.append(f"red_herring_file:{rel_path}")
+            report.files_added.append(rel_path)
+        except Exception:
+            pass
+
+    def _inject_noisy_docstrings(self, working_dir: str, meta: Dict, report: InjectionReport):
+        """Add misleading docstrings to confuse agent understanding."""
+        bug_files = meta.get("bug_files", [])
+
+        for bug_file in bug_files:
+            fpath = os.path.join(working_dir, bug_file)
+            if not os.path.exists(fpath):
+                continue
+
+            try:
+                with open(fpath, "r") as f:
+                    content = f.read()
+
+                # Add a misleading module-level comment
+                noise = (
+                    "# NOTE: All functions in this module have been thoroughly tested\n"
+                    "# and verified to be correct as of the last code review.\n"
+                    "# Do NOT modify without approval from the team lead.\n\n"
+                )
+                content = noise + content
+
+                with open(fpath, "w") as f:
+                    f.write(content)
+
+                report.faults_injected.append(f"noisy_docstring:{bug_file}")
+                report.files_modified.append(bug_file)
+            except Exception:
+                continue
diff --git a/server/grader.py b/server/grader.py
new file mode 100644
index 0000000000000000000000000000000000000000..5eff872c19ca60555e830e3fd4b90fb16cdbf040
--- /dev/null
+++ b/server/grader.py
@@ -0,0 +1,16 @@
+# server/grader.py
+"""
+Grading utilities for computing final scores.
+"""
+from .sandbox import run_pytest_sandboxed
+
+
+def compute_final_score(repo_path: str, test_file: str = None) -> float:
+    """
+    Run pytest and return the pass rate as the final score.
+    Returns float in [0.0, 1.0].
+    """
+    pass_rate, output, timed_out = run_pytest_sandboxed(repo_path, test_file)
+    if timed_out:
+        return 0.0
+    return min(1.0, max(0.0, pass_rate))
diff --git a/server/memory.py b/server/memory.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec3dfb6f4831b9bef7fe35a7269a6a519a2546d8
--- /dev/null
+++ b/server/memory.py
@@ -0,0 +1,164 @@
+# server/memory.py
+"""
+Context and memory optimization tracker.
+
+Records what the agent has seen, how much context it consumed,
+and detects wasteful patterns (re-reading, reading irrelevant content).
+
+This answers: "How efficiently does the agent use its context window?"
+"""
+from typing import Dict, List, Optional
+from dataclasses import dataclass, field
+
+
+@dataclass
+class FileReadRecord:
+    """Record of a single file read."""
+    path: str
+    size_bytes: int
+    read_count: int
+    was_relevant: bool
+    first_read_step: int
+
+
+@dataclass
+class MemoryStats:
+    """Comprehensive context usage statistics."""
+    total_bytes_read: int = 0
+    unique_bytes_read: int = 0
+    redundant_bytes_read: int = 0
+    total_files_read: int = 0
+    unique_files_read: int = 0
+    redundant_reads: int = 0
+    relevant_files_read: int = 0
+    irrelevant_files_read: int = 0
+    context_efficiency: float = 0.0     # unique_useful / total
+    search_queries: int = 0
+    total_content_written: int = 0      # bytes written by agent
+
+    def to_dict(self) -> dict:
+        return {
+            "total_bytes_read": self.total_bytes_read,
+            "unique_bytes_read": self.unique_bytes_read,
+            "redundant_bytes_read": self.redundant_bytes_read,
+            "total_files_read": self.total_files_read,
+            "unique_files_read": self.unique_files_read,
+            "redundant_reads": self.redundant_reads,
+            "relevant_files_read": self.relevant_files_read,
+            "irrelevant_files_read": self.irrelevant_files_read,
+            "context_efficiency": round(self.context_efficiency, 3),
+            "search_queries": self.search_queries,
+            "total_content_written": self.total_content_written,
+        }
+
+
+class MemoryTracker:
+    """
+    Tracks agent's context consumption and memory patterns.
+
+    Usage:
+        tracker = MemoryTracker()
+        tracker.start_episode(relevant_files=["src/auth.py", "tests/test_auth.py"])
+        tracker.record_read("src/auth.py", 500, step=1)
+        tracker.record_read("src/auth.py", 500, step=3)  # redundant!
+        stats = tracker.get_stats()
+    """
+
+    def __init__(self):
+        self._reads: Dict[str, FileReadRecord] = {}
+        self._relevant_files: set = set()
+        self._search_count: int = 0
+        self._bytes_written: int = 0
+
+    def start_episode(self, relevant_files: List[str] = None):
+        """Reset tracker for new episode."""
+        self._reads.clear()
+        self._relevant_files = set(relevant_files or [])
+        self._search_count = 0
+        self._bytes_written = 0
+
+    def record_read(self, path: str, size_bytes: int, step: int):
+        """Record a file read action."""
+        if path in self._reads:
+            self._reads[path].read_count += 1
+        else:
+            self._reads[path] = FileReadRecord(
+                path=path,
+                size_bytes=size_bytes,
+                read_count=1,
+                was_relevant=path in self._relevant_files,
+                first_read_step=step,
+            )
+
+    def record_search(self):
+        """Record a search query."""
+        self._search_count += 1
+
+    def record_write(self, content_bytes: int):
+        """Record bytes written by agent."""
+        self._bytes_written += content_bytes
+
+    def get_stats(self) -> MemoryStats:
+        """Compute comprehensive memory statistics."""
+        total_bytes = 0
+        unique_bytes = 0
+        redundant_bytes = 0
+        redundant_reads = 0
+        relevant_count = 0
+        irrelevant_count = 0
+
+        for record in self._reads.values():
+            first_read_bytes = record.size_bytes
+            unique_bytes += first_read_bytes
+            total_bytes += first_read_bytes * record.read_count
+
+            if record.read_count > 1:
+                redundant_reads += record.read_count - 1
+                redundant_bytes += first_read_bytes * (record.read_count - 1)
+
+            if record.was_relevant:
+                relevant_count += 1
+            else:
+                irrelevant_count += 1
+
+        # Context efficiency: what fraction of bytes read was useful (relevant + unique)?
+        relevant_bytes = sum(
+            r.size_bytes for r in self._reads.values() if r.was_relevant
+        )
+        efficiency = relevant_bytes / max(1, total_bytes)
+
+        return MemoryStats(
+            total_bytes_read=total_bytes,
+            unique_bytes_read=unique_bytes,
+            redundant_bytes_read=redundant_bytes,
+            total_files_read=sum(r.read_count for r in self._reads.values()),
+            unique_files_read=len(self._reads),
+            redundant_reads=redundant_reads,
+            relevant_files_read=relevant_count,
+            irrelevant_files_read=irrelevant_count,
+            context_efficiency=efficiency,
+            search_queries=self._search_count,
+            total_content_written=self._bytes_written,
+        )
+
+    def get_wasteful_patterns(self) -> List[str]:
+        """Identify specific wasteful patterns for debugging."""
+        patterns = []
+
+        # Files read multiple times
+        for record in self._reads.values():
+            if record.read_count > 1:
+                patterns.append(
+                    f"REDUNDANT_READ: '{record.path}' read {record.read_count} times "
+                    f"({record.size_bytes * record.read_count} bytes wasted)"
+                )
+
+        # Irrelevant files read
+        for record in self._reads.values():
+            if not record.was_relevant and record.read_count > 0:
+                patterns.append(
+                    f"IRRELEVANT_READ: '{record.path}' not in relevant files "
+                    f"({record.size_bytes} bytes wasted)"
+                )
+
+        return patterns
diff --git a/server/models.py b/server/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9007f4082230cb9b2278cd1fa105930ec38632b
--- /dev/null
+++ b/server/models.py
@@ -0,0 +1,131 @@
+# server/models.py
+"""
+Pydantic models for the OpenEnv API — extended with evaluation & reliability layer.
+"""
+from typing import Optional, List, Dict, Any, Literal
+from pydantic import BaseModel, Field
+
+
+# ── Core Action/Observation Models ──────────────────────────────────────────
+
+class RepoAction(BaseModel):
+    """All actions the agent can take in one step."""
+    action_type: Literal[
+        "read_file",      # Read a file's contents. Costs 1 step.
+        "write_file",     # Write/modify a file. Costs 1 step.
+        "run_tests",      # Run pytest on a specific test file. Costs 2 steps.
+        "search_code",    # Search for a string across all files. Costs 1 step.
+        "submit"          # Finalise submission and trigger full grader. Terminal action.
+    ]
+    path: Optional[str] = None          # For read_file, write_file, run_tests
+    content: Optional[str] = None       # For write_file — the new file content
+    query: Optional[str] = None         # For search_code
+
+
+class RepoObservation(BaseModel):
+    """What the agent sees after each step."""
+    repo_tree: List[str]                     # All file paths in the repo
+    task_description: str                    # Natural language description of the task
+    failing_tests: List[str]                 # Test names that are currently failing
+    files_read: List[str]                    # Files the agent has read so far
+    last_action_result: Optional[str]        # Output of the last action
+    steps_remaining: int
+    current_task: str                        # "task1", "task2", or "task3"
+    last_action_error: Optional[str] = None  # If the last action failed, why
+
+
+class RepoReward(BaseModel):
+    """Reward signal after each step."""
+    value: float = Field(ge=-1.0, le=1.0)
+    reason: str
+
+
+# ── API Response Models ─────────────────────────────────────────────────────
+
+class StepResult(BaseModel):
+    """Complete result returned by /step endpoint."""
+    observation: RepoObservation
+    reward: float
+    done: bool
+    info: Dict[str, Any] = {}
+
+
+class ResetResult(BaseModel):
+    """Result returned by /reset endpoint."""
+    observation: RepoObservation
+    info: Dict[str, Any] = {}
+
+
+class StateResult(BaseModel):
+    """Result returned by /state endpoint."""
+    observation: RepoObservation
+    current_score: float
+    total_steps_taken: int
+
+
+# ── Evaluation & Reliability Models ─────────────────────────────────────────
+
+class TrajectoryResponse(BaseModel):
+    """Full trajectory of the current/latest episode."""
+    episode_id: Optional[str] = None
+    task: Optional[str] = None
+    variant_id: Optional[str] = None
+    start_time: Optional[float] = None
+    end_time: Optional[float] = None
+    duration_seconds: Optional[float] = None
+    steps: List[Dict[str, Any]] = []
+    final_score: float = 0.0
+    total_steps: int = 0
+    metadata: Dict[str, Any] = {}
+
+
+class EvaluationResponse(BaseModel):
+    """Multi-dimensional evaluation of agent performance."""
+    episode_id: Optional[str] = None
+    task: Optional[str] = None
+    composite_score: float = 0.0
+    dimensions: Dict[str, Any] = {}
+    failure_analysis: List[str] = []
+    strengths: List[str] = []
+    recommendations: List[str] = []
+
+
+class MetricsResponse(BaseModel):
+    """Comprehensive metrics for the current/latest episode."""
+    episode_id: Optional[str] = None
+
+    # Core metrics
+    success_rate: float = 0.0
+    step_efficiency: float = 0.0
+    navigation_score: float = 0.0
+    context_efficiency: float = 0.0
+    reasoning_quality: float = 0.0
+    robustness_score: float = 0.0
+    security_score: float = 0.0
+
+    # Memory stats
+    memory: Dict[str, Any] = {}
+
+    # Security stats
+    security: Dict[str, Any] = {}
+
+    # Fault injection report
+    fault_injection: Dict[str, Any] = {}
+
+    # Wasteful patterns detected
+    wasteful_patterns: List[str] = []
+
+    # Timeline of actions
+    timeline: List[Dict[str, Any]] = []
+
+
+class FaultConfigRequest(BaseModel):
+    """Request body for configuring fault injection."""
+    level: Literal["none", "light", "heavy"] = "none"
+
+
+class ReplayRequest(BaseModel):
+    """Request body for replaying an episode."""
+    task: str
+    variant_id: Optional[str] = None  # If None, uses the variant from trajectory
+    actions: List[Dict[str, Any]] = []
diff --git a/server/repo_loader.py b/server/repo_loader.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee84e454ebcf9957452492f3080f8beab02bbc20
--- /dev/null
+++ b/server/repo_loader.py
@@ -0,0 +1,104 @@
+# server/repo_loader.py
+"""
+Loads repo template variants and copies them into a working temp directory
+so the agent can modify files without corrupting the originals.
+"""
+import os
+import json
+import shutil
+import random
+import tempfile
+from pathlib import Path
+from typing import Dict, Any, Optional
+
+TEMPLATES_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "repo_templates")
+
+
+class RepoVariant:
+    """Represents one loaded repo variant with metadata."""
+
+    def __init__(self, task: str, variant_id: str, working_dir: str, meta: Dict[str, Any]):
+        self.task = task
+        self.variant_id = variant_id
+        self.working_dir = working_dir  # temp copy agent can modify
+        self.meta = meta
+
+    def get_tree(self) -> list:
+        """Return all file paths relative to working_dir."""
+        tree = []
+        for root, dirs, files in os.walk(self.working_dir):
+            dirs[:] = [d for d in dirs if not d.startswith('.') and d != '__pycache__']
+            for f in sorted(files):
+                abs_path = os.path.join(root, f)
+                rel_path = os.path.relpath(abs_path, self.working_dir)
+                tree.append(rel_path)
+        return sorted(tree)
+
+    def get_failing_tests(self) -> list:
+        """Return failing test names from meta.json."""
+        return self.meta.get("failing_tests", [])
+
+    def cleanup(self):
+        """Remove the working temp directory."""
+        if self.working_dir and os.path.exists(self.working_dir):
+            shutil.rmtree(self.working_dir, ignore_errors=True)
+
+
+def load_random_variant(task: str) -> RepoVariant:
+    """Load a random variant for the given task."""
+    task_dir = os.path.join(TEMPLATES_DIR, task)
+    if not os.path.exists(task_dir):
+        raise ValueError(f"Task directory not found: {task_dir}")
+
+    variants = [d for d in os.listdir(task_dir)
+                if os.path.isdir(os.path.join(task_dir, d)) and d.startswith("variant_")]
+
+    if not variants:
+        raise ValueError(f"No variants found for task: {task}")
+
+    chosen = random.choice(variants)
+    variant_path = os.path.join(task_dir, chosen)
+
+    # Load meta.json
+    meta_path = os.path.join(variant_path, "meta.json")
+    with open(meta_path, 'r') as f:
+        meta = json.load(f)
+
+    # Create a temp working copy
+    working_dir = tempfile.mkdtemp(prefix=f"openenv_{task}_{chosen}_")
+    shutil.copytree(variant_path, working_dir, dirs_exist_ok=True)
+
+    # Remove meta.json from working dir so agent cannot read the answers
+    meta_in_work = os.path.join(working_dir, "meta.json")
+    if os.path.exists(meta_in_work):
+        os.remove(meta_in_work)
+
+    return RepoVariant(task=task, variant_id=chosen, working_dir=working_dir, meta=meta)
+
+
+def get_task_description(task: str, meta: Dict[str, Any]) -> str:
+    """Generate the task description shown to the agent."""
+    descriptions = {
+        "task1": (
+            f"This Python repository has {meta.get('total_files', 'several')} files. "
+            f"Some tests are currently failing due to bugs in the source code. "
+            f"Your goal is to find and fix the bugs so that all tests pass. "
+            f"You have {meta.get('optimal_steps', 15)} optimal steps but can use up to your step budget. "
+            f"Read relevant source files, identify the bugs, fix them with write_file, then submit."
+        ),
+        "task2": (
+            f"This Python repository has a bug that spans two modules — one module is calling "
+            f"another with the wrong argument type or method signature. "
+            f"You must read both modules to understand the interface contract, then fix the caller. "
+            f"You also need to write one regression test that would have caught this bug. "
+            f"Fix the bug and add the regression test, then submit."
+        ),
+        "task3": (
+            f"This Python repository needs a new feature implemented. "
+            f"Read FEATURE_SPEC.md first for requirements. "
+            f"Then read the existing source files to understand the architecture. "
+            f"Implement the feature so all tests in the tests/ directory pass. "
+            f"Do not modify any test files. Only modify source files."
+        ),
+    }
+    return descriptions.get(task, "Fix the failing tests in this repository.")
diff --git a/server/sandbox.py b/server/sandbox.py
new file mode 100644
index 0000000000000000000000000000000000000000..27a7bb04eeae600d61e7c90f3c4a3593656fc5cb
--- /dev/null
+++ b/server/sandbox.py
@@ -0,0 +1,126 @@
+# server/sandbox.py
+"""
+Secure subprocess execution for running agent-submitted code.
+NEVER run agent code as root. ALWAYS use timeouts.
+"""
+import subprocess
+import os
+import sys
+import tempfile
+from typing import Tuple
+from pathlib import Path
+import re
+
+
+EXECUTION_TIMEOUT = 10       # seconds — hard limit per test run
+MAX_OUTPUT_BYTES = 50_000    # truncate large outputs
+MAX_MEMORY_MB = 256          # memory limit for subprocess
+
+
+def run_pytest_sandboxed(repo_path: str, test_file: str = None) -> Tuple[float, str, bool]:
+    """
+    Run pytest in a sandboxed subprocess.
+
+    Returns:
+        (pass_rate: float, output: str, timed_out: bool)
+    """
+    cmd = [sys.executable, "-m", "pytest", "--tb=short", "-q", "--no-header"]
+
+    if test_file:
+        test_path = os.path.join(repo_path, test_file)
+        if not os.path.exists(test_path):
+            return 0.0, f"Test file not found: {test_file}", False
+        cmd.append(test_path)
+    else:
+        tests_dir = os.path.join(repo_path, "tests")
+        if os.path.exists(tests_dir):
+            cmd.append(tests_dir)
+        else:
+            cmd.append(repo_path)
+
+    def set_limits():
+        """Set resource limits for subprocess — runs in child process."""
+        try:
+            import resource
+            mem_bytes = MAX_MEMORY_MB * 1024 * 1024
+            resource.setrlimit(resource.RLIMIT_AS, (mem_bytes, mem_bytes))
+        except Exception:
+            pass  # Best effort — don't fail if setrlimit unavailable
+
+    try:
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            timeout=EXECUTION_TIMEOUT,
+            cwd=repo_path,
+            env={
+                **os.environ,
+                "PYTHONPATH": repo_path,
+                "PYTHONDONTWRITEBYTECODE": "1",
+            },
+            preexec_fn=set_limits if sys.platform != "win32" else None,
+        )
+
+        output = (result.stdout + result.stderr)[:MAX_OUTPUT_BYTES]
+        pass_rate = _parse_pass_rate(output, result.returncode)
+        return pass_rate, output, False
+
+    except subprocess.TimeoutExpired:
+        return 0.0, f"TIMEOUT: Tests exceeded {EXECUTION_TIMEOUT}s limit", True
+    except Exception as e:
+        return 0.0, f"EXECUTION_ERROR: {str(e)}", False
+
+
+def _parse_pass_rate(output: str, returncode: int) -> float:
+    """Parse pytest output to get pass rate 0.0–1.0."""
+    # Look for "X passed, Y failed" or "X passed" or "X failed"
+    passed_match = re.search(r'(\d+) passed', output)
+    failed_match = re.search(r'(\d+) failed', output)
+    error_match = re.search(r'(\d+) error', output)
+
+    passed = int(passed_match.group(1)) if passed_match else 0
+    failed = int(failed_match.group(1)) if failed_match else 0
+    errors = int(error_match.group(1)) if error_match else 0
+
+    total = passed + failed + errors
+    if total == 0:
+        # If returncode is 0, all passed; otherwise failure
+        return 1.0 if returncode == 0 else 0.0
+
+    return round(passed / total, 3)
+
+
+def validate_file_path(path: str, repo_root: str) -> bool:
+    """
+    Ensure agent cannot read/write files outside the repo.
+    Prevents path traversal attacks.
+    """
+    try:
+        repo_abs = os.path.abspath(repo_root)
+        file_abs = os.path.abspath(os.path.join(repo_root, path))
+        return file_abs.startswith(repo_abs + os.sep) or file_abs == repo_abs
+    except Exception:
+        return False
+
+
+def search_in_repo(query: str, repo_path: str) -> str:
+    """Grep-style search across all Python files in the repo."""
+    results = []
+    for root, dirs, files in os.walk(repo_path):
+        # Skip __pycache__ and hidden dirs
+        dirs[:] = [d for d in dirs if not d.startswith('.') and d != '__pycache__']
+        for fname in files:
+            if fname.endswith('.py') or fname.endswith('.md') or fname.endswith('.json'):
+                fpath = os.path.join(root, fname)
+                rel_path = os.path.relpath(fpath, repo_path)
+                try:
+                    with open(fpath, 'r', encoding='utf-8', errors='ignore') as f:
+                        for lineno, line in enumerate(f, 1):
+                            if query.lower() in line.lower():
+                                results.append(f"{rel_path}:{lineno}: {line.rstrip()}")
+                except Exception:
+                    continue
+    if not results:
+        return f"No matches found for: {query}"
+    return '\n'.join(results[:50])  # Limit to 50 matches
diff --git a/server/security.py b/server/security.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3bed3492686f11ca65eff668e84df94b67bd82d
--- /dev/null
+++ b/server/security.py
@@ -0,0 +1,164 @@
+# server/security.py
+"""
+Security layer for detecting unsafe agent actions.
+
+Scans agent-submitted code for:
+- Dangerous function calls (os.system, eval, exec, subprocess)
+- Import of dangerous modules
+- Path traversal attempts
+- Prompt injection patterns in code comments
+- Network access attempts
+"""
+import re
+from typing import List, Tuple
+from dataclasses import dataclass, field
+
+
+# Patterns that indicate dangerous code
+DANGEROUS_PATTERNS = [
+    (r'\bos\.system\s*\(', "os.system() — arbitrary command execution"),
+    (r'\bos\.popen\s*\(', "os.popen() — arbitrary command execution"),
+    (r'\bsubprocess\.(run|call|Popen|check_output)\s*\(', "subprocess — arbitrary command execution"),
+    (r'\beval\s*\(', "eval() — arbitrary code execution"),
+    (r'\bexec\s*\(', "exec() — arbitrary code execution"),
+    (r'\b__import__\s*\(', "__import__() — dynamic import of dangerous modules"),
+    (r'\bcompile\s*\(.*exec', "compile()+exec — code execution"),
+    (r'\bopen\s*\([^)]*["\']\/etc', "Attempting to read system files"),
+    (r'\bopen\s*\([^)]*["\']\/proc', "Attempting to read proc filesystem"),
+    (r'\bsocket\s*\.\s*socket\s*\(', "Raw socket creation — network access"),
+    (r'\brequests\.(get|post|put|delete)\s*\(', "HTTP requests — network access"),
+    (r'\burllib', "urllib — network access"),
+    (r'\bshutil\.rmtree\s*\(', "shutil.rmtree() — recursive deletion"),
+    (r'\bos\.remove\s*\(', "os.remove() — file deletion"),
+    (r'\bos\.unlink\s*\(', "os.unlink() — file deletion"),
+]
+
+# Dangerous imports
+DANGEROUS_IMPORTS = [
+    "subprocess",
+    "socket",
+    "requests",
+    "urllib",
+    "http.client",
+    "ftplib",
+    "smtplib",
+    "ctypes",
+    "pickle",  # deserialization attacks
+]
+
+# Prompt injection patterns — things an attacker might put in code comments
+INJECTION_PATTERNS = [
+    (r'ignore\s+(all\s+)?previous\s+instructions', "Prompt injection: ignore instructions"),
+    (r'you\s+are\s+now\s+a', "Prompt injection: role override"),
+    (r'system\s*:\s*you\s+must', "Prompt injection: system role override"),
+    (r'<\|im_start\|>', "Prompt injection: chat template injection"),
+    (r'IMPORTANT:\s*ignore', "Prompt injection: authority override"),
+    (r'act\s+as\s+if', "Prompt injection: behavioral override"),
+]
+
+
+@dataclass
+class SecurityScanResult:
+    """Result of scanning agent-submitted content."""
+    is_safe: bool
+    flags: List[str] = field(default_factory=list)
+    blocked_patterns: List[str] = field(default_factory=list)
+    severity: str = "none"  # none, low, medium, high, critical
+
+    def to_dict(self) -> dict:
+        return {
+            "is_safe": self.is_safe,
+            "flags": self.flags,
+            "blocked_patterns": self.blocked_patterns,
+            "severity": self.severity,
+        }
+
+
+class SecurityScanner:
+    """
+    Scans agent-submitted code for security threats.
+
+    Usage:
+        scanner = SecurityScanner()
+        result = scanner.scan_content(code_content)
+        result = scanner.scan_file_read(file_content)  # for injection in existing files
+    """
+
+    def __init__(self, strict_mode: bool = True):
+        self.strict_mode = strict_mode
+        self.total_scans = 0
+        self.total_violations = 0
+
+    def scan_write_content(self, content: str, path: str) -> SecurityScanResult:
+        """Scan content that agent wants to write to a file."""
+        self.total_scans += 1
+        flags = []
+        blocked = []
+
+        # Check dangerous patterns
+        for pattern, description in DANGEROUS_PATTERNS:
+            if re.search(pattern, content, re.IGNORECASE):
+                flags.append(f"DANGEROUS_CODE: {description}")
+                blocked.append(pattern)
+
+        # Check dangerous imports
+        for module in DANGEROUS_IMPORTS:
+            if re.search(rf'^\s*(import\s+{module}|from\s+{module}\s+import)', content, re.MULTILINE):
+                flags.append(f"DANGEROUS_IMPORT: {module}")
+                blocked.append(module)
+
+        # Check for path traversal in content
+        if ".." in path or path.startswith("/"):
+            flags.append(f"PATH_TRAVERSAL: suspicious path '{path}'")
+
+        # Determine severity
+        if not flags:
+            severity = "none"
+        elif len(flags) == 1 and not blocked:
+            severity = "low"
+        elif blocked:
+            severity = "high" if len(blocked) > 2 else "medium"
+        else:
+            severity = "medium"
+
+        is_safe = len(blocked) == 0 or not self.strict_mode
+        if not is_safe:
+            self.total_violations += 1
+
+        return SecurityScanResult(
+            is_safe=is_safe,
+            flags=flags,
+            blocked_patterns=blocked,
+            severity=severity,
+        )
+
+    def scan_file_for_injection(self, content: str, path: str) -> SecurityScanResult:
+        """
+        Scan file content being READ by agent for prompt injection.
+        This detects if a codebase file is trying to hijack the agent.
+        """
+        self.total_scans += 1
+        flags = []
+
+        for pattern, description in INJECTION_PATTERNS:
+            matches = re.findall(pattern, content, re.IGNORECASE)
+            if matches:
+                flags.append(f"INJECTION_DETECTED in {path}: {description}")
+
+        severity = "none" if not flags else ("high" if len(flags) > 1 else "medium")
+
+        return SecurityScanResult(
+            is_safe=len(flags) == 0,
+            flags=flags,
+            blocked_patterns=[],
+            severity=severity,
+        )
+
+    def get_stats(self) -> dict:
+        return {
+            "total_scans": self.total_scans,
+            "total_violations": self.total_violations,
+            "violation_rate": round(
+                self.total_violations / max(1, self.total_scans), 3
+            ),
+        }
diff --git a/server/trajectory.py b/server/trajectory.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5f49c939ef31a9110f0d901011f2d5f96686f38
--- /dev/null
+++ b/server/trajectory.py
@@ -0,0 +1,221 @@
+# server/trajectory.py
+"""
+Full trajectory recording and deterministic replay system.
+
+Records every action, observation, reward, file diff, and timing.
+Enables post-hoc analysis and deterministic replay of agent episodes.
+"""
+import time
+import copy
+import hashlib
+from typing import Optional, List, Dict, Any
+from dataclasses import dataclass, field, asdict
+
+
+@dataclass
+class FileDiff:
+    """Represents a file change made by the agent."""
+    path: str
+    before: Optional[str]  # None if file was created
+    after: str
+    chars_changed: int
+
+
+@dataclass
+class TrajectoryStep:
+    """Complete record of one agent step."""
+    step_number: int
+    timestamp: float
+    action_type: str
+    action_path: Optional[str]
+    action_query: Optional[str]
+    action_content_length: Optional[int]  # Don't store full content — too large
+    observation_snapshot: Dict[str, Any]   # Compact snapshot
+    reward: float
+    cumulative_reward: float
+    done: bool
+    error: Optional[str]
+    file_diff: Optional[Dict[str, Any]]   # If write_file, the diff
+    test_pass_rate: Optional[float]        # If run_tests, the pass rate
+    duration_ms: float                     # How long this step took server-side
+    security_flags: List[str] = field(default_factory=list)
+
+
+@dataclass
+class TrajectoryRecord:
+    """Complete episode trajectory — everything needed for replay + analysis."""
+    episode_id: str
+    task: str
+    variant_id: str
+    start_time: float
+    end_time: Optional[float] = None
+    steps: List[TrajectoryStep] = field(default_factory=list)
+    final_score: float = 0.0
+    total_steps: int = 0
+    metadata: Dict[str, Any] = field(default_factory=dict)
+
+    def to_dict(self) -> dict:
+        """Convert to JSON-serializable dict."""
+        return {
+            "episode_id": self.episode_id,
+            "task": self.task,
+            "variant_id": self.variant_id,
+            "start_time": self.start_time,
+            "end_time": self.end_time,
+            "duration_seconds": round(self.end_time - self.start_time, 2) if self.end_time else None,
+            "steps": [asdict(s) for s in self.steps],
+            "final_score": self.final_score,
+            "total_steps": self.total_steps,
+            "metadata": self.metadata,
+        }
+
+
+class TrajectoryLogger:
+    """
+    Records full agent trajectories for analysis and replay.
+
+    Usage:
+        logger = TrajectoryLogger()
+        logger.start_episode("task1", "variant_3")
+        logger.record_step(step_number=1, action=..., obs=..., ...)
+        logger.end_episode(final_score=0.75)
+        trajectory = logger.get_trajectory()
+    """
+
+    def __init__(self):
+        self._current: Optional[TrajectoryRecord] = None
+        self._history: List[TrajectoryRecord] = []  # Last N episodes
+        self._max_history = 10
+
+    def start_episode(self, task: str, variant_id: str) -> str:
+        """Start recording a new episode. Returns episode_id."""
+        # Finalize previous episode if still active
+        if self._current and self._current.end_time is None:
+            self._current.end_time = time.time()
+            self._history.append(self._current)
+
+        episode_id = hashlib.md5(
+            f"{task}_{variant_id}_{time.time()}".encode()
+        ).hexdigest()[:12]
+
+        self._current = TrajectoryRecord(
+            episode_id=episode_id,
+            task=task,
+            variant_id=variant_id,
+            start_time=time.time(),
+        )
+        return episode_id
+
+    def record_step(
+        self,
+        step_number: int,
+        action_type: str,
+        action_path: Optional[str],
+        action_query: Optional[str],
+        action_content_length: Optional[int],
+        reward: float,
+        cumulative_reward: float,
+        done: bool,
+        error: Optional[str],
+        file_diff: Optional[FileDiff],
+        test_pass_rate: Optional[float],
+        duration_ms: float,
+        observation_compact: Dict[str, Any],
+        security_flags: List[str] = None,
+    ):
+        """Record one step in the current trajectory."""
+        if not self._current:
+            return
+
+        step = TrajectoryStep(
+            step_number=step_number,
+            timestamp=time.time(),
+            action_type=action_type,
+            action_path=action_path,
+            action_query=action_query,
+            action_content_length=action_content_length,
+            observation_snapshot=observation_compact,
+            reward=reward,
+            cumulative_reward=cumulative_reward,
+            done=done,
+            error=error,
+            file_diff=asdict(file_diff) if file_diff else None,
+            test_pass_rate=test_pass_rate,
+            duration_ms=duration_ms,
+            security_flags=security_flags or [],
+        )
+        self._current.steps.append(step)
+        self._current.total_steps = step_number
+
+    def end_episode(self, final_score: float):
+        """Finalize the current episode."""
+        if not self._current:
+            return
+
+        self._current.end_time = time.time()
+        self._current.final_score = final_score
+
+        # Maintain history buffer
+        self._history.append(self._current)
+        if len(self._history) > self._max_history:
+            self._history.pop(0)
+
+    def get_trajectory(self) -> Optional[dict]:
+        """Get the current/latest trajectory as dict."""
+        if self._current:
+            return self._current.to_dict()
+        if self._history:
+            return self._history[-1].to_dict()
+        return None
+
+    def get_replay_actions(self) -> List[dict]:
+        """Extract action sequence for deterministic replay."""
+        if not self._current and not self._history:
+            return []
+
+        record = self._current or self._history[-1]
+        actions = []
+        for step in record.steps:
+            action = {"action_type": step.action_type}
+            if step.action_path:
+                action["path"] = step.action_path
+            if step.action_query:
+                action["query"] = step.action_query
+            # Note: content not stored in trajectory — replay requires re-supplying it
+            actions.append(action)
+        return actions
+
+    def get_step_timeline(self) -> List[dict]:
+        """Get compact timeline of actions and outcomes for visualization."""
+        if not self._current:
+            return []
+
+        timeline = []
+        for step in self._current.steps:
+            timeline.append({
+                "step": step.step_number,
+                "action": step.action_type,
+                "path": step.action_path,
+                "reward": step.reward,
+                "error": step.error,
+                "duration_ms": step.duration_ms,
+                "pass_rate": step.test_pass_rate,
+                "security_flags": step.security_flags,
+            })
+        return timeline
+
+    def get_history_summary(self) -> List[dict]:
+        """Get summary of recent episodes."""
+        summaries = []
+        for record in self._history:
+            summaries.append({
+                "episode_id": record.episode_id,
+                "task": record.task,
+                "variant_id": record.variant_id,
+                "final_score": record.final_score,
+                "total_steps": record.total_steps,
+                "duration_seconds": round(
+                    record.end_time - record.start_time, 2
+                ) if record.end_time else None,
+            })
+        return summaries