Spaces:

darshanajudiya7
/

python_env

Sleeping

App Files Files Community

darshanajudiya7 commited on 16 days ago

Commit

d25ab77

verified ·

1 Parent(s): cf8ad9a

Upload folder using huggingface_hub

Browse files

Files changed (27) hide show

Dockerfile +81 -0
README.md +269 -5
__init__.py +17 -0
client.py +94 -0
inference.py +264 -0
models.py +304 -0
openenv.yaml +6 -0
openenv_python_env.egg-info/PKG-INFO +11 -0
openenv_python_env.egg-info/SOURCES.txt +27 -0
openenv_python_env.egg-info/dependency_links.txt +1 -0
openenv_python_env.egg-info/entry_points.txt +2 -0
openenv_python_env.egg-info/requires.txt +7 -0
openenv_python_env.egg-info/top_level.txt +1 -0
pyproject.toml +50 -0
rollout.py +71 -0
server/__init__.py +11 -0
server/app.py +148 -0
server/data/snippets_easy.json +238 -0
server/data/snippets_hard.json +214 -0
server/data/snippets_medium.json +191 -0
server/grading.py +465 -0
server/python_env_environment.py +500 -0
server/requirements.txt +5 -0
server/review_runtime.py +418 -0
server/task_bank.py +83 -0
tests/test_env.py +157 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=python_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,274 @@
 ---
-title: Python Env
-emoji: 📊
-colorFrom: green
-colorTo: purple
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Python Env Environment Server
+emoji: 🎶
+colorFrom: purple
+colorTo: red
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Python Code Review Environment
+This repository now hosts a deterministic OpenEnv benchmark for Python code review. Agents review snippets one step at a time and receive dense rewards, `done` flags, precision/recall/F1 metrics, and task-specific grading suitable for RL training or evaluation.
+Active task families:
+- `task_easy`: style and convention review
+- `task_medium`: logic bug detection
+- `task_hard`: security vulnerability audit
+Use the action schema in `models.py`, the runtime in `server/python_env_environment.py`, and the rollout loop in `inference.py` / `rollout.py` as the current source of truth. The remaining template sections below are legacy scaffolding and may lag behind the benchmark implementation.
+## Quick Start
+The simplest way to use the Python Env environment is through the `PythonEnv` class:
+```python
+from python_env import PythonAction, PythonEnv
+try:
+    # Create environment from Docker image
+    python_envenv = PythonEnv.from_docker_image("python_env-env:latest")
+    # Reset
+    result = python_envenv.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Send multiple messages
+    messages = ["Hello, World!", "Testing echo", "Final message"]
+    for msg in messages:
+        result = python_envenv.step(PythonAction(message=msg))
+        print(f"Sent: '{msg}'")
+        print(f"  → Echoed: '{result.observation.echoed_message}'")
+        print(f"  → Length: {result.observation.message_length}")
+        print(f"  → Reward: {result.reward}")
+finally:
+    # Always clean up
+    python_envenv.close()
+```
+That's it! The `PythonEnv.from_docker_image()` method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t python_env-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+**PythonAction**: Contains a single field
+- `message` (str) - The message to echo back
+### Observation
+**PythonObservation**: Contains the echo response and metadata
+- `echoed_message` (str) - The message echoed back
+- `message_length` (int) - Length of the message
+- `reward` (float) - Reward based on message length (length × 0.1)
+- `done` (bool) - Always False for echo environment
+- `metadata` (dict) - Additional info like step count
+### Reward
+The reward is calculated as: `message_length × 0.1`
+- "Hi" → reward: 0.2
+- "Hello, World!" → reward: 1.3
+- Empty message → reward: 0.0
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a Python Env environment server running, you can connect directly:
+```python
+from python_env import PythonEnv
+# Connect to existing server
+python_envenv = PythonEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = python_envenv.reset()
+result = python_envenv.step(PythonAction(message="Hello!"))
+```
+Note: When connecting to an existing server, `python_envenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from python_env import PythonAction, PythonEnv
+# Connect with context manager (auto-connects and closes)
+with PythonEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Multiple steps with low latency
+    for msg in ["Hello", "World", "!"]:
+        result = env.step(PythonAction(message=msg))
+        print(f"Echoed: {result.observation.echoed_message}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    PythonEnvironment,  # Pass class, not instance
+    PythonAction,
+    PythonObservation,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from python_env import PythonAction, PythonEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with PythonEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(PythonAction(message=f"Client {client_id}, step {i}"))
+        return client_id, result.observation.message_length
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Test the environment logic directly without starting the HTTP server:
+```bash
+# From the server directory
+python3 server/python_env_environment.py
+```
+This verifies that:
+- Environment resets correctly
+- Step executes actions properly
+- State tracking works
+- Rewards are calculated correctly
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+python_env/
+├── .dockerignore         # Docker build exclusions
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── uv.lock                # Locked dependencies (generated)
+├── client.py              # PythonEnv client
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── python_env_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
+    └── Dockerfile         # Container image definition
+```
+---------------------------------------
+cd F:\python_env
+  # Edit your environment implementation in server/python_env_environment.py
+  # Edit your models in models.py
+  # Install dependencies: uv sync
+  # To integrate into OpenEnv repo:
+  # 1. Copy this directory to <repo_root>/envs/python_env_env
+  # 2. Build from repo root: docker build -t python_env_env:latest -f envs/python_env_env/server/Dockerfile .
+  # 3. Run your image: docker run -p 8000:8000 python_env_env:latest

__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python code-review benchmark environment."""
+from .client import PythonEnv
+from .models import PythonAction, PythonObservation, PythonState
+__all__ = [
+    "PythonAction",
+    "PythonObservation",
+    "PythonState",
+    "PythonEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,94 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python Env Environment Client."""
+from __future__ import annotations
+from typing import Any, Dict
+from urllib.parse import urlparse
+import httpx
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+try:
+    from .models import (
+        HealthResponse,
+        MetricsResponse,
+        PythonAction,
+        PythonObservation,
+        PythonState,
+        TaskListResponse,
+    )
+except ImportError:
+    from models import (  # type: ignore
+        HealthResponse,
+        MetricsResponse,
+        PythonAction,
+        PythonObservation,
+        PythonState,
+        TaskListResponse,
+    )
+def _to_http_base_url(base_url: str) -> str:
+    parsed = urlparse(base_url)
+    scheme = "https" if parsed.scheme == "wss" else "http"
+    if parsed.scheme in {"http", "https"}:
+        scheme = parsed.scheme
+    return f"{scheme}://{parsed.netloc}{parsed.path}".rstrip("/")
+class PythonEnv(EnvClient[PythonAction, PythonObservation, PythonState]):
+    """Typed client for the Python code-review environment."""
+    def __init__(self, base_url: str, **kwargs: Any):
+        super().__init__(base_url=base_url, **kwargs)
+        self._http_base_url = _to_http_base_url(base_url)
+    def _step_payload(self, action: PythonAction) -> Dict[str, Any]:
+        """Convert a validated action model to the JSON payload expected by the server."""
+        return action.model_dump(exclude_none=True)
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[PythonObservation]:
+        """Parse a server response into a typed step result."""
+        obs_data = dict(payload.get("observation", {}))
+        obs_data.setdefault("done", payload.get("done", False))
+        obs_data.setdefault("reward", payload.get("reward"))
+        observation = PythonObservation.model_validate(obs_data)
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> PythonState:
+        """Parse the server state payload into the shared state model."""
+        return PythonState.model_validate(payload)
+    async def get_tasks(self) -> TaskListResponse:
+        async with httpx.AsyncClient() as client:
+            response = await client.get(f"{self._http_base_url}/tasks")
+            response.raise_for_status()
+        return TaskListResponse.model_validate(response.json())
+    async def get_metrics(self) -> MetricsResponse:
+        async with httpx.AsyncClient() as client:
+            response = await client.get(f"{self._http_base_url}/metrics")
+            response.raise_for_status()
+        return MetricsResponse.model_validate(response.json())
+    async def get_health(self) -> HealthResponse:
+        async with httpx.AsyncClient() as client:
+            response = await client.get(f"{self._http_base_url}/health")
+            response.raise_for_status()
+        return HealthResponse.model_validate(response.json())

inference.py ADDED Viewed

	@@ -0,0 +1,264 @@

+"""Baseline inference script for the Python code-review environment."""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import re
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+from client import PythonEnv
+from models import ActionType, PythonReviewAction
+# Read all runtime configuration from environment variables so the script can
+# be reused unchanged across local runs, CI, and HF Spaces validation.
+API_BASE_URL = os.environ["API_BASE_URL"]
+MODEL_NAME = os.environ["MODEL_NAME"]
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL")
+DOCKER_IMAGE = os.getenv("PYTHON_ENV_IMAGE", "python_env-env:latest")
+MAX_STEPS = int(os.getenv("MAX_STEPS", "25"))
+REPORT_PATH = Path(os.getenv("INFERENCE_REPORT_PATH", "inference_results.json"))
+TEMPERATURE = float(os.getenv("TEMPERATURE", "0"))
+MAX_TOKENS = int(os.getenv("MAX_TOKENS", "900"))
+TASK_IDS = ["task_easy", "task_medium", "task_hard"]
+SYSTEM_PROMPT = """You are a precise senior Python code reviewer.
+Return strict JSON using this schema:
+{
+  "action_type": "ADD_COMMENT|APPROVE|REQUEST_CHANGES|ASK_CONTEXT|SKIP_LINE",
+  "line_number": 1,
+  "issue_type": "STYLE|LOGIC|SECURITY|PERFORMANCE|DOCS",
+  "severity": "LOW|MEDIUM|HIGH|CRITICAL",
+  "comment": "why this matters",
+  "suggestion": "optional fix suggestion",
+  "question": "optional context question"
+}
+Rules:
+- Output JSON only. No markdown fences.
+- Only report issues supported by the visible code.
+- Use one action per step.
+- Prefer high precision over quantity.
+- Use REQUEST_CHANGES once you believe the code should be rejected.
+- Use APPROVE only when the snippet is genuinely clean.
+"""
+def _build_prompt(observation, step: int, history: List[str]) -> str:
+    """Build the task prompt sent to the model for one step."""
+    numbered_lines = "\n".join(
+        f"{index + 1:>3}: {line}" for index, line in enumerate(observation.lines)
+    )
+    history_text = "\n".join(history[-4:]) if history else "No previous attempts."
+    return (
+        f"Task ID: {observation.task_id}\n"
+        f"Step: {step}\n"
+        f"Current score: {observation.metrics.current_score:.2f}\n"
+        f"Last reward: {observation.reward_summary.step_reward:.2f}\n"
+        f"Cumulative reward: {observation.reward_summary.cumulative_reward:.2f}\n"
+        f"Latest feedback: {observation.feedback or 'None'}\n"
+        f"Attempt history:\n{history_text}\n\n"
+        f"Filename: {observation.filename}\n"
+        f"Context: {observation.context or 'None'}\n"
+        "Code to review:\n"
+        f"{numbered_lines}"
+    )
+def _extract_text_content(message_content: Any) -> str:
+    """Normalize OpenAI response content into one text string."""
+    if isinstance(message_content, str):
+        return message_content
+    if isinstance(message_content, list):
+        parts: List[str] = []
+        for item in message_content:
+            if isinstance(item, dict):
+                text = item.get("text")
+                if isinstance(text, str):
+                    parts.append(text)
+        return "\n".join(parts)
+    return ""
+def _extract_json_blob(content: str) -> str:
+    """Extract a JSON object from plain or fenced model output."""
+    fenced_match = re.search(r"```(?:json)?\s*(\{.*\})\s*```", content, re.DOTALL)
+    if fenced_match:
+        return fenced_match.group(1)
+    start = content.find("{")
+    end = content.rfind("}")
+    if start != -1 and end != -1 and end > start:
+        return content[start : end + 1]
+    return content
+def _parse_response(content: str) -> Dict[str, Any]:
+    """Parse the model response into a normalized payload dict."""
+    raw = _extract_json_blob(content)
+    try:
+        data = json.loads(raw)
+    except json.JSONDecodeError:
+        return {"_parse_error": raw}
+    return data
+def _completion(client: OpenAI, prompt: str) -> Dict[str, Any]:
+    """Send one completion request to the configured model endpoint."""
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        temperature=TEMPERATURE,
+        max_tokens=MAX_TOKENS,
+        messages=[
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": prompt},
+        ],
+    )
+    content = _extract_text_content(response.choices[0].message.content) or "{}"
+    return _parse_response(content)
+def _build_fallback_action(observation, note: str) -> PythonReviewAction:
+    """Create a safe fallback action when model output is unusable."""
+    return PythonReviewAction(
+        action_type=ActionType.REQUEST_CHANGES
+        if observation.current_step + 1 >= observation.max_steps
+        else ActionType.ASK_CONTEXT,
+        question=note if observation.current_step + 1 < observation.max_steps else None,
+    )
+def _to_action(
+    payload: Dict[str, Any],
+    observation,
+) -> PythonReviewAction:
+    """Convert a parsed model payload into a valid environment action."""
+    try:
+        return PythonReviewAction.model_validate(payload)
+    except Exception:
+        note = "Model returned no valid action."
+        if payload.get("_parse_error"):
+            note = f"{note} Raw response could not be parsed as JSON."
+        return _build_fallback_action(observation, note)
+def _make_env():
+    """Connect to a live environment or launch the Docker image."""
+    if ENV_BASE_URL:
+        return PythonEnv(base_url=ENV_BASE_URL).sync()
+    return asyncio.run(PythonEnv.from_docker_image(DOCKER_IMAGE)).sync()
+def _task_result_dict(observation, step_logs: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """Build the report payload for one completed task run."""
+    return {
+        "task_id": observation.task_id,
+        "snippet_id": observation.snippet_id,
+        "score": observation.metrics.current_score,
+        "precision": observation.metrics.precision,
+        "recall": observation.metrics.recall,
+        "f1": observation.metrics.f1,
+        "true_positives": observation.metrics.true_positives,
+        "false_positives": observation.metrics.false_positives,
+        "missed_issues": observation.metrics.missed_issues,
+        "cumulative_reward": observation.metrics.cumulative_reward,
+        "steps": step_logs,
+    }
+def main() -> None:
+    """Run the configured model against the benchmark task set."""
+    if not API_KEY:
+        raise RuntimeError("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = _make_env()
+    episode_results: List[Dict[str, Any]] = []
+    try:
+        for index, task_id in enumerate(TASK_IDS, start=1):
+            result = env.reset(task_id=task_id)
+            observation = result.observation
+            history: List[str] = []
+            step_logs: List[Dict[str, Any]] = []
+            print(f"Task {index}: {task_id} ({observation.snippet_id})")
+            for step in range(1, MAX_STEPS + 1):
+                prompt = _build_prompt(observation, step, history)
+                try:
+                    payload = _completion(client, prompt)
+                except Exception as exc:
+                    payload = {"_error": str(exc)}
+                action = _to_action(payload=payload, observation=observation)
+                result = env.step(action)
+                observation = result.observation
+                step_log = {
+                    "step": step,
+                    "action_type": action.action_type.value,
+                    "line_number": action.line_number,
+                    "reward": result.reward or 0.0,
+                    "score": observation.metrics.current_score,
+                    "done": result.done,
+                    "feedback": observation.feedback,
+                }
+                if payload.get("_error"):
+                    step_log["model_error"] = payload["_error"]
+                if payload.get("_parse_error"):
+                    step_log["parse_error"] = True
+                step_logs.append(step_log)
+                history.append(
+                    f"step={step} action={action.action_type.value} "
+                    f"line={action.line_number} score={observation.metrics.current_score:.2f} "
+                    f"reward={(result.reward or 0.0):.2f} feedback={observation.feedback}"
+                )
+                print(
+                    f"  step={step} action={action.action_type.value} "
+                    f"score={observation.metrics.current_score:.2f} reward={(result.reward or 0.0):.2f} "
+                    f"done={result.done}"
+                )
+                if result.done:
+                    break
+            episode_results.append(_task_result_dict(observation, step_logs))
+    finally:
+        env.close()
+    mean_score = sum(item["score"] for item in episode_results) / len(episode_results) if episode_results else 0.0
+    summary = {
+        "model_name": MODEL_NAME,
+        "api_base_url": API_BASE_URL,
+        "task_count": len(episode_results),
+        "mean_score": mean_score,
+        "results": episode_results,
+    }
+    REPORT_PATH.write_text(json.dumps(summary, indent=2), encoding="utf-8")
+    print(json.dumps(summary, indent=2))
+    print(f"\nSaved report to {REPORT_PATH}")
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,304 @@

+"""Shared models for the Python code-review OpenEnv benchmark."""
+from __future__ import annotations
+from enum import Enum
+from typing import Dict, List, Optional
+from pydantic import BaseModel, Field, model_validator
+from openenv.core.env_server.types import Action, Observation, State
+class Difficulty(str, Enum):
+    EASY = "easy"
+    MEDIUM = "medium"
+    HARD = "hard"
+class ActionType(str, Enum):
+    ADD_COMMENT = "ADD_COMMENT"
+    APPROVE = "APPROVE"
+    REQUEST_CHANGES = "REQUEST_CHANGES"
+    ASK_CONTEXT = "ASK_CONTEXT"
+    SKIP_LINE = "SKIP_LINE"
+class IssueType(str, Enum):
+    STYLE = "STYLE"
+    LOGIC = "LOGIC"
+    SECURITY = "SECURITY"
+    PERFORMANCE = "PERFORMANCE"
+    DOCS = "DOCS"
+class Severity(str, Enum):
+    LOW = "LOW"
+    MEDIUM = "MEDIUM"
+    HIGH = "HIGH"
+    CRITICAL = "CRITICAL"
+class GoldIssue(BaseModel):
+    """Hidden benchmark annotation for one issue in a snippet."""
+    issue_id: str
+    line: int = Field(..., ge=1)
+    issue_type: IssueType
+    severity: Severity
+    description: str
+    required: bool = True
+    explanation_keywords: List[str] = Field(default_factory=list)
+    fix_keywords: List[str] = Field(default_factory=list)
+    owasp_category: Optional[str] = None
+    owasp_keywords: List[str] = Field(default_factory=list)
+class ReviewComment(BaseModel):
+    """Stored review action visible to the agent in `review_history`."""
+    step_index: int = Field(..., ge=1)
+    action_type: ActionType
+    line_number: Optional[int] = Field(default=None, ge=1)
+    issue_type: Optional[IssueType] = None
+    severity: Optional[Severity] = None
+    comment: Optional[str] = None
+    suggestion: Optional[str] = None
+    question: Optional[str] = None
+    matched_issue_ids: List[str] = Field(default_factory=list)
+    reward_delta: float = 0.0
+class CodeReviewSnippet(BaseModel):
+    """Benchmark sample loaded from JSON."""
+    snippet_id: str
+    filename: str
+    code: str
+    context: Optional[str] = None
+    diff: Optional[str] = None
+    gold_issues: List[GoldIssue]
+    must_approve: bool = False
+    must_reject: bool = True
+class TaskMetadata(BaseModel):
+    """Visible task-family metadata."""
+    task_id: str
+    name: str
+    difficulty: Difficulty
+    description: str
+    snippet_count: int = Field(..., ge=0)
+    max_steps: int = Field(..., ge=1)
+    min_score: float = Field(default=0.0, ge=0.0, le=1.0)
+    max_score: float = Field(default=1.0, ge=0.0, le=1.0)
+class ReviewFinding(BaseModel):
+    """Compatibility shim for earlier template-derived environment code."""
+    title: str = ""
+    line: Optional[int] = Field(default=None, ge=1)
+    category: str = "bug"
+    severity: str = "warning"
+    rationale: str = ""
+    recommendation: Optional[str] = None
+    rule_id: Optional[str] = None
+class TaskDescriptor(BaseModel):
+    """Compatibility shim for earlier template-derived environment code."""
+    task_id: str
+    difficulty: str
+    title: str
+    objective: str
+    code: str
+    max_steps: int = Field(..., ge=1)
+    success_threshold: float = Field(default=0.0, ge=0.0, le=1.0)
+class TaskEvaluation(BaseModel):
+    """Compatibility shim for earlier template-derived environment code."""
+    matched_reference_ids: List[str] = Field(default_factory=list)
+    matched_findings: int = 0
+    total_findings: int = 0
+    false_positives: int = 0
+    duplicate_findings: int = 0
+    weighted_recall: float = 0.0
+    patch_score: float = 0.0
+    score: float = 0.0
+    passed: bool = False
+class PythonEnvConfig(BaseModel):
+    """Environment configuration used by the benchmark runtime."""
+    task_order: List[str] = Field(
+        default_factory=lambda: ["task_easy", "task_medium", "task_hard"]
+    )
+    max_steps_per_task: int = Field(default=25, ge=1, le=100)
+    max_history_entries: int = Field(default=200, ge=1, le=1000)
+    rotate_tasks: bool = True
+class EpisodeMetrics(BaseModel):
+    """Current episode metrics for UI, evaluation, and RL logging."""
+    precision: float = Field(default=0.0, ge=0.0, le=1.0)
+    recall: float = Field(default=0.0, ge=0.0, le=1.0)
+    f1: float = Field(default=0.0, ge=0.0, le=1.0)
+    true_positives: int = Field(default=0, ge=0)
+    false_positives: int = Field(default=0, ge=0)
+    missed_issues: int = Field(default=0, ge=0)
+    required_found: int = Field(default=0, ge=0)
+    required_total: int = Field(default=0, ge=0)
+    bonus_found: int = Field(default=0, ge=0)
+    duplicate_comments: int = Field(default=0, ge=0)
+    context_requests: int = Field(default=0, ge=0)
+    skipped_clean_lines: int = Field(default=0, ge=0)
+    skipped_issue_lines: int = Field(default=0, ge=0)
+    current_score: float = Field(default=0.0, ge=0.0, le=1.0)
+    cumulative_reward: float = 0.0
+    breakdown: Dict[str, float] = Field(default_factory=dict)
+class RewardSummary(BaseModel):
+    """Reward details from the most recent step."""
+    step_reward: float = 0.0
+    cumulative_reward: float = 0.0
+    breakdown: Dict[str, float] = Field(default_factory=dict)
+    false_positives: int = Field(default=0, ge=0)
+    true_positives: int = Field(default=0, ge=0)
+    missed_issues: int = Field(default=0, ge=0)
+class PythonReviewAction(Action):
+    """Structured review action emitted by a model or trainer."""
+    action_type: ActionType
+    line_number: Optional[int] = Field(default=None, ge=1)
+    issue_type: Optional[IssueType] = None
+    severity: Optional[Severity] = None
+    comment: Optional[str] = None
+    suggestion: Optional[str] = None
+    question: Optional[str] = None
+    # Template compatibility
+    operation: str = "submit_findings"
+    findings: List[ReviewFinding] = Field(default_factory=list)
+    patched_code: Optional[str] = None
+    @model_validator(mode="after")
+    def validate_action_shape(self) -> "PythonReviewAction":
+        """Require the right fields for each action type."""
+        if self.action_type == ActionType.ADD_COMMENT:
+            missing = []
+            if self.line_number is None:
+                missing.append("line_number")
+            if self.issue_type is None:
+                missing.append("issue_type")
+            if self.severity is None:
+                missing.append("severity")
+            if not (self.comment or "").strip():
+                missing.append("comment")
+            if missing:
+                raise ValueError("ADD_COMMENT requires: " + ", ".join(missing))
+        elif self.action_type == ActionType.SKIP_LINE:
+            if self.line_number is None:
+                raise ValueError("SKIP_LINE requires line_number")
+        elif self.action_type == ActionType.ASK_CONTEXT:
+            if not (self.question or self.comment or "").strip():
+                raise ValueError("ASK_CONTEXT requires question or comment")
+        elif self.action_type in {ActionType.APPROVE, ActionType.REQUEST_CHANGES}:
+            noisy_fields = {
+                "line_number": self.line_number,
+                "issue_type": self.issue_type,
+                "severity": self.severity,
+                "comment": self.comment,
+                "suggestion": self.suggestion,
+                "question": self.question,
+            }
+            populated = [
+                name for name, value in noisy_fields.items() if value not in (None, "")
+            ]
+            if populated:
+                raise ValueError(
+                    f"{self.action_type.value} does not accept extra fields: {', '.join(populated)}"
+                )
+        return self
+class PythonReviewObservation(Observation):
+    """Observation returned by reset/step, including trainer-visible metrics."""
+    snippet_id: str
+    code: str
+    filename: str
+    language: str = "python"
+    context: Optional[str] = None
+    diff: Optional[str] = None
+    line_count: int = Field(..., ge=0)
+    current_step: int = Field(..., ge=0)
+    max_steps: int = Field(..., ge=1)
+    task_id: str
+    review_history: List[ReviewComment] = Field(default_factory=list)
+    lines: List[str] = Field(default_factory=list)
+    reward_summary: RewardSummary = Field(default_factory=RewardSummary)
+    metrics: EpisodeMetrics = Field(default_factory=EpisodeMetrics)
+    feedback: str = ""
+    # Template compatibility
+    task: Optional[TaskDescriptor] = None
+    instructions: str = ""
+    submitted_findings: List[ReviewFinding] = Field(default_factory=list)
+    hints_used: int = 0
+    attempts_remaining: int = 0
+    evaluation: Optional[TaskEvaluation] = None
+    score: float = 0.0
+class PythonReviewState(State):
+    """Full server-side state exposed by `/state`."""
+    task_id: Optional[str] = None
+    difficulty: Optional[Difficulty] = None
+    snippet_id: Optional[str] = None
+    current_step: int = Field(default=0, ge=0)
+    max_steps: int = Field(default=0, ge=0)
+    done: bool = False
+    filename: Optional[str] = None
+    review_history: List[ReviewComment] = Field(default_factory=list)
+    metrics: EpisodeMetrics = Field(default_factory=EpisodeMetrics)
+    last_feedback: str = ""
+class TaskListResponse(BaseModel):
+    tasks: List[TaskMetadata] = Field(default_factory=list)
+class MetricsResponse(BaseModel):
+    task_id: Optional[str] = None
+    snippet_id: Optional[str] = None
+    done: bool = False
+    metrics: EpisodeMetrics = Field(default_factory=EpisodeMetrics)
+class HealthResponse(BaseModel):
+    status: str = "ok"
+    environment: str = "python_code_review_env"
+    task_count: int = Field(default=0, ge=0)
+    active_task_id: Optional[str] = None
+    active_snippet_id: Optional[str] = None
+    active_episode_id: Optional[str] = None
+PythonAction = PythonReviewAction
+PythonObservation = PythonReviewObservation
+PythonState = PythonReviewState
+CodeReviewAction = PythonReviewAction
+CodeReviewObservation = PythonReviewObservation
+CodeReviewConfig = PythonEnvConfig

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: python-code-review
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

openenv_python_env.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,11 @@

+Metadata-Version: 2.4
+Name: openenv-python_env
+Version: 0.1.0
+Summary: Python Env environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: httpx>=0.28.1
+Requires-Dist: pydantic>=2.12.5
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_python_env.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+README.md
+__init__.py
+client.py
+inference.py
+models.py
+pyproject.toml
+./__init__.py
+./client.py
+./inference.py
+./models.py
+./rollout.py
+openenv_python_env.egg-info/PKG-INFO
+openenv_python_env.egg-info/SOURCES.txt
+openenv_python_env.egg-info/dependency_links.txt
+openenv_python_env.egg-info/entry_points.txt
+openenv_python_env.egg-info/requires.txt
+openenv_python_env.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/grading.py
+server/python_env_environment.py
+server/review_runtime.py
+server/task_bank.py
+server/data/snippets_easy.json
+server/data/snippets_hard.json
+server/data/snippets_medium.json
+tests/test_env.py

openenv_python_env.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_python_env.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = python_env.server.app:main

openenv_python_env.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+openenv-core[core]>=0.2.2
+httpx>=0.28.1
+pydantic>=2.12.5
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_python_env.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python_env

pyproject.toml ADDED Viewed

	@@ -0,0 +1,50 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-python_env"
+version = "0.1.0"
+description = "Python Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+    "httpx>=0.28.1",
+    "pydantic>=2.12.5",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m python_env.server.app
+server = "python_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["python_env", "python_env.server"]
+package-dir = { "python_env" = ".", "python_env.server" = "server" }
+[tool.setuptools.package-data]
+"python_env.server" = ["data/*.json"]

rollout.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""Trajectory collection helpers for RL-style training loops."""
+from __future__ import annotations
+import json
+from dataclasses import dataclass, asdict
+from pathlib import Path
+from typing import Callable, Dict, List, Optional
+from client import PythonEnv
+from models import PythonReviewAction
+@dataclass
+class TrajectoryStep:
+    observation: Dict[str, object]
+    action: Dict[str, object]
+    reward: float
+    done: bool
+@dataclass
+class TrajectoryEpisode:
+    task_id: str
+    snippet_id: str
+    final_score: float
+    cumulative_reward: float
+    steps: List[TrajectoryStep]
+PolicyFn = Callable[[object], PythonReviewAction]
+def collect_episode(env, task_id: str, policy: PolicyFn, max_steps: Optional[int] = None) -> TrajectoryEpisode:
+    """Collect one benchmark episode for an external trainer."""
+    result = env.reset(task_id=task_id)
+    observation = result.observation
+    steps: List[TrajectoryStep] = []
+    step_limit = max_steps or observation.max_steps
+    for _ in range(step_limit):
+        action = policy(observation)
+        result = env.step(action)
+        steps.append(
+            TrajectoryStep(
+                observation=observation.model_dump(),
+                action=action.model_dump(exclude_none=True),
+                reward=float(result.reward or 0.0),
+                done=bool(result.done),
+            )
+        )
+        observation = result.observation
+        if result.done:
+            break
+    return TrajectoryEpisode(
+        task_id=observation.task_id,
+        snippet_id=observation.snippet_id,
+        final_score=observation.metrics.current_score,
+        cumulative_reward=observation.metrics.cumulative_reward,
+        steps=steps,
+    )
+def write_jsonl(episodes: List[TrajectoryEpisode], output_path: str | Path) -> None:
+    """Persist collected trajectories in a trainer-friendly JSONL format."""
+    path = Path(output_path)
+    lines = [json.dumps(asdict(episode)) for episode in episodes]
+    path.write_text("\n".join(lines), encoding="utf-8")

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python Env environment server components."""
+from .python_env_environment import PythonEnvironment
+__all__ = ["PythonEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,148 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Python Env Environment.
+This module creates an HTTP server that exposes the PythonEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+from fastapi.routing import APIRoute
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+try:
+    from ..models import (
+        HealthResponse,
+        MetricsResponse,
+        PythonAction,
+        PythonObservation,
+        PythonState,
+        TaskListResponse,
+    )
+    from .python_env_environment import (
+        PythonEnvironment,
+        get_current_state,
+        get_health_response,
+        get_metrics_response,
+        get_tasks_response,
+    )
+except ImportError:
+    from models import (  # type: ignore
+        HealthResponse,
+        MetricsResponse,
+        PythonAction,
+        PythonObservation,
+        PythonState,
+        TaskListResponse,
+    )
+    from server.python_env_environment import (  # type: ignore
+        PythonEnvironment,
+        get_current_state,
+        get_health_response,
+        get_metrics_response,
+        get_tasks_response,
+    )
+# Create the app with web interface and README integration
+app = create_app(
+    PythonEnvironment,
+    PythonAction,
+    PythonObservation,
+    env_name="python_env",
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def _remove_get_route(path: str) -> None:
+    app.router.routes = [
+        route
+        for route in app.router.routes
+        if not (
+            isinstance(route, APIRoute)
+            and route.path == path
+            and "GET" in route.methods
+        )
+    ]
+_remove_get_route("/health")
+_remove_get_route("/state")
+@app.get("/health", response_model=HealthResponse, tags=["Health"])
+async def health() -> HealthResponse:
+    return get_health_response()
+@app.get("/state", response_model=PythonState, tags=["State Management"])
+async def state() -> PythonState:
+    return get_current_state()
+@app.get("/tasks", response_model=TaskListResponse, tags=["Environment Info"])
+async def tasks() -> TaskListResponse:
+    return get_tasks_response()
+@app.get("/metrics", response_model=MetricsResponse, tags=["Environment Info"])
+async def metrics() -> MetricsResponse:
+    return get_metrics_response()
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m python_env.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn python_env.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/data/snippets_easy.json ADDED Viewed

	@@ -0,0 +1,238 @@

+[
+  {
+    "snippet_id": "easy_001",
+    "filename": "utils.py",
+    "code": "import math\n\ndef build_label(value):\n    l = str(value)\n    return l.upper()",
+    "context": "Utility helper used by several modules.",
+    "gold_issues": [
+      {
+        "issue_id": "easy_001_docs",
+        "line": 3,
+        "issue_type": "DOCS",
+        "severity": "LOW",
+        "description": "Missing docstring on public function.",
+        "required": false,
+        "explanation_keywords": ["docstring", "documentation", "public"]
+      },
+      {
+        "issue_id": "easy_001_name",
+        "line": 4,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Variable name 'l' is ambiguous (PEP8 E741).",
+        "required": true,
+        "explanation_keywords": ["ambiguous", "pep8", "e741", "variable", "name"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_002",
+    "filename": "cleanup.py",
+    "code": "def normalize_total(total, fee):\n    result=total+fee\n    return result",
+    "gold_issues": [
+      {
+        "issue_id": "easy_002_spacing",
+        "line": 2,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Missing whitespace around operators.",
+        "required": true,
+        "explanation_keywords": ["whitespace", "spacing", "operator", "pep8"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_003",
+    "filename": "deploy.py",
+    "code": "def deploy_service(name):\n    print(\"deploying\", name)\n    return name.lower()",
+    "context": "Runs during an automated deployment pipeline.",
+    "gold_issues": [
+      {
+        "issue_id": "easy_003_print",
+        "line": 2,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Leftover print statement in production code.",
+        "required": true,
+        "explanation_keywords": ["print", "debug", "production", "logging"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_004",
+    "filename": "imports.py",
+    "code": "import os\n\ndef slugify(name):\n    return name.strip().lower().replace(\" \", \"-\")",
+    "gold_issues": [
+      {
+        "issue_id": "easy_004_unused_import",
+        "line": 1,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Unused import `os`.",
+        "required": true,
+        "explanation_keywords": ["unused", "import", "os"]
+      },
+      {
+        "issue_id": "easy_004_docs",
+        "line": 3,
+        "issue_type": "DOCS",
+        "severity": "LOW",
+        "description": "Missing docstring on public function.",
+        "required": false,
+        "explanation_keywords": ["docstring", "documentation", "public"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_005",
+    "filename": "pricing.py",
+    "code": "def render_banner(product_name):\n    return f\"Product {product_name} ships worldwide with next business day handling and standard insured delivery included.\"",
+    "gold_issues": [
+      {
+        "issue_id": "easy_005_long_line",
+        "line": 2,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Line exceeds 79 characters.",
+        "required": true,
+        "explanation_keywords": ["line", "79", "too long", "length"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_006",
+    "filename": "stats.py",
+    "code": "def summarize(items):\n    total = len(items)\n    O = total / 2\n    return total, O",
+    "gold_issues": [
+      {
+        "issue_id": "easy_006_name",
+        "line": 3,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Variable name 'O' is ambiguous (PEP8 E741).",
+        "required": true,
+        "explanation_keywords": ["ambiguous", "pep8", "e741", "variable", "name"]
+      },
+      {
+        "issue_id": "easy_006_docs",
+        "line": 1,
+        "issue_type": "DOCS",
+        "severity": "LOW",
+        "description": "Missing docstring on public function.",
+        "required": false,
+        "explanation_keywords": ["docstring", "documentation", "public"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_007",
+    "filename": "parser.py",
+    "code": "import json\n\ndef parse_flag(flag):\n    I = flag.strip()\n    return I.lower() == \"yes\"",
+    "gold_issues": [
+      {
+        "issue_id": "easy_007_unused_import",
+        "line": 1,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Unused import `json`.",
+        "required": true,
+        "explanation_keywords": ["unused", "import", "json"]
+      },
+      {
+        "issue_id": "easy_007_name",
+        "line": 4,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Variable name 'I' is ambiguous (PEP8 E741).",
+        "required": true,
+        "explanation_keywords": ["ambiguous", "pep8", "e741", "variable", "name"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_008",
+    "filename": "notifier.py",
+    "code": "def notify(user, count):\n    message = user + \":\" + str(count)\n    print(message)\n    return message",
+    "gold_issues": [
+      {
+        "issue_id": "easy_008_print",
+        "line": 3,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Leftover print statement in production code.",
+        "required": true,
+        "explanation_keywords": ["print", "debug", "production", "logging"]
+      },
+      {
+        "issue_id": "easy_008_docs",
+        "line": 1,
+        "issue_type": "DOCS",
+        "severity": "LOW",
+        "description": "Missing docstring on public function.",
+        "required": false,
+        "explanation_keywords": ["docstring", "documentation", "public"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_009",
+    "filename": "math_helpers.py",
+    "code": "def add_fee(total, fee):\n    amount = total+fee\n    return amount",
+    "gold_issues": [
+      {
+        "issue_id": "easy_009_spacing",
+        "line": 2,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Missing whitespace around operators.",
+        "required": true,
+        "explanation_keywords": ["whitespace", "spacing", "operator", "pep8"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "easy_010",
+    "filename": "views.py",
+    "code": "import datetime\n\ndef build_title(user_name):\n    return f\"Welcome {user_name}, thanks for joining the quarterly partner enablement kickoff meeting today.\"",
+    "gold_issues": [
+      {
+        "issue_id": "easy_010_unused_import",
+        "line": 1,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Unused import `datetime`.",
+        "required": true,
+        "explanation_keywords": ["unused", "import", "datetime"]
+      },
+      {
+        "issue_id": "easy_010_long_line",
+        "line": 4,
+        "issue_type": "STYLE",
+        "severity": "LOW",
+        "description": "Line exceeds 79 characters.",
+        "required": true,
+        "explanation_keywords": ["line", "79", "too long", "length"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  }
+]

server/data/snippets_hard.json ADDED Viewed

	@@ -0,0 +1,214 @@

+[
+  {
+    "snippet_id": "hard_001",
+    "filename": "db.py",
+    "code": "def load_user(cursor, user_id):\n    query = f\"SELECT * FROM users WHERE id = {user_id}\"\n    return cursor.execute(query).fetchone()",
+    "context": "Used by an internal admin dashboard.",
+    "gold_issues": [
+      {
+        "issue_id": "hard_001_sqli",
+        "line": 2,
+        "issue_type": "SECURITY",
+        "severity": "CRITICAL",
+        "description": "SQL query interpolates user input directly and is vulnerable to SQL injection.",
+        "required": true,
+        "explanation_keywords": ["sql injection", "parameterized", "query", "interpolate", "user input"],
+        "fix_keywords": ["parameterized", "placeholder", "bind", "params"],
+        "owasp_category": "A03:2021-Injection",
+        "owasp_keywords": ["owasp", "injection", "a03"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_002",
+    "filename": "settings.py",
+    "code": "API_KEY = \"sk-live-hardcoded-secret\"\n\ndef get_auth_header():\n    return {\"Authorization\": f\"Bearer {API_KEY}\"}",
+    "gold_issues": [
+      {
+        "issue_id": "hard_002_secret",
+        "line": 1,
+        "issue_type": "SECURITY",
+        "severity": "HIGH",
+        "description": "Hardcoded API secret in source code.",
+        "required": true,
+        "explanation_keywords": ["hardcoded", "secret", "api key", "credential"],
+        "fix_keywords": ["environment", "vault", "secret manager", "config"],
+        "owasp_category": "A07:2021-Identification and Authentication Failures",
+        "owasp_keywords": ["secret", "credential", "authentication", "owasp"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_003",
+    "filename": "storage.py",
+    "code": "from pathlib import Path\n\ndef read_report(base_dir, user_filename):\n    path = Path(base_dir) / user_filename\n    return path.read_text()",
+    "gold_issues": [
+      {
+        "issue_id": "hard_003_path_traversal",
+        "line": 4,
+        "issue_type": "SECURITY",
+        "severity": "HIGH",
+        "description": "User-controlled file path can escape the intended directory.",
+        "required": true,
+        "explanation_keywords": ["path traversal", "directory", "user filename", "path"],
+        "fix_keywords": ["resolve", "validate", "allowlist", "basename"],
+        "owasp_category": "A01:2021-Broken Access Control",
+        "owasp_keywords": ["broken access control", "path traversal", "owasp", "a01"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_004",
+    "filename": "net.py",
+    "code": "import os\n\ndef ping(hostname):\n    return os.system(f\"ping -c 1 {hostname}\")",
+    "gold_issues": [
+      {
+        "issue_id": "hard_004_command_injection",
+        "line": 4,
+        "issue_type": "SECURITY",
+        "severity": "CRITICAL",
+        "description": "Shell command is built from user input, allowing command injection.",
+        "required": true,
+        "explanation_keywords": ["command injection", "shell", "os.system", "user input"],
+        "fix_keywords": ["subprocess", "argument list", "shell false", "validate"],
+        "owasp_category": "A03:2021-Injection",
+        "owasp_keywords": ["owasp", "injection", "a03", "command injection"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_005",
+    "filename": "serializer.py",
+    "code": "import pickle\n\ndef load_session(user_data):\n    return pickle.loads(user_data)",
+    "gold_issues": [
+      {
+        "issue_id": "hard_005_pickle",
+        "line": 4,
+        "issue_type": "SECURITY",
+        "severity": "CRITICAL",
+        "description": "Untrusted pickle deserialization can lead to arbitrary code execution.",
+        "required": true,
+        "explanation_keywords": ["pickle", "deserialization", "arbitrary code", "untrusted"],
+        "fix_keywords": ["json", "safe format", "validate", "trusted"],
+        "owasp_category": "A08:2021-Software and Data Integrity Failures",
+        "owasp_keywords": ["integrity", "deserialization", "owasp", "a08"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_006",
+    "filename": "crypto.py",
+    "code": "import hashlib\n\ndef hash_password(password):\n    return hashlib.md5(password.encode()).hexdigest()",
+    "gold_issues": [
+      {
+        "issue_id": "hard_006_weak_crypto",
+        "line": 4,
+        "issue_type": "SECURITY",
+        "severity": "HIGH",
+        "description": "Uses MD5 for password hashing, which is cryptographically weak.",
+        "required": true,
+        "explanation_keywords": ["md5", "weak", "hash", "password", "crypto"],
+        "fix_keywords": ["bcrypt", "argon2", "scrypt", "salt"],
+        "owasp_category": "A02:2021-Cryptographic Failures",
+        "owasp_keywords": ["cryptographic", "owasp", "a02", "weak crypto"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_007",
+    "filename": "shell.py",
+    "code": "import subprocess\n\ndef run(cmd):\n    return subprocess.run(cmd, shell=True, check=True)",
+    "gold_issues": [
+      {
+        "issue_id": "hard_007_shell_true",
+        "line": 4,
+        "issue_type": "SECURITY",
+        "severity": "CRITICAL",
+        "description": "Runs shell commands with shell=True on untrusted input.",
+        "required": true,
+        "explanation_keywords": ["shell=true", "subprocess", "command injection", "shell"],
+        "fix_keywords": ["shell false", "argument list", "validate", "subprocess"],
+        "owasp_category": "A03:2021-Injection",
+        "owasp_keywords": ["owasp", "injection", "a03", "shell"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_008",
+    "filename": "redirects.py",
+    "code": "def login_redirect(next_url):\n    return {\"location\": next_url, \"status\": 302}",
+    "gold_issues": [
+      {
+        "issue_id": "hard_008_open_redirect",
+        "line": 2,
+        "issue_type": "SECURITY",
+        "severity": "HIGH",
+        "description": "Redirect target is fully user-controlled, creating an open redirect.",
+        "required": true,
+        "explanation_keywords": ["open redirect", "redirect", "next_url", "user controlled"],
+        "fix_keywords": ["allowlist", "relative path", "validate", "trusted host"],
+        "owasp_category": "A01:2021-Broken Access Control",
+        "owasp_keywords": ["owasp", "broken access control", "open redirect", "a01"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_009",
+    "filename": "orders.py",
+    "code": "def view_order(request, db):\n    order_id = request.query_params[\"order_id\"]\n    return db.get_order(order_id)",
+    "context": "Customer dashboard handler.",
+    "gold_issues": [
+      {
+        "issue_id": "hard_009_idor",
+        "line": 3,
+        "issue_type": "SECURITY",
+        "severity": "HIGH",
+        "description": "Looks up an order by user-supplied id without an ownership check, enabling IDOR.",
+        "required": true,
+        "explanation_keywords": ["idor", "ownership", "authorization", "access control", "order id"],
+        "fix_keywords": ["authorize", "ownership", "current user", "scoped query"],
+        "owasp_category": "A01:2021-Broken Access Control",
+        "owasp_keywords": ["owasp", "broken access control", "idor", "a01"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "hard_010",
+    "filename": "yaml_loader.py",
+    "code": "import yaml\n\ndef parse_config(data):\n    return yaml.load(data, Loader=yaml.Loader)",
+    "gold_issues": [
+      {
+        "issue_id": "hard_010_yaml_load",
+        "line": 4,
+        "issue_type": "SECURITY",
+        "severity": "HIGH",
+        "description": "Unsafe YAML loader can construct arbitrary Python objects from untrusted input.",
+        "required": true,
+        "explanation_keywords": ["yaml.load", "unsafe", "loader", "object", "untrusted"],
+        "fix_keywords": ["safe_load", "safe loader", "validate", "trusted"],
+        "owasp_category": "A08:2021-Software and Data Integrity Failures",
+        "owasp_keywords": ["owasp", "integrity", "yaml", "a08"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  }
+]

server/data/snippets_medium.json ADDED Viewed

	@@ -0,0 +1,191 @@

+[
+  {
+    "snippet_id": "medium_001",
+    "filename": "cart.py",
+    "code": "def collect_names(items):\n    names = []\n    for i in range(len(items) - 1):\n        names.append(items[i].name)\n    return names",
+    "gold_issues": [
+      {
+        "issue_id": "medium_001_off_by_one",
+        "line": 3,
+        "issue_type": "LOGIC",
+        "severity": "MEDIUM",
+        "description": "Loop skips the last item because of an off-by-one range.",
+        "required": true,
+        "explanation_keywords": ["off-by-one", "last item", "range", "skip", "len"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_002",
+    "filename": "lists.py",
+    "code": "def add_item(item, bucket=[]):\n    bucket.append(item)\n    return bucket",
+    "gold_issues": [
+      {
+        "issue_id": "medium_002_mutable_default",
+        "line": 1,
+        "issue_type": "LOGIC",
+        "severity": "HIGH",
+        "description": "Mutable default argument is shared between calls.",
+        "required": true,
+        "explanation_keywords": ["mutable", "default", "shared", "calls", "list"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_003",
+    "filename": "loader.py",
+    "code": "def load_payload(reader):\n    try:\n        return reader.read()\n    except Exception:\n        pass\n    return None",
+    "gold_issues": [
+      {
+        "issue_id": "medium_003_swallow",
+        "line": 4,
+        "issue_type": "LOGIC",
+        "severity": "HIGH",
+        "description": "Broad exception is swallowed, hiding errors from callers.",
+        "required": true,
+        "explanation_keywords": ["exception", "swallow", "pass", "hide", "error"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_004",
+    "filename": "billing.py",
+    "code": "def total_price(prices):\n    total = 0\n    for price in prices:\n        total = total + str(price)\n    return total",
+    "gold_issues": [
+      {
+        "issue_id": "medium_004_type_bug",
+        "line": 4,
+        "issue_type": "LOGIC",
+        "severity": "MEDIUM",
+        "description": "Converts price to string and concatenates instead of adding numerically.",
+        "required": true,
+        "explanation_keywords": ["string", "concatenate", "numeric", "add", "type"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_005",
+    "filename": "flags.py",
+    "code": "def should_run(user_input):\n    if user_input == True:\n        return True\n    return False",
+    "gold_issues": [
+      {
+        "issue_id": "medium_005_bool_compare",
+        "line": 2,
+        "issue_type": "LOGIC",
+        "severity": "LOW",
+        "description": "Explicit comparison to True is brittle and can mis-handle truthy values.",
+        "required": true,
+        "explanation_keywords": ["true", "truthy", "boolean", "comparison", "if"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_006",
+    "filename": "ranges.py",
+    "code": "def between(value, start, end):\n    if value >= start or value <= end:\n        return True\n    return False",
+    "gold_issues": [
+      {
+        "issue_id": "medium_006_boolean_logic",
+        "line": 2,
+        "issue_type": "LOGIC",
+        "severity": "HIGH",
+        "description": "Uses `or` instead of `and`, so the range check almost always passes.",
+        "required": true,
+        "explanation_keywords": ["or", "and", "range", "always", "boolean"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_007",
+    "filename": "cleanup.py",
+    "code": "def remove_empty(values):\n    for value in values:\n        if not value:\n            values.remove(value)\n    return values",
+    "gold_issues": [
+      {
+        "issue_id": "medium_007_mutation_during_iteration",
+        "line": 4,
+        "issue_type": "LOGIC",
+        "severity": "HIGH",
+        "description": "Mutates the list while iterating, causing elements to be skipped.",
+        "required": true,
+        "explanation_keywords": ["mutate", "iteration", "remove", "skip", "list"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_008",
+    "filename": "averages.py",
+    "code": "def average(numbers):\n    if not numbers:\n        return 0\n    return sum(numbers) / (len(numbers) - 1)",
+    "gold_issues": [
+      {
+        "issue_id": "medium_008_divisor",
+        "line": 4,
+        "issue_type": "LOGIC",
+        "severity": "HIGH",
+        "description": "Divides by len(numbers) - 1, producing the wrong average and crashing for one item.",
+        "required": true,
+        "explanation_keywords": ["average", "divide", "len", "minus 1", "wrong"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_009",
+    "filename": "retry.py",
+    "code": "def fetch_name(client):\n    try:\n        return client.name()\n    except ValueError:\n        return \"\"\n    return None",
+    "gold_issues": [
+      {
+        "issue_id": "medium_009_unreachable_fallback",
+        "line": 6,
+        "issue_type": "LOGIC",
+        "severity": "LOW",
+        "description": "The final return is unreachable and suggests the error path was designed incorrectly.",
+        "required": false,
+        "explanation_keywords": ["unreachable", "return", "fallback", "dead code"]
+      },
+      {
+        "issue_id": "medium_009_swallow",
+        "line": 5,
+        "issue_type": "LOGIC",
+        "severity": "MEDIUM",
+        "description": "Returns an empty string on ValueError, masking the failure as a valid result.",
+        "required": true,
+        "explanation_keywords": ["empty string", "mask", "failure", "valid result", "error"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  },
+  {
+    "snippet_id": "medium_010",
+    "filename": "tokens.py",
+    "code": "def normalize_token(token):\n    if token is \"\":\n        return None\n    return token.strip()",
+    "gold_issues": [
+      {
+        "issue_id": "medium_010_is_compare",
+        "line": 2,
+        "issue_type": "LOGIC",
+        "severity": "MEDIUM",
+        "description": "Uses `is` for string comparison instead of equality.",
+        "required": true,
+        "explanation_keywords": ["is", "string", "comparison", "equality", "identity"]
+      }
+    ],
+    "must_approve": false,
+    "must_reject": true
+  }
+]

server/grading.py ADDED Viewed

	@@ -0,0 +1,465 @@

+"""Deterministic task graders for the code-review benchmark."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Dict, Iterable, List, Optional
+try:
+    from ..models import (
+        ActionType,
+        CodeReviewSnippet,
+        GoldIssue,
+        IssueType,
+        ReviewComment,
+        Severity,
+    )
+except ImportError:
+    from models import (  # type: ignore
+        ActionType,
+        CodeReviewSnippet,
+        GoldIssue,
+        IssueType,
+        ReviewComment,
+        Severity,
+    )
+def _normalize_text(value: Optional[str]) -> str:
+    return " ".join((value or "").lower().split())
+def _keyword_match(text: str, keywords: Iterable[str]) -> bool:
+    normalized = _normalize_text(text)
+    return any(_normalize_text(keyword) in normalized for keyword in keywords if keyword)
+def _keyword_match_score(text: str, keywords: Iterable[str]) -> float:
+    """
+    FIX: Returns partial score 0.0-1.0 based on how many keywords matched.
+    Old code: binary match (any keyword → True/False).
+    New code: count matches → partial credit even with 1 keyword hit.
+    """
+    normalized = _normalize_text(text)
+    kw_list = [k for k in keywords if k]
+    if not kw_list:
+        return 0.0
+    hits = sum(1 for kw in kw_list if _normalize_text(kw) in normalized)
+    return hits / len(kw_list)
+def _terminal_action(history: List[ReviewComment]) -> Optional[ActionType]:
+    for item in reversed(history):
+        if item.action_type in {ActionType.APPROVE, ActionType.REQUEST_CHANGES}:
+            return item.action_type
+    return None
+@dataclass
+class GradeResult:
+    score: float
+    precision: float
+    recall: float
+    f1: float
+    true_positives: int
+    false_positives: int
+    missed_issues: int
+    required_found: int
+    required_total: int
+    bonus_found: int
+    matched_issue_ids: List[str]
+    breakdown: Dict[str, float]
+def grade_review(
+    task_id: str,
+    snippet: CodeReviewSnippet,
+    history: List[ReviewComment],
+    duplicate_comments: int,
+) -> GradeResult:
+    """Grade a completed or in-progress review deterministically."""
+    comments = [item for item in history if item.action_type == ActionType.ADD_COMMENT]
+    if task_id == "task_easy":
+        return _grade_easy(snippet, comments, history, duplicate_comments)
+    if task_id == "task_medium":
+        return _grade_medium(snippet, comments, history, duplicate_comments)
+    return _grade_hard(snippet, comments, history, duplicate_comments)
+def _grade_easy(
+    snippet: CodeReviewSnippet,
+    comments: List[ReviewComment],
+    history: List[ReviewComment],
+    duplicate_comments: int,
+) -> GradeResult:
+    required_issues = [issue for issue in snippet.gold_issues if issue.required]
+    required_denominator = max(len(required_issues), 1)
+    # FIX: Start credit at 0 for every issue
+    best_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    matched_ids: set[str] = set()
+    false_positives = 0
+    for comment in comments:
+        positive = False
+        comment_text = f"{comment.comment or ''} {comment.suggestion or ''}"
+        for issue in snippet.gold_issues:
+            if comment.line_number is None:
+                continue
+            distance = abs(comment.line_number - issue.line)
+            credit = 0.0
+            if issue.required:
+                # FIX: More generous distance tolerance + keyword fallback
+                if comment.issue_type == issue.issue_type:
+                    if distance <= 1:
+                        credit = 0.30 / required_denominator
+                    elif distance <= 3:
+                        credit = 0.15 / required_denominator  # FIX: was 0.10
+                    elif distance <= 5:
+                        credit = 0.08 / required_denominator  # FIX: new tier
+                elif _keyword_match(comment_text, getattr(issue, "explanation_keywords", [])):
+                    # FIX: Wrong issue_type but comment mentions the bug → partial credit
+                    if distance <= 3:
+                        credit = 0.08 / required_denominator
+            else:
+                # Bonus issues
+                if distance <= 3:
+                    if comment.issue_type == issue.issue_type:
+                        credit = 0.05
+                    elif _keyword_match(comment_text, getattr(issue, "explanation_keywords", [])):
+                        credit = 0.02  # FIX: small credit for keyword match
+            if credit > 0:
+                positive = True
+                best_credit[issue.issue_id] = max(best_credit[issue.issue_id], credit)
+                matched_ids.add(issue.issue_id)
+        if not positive:
+            false_positives += 1
+    required_score = sum(best_credit[issue.issue_id] for issue in required_issues)
+    bonus_score = min(
+        sum(
+            best_credit[issue.issue_id]
+            for issue in snippet.gold_issues
+            if not issue.required
+        ),
+        0.15,
+    )
+    # FIX: Reduced false positive penalty — was 0.05 per FP, now 0.03
+    # Prevents over-penalising agents that flag too many issues
+    false_positive_penalty = min(false_positives * 0.03, 0.15)
+    final_action = _terminal_action(history)
+    action_adjustment = 0.0
+    if snippet.must_reject and final_action == ActionType.REQUEST_CHANGES:
+        action_adjustment = 0.10
+    elif snippet.must_reject and final_action == ActionType.APPROVE:
+        action_adjustment = -0.10
+    raw_score = required_score + bonus_score - false_positive_penalty + action_adjustment
+    required_found = sum(1 for issue in required_issues if best_credit[issue.issue_id] > 0)
+    bonus_found = sum(
+        1
+        for issue in snippet.gold_issues
+        if not issue.required and best_credit[issue.issue_id] > 0
+    )
+    return _build_result(
+        score=raw_score,
+        matched_issue_ids=sorted(matched_ids),
+        false_positives=false_positives,
+        required_found=required_found,
+        required_total=len(required_issues),
+        bonus_found=bonus_found,
+        duplicate_comments=duplicate_comments,
+        breakdown={
+            "required_score": required_score,
+            "bonus_score": bonus_score,
+            "false_positive_penalty": -false_positive_penalty,
+            "action_adjustment": action_adjustment,
+        },
+    )
+def _grade_medium(
+    snippet: CodeReviewSnippet,
+    comments: List[ReviewComment],
+    history: List[ReviewComment],
+    duplicate_comments: int,
+) -> GradeResult:
+    required_issues = [issue for issue in snippet.gold_issues if issue.required]
+    required_denominator = max(len(required_issues), 1)
+    best_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    explanation_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    matched_ids: set[str] = set()
+    false_positives = 0
+    for comment in comments:
+        positive = False
+        comment_text = f"{comment.comment or ''} {comment.suggestion or ''}"
+        for issue in snippet.gold_issues:
+            if comment.line_number is None:
+                continue
+            distance = abs(comment.line_number - issue.line)
+            # FIX: Relaxed from distance <= 5 to distance <= 8
+            if distance > 8:
+                continue
+            credit = 0.0
+            keyword_match = _keyword_match(comment_text, issue.explanation_keywords)
+            # FIX: Old code required BOTH issue_type match AND exact/near line.
+            # New code: issue_type OR keyword match gives credit, distance tiers.
+            if comment.issue_type == IssueType.LOGIC and issue.issue_type == IssueType.LOGIC:
+                if distance <= 1:
+                    # FIX: was "distance == 0" — now ±1 for full credit
+                    credit = 0.25 / required_denominator if issue.required else 0.05
+                elif distance <= 3:
+                    credit = 0.15 / required_denominator if issue.required else 0.03  # FIX: was 0.10
+                elif distance <= 8:
+                    credit = 0.08 / required_denominator if issue.required else 0.02  # FIX: new tier
+            elif keyword_match:
+                # FIX: keyword match alone is worth more — was 0.05, now 0.10
+                if distance <= 3:
+                    credit = 0.10 / required_denominator if issue.required else 0.03
+                elif distance <= 8:
+                    credit = 0.05 / required_denominator if issue.required else 0.01
+            if credit > 0:
+                positive = True
+                best_credit[issue.issue_id] = max(best_credit[issue.issue_id], credit)
+                matched_ids.add(issue.issue_id)
+                # FIX: Use partial keyword score instead of binary
+                kw_score = _keyword_match_score(comment_text, issue.explanation_keywords)
+                if kw_score > 0:
+                    explanation_credit[issue.issue_id] = max(
+                        explanation_credit[issue.issue_id],
+                        # FIX: Scale explanation bonus by keyword match quality
+                        (0.05 * kw_score) / required_denominator if issue.required else (0.02 * kw_score),
+                    )
+        if not positive:
+            false_positives += 1
+    base_score = sum(best_credit.values()) + sum(explanation_credit.values())
+    # FIX: Reduced FP penalty — was 0.08 per FP, now 0.05
+    false_positive_penalty = min(false_positives * 0.05, 0.25)
+    final_action = _terminal_action(history)
+    action_adjustment = 0.0
+    if snippet.must_reject and final_action == ActionType.REQUEST_CHANGES:
+        action_adjustment = 0.10
+    elif snippet.must_reject and final_action == ActionType.APPROVE:
+        action_adjustment = -0.15
+    required_found = sum(1 for issue in required_issues if best_credit[issue.issue_id] > 0)
+    bonus_found = sum(
+        1
+        for issue in snippet.gold_issues
+        if not issue.required and best_credit[issue.issue_id] > 0
+    )
+    return _build_result(
+        score=base_score - false_positive_penalty + action_adjustment,
+        matched_issue_ids=sorted(matched_ids),
+        false_positives=false_positives,
+        required_found=required_found,
+        required_total=len(required_issues),
+        bonus_found=bonus_found,
+        duplicate_comments=duplicate_comments,
+        breakdown={
+            "logic_score": sum(best_credit.values()),
+            "explanation_score": sum(explanation_credit.values()),
+            "false_positive_penalty": -false_positive_penalty,
+            "action_adjustment": action_adjustment,
+        },
+    )
+def _grade_hard(
+    snippet: CodeReviewSnippet,
+    comments: List[ReviewComment],
+    history: List[ReviewComment],
+    duplicate_comments: int,
+) -> GradeResult:
+    required_issues = [issue for issue in snippet.gold_issues if issue.required]
+    required_denominator = max(len(required_issues), 1)
+    best_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    owasp_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    fix_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    severity_credit: Dict[str, float] = {issue.issue_id: 0.0 for issue in snippet.gold_issues}
+    matched_ids: set[str] = set()
+    false_positives = 0
+    for comment in comments:
+        positive = False
+        comment_text = f"{comment.comment or ''} {comment.suggestion or ''}"
+        for issue in snippet.gold_issues:
+            # FIX: Was exact line match only (distance == 0).
+            # Security vulns span multiple lines — now ±2 tolerance.
+            if comment.line_number is None:
+                continue
+            distance = abs(comment.line_number - issue.line)
+            if distance > 2:  # FIX: was `!= issue.line` (zero tolerance)
+                continue
+            credit = 0.0
+            if comment.issue_type == IssueType.SECURITY and issue.issue_type == IssueType.SECURITY:
+                if distance == 0:
+                    credit = 0.20 / required_denominator if issue.required else 0.05
+                else:
+                    # FIX: ±1-2 lines gets partial credit (was zero)
+                    credit = 0.12 / required_denominator if issue.required else 0.03
+            # FIX: Even if issue_type is wrong, keyword match on SECURITY issue → small credit
+            elif _keyword_match(comment_text, getattr(issue, "owasp_keywords", []) + getattr(issue, "fix_keywords", [])):
+                if distance <= 2:
+                    credit = 0.06 / required_denominator if issue.required else 0.02
+            if credit > 0:
+                positive = True
+                matched_ids.add(issue.issue_id)
+                best_credit[issue.issue_id] = max(best_credit[issue.issue_id], credit)
+                owasp_kw = list(getattr(issue, "owasp_keywords", []))
+                owasp_cat = [issue.owasp_category] if getattr(issue, "owasp_category", None) else []
+                if _keyword_match(comment_text, owasp_kw + owasp_cat):
+                    owasp_credit[issue.issue_id] = max(
+                        owasp_credit[issue.issue_id],
+                        0.10 / required_denominator if issue.required else 0.02,
+                    )
+                fix_kw = list(getattr(issue, "fix_keywords", []))
+                if _keyword_match(comment_text, fix_kw):
+                    fix_credit[issue.issue_id] = max(
+                        fix_credit[issue.issue_id],
+                        0.05 / required_denominator if issue.required else 0.02,
+                    )
+                if comment.severity in {Severity.HIGH, Severity.CRITICAL}:
+                    if comment.severity == issue.severity or (
+                        issue.severity == Severity.HIGH and comment.severity == Severity.CRITICAL
+                    ):
+                        severity_credit[issue.issue_id] = max(
+                            severity_credit[issue.issue_id], 0.05 / required_denominator
+                        )
+                elif issue.severity == Severity.CRITICAL and comment.severity in {
+                    Severity.LOW,
+                    Severity.MEDIUM,
+                }:
+                    # FIX: Only penalise if we actually matched (was applying even with no match)
+                    if best_credit[issue.issue_id] > 0:
+                        severity_credit[issue.issue_id] = min(
+                            severity_credit[issue.issue_id], -0.05 / required_denominator
+                        )
+        if not positive:
+            false_positives += 1
+    # Missing critical penalty
+    missing_critical_penalty = 0.0
+    for issue in required_issues:
+        if issue.severity == Severity.CRITICAL and best_credit[issue.issue_id] == 0:
+            missing_critical_penalty += 0.15
+    # FIX: Reduced FP penalty for hard task — was 0.10, now 0.07
+    # Hard tasks have many lines so innocent FPs should cost less
+    false_positive_penalty = min(false_positives * 0.07, 0.35)
+    final_action = _terminal_action(history)
+    action_adjustment = 0.0
+    if snippet.must_reject and final_action == ActionType.REQUEST_CHANGES:
+        action_adjustment = 0.10
+    elif snippet.must_reject and final_action == ActionType.APPROVE:
+        action_adjustment = -0.20
+    required_found = sum(1 for issue in required_issues if best_credit[issue.issue_id] > 0)
+    bonus_found = sum(
+        1
+        for issue in snippet.gold_issues
+        if not issue.required and best_credit[issue.issue_id] > 0
+    )
+    return _build_result(
+        score=(
+            sum(best_credit.values())
+            + sum(owasp_credit.values())
+            + sum(fix_credit.values())
+            + sum(severity_credit.values())
+            - false_positive_penalty
+            - missing_critical_penalty
+            + action_adjustment
+        ),
+        matched_issue_ids=sorted(matched_ids),
+        false_positives=false_positives,
+        required_found=required_found,
+        required_total=len(required_issues),
+        bonus_found=bonus_found,
+        duplicate_comments=duplicate_comments,
+        breakdown={
+            "security_score": sum(best_credit.values()),
+            "owasp_score": sum(owasp_credit.values()),
+            "fix_score": sum(fix_credit.values()),
+            "severity_score": sum(severity_credit.values()),
+            "false_positive_penalty": -false_positive_penalty,
+            "missing_critical_penalty": -missing_critical_penalty,
+            "action_adjustment": action_adjustment,
+        },
+    )
+def _build_result(
+    *,
+    score: float,
+    matched_issue_ids: List[str],
+    false_positives: int,
+    required_found: int,
+    required_total: int,
+    bonus_found: int,
+    duplicate_comments: int,
+    breakdown: Dict[str, float],
+) -> GradeResult:
+    clamped_score = max(0.0, min(score, 1.0))
+    true_positives = len(matched_issue_ids)
+    missed_issues = max(required_total - required_found, 0)
+    precision = true_positives / max(true_positives + false_positives, 1)
+    recall = required_found / max(required_total, 1)
+    f1 = 0.0
+    if precision + recall:
+        f1 = 2 * precision * recall / (precision + recall)
+    breakdown = {
+        **breakdown,
+        "duplicate_comments": float(duplicate_comments),
+        "precision": precision,
+        "recall": recall,
+        "f1": f1,
+        "score": clamped_score,
+    }
+    return GradeResult(
+        score=clamped_score,
+        precision=precision,
+        recall=recall,
+        f1=f1,
+        true_positives=true_positives,
+        false_positives=false_positives,
+        missed_issues=missed_issues,
+        required_found=required_found,
+        required_total=required_total,
+        bonus_found=bonus_found,
+        matched_issue_ids=matched_issue_ids,
+        breakdown=breakdown,
+    )

server/python_env_environment.py ADDED Viewed

	@@ -0,0 +1,500 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python code-review environment implementation."""
+from __future__ import annotations
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from typing import Dict, Iterable, List, Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..models import (
+        Difficulty,
+        PythonAction,
+        PythonEnvConfig,
+        PythonObservation,
+        PythonState,
+        ReviewFinding,
+        TaskDescriptor,
+        TaskEvaluation,
+        TaskMetadata,
+    )
+except ImportError:
+    from models import (  # type: ignore
+        Difficulty,
+        PythonAction,
+        PythonEnvConfig,
+        PythonObservation,
+        PythonState,
+        ReviewFinding,
+        TaskDescriptor,
+        TaskEvaluation,
+        TaskMetadata,
+    )
+@dataclass(frozen=True)
+class ReferenceFinding:
+    """Hidden finding metadata used for deterministic grading."""
+    rule_id: str
+    title: str
+    line: int
+    category: str
+    severity: str
+    rationale: str
+    recommendation: str
+    weight: float
+    keywords: List[str] = Field(default_factory=list)
+@dataclass(frozen=True)
+class ReviewTask:
+    """A visible task plus its hidden grading references."""
+    descriptor: TaskDescriptor
+    references: tuple[ReferenceFinding, ...]
+    hint: str
+    patched_code: Optional[str] = None
+TASK_BANK: Dict[str, ReviewTask] = {
+    "py-review-easy": ReviewTask(
+        descriptor=TaskDescriptor(
+            task_id="py-review-easy",
+            difficulty="easy",
+            title="Mutable default argument",
+            objective="Find the correctness issue and explain a safe fix.",
+            code=(
+                "def add_tag(tag, tags=[]):\n"
+                "    tags.append(tag)\n"
+                "    return tags\n"
+            ),
+            max_steps=4,
+            success_threshold=0.7,
+        ),
+        references=(
+            ReferenceFinding(
+                rule_id="mutable-default",
+                title="Mutable default list is shared across calls",
+                line=1,
+                category="bug",
+                severity="warning",
+                rationale="The list persists between calls and leaks state.",
+                recommendation="Use None as the default and create a new list inside the function.",
+                weight=1.0,
+                keywords=["mutable", "default", "list", "shared", "persists", "leaks"],
+            ),
+        ),
+        hint="Look for state that survives between separate function calls.",
+        patched_code=(
+            "def add_tag(tag, tags=None):\n"
+            "    if tags is None:\n"
+            "        tags = []\n"
+            "    tags.append(tag)\n"
+            "    return tags\n"
+        ),
+    ),
+    "py-review-medium": ReviewTask(
+        descriptor=TaskDescriptor(
+            task_id="py-review-medium",
+            difficulty="medium",
+            title="Unsafe shell invocation",
+            objective="Review the snippet for security-sensitive behavior.",
+            code=(
+                "import os\n\n"
+                "def run_backup(path):\n"
+                "    os.system(f\"tar -czf backup.tgz {path}\")\n"
+            ),
+            max_steps=4,
+            success_threshold=0.72,
+        ),
+        references=(
+            ReferenceFinding(
+                rule_id="shell-injection",
+                title="User input is interpolated into a shell command",
+                line=4,
+                category="security",
+                severity="critical",
+                rationale="An attacker can inject shell metacharacters through the path argument.",
+                recommendation="Use subprocess with an argument list instead of os.system.",
+                weight=1.0,
+                keywords=["shell", "injection", "os.system", "subprocess", "input", "unsantized", "escaping"],
+            ),
+        ),
+        hint="Check how external commands are invoked and whether user input is escaped.",
+        patched_code=(
+            "import subprocess\n\n"
+            "def run_backup(path):\n"
+            "    subprocess.run([\"tar\", \"-czf\", \"backup.tgz\", path], check=True)\n"
+        ),
+    ),
+    "py-review-hard": ReviewTask(
+        descriptor=TaskDescriptor(
+            task_id="py-review-hard",
+            difficulty="hard",
+            title="Retry helper hides failures",
+            objective="Identify correctness and maintainability issues in the retry logic.",
+            code=(
+                "import time\n\n"
+                "def fetch_with_retry(client, url, retries=3):\n"
+                "    last_error = None\n"
+                "    for _ in range(retries):\n"
+                "        try:\n"
+                "            return client.get(url, timeout=1)\n"
+                "        except Exception as exc:\n"
+                "            last_error = exc\n"
+                "            time.sleep(0.1)\n"
+                "    return None\n"
+            ),
+            max_steps=4,
+            success_threshold=0.74,
+        ),
+        references=(
+            ReferenceFinding(
+                rule_id="swallowed-error",
+                title="Function swallows the final exception and returns None",
+                line=10,
+                category="bug",
+                severity="warning",
+                rationale="Callers cannot distinguish a failed request from a valid None result.",
+                recommendation="Re-raise the last exception after retries are exhausted.",
+                weight=0.65,
+                keywords=["swallowed", "exception", "return none", "error handling"],
+            ),
+            ReferenceFinding(
+                rule_id="broad-except",
+                title="Broad exception handler catches unexpected failures",
+                line=7,
+                category="maintainability",
+                severity="info",
+                rationale="Catching Exception masks programming errors and interrupts.",
+                recommendation="Catch only the client or network exceptions you expect to retry.",
+                weight=0.35,
+                keywords=["broad", "except", "catch exception"],
+            ),
+        ),
+        hint="Consider what happens to the final error after the retry loop finishes.",
+        patched_code=(
+            "import time\n\n"
+            "def fetch_with_retry(client, url, retries=3):\n"
+            "    last_error = None\n"
+            "    for _ in range(retries):\n"
+            "        try:\n"
+            "            return client.get(url, timeout=1)\n"
+            "        except client.retryable_exceptions as exc:\n"
+            "            last_error = exc\n"
+            "            time.sleep(0.1)\n"
+            "    if last_error is not None:\n"
+            "        raise last_error\n"
+        ),
+    ),
+}
+def _utc_now() -> str:
+    return datetime.now(UTC).isoformat()
+def _normalize_text(value: Optional[str]) -> str:
+    return " ".join((value or "").strip().lower().split())
+def _normalize_code(value: Optional[str]) -> str:
+    return "\n".join(line.rstrip() for line in (value or "").strip().splitlines())
+class PythonEnvironment(Environment[PythonAction, PythonObservation, State]):
+    """Deterministic benchmark environment for Python code review tasks."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self, config: Optional[PythonEnvConfig] = None):
+        super().__init__()
+        self._config = config or PythonEnvConfig()
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._task_cursor = -1
+        self._current_task: Optional[ReviewTask] = None
+        self._submitted_findings: List[ReviewFinding] = []
+        self._hints_used = 0
+        self._created_at = _utc_now()
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs,
+    ) -> PythonObservation:
+        """Start the next configured review task."""
+        del seed, kwargs
+        self._task_cursor = (self._task_cursor + 1) % len(self._config.task_order)
+        task_id = self._config.task_order[self._task_cursor]
+        self._current_task = TASK_BANK.get(task_id, TASK_BANK["py-review-easy"])
+        self._state = State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+        )
+        self._submitted_findings = []
+        self._hints_used = 0
+        self._created_at = _utc_now()
+        return self._build_observation(
+            feedback="New review task loaded. Submit findings or request a hint.",
+            reward=0.0,
+            done=False,
+        )
+    def step(
+        self,
+        action: PythonAction,
+        timeout_s: Optional[float] = None,
+        **kwargs,
+    ) -> PythonObservation:
+        """Process one review action and return updated feedback."""
+        del timeout_s, kwargs
+        if self._current_task is None:
+            return self.reset()
+        self._state.step_count += 1
+        operation = action.operation
+        feedback = ""
+        reward = 0.0
+        done = False
+        if operation == "request_hint":
+            self._hints_used += 1
+            feedback = self._current_task.hint
+            evaluation = self._evaluate(self._submitted_findings, action.patched_code)
+            reward = evaluation.score
+        else:
+            if action.findings:
+                self._submitted_findings.extend(action.findings)
+            evaluation = self._evaluate(self._submitted_findings, action.patched_code)
+            reward = evaluation.score
+            if operation == "finalize":
+                done = True
+                feedback = (
+                    "Review finalized. "
+                    f"Matched {evaluation.matched_findings}/{evaluation.total_findings} "
+                    "reference findings."
+                )
+            else:
+                feedback = (
+                    f"Progress saved. Matched {evaluation.matched_findings}/"
+                    f"{evaluation.total_findings} findings with score {evaluation.score:.2f}."
+                )
+        if self._state.step_count >= self._max_steps():
+            done = True
+            if operation != "finalize":
+                feedback = (
+                    f"{feedback} Maximum steps reached."
+                    if feedback
+                    else "Maximum steps reached."
+                )
+        return self._build_observation(
+            feedback=feedback,
+            reward=reward,
+            done=done,
+            patched_code=action.patched_code,
+        )
+    def _build_observation(
+        self,
+        *,
+        feedback: str,
+        reward: float,
+        done: bool,
+        patched_code: Optional[str] = None,
+    ) -> PythonObservation:
+        assert self._current_task is not None
+        evaluation = self._evaluate(self._submitted_findings, patched_code)
+        attempts_remaining = max(
+            self._max_steps() - self._state.step_count,
+            0,
+        )
+        return PythonObservation(
+            task=self._current_task.descriptor,
+            feedback=feedback,
+            submitted_findings=list(self._submitted_findings),
+            hints_used=self._hints_used,
+            attempts_remaining=attempts_remaining,
+            evaluation=evaluation,
+            score=evaluation.score,
+            review_time_ms=float(self._state.step_count * 125),
+            done=done,
+            reward=reward,
+            metadata={
+                "episode_id": self._state.episode_id,
+                "created_at": self._created_at,
+                "updated_at": _utc_now(),
+            },
+        )
+    def _evaluate(
+        self,
+        findings: Iterable[ReviewFinding],
+        patched_code: Optional[str],
+    ) -> TaskEvaluation:
+        assert self._current_task is not None
+        references = self._current_task.references
+        matched_reference_ids: List[str] = []
+        matched_weight = 0.0
+        false_positives = 0
+        duplicate_findings = 0
+        seen_ids = set()
+        for finding in findings:
+            ref_id = self._match_reference(finding, references)
+            if ref_id is None:
+                false_positives += 1
+                continue
+            if ref_id in seen_ids:
+                duplicate_findings += 1
+                continue
+            seen_ids.add(ref_id)
+            matched_reference_ids.append(ref_id)
+            matched_weight += next(ref.weight for ref in references if ref.rule_id == ref_id)
+        total_weight = sum(ref.weight for ref in references) or 1.0
+        weighted_recall = min(matched_weight / total_weight, 1.0)
+        patch_score = 0.0
+        if self._current_task.patched_code and patched_code:
+            patch_score = float(
+                _normalize_code(patched_code) == _normalize_code(self._current_task.patched_code)
+            )
+        raw_score = (
+            weighted_recall
+            + (self._config.patch_bonus_multiplier * patch_score)
+            - (self._config.false_positive_penalty * false_positives)
+            - (self._config.duplicate_penalty * duplicate_findings)
+            - (self._config.hint_penalty * self._hints_used)
+        )
+        score = max(0.0, min(raw_score, 1.0))
+        return TaskEvaluation(
+            matched_reference_ids=matched_reference_ids,
+            matched_findings=len(matched_reference_ids),
+            total_findings=len(references),
+            false_positives=false_positives,
+            duplicate_findings=duplicate_findings,
+            weighted_recall=weighted_recall,
+            patch_score=patch_score,
+            score=score,
+            passed=score >= self._current_task.descriptor.success_threshold,
+        )
+    def _match_reference(
+        self,
+        finding: ReviewFinding,
+        references: Iterable[ReferenceFinding],
+    ) -> Optional[str]:
+        finding_rule = _normalize_text(finding.rule_id)
+        finding_title = _normalize_text(finding.title)
+        for reference in references:
+            if finding_rule and finding_rule == _normalize_text(reference.rule_id):
+                return reference.rule_id
+            line_matches = finding.line is not None and finding.line == reference.line
+            category_matches = finding.category == reference.category
+            title_matches = finding_title and (
+                finding_title in _normalize_text(reference.title)
+                or _normalize_text(reference.title) in finding_title
+            )
+            # Keyword match: check if any reference keywords are in the finding text
+            keyword_match = any(
+                _normalize_text(kw) in finding_title
+                for kw in getattr(reference, "keywords", [])
+            ) if finding_title else False
+            # Relaxed matching: allow matching if the title or keywords match even if the line is missing
+            if (line_matches and (category_matches or title_matches)) or title_matches or keyword_match:
+                return reference.rule_id
+        return None
+    def _max_steps(self) -> int:
+        assert self._current_task is not None
+        return min(
+            self._current_task.descriptor.max_steps,
+            self._config.max_steps_per_task,
+        )
+    @property
+    def state(self) -> State:
+        """Return the current environment state."""
+        return self._state
+# try:
+#     from .review_runtime import (  # type: ignore
+...
+#     )
+# --- App Interface Shims ---
+_GLOBAL_ENV: Optional[PythonEnvironment] = None
+def _get_env() -> PythonEnvironment:
+    global _GLOBAL_ENV
+    if _GLOBAL_ENV is None:
+        _GLOBAL_ENV = PythonEnvironment()
+    return _GLOBAL_ENV
+def get_current_state() -> PythonState:
+    env = _get_env()
+    obs = env._build_observation(feedback="State request", reward=0.0, done=False)
+    # Convert PythonObservation to PythonState if needed
+    return PythonState(
+        episode_id=env.state.episode_id,
+        current_step=env.state.step_count,
+        task_id=obs.task.task_id if obs.task else None,
+        difficulty=Difficulty(obs.task.difficulty) if obs.task else None,
+        done=False,
+        last_feedback=obs.feedback,
+    )
+def get_health_response() -> HealthResponse:
+    return HealthResponse(
+        status="ok",
+        environment="python_env",
+        task_count=len(TASK_BANK),
+    )
+def get_metrics_response() -> MetricsResponse:
+    return MetricsResponse()
+def get_tasks_response() -> TaskListResponse:
+    from .task_bank import load_task_catalog
+    try:
+        tasks = load_task_catalog()
+    except Exception:
+        tasks = []
+    # If using local TASK_BANK, convert them
+    if not tasks:
+        tasks = [
+            TaskMetadata(
+                task_id=tid,
+                name=t.descriptor.title,
+                difficulty=Difficulty(t.descriptor.difficulty),
+                description=t.descriptor.objective,
+                snippet_count=1,
+                max_steps=t.descriptor.max_steps,
+            )
+            for tid, t in TASK_BANK.items()
+        ]
+    return TaskListResponse(tasks=tasks)

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+pydantic>=2.12.5
+httpx>=0.28.1

server/review_runtime.py ADDED Viewed

	@@ -0,0 +1,418 @@

+"""Benchmark runtime for the Python code-review environment."""
+from __future__ import annotations
+import random
+from dataclasses import dataclass, field
+from datetime import UTC, datetime
+from typing import Dict, List, Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+try:
+    from ..models import (
+        ActionType,
+        CodeReviewSnippet,
+        EpisodeMetrics,
+        HealthResponse,
+        IssueType,
+        MetricsResponse,
+        PythonAction,
+        PythonEnvConfig,
+        PythonObservation,
+        PythonState,
+        ReviewComment,
+        RewardSummary,
+        TaskListResponse,
+    )
+    from .grading import GradeResult, grade_review
+    from .task_bank import get_task_metadata, load_task_bank, load_task_catalog
+except ImportError:
+    from models import (  # type: ignore
+        ActionType,
+        CodeReviewSnippet,
+        EpisodeMetrics,
+        HealthResponse,
+        IssueType,
+        MetricsResponse,
+        PythonAction,
+        PythonEnvConfig,
+        PythonObservation,
+        PythonState,
+        ReviewComment,
+        RewardSummary,
+        TaskListResponse,
+    )
+    from server.grading import GradeResult, grade_review  # type: ignore
+    from server.task_bank import get_task_metadata, load_task_bank, load_task_catalog  # type: ignore
+def _utc_now() -> str:
+    return datetime.now(UTC).isoformat()
+def _severity_reward(issue_severity: str, bonus_issue: bool) -> float:
+    if bonus_issue:
+        return 0.03
+    if issue_severity in {"CRITICAL", "HIGH"}:
+        return 0.15
+    if issue_severity == "MEDIUM":
+        return 0.10
+    return 0.05
+def _false_positive_penalty(action_severity: Optional[str]) -> float:
+    if action_severity == "CRITICAL":
+        return -0.12
+    if action_severity == "HIGH":
+        return -0.08
+    return -0.04
+def _line_window_for_task(task_id: str) -> int:
+    if task_id == "task_easy":
+        return 3
+    if task_id == "task_medium":
+        return 5
+    return 0
+@dataclass
+class EpisodeRuntime:
+    episode_id: str
+    task_id: str
+    snippet: CodeReviewSnippet
+    current_step: int
+    max_steps: int
+    created_at: str
+    review_history: List[ReviewComment] = field(default_factory=list)
+    cumulative_reward: float = 0.0
+    done: bool = False
+    last_feedback: str = ""
+    found_issue_ids: set[str] = field(default_factory=set)
+    duplicate_comments: int = 0
+    context_requests: int = 0
+    skipped_clean_lines: int = 0
+    skipped_issue_lines: int = 0
+    commented_lines: set[int] = field(default_factory=set)
+    grade: GradeResult = field(
+        default_factory=lambda: GradeResult(
+            score=0.0,
+            precision=0.0,
+            recall=0.0,
+            f1=0.0,
+            true_positives=0,
+            false_positives=0,
+            missed_issues=0,
+            required_found=0,
+            required_total=0,
+            bonus_found=0,
+            matched_issue_ids=[],
+            breakdown={},
+        )
+    )
+    reward_summary: RewardSummary = field(default_factory=RewardSummary)
+_ACTIVE_EPISODE: Optional[EpisodeRuntime] = None
+_TASK_CURSOR = -1
+_SNIPPET_CURSORS: Dict[str, int] = {task.task_id: -1 for task in load_task_catalog()}
+def _set_active_episode(episode: Optional[EpisodeRuntime]) -> None:
+    global _ACTIVE_EPISODE
+    _ACTIVE_EPISODE = episode
+def _current_episode() -> Optional[EpisodeRuntime]:
+    return _ACTIVE_EPISODE
+def _match_issue_for_action(task_id: str, snippet: CodeReviewSnippet, action: PythonAction, found_issue_ids: set[str]) -> Optional[str]:
+    if action.action_type != ActionType.ADD_COMMENT or action.line_number is None or action.issue_type is None:
+        return None
+    max_distance = _line_window_for_task(task_id)
+    best_issue_id: Optional[str] = None
+    best_distance = max_distance + 1
+    for issue in snippet.gold_issues:
+        if issue.issue_id in found_issue_ids or issue.issue_type != action.issue_type:
+            continue
+        distance = abs(action.line_number - issue.line)
+        if distance <= max_distance and distance < best_distance:
+            best_issue_id = issue.issue_id
+            best_distance = distance
+    return best_issue_id
+def build_metrics(episode: EpisodeRuntime) -> EpisodeMetrics:
+    return EpisodeMetrics(
+        precision=episode.grade.precision,
+        recall=episode.grade.recall,
+        f1=episode.grade.f1,
+        true_positives=episode.grade.true_positives,
+        false_positives=episode.grade.false_positives,
+        missed_issues=episode.grade.missed_issues,
+        required_found=episode.grade.required_found,
+        required_total=episode.grade.required_total,
+        bonus_found=episode.grade.bonus_found,
+        duplicate_comments=episode.duplicate_comments,
+        context_requests=episode.context_requests,
+        skipped_clean_lines=episode.skipped_clean_lines,
+        skipped_issue_lines=episode.skipped_issue_lines,
+        current_score=episode.grade.score,
+        cumulative_reward=episode.cumulative_reward,
+        breakdown=episode.grade.breakdown,
+    )
+def build_state(episode: EpisodeRuntime) -> PythonState:
+    return PythonState(
+        episode_id=episode.episode_id,
+        step_count=episode.current_step,
+        task_id=episode.task_id,
+        difficulty=get_task_metadata(episode.task_id).difficulty,
+        snippet_id=episode.snippet.snippet_id,
+        current_step=episode.current_step,
+        max_steps=episode.max_steps,
+        done=episode.done,
+        filename=episode.snippet.filename,
+        review_history=list(episode.review_history),
+        metrics=build_metrics(episode),
+        last_feedback=episode.last_feedback,
+    )
+def get_tasks_response() -> TaskListResponse:
+    return TaskListResponse(tasks=load_task_catalog())
+def get_metrics_response() -> MetricsResponse:
+    episode = _current_episode()
+    if episode is None:
+        return MetricsResponse()
+    return MetricsResponse(task_id=episode.task_id, snippet_id=episode.snippet.snippet_id, done=episode.done, metrics=build_metrics(episode))
+def get_health_response() -> HealthResponse:
+    episode = _current_episode()
+    return HealthResponse(
+        status="ok",
+        environment="python_code_review_env",
+        task_count=sum(len(items) for items in load_task_bank().values()),
+        active_task_id=episode.task_id if episode else None,
+        active_snippet_id=episode.snippet.snippet_id if episode else None,
+        active_episode_id=episode.episode_id if episode else None,
+    )
+def get_current_state() -> PythonState:
+    episode = _current_episode()
+    return PythonState() if episode is None else build_state(episode)
+class PythonReviewRuntime(Environment[PythonAction, PythonObservation, PythonState]):
+    """Deterministic code-review benchmark environment with dense rewards."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self, config: Optional[PythonEnvConfig] = None):
+        super().__init__()
+        self._config = config or PythonEnvConfig()
+        self._episode: Optional[EpisodeRuntime] = None
+    def _restore_episode(self) -> Optional[EpisodeRuntime]:
+        if self._episode is not None:
+            return self._episode
+        self._episode = _current_episode()
+        return self._episode
+    def _select_task_id(self, seed: Optional[int]) -> str:
+        task_order = list(self._config.task_order)
+        if seed is not None:
+            return random.Random(seed).choice(task_order)
+        if not self._config.rotate_tasks:
+            return task_order[0]
+        global _TASK_CURSOR
+        _TASK_CURSOR = (_TASK_CURSOR + 1) % len(task_order)
+        return task_order[_TASK_CURSOR]
+    def _select_snippet(self, task_id: str, seed: Optional[int]) -> CodeReviewSnippet:
+        snippets = load_task_bank()[task_id]
+        if seed is not None:
+            return random.Random(seed).choice(snippets)
+        _SNIPPET_CURSORS[task_id] = (_SNIPPET_CURSORS[task_id] + 1) % len(snippets)
+        return snippets[_SNIPPET_CURSORS[task_id]]
+    def _terminal_reward(self, episode: EpisodeRuntime, action_type: ActionType) -> float:
+        reward = 0.0
+        if episode.grade.required_found == episode.grade.required_total and episode.grade.required_total:
+            reward += 0.20
+        if episode.grade.false_positives == 0:
+            reward += 0.10
+        if action_type == ActionType.REQUEST_CHANGES and episode.snippet.must_reject:
+            reward += 0.10
+        if action_type == ActionType.APPROVE and episode.snippet.must_approve:
+            reward += 0.15
+        if action_type == ActionType.APPROVE and episode.snippet.must_reject:
+            reward -= 0.25
+        reward += 0.05 * (1 - (episode.current_step / max(episode.max_steps, 1)))
+        return reward
+    def reset(self, seed: Optional[int] = None, episode_id: Optional[str] = None, task_id: Optional[str] = None, **kwargs) -> PythonObservation:
+        del kwargs
+        selected_task_id = task_id or self._select_task_id(seed)
+        snippet = self._select_snippet(selected_task_id, seed)
+        metadata = get_task_metadata(selected_task_id)
+        episode = EpisodeRuntime(
+            episode_id=episode_id or str(uuid4()),
+            task_id=selected_task_id,
+            snippet=snippet,
+            current_step=0,
+            max_steps=min(metadata.max_steps, self._config.max_steps_per_task),
+            created_at=_utc_now(),
+        )
+        episode.grade = grade_review(selected_task_id, snippet, episode.review_history, episode.duplicate_comments)
+        episode.last_feedback = f"Loaded {metadata.name}. Review the code and submit comments line by line."
+        self._episode = episode
+        _set_active_episode(episode)
+        return self._build_observation(episode, 0.0)
+    def step(self, action: PythonAction, timeout_s: Optional[float] = None, **kwargs) -> PythonObservation:
+        del timeout_s, kwargs
+        episode = self._restore_episode()
+        if episode is None:
+            return self.reset()
+        if episode.done:
+            return self._build_observation(episode, 0.0)
+        episode.current_step += 1
+        step_reward = 0.0
+        breakdown: Dict[str, float] = {}
+        feedback = ""
+        matched_issue_ids: List[str] = []
+        if action.action_type == ActionType.ADD_COMMENT:
+            if action.line_number in episode.commented_lines:
+                episode.duplicate_comments += 1
+                step_reward -= 0.08
+                breakdown["duplicate_comment_penalty"] = -0.08
+            issue_id = _match_issue_for_action(episode.task_id, episode.snippet, action, episode.found_issue_ids)
+            if issue_id is not None:
+                issue = next(item for item in episode.snippet.gold_issues if item.issue_id == issue_id)
+                hit_reward = _severity_reward(issue.severity.value, not issue.required)
+                step_reward += hit_reward
+                breakdown["issue_hit"] = hit_reward
+                episode.found_issue_ids.add(issue_id)
+                matched_issue_ids = [issue_id]
+                feedback = f"Recorded issue on line {action.line_number}."
+            else:
+                penalty = _false_positive_penalty(action.severity.value if action.severity else None)
+                step_reward += penalty
+                breakdown["false_positive_penalty"] = penalty
+                feedback = "Comment did not match a benchmark issue."
+            if action.line_number is not None:
+                episode.commented_lines.add(action.line_number)
+        elif action.action_type == ActionType.SKIP_LINE:
+            assert action.line_number is not None
+            required_issue_on_line = any(
+                issue.required and issue.line == action.line_number
+                for issue in episode.snippet.gold_issues
+            )
+            if required_issue_on_line:
+                step_reward -= 0.10
+                episode.skipped_issue_lines += 1
+                breakdown["skip_issue_penalty"] = -0.10
+                feedback = "Skipped a line with a required issue."
+            else:
+                step_reward += 0.02
+                episode.skipped_clean_lines += 1
+                breakdown["skip_clean_reward"] = 0.02
+                feedback = "Marked the line as clean."
+        elif action.action_type == ActionType.ASK_CONTEXT:
+            episode.context_requests += 1
+            step_reward -= 0.03
+            breakdown["ask_context_penalty"] = -0.03
+            feedback = episode.snippet.context or episode.snippet.diff or "No additional context available."
+        elif action.action_type in {ActionType.APPROVE, ActionType.REQUEST_CHANGES}:
+            feedback = "Final review decision recorded."
+        episode.review_history.append(
+            ReviewComment(
+                step_index=episode.current_step,
+                action_type=action.action_type,
+                line_number=action.line_number,
+                issue_type=action.issue_type,
+                severity=action.severity,
+                comment=action.comment,
+                suggestion=action.suggestion,
+                question=action.question,
+                matched_issue_ids=matched_issue_ids,
+                reward_delta=step_reward,
+            )
+        )
+        if len(episode.review_history) > self._config.max_history_entries:
+            episode.review_history = episode.review_history[-self._config.max_history_entries :]
+        done = action.action_type in {ActionType.APPROVE, ActionType.REQUEST_CHANGES}
+        if episode.current_step >= episode.max_steps:
+            done = True
+            feedback = f"{feedback} Maximum steps reached.".strip()
+        episode.grade = grade_review(episode.task_id, episode.snippet, episode.review_history, episode.duplicate_comments)
+        if done:
+            terminal_bonus = self._terminal_reward(episode, action.action_type)
+            step_reward += terminal_bonus
+            breakdown["terminal_bonus"] = terminal_bonus
+            episode.done = True
+            feedback = f"{feedback} Final score {episode.grade.score:.2f}.".strip()
+        episode.cumulative_reward += step_reward
+        episode.reward_summary = RewardSummary(
+            step_reward=step_reward,
+            cumulative_reward=episode.cumulative_reward,
+            breakdown=breakdown,
+            false_positives=episode.grade.false_positives,
+            true_positives=episode.grade.true_positives,
+            missed_issues=episode.grade.missed_issues,
+        )
+        episode.last_feedback = feedback or "Step complete."
+        self._episode = episode
+        _set_active_episode(episode)
+        return self._build_observation(episode, step_reward)
+    def _build_observation(self, episode: EpisodeRuntime, reward: float) -> PythonObservation:
+        lines = episode.snippet.code.splitlines()
+        return PythonObservation(
+            snippet_id=episode.snippet.snippet_id,
+            code=episode.snippet.code,
+            filename=episode.snippet.filename,
+            language="python",
+            context=episode.snippet.context,
+            diff=episode.snippet.diff,
+            line_count=len(lines),
+            current_step=episode.current_step,
+            max_steps=episode.max_steps,
+            task_id=episode.task_id,
+            review_history=list(episode.review_history),
+            lines=lines,
+            reward_summary=episode.reward_summary,
+            metrics=build_metrics(episode),
+            feedback=episode.last_feedback,
+            done=episode.done,
+            reward=reward,
+            metadata={
+                "episode_id": episode.episode_id,
+                "created_at": episode.created_at,
+                "updated_at": _utc_now(),
+                "task_name": get_task_metadata(episode.task_id).name,
+            },
+        )
+    @property
+    def state(self) -> PythonState:
+        episode = self._restore_episode()
+        return PythonState() if episode is None else build_state(episode)

server/task_bank.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""Dataset-backed task catalog for the Python code-review benchmark."""
+from __future__ import annotations
+import json
+from functools import lru_cache
+from pathlib import Path
+from typing import Dict, List
+try:
+    from ..models import CodeReviewSnippet, Difficulty, TaskMetadata
+except ImportError:
+    from models import CodeReviewSnippet, Difficulty, TaskMetadata  # type: ignore
+DATA_DIR = Path(__file__).with_name("data")
+TASK_DEFINITIONS: Dict[str, dict[str, object]] = {
+    "task_easy": {
+        "name": "Style & Convention Review",
+        "difficulty": Difficulty.EASY,
+        "description": "Find style, naming, formatting, and documentation issues.",
+        "filename": "snippets_easy.json",
+        "max_steps": 25,
+    },
+    "task_medium": {
+        "name": "Logic Bug Detection",
+        "difficulty": Difficulty.MEDIUM,
+        "description": "Identify correctness issues in ordinary Python code.",
+        "filename": "snippets_medium.json",
+        "max_steps": 25,
+    },
+    "task_hard": {
+        "name": "Security Vulnerability Audit",
+        "difficulty": Difficulty.HARD,
+        "description": "Review web and data-processing code for security flaws.",
+        "filename": "snippets_hard.json",
+        "max_steps": 25,
+    },
+}
+@lru_cache(maxsize=1)
+def load_task_bank() -> Dict[str, List[CodeReviewSnippet]]:
+    """Load and validate all snippet JSON files."""
+    task_bank: Dict[str, List[CodeReviewSnippet]] = {}
+    for task_id, spec in TASK_DEFINITIONS.items():
+        raw_items = json.loads((DATA_DIR / str(spec["filename"])).read_text(encoding="utf-8"))
+        task_bank[task_id] = [CodeReviewSnippet.model_validate(item) for item in raw_items]
+    return task_bank
+@lru_cache(maxsize=1)
+def load_task_catalog() -> List[TaskMetadata]:
+    """Return visible task metadata for `/tasks` and environment resets."""
+    task_bank = load_task_bank()
+    catalog: List[TaskMetadata] = []
+    for task_id, spec in TASK_DEFINITIONS.items():
+        catalog.append(
+            TaskMetadata(
+                task_id=task_id,
+                name=str(spec["name"]),
+                difficulty=spec["difficulty"],  # type: ignore[arg-type]
+                description=str(spec["description"]),
+                snippet_count=len(task_bank[task_id]),
+                max_steps=int(spec["max_steps"]),
+                min_score=0.0,
+                max_score=1.0,
+            )
+        )
+    return catalog
+def get_task_metadata(task_id: str) -> TaskMetadata:
+    """Return task metadata for one family."""
+    for task in load_task_catalog():
+        if task.task_id == task_id:
+            return task
+    raise KeyError(f"Unknown task_id: {task_id}")

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,157 @@

+from __future__ import annotations
+from fastapi.testclient import TestClient
+import pytest
+from models import (
+    ActionType,
+    IssueType,
+    PythonReviewAction,
+    Severity,
+)
+from server.app import app
+from server.grading import grade_review
+from server.python_env_environment import PythonEnvironment
+from server.task_bank import load_task_bank
+def _snippet_by_id(task_id: str, snippet_id: str):
+    return next(item for item in load_task_bank()[task_id] if item.snippet_id == snippet_id)
+def test_add_comment_requires_fields() -> None:
+    with pytest.raises(Exception):
+        PythonReviewAction(action_type=ActionType.ADD_COMMENT)
+def test_approve_rejects_extra_fields() -> None:
+    with pytest.raises(Exception):
+        PythonReviewAction(
+            action_type=ActionType.APPROVE,
+            comment="looks good",
+        )
+def test_easy_grader_rewards_required_issue_and_request_changes() -> None:
+    snippet = load_task_bank()["task_easy"][0]
+    history = [
+        PythonReviewAction(
+            action_type=ActionType.ADD_COMMENT,
+            line_number=4,
+            issue_type=IssueType.STYLE,
+            severity=Severity.LOW,
+            comment="Ambiguous variable name l violates PEP8 E741.",
+        ),
+        PythonReviewAction(action_type=ActionType.REQUEST_CHANGES),
+    ]
+    comments = []
+    for step, action in enumerate(history, start=1):
+        comments.append(
+            {
+                "step_index": step,
+                "action_type": action.action_type,
+                "line_number": action.line_number,
+                "issue_type": action.issue_type,
+                "severity": action.severity,
+                "comment": action.comment,
+            }
+        )
+    from models import ReviewComment
+    result = grade_review(
+        "task_easy",
+        snippet,
+        [ReviewComment.model_validate(item) for item in comments],
+        duplicate_comments=0,
+    )
+    assert result.score > 0.35
+    assert result.required_found >= 1
+def test_hard_grader_rewards_security_metadata() -> None:
+    snippet = load_task_bank()["task_hard"][0]
+    from models import ReviewComment
+    review = ReviewComment(
+        step_index=1,
+        action_type=ActionType.ADD_COMMENT,
+        line_number=2,
+        issue_type=IssueType.SECURITY,
+        severity=Severity.CRITICAL,
+        comment="SQL injection risk. This is an OWASP injection issue because the query interpolates user input.",
+        suggestion="Use a parameterized query with placeholders instead of string interpolation.",
+    )
+    result = grade_review("task_hard", snippet, [review], duplicate_comments=0)
+    assert result.score > 0.30
+    assert result.true_positives == 1
+def test_environment_step_updates_metrics() -> None:
+    env = PythonEnvironment()
+    observation = env.reset(task_id="task_easy").model_copy()
+    snippet = _snippet_by_id("task_easy", observation.snippet_id)
+    issue = next(item for item in snippet.gold_issues if item.required)
+    next_observation = env.step(
+        PythonReviewAction(
+            action_type=ActionType.ADD_COMMENT,
+            line_number=issue.line,
+            issue_type=issue.issue_type,
+            severity=issue.severity,
+            comment=issue.description,
+        )
+    )
+    assert next_observation.reward is not None
+    assert next_observation.metrics.true_positives >= 1
+    assert next_observation.review_history[-1].matched_issue_ids
+def test_environment_terminal_action_sets_done() -> None:
+    env = PythonEnvironment()
+    observation = env.reset(task_id="task_easy")
+    result = env.step(PythonReviewAction(action_type=ActionType.REQUEST_CHANGES))
+    assert result.done is True
+    assert result.metrics.current_score >= 0.0
+def test_api_smoke_endpoints() -> None:
+    client = TestClient(app)
+    reset_response = client.post("/reset", json={"task_id": "task_easy"})
+    assert reset_response.status_code == 200
+    payload = reset_response.json()
+    assert payload["observation"]["task_id"] == "task_easy"
+    snippet = _snippet_by_id("task_easy", payload["observation"]["snippet_id"])
+    issue = next(item for item in snippet.gold_issues if item.required)
+    step_response = client.post(
+        "/step",
+        json={
+            "action": {
+                "action_type": "ADD_COMMENT",
+                "line_number": issue.line,
+                "issue_type": issue.issue_type.value,
+                "severity": issue.severity.value,
+                "comment": issue.description,
+            }
+        },
+    )
+    assert step_response.status_code == 200
+    assert step_response.json()["observation"]["metrics"]["true_positives"] >= 1
+    tasks_response = client.get("/tasks")
+    assert tasks_response.status_code == 200
+    assert len(tasks_response.json()["tasks"]) == 3
+    metrics_response = client.get("/metrics")
+    assert metrics_response.status_code == 200
+    assert "metrics" in metrics_response.json()
+    health_response = client.get("/health")
+    assert health_response.status_code == 200
+    assert health_response.json()["status"] == "ok"
+    schema_response = client.get("/schema")
+    assert schema_response.status_code == 200
+    assert "action" in schema_response.json()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff