Spaces:

ark406
/

openenv_project

Sleeping

App Files Files Community

ark406 commited on 15 days ago

Commit

0b55673

verified ·

1 Parent(s): 7b5dd1d

Deploy OpenEnv Submission

Browse files

Files changed (14) hide show

Dockerfile +17 -0
README.md +153 -6
app/__init__.py +1 -0
app/main.py +130 -0
app/models.py +33 -0
app/tasks/__init__.py +12 -0
app/tasks/base.py +91 -0
app/tasks/task_easy.py +88 -0
app/tasks/task_hard.py +122 -0
app/tasks/task_medium.py +94 -0
inference.py +181 -0
inference_local.py +193 -0
openenv.yaml +62 -0
requirements.txt +7 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,17 @@

+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install dependencies first (layer caching)
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy project files
+COPY . .
+# HuggingFace Spaces expects port 7860
+EXPOSE 7860
+# Start FastAPI server
+CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,157 @@
 ---
-title: Openenv Project
-emoji: 📈
-colorFrom: gray
-colorTo: gray
 sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Python Bug Fixer OpenEnv
+emoji: 🐛
+colorFrom: blue
+colorTo: green
 sdk: docker
+app_port: 7860
 ---
+# Python Bug Fixer — OpenEnv
+An OpenEnv-compliant environment where an AI agent must identify and fix bugs
+in Python code to produce correct program output. Simulates real-world
+software debugging and code review workflows.
+---
+## Environment Description
+The agent receives a buggy Python code snippet along with a description of
+expected behavior. The agent's action is to return the corrected Python code.
+The environment executes the code and rewards the agent based on how many
+expected output lines are produced correctly.
+---
+## Observation Space
+**Type:** Text
+The observation contains:
+- Task description and difficulty
+- Expected stdout output (ground truth)
+- The buggy Python code to fix
+---
+## Action Space
+**Type:** Text
+The action is raw Python code (no markdown, no code fences).
+It must be valid Python that can be executed with `python3`.
+---
+## Tasks
+| Task ID | Name | Difficulty | Bugs | Max Steps |
+|---------|------|-----------|------|-----------|
+| `task_easy` | Fix Index Errors | Easy | 2 | 5 |
+| `task_medium` | Fix Binary Search | Medium | 2 | 5 |
+| `task_hard` | Fix DataProcessor Class | Hard | 3 | 7 |
+### Reward Function
+- Reward ∈ [0.0, 1.0]
+- Each expected output line is worth `1 / N` reward
+- Partial credit awarded for partially correct fixes
+- Code that crashes with runtime error: 0.1 partial credit if some output produced
+---
+## Setup & Run Locally
+```bash
+# 1. Install dependencies
+pip install -r requirements.txt
+# 2. Start the server
+uvicorn app.main:app --host 0.0.0.0 --port 7860
+# 3. Test endpoints
+curl http://localhost:7860/health
+curl http://localhost:7860/tasks
+```
+---
+## Run Inference
+```bash
+export API_BASE_URL="https://api-inference.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
+export HF_TOKEN="hf_YOUR_TOKEN_HERE"
+export SPACE_URL="https://YOUR_USERNAME-python-bug-fixer.hf.space"
+python inference.py
+```
+Expected output format:
+```
+[START] {"task_id": "task_easy", "session_id": "...", "model": "...", "timestamp": "..."}
+[STEP]  {"step": 1, "reward": 1.0, "done": true, ...}
+[END]   {"task_id": "task_easy", "total_reward": 1.0, "steps": 1, "success": true, ...}
+```
+---
+## API Reference
+### `POST /reset`
+Start a new episode.
+```json
+Request:  { "task_id": "task_easy" }
+Response: { "session_id": "...", "task_id": "...", "observation": "...", "info": {} }
+```
+### `POST /step`
+Submit fixed code as an action.
+```json
+Request:  { "session_id": "...", "action": "def get_last_element(lst): ..." }
+Response: { "observation": "...", "reward": 1.0, "done": true, "info": {} }
+```
+### `GET /state?session_id=...`
+Get current episode state without advancing.
+```json
+Response: { "session_id": "...", "task_id": "...", "steps": 1, "done": true, "current_observation": "..." }
+```
+### `GET /tasks`
+List all available tasks and metadata.
+### `GET /health`
+Returns `{"status": "ok"}`.
+---
+## Docker
+```bash
+docker build -t python-bug-fixer .
+docker run -p 7860:7860 python-bug-fixer
+```
+---
+## Project Structure
+```
+my-openenv/
+├── inference.py          # Baseline inference script (root — required)
+├── openenv.yaml          # OpenEnv specification
+├── Dockerfile            # Container definition
+├── requirements.txt      # Python dependencies
+├── README.md
+└── app/
+    ├── __init__.py
+    ├── main.py           # FastAPI server (reset/step/state endpoints)
+    ├── models.py         # Pydantic request/response models
+    └── tasks/
+        ├── __init__.py   # Task registry
+        ├── base.py       # BaseTask + safe code runner
+        ├── task_easy.py  # Easy task (2 index bugs)
+        ├── task_medium.py # Medium task (2 binary search bugs)
+        └── task_hard.py  # Hard task (3 DataProcessor bugs)
+```

app/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # app package

app/main.py ADDED Viewed

	@@ -0,0 +1,130 @@

+"""
+Python Bug Fixer — OpenEnv-compliant FastAPI server.
+Endpoints
+---------
+GET  /           → health + metadata
+GET  /health     → {"status": "ok"}
+GET  /tasks      → list all task IDs
+POST /reset      → start new episode      body: {"task_id": "task_easy"}
+POST /step       → submit action          body: {"session_id": "...", "action": "..."}
+GET  /state      → current episode state  ?session_id=...
+"""
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from app.models import ResetRequest, ResetResponse, StepRequest, StepResponse, StateResponse
+from app.tasks import TASKS
+import uuid
+app = FastAPI(
+    title="Python Bug Fixer OpenEnv",
+    description="OpenEnv environment: agent fixes bugs in Python code.",
+    version="1.0.0",
+)
+# Allow external services to call the OpenEnv API
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# In-memory session store  {session_id: {"task_id": str, "task": BaseTask}}
+sessions: dict = {}
+# ── Health / metadata ──────────────────────────────────────────────────────────
+@app.get("/")
+def root():
+    return {
+        "status": "ok",
+        "name": "python-bug-fixer",
+        "version": "1.0.0",
+        "tasks": list(TASKS.keys()),
+    }
+@app.get("/health")
+def health():
+    return {"status": "ok"}
+@app.get("/tasks")
+def list_tasks():
+    task_meta = []
+    for task_id, task_cls in TASKS.items():
+        t = task_cls()
+        result = t.reset()
+        task_meta.append({
+            "task_id": task_id,
+            "info": result.get("info", {}),
+        })
+    return {"tasks": task_meta}
+# ── Core OpenEnv endpoints ─────────────────────────────────────────────────────
+@app.post("/reset", response_model=ResetResponse)
+def reset(req: ResetRequest):
+    """Start a new episode for the given task."""
+    if req.task_id not in TASKS:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unknown task_id '{req.task_id}'. Valid: {list(TASKS.keys())}",
+        )
+    session_id = str(uuid.uuid4())
+    task = TASKS[req.task_id]()
+    result = task.reset()
+    sessions[session_id] = {"task_id": req.task_id, "task": task}
+    return ResetResponse(
+        session_id=session_id,
+        task_id=req.task_id,
+        observation=result["observation"],
+        info=result.get("info", {}),
+    )
+@app.post("/step", response_model=StepResponse)
+def step(req: StepRequest):
+    """Submit an action (fixed code) for the current episode."""
+    if req.session_id not in sessions:
+        raise HTTPException(status_code=404, detail="Session not found. Call /reset first.")
+    task = sessions[req.session_id]["task"]
+    result = task.step(req.action)
+    # Cleanup session to prevent memory leak
+    if result.get("done"):
+        del sessions[req.session_id]
+    return StepResponse(
+        observation=result["observation"],
+        reward=result["reward"],
+        done=result["done"],
+        info=result.get("info", {}),
+    )
+@app.get("/state", response_model=StateResponse)
+def state(session_id: str):
+    """Return current state of an episode without advancing it."""
+    if session_id not in sessions:
+        raise HTTPException(status_code=404, detail="Session not found.")
+    s = sessions[session_id]
+    state_data = s["task"].state()
+    return StateResponse(
+        session_id=session_id,
+        task_id=s["task_id"],
+        steps=state_data["steps"],
+        done=state_data["done"],
+        current_observation=state_data["current_observation"],
+    )

app/models.py ADDED Viewed

	@@ -0,0 +1,33 @@

+from pydantic import BaseModel
+from typing import Any, Dict, Optional
+class ResetRequest(BaseModel):
+    task_id: str = "task_easy"
+class ResetResponse(BaseModel):
+    session_id: str
+    task_id: str
+    observation: str
+    info: Dict[str, Any] = {}
+class StepRequest(BaseModel):
+    session_id: str
+    action: str
+class StepResponse(BaseModel):
+    observation: str
+    reward: float
+    done: bool
+    info: Dict[str, Any] = {}
+class StateResponse(BaseModel):
+    session_id: str
+    task_id: str
+    steps: int
+    done: bool
+    current_observation: str

app/tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from app.tasks.task_easy import TaskEasy
+from app.tasks.task_medium import TaskMedium
+from app.tasks.task_hard import TaskHard
+# Registry — maps task_id → task class
+TASKS = {
+    "task_easy":   TaskEasy,
+    "task_medium": TaskMedium,
+    "task_hard":   TaskHard,
+}
+__all__ = ["TASKS", "TaskEasy", "TaskMedium", "TaskHard"]

app/tasks/base.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import subprocess
+import sys
+import tempfile
+import os
+from abc import ABC, abstractmethod
+from typing import Dict, Any, List
+class BaseTask(ABC):
+    """
+    Base class for all OpenEnv tasks.
+    Every task must implement reset() and step().
+    """
+    def __init__(self):
+        self.current_step: int = 0
+        self.done: bool = False
+        self.current_observation: str = ""
+    @abstractmethod
+    def reset(self) -> Dict[str, Any]:
+        """
+        Reset the environment and return the initial observation.
+        Returns dict with keys: observation, info
+        """
+        pass
+    @abstractmethod
+    def step(self, action: str) -> Dict[str, Any]:
+        """
+        Take an action and return (observation, reward, done, info).
+        Returns dict with keys: observation, reward, done, info
+        """
+        pass
+    def state(self) -> Dict[str, Any]:
+        """Return the current state of the environment."""
+        return {
+            "steps": self.current_step,
+            "done": self.done,
+            "current_observation": self.current_observation,
+        }
+    def _run_code_safely(self, code: str, expected_outputs: List[str]) -> float:
+        """
+        Execute code in a subprocess and check how many expected outputs
+        appear in stdout. Returns a reward in [0.0, 1.0].
+        Partial credit: each matching expected string = 1/N reward.
+        """
+        tmp_path = None
+        try:
+            # Write code to temp file
+            with tempfile.NamedTemporaryFile(
+                mode="w", suffix=".py", delete=False, encoding="utf-8"
+            ) as f:
+                f.write(code)
+                tmp_path = f.name
+            # Run with timeout
+            result = subprocess.run(
+                [sys.executable, tmp_path],
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            # Non-zero exit = runtime error but partial credit possible
+            if result.returncode != 0:
+                # Check if any expected output appeared before crash
+                stdout = result.stdout
+                correct = sum(1 for e in expected_outputs if e in stdout)
+                if correct == 0:
+                    return 0.1  # tiny credit for at least running
+                return round(correct / len(expected_outputs) * 0.5, 2)
+            stdout = result.stdout
+            correct = sum(1 for e in expected_outputs if e in stdout)
+            reward = round(correct / len(expected_outputs), 2)
+            return reward
+        except subprocess.TimeoutExpired:
+            return 0.0
+        except Exception:
+            return 0.0
+        finally:
+            if tmp_path and os.path.exists(tmp_path):
+                try:
+                    os.unlink(tmp_path)
+                except OSError:
+                    pass

app/tasks/task_easy.py ADDED Viewed

	@@ -0,0 +1,88 @@

+from app.tasks.base import BaseTask
+from typing import Dict, Any
+# ── Buggy code the agent must fix ──────────────────────────────────────────────
+BUGGY_CODE = '''\
+def get_last_element(lst):
+    # BUG 1: len(lst) is out of range — should be len(lst) - 1
+    return lst[len(lst)]
+def compute_sum(numbers):
+    total = 0
+    # BUG 2: range(len(numbers) + 1) goes one too far → IndexError
+    for i in range(len(numbers) + 1):
+        total += numbers[i]
+    return total
+result = get_last_element([1, 2, 3, 4, 5])
+print(result)
+total = compute_sum([10, 20, 30])
+print(total)
+'''
+# ── What the agent sees ────────────────────────────────────────────────────────
+DESCRIPTION = """\
+=== TASK: Fix Index Errors (EASY) ===
+The code below contains 2 bugs — both are off-by-one index errors.
+Fix them so the program runs without errors and prints the expected output.
+Expected output (exactly 2 lines):
+5
+60
+Buggy code:
+"""
+TASK_INFO = {
+    "task_id": "task_easy",
+    "difficulty": "easy",
+    "num_bugs": 2,
+    "max_steps": 5,
+}
+# Expected strings that must appear in stdout to earn full reward
+EXPECTED_OUTPUTS = ["5\n", "60\n"]
+class TaskEasy(BaseTask):
+    """
+    Easy task: fix two off-by-one index errors.
+    Reward: 0.5 per correct output line → max 1.0
+    """
+    def reset(self) -> Dict[str, Any]:
+        self.current_step = 0
+        self.done = False
+        self.current_observation = DESCRIPTION + BUGGY_CODE
+        return {
+            "observation": self.current_observation,
+            "info": TASK_INFO,
+        }
+    def step(self, action: str) -> Dict[str, Any]:
+        self.current_step += 1
+        reward = self._run_code_safely(action, EXPECTED_OUTPUTS)
+        max_steps_reached = self.current_step >= TASK_INFO["max_steps"]
+        if reward >= 1.0 or max_steps_reached:
+            self.done = True
+        if reward >= 1.0:
+            obs = f"✓ All outputs correct! Reward: {reward:.2f}. Task complete."
+        elif reward > 0.0:
+            obs = f"Partial credit. Reward: {reward:.2f}. Some outputs still wrong. Try again."
+        else:
+            obs = f"Code failed to run or produced wrong output. Reward: {reward:.2f}. Try again."
+        self.current_observation = obs
+        return {
+            "observation": obs,
+            "reward": reward,
+            "done": self.done,
+            "info": {"step": self.current_step},
+        }

app/tasks/task_hard.py ADDED Viewed

	@@ -0,0 +1,122 @@

+from app.tasks.base import BaseTask
+from typing import Dict, Any
+# ── Buggy code ─────────────────────────────────────────────────────────────────
+BUGGY_CODE = '''\
+class DataProcessor:
+    """Processes a list of employee records."""
+    def __init__(self):
+        self.data = []
+    def add_record(self, record: dict):
+        self.data.append(record)
+    def get_average(self, field: str) -> float:
+        """Return the average value of a numeric field."""
+        if not self.data:
+            return 0.0
+        return sum(r[field] for r in self.data) / len(self.data)
+    def filter_records(self, field: str, value):
+        """Return all records where record[field] == value."""
+        # BUG 1: single = is assignment, not comparison — SyntaxError
+        return [r for r in self.data if r[field] = value]
+    def get_sorted(self, field: str, reverse: bool = False):
+        """Return records sorted by field. reverse=True means descending."""
+        # BUG 2: reverse logic is inverted — passing `not reverse` flips the sort
+        return sorted(self.data, key=lambda x: x[field], reverse=not reverse)
+    def get_max(self, field: str) -> dict:
+        """Return the record with the highest value for field."""
+        # BUG 3: "field" is a string literal, not the variable — always uses key "field"
+        return max(self.data, key=lambda x: x["field"])
+# ── Test harness ────────────────────────────────────────────────────────────────
+p = DataProcessor()
+p.add_record({"name": "Alice",   "score": 85})
+p.add_record({"name": "Bob",     "score": 92})
+p.add_record({"name": "Charlie", "score": 78})
+print(round(p.get_average("score"), 1))           # Expected: 85.0
+print(len(p.filter_records("name", "Alice")))     # Expected: 1
+print(p.get_sorted("score", reverse=True)[0]["name"])  # Expected: Bob
+print(p.get_max("score")["name"])                 # Expected: Bob
+'''
+DESCRIPTION = """\
+=== TASK: Fix DataProcessor Class (HARD) ===
+The DataProcessor class below has exactly 3 bugs — one in each of three methods.
+Fix all 3 bugs so the test harness at the bottom runs and prints the expected output.
+Hint — methods with bugs:
+  filter_records  → SyntaxError
+  get_sorted      → wrong sort direction
+  get_max         → wrong key lookup
+Expected output (exactly 4 lines):
+85.0
+1
+Bob
+Bob
+Buggy code:
+"""
+TASK_INFO = {
+    "task_id": "task_hard",
+    "difficulty": "hard",
+    "num_bugs": 3,
+    "max_steps": 7,
+}
+EXPECTED_OUTPUTS = ["85.0\n", "1\n", "Bob\nBob\n"]
+class TaskHard(BaseTask):
+    """
+    Hard task: fix 3 bugs in a DataProcessor class (one per method).
+    Reward: partial credit per correct output.
+    """
+    def reset(self) -> Dict[str, Any]:
+        self.current_step = 0
+        self.done = False
+        self.current_observation = DESCRIPTION + BUGGY_CODE
+        return {
+            "observation": self.current_observation,
+            "info": TASK_INFO,
+        }
+    def step(self, action: str) -> Dict[str, Any]:
+        self.current_step += 1
+        reward = self._run_code_safely(action, EXPECTED_OUTPUTS)
+        max_steps_reached = self.current_step >= TASK_INFO["max_steps"]
+        if reward >= 1.0 or max_steps_reached:
+            self.done = True
+        if reward >= 1.0:
+            obs = f"✓ All 3 bugs fixed! Perfect output. Reward: {reward:.2f}. Task complete."
+        elif reward > 0.0:
+            obs = (
+                f"Partial fix. Reward: {reward:.2f}. "
+                f"Some outputs still wrong — check all 3 buggy methods."
+            )
+        else:
+            obs = (
+                f"Code failed or produced wrong output. Reward: {reward:.2f}. "
+                f"Look for SyntaxError in filter_records first."
+            )
+        self.current_observation = obs
+        return {
+            "observation": obs,
+            "reward": reward,
+            "done": self.done,
+            "info": {"step": self.current_step},
+        }

app/tasks/task_medium.py ADDED Viewed

	@@ -0,0 +1,94 @@

+from app.tasks.base import BaseTask
+from typing import Dict, Any
+# ── Buggy code ─────────────────────────────────────────────────────────────────
+BUGGY_CODE = '''\
+def binary_search(arr, target):
+    # BUG 1: right should be len(arr) - 1, not len(arr)
+    #        This causes an IndexError on the first iteration
+    left, right = 0, len(arr)
+    while left <= right:
+        mid = (left + right) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            # BUG 2: must be mid + 1 — using just mid causes infinite loop
+            left = mid
+        else:
+            right = mid - 1
+    return -1
+arr = [1, 3, 5, 7, 9, 11, 13]
+print(binary_search(arr, 7))    # Expected: 3
+print(binary_search(arr, 11))   # Expected: 5
+print(binary_search(arr, 4))    # Expected: -1
+'''
+DESCRIPTION = """\
+=== TASK: Fix Binary Search (MEDIUM) ===
+The binary search function below has 2 bugs:
+  Bug 1 — the right boundary is wrong (causes IndexError)
+  Bug 2 — the left pointer never advances (causes infinite loop)
+Fix both bugs so the function works correctly.
+Expected output (exactly 3 lines):
+3
+5
+-1
+Buggy code:
+"""
+TASK_INFO = {
+    "task_id": "task_medium",
+    "difficulty": "medium",
+    "num_bugs": 2,
+    "max_steps": 5,
+}
+EXPECTED_OUTPUTS = ["3\n", "5\n", "-1\n"]
+class TaskMedium(BaseTask):
+    """
+    Medium task: fix 2 bugs in a binary search implementation.
+    Reward: 1/3 per correct output line → max 1.0
+    """
+    def reset(self) -> Dict[str, Any]:
+        self.current_step = 0
+        self.done = False
+        self.current_observation = DESCRIPTION + BUGGY_CODE
+        return {
+            "observation": self.current_observation,
+            "info": TASK_INFO,
+        }
+    def step(self, action: str) -> Dict[str, Any]:
+        self.current_step += 1
+        reward = self._run_code_safely(action, EXPECTED_OUTPUTS)
+        max_steps_reached = self.current_step >= TASK_INFO["max_steps"]
+        if reward >= 1.0 or max_steps_reached:
+            self.done = True
+        if reward >= 1.0:
+            obs = f"✓ Perfect! All search results correct. Reward: {reward:.2f}. Task complete."
+        elif reward > 0.0:
+            obs = f"Partial credit. Reward: {reward:.2f}. Some search results are still wrong."
+        else:
+            obs = f"Code failed or produced wrong output. Reward: {reward:.2f}. Check both bugs."
+        self.current_observation = obs
+        return {
+            "observation": obs,
+            "reward": reward,
+            "done": self.done,
+            "info": {"step": self.current_step},
+        }

inference.py ADDED Viewed

	@@ -0,0 +1,181 @@

+"""
+inference.py — Baseline inference script for Python Bug Fixer OpenEnv.
+Usage:
+    export API_BASE_URL="https://api-inference.huggingface.co/v1"
+    export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
+    export HF_TOKEN="hf_YOUR_TOKEN"
+    export SPACE_URL="https://YOUR_USERNAME-python-bug-fixer.hf.space"
+    python inference.py
+Log format (required — do not change):
+    [START] {...json...}
+    [STEP]  {...json...}
+    [END]   {...json...}
+"""
+import os
+import json
+import requests
+from datetime import datetime, timezone
+from openai import OpenAI
+# ── Environment variables ──────────────────────────────────────────────────────
+# Defaults are placeholders only — real values must be set via env vars.
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME   = os.getenv("MODEL_NAME",   "meta-llama/Meta-Llama-3-8B-Instruct")
+HF_TOKEN     = os.getenv("HF_TOKEN",     "hf_YOUR_TOKEN")
+SPACE_URL    = os.getenv("SPACE_URL",    "http://localhost:7860")
+# ── OpenAI client (uses API_BASE_URL + HF_TOKEN) ──────────────────────────────
+client = OpenAI(
+    base_url=API_BASE_URL,
+    api_key=HF_TOKEN,
+)
+# Tasks to evaluate (in order)
+TASK_IDS = ["task_easy", "task_medium", "task_hard"]
+# System prompt for the debugger agent
+SYSTEM_PROMPT = (
+    "You are an expert Python developer and debugger. "
+    "You will be shown buggy Python code along with the expected output. "
+    "Your job is to return ONLY the corrected Python code — raw Python, "
+    "no explanations, no markdown, no code fences (no ```). "
+    "The code you return will be executed directly. Make it print the exact expected output."
+)
+# ── Helper functions ───────────────────────────────────────────────────────────
+def now_iso() -> str:
+    return datetime.now(timezone.utc).isoformat()
+def reset_task(task_id: str) -> dict:
+    """Call POST /reset and return the response JSON."""
+    resp = requests.post(
+        f"{SPACE_URL}/reset",
+        json={"task_id": task_id},
+        timeout=30,
+    )
+    resp.raise_for_status()
+    return resp.json()
+def step_task(session_id: str, action: str) -> dict:
+    """Call POST /step with the fixed code and return the response JSON."""
+    resp = requests.post(
+        f"{SPACE_URL}/step",
+        json={"session_id": session_id, "action": action},
+        timeout=30,
+    )
+    resp.raise_for_status()
+    return resp.json()
+def get_fixed_code(observation: str) -> str:
+    """
+    Call the LLM with the buggy-code observation and return fixed code.
+    Uses the OpenAI client configured via API_BASE_URL + MODEL_NAME.
+    """
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=[
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user",   "content": observation},
+        ],
+        max_tokens=1000,
+        temperature=0.1,
+    )
+    return response.choices[0].message.content.strip()
+# ── Core task runner ───────────────────────────────────────────────────────────
+def run_task(task_id: str) -> dict:
+    """
+    Run a single task episode from reset to done.
+    Emits [START], [STEP], [END] logs to stdout.
+    Returns summary dict.
+    """
+    # Reset
+    reset_data  = reset_task(task_id)
+    session_id  = reset_data["session_id"]
+    observation = reset_data["observation"]
+    # [START] log — required format
+    start_log = {
+        "task_id":    task_id,
+        "session_id": session_id,
+        "model":      MODEL_NAME,
+        "timestamp":  now_iso(),
+    }
+    print(f"[START] {json.dumps(start_log)}", flush=True)
+    step_num    = 0
+    reward      = 0.0
+    done        = False
+    while not done:
+        step_num += 1
+        # Get action from LLM
+        action = get_fixed_code(observation)
+        # Submit action to environment
+        result      = step_task(session_id, action)
+        observation = result["observation"]
+        reward      = result["reward"]
+        done        = result["done"]
+        # [STEP] log — required format
+        step_log = {
+            "step":            step_num,
+            "action_chars":    len(action),
+            "reward":          reward,
+            "done":            done,
+            "observation":     observation[:200],   # truncated for log readability
+        }
+        print(f"[STEP] {json.dumps(step_log)}", flush=True)
+    # [END] log — required format
+    end_log = {
+        "task_id":      task_id,
+        "session_id":   session_id,
+        "total_reward": reward,
+        "steps":        step_num,
+        "success":      reward >= 0.8,
+        "timestamp":    now_iso(),
+    }
+    print(f"[END] {json.dumps(end_log)}", flush=True)
+    return {"task_id": task_id, "reward": reward, "steps": step_num, "success": reward >= 0.8}
+# ── Entry point ────────────────────────────────────────────────────────────────
+def main():
+    print(f"Starting inference — model={MODEL_NAME} space={SPACE_URL}", flush=True)
+    print("-" * 60, flush=True)
+    results = []
+    for task_id in TASK_IDS:
+        result = run_task(task_id)
+        results.append(result)
+        print("-" * 60, flush=True)
+    # Summary
+    print("\n=== SUMMARY ===")
+    total_reward = 0.0
+    for r in results:
+        status = "PASS" if r["success"] else "FAIL"
+        print(f"  [{status}] {r['task_id']:15s}  reward={r['reward']:.2f}  steps={r['steps']}")
+        total_reward += r["reward"]
+    avg = total_reward / len(results)
+    print(f"\n  Average reward: {avg:.2f}")
+    print("=== END SUMMARY ===")
+if __name__ == "__main__":
+    main()

inference_local.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""
+inference_local.py — Local inference for Python Bug Fixer OpenEnv.
+This script runs all 3 tasks using pre-written correct solutions
+(no external LLM required). Demonstrates the full environment loop.
+Usage:
+    # 1. Start the server first:
+    #    uvicorn app.main:app --host 0.0.0.0 --port 7860
+    # 2. Run this script:
+    #    python inference_local.py
+"""
+import json
+import requests
+from datetime import datetime, timezone
+SPACE_URL = "http://localhost:7860"
+# ── Pre-written correct solutions for each task ───────────────────────────────
+SOLUTIONS = {
+    "task_easy": """\
+def get_last_element(lst):
+    return lst[len(lst) - 1]
+def compute_sum(numbers):
+    total = 0
+    for i in range(len(numbers)):
+        total += numbers[i]
+    return total
+result = get_last_element([1, 2, 3, 4, 5])
+print(result)
+total = compute_sum([10, 20, 30])
+print(total)
+""",
+    "task_medium": """\
+def binary_search(arr, target):
+    left, right = 0, len(arr) - 1
+    while left <= right:
+        mid = (left + right) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            left = mid + 1
+        else:
+            right = mid - 1
+    return -1
+arr = [1, 3, 5, 7, 9, 11, 13]
+print(binary_search(arr, 7))
+print(binary_search(arr, 11))
+print(binary_search(arr, 4))
+""",
+    "task_hard": """\
+class DataProcessor:
+    \"\"\"Processes a list of employee records.\"\"\"
+    def __init__(self):
+        self.data = []
+    def add_record(self, record: dict):
+        self.data.append(record)
+    def get_average(self, field: str) -> float:
+        \"\"\"Return the average value of a numeric field.\"\"\"
+        if not self.data:
+            return 0.0
+        return sum(r[field] for r in self.data) / len(self.data)
+    def filter_records(self, field: str, value):
+        \"\"\"Return all records where record[field] == value.\"\"\"
+        return [r for r in self.data if r[field] == value]
+    def get_sorted(self, field: str, reverse: bool = False):
+        \"\"\"Return records sorted by field. reverse=True means descending.\"\"\"
+        return sorted(self.data, key=lambda x: x[field], reverse=reverse)
+    def get_max(self, field: str) -> dict:
+        \"\"\"Return the record with the highest value for field.\"\"\"
+        return max(self.data, key=lambda x: x[field])
+p = DataProcessor()
+p.add_record({"name": "Alice",   "score": 85})
+p.add_record({"name": "Bob",     "score": 92})
+p.add_record({"name": "Charlie", "score": 78})
+print(round(p.get_average("score"), 1))
+print(len(p.filter_records("name", "Alice")))
+print(p.get_sorted("score", reverse=True)[0]["name"])
+print(p.get_max("score")["name"])
+""",
+}
+TASK_IDS = ["task_easy", "task_medium", "task_hard"]
+def now_iso() -> str:
+    return datetime.now(timezone.utc).isoformat()
+def run_task(task_id: str) -> dict:
+    """Run one task: reset → submit fixed code → check reward."""
+    # Reset
+    resp = requests.post(f"{SPACE_URL}/reset", json={"task_id": task_id}, timeout=30)
+    resp.raise_for_status()
+    reset_data = resp.json()
+    session_id = reset_data["session_id"]
+    start_log = {
+        "task_id": task_id,
+        "session_id": session_id,
+        "model": "local-solver",
+        "timestamp": now_iso(),
+    }
+    print(f"[START] {json.dumps(start_log)}", flush=True)
+    # Submit the correct solution
+    action = SOLUTIONS[task_id]
+    resp = requests.post(
+        f"{SPACE_URL}/step",
+        json={"session_id": session_id, "action": action},
+        timeout=30,
+    )
+    resp.raise_for_status()
+    result = resp.json()
+    step_log = {
+        "step": 1,
+        "action_chars": len(action),
+        "reward": result["reward"],
+        "done": result["done"],
+        "observation": result["observation"][:200],
+    }
+    print(f"[STEP]  {json.dumps(step_log)}", flush=True)
+    end_log = {
+        "task_id": task_id,
+        "session_id": session_id,
+        "total_reward": result["reward"],
+        "steps": 1,
+        "success": result["reward"] >= 0.8,
+        "timestamp": now_iso(),
+    }
+    print(f"[END]   {json.dumps(end_log)}", flush=True)
+    return {
+        "task_id": task_id,
+        "reward": result["reward"],
+        "steps": 1,
+        "success": result["reward"] >= 0.8,
+    }
+def main():
+    print("=" * 60)
+    print("  Python Bug Fixer — Local Inference (no LLM needed)")
+    print("=" * 60)
+    print(f"  Server: {SPACE_URL}")
+    print(f"  Tasks:  {', '.join(TASK_IDS)}")
+    print("-" * 60)
+    results = []
+    for task_id in TASK_IDS:
+        result = run_task(task_id)
+        results.append(result)
+        print("-" * 60)
+    # Summary
+    print("\n=== SUMMARY ===")
+    total_reward = 0.0
+    for r in results:
+        status = "✅ PASS" if r["success"] else "❌ FAIL"
+        print(f"  [{status}] {r['task_id']:15s}  reward={r['reward']:.2f}  steps={r['steps']}")
+        total_reward += r["reward"]
+    avg = total_reward / len(results)
+    print(f"\n  Average reward: {avg:.2f}")
+    print("=== END SUMMARY ===")
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,62 @@

+name: python-bug-fixer
+version: "1.0.0"
+description: >
+  A real-world environment where an AI agent must identify and fix bugs
+  in Python code snippets. The agent receives buggy code along with a
+  description of expected behavior and must return corrected code that
+  runs without errors and produces the correct output.
+  Simulates real-world software debugging and code review workflows.
+observation_space:
+  type: text
+  description: >
+    A buggy Python code snippet with a description of the expected behavior
+    and expected stdout output. May contain 1–3 bugs of varying types
+    (SyntaxError, IndexError, LogicError).
+action_space:
+  type: text
+  description: >
+    The corrected Python code as a raw string. Must be valid Python that
+    can be executed directly with python3. No markdown, no code fences.
+tasks:
+  - id: task_easy
+    name: "Fix Index Errors"
+    difficulty: easy
+    max_steps: 5
+    reward_threshold: 0.5
+    description: "Fix 2 off-by-one index errors in a list-processing script."
+  - id: task_medium
+    name: "Fix Binary Search Logic"
+    difficulty: medium
+    max_steps: 5
+    reward_threshold: 0.7
+    description: "Fix 2 bugs in a binary search implementation (boundary + infinite loop)."
+  - id: task_hard
+    name: "Fix DataProcessor Class"
+    difficulty: hard
+    max_steps: 7
+    reward_threshold: 0.8
+    description: "Fix 3 bugs across 3 methods of a DataProcessor class."
+reward_range: [0.0, 1.0]
+reward_description: >
+  Reward is computed by running the agent's fixed code in a sandboxed subprocess
+  and checking how many expected output strings appear in stdout.
+  Each expected output line is worth an equal fraction of 1.0.
+  Partial credit is awarded for partially correct fixes.
+endpoints:
+  reset: "POST /reset"
+  step:  "POST /step"
+  state: "GET  /state"
+  tasks: "GET  /tasks"
+  health: "GET /health"
+runtime:
+  max_inference_minutes: 20
+  max_vcpu: 2
+  max_memory_gb: 8

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+fastapi==0.135.3
+uvicorn[standard]==0.44.0
+pydantic==2.12.5
+openai==2.30.0
+PyYAML==6.0.3
+requests==2.33.1
+python-multipart==0.0.24