Spaces:

Your-Mate
/

openenv-productivity

Running

App Files Files Community

Your-Mate commited on 14 days ago

Commit

18fb093

verified ·

1 Parent(s): db8f4a3

Upload folder using huggingface_hub

Browse files

Files changed (18) hide show

Dockerfile +17 -0
README.md +201 -10
__init__.py +2 -0
client.py +51 -0
env/__init__.py +1 -0
env/environment.py +265 -0
env/models.py +60 -0
env/tasks.py +277 -0
environment.py +3 -0
inference.py +173 -0
models.py +4 -0
openenv.yaml +10 -0
pyproject.toml +23 -0
requirements.txt +2 -0
server/__init__.py +2 -0
server/app.py +24 -0
tasks.py +4 -0
uv.lock +15 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,17 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir -r /app/requirements.txt
+COPY env /app/env
+COPY inference.py /app/inference.py
+COPY openenv.yaml /app/openenv.yaml
+COPY README.md /app/README.md
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["python", "inference.py"]

README.md CHANGED Viewed

@@ -1,10 +1,201 @@
----
-title: Openenv Productivity
-emoji: 👁
-colorFrom: pink
-colorTo: pink
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: openenv-productivity
+emoji: 🚀
+colorFrom: blue
+colorTo: green
+sdk: docker
+pinned: false
+base_path: /web
+---
+# OpenEnv Productivity Benchmark
+`openenv-productivity` is a deterministic reinforcement learning benchmark with three productivity-oriented tasks designed for the OpenEnv RL Challenge. The environment is intentionally small, reproducible, and easy to validate while still rewarding iterative improvement over multiple steps.
+## Environment Overview
+The benchmark exposes a single environment class: `ProductivityEnvironment` in `env/environment.py`.
+Implemented API:
+- `reset(task_name="easy") -> Observation`
+- `step(action) -> (Observation, Reward, done, info)`
+- `state() -> Observation`
+The environment is deterministic:
+- No randomness is used anywhere in task generation or grading.
+- All tasks use fixed payloads and fixed graders.
+- Reward shaping is stable and repeatable for identical action sequences.
+## Observation Space
+Observations are validated with Pydantic through the `Observation` model in `env/models.py`.
+Observation fields:
+- `benchmark`: benchmark name
+- `task_name`: `easy`, `medium`, or `hard`
+- `instruction`: natural-language task instruction
+- `payload`: task data and target schema
+- `action_format`: supported action patterns
+- `step_count`: current step index
+- `max_steps`: maximum allowed steps
+- `best_score`: best score seen so far in the episode
+- `last_action`: previous action string
+- `last_feedback`: deterministic grader feedback
+- `done`: terminal flag
+## Action Space
+Actions are validated with Pydantic through the `Action` model.
+Supported actions:
+- `inspect`
+- `propose:{"field":"value"}`
+- `final:{"field":"value"}`
+Notes:
+- `inspect` lets an agent spend a step rereading the state, but it still incurs the step penalty.
+- `propose:` is useful for incremental reward collection.
+- `final:` ends the episode immediately, even if the answer is incomplete.
+- Malformed actions and malformed JSON receive deterministic penalties.
+## Reward Logic
+Rewards are represented by the Pydantic `Reward` model.
+Per-step reward includes:
+- positive delta when a proposal improves over the previous best score
+- partial credit from field-level grading
+- `-0.02` step penalty on every step
+- wrong-answer penalty when a submission regresses or scores zero
+- malformed action penalty for invalid actions or invalid JSON
+- loop penalty when the same action is repeated consecutively
+This shaping discourages loops, rewards iterative progress, and stays deterministic.
+## Tasks
+Exactly three tasks are included.
+### 1. Easy: Email Classification
+Goal:
+- classify an email into `label`
+- assign `priority`
+- determine `needs_reply`
+Edge cases handled:
+- sender intent outweighs superficial phrasing
+- reply detection is based on explicit requested action
+- constrained label and priority vocabularies
+Expected strong baseline score:
+- `1.00` within 1 to 2 steps
+### 2. Medium: Calendar Scheduling
+Goal:
+- schedule a 60-minute meeting for all required participants
+- avoid lunch and blocked windows
+- select a room with enough capacity
+Edge cases handled:
+- partial overlap is not sufficient
+- room capacity must satisfy participant count
+- blocked windows override individual availability
+Expected strong baseline score:
+- `1.00` within 1 to 3 steps
+### 3. Hard: Data Cleaning
+Goal:
+- clean a tabular dataset with duplicate IDs and malformed emails
+- keep first duplicate occurrence only
+- compute a normalized total from retained rows
+Edge cases handled:
+- whitespace trimming before validation
+- duplicate handling before retention
+- numeric normalization to two decimals
+Expected strong baseline score:
+- `1.00` within 2 to 4 steps
+## Setup
+### Local
+```bash
+python -m venv .venv
+. .venv/bin/activate
+pip install -r requirements.txt
+export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
+export MODEL_NAME="your-model"
+export HF_TOKEN="your-token"
+python inference.py --task easy
+```
+On Windows PowerShell:
+```powershell
+python -m venv .venv
+.venv\Scripts\Activate.ps1
+pip install -r requirements.txt
+$env:API_BASE_URL="https://your-openai-compatible-endpoint/v1"
+$env:MODEL_NAME="your-model"
+$env:HF_TOKEN="your-token"
+python inference.py --task easy
+```
+## Inference Output Contract
+`inference.py` emits only these lines:
+```text
+[START] task=<task_name> env=<benchmark> model=<model_name>
+[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
+```
+Compliance guarantees:
+- reward formatted to two decimals
+- lowercase booleans
+- no extra blank lines
+- `[END]` is always printed, including on failure
+- max steps capped at five
+## Docker
+Build:
+```bash
+docker build -t openenv-productivity .
+```
+Run:
+```bash
+docker run --rm \
+  -e API_BASE_URL="https://your-openai-compatible-endpoint/v1" \
+  -e MODEL_NAME="your-model" \
+  -e HF_TOKEN="your-token" \
+  openenv-productivity
+```
+The container is intentionally lean and suitable for a 2 CPU / 8 GB RAM runtime.

__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ """OpenEnv productivity environment package root."""
2	+

client.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from __future__ import annotations
+import json
+import os
+import re
+from typing import Any
+from openai import OpenAI
+def _compact(text: str) -> str:
+    return re.sub(r"\s+", " ", text).strip()
+class ProductivityClient:
+    """Minimal OpenEnv client wrapper for remote action generation."""
+    def __init__(self) -> None:
+        api_base_url = os.getenv("API_BASE_URL")
+        model_name = os.getenv("MODEL_NAME")
+        token = os.getenv("HF_TOKEN")
+        if not api_base_url:
+            raise ValueError("missing API_BASE_URL")
+        if not model_name:
+            raise ValueError("missing MODEL_NAME")
+        if not token:
+            raise ValueError("missing HF_TOKEN")
+        self.model_name = model_name
+        self.client = OpenAI(base_url=api_base_url, api_key=token)
+    def act(self, observation: Any) -> str:
+        observation_json = json.dumps(observation, sort_keys=True, separators=(",", ":"))
+        response = self.client.chat.completions.create(
+            model=self.model_name,
+            temperature=0,
+            messages=[
+                {
+                    "role": "system",
+                    "content": (
+                        "Reply with exactly one line and no explanation. "
+                        "Allowed formats: inspect, propose:{...}, final:{...}."
+                    ),
+                },
+                {"role": "user", "content": observation_json},
+            ],
+        )
+        content = response.choices[0].message.content or ""
+        action = content.splitlines()[0].strip() if content else ""
+        return _compact(action) if action else "inspect"

env/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Environment package for openenv-productivity."""

env/environment.py ADDED Viewed

	@@ -0,0 +1,265 @@

+from __future__ import annotations
+import json
+from decimal import Decimal, ROUND_HALF_UP
+from typing import Any, Dict, Optional, Tuple
+from env.models import Action, Observation, Reward, StepInfo
+from env.tasks import get_task, task_names
+class ProductivityEnvironment:
+    benchmark_name = "openenv-productivity"
+    def __init__(self, max_steps: int = 5) -> None:
+        self._default_max_steps = max_steps
+        self._task_name = "easy"
+        self._task = get_task(self._task_name)
+        self._step_count = 0
+        self._done = False
+        self._best_score = 0.0
+        self._last_action: Optional[str] = None
+        self._prior_action: Optional[str] = None
+        self._last_feedback: Optional[str] = None
+        self._last_reward = Reward(
+            value=0.0,
+            score=0.0,
+            delta=0.0,
+            step_penalty=0.0,
+            wrong_answer_penalty=0.0,
+            loop_penalty=0.0,
+            malformed_penalty=0.0,
+            explanation="Environment initialized.",
+        )
+    @property
+    def max_steps(self) -> int:
+        return min(self._default_max_steps, self._task.max_steps)
+    def available_tasks(self) -> list[str]:
+        return task_names()
+    def reset(self, task_name: str = "easy") -> Observation:
+        self._task_name = task_name
+        self._task = get_task(task_name)
+        self._step_count = 0
+        self._done = False
+        self._best_score = 0.0
+        self._last_action = None
+        self._prior_action = None
+        self._last_feedback = "Environment reset."
+        self._last_reward = Reward(
+            value=0.0,
+            score=0.0,
+            delta=0.0,
+            step_penalty=0.0,
+            wrong_answer_penalty=0.0,
+            loop_penalty=0.0,
+            malformed_penalty=0.0,
+            explanation="Environment reset.",
+        )
+        return self.state()
+    def state(self) -> Observation:
+        return Observation(
+            benchmark=self.benchmark_name,
+            task_name=self._task.name,
+            instruction=self._task.instruction,
+            payload={
+                "data": self._task.public_payload(),
+                "schema": self._task.public_schema(),
+            },
+            action_format=[
+                "inspect",
+                'propose:{"field":"value"}',
+                'final:{"field":"value"}',
+            ],
+            step_count=self._step_count,
+            max_steps=self.max_steps,
+            best_score=self._best_score,
+            last_action=self._last_action,
+            last_feedback=self._last_feedback,
+            done=self._done,
+        )
+    def step(self, action: Any) -> Tuple[Observation, Reward, bool, Dict[str, Any]]:
+        if self._done:
+            info = StepInfo(
+                current_score=self._best_score,
+                best_score=self._best_score,
+                terminated_by="already_done",
+                error="step called after completion",
+            )
+            reward = Reward(
+                value=-0.1,
+                score=self._best_score,
+                delta=0.0,
+                step_penalty=-0.1,
+                wrong_answer_penalty=0.0,
+                loop_penalty=0.0,
+                malformed_penalty=0.0,
+                explanation="Step rejected because the episode is already done.",
+            )
+            self._last_reward = reward
+            self._last_feedback = reward.explanation
+            return self.state(), reward, True, info.model_dump()
+        parsed_action, action_error = self._coerce_action(action)
+        self._step_count += 1
+        if action_error is not None or parsed_action is None:
+            reward = self._build_reward(
+                current_score=self._best_score,
+                previous_best=self._best_score,
+                malformed_penalty=-0.25,
+                explanation=f"Malformed action: {action_error}",
+            )
+            self._last_feedback = reward.explanation
+            self._maybe_finish()
+            info = StepInfo(
+                current_score=self._best_score,
+                best_score=self._best_score,
+                terminated_by="max_steps" if self._done else None,
+                error=action_error,
+            )
+            return self.state(), reward, self._done, info.model_dump()
+        if parsed_action.raw == "inspect":
+            self._prior_action = self._last_action
+            self._last_action = parsed_action.raw
+            reward = self._build_reward(
+                current_score=self._best_score,
+                previous_best=self._best_score,
+                explanation="Inspection used. No new answer submitted.",
+            )
+            self._last_feedback = reward.explanation
+            self._maybe_finish()
+            info = StepInfo(
+                parsed_action={"type": "inspect"},
+                current_score=self._best_score,
+                best_score=self._best_score,
+                terminated_by="max_steps" if self._done else None,
+            )
+            return self.state(), reward, self._done, info.model_dump()
+        action_type, candidate = self._parse_payload_action(parsed_action.raw)
+        self._prior_action = self._last_action
+        self._last_action = parsed_action.raw
+        if candidate is None:
+            reward = self._build_reward(
+                current_score=self._best_score,
+                previous_best=self._best_score,
+                malformed_penalty=-0.25,
+                explanation="Malformed JSON payload in action.",
+            )
+            self._last_feedback = reward.explanation
+            self._maybe_finish()
+            info = StepInfo(
+                parsed_action={"type": action_type},
+                current_score=self._best_score,
+                best_score=self._best_score,
+                terminated_by="max_steps" if self._done else None,
+                error="invalid_json_payload",
+            )
+            return self.state(), reward, self._done, info.model_dump()
+        current_score, components = self._task.grade_submission(candidate)
+        previous_best = self._best_score
+        if current_score > self._best_score:
+            self._best_score = current_score
+        wrong_answer_penalty = 0.0
+        if current_score < previous_best:
+            wrong_answer_penalty = -0.15
+        elif current_score == 0.0:
+            wrong_answer_penalty = -0.05
+        explanation = (
+            f"Submitted {action_type} with score {current_score:.2f}. "
+            f"Components: {json.dumps(components, sort_keys=True)}."
+        )
+        reward = self._build_reward(
+            current_score=current_score,
+            previous_best=previous_best,
+            wrong_answer_penalty=wrong_answer_penalty,
+            explanation=explanation,
+        )
+        self._last_feedback = explanation
+        terminated_by = None
+        if action_type == "final":
+            self._done = True
+            terminated_by = "final_action"
+        elif self._best_score >= 1.0:
+            self._done = True
+            terminated_by = "perfect_score"
+        else:
+            self._maybe_finish()
+            if self._done:
+                terminated_by = "max_steps"
+        info = StepInfo(
+            parsed_action={"type": action_type, "candidate": candidate, "components": components},
+            current_score=current_score,
+            best_score=self._best_score,
+            terminated_by=terminated_by,
+        )
+        return self.state(), reward, self._done, info.model_dump()
+    def _coerce_action(self, action: Any) -> Tuple[Optional[Action], Optional[str]]:
+        try:
+            if isinstance(action, Action):
+                return action, None
+            if isinstance(action, dict) and "raw" in action:
+                return Action.model_validate(action), None
+            return Action(raw=str(action)), None
+        except Exception as exc:
+            return None, str(exc)
+    def _parse_payload_action(self, raw: str) -> Tuple[str, Optional[Dict[str, Any]]]:
+        if raw.startswith("propose:"):
+            action_type = "propose"
+            payload_text = raw[len("propose:") :]
+        else:
+            action_type = "final"
+            payload_text = raw[len("final:") :]
+        try:
+            parsed = json.loads(payload_text)
+        except json.JSONDecodeError:
+            return action_type, None
+        if not isinstance(parsed, dict):
+            return action_type, None
+        return action_type, parsed
+    def _build_reward(
+        self,
+        current_score: float,
+        previous_best: float,
+        explanation: str,
+        wrong_answer_penalty: float = 0.0,
+        malformed_penalty: float = 0.0,
+    ) -> Reward:
+        delta = max(current_score - previous_best, 0.0)
+        step_penalty = -0.02
+        loop_penalty = -0.05 if self._last_action is not None and self._last_action == self._prior_action else 0.0
+        value = delta + step_penalty + wrong_answer_penalty + loop_penalty + malformed_penalty
+        value = float(Decimal(str(value)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
+        reward = Reward(
+            value=max(-1.0, min(1.0, value)),
+            score=current_score,
+            delta=float(Decimal(str(delta)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)),
+            step_penalty=step_penalty,
+            wrong_answer_penalty=wrong_answer_penalty,
+            loop_penalty=loop_penalty,
+            malformed_penalty=malformed_penalty,
+            explanation=explanation,
+        )
+        self._last_reward = reward
+        return reward
+    def _maybe_finish(self) -> None:
+        if self._step_count >= self.max_steps:
+            self._done = True

env/models.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import BaseModel, ConfigDict, Field, field_validator
+class Action(BaseModel):
+    model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)
+    raw: str = Field(..., min_length=1, max_length=4000)
+    @field_validator("raw")
+    @classmethod
+    def validate_action_prefix(cls, value: str) -> str:
+        allowed_prefixes = ("inspect", "propose:", "final:")
+        if not value.startswith(allowed_prefixes):
+            raise ValueError(
+                "action must start with 'inspect', 'propose:', or 'final:'"
+            )
+        return value
+class Reward(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    value: float = Field(..., ge=-1.0, le=1.0)
+    score: float = Field(..., ge=0.0, le=1.0)
+    delta: float = Field(..., ge=-1.0, le=1.0)
+    step_penalty: float = Field(..., ge=-1.0, le=0.0)
+    wrong_answer_penalty: float = Field(..., ge=-1.0, le=0.0)
+    loop_penalty: float = Field(..., ge=-1.0, le=0.0)
+    malformed_penalty: float = Field(..., ge=-1.0, le=0.0)
+    explanation: str = Field(..., min_length=1, max_length=500)
+class Observation(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    benchmark: str
+    task_name: Literal["easy", "medium", "hard"]
+    instruction: str
+    payload: Dict[str, Any]
+    action_format: List[str]
+    step_count: int = Field(..., ge=0)
+    max_steps: int = Field(..., ge=1)
+    best_score: float = Field(..., ge=0.0, le=1.0)
+    last_action: Optional[str] = None
+    last_feedback: Optional[str] = None
+    done: bool = False
+class StepInfo(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    parsed_action: Optional[Dict[str, Any]] = None
+    current_score: float = Field(..., ge=0.0, le=1.0)
+    best_score: float = Field(..., ge=0.0, le=1.0)
+    terminated_by: Optional[str] = None
+    error: Optional[str] = None

env/tasks.py ADDED Viewed

	@@ -0,0 +1,277 @@

+from __future__ import annotations
+import copy
+import json
+from dataclasses import dataclass
+from decimal import Decimal, InvalidOperation, ROUND_HALF_UP
+from typing import Any, Dict, List, Tuple
+def _normalize_text(value: Any) -> str:
+    return " ".join(str(value).strip().lower().split())
+def _normalize_bool(value: Any) -> str:
+    normalized = _normalize_text(value)
+    if normalized in {"yes", "true", "1", "reply", "needed"}:
+        return "yes"
+    if normalized in {"no", "false", "0", "none", "not needed"}:
+        return "no"
+    return normalized
+def _normalize_date(value: Any) -> str:
+    return _normalize_text(value).replace("/", "-")
+def _normalize_time(value: Any) -> str:
+    text = _normalize_text(value)
+    if len(text) == 4 and ":" not in text and text.isdigit():
+        return f"{text[:2]}:{text[2:]}"
+    return text
+def _normalize_list(values: Any) -> List[str]:
+    if not isinstance(values, list):
+        return []
+    return sorted({_normalize_text(item) for item in values if str(item).strip()})
+def _normalize_decimal(value: Any) -> str:
+    try:
+        decimal = Decimal(str(value)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
+    except (InvalidOperation, TypeError, ValueError):
+        return ""
+    return format(decimal, ".2f")
+def _exact_match(candidate: Any, expected: Any, normalizer) -> float:
+    return 1.0 if normalizer(candidate) == normalizer(expected) else 0.0
+def _score_list(candidate: Any, expected: List[str]) -> float:
+    actual = set(_normalize_list(candidate))
+    target = set(_normalize_list(expected))
+    if not target:
+        return 1.0 if not actual else 0.0
+    return len(actual.intersection(target)) / len(target)
+@dataclass(frozen=True)
+class TaskSpec:
+    name: str
+    difficulty: str
+    instruction: str
+    payload: Dict[str, Any]
+    schema: Dict[str, str]
+    expected: Dict[str, Any]
+    max_steps: int
+    def grade_submission(self, candidate: Dict[str, Any]) -> Tuple[float, Dict[str, float]]:
+        if not isinstance(candidate, dict):
+            return 0.0, {key: 0.0 for key in self.expected.keys()}
+        if self.name == "easy":
+            components = {
+                "label": _exact_match(candidate.get("label"), self.expected["label"], _normalize_text),
+                "priority": _exact_match(candidate.get("priority"), self.expected["priority"], _normalize_text),
+                "needs_reply": _exact_match(
+                    candidate.get("needs_reply"), self.expected["needs_reply"], _normalize_bool
+                ),
+            }
+            weights = {"label": 0.6, "priority": 0.2, "needs_reply": 0.2}
+        elif self.name == "medium":
+            components = {
+                "day": _exact_match(candidate.get("day"), self.expected["day"], _normalize_date),
+                "start": _exact_match(candidate.get("start"), self.expected["start"], _normalize_time),
+                "end": _exact_match(candidate.get("end"), self.expected["end"], _normalize_time),
+                "participants": _score_list(candidate.get("participants"), self.expected["participants"]),
+                "room": _exact_match(candidate.get("room"), self.expected["room"], _normalize_text),
+            }
+            weights = {
+                "day": 0.2,
+                "start": 0.2,
+                "end": 0.2,
+                "participants": 0.2,
+                "room": 0.2,
+            }
+        else:
+            components = {
+                "valid_rows": _exact_match(candidate.get("valid_rows"), self.expected["valid_rows"], _normalize_text),
+                "duplicate_ids": _score_list(candidate.get("duplicate_ids"), self.expected["duplicate_ids"]),
+                "invalid_emails": _score_list(candidate.get("invalid_emails"), self.expected["invalid_emails"]),
+                "normalized_total": _exact_match(
+                    candidate.get("normalized_total"), self.expected["normalized_total"], _normalize_decimal
+                ),
+                "retained_ids": _score_list(candidate.get("retained_ids"), self.expected["retained_ids"]),
+            }
+            weights = {
+                "valid_rows": 0.2,
+                "duplicate_ids": 0.2,
+                "invalid_emails": 0.2,
+                "normalized_total": 0.2,
+                "retained_ids": 0.2,
+            }
+        score = sum(components[key] * weights[key] for key in weights)
+        score = float(Decimal(str(score)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP))
+        return score, components
+    def public_payload(self) -> Dict[str, Any]:
+        return copy.deepcopy(self.payload)
+    def public_schema(self) -> Dict[str, str]:
+        return copy.deepcopy(self.schema)
+TASKS: Dict[str, TaskSpec] = {
+    "easy": TaskSpec(
+        name="easy",
+        difficulty="easy",
+        instruction=(
+            "Classify the email into a label, priority, and whether it needs a reply. "
+            "Output JSON with keys: label, priority, needs_reply. "
+            "Allowed labels: work, personal, spam, finance. "
+            "Allowed priorities: low, normal, high. "
+            "needs_reply must be yes or no."
+        ),
+        payload={
+            "email": {
+                "from": "billing@northstarbank.example",
+                "subject": "Invoice 88421 available in your portal",
+                "body": (
+                    "Hello, your March statement is now available. "
+                    "No action is needed unless you notice an error."
+                ),
+            },
+            "edge_cases": [
+                "Promotional language should not override sender intent.",
+                "If no response is requested, needs_reply should be no.",
+                "Classification must be based on content, not guesswork.",
+            ],
+        },
+        schema={"label": "string", "priority": "string", "needs_reply": "string"},
+        expected={"label": "finance", "priority": "normal", "needs_reply": "no"},
+        max_steps=5,
+    ),
+    "medium": TaskSpec(
+        name="medium",
+        difficulty="medium",
+        instruction=(
+            "Schedule a 60-minute project sync that includes every required participant. "
+            "Avoid blocked windows and lunch hours. "
+            "Output JSON with keys: day, start, end, participants, room."
+        ),
+        payload={
+            "duration_minutes": 60,
+            "timezone": "Asia/Kolkata",
+            "required_participants": ["Alex", "Priya", "Sam"],
+            "blocked_windows": [
+                {"day": "2026-04-09", "start": "12:00", "end": "13:00", "reason": "lunch"},
+                {"day": "2026-04-09", "start": "16:00", "end": "17:00", "reason": "company all-hands"},
+            ],
+            "availability": {
+                "Alex": [
+                    {"day": "2026-04-09", "start": "09:00", "end": "11:00"},
+                    {"day": "2026-04-09", "start": "14:00", "end": "16:00"},
+                ],
+                "Priya": [
+                    {"day": "2026-04-09", "start": "10:00", "end": "11:00"},
+                    {"day": "2026-04-09", "start": "14:00", "end": "15:30"},
+                ],
+                "Sam": [
+                    {"day": "2026-04-09", "start": "09:30", "end": "10:30"},
+                    {"day": "2026-04-09", "start": "14:00", "end": "17:00"},
+                ],
+            },
+            "rooms": [
+                {"name": "Focus-2", "capacity": 2, "available": [{"day": "2026-04-09", "start": "14:00", "end": "15:00"}]},
+                {"name": "Focus-3", "capacity": 3, "available": [{"day": "2026-04-09", "start": "14:00", "end": "15:00"}]},
+                {"name": "Board-6", "capacity": 6, "available": [{"day": "2026-04-09", "start": "15:00", "end": "16:00"}]},
+            ],
+            "edge_cases": [
+                "A room with insufficient capacity is invalid even if time matches.",
+                "The interval must fit every attendee exactly, not just overlap partially.",
+                "Lunch block must be avoided even if users appear available.",
+            ],
+        },
+        schema={
+            "day": "YYYY-MM-DD",
+            "start": "HH:MM",
+            "end": "HH:MM",
+            "participants": "list[string]",
+            "room": "string",
+        },
+        expected={
+            "day": "2026-04-09",
+            "start": "14:00",
+            "end": "15:00",
+            "participants": ["alex", "priya", "sam"],
+            "room": "focus-3",
+        },
+        max_steps=5,
+    ),
+    "hard": TaskSpec(
+        name="hard",
+        difficulty="hard",
+        instruction=(
+            "Clean the dataset deterministically using the provided rules. "
+            "Keep the first occurrence of a duplicate id, drop rows with invalid emails, "
+            "normalize amount to two decimals, and report summary metrics. "
+            "Output JSON with keys: valid_rows, duplicate_ids, invalid_emails, normalized_total, retained_ids."
+        ),
+        payload={
+            "rules": [
+                "Trim whitespace from every string field.",
+                "Emails must contain one @ and at least one dot after @.",
+                "Duplicate ids are counted once per repeated id; keep the first occurrence only.",
+                "Rows with invalid emails are removed before summing amounts.",
+                "Sum uses the retained rows only and must be rounded to two decimals.",
+            ],
+            "rows": [
+                {"id": "a001", "email": "alice@example.com", "amount": "120"},
+                {"id": "b002", "email": "bob@example.com ", "amount": "80.5"},
+                {"id": "c003", "email": "bad-email", "amount": "10.00"},
+                {"id": "c003", "email": "carol@example.com", "amount": "10.00"},
+                {"id": "d004", "email": " dan@example.org", "amount": "200.40"},
+                {"id": "e005", "email": "eve@example.org", "amount": "160.50"},
+            ],
+            "edge_cases": [
+                "Whitespace around email fields should be removed before validation.",
+                "The second c003 row is discarded because the id is duplicate even though the email is valid.",
+                "Amounts may arrive as integers or decimal strings.",
+            ],
+        },
+        schema={
+            "valid_rows": "integer",
+            "duplicate_ids": "list[string]",
+            "invalid_emails": "list[string]",
+            "normalized_total": "string or number with two decimals",
+            "retained_ids": "list[string]",
+        },
+        expected={
+            "valid_rows": 4,
+            "duplicate_ids": ["c003"],
+            "invalid_emails": ["bad-email"],
+            "normalized_total": "561.40",
+            "retained_ids": ["a001", "b002", "d004", "e005"],
+        },
+        max_steps=5,
+    ),
+}
+def get_task(task_name: str) -> TaskSpec:
+    normalized = _normalize_text(task_name)
+    if normalized not in TASKS:
+        valid = ", ".join(sorted(TASKS.keys()))
+        raise ValueError(f"unknown task '{task_name}'. expected one of: {valid}")
+    return TASKS[normalized]
+def task_names() -> List[str]:
+    return ["easy", "medium", "hard"]
+def schema_json(task_name: str) -> str:
+    return json.dumps(get_task(task_name).public_schema(), sort_keys=True)

environment.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from env.environment import ProductivityEnvironment
2	+
3	+ __all__ = ["ProductivityEnvironment"]

inference.py ADDED Viewed

	@@ -0,0 +1,173 @@

+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+from typing import Optional, Tuple
+from openai import OpenAI
+from env.environment import ProductivityEnvironment
+MAX_STEPS = 5
+def _compact(text: Optional[str]) -> str:
+    if text is None:
+        return "null"
+    return re.sub(r"\s+", " ", str(text)).strip() or "null"
+def _bool_text(value: bool) -> str:
+    return "true" if value else "false"
+def _print_start(task_name: str, env_name: str, model_name: str) -> None:
+    print(f"[START] task={task_name} env={env_name} model={model_name}")
+def _print_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    print(
+        f"[STEP] step={step} action={_compact(action)} reward={reward:.2f} "
+        f"done={_bool_text(done)} error={_compact(error)}"
+    )
+def _print_end(success: bool, steps: int, rewards: list[float]) -> None:
+    reward_text = ",".join(f"{value:.2f}" for value in rewards)
+    print(f"[END] success={_bool_text(success)} steps={steps} rewards={reward_text}")
+def _build_client() -> Tuple[Optional[OpenAI], Optional[str], Optional[str]]:
+    api_base_url = os.getenv("API_BASE_URL")
+    model_name = os.getenv("MODEL_NAME")
+    token = os.getenv("HF_TOKEN")
+    if not api_base_url:
+        return None, model_name, "missing API_BASE_URL"
+    if not model_name:
+        return None, None, "missing MODEL_NAME"
+    if not token:
+        return None, model_name, "missing HF_TOKEN"
+    try:
+        client = OpenAI(base_url=api_base_url, api_key=token)
+    except Exception as exc:
+        return None, model_name, f"client_initialization_failed:{_compact(exc)}"
+    return client, model_name, None
+def _extract_action(content: str) -> str:
+    text = content.strip()
+    if text.startswith("```"):
+        text = re.sub(r"^```[a-zA-Z0-9_-]*", "", text).strip()
+        text = re.sub(r"```$", "", text).strip()
+    return text.splitlines()[0].strip() if text else ""
+def _query_model(client: OpenAI, model_name: str, observation_json: str) -> Tuple[Optional[str], Optional[str]]:
+    try:
+        response = client.chat.completions.create(
+            model=model_name,
+            temperature=0,
+            messages=[
+                {
+                    "role": "system",
+                    "content": (
+                        "You are solving a deterministic RL benchmark. "
+                        "Reply with exactly one line and no explanation. "
+                        "Allowed formats are inspect, propose:{...}, or final:{...}. "
+                        "Use compact JSON. Prefer final:{...} once confident."
+                    ),
+                },
+                {"role": "user", "content": observation_json},
+            ],
+        )
+    except Exception as exc:
+        return None, f"api_error:{_compact(exc)}"
+    try:
+        content = response.choices[0].message.content
+    except Exception as exc:
+        return None, f"malformed_response:{_compact(exc)}"
+    if not content or not str(content).strip():
+        return None, "empty_response"
+    action = _extract_action(str(content))
+    if not action:
+        return None, "empty_action"
+    return action, None
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--task", default=os.getenv("TASK_NAME", "easy"))
+    args = parser.parse_args()
+    env = ProductivityEnvironment(max_steps=MAX_STEPS)
+    task_name = args.task
+    model_name_for_log = os.getenv("MODEL_NAME") or "unknown"
+    rewards: list[float] = []
+    success = False
+    steps_taken = 0
+    _print_start(task_name, env.benchmark_name, model_name_for_log)
+    try:
+        observation = env.reset(task_name=task_name)
+    except Exception as exc:
+        _print_step(1, "inspect", 0.00, True, f"reset_failed:{_compact(exc)}")
+        _print_end(False, 1, [0.00])
+        return
+    client, model_name, init_error = _build_client()
+    if init_error is not None or client is None or model_name is None:
+        rewards.append(0.00)
+        _print_step(1, "inspect", 0.00, True, init_error)
+        _print_end(False, 1, rewards)
+        return
+    done = False
+    last_error: Optional[str] = None
+    for step_number in range(1, MAX_STEPS + 1):
+        steps_taken = step_number
+        action, model_error = _query_model(
+            client,
+            model_name,
+            json.dumps(observation.model_dump(), separators=(",", ":"), sort_keys=True),
+        )
+        if model_error is not None or action is None:
+            rewards.append(0.00)
+            _print_step(step_number, "inspect", 0.00, True, model_error)
+            done = True
+            last_error = model_error
+            break
+        try:
+            observation, reward, done, info = env.step(action)
+            error = info.get("error")
+            rewards.append(reward.value)
+            _print_step(step_number, action, reward.value, done, error)
+            last_error = error
+        except Exception as exc:
+            rewards.append(0.00)
+            _print_step(step_number, action, 0.00, True, f"step_failed:{_compact(exc)}")
+            done = True
+            last_error = str(exc)
+            break
+        if done:
+            break
+    if rewards:
+        success = done and env.state().best_score >= 1.0 and last_error in (None, "")
+    _print_end(success, max(steps_taken, 1), rewards if rewards else [0.00])
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,4 @@


1	+ from env.models import Action, Observation, Reward, StepInfo
2	+
3	+ __all__ = ["Action", "Observation", "Reward", "StepInfo"]
4	+

openenv.yaml ADDED Viewed

	@@ -0,0 +1,10 @@

+name: openenv-productivity
+version: "1.0.0"
+description: Deterministic productivity benchmark with three tasks for the OpenEnv RL Challenge.
+entrypoint: env.environment:ProductivityEnvironment
+inference: python inference.py
+max_steps: 5
+tasks:
+  - easy
+  - medium
+  - hard

pyproject.toml ADDED Viewed

	@@ -0,0 +1,23 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-productivity"
+version = "1.0.0"
+description = "Deterministic productivity benchmark for the OpenEnv RL Challenge."
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+  "openenv-core>=0.2.0",
+  "openai>=1.30.0,<3.0.0",
+  "pydantic>=2.7.0,<3.0.0",
+  "fastapi>=0.110.0,<1.0.0",
+  "uvicorn>=0.30.0,<1.0.0",
+]
+[project.scripts]
+server = "server.app:main"
+[tool.setuptools]
+packages = ["env", "server"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ openai>=1.30.0,<2.0.0
2	+ pydantic>=2.7.0,<3.0.0

server/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ """Server package for OpenEnv deployment entrypoints."""
2	+

server/app.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from __future__ import annotations
+import os
+from fastapi import FastAPI
+import uvicorn
+app = FastAPI(title="openenv-productivity", version="1.0.0")
+@app.get("/health")
+def health() -> dict[str, str]:
+    return {"status": "ok"}
+def main() -> None:
+    host = os.getenv("HOST", "0.0.0.0")
+    port = int(os.getenv("PORT", "7860"))
+    uvicorn.run("server.app:app", host=host, port=port, reload=False)
+if __name__ == "__main__":
+    main()

tasks.py ADDED Viewed

	@@ -0,0 +1,4 @@


1	+ from env.tasks import TASKS, TaskSpec, get_task, schema_json, task_names
2	+
3	+ __all__ = ["TASKS", "TaskSpec", "get_task", "schema_json", "task_names"]
4	+

uv.lock ADDED Viewed

	@@ -0,0 +1,15 @@

+version = 1
+revision = 1
+requires-python = ">=3.10"
+[[package]]
+name = "openenv-productivity"
+version = "1.0.0"
+source = { virtual = "." }
+dependencies = [
+    { name = "fastapi", specifier = ">=0.110.0,<1.0.0" },
+    { name = "openai", specifier = ">=1.30.0,<3.0.0" },
+    { name = "openenv-core", specifier = ">=0.2.0" },
+    { name = "pydantic", specifier = ">=2.7.0,<3.0.0" },
+    { name = "uvicorn", specifier = ">=0.30.0,<1.0.0" },
+]