Spaces:

Your-Mate
/

openenv-productivity

Running

App Files Files Community

Your-Mate commited on 14 days ago

Commit

6e70e27

verified ·

1 Parent(s): cf4b443

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +78 -134
baseline.py +29 -0
env/tasks.py +102 -7
tests/test_environment.py +59 -0

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: openenv-productivity
-emoji: 🚀
 colorFrom: blue
 colorTo: green
 sdk: docker
@@ -10,133 +10,72 @@ base_path: /web
 # OpenEnv Productivity Benchmark
-`openenv-productivity` is a deterministic reinforcement learning benchmark with three productivity-oriented tasks designed for the OpenEnv RL Challenge. The environment is intentionally small, reproducible, and easy to validate while still rewarding iterative improvement over multiple steps.
-## Environment Overview
-The benchmark exposes a single environment class: `ProductivityEnvironment` in `env/environment.py`.
-Implemented API:
 - `reset(task_name="easy") -> Observation`
 - `step(action) -> (Observation, Reward, done, info)`
 - `state() -> Observation`
-The environment is deterministic:
-- No randomness is used anywhere in task generation or grading.
-- All tasks use fixed payloads and fixed graders.
-- Reward shaping is stable and repeatable for identical action sequences.
-## Observation Space
-Observations are validated with Pydantic through the `Observation` model in `env/models.py`.
-Observation fields:
-- `benchmark`: benchmark name
-- `task_name`: `easy`, `medium`, or `hard`
-- `instruction`: natural-language task instruction
-- `payload`: task data and target schema
-- `action_format`: supported action patterns
-- `step_count`: current step index
-- `max_steps`: maximum allowed steps
-- `best_score`: best score seen so far in the episode
-- `last_action`: previous action string
-- `last_feedback`: deterministic grader feedback
-- `done`: terminal flag
-## Action Space
-Actions are validated with Pydantic through the `Action` model.
-Supported actions:
-- `inspect`
-- `propose:{"field":"value"}`
-- `final:{"field":"value"}`
-Notes:
-- `inspect` lets an agent spend a step rereading the state, but it still incurs the step penalty.
-- `propose:` is useful for incremental reward collection.
-- `final:` ends the episode immediately, even if the answer is incomplete.
-- Malformed actions and malformed JSON receive deterministic penalties.
-## Reward Logic
-Rewards are represented by the Pydantic `Reward` model.
-Per-step reward includes:
-- positive delta when a proposal improves over the previous best score
-- partial credit from field-level grading
-- `-0.02` step penalty on every step
-- wrong-answer penalty when a submission regresses or scores zero
-- malformed action penalty for invalid actions or invalid JSON
-- loop penalty when the same action is repeated consecutively
-This shaping discourages loops, rewards iterative progress, and stays deterministic.
-## Tasks
-Exactly three tasks are included.
-### 1. Easy: Email Classification
-Goal:
-- classify an email into `label`
-- assign `priority`
-- determine `needs_reply`
-Edge cases handled:
-- sender intent outweighs superficial phrasing
-- reply detection is based on explicit requested action
-- constrained label and priority vocabularies
-Expected strong baseline score:
-- `1.00` within 1 to 2 steps
-### 2. Medium: Calendar Scheduling
-Goal:
-- schedule a 60-minute meeting for all required participants
-- avoid lunch and blocked windows
-- select a room with enough capacity
-Edge cases handled:
-- partial overlap is not sufficient
-- room capacity must satisfy participant count
-- blocked windows override individual availability
-Expected strong baseline score:
-- `1.00` within 1 to 3 steps
-### 3. Hard: Data Cleaning
-Goal:
-- clean a tabular dataset with duplicate IDs and malformed emails
-- keep first duplicate occurrence only
-- compute a normalized total from retained rows
-Edge cases handled:
-- whitespace trimming before validation
-- duplicate handling before retention
-- numeric normalization to two decimals
-Expected strong baseline score:
-- `1.00` within 2 to 4 steps
-## Setup
 ### Local
@@ -144,41 +83,38 @@ Expected strong baseline score:
 python -m venv .venv
 . .venv/bin/activate
 pip install -r requirements.txt
-export API_BASE_URL="https://your-openai-compatible-endpoint/v1"
-export MODEL_NAME="your-model"
-export HF_TOKEN="your-token"
 python inference.py --task easy
 ```
-On Windows PowerShell:
-```powershell
-python -m venv .venv
-.venv\Scripts\Activate.ps1
 pip install -r requirements.txt
-$env:API_BASE_URL="https://your-openai-compatible-endpoint/v1"
-$env:MODEL_NAME="your-model"
-$env:HF_TOKEN="your-token"
 python inference.py --task easy
 ```
-## Inference Output Contract
-`inference.py` emits only these lines:
-```text
-[START] task=<task_name> env=<benchmark> model=<model_name>
-[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
-[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
 ```
-Compliance guarantees:
-- reward formatted to two decimals
-- lowercase booleans
-- no extra blank lines
-- `[END]` is always printed, including on failure
-- max steps capped at five
 ## Docker
@@ -188,14 +124,22 @@ Build:
 docker build -t openenv-productivity .
 ```
-Run:
 ```bash
-docker run --rm \
-  -e API_BASE_URL="https://your-openai-compatible-endpoint/v1" \
-  -e MODEL_NAME="your-model" \
-  -e HF_TOKEN="your-token" \
-  openenv-productivity
 ```
-The container is intentionally lean and suitable for a 2 CPU / 8 GB RAM runtime.

 ---
 title: openenv-productivity
+emoji: "🚀"
 colorFrom: blue
 colorTo: green
 sdk: docker
 # OpenEnv Productivity Benchmark
+`openenv-productivity` is a deterministic RL benchmark for real operational assistant workflows. It includes exactly three tasks with increasing difficulty and deterministic 0.00-1.00 grading.
+## Why This Is Useful
+- Email triage mirrors real inbox operations.
+- Calendar scheduling includes true resource constraints (people, time windows, rooms, lunch blocks).
+- Data cleaning captures production ETL quality checks and audit-style outputs.
+## Environment API
 - `reset(task_name="easy") -> Observation`
 - `step(action) -> (Observation, Reward, done, info)`
 - `state() -> Observation`
+Pydantic models:
+- `Action`
+- `Observation`
+- `Reward`
+## Task Set (Exactly 3)
+1. `easy` - email classification
+2. `medium` - calendar scheduling
+3. `hard` - data cleaning
+All graders are deterministic, reproducible, and return bounded scores in `0.00-1.00`.
+## Reward Design
+Each step includes:
+- incremental improvement reward (`delta` on better submissions)
+- partial credit from component-level grading
+- wrong-answer penalties for regressions / zero-quality answers
+- malformed action penalty for invalid action or invalid JSON
+- anti-loop penalty for repeated actions
+- fixed step penalty to discourage long trajectories
+This keeps rewards dense, stable, and useful for policy learning.
+## Determinism & Reproducibility
+- no randomness in task payloads or grading
+- fixed expected outputs with deterministic normalization
+- reproducible baseline script for all tasks
+- deterministic unit tests included
+## Inference Output Contract
+`inference.py` emits only:
+```text
+[START] task=<task_name> env=<benchmark> model=<model_name>
+[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
+```
+Guarantees:
+- reward formatted to two decimals
+- lowercase booleans
+- max 5 steps
+- `[END]` always printed (including error paths)
+## Quickstart
 ### Local
 python -m venv .venv
 . .venv/bin/activate
 pip install -r requirements.txt
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="zai-org/GLM-5.1"
+export HF_TOKEN="hf_xxx"
 python inference.py --task easy
 ```
+Windows cmd:
+```cmd
 pip install -r requirements.txt
+set API_BASE_URL=https://router.huggingface.co/v1
+set MODEL_NAME=zai-org/GLM-5.1
+set HF_TOKEN=hf_xxx
 python inference.py --task easy
 ```
+### Deterministic Baseline
+```bash
+python baseline.py
 ```
+Expected pattern:
+- each task score = `1.0`
+- each task reward = `0.98` (perfect delta - step penalty)
+### Tests
+```bash
+python -m unittest discover -s tests -p "test_*.py"
+```
 ## Docker
 docker build -t openenv-productivity .
 ```
+Run server mode:
+```bash
+docker run --rm -p 7860:7860 openenv-productivity
+```
+Health check:
+```bash
+curl http://localhost:7860/health
+```
+## OpenEnv Validation
 ```bash
+openenv validate
 ```
+The project is designed to pass OpenEnv structure checks and deploy cleanly to Hugging Face Spaces.

baseline.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from __future__ import annotations
+import json
+from env.environment import ProductivityEnvironment
+from env.tasks import get_task, task_names
+def run_baseline() -> None:
+    env = ProductivityEnvironment(max_steps=5)
+    summary: dict[str, dict[str, float | int | bool]] = {}
+    for task_name in task_names():
+        task = get_task(task_name)
+        env.reset(task_name)
+        action = f"final:{json.dumps(task.expected, separators=(',', ':'), sort_keys=True)}"
+        _, reward, done, info = env.step(action)
+        summary[task_name] = {
+            "done": bool(done),
+            "score": float(info["best_score"]),
+            "reward": float(reward.value),
+            "steps": 1,
+        }
+    print(json.dumps(summary, indent=2, sort_keys=True))
+if __name__ == "__main__":
+    run_baseline()

env/tasks.py CHANGED Viewed

@@ -31,12 +31,37 @@ def _normalize_time(value: Any) -> str:
     return text
 def _normalize_list(values: Any) -> List[str]:
     if not isinstance(values, list):
         return []
     return sorted({_normalize_text(item) for item in values if str(item).strip()})
 def _normalize_decimal(value: Any) -> str:
     try:
         decimal = Decimal(str(value)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
@@ -57,6 +82,19 @@ def _score_list(candidate: Any, expected: List[str]) -> float:
     return len(actual.intersection(target)) / len(target)
 @dataclass(frozen=True)
 class TaskSpec:
     name: str
@@ -81,19 +119,70 @@ class TaskSpec:
             }
             weights = {"label": 0.6, "priority": 0.2, "needs_reply": 0.2}
         elif self.name == "medium":
             components = {
                 "day": _exact_match(candidate.get("day"), self.expected["day"], _normalize_date),
                 "start": _exact_match(candidate.get("start"), self.expected["start"], _normalize_time),
                 "end": _exact_match(candidate.get("end"), self.expected["end"], _normalize_time),
                 "participants": _score_list(candidate.get("participants"), self.expected["participants"]),
                 "room": _exact_match(candidate.get("room"), self.expected["room"], _normalize_text),
             }
             weights = {
-                "day": 0.2,
-                "start": 0.2,
-                "end": 0.2,
-                "participants": 0.2,
-                "room": 0.2,
             }
         else:
             components = {
@@ -104,13 +193,19 @@ class TaskSpec:
                     candidate.get("normalized_total"), self.expected["normalized_total"], _normalize_decimal
                 ),
                 "retained_ids": _score_list(candidate.get("retained_ids"), self.expected["retained_ids"]),
             }
             weights = {
                 "valid_rows": 0.2,
-                "duplicate_ids": 0.2,
-                "invalid_emails": 0.2,
                 "normalized_total": 0.2,
                 "retained_ids": 0.2,
             }
         score = sum(components[key] * weights[key] for key in weights)

     return text
+def _minutes_since_midnight(value: Any) -> int:
+    text = _normalize_time(value)
+    parts = text.split(":")
+    if len(parts) != 2:
+        return -1
+    if not parts[0].isdigit() or not parts[1].isdigit():
+        return -1
+    hour = int(parts[0])
+    minute = int(parts[1])
+    if hour < 0 or hour > 23 or minute < 0 or minute > 59:
+        return -1
+    return hour * 60 + minute
 def _normalize_list(values: Any) -> List[str]:
     if not isinstance(values, list):
         return []
     return sorted({_normalize_text(item) for item in values if str(item).strip()})
+def _normalize_list_in_order(values: Any) -> List[str]:
+    if not isinstance(values, list):
+        return []
+    output: List[str] = []
+    for item in values:
+        normalized = _normalize_text(item)
+        if normalized:
+            output.append(normalized)
+    return output
 def _normalize_decimal(value: Any) -> str:
     try:
         decimal = Decimal(str(value)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
     return len(actual.intersection(target)) / len(target)
+def _in_any_window(day: str, start_minute: int, end_minute: int, windows: List[Dict[str, str]]) -> bool:
+    for window in windows:
+        if _normalize_date(window.get("day", "")) != day:
+            continue
+        ws = _minutes_since_midnight(window.get("start", ""))
+        we = _minutes_since_midnight(window.get("end", ""))
+        if ws < 0 or we < 0 or ws >= we:
+            continue
+        if start_minute >= ws and end_minute <= we:
+            return True
+    return False
 @dataclass(frozen=True)
 class TaskSpec:
     name: str
             }
             weights = {"label": 0.6, "priority": 0.2, "needs_reply": 0.2}
         elif self.name == "medium":
+            day = _normalize_date(candidate.get("day", ""))
+            start_minute = _minutes_since_midnight(candidate.get("start", ""))
+            end_minute = _minutes_since_midnight(candidate.get("end", ""))
+            required_duration = int(self.payload.get("duration_minutes", 60))
+            duration_ok = 1.0 if start_minute >= 0 and end_minute - start_minute == required_duration else 0.0
+            no_blocked_overlap = 1.0
+            if day and start_minute >= 0 and end_minute > start_minute:
+                for block in self.payload.get("blocked_windows", []):
+                    if _normalize_date(block.get("day", "")) != day:
+                        continue
+                    bs = _minutes_since_midnight(block.get("start", ""))
+                    be = _minutes_since_midnight(block.get("end", ""))
+                    if bs < 0 or be < 0:
+                        continue
+                    overlap = max(start_minute, bs) < min(end_minute, be)
+                    if overlap:
+                        no_blocked_overlap = 0.0
+                        break
+            else:
+                no_blocked_overlap = 0.0
+            room_ok = 0.0
+            room_name = _normalize_text(candidate.get("room", ""))
+            participants = _normalize_list(candidate.get("participants", []))
+            for room in self.payload.get("rooms", []):
+                if _normalize_text(room.get("name", "")) != room_name:
+                    continue
+                capacity_ok = int(room.get("capacity", 0)) >= len(participants)
+                slot_ok = _in_any_window(day, start_minute, end_minute, room.get("available", []))
+                room_ok = 1.0 if capacity_ok and slot_ok else 0.0
+                break
+            participant_availability = 1.0
+            required_people = _normalize_list(self.payload.get("required_participants", []))
+            if not set(required_people).issubset(set(participants)):
+                participant_availability = 0.0
+            else:
+                for person in required_people:
+                    availability = self.payload.get("availability", {}).get(person.title(), [])
+                    if not _in_any_window(day, start_minute, end_minute, availability):
+                        participant_availability = 0.0
+                        break
             components = {
                 "day": _exact_match(candidate.get("day"), self.expected["day"], _normalize_date),
                 "start": _exact_match(candidate.get("start"), self.expected["start"], _normalize_time),
                 "end": _exact_match(candidate.get("end"), self.expected["end"], _normalize_time),
                 "participants": _score_list(candidate.get("participants"), self.expected["participants"]),
                 "room": _exact_match(candidate.get("room"), self.expected["room"], _normalize_text),
+                "duration_ok": duration_ok,
+                "no_blocked_overlap": no_blocked_overlap,
+                "participant_availability": participant_availability,
+                "room_valid": room_ok,
             }
             weights = {
+                "day": 0.1,
+                "start": 0.1,
+                "end": 0.1,
+                "participants": 0.15,
+                "room": 0.1,
+                "duration_ok": 0.15,
+                "no_blocked_overlap": 0.1,
+                "participant_availability": 0.1,
+                "room_valid": 0.1,
             }
         else:
             components = {
                     candidate.get("normalized_total"), self.expected["normalized_total"], _normalize_decimal
                 ),
                 "retained_ids": _score_list(candidate.get("retained_ids"), self.expected["retained_ids"]),
+                "retained_ids_order": _exact_match(
+                    candidate.get("retained_ids"),
+                    self.expected["retained_ids"],
+                    lambda x: json.dumps(_normalize_list_in_order(x), separators=(",", ":")),
+                ),
             }
             weights = {
                 "valid_rows": 0.2,
+                "duplicate_ids": 0.15,
+                "invalid_emails": 0.15,
                 "normalized_total": 0.2,
                 "retained_ids": 0.2,
+                "retained_ids_order": 0.1,
             }
         score = sum(components[key] * weights[key] for key in weights)

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,59 @@

+from __future__ import annotations
+import json
+import unittest
+from env.environment import ProductivityEnvironment
+from env.tasks import get_task
+class EnvironmentDeterminismTests(unittest.TestCase):
+    def setUp(self) -> None:
+        self.env = ProductivityEnvironment(max_steps=5)
+    def test_easy_perfect_answer_scores_one(self) -> None:
+        task = get_task("easy")
+        self.env.reset("easy")
+        action = f"final:{json.dumps(task.expected, separators=(',', ':'), sort_keys=True)}"
+        _, reward, done, info = self.env.step(action)
+        self.assertTrue(done)
+        self.assertEqual(info["best_score"], 1.0)
+        self.assertAlmostEqual(reward.value, 0.98, places=2)
+    def test_medium_invalid_room_capacity_penalized(self) -> None:
+        self.env.reset("medium")
+        bad = {
+            "day": "2026-04-09",
+            "start": "14:00",
+            "end": "15:00",
+            "participants": ["Alex", "Priya", "Sam"],
+            "room": "Focus-2",
+        }
+        _, reward, done, info = self.env.step(f"final:{json.dumps(bad, separators=(',', ':'))}")
+        self.assertTrue(done)
+        self.assertLess(info["best_score"], 1.0)
+        self.assertLess(reward.value, 0.98)
+    def test_hard_deterministic(self) -> None:
+        self.env.reset("hard")
+        answer = {
+            "valid_rows": 4,
+            "duplicate_ids": ["c003"],
+            "invalid_emails": ["bad-email"],
+            "normalized_total": "561.40",
+            "retained_ids": ["a001", "b002", "d004", "e005"],
+        }
+        _, _, _, info_a = self.env.step(f"final:{json.dumps(answer, separators=(',', ':'))}")
+        self.env.reset("hard")
+        _, _, _, info_b = self.env.step(f"final:{json.dumps(answer, separators=(',', ':'))}")
+        self.assertEqual(info_a["best_score"], info_b["best_score"])
+    def test_loop_penalty(self) -> None:
+        self.env.reset("easy")
+        self.env.step("inspect")
+        _, reward, _, _ = self.env.step("inspect")
+        self.assertLess(reward.value, -0.02)
+if __name__ == "__main__":
+    unittest.main()