Spaces:
Sleeping
Sleeping
| title: Smart Calendar Resolver | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_file: app.py | |
| pinned: false | |
| # Smart Calendar Resolver β OpenEnv Environment | |
| A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows. | |
| This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop. | |
| --- | |
| ## Problem Definition | |
| Given: | |
| - a natural language meeting request | |
| - multiple participants with availability windows | |
| - constraints (duration, deadline, priority, timezone) | |
| The agent must: | |
| 1. Interpret the request | |
| 2. Aggregate and reason over availability | |
| 3. Select a valid time slot | |
| 4. Confirm and finalize the schedule | |
| This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools. | |
| --- | |
| ## Environment Design | |
| ### Core Loop | |
| The environment follows the standard OpenEnv interface: | |
| - `reset()` β returns initial observation | |
| - `step(action)` β returns (observation, reward, done, info) | |
| - `state` β internal environment state | |
| ### Stage-Based Interaction | |
| The task is decomposed into explicit stages: | |
| 1. `understand_request` | |
| 2. `evaluate_availability` | |
| 3. `propose_slot` | |
| 4. `confirm_schedule` | |
| Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized. | |
| --- | |
| ## Dataset | |
| A small, fully deterministic, in-memory dataset is used. | |
| Each scenario includes: | |
| - request text | |
| - participants | |
| - availability windows | |
| - constraints (deadline, duration, priority) | |
| - ground-truth valid slot | |
| Difficulty levels: | |
| - **Easy**: single valid slot, minimal reasoning | |
| - **Medium**: conflicting availability with constraint filtering | |
| - **Hard**: multiple candidates requiring prioritization and constraint trade-offs | |
| Design choice: | |
| - Small dataset ensures reproducibility | |
| - No randomness ensures stable evaluation and debugging | |
| --- | |
| ## State Representation | |
| The environment maintains: | |
| - `episode_id` | |
| - `step_count` | |
| - `current_scenario` | |
| - `selected_slot` | |
| - `action_history` | |
| - `solved` flag | |
| This enables: | |
| - trajectory-based evaluation | |
| - reward shaping across steps | |
| - deterministic replay | |
| --- | |
| ## Observation Space | |
| Each observation contains: | |
| - request (natural language) | |
| - structured availability | |
| - constraints | |
| - current step index | |
| - feedback signal | |
| - action history | |
| - next expected stage | |
| - reward | |
| - done flag | |
| Observations are designed to balance: | |
| - realism (semi-structured inputs) | |
| - controllability (no external dependencies) | |
| --- | |
| ## Action Space | |
| Typed via Pydantic models: | |
| Fields include: | |
| - `stage` | |
| - `proposed_time_slot` | |
| - `confirm_schedule` | |
| - `final_note` | |
| Actions are structured but flexible enough to simulate agent reasoning. | |
| --- | |
| ## Reward Function | |
| Shaped reward encourages incremental progress: | |
| - + correct interpretation of request | |
| - + correct use of availability constraints | |
| - + valid slot selection | |
| - + correct final confirmation | |
| - + concise and relevant final note | |
| Penalties: | |
| - invalid stage transitions | |
| - incorrect slot selection | |
| - repeated or redundant actions | |
| Properties: | |
| - dense (not sparse) | |
| - deterministic | |
| - aligned with task completion | |
| --- | |
| ## Determinism & Reproducibility | |
| - No randomness in dataset or transitions | |
| - Fixed scenario ordering | |
| - Identical rewards for identical actions | |
| - Deterministic baseline policy | |
| This ensures: | |
| - reproducible scoring | |
| - stable evaluation across runs | |
| - compatibility with automated grading | |
| --- | |
| ## Baseline (Inference) | |
| A deterministic baseline is provided. | |
| Characteristics: | |
| - follows correct stage sequence | |
| - selects known valid slot | |
| - produces consistent output | |
| - uses the injected OpenAI-compatible proxy when `API_BASE_URL`, `API_KEY`, and `MODEL_NAME` are present | |
| - falls back to the deterministic local baseline when those submission env vars are absent | |
| ### Required Output Format | |
| The script emits strictly formatted logs: | |
| [START] task=<task_name> env=<env_name> model=<model_name> | |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> | |
| [END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn> | |
| This format is required for evaluation pipelines. | |
| --- | |
| ## Validation & Testing | |
| The environment has been verified with: | |
| - `uv run openenv validate .` | |
| - deterministic baseline execution | |
| - pytest suite covering: | |
| - environment flow | |
| - state transitions | |
| - reward correctness | |
| - inference execution | |
| - API health | |
| All tests pass from repository root. | |
| --- | |
| ## Deployment | |
| ### Docker | |
| ```bash | |
| docker build -t smart-calendar-env . | |
| docker run -p 8000:8000 smart-calendar-env | |
| ``` | |
| Health check: | |
| curl http://localhost:8000/health | |
| Expected: | |
| {"status":"healthy"} | |
| Hugging Face Spaces | |
| Deploy using Docker SDK | |
| Use repository root as build context | |
| Verify /health endpoint | |
| Ensure logs show clean startup | |
| Key Design Decisions | |
| Stage-based decomposition β improves interpretability and grading | |
| Small synthetic dataset β ensures determinism and fast validation | |
| Structured actions β enables consistent evaluation | |
| Shaped rewards β provides meaningful learning signal | |
| Root-level Dockerfile β simplifies deployment pipeline | |
| Evaluation Alignment | |
| This environment directly satisfies OpenEnv requirements: | |
| real-world task simulation | |
| multi-step agent interaction | |
| deterministic graders | |
| meaningful reward shaping | |
| reproducible baseline | |
| Docker + HF Spaces deployability | |
| Summary | |
| Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate. | |
| It is designed to test: | |
| multi-step reasoning | |
| constraint handling | |
| structured decision making | |
| trajectory-based agent performance | |
| I also pushed this to huggingface spaces | |