--- title: Smart Calendar Resolver emoji: 📅 colorFrom: blue colorTo: green sdk: docker app_file: app.py pinned: false --- # Smart Calendar Resolver — OpenEnv Environment A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows. This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop. --- ## Problem Definition Given: - a natural language meeting request - multiple participants with availability windows - constraints (duration, deadline, priority, timezone) The agent must: 1. Interpret the request 2. Aggregate and reason over availability 3. Select a valid time slot 4. Confirm and finalize the schedule This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools. --- ## Environment Design ### Core Loop The environment follows the standard OpenEnv interface: - `reset()` → returns initial observation - `step(action)` → returns (observation, reward, done, info) - `state` → internal environment state ### Stage-Based Interaction The task is decomposed into explicit stages: 1. `understand_request` 2. `evaluate_availability` 3. `propose_slot` 4. `confirm_schedule` Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized. --- ## Dataset A small, fully deterministic, in-memory dataset is used. Each scenario includes: - request text - participants - availability windows - constraints (deadline, duration, priority) - ground-truth valid slot Difficulty levels: - **Easy**: single valid slot, minimal reasoning - **Medium**: conflicting availability with constraint filtering - **Hard**: multiple candidates requiring prioritization and constraint trade-offs Design choice: - Small dataset ensures reproducibility - No randomness ensures stable evaluation and debugging --- ## State Representation The environment maintains: - `episode_id` - `step_count` - `current_scenario` - `selected_slot` - `action_history` - `solved` flag This enables: - trajectory-based evaluation - reward shaping across steps - deterministic replay --- ## Observation Space Each observation contains: - request (natural language) - structured availability - constraints - current step index - feedback signal - action history - next expected stage - reward - done flag Observations are designed to balance: - realism (semi-structured inputs) - controllability (no external dependencies) --- ## Action Space Typed via Pydantic models: Fields include: - `stage` - `proposed_time_slot` - `confirm_schedule` - `final_note` Actions are structured but flexible enough to simulate agent reasoning. --- ## Reward Function Shaped reward encourages incremental progress: - + correct interpretation of request - + correct use of availability constraints - + valid slot selection - + correct final confirmation - + concise and relevant final note Penalties: - invalid stage transitions - incorrect slot selection - repeated or redundant actions Properties: - dense (not sparse) - deterministic - aligned with task completion --- ## Determinism & Reproducibility - No randomness in dataset or transitions - Fixed scenario ordering - Identical rewards for identical actions - Deterministic baseline policy This ensures: - reproducible scoring - stable evaluation across runs - compatibility with automated grading --- ## Baseline (Inference) A deterministic baseline is provided. Characteristics: - follows correct stage sequence - selects known valid slot - produces consistent output - uses the injected OpenAI-compatible proxy when `API_BASE_URL`, `API_KEY`, and `MODEL_NAME` are present - falls back to the deterministic local baseline when those submission env vars are absent ### Required Output Format The script emits strictly formatted logs: [START] task= env= model= [STEP] step= action= reward=<0.00> done= error= [END] success= steps= rewards= This format is required for evaluation pipelines. --- ## Validation & Testing The environment has been verified with: - `uv run openenv validate .` - deterministic baseline execution - pytest suite covering: - environment flow - state transitions - reward correctness - inference execution - API health All tests pass from repository root. --- ## Deployment ### Docker ```bash docker build -t smart-calendar-env . docker run -p 8000:8000 smart-calendar-env ``` Health check: curl http://localhost:8000/health Expected: {"status":"healthy"} Hugging Face Spaces Deploy using Docker SDK Use repository root as build context Verify /health endpoint Ensure logs show clean startup Key Design Decisions Stage-based decomposition → improves interpretability and grading Small synthetic dataset → ensures determinism and fast validation Structured actions → enables consistent evaluation Shaped rewards → provides meaningful learning signal Root-level Dockerfile → simplifies deployment pipeline Evaluation Alignment This environment directly satisfies OpenEnv requirements: real-world task simulation multi-step agent interaction deterministic graders meaningful reward shaping reproducible baseline Docker + HF Spaces deployability Summary Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate. It is designed to test: multi-step reasoning constraint handling structured decision making trajectory-based agent performance I also pushed this to huggingface spaces