Spaces:
Sleeping
Sleeping
File size: 6,061 Bytes
7bfb138 6504bdb eff241c 6504bdb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | ---
title: Smart Calendar Resolver
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
---
# Smart Calendar Resolver β OpenEnv Environment
A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.
This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.
---
## Problem Definition
Given:
- a natural language meeting request
- multiple participants with availability windows
- constraints (duration, deadline, priority, timezone)
The agent must:
1. Interpret the request
2. Aggregate and reason over availability
3. Select a valid time slot
4. Confirm and finalize the schedule
This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.
---
## Environment Design
### Core Loop
The environment follows the standard OpenEnv interface:
- `reset()` β returns initial observation
- `step(action)` β returns (observation, reward, done, info)
- `state` β internal environment state
### Stage-Based Interaction
The task is decomposed into explicit stages:
1. `understand_request`
2. `evaluate_availability`
3. `propose_slot`
4. `confirm_schedule`
Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.
---
## Dataset
A small, fully deterministic, in-memory dataset is used.
Each scenario includes:
- request text
- participants
- availability windows
- constraints (deadline, duration, priority)
- ground-truth valid slot
Difficulty levels:
- **Easy**: single valid slot, minimal reasoning
- **Medium**: conflicting availability with constraint filtering
- **Hard**: multiple candidates requiring prioritization and constraint trade-offs
Design choice:
- Small dataset ensures reproducibility
- No randomness ensures stable evaluation and debugging
---
## State Representation
The environment maintains:
- `episode_id`
- `step_count`
- `current_scenario`
- `selected_slot`
- `action_history`
- `solved` flag
This enables:
- trajectory-based evaluation
- reward shaping across steps
- deterministic replay
---
## Observation Space
Each observation contains:
- request (natural language)
- structured availability
- constraints
- current step index
- feedback signal
- action history
- next expected stage
- reward
- done flag
Observations are designed to balance:
- realism (semi-structured inputs)
- controllability (no external dependencies)
---
## Action Space
Typed via Pydantic models:
Fields include:
- `stage`
- `proposed_time_slot`
- `confirm_schedule`
- `final_note`
Actions are structured but flexible enough to simulate agent reasoning.
---
## Reward Function
Shaped reward encourages incremental progress:
- + correct interpretation of request
- + correct use of availability constraints
- + valid slot selection
- + correct final confirmation
- + concise and relevant final note
Penalties:
- invalid stage transitions
- incorrect slot selection
- repeated or redundant actions
Properties:
- dense (not sparse)
- deterministic
- aligned with task completion
---
## Determinism & Reproducibility
- No randomness in dataset or transitions
- Fixed scenario ordering
- Identical rewards for identical actions
- Deterministic baseline policy
This ensures:
- reproducible scoring
- stable evaluation across runs
- compatibility with automated grading
---
## Baseline (Inference)
A deterministic baseline is provided.
Characteristics:
- follows correct stage sequence
- selects known valid slot
- produces consistent output
- uses the injected OpenAI-compatible proxy when `API_BASE_URL`, `API_KEY`, and `MODEL_NAME` are present
- falls back to the deterministic local baseline when those submission env vars are absent
### Required Output Format
The script emits strictly formatted logs:
[START] task=<task_name> env=<env_name> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
This format is required for evaluation pipelines.
---
## Validation & Testing
The environment has been verified with:
- `uv run openenv validate .`
- deterministic baseline execution
- pytest suite covering:
- environment flow
- state transitions
- reward correctness
- inference execution
- API health
All tests pass from repository root.
---
## Deployment
### Docker
```bash
docker build -t smart-calendar-env .
docker run -p 8000:8000 smart-calendar-env
```
Health check:
curl http://localhost:8000/health
Expected:
{"status":"healthy"}
Hugging Face Spaces
Deploy using Docker SDK
Use repository root as build context
Verify /health endpoint
Ensure logs show clean startup
Key Design Decisions
Stage-based decomposition β improves interpretability and grading
Small synthetic dataset β ensures determinism and fast validation
Structured actions β enables consistent evaluation
Shaped rewards β provides meaningful learning signal
Root-level Dockerfile β simplifies deployment pipeline
Evaluation Alignment
This environment directly satisfies OpenEnv requirements:
real-world task simulation
multi-step agent interaction
deterministic graders
meaningful reward shaping
reproducible baseline
Docker + HF Spaces deployability
Summary
Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.
It is designed to test:
multi-step reasoning
constraint handling
structured decision making
trajectory-based agent performance
I also pushed this to huggingface spaces
|