Spaces:
Sleeping
Sleeping
maruthi commited on
Add README for Smart Calendar Resolver environment
Browse filesAdd detailed README for Smart Calendar Resolver environment, outlining problem definition, environment design, dataset, state representation, observation and action spaces, reward function, determinism, validation, testing, and deployment instructions.
README.md
CHANGED
|
@@ -0,0 +1,248 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Smart Calendar Resolver β OpenEnv Environment
|
| 2 |
+
|
| 3 |
+
A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.
|
| 4 |
+
|
| 5 |
+
This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Problem Definition
|
| 10 |
+
|
| 11 |
+
Given:
|
| 12 |
+
- a natural language meeting request
|
| 13 |
+
- multiple participants with availability windows
|
| 14 |
+
- constraints (duration, deadline, priority, timezone)
|
| 15 |
+
|
| 16 |
+
The agent must:
|
| 17 |
+
1. Interpret the request
|
| 18 |
+
2. Aggregate and reason over availability
|
| 19 |
+
3. Select a valid time slot
|
| 20 |
+
4. Confirm and finalize the schedule
|
| 21 |
+
|
| 22 |
+
This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Environment Design
|
| 27 |
+
|
| 28 |
+
### Core Loop
|
| 29 |
+
|
| 30 |
+
The environment follows the standard OpenEnv interface:
|
| 31 |
+
|
| 32 |
+
- `reset()` β returns initial observation
|
| 33 |
+
- `step(action)` β returns (observation, reward, done, info)
|
| 34 |
+
- `state` β internal environment state
|
| 35 |
+
|
| 36 |
+
### Stage-Based Interaction
|
| 37 |
+
|
| 38 |
+
The task is decomposed into explicit stages:
|
| 39 |
+
|
| 40 |
+
1. `understand_request`
|
| 41 |
+
2. `evaluate_availability`
|
| 42 |
+
3. `propose_slot`
|
| 43 |
+
4. `confirm_schedule`
|
| 44 |
+
|
| 45 |
+
Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Dataset
|
| 50 |
+
|
| 51 |
+
A small, fully deterministic, in-memory dataset is used.
|
| 52 |
+
|
| 53 |
+
Each scenario includes:
|
| 54 |
+
- request text
|
| 55 |
+
- participants
|
| 56 |
+
- availability windows
|
| 57 |
+
- constraints (deadline, duration, priority)
|
| 58 |
+
- ground-truth valid slot
|
| 59 |
+
|
| 60 |
+
Difficulty levels:
|
| 61 |
+
- **Easy**: single valid slot, minimal reasoning
|
| 62 |
+
- **Medium**: conflicting availability with constraint filtering
|
| 63 |
+
- **Hard**: multiple candidates requiring prioritization and constraint trade-offs
|
| 64 |
+
|
| 65 |
+
Design choice:
|
| 66 |
+
- Small dataset ensures reproducibility
|
| 67 |
+
- No randomness ensures stable evaluation and debugging
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## State Representation
|
| 72 |
+
|
| 73 |
+
The environment maintains:
|
| 74 |
+
|
| 75 |
+
- `episode_id`
|
| 76 |
+
- `step_count`
|
| 77 |
+
- `current_scenario`
|
| 78 |
+
- `selected_slot`
|
| 79 |
+
- `action_history`
|
| 80 |
+
- `solved` flag
|
| 81 |
+
|
| 82 |
+
This enables:
|
| 83 |
+
- trajectory-based evaluation
|
| 84 |
+
- reward shaping across steps
|
| 85 |
+
- deterministic replay
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## Observation Space
|
| 90 |
+
|
| 91 |
+
Each observation contains:
|
| 92 |
+
|
| 93 |
+
- request (natural language)
|
| 94 |
+
- structured availability
|
| 95 |
+
- constraints
|
| 96 |
+
- current step index
|
| 97 |
+
- feedback signal
|
| 98 |
+
- action history
|
| 99 |
+
- next expected stage
|
| 100 |
+
- reward
|
| 101 |
+
- done flag
|
| 102 |
+
|
| 103 |
+
Observations are designed to balance:
|
| 104 |
+
- realism (semi-structured inputs)
|
| 105 |
+
- controllability (no external dependencies)
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## Action Space
|
| 110 |
+
|
| 111 |
+
Typed via Pydantic models:
|
| 112 |
+
|
| 113 |
+
Fields include:
|
| 114 |
+
- `stage`
|
| 115 |
+
- `proposed_time_slot`
|
| 116 |
+
- `confirm_schedule`
|
| 117 |
+
- `final_note`
|
| 118 |
+
|
| 119 |
+
Actions are structured but flexible enough to simulate agent reasoning.
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Reward Function
|
| 124 |
+
|
| 125 |
+
Shaped reward encourages incremental progress:
|
| 126 |
+
|
| 127 |
+
- + correct interpretation of request
|
| 128 |
+
- + correct use of availability constraints
|
| 129 |
+
- + valid slot selection
|
| 130 |
+
- + correct final confirmation
|
| 131 |
+
- + concise and relevant final note
|
| 132 |
+
|
| 133 |
+
Penalties:
|
| 134 |
+
- invalid stage transitions
|
| 135 |
+
- incorrect slot selection
|
| 136 |
+
- repeated or redundant actions
|
| 137 |
+
|
| 138 |
+
Properties:
|
| 139 |
+
- dense (not sparse)
|
| 140 |
+
- deterministic
|
| 141 |
+
- aligned with task completion
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Determinism & Reproducibility
|
| 146 |
+
|
| 147 |
+
- No randomness in dataset or transitions
|
| 148 |
+
- Fixed scenario ordering
|
| 149 |
+
- Identical rewards for identical actions
|
| 150 |
+
- Deterministic baseline policy
|
| 151 |
+
|
| 152 |
+
This ensures:
|
| 153 |
+
- reproducible scoring
|
| 154 |
+
- stable evaluation across runs
|
| 155 |
+
- compatibility with automated grading
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## Baseline (Inference)
|
| 160 |
+
|
| 161 |
+
A deterministic baseline is provided.
|
| 162 |
+
|
| 163 |
+
Characteristics:
|
| 164 |
+
- follows correct stage sequence
|
| 165 |
+
- selects known valid slot
|
| 166 |
+
- produces consistent output
|
| 167 |
+
- no external model dependency
|
| 168 |
+
|
| 169 |
+
### Required Output Format
|
| 170 |
+
|
| 171 |
+
The script emits strictly formatted logs:
|
| 172 |
+
|
| 173 |
+
[START] task=<task_name> env=<env_name> model=<model_name>
|
| 174 |
+
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 175 |
+
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
This format is required for evaluation pipelines.
|
| 179 |
+
|
| 180 |
+
---
|
| 181 |
+
|
| 182 |
+
## Validation & Testing
|
| 183 |
+
|
| 184 |
+
The environment has been verified with:
|
| 185 |
+
|
| 186 |
+
- `uv run openenv validate .`
|
| 187 |
+
- deterministic baseline execution
|
| 188 |
+
- pytest suite covering:
|
| 189 |
+
- environment flow
|
| 190 |
+
- state transitions
|
| 191 |
+
- reward correctness
|
| 192 |
+
- inference execution
|
| 193 |
+
- API health
|
| 194 |
+
|
| 195 |
+
All tests pass from repository root.
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## Deployment
|
| 200 |
+
|
| 201 |
+
### Docker
|
| 202 |
+
|
| 203 |
+
```bash
|
| 204 |
+
docker build -t smart-calendar-env .
|
| 205 |
+
docker run -p 8000:8000 smart-calendar-env
|
| 206 |
+
|
| 207 |
+
```
|
| 208 |
+
Health check:
|
| 209 |
+
|
| 210 |
+
curl http://localhost:8000/health
|
| 211 |
+
|
| 212 |
+
Expected:
|
| 213 |
+
|
| 214 |
+
{"status":"healthy"}
|
| 215 |
+
Hugging Face Spaces
|
| 216 |
+
Deploy using Docker SDK
|
| 217 |
+
Use repository root as build context
|
| 218 |
+
Verify /health endpoint
|
| 219 |
+
Ensure logs show clean startup
|
| 220 |
+
|
| 221 |
+
Key Design Decisions
|
| 222 |
+
Stage-based decomposition β improves interpretability and grading
|
| 223 |
+
Small synthetic dataset β ensures determinism and fast validation
|
| 224 |
+
Structured actions β enables consistent evaluation
|
| 225 |
+
Shaped rewards β provides meaningful learning signal
|
| 226 |
+
Root-level Dockerfile β simplifies deployment pipeline
|
| 227 |
+
Evaluation Alignment
|
| 228 |
+
|
| 229 |
+
This environment directly satisfies OpenEnv requirements:
|
| 230 |
+
|
| 231 |
+
real-world task simulation
|
| 232 |
+
multi-step agent interaction
|
| 233 |
+
deterministic graders
|
| 234 |
+
meaningful reward shaping
|
| 235 |
+
reproducible baseline
|
| 236 |
+
Docker + HF Spaces deployability
|
| 237 |
+
Summary
|
| 238 |
+
|
| 239 |
+
Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.
|
| 240 |
+
|
| 241 |
+
It is designed to test:
|
| 242 |
+
|
| 243 |
+
multi-step reasoning
|
| 244 |
+
constraint handling
|
| 245 |
+
structured decision making
|
| 246 |
+
trajectory-based agent performance
|
| 247 |
+
|
| 248 |
+
I also pushed this to huggingface spaces
|