maruthi commited on
Commit
6504bdb
Β·
unverified Β·
1 Parent(s): d02cfdb

Add README for Smart Calendar Resolver environment

Browse files

Add detailed README for Smart Calendar Resolver environment, outlining problem definition, environment design, dataset, state representation, observation and action spaces, reward function, determinism, validation, testing, and deployment instructions.

Files changed (1) hide show
  1. README.md +248 -0
README.md CHANGED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Smart Calendar Resolver β€” OpenEnv Environment
2
+
3
+ A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.
4
+
5
+ This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.
6
+
7
+ ---
8
+
9
+ ## Problem Definition
10
+
11
+ Given:
12
+ - a natural language meeting request
13
+ - multiple participants with availability windows
14
+ - constraints (duration, deadline, priority, timezone)
15
+
16
+ The agent must:
17
+ 1. Interpret the request
18
+ 2. Aggregate and reason over availability
19
+ 3. Select a valid time slot
20
+ 4. Confirm and finalize the schedule
21
+
22
+ This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.
23
+
24
+ ---
25
+
26
+ ## Environment Design
27
+
28
+ ### Core Loop
29
+
30
+ The environment follows the standard OpenEnv interface:
31
+
32
+ - `reset()` β†’ returns initial observation
33
+ - `step(action)` β†’ returns (observation, reward, done, info)
34
+ - `state` β†’ internal environment state
35
+
36
+ ### Stage-Based Interaction
37
+
38
+ The task is decomposed into explicit stages:
39
+
40
+ 1. `understand_request`
41
+ 2. `evaluate_availability`
42
+ 3. `propose_slot`
43
+ 4. `confirm_schedule`
44
+
45
+ Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.
46
+
47
+ ---
48
+
49
+ ## Dataset
50
+
51
+ A small, fully deterministic, in-memory dataset is used.
52
+
53
+ Each scenario includes:
54
+ - request text
55
+ - participants
56
+ - availability windows
57
+ - constraints (deadline, duration, priority)
58
+ - ground-truth valid slot
59
+
60
+ Difficulty levels:
61
+ - **Easy**: single valid slot, minimal reasoning
62
+ - **Medium**: conflicting availability with constraint filtering
63
+ - **Hard**: multiple candidates requiring prioritization and constraint trade-offs
64
+
65
+ Design choice:
66
+ - Small dataset ensures reproducibility
67
+ - No randomness ensures stable evaluation and debugging
68
+
69
+ ---
70
+
71
+ ## State Representation
72
+
73
+ The environment maintains:
74
+
75
+ - `episode_id`
76
+ - `step_count`
77
+ - `current_scenario`
78
+ - `selected_slot`
79
+ - `action_history`
80
+ - `solved` flag
81
+
82
+ This enables:
83
+ - trajectory-based evaluation
84
+ - reward shaping across steps
85
+ - deterministic replay
86
+
87
+ ---
88
+
89
+ ## Observation Space
90
+
91
+ Each observation contains:
92
+
93
+ - request (natural language)
94
+ - structured availability
95
+ - constraints
96
+ - current step index
97
+ - feedback signal
98
+ - action history
99
+ - next expected stage
100
+ - reward
101
+ - done flag
102
+
103
+ Observations are designed to balance:
104
+ - realism (semi-structured inputs)
105
+ - controllability (no external dependencies)
106
+
107
+ ---
108
+
109
+ ## Action Space
110
+
111
+ Typed via Pydantic models:
112
+
113
+ Fields include:
114
+ - `stage`
115
+ - `proposed_time_slot`
116
+ - `confirm_schedule`
117
+ - `final_note`
118
+
119
+ Actions are structured but flexible enough to simulate agent reasoning.
120
+
121
+ ---
122
+
123
+ ## Reward Function
124
+
125
+ Shaped reward encourages incremental progress:
126
+
127
+ - + correct interpretation of request
128
+ - + correct use of availability constraints
129
+ - + valid slot selection
130
+ - + correct final confirmation
131
+ - + concise and relevant final note
132
+
133
+ Penalties:
134
+ - invalid stage transitions
135
+ - incorrect slot selection
136
+ - repeated or redundant actions
137
+
138
+ Properties:
139
+ - dense (not sparse)
140
+ - deterministic
141
+ - aligned with task completion
142
+
143
+ ---
144
+
145
+ ## Determinism & Reproducibility
146
+
147
+ - No randomness in dataset or transitions
148
+ - Fixed scenario ordering
149
+ - Identical rewards for identical actions
150
+ - Deterministic baseline policy
151
+
152
+ This ensures:
153
+ - reproducible scoring
154
+ - stable evaluation across runs
155
+ - compatibility with automated grading
156
+
157
+ ---
158
+
159
+ ## Baseline (Inference)
160
+
161
+ A deterministic baseline is provided.
162
+
163
+ Characteristics:
164
+ - follows correct stage sequence
165
+ - selects known valid slot
166
+ - produces consistent output
167
+ - no external model dependency
168
+
169
+ ### Required Output Format
170
+
171
+ The script emits strictly formatted logs:
172
+
173
+ [START] task=<task_name> env=<env_name> model=<model_name>
174
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
175
+ [END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
176
+
177
+
178
+ This format is required for evaluation pipelines.
179
+
180
+ ---
181
+
182
+ ## Validation & Testing
183
+
184
+ The environment has been verified with:
185
+
186
+ - `uv run openenv validate .`
187
+ - deterministic baseline execution
188
+ - pytest suite covering:
189
+ - environment flow
190
+ - state transitions
191
+ - reward correctness
192
+ - inference execution
193
+ - API health
194
+
195
+ All tests pass from repository root.
196
+
197
+ ---
198
+
199
+ ## Deployment
200
+
201
+ ### Docker
202
+
203
+ ```bash
204
+ docker build -t smart-calendar-env .
205
+ docker run -p 8000:8000 smart-calendar-env
206
+
207
+ ```
208
+ Health check:
209
+
210
+ curl http://localhost:8000/health
211
+
212
+ Expected:
213
+
214
+ {"status":"healthy"}
215
+ Hugging Face Spaces
216
+ Deploy using Docker SDK
217
+ Use repository root as build context
218
+ Verify /health endpoint
219
+ Ensure logs show clean startup
220
+
221
+ Key Design Decisions
222
+ Stage-based decomposition β†’ improves interpretability and grading
223
+ Small synthetic dataset β†’ ensures determinism and fast validation
224
+ Structured actions β†’ enables consistent evaluation
225
+ Shaped rewards β†’ provides meaningful learning signal
226
+ Root-level Dockerfile β†’ simplifies deployment pipeline
227
+ Evaluation Alignment
228
+
229
+ This environment directly satisfies OpenEnv requirements:
230
+
231
+ real-world task simulation
232
+ multi-step agent interaction
233
+ deterministic graders
234
+ meaningful reward shaping
235
+ reproducible baseline
236
+ Docker + HF Spaces deployability
237
+ Summary
238
+
239
+ Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.
240
+
241
+ It is designed to test:
242
+
243
+ multi-step reasoning
244
+ constraint handling
245
+ structured decision making
246
+ trajectory-based agent performance
247
+
248
+ I also pushed this to huggingface spaces