File size: 6,061 Bytes
7bfb138
 
 
 
 
 
 
 
 
 
 
6504bdb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eff241c
 
 
 
 
 
6504bdb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---

title: Smart Calendar Resolver
emoji: πŸ“…
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
---



# Smart Calendar Resolver β€” OpenEnv Environment

A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.

This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.

---

## Problem Definition

Given:
- a natural language meeting request
- multiple participants with availability windows
- constraints (duration, deadline, priority, timezone)

The agent must:
1. Interpret the request
2. Aggregate and reason over availability
3. Select a valid time slot
4. Confirm and finalize the schedule

This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.

---

## Environment Design

### Core Loop

The environment follows the standard OpenEnv interface:

- `reset()` β†’ returns initial observation
- `step(action)` β†’ returns (observation, reward, done, info)
- `state` β†’ internal environment state

### Stage-Based Interaction

The task is decomposed into explicit stages:

1. `understand_request`
2. `evaluate_availability`
3. `propose_slot`
4. `confirm_schedule`

Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.

---

## Dataset

A small, fully deterministic, in-memory dataset is used.

Each scenario includes:
- request text
- participants
- availability windows
- constraints (deadline, duration, priority)
- ground-truth valid slot

Difficulty levels:
- **Easy**: single valid slot, minimal reasoning
- **Medium**: conflicting availability with constraint filtering
- **Hard**: multiple candidates requiring prioritization and constraint trade-offs

Design choice:
- Small dataset ensures reproducibility
- No randomness ensures stable evaluation and debugging

---

## State Representation

The environment maintains:

- `episode_id`
- `step_count`
- `current_scenario`
- `selected_slot`
- `action_history`
- `solved` flag

This enables:
- trajectory-based evaluation
- reward shaping across steps
- deterministic replay

---

## Observation Space

Each observation contains:

- request (natural language)
- structured availability
- constraints
- current step index
- feedback signal
- action history
- next expected stage
- reward
- done flag

Observations are designed to balance:
- realism (semi-structured inputs)
- controllability (no external dependencies)

---

## Action Space

Typed via Pydantic models:

Fields include:
- `stage`
- `proposed_time_slot`
- `confirm_schedule`
- `final_note`

Actions are structured but flexible enough to simulate agent reasoning.

---

## Reward Function

Shaped reward encourages incremental progress:

- + correct interpretation of request
- + correct use of availability constraints
- + valid slot selection
- + correct final confirmation
- + concise and relevant final note

Penalties:
- invalid stage transitions
- incorrect slot selection
- repeated or redundant actions

Properties:
- dense (not sparse)
- deterministic
- aligned with task completion

---

## Determinism & Reproducibility

- No randomness in dataset or transitions
- Fixed scenario ordering
- Identical rewards for identical actions
- Deterministic baseline policy

This ensures:
- reproducible scoring
- stable evaluation across runs
- compatibility with automated grading

---

## Baseline (Inference)

A deterministic baseline is provided.

Characteristics:
- follows correct stage sequence
- selects known valid slot
- produces consistent output
- uses the injected OpenAI-compatible proxy when `API_BASE_URL`, `API_KEY`, and `MODEL_NAME` are present
- falls back to the deterministic local baseline when those submission env vars are absent

### Required Output Format

The script emits strictly formatted logs:

[START] task=<task_name> env=<env_name> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>


This format is required for evaluation pipelines.

---

## Validation & Testing

The environment has been verified with:

- `uv run openenv validate .`
- deterministic baseline execution
- pytest suite covering:
  - environment flow
  - state transitions
  - reward correctness
  - inference execution
  - API health

All tests pass from repository root.

---

## Deployment

### Docker

```bash

docker build -t smart-calendar-env .

docker run -p 8000:8000 smart-calendar-env



```
Health check:

curl http://localhost:8000/health

Expected:

{"status":"healthy"}
Hugging Face Spaces
Deploy using Docker SDK
Use repository root as build context
Verify /health endpoint
Ensure logs show clean startup

Key Design Decisions
Stage-based decomposition β†’ improves interpretability and grading
Small synthetic dataset β†’ ensures determinism and fast validation
Structured actions β†’ enables consistent evaluation
Shaped rewards β†’ provides meaningful learning signal
Root-level Dockerfile β†’ simplifies deployment pipeline
Evaluation Alignment

This environment directly satisfies OpenEnv requirements:

real-world task simulation
multi-step agent interaction
deterministic graders
meaningful reward shaping
reproducible baseline
Docker + HF Spaces deployability
Summary

Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.

It is designed to test:

multi-step reasoning
constraint handling
structured decision making
trajectory-based agent performance

I also pushed this to huggingface spaces