morty649's picture
Merge hf/main and keep hackathon proxy fix
a6eaeaf
---
title: Smart Calendar Resolver
emoji: πŸ“…
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
---
# Smart Calendar Resolver β€” OpenEnv Environment
A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.
This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.
---
## Problem Definition
Given:
- a natural language meeting request
- multiple participants with availability windows
- constraints (duration, deadline, priority, timezone)
The agent must:
1. Interpret the request
2. Aggregate and reason over availability
3. Select a valid time slot
4. Confirm and finalize the schedule
This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.
---
## Environment Design
### Core Loop
The environment follows the standard OpenEnv interface:
- `reset()` β†’ returns initial observation
- `step(action)` β†’ returns (observation, reward, done, info)
- `state` β†’ internal environment state
### Stage-Based Interaction
The task is decomposed into explicit stages:
1. `understand_request`
2. `evaluate_availability`
3. `propose_slot`
4. `confirm_schedule`
Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.
---
## Dataset
A small, fully deterministic, in-memory dataset is used.
Each scenario includes:
- request text
- participants
- availability windows
- constraints (deadline, duration, priority)
- ground-truth valid slot
Difficulty levels:
- **Easy**: single valid slot, minimal reasoning
- **Medium**: conflicting availability with constraint filtering
- **Hard**: multiple candidates requiring prioritization and constraint trade-offs
Design choice:
- Small dataset ensures reproducibility
- No randomness ensures stable evaluation and debugging
---
## State Representation
The environment maintains:
- `episode_id`
- `step_count`
- `current_scenario`
- `selected_slot`
- `action_history`
- `solved` flag
This enables:
- trajectory-based evaluation
- reward shaping across steps
- deterministic replay
---
## Observation Space
Each observation contains:
- request (natural language)
- structured availability
- constraints
- current step index
- feedback signal
- action history
- next expected stage
- reward
- done flag
Observations are designed to balance:
- realism (semi-structured inputs)
- controllability (no external dependencies)
---
## Action Space
Typed via Pydantic models:
Fields include:
- `stage`
- `proposed_time_slot`
- `confirm_schedule`
- `final_note`
Actions are structured but flexible enough to simulate agent reasoning.
---
## Reward Function
Shaped reward encourages incremental progress:
- + correct interpretation of request
- + correct use of availability constraints
- + valid slot selection
- + correct final confirmation
- + concise and relevant final note
Penalties:
- invalid stage transitions
- incorrect slot selection
- repeated or redundant actions
Properties:
- dense (not sparse)
- deterministic
- aligned with task completion
---
## Determinism & Reproducibility
- No randomness in dataset or transitions
- Fixed scenario ordering
- Identical rewards for identical actions
- Deterministic baseline policy
This ensures:
- reproducible scoring
- stable evaluation across runs
- compatibility with automated grading
---
## Baseline (Inference)
A deterministic baseline is provided.
Characteristics:
- follows correct stage sequence
- selects known valid slot
- produces consistent output
- uses the injected OpenAI-compatible proxy when `API_BASE_URL`, `API_KEY`, and `MODEL_NAME` are present
- falls back to the deterministic local baseline when those submission env vars are absent
### Required Output Format
The script emits strictly formatted logs:
[START] task=<task_name> env=<env_name> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
This format is required for evaluation pipelines.
---
## Validation & Testing
The environment has been verified with:
- `uv run openenv validate .`
- deterministic baseline execution
- pytest suite covering:
- environment flow
- state transitions
- reward correctness
- inference execution
- API health
All tests pass from repository root.
---
## Deployment
### Docker
```bash
docker build -t smart-calendar-env .
docker run -p 8000:8000 smart-calendar-env
```
Health check:
curl http://localhost:8000/health
Expected:
{"status":"healthy"}
Hugging Face Spaces
Deploy using Docker SDK
Use repository root as build context
Verify /health endpoint
Ensure logs show clean startup
Key Design Decisions
Stage-based decomposition β†’ improves interpretability and grading
Small synthetic dataset β†’ ensures determinism and fast validation
Structured actions β†’ enables consistent evaluation
Shaped rewards β†’ provides meaningful learning signal
Root-level Dockerfile β†’ simplifies deployment pipeline
Evaluation Alignment
This environment directly satisfies OpenEnv requirements:
real-world task simulation
multi-step agent interaction
deterministic graders
meaningful reward shaping
reproducible baseline
Docker + HF Spaces deployability
Summary
Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.
It is designed to test:
multi-step reasoning
constraint handling
structured decision making
trajectory-based agent performance
I also pushed this to huggingface spaces