Spaces:

morty649
/

smart-calendar-resolver

Sleeping

App Files Files Community

smart-calendar-resolver / README.md

morty649

Merge hf/main and keep hackathon proxy fix

a6eaeaf 30 days ago

preview code

raw

history blame contribute delete

6.06 kB

metadata

title: Smart Calendar Resolver
emoji: 📅
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false

Smart Calendar Resolver — OpenEnv Environment

A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.

This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.

Problem Definition

Given:

a natural language meeting request
multiple participants with availability windows
constraints (duration, deadline, priority, timezone)

The agent must:

Interpret the request
Aggregate and reason over availability
Select a valid time slot
Confirm and finalize the schedule

This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.

Environment Design

Core Loop

The environment follows the standard OpenEnv interface:

reset() → returns initial observation
step(action) → returns (observation, reward, done, info)
state → internal environment state

Stage-Based Interaction

The task is decomposed into explicit stages:

understand_request
evaluate_availability
propose_slot
confirm_schedule

Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.

Dataset

A small, fully deterministic, in-memory dataset is used.

Each scenario includes:

request text
participants
availability windows
constraints (deadline, duration, priority)
ground-truth valid slot

Difficulty levels:

Easy: single valid slot, minimal reasoning
Medium: conflicting availability with constraint filtering
Hard: multiple candidates requiring prioritization and constraint trade-offs

Design choice:

Small dataset ensures reproducibility
No randomness ensures stable evaluation and debugging

State Representation

The environment maintains:

episode_id
step_count
current_scenario
selected_slot
action_history
solved flag

This enables:

trajectory-based evaluation
reward shaping across steps
deterministic replay

Observation Space

Each observation contains:

request (natural language)
structured availability
constraints
current step index
feedback signal
action history
next expected stage
reward
done flag

Observations are designed to balance:

realism (semi-structured inputs)
controllability (no external dependencies)

Action Space

Typed via Pydantic models:

Fields include:

stage
proposed_time_slot
confirm_schedule
final_note

Actions are structured but flexible enough to simulate agent reasoning.

Reward Function

Shaped reward encourages incremental progress:

- correct interpretation of request
- correct use of availability constraints
- valid slot selection
- correct final confirmation
- concise and relevant final note

Penalties:

invalid stage transitions
incorrect slot selection
repeated or redundant actions

Properties:

dense (not sparse)
deterministic
aligned with task completion

Determinism & Reproducibility

No randomness in dataset or transitions
Fixed scenario ordering
Identical rewards for identical actions
Deterministic baseline policy

This ensures:

reproducible scoring
stable evaluation across runs
compatibility with automated grading

Baseline (Inference)

A deterministic baseline is provided.

Characteristics:

follows correct stage sequence
selects known valid slot
produces consistent output
uses the injected OpenAI-compatible proxy when API_BASE_URL, API_KEY, and MODEL_NAME are present
falls back to the deterministic local baseline when those submission env vars are absent

Required Output Format

The script emits strictly formatted logs:

[START] task= env= model= [STEP] step= action= reward=<0.00> done=<true|false> error=<msg|null> [END] success=<true|false> steps= rewards=<r1,r2,...,rn>

This format is required for evaluation pipelines.

Validation & Testing

The environment has been verified with:

uv run openenv validate .
deterministic baseline execution
pytest suite covering:
- environment flow
- state transitions
- reward correctness
- inference execution
- API health

All tests pass from repository root.

Deployment

Docker

docker build -t smart-calendar-env .
docker run -p 8000:8000 smart-calendar-env

Health check:

curl http://localhost:8000/health

Expected:

{"status":"healthy"} Hugging Face Spaces Deploy using Docker SDK Use repository root as build context Verify /health endpoint Ensure logs show clean startup

Key Design Decisions Stage-based decomposition → improves interpretability and grading Small synthetic dataset → ensures determinism and fast validation Structured actions → enables consistent evaluation Shaped rewards → provides meaningful learning signal Root-level Dockerfile → simplifies deployment pipeline Evaluation Alignment

This environment directly satisfies OpenEnv requirements:

real-world task simulation multi-step agent interaction deterministic graders meaningful reward shaping reproducible baseline Docker + HF Spaces deployability Summary

Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.

It is designed to test:

multi-step reasoning constraint handling structured decision making trajectory-based agent performance

I also pushed this to huggingface spaces