morty649's picture
Merge hf/main and keep hackathon proxy fix
a6eaeaf
metadata
title: Smart Calendar Resolver
emoji: πŸ“…
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false

Smart Calendar Resolver β€” OpenEnv Environment

A deterministic, multi-step OpenEnv environment for evaluating agent reasoning in real-world scheduling workflows.

This environment models a constrained meeting scheduling problem where an agent must interpret user intent, reason over structured availability, and produce a valid, verified outcome through a staged interaction loop.


Problem Definition

Given:

  • a natural language meeting request
  • multiple participants with availability windows
  • constraints (duration, deadline, priority, timezone)

The agent must:

  1. Interpret the request
  2. Aggregate and reason over availability
  3. Select a valid time slot
  4. Confirm and finalize the schedule

This reflects real-world calendar coordination tasks commonly handled by assistants and productivity tools.


Environment Design

Core Loop

The environment follows the standard OpenEnv interface:

  • reset() β†’ returns initial observation
  • step(action) β†’ returns (observation, reward, done, info)
  • state β†’ internal environment state

Stage-Based Interaction

The task is decomposed into explicit stages:

  1. understand_request
  2. evaluate_availability
  3. propose_slot
  4. confirm_schedule

Agents are expected to follow this progression. Out-of-order or invalid transitions are penalized.


Dataset

A small, fully deterministic, in-memory dataset is used.

Each scenario includes:

  • request text
  • participants
  • availability windows
  • constraints (deadline, duration, priority)
  • ground-truth valid slot

Difficulty levels:

  • Easy: single valid slot, minimal reasoning
  • Medium: conflicting availability with constraint filtering
  • Hard: multiple candidates requiring prioritization and constraint trade-offs

Design choice:

  • Small dataset ensures reproducibility
  • No randomness ensures stable evaluation and debugging

State Representation

The environment maintains:

  • episode_id
  • step_count
  • current_scenario
  • selected_slot
  • action_history
  • solved flag

This enables:

  • trajectory-based evaluation
  • reward shaping across steps
  • deterministic replay

Observation Space

Each observation contains:

  • request (natural language)
  • structured availability
  • constraints
  • current step index
  • feedback signal
  • action history
  • next expected stage
  • reward
  • done flag

Observations are designed to balance:

  • realism (semi-structured inputs)
  • controllability (no external dependencies)

Action Space

Typed via Pydantic models:

Fields include:

  • stage
  • proposed_time_slot
  • confirm_schedule
  • final_note

Actions are structured but flexible enough to simulate agent reasoning.


Reward Function

Shaped reward encourages incremental progress:

    • correct interpretation of request
    • correct use of availability constraints
    • valid slot selection
    • correct final confirmation
    • concise and relevant final note

Penalties:

  • invalid stage transitions
  • incorrect slot selection
  • repeated or redundant actions

Properties:

  • dense (not sparse)
  • deterministic
  • aligned with task completion

Determinism & Reproducibility

  • No randomness in dataset or transitions
  • Fixed scenario ordering
  • Identical rewards for identical actions
  • Deterministic baseline policy

This ensures:

  • reproducible scoring
  • stable evaluation across runs
  • compatibility with automated grading

Baseline (Inference)

A deterministic baseline is provided.

Characteristics:

  • follows correct stage sequence
  • selects known valid slot
  • produces consistent output
  • uses the injected OpenAI-compatible proxy when API_BASE_URL, API_KEY, and MODEL_NAME are present
  • falls back to the deterministic local baseline when those submission env vars are absent

Required Output Format

The script emits strictly formatted logs:

[START] task= env= model= [STEP] step= action= reward=<0.00> done=<true|false> error=<msg|null> [END] success=<true|false> steps= rewards=<r1,r2,...,rn>

This format is required for evaluation pipelines.


Validation & Testing

The environment has been verified with:

  • uv run openenv validate .
  • deterministic baseline execution
  • pytest suite covering:
    • environment flow
    • state transitions
    • reward correctness
    • inference execution
    • API health

All tests pass from repository root.


Deployment

Docker

docker build -t smart-calendar-env .
docker run -p 8000:8000 smart-calendar-env

Health check:

curl http://localhost:8000/health

Expected:

{"status":"healthy"} Hugging Face Spaces Deploy using Docker SDK Use repository root as build context Verify /health endpoint Ensure logs show clean startup

Key Design Decisions Stage-based decomposition β†’ improves interpretability and grading Small synthetic dataset β†’ ensures determinism and fast validation Structured actions β†’ enables consistent evaluation Shaped rewards β†’ provides meaningful learning signal Root-level Dockerfile β†’ simplifies deployment pipeline Evaluation Alignment

This environment directly satisfies OpenEnv requirements:

real-world task simulation multi-step agent interaction deterministic graders meaningful reward shaping reproducible baseline Docker + HF Spaces deployability Summary

Smart Calendar Resolver is a compact, deterministic environment that captures a realistic scheduling workflow while remaining easy to validate, deploy, and evaluate.

It is designed to test:

multi-step reasoning constraint handling structured decision making trajectory-based agent performance

I also pushed this to huggingface spaces