sieve / README.md
jampuramprem's picture
Just added a small error message in the inferance and added .env.example for detailing purpose
d0b62f7
metadata
title: Sieve
sdk: docker
pinned: false

Sieve β€” Customer Support RL Environment

Sieve is a reinforcement learning environment that simulates a real-world customer support inbox. An AI agent interacts with it through a standard reset() / step() / state() HTTP API, receiving emails, taking actions, and earning rewards based on how well it handles each situation.

How It Works

How It Works

The agent calls /reset to start an episode, then loops β€” reading the current email from the Observation, posting an Action to /step, and receiving a Reward and next Observation β€” until done=true. Each step reward reflects immediate quality. A -0.005 step penalty discourages unnecessary actions. The final grader score from /grader is a holistic metric computed over the full episode.

Project Structure

.
β”œβ”€β”€ models.py          # Shared Pydantic models (Action, Observation, Reward, etc.)
β”œβ”€β”€ inference.py       # Baseline agent script using OpenAI client
β”œβ”€β”€ logger.py          # Structured [START]/[STEP]/[END] stdout logger
β”œβ”€β”€ openenv.yaml       # OpenEnv environment metadata
β”œβ”€β”€ pyproject.toml     # Project config and dependencies
β”œβ”€β”€ Dockerfile         # Container definition
β”œβ”€β”€ .env.example       # Example environment variables (copy to .env)
└── server/
    β”œβ”€β”€ app.py         # FastAPI application and API endpoints
    β”œβ”€β”€ environment.py # Core environment logic (step, reset, reward, grader)
    β”œβ”€β”€ data.py        # Email datasets for all three tasks
    └── config.py      # Action schema definition

Tasks

Task 1 β€” Email Classification (Easy)

The agent receives one email at a time and must classify it using the classify action.

Available action: classify only

Step Rewards

  • Correct category: +0.15
  • Wrong category: -0.05
  • Correct urgency: +0.05
  • Wrong urgency: -0.02
  • Wrong action type: -0.05
  • Step penalty: -0.005

Final Grader Score

  • Category accuracy: 70% weight
  • Urgency accuracy: 30% weight

Task 2 β€” Response Drafting (Medium)

The agent reads a customer email and drafts a professional response using the respond action.

Available action: respond only

Step Rewards

  • Response >= 50 characters: +0.05
  • Response < 50 characters: -0.10
  • Keyword coverage: up to +0.25 (scaled by matched / min_required)
  • Negative/unprofessional tone (VADER neg > 0.4): -0.10
  • Wrong action type: -0.05
  • Step penalty: -0.005

Final Grader Score

  • Keyword coverage weighted at 0.80
  • Length bonus up to 0.20 (scaled by length / 200, requires length > 50)
  • Averaged across all emails in the task

Task 3 β€” Full Support Session (Hard)

The agent manages a queue of 15 mixed emails. It must choose which email to handle, classify it, and take the right action β€” all in the correct priority order.

Available actions: respond, escalate, archive, skip

Priority rules

  • VIP customers (sender_tier=vip) must be handled before standard customers
  • High urgency emails take precedence over medium and low
  • Security breaches and VIP incidents β†’ escalate
  • Spam and feature requests β†’ archive
  • Standard billing and technical issues β†’ respond
  • Use email_id in the action to select which email to process

Step Rewards

  • VIP email handled in first 4 positions: +0.08
  • VIP email delayed (position >= 4): -0.05
  • High urgency email in first 6 positions: +0.05
  • Low urgency email after position 6: +0.03
  • Correct category: +0.04
  • Correct urgency: +0.02
  • Correct action: +0.06
  • Wrong action: -0.03
  • Response text provided and > 50 characters: +0.02
  • Spam not archived: -0.04
  • Step penalty: -0.005

Final Grader Score

  • VIP prioritization: up to 0.20 (40% credit if handled late)
  • High urgency prioritization: up to 0.10 (40% credit if handled late)
  • Category accuracy: up to 0.15
  • Urgency accuracy: up to 0.15
  • Action accuracy: up to 0.30
  • Email coverage: up to 0.10
  • Maximum: 1.0

Data Models

Enums

ActionType

  • classify β€” Classify an email into a category and urgency
  • respond β€” Draft a response to an email
  • escalate β€” Escalate an email with a reason
  • archive β€” Archive an email
  • skip β€” Skip the current email

Category

  • billing β€” Payment, invoices, subscription issues
  • technical β€” Bugs, errors, technical failures
  • general β€” General inquiries
  • spam β€” Unsolicited or irrelevant messages
  • account β€” Account access, settings, profile issues
  • feature_request β€” Requests for new features

Urgency

  • high β€” Requires immediate attention
  • medium β€” Standard priority
  • low β€” Can be handled later

Models

Email

  • id (str) β€” Unique email identifier
  • subject (str) β€” Email subject line
  • body (str) β€” Email body content
  • sender (str) β€” Sender's email address
  • sender_tier (str, default: "standard") β€” Customer tier (standard or vip)
  • received_minutes_ago (int, default: 0) β€” How long ago the email was received

Action

  • action_type (ActionType) β€” The action to perform
  • category (Category, optional) β€” Email category, used with classify
  • urgency (Urgency, optional) β€” Email urgency, used with classify
  • response_text (str, optional) β€” Drafted response, used with respond
  • escalation_reason (str, optional) β€” Reason for escalation, used with escalate
  • email_id (str, optional) β€” Target email ID, used in support_session to select which email to process

Observation

  • current_email (Email, optional) β€” The email currently being processed
  • email_queue (List[Email], default: []) β€” Queue of pending emails, populated in Task 3 only
  • processed_count (int, default: 0) β€” Number of emails processed so far
  • step_count (int, default: 0) β€” Current step number
  • task_id (str) β€” Active task identifier
  • task_description (str) β€” Human-readable task description
  • available_actions (List[str]) β€” Actions valid for the current state
  • context (Dict) β€” Additional context such as max_steps, remaining_steps, queue_size

Reward

  • value (float) β€” Total reward for the step
  • components (Dict[str, float], default: {}) β€” Breakdown of reward sub-components
  • reason (str, default: "") β€” Human-readable explanation of the reward

StepResult

  • observation (Observation) β€” Next environment observation
  • reward (Reward) β€” Reward received for the action
  • done (bool) β€” Whether the episode has ended
  • info (Dict) β€” Additional diagnostic information

Backend API

Method Path Description
POST /reset?task_id=<id> Reset environment for a task, returns initial Observation
POST /step Submit an Action, returns {observation, reward, done, info}
GET /state Current environment state
GET /tasks List all tasks with action schema
GET /grader Current grader score (0.0–1.0)

Baseline Scores

Baseline agent: gpt-4o-mini via OpenAI API

Task Score Steps Total Reward
Email Classification 0.930 10 1.755
Response Drafting 0.920 6 1.650
Support Session 0.882 15 1.506

Local Development Setup

Prerequisites

  • Python 3.11 or 3.12 (matches the Docker image)
  • Optional: uv for creating a virtual environment

Steps

1. Create and activate a virtual environment

With uv:

uv venv --python 3.11
source .venv/bin/activate

Or with the standard library:

python3.11 -m venv .venv
source .venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. Download NLTK data (one time)

python -c "import nltk; nltk.download('vader_lexicon', quiet=True); nltk.download('punkt_tab', quiet=True)"

4. Environment variables

Copy the example file and edit .env:

cp .env.example .env
Variable Required for Description
API_BASE_URL Baseline inference OpenAI-compatible API base URL (default: Hugging Face router).
MODEL_NAME Baseline inference Model identifier for that API.
HF_TOKEN Baseline (HF) Hugging Face token when using the HF router or similar.
OPENAI_API_KEY Baseline (OpenAI) OpenAI API key when using OpenAI’s API. Inference uses HF_TOKEN if set, otherwise OPENAI_API_KEY.
ENV_BASE_URL Baseline inference URL of this environment (http://localhost:7860 locally).

Running only the API server does not require LLM keys.

5. Start the server

uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

Open http://localhost:7860/docs to confirm the API is up.

Baseline inference

With the server running (step 5) and .env configured with LLM credentials, run:

python inference.py

Structured logs go to stdout ([START], [STEP], [END]); a JSON summary is printed to stderr.

Docker

Build and run the same service the Hugging Face Space uses:

docker build -t sieve .
docker run --rm -p 7860:7860 sieve

Then set ENV_BASE_URL=http://localhost:7860 (or the container’s URL) for inference.py.