--- title: Sieve sdk: docker pinned: false --- # Sieve — Customer Support RL Environment Sieve is a reinforcement learning environment that simulates a real-world customer support inbox. An AI agent interacts with it through a standard `reset() / step() / state()` HTTP API, receiving emails, taking actions, and earning rewards based on how well it handles each situation. ## How It Works ![How It Works](assets/how_it_works_v2.svg) The agent calls `/reset` to start an episode, then loops — reading the current email from the `Observation`, posting an `Action` to `/step`, and receiving a `Reward` and next `Observation` — until `done=true`. Each step reward reflects immediate quality. A `-0.005` step penalty discourages unnecessary actions. The final grader score from `/grader` is a holistic metric computed over the full episode. ## Project Structure ``` . ├── models.py # Shared Pydantic models (Action, Observation, Reward, etc.) ├── inference.py # Baseline agent script using OpenAI client ├── logger.py # Structured [START]/[STEP]/[END] stdout logger ├── openenv.yaml # OpenEnv environment metadata ├── pyproject.toml # Project config and dependencies ├── Dockerfile # Container definition ├── .env.example # Example environment variables (copy to .env) └── server/ ├── app.py # FastAPI application and API endpoints ├── environment.py # Core environment logic (step, reset, reward, grader) ├── data.py # Email datasets for all three tasks └── config.py # Action schema definition ``` ## Tasks ### Task 1 — Email Classification (Easy) The agent receives one email at a time and must classify it using the `classify` action. **Available action:** `classify` only **Step Rewards** - Correct category: `+0.15` - Wrong category: `-0.05` - Correct urgency: `+0.05` - Wrong urgency: `-0.02` - Wrong action type: `-0.05` - Step penalty: `-0.005` **Final Grader Score** - Category accuracy: `70%` weight - Urgency accuracy: `30%` weight --- ### Task 2 — Response Drafting (Medium) The agent reads a customer email and drafts a professional response using the `respond` action. **Available action:** `respond` only **Step Rewards** - Response >= 50 characters: `+0.05` - Response < 50 characters: `-0.10` - Keyword coverage: up to `+0.25` (scaled by `matched / min_required`) - Negative/unprofessional tone (VADER neg > 0.4): `-0.10` - Wrong action type: `-0.05` - Step penalty: `-0.005` **Final Grader Score** - Keyword coverage weighted at `0.80` - Length bonus up to `0.20` (scaled by `length / 200`, requires length > 50) - Averaged across all emails in the task --- ### Task 3 — Full Support Session (Hard) The agent manages a queue of 15 mixed emails. It must choose which email to handle, classify it, and take the right action — all in the correct priority order. **Available actions:** `respond`, `escalate`, `archive`, `skip` **Priority rules** - VIP customers (`sender_tier=vip`) must be handled before standard customers - High urgency emails take precedence over medium and low - Security breaches and VIP incidents → `escalate` - Spam and feature requests → `archive` - Standard billing and technical issues → `respond` - Use `email_id` in the action to select which email to process **Step Rewards** - VIP email handled in first 4 positions: `+0.08` - VIP email delayed (position >= 4): `-0.05` - High urgency email in first 6 positions: `+0.05` - Low urgency email after position 6: `+0.03` - Correct category: `+0.04` - Correct urgency: `+0.02` - Correct action: `+0.06` - Wrong action: `-0.03` - Response text provided and > 50 characters: `+0.02` - Spam not archived: `-0.04` - Step penalty: `-0.005` **Final Grader Score** - VIP prioritization: up to `0.20` (40% credit if handled late) - High urgency prioritization: up to `0.10` (40% credit if handled late) - Category accuracy: up to `0.15` - Urgency accuracy: up to `0.15` - Action accuracy: up to `0.30` - Email coverage: up to `0.10` - Maximum: `1.0` --- ## Data Models ### Enums #### ActionType - `classify` — Classify an email into a category and urgency - `respond` — Draft a response to an email - `escalate` — Escalate an email with a reason - `archive` — Archive an email - `skip` — Skip the current email #### Category - `billing` — Payment, invoices, subscription issues - `technical` — Bugs, errors, technical failures - `general` — General inquiries - `spam` — Unsolicited or irrelevant messages - `account` — Account access, settings, profile issues - `feature_request` — Requests for new features #### Urgency - `high` — Requires immediate attention - `medium` — Standard priority - `low` — Can be handled later ### Models #### Email - `id` (`str`) — Unique email identifier - `subject` (`str`) — Email subject line - `body` (`str`) — Email body content - `sender` (`str`) — Sender's email address - `sender_tier` (`str`, default: `"standard"`) — Customer tier (`standard` or `vip`) - `received_minutes_ago` (`int`, default: `0`) — How long ago the email was received #### Action - `action_type` (`ActionType`) — The action to perform - `category` (`Category`, optional) — Email category, used with `classify` - `urgency` (`Urgency`, optional) — Email urgency, used with `classify` - `response_text` (`str`, optional) — Drafted response, used with `respond` - `escalation_reason` (`str`, optional) — Reason for escalation, used with `escalate` - `email_id` (`str`, optional) — Target email ID, used in `support_session` to select which email to process #### Observation - `current_email` (`Email`, optional) — The email currently being processed - `email_queue` (`List[Email]`, default: `[]`) — Queue of pending emails, populated in Task 3 only - `processed_count` (`int`, default: `0`) — Number of emails processed so far - `step_count` (`int`, default: `0`) — Current step number - `task_id` (`str`) — Active task identifier - `task_description` (`str`) — Human-readable task description - `available_actions` (`List[str]`) — Actions valid for the current state - `context` (`Dict`) — Additional context such as `max_steps`, `remaining_steps`, `queue_size` #### Reward - `value` (`float`) — Total reward for the step - `components` (`Dict[str, float]`, default: `{}`) — Breakdown of reward sub-components - `reason` (`str`, default: `""`) — Human-readable explanation of the reward #### StepResult - `observation` (`Observation`) — Next environment observation - `reward` (`Reward`) — Reward received for the action - `done` (`bool`) — Whether the episode has ended - `info` (`Dict`) — Additional diagnostic information ## Backend API | Method | Path | Description | |--------|------|-------------| | `POST` | `/reset?task_id=` | Reset environment for a task, returns initial Observation | | `POST` | `/step` | Submit an Action, returns `{observation, reward, done, info}` | | `GET` | `/state` | Current environment state | | `GET` | `/tasks` | List all tasks with action schema | | `GET` | `/grader` | Current grader score (0.0–1.0) | ## Baseline Scores Baseline agent: `gpt-4o-mini` via OpenAI API | Task | Score | Steps | Total Reward | |------|-------|-------|--------------| | Email Classification | 0.930 | 10 | 1.755 | | Response Drafting | 0.920 | 6 | 1.650 | | Support Session | 0.882 | 15 | 1.506 | ## Local Development Setup ### Prerequisites - Python 3.11 or 3.12 (matches the Docker image) - Optional: [uv](https://docs.astral.sh/uv/) for creating a virtual environment ### Steps **1. Create and activate a virtual environment** With uv: ```bash uv venv --python 3.11 source .venv/bin/activate ``` Or with the standard library: ```bash python3.11 -m venv .venv source .venv/bin/activate ``` **2. Install dependencies** ```bash pip install -r requirements.txt ``` **3. Download NLTK data (one time)** ```bash python -c "import nltk; nltk.download('vader_lexicon', quiet=True); nltk.download('punkt_tab', quiet=True)" ``` **4. Environment variables** Copy the example file and edit `.env`: ```bash cp .env.example .env ``` | Variable | Required for | Description | |----------|----------------|-------------| | `API_BASE_URL` | Baseline inference | OpenAI-compatible API base URL (default: Hugging Face router). | | `MODEL_NAME` | Baseline inference | Model identifier for that API. | | `HF_TOKEN` | Baseline (HF) | Hugging Face token when using the HF router or similar. | | `OPENAI_API_KEY` | Baseline (OpenAI) | OpenAI API key when using OpenAI’s API. Inference uses `HF_TOKEN` if set, otherwise `OPENAI_API_KEY`. | | `ENV_BASE_URL` | Baseline inference | URL of this environment (`http://localhost:7860` locally). | Running only the API server does not require LLM keys. **5. Start the server** ```bash uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload ``` Open `http://localhost:7860/docs` to confirm the API is up. ### Baseline inference With the server running (step 5) and `.env` configured with LLM credentials, run: ```bash python inference.py ``` Structured logs go to stdout (`[START]`, `[STEP]`, `[END]`); a JSON summary is printed to stderr. ### Docker Build and run the same service the Hugging Face Space uses: ```bash docker build -t sieve . docker run --rm -p 7860:7860 sieve ``` Then set `ENV_BASE_URL=http://localhost:7860` (or the container’s URL) for `inference.py`.