Spaces:
Sleeping
Sleeping
Just added a small error message in the inferance and added .env.example for detailing purpose
d0b62f7 | title: Sieve | |
| sdk: docker | |
| pinned: false | |
| # Sieve β Customer Support RL Environment | |
| Sieve is a reinforcement learning environment that simulates a real-world customer support inbox. An AI agent interacts with it through a standard `reset() / step() / state()` HTTP API, receiving emails, taking actions, and earning rewards based on how well it handles each situation. | |
| ## How It Works | |
|  | |
| The agent calls `/reset` to start an episode, then loops β reading the current email from the `Observation`, posting an `Action` to `/step`, and receiving a `Reward` and next `Observation` β until `done=true`. Each step reward reflects immediate quality. A `-0.005` step penalty discourages unnecessary actions. The final grader score from `/grader` is a holistic metric computed over the full episode. | |
| ## Project Structure | |
| ``` | |
| . | |
| βββ models.py # Shared Pydantic models (Action, Observation, Reward, etc.) | |
| βββ inference.py # Baseline agent script using OpenAI client | |
| βββ logger.py # Structured [START]/[STEP]/[END] stdout logger | |
| βββ openenv.yaml # OpenEnv environment metadata | |
| βββ pyproject.toml # Project config and dependencies | |
| βββ Dockerfile # Container definition | |
| βββ .env.example # Example environment variables (copy to .env) | |
| βββ server/ | |
| βββ app.py # FastAPI application and API endpoints | |
| βββ environment.py # Core environment logic (step, reset, reward, grader) | |
| βββ data.py # Email datasets for all three tasks | |
| βββ config.py # Action schema definition | |
| ``` | |
| ## Tasks | |
| ### Task 1 β Email Classification (Easy) | |
| The agent receives one email at a time and must classify it using the `classify` action. | |
| **Available action:** `classify` only | |
| **Step Rewards** | |
| - Correct category: `+0.15` | |
| - Wrong category: `-0.05` | |
| - Correct urgency: `+0.05` | |
| - Wrong urgency: `-0.02` | |
| - Wrong action type: `-0.05` | |
| - Step penalty: `-0.005` | |
| **Final Grader Score** | |
| - Category accuracy: `70%` weight | |
| - Urgency accuracy: `30%` weight | |
| --- | |
| ### Task 2 β Response Drafting (Medium) | |
| The agent reads a customer email and drafts a professional response using the `respond` action. | |
| **Available action:** `respond` only | |
| **Step Rewards** | |
| - Response >= 50 characters: `+0.05` | |
| - Response < 50 characters: `-0.10` | |
| - Keyword coverage: up to `+0.25` (scaled by `matched / min_required`) | |
| - Negative/unprofessional tone (VADER neg > 0.4): `-0.10` | |
| - Wrong action type: `-0.05` | |
| - Step penalty: `-0.005` | |
| **Final Grader Score** | |
| - Keyword coverage weighted at `0.80` | |
| - Length bonus up to `0.20` (scaled by `length / 200`, requires length > 50) | |
| - Averaged across all emails in the task | |
| --- | |
| ### Task 3 β Full Support Session (Hard) | |
| The agent manages a queue of 15 mixed emails. It must choose which email to handle, classify it, and take the right action β all in the correct priority order. | |
| **Available actions:** `respond`, `escalate`, `archive`, `skip` | |
| **Priority rules** | |
| - VIP customers (`sender_tier=vip`) must be handled before standard customers | |
| - High urgency emails take precedence over medium and low | |
| - Security breaches and VIP incidents β `escalate` | |
| - Spam and feature requests β `archive` | |
| - Standard billing and technical issues β `respond` | |
| - Use `email_id` in the action to select which email to process | |
| **Step Rewards** | |
| - VIP email handled in first 4 positions: `+0.08` | |
| - VIP email delayed (position >= 4): `-0.05` | |
| - High urgency email in first 6 positions: `+0.05` | |
| - Low urgency email after position 6: `+0.03` | |
| - Correct category: `+0.04` | |
| - Correct urgency: `+0.02` | |
| - Correct action: `+0.06` | |
| - Wrong action: `-0.03` | |
| - Response text provided and > 50 characters: `+0.02` | |
| - Spam not archived: `-0.04` | |
| - Step penalty: `-0.005` | |
| **Final Grader Score** | |
| - VIP prioritization: up to `0.20` (40% credit if handled late) | |
| - High urgency prioritization: up to `0.10` (40% credit if handled late) | |
| - Category accuracy: up to `0.15` | |
| - Urgency accuracy: up to `0.15` | |
| - Action accuracy: up to `0.30` | |
| - Email coverage: up to `0.10` | |
| - Maximum: `1.0` | |
| --- | |
| ## Data Models | |
| ### Enums | |
| #### ActionType | |
| - `classify` β Classify an email into a category and urgency | |
| - `respond` β Draft a response to an email | |
| - `escalate` β Escalate an email with a reason | |
| - `archive` β Archive an email | |
| - `skip` β Skip the current email | |
| #### Category | |
| - `billing` β Payment, invoices, subscription issues | |
| - `technical` β Bugs, errors, technical failures | |
| - `general` β General inquiries | |
| - `spam` β Unsolicited or irrelevant messages | |
| - `account` β Account access, settings, profile issues | |
| - `feature_request` β Requests for new features | |
| #### Urgency | |
| - `high` β Requires immediate attention | |
| - `medium` β Standard priority | |
| - `low` β Can be handled later | |
| ### Models | |
| - `id` (`str`) β Unique email identifier | |
| - `subject` (`str`) β Email subject line | |
| - `body` (`str`) β Email body content | |
| - `sender` (`str`) β Sender's email address | |
| - `sender_tier` (`str`, default: `"standard"`) β Customer tier (`standard` or `vip`) | |
| - `received_minutes_ago` (`int`, default: `0`) β How long ago the email was received | |
| #### Action | |
| - `action_type` (`ActionType`) β The action to perform | |
| - `category` (`Category`, optional) β Email category, used with `classify` | |
| - `urgency` (`Urgency`, optional) β Email urgency, used with `classify` | |
| - `response_text` (`str`, optional) β Drafted response, used with `respond` | |
| - `escalation_reason` (`str`, optional) β Reason for escalation, used with `escalate` | |
| - `email_id` (`str`, optional) β Target email ID, used in `support_session` to select which email to process | |
| #### Observation | |
| - `current_email` (`Email`, optional) β The email currently being processed | |
| - `email_queue` (`List[Email]`, default: `[]`) β Queue of pending emails, populated in Task 3 only | |
| - `processed_count` (`int`, default: `0`) β Number of emails processed so far | |
| - `step_count` (`int`, default: `0`) β Current step number | |
| - `task_id` (`str`) β Active task identifier | |
| - `task_description` (`str`) β Human-readable task description | |
| - `available_actions` (`List[str]`) β Actions valid for the current state | |
| - `context` (`Dict`) β Additional context such as `max_steps`, `remaining_steps`, `queue_size` | |
| #### Reward | |
| - `value` (`float`) β Total reward for the step | |
| - `components` (`Dict[str, float]`, default: `{}`) β Breakdown of reward sub-components | |
| - `reason` (`str`, default: `""`) β Human-readable explanation of the reward | |
| #### StepResult | |
| - `observation` (`Observation`) β Next environment observation | |
| - `reward` (`Reward`) β Reward received for the action | |
| - `done` (`bool`) β Whether the episode has ended | |
| - `info` (`Dict`) β Additional diagnostic information | |
| ## Backend API | |
| | Method | Path | Description | | |
| |--------|------|-------------| | |
| | `POST` | `/reset?task_id=<id>` | Reset environment for a task, returns initial Observation | | |
| | `POST` | `/step` | Submit an Action, returns `{observation, reward, done, info}` | | |
| | `GET` | `/state` | Current environment state | | |
| | `GET` | `/tasks` | List all tasks with action schema | | |
| | `GET` | `/grader` | Current grader score (0.0β1.0) | | |
| ## Baseline Scores | |
| Baseline agent: `gpt-4o-mini` via OpenAI API | |
| | Task | Score | Steps | Total Reward | | |
| |------|-------|-------|--------------| | |
| | Email Classification | 0.930 | 10 | 1.755 | | |
| | Response Drafting | 0.920 | 6 | 1.650 | | |
| | Support Session | 0.882 | 15 | 1.506 | | |
| ## Local Development Setup | |
| ### Prerequisites | |
| - Python 3.11 or 3.12 (matches the Docker image) | |
| - Optional: [uv](https://docs.astral.sh/uv/) for creating a virtual environment | |
| ### Steps | |
| **1. Create and activate a virtual environment** | |
| With uv: | |
| ```bash | |
| uv venv --python 3.11 | |
| source .venv/bin/activate | |
| ``` | |
| Or with the standard library: | |
| ```bash | |
| python3.11 -m venv .venv | |
| source .venv/bin/activate | |
| ``` | |
| **2. Install dependencies** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| **3. Download NLTK data (one time)** | |
| ```bash | |
| python -c "import nltk; nltk.download('vader_lexicon', quiet=True); nltk.download('punkt_tab', quiet=True)" | |
| ``` | |
| **4. Environment variables** | |
| Copy the example file and edit `.env`: | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| | Variable | Required for | Description | | |
| |----------|----------------|-------------| | |
| | `API_BASE_URL` | Baseline inference | OpenAI-compatible API base URL (default: Hugging Face router). | | |
| | `MODEL_NAME` | Baseline inference | Model identifier for that API. | | |
| | `HF_TOKEN` | Baseline (HF) | Hugging Face token when using the HF router or similar. | | |
| | `OPENAI_API_KEY` | Baseline (OpenAI) | OpenAI API key when using OpenAIβs API. Inference uses `HF_TOKEN` if set, otherwise `OPENAI_API_KEY`. | | |
| | `ENV_BASE_URL` | Baseline inference | URL of this environment (`http://localhost:7860` locally). | | |
| Running only the API server does not require LLM keys. | |
| **5. Start the server** | |
| ```bash | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload | |
| ``` | |
| Open `http://localhost:7860/docs` to confirm the API is up. | |
| ### Baseline inference | |
| With the server running (step 5) and `.env` configured with LLM credentials, run: | |
| ```bash | |
| python inference.py | |
| ``` | |
| Structured logs go to stdout (`[START]`, `[STEP]`, `[END]`); a JSON summary is printed to stderr. | |
| ### Docker | |
| Build and run the same service the Hugging Face Space uses: | |
| ```bash | |
| docker build -t sieve . | |
| docker run --rm -p 7860:7860 sieve | |
| ``` | |
| Then set `ENV_BASE_URL=http://localhost:7860` (or the containerβs URL) for `inference.py`. | |