Spaces:

jampuramprem
/

sieve

Sleeping

App Files Files Community

sieve / README.md

jampuramprem

Just added a small error message in the inferance and added .env.example for detailing purpose

d0b62f7 about 1 month ago

preview code

raw

history blame contribute delete

9.54 kB

	---
	title: Sieve
	sdk: docker
	pinned: false
	---

	# Sieve — Customer Support RL Environment

	Sieve is a reinforcement learning environment that simulates a real-world customer support inbox. An AI agent interacts with it through a standard `reset() / step() / state()` HTTP API, receiving emails, taking actions, and earning rewards based on how well it handles each situation.

	## How It Works

	![How It Works](assets/how_it_works_v2.svg)

	The agent calls `/reset` to start an episode, then loops — reading the current email from the `Observation`, posting an `Action` to `/step`, and receiving a `Reward` and next `Observation` — until `done=true`. Each step reward reflects immediate quality. A `-0.005` step penalty discourages unnecessary actions. The final grader score from `/grader` is a holistic metric computed over the full episode.

	## Project Structure

	```
	.
	├── models.py # Shared Pydantic models (Action, Observation, Reward, etc.)
	├── inference.py # Baseline agent script using OpenAI client
	├── logger.py # Structured [START]/[STEP]/[END] stdout logger
	├── openenv.yaml # OpenEnv environment metadata
	├── pyproject.toml # Project config and dependencies
	├── Dockerfile # Container definition
	├── .env.example # Example environment variables (copy to .env)
	└── server/
	├── app.py # FastAPI application and API endpoints
	├── environment.py # Core environment logic (step, reset, reward, grader)
	├── data.py # Email datasets for all three tasks
	└── config.py # Action schema definition
	```

	## Tasks

	### Task 1 — Email Classification (Easy)

	The agent receives one email at a time and must classify it using the `classify` action.

	Available action: `classify` only

	Step Rewards
	- Correct category: `+0.15`
	- Wrong category: `-0.05`
	- Correct urgency: `+0.05`
	- Wrong urgency: `-0.02`
	- Wrong action type: `-0.05`
	- Step penalty: `-0.005`

	Final Grader Score
	- Category accuracy: `70%` weight
	- Urgency accuracy: `30%` weight

	---

	### Task 2 — Response Drafting (Medium)

	The agent reads a customer email and drafts a professional response using the `respond` action.

	Available action: `respond` only

	Step Rewards
	- Response >= 50 characters: `+0.05`
	- Response < 50 characters: `-0.10`
	- Keyword coverage: up to `+0.25` (scaled by `matched / min_required`)
	- Negative/unprofessional tone (VADER neg > 0.4): `-0.10`
	- Wrong action type: `-0.05`
	- Step penalty: `-0.005`

	Final Grader Score
	- Keyword coverage weighted at `0.80`
	- Length bonus up to `0.20` (scaled by `length / 200`, requires length > 50)
	- Averaged across all emails in the task

	---

	### Task 3 — Full Support Session (Hard)

	The agent manages a queue of 15 mixed emails. It must choose which email to handle, classify it, and take the right action — all in the correct priority order.

	Available actions: `respond`, `escalate`, `archive`, `skip`

	Priority rules
	- VIP customers (`sender_tier=vip`) must be handled before standard customers
	- High urgency emails take precedence over medium and low
	- Security breaches and VIP incidents → `escalate`
	- Spam and feature requests → `archive`
	- Standard billing and technical issues → `respond`
	- Use `email_id` in the action to select which email to process

	Step Rewards
	- VIP email handled in first 4 positions: `+0.08`
	- VIP email delayed (position >= 4): `-0.05`
	- High urgency email in first 6 positions: `+0.05`
	- Low urgency email after position 6: `+0.03`
	- Correct category: `+0.04`
	- Correct urgency: `+0.02`
	- Correct action: `+0.06`
	- Wrong action: `-0.03`
	- Response text provided and > 50 characters: `+0.02`
	- Spam not archived: `-0.04`
	- Step penalty: `-0.005`

	Final Grader Score
	- VIP prioritization: up to `0.20` (40% credit if handled late)
	- High urgency prioritization: up to `0.10` (40% credit if handled late)
	- Category accuracy: up to `0.15`
	- Urgency accuracy: up to `0.15`
	- Action accuracy: up to `0.30`
	- Email coverage: up to `0.10`
	- Maximum: `1.0`

	---

	## Data Models

	### Enums

	#### ActionType
	- `classify` — Classify an email into a category and urgency
	- `respond` — Draft a response to an email
	- `escalate` — Escalate an email with a reason
	- `archive` — Archive an email
	- `skip` — Skip the current email

	#### Category
	- `billing` — Payment, invoices, subscription issues
	- `technical` — Bugs, errors, technical failures
	- `general` — General inquiries
	- `spam` — Unsolicited or irrelevant messages
	- `account` — Account access, settings, profile issues
	- `feature_request` — Requests for new features

	#### Urgency
	- `high` — Requires immediate attention
	- `medium` — Standard priority
	- `low` — Can be handled later

	### Models

	#### Email
	- `id` (`str`) — Unique email identifier
	- `subject` (`str`) — Email subject line
	- `body` (`str`) — Email body content
	- `sender` (`str`) — Sender's email address
	- `sender_tier` (`str`, default: `"standard"`) — Customer tier (`standard` or `vip`)
	- `received_minutes_ago` (`int`, default: `0`) — How long ago the email was received

	#### Action
	- `action_type` (`ActionType`) — The action to perform
	- `category` (`Category`, optional) — Email category, used with `classify`
	- `urgency` (`Urgency`, optional) — Email urgency, used with `classify`
	- `response_text` (`str`, optional) — Drafted response, used with `respond`
	- `escalation_reason` (`str`, optional) — Reason for escalation, used with `escalate`
	- `email_id` (`str`, optional) — Target email ID, used in `support_session` to select which email to process

	#### Observation
	- `current_email` (`Email`, optional) — The email currently being processed
	- `email_queue` (`List[Email]`, default: `[]`) — Queue of pending emails, populated in Task 3 only
	- `processed_count` (`int`, default: `0`) — Number of emails processed so far
	- `step_count` (`int`, default: `0`) — Current step number
	- `task_id` (`str`) — Active task identifier
	- `task_description` (`str`) — Human-readable task description
	- `available_actions` (`List[str]`) — Actions valid for the current state
	- `context` (`Dict`) — Additional context such as `max_steps`, `remaining_steps`, `queue_size`

	#### Reward
	- `value` (`float`) — Total reward for the step
	- `components` (`Dict[str, float]`, default: `{}`) — Breakdown of reward sub-components
	- `reason` (`str`, default: `""`) — Human-readable explanation of the reward

	#### StepResult
	- `observation` (`Observation`) — Next environment observation
	- `reward` (`Reward`) — Reward received for the action
	- `done` (`bool`) — Whether the episode has ended
	- `info` (`Dict`) — Additional diagnostic information

	## Backend API

	\| Method \| Path \| Description \|
	\|--------\|------\|-------------\|
	\| `POST` \| `/reset?task_id=<id>` \| Reset environment for a task, returns initial Observation \|
	\| `POST` \| `/step` \| Submit an Action, returns `{observation, reward, done, info}` \|
	\| `GET` \| `/state` \| Current environment state \|
	\| `GET` \| `/tasks` \| List all tasks with action schema \|
	\| `GET` \| `/grader` \| Current grader score (0.0–1.0) \|

	## Baseline Scores

	Baseline agent: `gpt-4o-mini` via OpenAI API

	\| Task \| Score \| Steps \| Total Reward \|
	\|------\|-------\|-------\|--------------\|
	\| Email Classification \| 0.930 \| 10 \| 1.755 \|
	\| Response Drafting \| 0.920 \| 6 \| 1.650 \|
	\| Support Session \| 0.882 \| 15 \| 1.506 \|

	## Local Development Setup

	### Prerequisites

	- Python 3.11 or 3.12 (matches the Docker image)
	- Optional: [uv](https://docs.astral.sh/uv/) for creating a virtual environment

	### Steps

	1. Create and activate a virtual environment

	With uv:

	```bash
	uv venv --python 3.11
	source .venv/bin/activate
	```

	Or with the standard library:

	```bash
	python3.11 -m venv .venv
	source .venv/bin/activate
	```

	2. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	3. Download NLTK data (one time)

	```bash
	python -c "import nltk; nltk.download('vader_lexicon', quiet=True); nltk.download('punkt_tab', quiet=True)"
	```

	4. Environment variables

	Copy the example file and edit `.env`:

	```bash
	cp .env.example .env
	```

	\| Variable \| Required for \| Description \|
	\|----------\|----------------\|-------------\|
	\| `API_BASE_URL` \| Baseline inference \| OpenAI-compatible API base URL (default: Hugging Face router). \|
	\| `MODEL_NAME` \| Baseline inference \| Model identifier for that API. \|
	\| `HF_TOKEN` \| Baseline (HF) \| Hugging Face token when using the HF router or similar. \|
	\| `OPENAI_API_KEY` \| Baseline (OpenAI) \| OpenAI API key when using OpenAI’s API. Inference uses `HF_TOKEN` if set, otherwise `OPENAI_API_KEY`. \|
	\| `ENV_BASE_URL` \| Baseline inference \| URL of this environment (`http://localhost:7860` locally). \|

	Running only the API server does not require LLM keys.

	5. Start the server

	```bash
	uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
	```

	Open `http://localhost:7860/docs` to confirm the API is up.

	### Baseline inference

	With the server running (step 5) and `.env` configured with LLM credentials, run:

	```bash
	python inference.py
	```

	Structured logs go to stdout (`[START]`, `[STEP]`, `[END]`); a JSON summary is printed to stderr.

	### Docker

	Build and run the same service the Hugging Face Space uses:

	```bash
	docker build -t sieve .
	docker run --rm -p 7860:7860 sieve
	```

	Then set `ENV_BASE_URL=http://localhost:7860` (or the container’s URL) for `inference.py`.