jampuramprem commited on
Commit
a4e4bc8
Β·
1 Parent(s): c5a9938

Updated README

Browse files
Files changed (1) hide show
  1. README.md +138 -78
README.md CHANGED
@@ -1,65 +1,128 @@
1
- # Sieve : Customer Support Reinforcement Learning Environment
2
 
3
- Primarily there are gonna be three major tasks **Email Classification**, **Response Drafting** and **Support Session**
4
 
5
- ## Email Classification - Task 1
6
 
7
- The agent receives one email at a time and must classify it into a category and urgency using the `classify` action.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  **Step Rewards**
10
  - Correct category: `+0.15`
11
  - Wrong category: `-0.05`
12
  - Correct urgency: `+0.05`
13
  - Wrong urgency: `-0.02`
14
- - Wrong action type (not `classify`): `-0.05`
15
- - Step penalty (applied every step): `-0.005`
16
 
17
  **Final Grader Score**
18
- - Category accuracy accounts for `70%` of the final score
19
- - Urgency accuracy accounts for `30%` of the final score
20
 
21
- ## Response Drafting - Task 2
 
 
22
 
23
  The agent reads a customer email and drafts a professional response using the `respond` action.
24
 
 
 
25
  **Step Rewards**
26
- - Response length >= 50 characters: `+0.05`
27
- - Response length < 50 characters: `-0.10`
28
- - Keyword coverage: up to `+0.25` scaled by `matched / min_required` keywords
29
- - Negative/unprofessional tone (VADER negative score > 0.4): `-0.10`
30
- - Wrong action type (not `respond`): `-0.05`
31
- - Step penalty (applied every step): `-0.005`
32
 
33
  **Final Grader Score**
34
- - Keyword coverage (0.0–1.0) weighted at `0.80`
35
- - Length bonus of up to `0.20` for responses longer than 50 characters (scaled by `length / 200`)
36
- - Score is averaged across all emails in the task
 
 
37
 
38
- ## Support Session - Task 3
39
 
40
- The agent manages a queue of mixed emails and must prioritize, classify, and take the correct action on each one.
 
 
 
 
 
 
 
 
 
 
41
 
42
  **Step Rewards**
43
- - VIP email handled within first 4 positions: `+0.08`
44
- - VIP email handled at position 4 or later: `-0.05`
45
- - High urgency email handled within first 6 positions: `+0.05`
46
- - Low urgency email handled after position 6: `+0.03`
47
  - Correct category: `+0.04`
48
  - Correct urgency: `+0.02`
49
- - Correct action (`respond`, `escalate`, or `archive`): `+0.06`
50
  - Wrong action: `-0.03`
51
- - Response text provided and longer than 50 characters: `+0.02`
52
- - Spam email not archived: `-0.04`
53
- - Step penalty (applied every step): `-0.005`
54
 
55
  **Final Grader Score**
56
- - VIP prioritization: up to `0.20` (reduced to 40% if handled late)
57
- - High urgency prioritization: up to `0.10` (reduced to 40% if handled late)
58
  - Category accuracy: up to `0.15`
59
  - Urgency accuracy: up to `0.15`
60
  - Action accuracy: up to `0.30`
61
- - Email coverage (emails processed / total): up to `0.10`
62
- - Maximum possible score: `1.0`
 
 
63
 
64
  ## Data Models
65
 
@@ -124,52 +187,6 @@ The agent manages a queue of mixed emails and must prioritize, classify, and tak
124
  - `done` (`bool`) β€” Whether the episode has ended
125
  - `info` (`Dict`) β€” Additional diagnostic information
126
 
127
- ## Setup
128
-
129
- **Prerequisites:** Python 3.11+
130
-
131
- **Install dependencies**
132
- - `pip install -r requirements.txt`
133
-
134
- **Environment variables**
135
-
136
- - `API_BASE_URL` β€” LLM API endpoint (default: `https://router.huggingface.co/v1`)
137
- - `MODEL_NAME` β€” Model identifier (default: `Qwen/Qwen2.5-7B-Instruct`)
138
- - `OPENAI_API_KEY` β€” API key for the LLM provider
139
- - `HF_TOKEN` β€” Hugging Face token
140
- - `ENV_BASE_URL` β€” Running environment URL (default: `http://localhost:7860`)
141
-
142
- **Run the server**
143
- - `uvicorn app:app --host 0.0.0.0 --port 7860`
144
-
145
- **Run baseline inference**
146
- - `python inference.py`
147
-
148
- **Run with Docker**
149
- - `docker build -t sieve .`
150
- - `docker run -p 7860:7860 sieve`
151
-
152
- ## Baseline Scores
153
-
154
- Baseline agent: `gpt-4o-mini` via OpenAI API
155
-
156
- | Task | Score | Steps | Total Reward |
157
- |------|-------|-------|-------------|
158
- | Email Classification | 0.860 | 10 | 1.555 |
159
- | Response Drafting | 0.956 | 6 | 1.692 |
160
- | Support Session | 0.850 | 15 | 1.400 |
161
- | **Average** | **0.889** | β€” | β€” |
162
-
163
- ## Backend API
164
-
165
- | Method | Path | Description |
166
- |--------|------|-------------|
167
- | `POST` | `/reset?task_id=<id>` | Reset environment for a task, returns initial Observation |
168
- | `POST` | `/step` | Submit an Action, returns `{observation, reward, done, info}` |
169
- | `GET` | `/state` | Current environment state |
170
- | `GET` | `/tasks` | List all tasks with action schema |
171
- | `GET` | `/grader` | Current grader score (0.0–1.0) |
172
-
173
  ## Observation Space
174
 
175
  ```json
@@ -209,5 +226,48 @@ Baseline agent: `gpt-4o-mini` via OpenAI API
209
  }
210
  ```
211
 
 
212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sieve β€” Customer Support RL Environment
2
 
3
+ Sieve is a reinforcement learning environment that simulates a real-world customer support inbox. An AI agent interacts with it through a standard `reset() / step() / state()` HTTP API, receiving emails, taking actions, and earning rewards based on how well it handles each situation.
4
 
5
+ ## How It Works
6
 
7
+ ```
8
+ Agent Sieve (FastAPI server)
9
+ | |
10
+ |-- POST /reset?task_id=<id> --------> | Loads email queue, returns first Observation
11
+ |<- Observation ---------------------- |
12
+ | |
13
+ |-- POST /step (Action) -----------> | Processes action, computes reward
14
+ |<- { observation, reward, done, info} |
15
+ | |
16
+ | ... repeat until done=true ... |
17
+ | |
18
+ |-- GET /grader ---------------------->| Returns final grader score (0.0–1.0)
19
+ ```
20
+
21
+ Each episode follows this loop:
22
+ - The agent calls `/reset` with a `task_id` to start a fresh episode and receive the initial `Observation`
23
+ - The agent reads the current email(s) from the observation and decides on an `Action`
24
+ - The agent posts the action to `/step` and receives the next `Observation`, a `Reward`, and a `done` flag
25
+ - When `done=true`, the agent calls `/grader` to get the final episode score
26
+
27
+ The reward at each step reflects immediate quality (correct classification, good response, right prioritization). A small step penalty of `-0.005` is applied every step to discourage unnecessary actions. The final grader score is a separate holistic metric computed over the full episode.
28
+
29
+ ## Project Structure
30
+
31
+ ```
32
+ .
33
+ β”œβ”€β”€ models.py # Shared Pydantic models (Action, Observation, Reward, etc.)
34
+ β”œβ”€β”€ inference.py # Baseline agent script using OpenAI client
35
+ β”œβ”€β”€ logger.py # Structured [START]/[STEP]/[END] stdout logger
36
+ β”œβ”€β”€ openenv.yaml # OpenEnv environment metadata
37
+ β”œβ”€β”€ pyproject.toml # Project config and dependencies
38
+ β”œβ”€β”€ Dockerfile # Container definition
39
+ └── server/
40
+ β”œβ”€β”€ app.py # FastAPI application and API endpoints
41
+ β”œβ”€β”€ environment.py # Core environment logic (step, reset, reward, grader)
42
+ β”œβ”€β”€ data.py # Email datasets for all three tasks
43
+ └── config.py # Action schema definition
44
+ ```
45
+
46
+ ## Tasks
47
+
48
+ ### Task 1 β€” Email Classification (Easy)
49
+
50
+ The agent receives one email at a time and must classify it using the `classify` action.
51
+
52
+ **Available action:** `classify` only
53
 
54
  **Step Rewards**
55
  - Correct category: `+0.15`
56
  - Wrong category: `-0.05`
57
  - Correct urgency: `+0.05`
58
  - Wrong urgency: `-0.02`
59
+ - Wrong action type: `-0.05`
60
+ - Step penalty: `-0.005`
61
 
62
  **Final Grader Score**
63
+ - Category accuracy: `70%` weight
64
+ - Urgency accuracy: `30%` weight
65
 
66
+ ---
67
+
68
+ ### Task 2 β€” Response Drafting (Medium)
69
 
70
  The agent reads a customer email and drafts a professional response using the `respond` action.
71
 
72
+ **Available action:** `respond` only
73
+
74
  **Step Rewards**
75
+ - Response >= 50 characters: `+0.05`
76
+ - Response < 50 characters: `-0.10`
77
+ - Keyword coverage: up to `+0.25` (scaled by `matched / min_required`)
78
+ - Negative/unprofessional tone (VADER neg > 0.4): `-0.10`
79
+ - Wrong action type: `-0.05`
80
+ - Step penalty: `-0.005`
81
 
82
  **Final Grader Score**
83
+ - Keyword coverage weighted at `0.80`
84
+ - Length bonus up to `0.20` (scaled by `length / 200`, requires length > 50)
85
+ - Averaged across all emails in the task
86
+
87
+ ---
88
 
89
+ ### Task 3 β€” Full Support Session (Hard)
90
 
91
+ The agent manages a queue of 15 mixed emails. It must choose which email to handle, classify it, and take the right action β€” all in the correct priority order.
92
+
93
+ **Available actions:** `respond`, `escalate`, `archive`, `skip`
94
+
95
+ **Priority rules**
96
+ - VIP customers (`sender_tier=vip`) must be handled before standard customers
97
+ - High urgency emails take precedence over medium and low
98
+ - Security breaches and VIP incidents β†’ `escalate`
99
+ - Spam and feature requests β†’ `archive`
100
+ - Standard billing and technical issues β†’ `respond`
101
+ - Use `email_id` in the action to select which email to process
102
 
103
  **Step Rewards**
104
+ - VIP email handled in first 4 positions: `+0.08`
105
+ - VIP email delayed (position >= 4): `-0.05`
106
+ - High urgency email in first 6 positions: `+0.05`
107
+ - Low urgency email after position 6: `+0.03`
108
  - Correct category: `+0.04`
109
  - Correct urgency: `+0.02`
110
+ - Correct action: `+0.06`
111
  - Wrong action: `-0.03`
112
+ - Response text provided and > 50 characters: `+0.02`
113
+ - Spam not archived: `-0.04`
114
+ - Step penalty: `-0.005`
115
 
116
  **Final Grader Score**
117
+ - VIP prioritization: up to `0.20` (40% credit if handled late)
118
+ - High urgency prioritization: up to `0.10` (40% credit if handled late)
119
  - Category accuracy: up to `0.15`
120
  - Urgency accuracy: up to `0.15`
121
  - Action accuracy: up to `0.30`
122
+ - Email coverage: up to `0.10`
123
+ - Maximum: `1.0`
124
+
125
+ ---
126
 
127
  ## Data Models
128
 
 
187
  - `done` (`bool`) β€” Whether the episode has ended
188
  - `info` (`Dict`) β€” Additional diagnostic information
189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
  ## Observation Space
191
 
192
  ```json
 
226
  }
227
  ```
228
 
229
+ ## Backend API
230
 
231
+ | Method | Path | Description |
232
+ |--------|------|-------------|
233
+ | `POST` | `/reset?task_id=<id>` | Reset environment for a task, returns initial Observation |
234
+ | `POST` | `/step` | Submit an Action, returns `{observation, reward, done, info}` |
235
+ | `GET` | `/state` | Current environment state |
236
+ | `GET` | `/tasks` | List all tasks with action schema |
237
+ | `GET` | `/grader` | Current grader score (0.0–1.0) |
238
+
239
+ ## Setup
240
+
241
+ **Prerequisites:** Python 3.11+, [uv](https://github.com/astral-sh/uv)
242
+
243
+ **Install dependencies**
244
+ - `uv sync`
245
 
246
+ **Environment variables**
247
+
248
+ - `API_BASE_URL` β€” LLM API endpoint (default: `https://router.huggingface.co/v1`)
249
+ - `MODEL_NAME` β€” Model identifier (default: `Qwen/Qwen2.5-7B-Instruct`)
250
+ - `OPENAI_API_KEY` β€” API key for the LLM provider
251
+ - `HF_TOKEN` β€” Hugging Face token
252
+ - `ENV_BASE_URL` β€” Running environment URL (default: `http://localhost:7860`)
253
+
254
+ **Run the server**
255
+ - `uvicorn server.app:app --host 0.0.0.0 --port 7860`
256
+
257
+ **Run baseline inference**
258
+ - `python inference.py`
259
+
260
+ **Run with Docker**
261
+ - `docker build -t sieve .`
262
+ - `docker run -p 7860:7860 -e OPENAI_API_KEY=... sieve`
263
+
264
+ ## Baseline Scores
265
+
266
+ Baseline agent: `gpt-4o-mini` via OpenAI API
267
+
268
+ | Task | Score | Steps | Total Reward |
269
+ |------|-------|-------|--------------|
270
+ | Email Classification | 0.930 | 10 | 1.755 |
271
+ | Response Drafting | 0.956 | 6 | 1.692 |
272
+ | Support Session | 0.870 | 15 | 1.490 |
273
+ | **Average** | **0.919** | β€” | β€” |