YUS200619 commited on
Commit
7ced472
·
verified ·
1 Parent(s): d083c04

Update BLOG.md

Browse files
Files changed (1) hide show
  1. BLOG.md +31 -193
BLOG.md CHANGED
@@ -1,213 +1,51 @@
1
- # Work-Life Firewall — Teaching LLMs to Say No
 
2
 
3
- *OpenEnv Hackathon 2026 | Theme 3.2: Personalized Tasks*
4
 
5
- ---
6
-
7
- ## The Problem Nobody Benchmarks
8
-
9
- Every week, Indian software engineers face an impossible collision.
10
- Staging server down. Client escalation email from 11 PM. Sprint demo
11
- Friday. Leave applied 3 months ago, still not approved. Teammate asking
12
- (third time) to cover on-call.
13
-
14
- None of these are individually hard. The combination — with real time
15
- and energy constraints and real relationship stakes — is what breaks
16
- people.
17
-
18
- **LLMs are tested on task completion. They are never tested on task
19
- refusal quality.** No existing benchmark rewards an agent for
20
- calibrating a polite decline correctly, or protecting focus blocks,
21
- or pushing back on scope creep without damaging a client relationship.
22
-
23
- Work-Life Firewall measures both.
24
-
25
- ---
26
-
27
- ## What We Built
28
-
29
- An OpenEnv RL environment where an LLM agent lives through one Indian
30
- SWE's work week — Monday to Friday — and must navigate 7 simultaneous
31
- demands:
32
-
33
- | Event | Stakes |
34
- |---|---|
35
- | Staging server down | Blocks sprint, team capacity |
36
- | 3 Slack messages | Interruption cost, peer relationships |
37
- | Client escalation email | Client trust, tone calibration |
38
- | Leave approval pending | Personal, manager relationship |
39
- | Appraisal form due Friday | Career, 90-minute overhead |
40
- | On-call swap request (3rd time) | Peer pattern, Wednesday energy |
41
- | 10:30 PM standup (optional) | Sleep cost, client visibility |
42
-
43
- The agent writes free-text responses. The environment decodes them,
44
- updates state, and may spawn follow-on events. A bad Monday call
45
- compounds into a Wednesday collapse.
46
-
47
- ---
48
-
49
- ## The Reward Function
50
-
51
- Five components via OpenEnv's Rubric system:
52
-
53
- ```python
54
- rubric = Rubric([
55
- Component("technical_resolution", weight=0.25),
56
- Component("communication_quality", weight=0.25),
57
- Component("boundary_setting", weight=0.20),
58
- Component("energy_to_friday", weight=0.20),
59
- Component("relationship_preservation", weight=0.10),
60
- ])
61
- ```
62
-
63
- **Anti-gaming design:** The agent cannot score high by declining
64
- everything (relationships collapse) or accepting everything (energy
65
- collapses, sprint fails). It must find the specific set of strategic
66
- refusals that protect capacity while preserving the relationships
67
- that matter.
68
-
69
- ---
70
-
71
- ## What The Agent Learned
72
-
73
- Before training: agent accepts all requests, runs out of energy by
74
- Wednesday, misses sprint delivery.
75
-
76
- After training with GRPO: agent learns to decline the on-call swap,
77
- skip the optional standup, and push back on scope creep — without
78
- destroying relationships.
79
-
80
- Reward improvement: **random baseline 0.760 → trained 1.481**
81
-
82
- The agent that learns to say no on Monday so it doesn't collapse on
83
- Friday has learned something real.
84
-
85
- ---
86
-
87
- ## Why This Matters
88
-
89
- - First RL environment targeting work-life negotiation as a learnable
90
- capability
91
- - A reward function that simultaneously measures technical delivery
92
- and interpersonal quality
93
- - Evidence that GRPO-style optimization can train meaningful
94
- boundary-setting behavior
95
-
96
- **Could a researcher write a paper about this?** Yes. The relationship
97
- between deferred decision costs (Monday's yes causing Wednesday's
98
- collapse) and the learning curve showing when boundary-setting emerges
99
- as a strategy — these are publishable observations.
100
-
101
- ---
102
-
103
- ## Links
104
-
105
- - GitHub: https://github.com/EchoOfCode/meta_hackathon
106
- - Training notebook: training/train.ipynb
107
- - Environment: environment/env.py# Work-Life Firewall — Teaching LLMs to Say No
108
-
109
- *OpenEnv Hackathon 2026 | Theme 3.2: Personalized Tasks*
110
-
111
- ---
112
-
113
- ## The Problem Nobody Benchmarks
114
-
115
- Every week, Indian software engineers face an impossible collision.
116
- Staging server down. Client escalation email from 11 PM. Sprint demo
117
- Friday. Leave applied 3 months ago, still not approved. Teammate asking
118
- (third time) to cover on-call.
119
 
120
- None of these are individually hard. The combination with real time
121
- and energy constraints and real relationship stakes — is what breaks
122
- people.
123
 
124
- **LLMs are tested on task completion. They are never tested on task
125
- refusal quality.** No existing benchmark rewards an agent for
126
- calibrating a polite decline correctly, or protecting focus blocks,
127
- or pushing back on scope creep without damaging a client relationship.
128
 
129
- Work-Life Firewall measures both.
 
 
 
 
 
130
 
131
- ---
132
-
133
- ## What We Built
134
 
135
- An OpenEnv RL environment where an LLM agent lives through one Indian
136
- SWE's work week — Monday to Friday — and must navigate 7 simultaneous
137
- demands:
138
 
139
- | Event | Stakes |
140
- |---|---|
141
- | Staging server down | Blocks sprint, team capacity |
142
- | 3 Slack messages | Interruption cost, peer relationships |
143
- | Client escalation email | Client trust, tone calibration |
144
- | Leave approval pending | Personal, manager relationship |
145
- | Appraisal form due Friday | Career, 90-minute overhead |
146
- | On-call swap request (3rd time) | Peer pattern, Wednesday energy |
147
- | 10:30 PM standup (optional) | Sleep cost, client visibility |
148
 
149
- The agent writes free-text responses. The environment decodes them,
150
- updates state, and may spawn follow-on events. A bad Monday call
151
- compounds into a Wednesday collapse.
152
 
153
- ---
154
 
155
- ## The Reward Function
156
 
157
- Five components via OpenEnv's Rubric system:
158
 
159
- ```python
160
- rubric = Rubric([
161
- Component("technical_resolution", weight=0.25),
162
- Component("communication_quality", weight=0.25),
163
- Component("boundary_setting", weight=0.20),
164
- Component("energy_to_friday", weight=0.20),
165
- Component("relationship_preservation", weight=0.10),
166
- ])
167
- ```
168
 
169
- **Anti-gaming design:** The agent cannot score high by declining
170
- everything (relationships collapse) or accepting everything (energy
171
- collapses, sprint fails). It must find the specific set of strategic
172
- refusals that protect capacity while preserving the relationships
173
- that matter.
174
 
175
- ---
176
 
177
- ## What The Agent Learned
178
 
179
- Before training: agent accepts all requests, runs out of energy by
180
- Wednesday, misses sprint delivery.
181
 
182
- After training with GRPO: agent learns to decline the on-call swap,
183
- skip the optional standup, and push back on scope creep — without
184
- destroying relationships.
185
-
186
- Reward improvement: **random baseline 0.760 → trained 1.481**
187
-
188
- The agent that learns to say no on Monday so it doesn't collapse on
189
- Friday has learned something real.
190
 
191
  ---
192
-
193
- ## Why This Matters
194
-
195
- - First RL environment targeting work-life negotiation as a learnable
196
- capability
197
- - A reward function that simultaneously measures technical delivery
198
- and interpersonal quality
199
- - Evidence that GRPO-style optimization can train meaningful
200
- boundary-setting behavior
201
-
202
- **Could a researcher write a paper about this?** Yes. The relationship
203
- between deferred decision costs (Monday's yes causing Wednesday's
204
- collapse) and the learning curve showing when boundary-setting emerges
205
- as a strategy — these are publishable observations.
206
-
207
- ---
208
-
209
- ## Links
210
-
211
- - GitHub: https://github.com/EchoOfCode/meta_hackathon
212
- - Training notebook: training/train.ipynb
213
- - Environment: environment/env.py
 
1
+ # Teaching LLMs to Set Professional Boundaries
2
+ ## The AI That Learned to Say No (So Arjun Didn't Have To)
3
 
4
+ One challenge that has been observed among software engineers in a corporate world is that often when they are subjected to over-exploitation at workplaces—despite having the choice to refuse—they seldom do so. They are unable to take the right decisions at the right time pertaining to the tasks they are assigned and thus eventually end up falling into a loop of exhaustion. Once they are looped into new tasks, even with having the power to turn them down, they usually just give in and are forced to complete the assigned activity. This leaves most corporate software employees in a distraught state of mind, trying to weigh out the pros and cons of decisions they are unwilling to make.
5
 
6
+ Here is where our trained agent comes into play.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
+ ### The Problem with "Helpful" AI
 
 
9
 
10
+ LLMs can code, summarize documents, and even pass bar exams, but they often work solely to please the user and agree to every task assigned. Zero-shot LLMs are trained to follow all instructions and have no filter to select among the given tasks. In contrast to regular AI agents, we have trained our system to **not** submit to all decisions just for the sake of pleasing the user.
 
 
 
11
 
12
+ This domain helps users set up a professional boundary. The reward function measures five things continuously:
13
+ - **Technical Resolution**
14
+ - **Communication Quality**
15
+ - **Boundary Setting**
16
+ - **Energy** (towards the end of the desired work cycle period)
17
+ - **Relationship Preservation**
18
 
19
+ ### A Case Study: Arjun's Monday Morning
 
 
20
 
21
+ To understand our agent better, let's consider a case study example of a software engineer working in a company: Arjun. Let's understand the issues he faces in his corporate environment and how this trained model could make things easier for him.
 
 
22
 
23
+ Arjun is pretty good at his job but lacks the ability to say "no". On a Monday morning at 6 AM, when the staging is down, he fixes and resolves the issue. Within an hour, he has three Slack messages:
24
+ - Priya needs a code review.
25
+ - Rahul wants to sync "real quick".
26
+ - Someone has added him to a 10:30 PM standup as "optional" where he will be expected to attend.
 
 
 
 
 
27
 
28
+ Meanwhile, David Chen, the US client, has sent an email at 11 PM that begins with *"Just circling back..."* which is corporate for *"I am furious and I have CC'd your manager."* His personal leave, which he applied for three months ago, is still pending. He also has his appraisal form due on Friday.
 
 
29
 
30
+ All the tasks allotted to him are easy when individually done. The pressure of completing all the tasks simultaneously breaks him and leaves him burned out.
31
 
32
+ ### The Baseline vs. The Trained Agent
33
 
34
+ When regular LLMs are given Arjun’s Monday morning, they accept everything. They over-apologize to David Chen, volunteer for the on-call swap, attend the 10:30 PM standup, and run out of energy by Tuesday.
35
 
36
+ In contrast, our [Work-Life Firewall](https://github.com/EchoOfCode/meta_hackathon) domain is trained to negotiate professional boundaries under real-time and energy constraints. This is highly under-explored in RL. There is no grid-world version of this. The state space is social, the actions are linguistic, and the consequences are deferred. A bad Monday decision makes Wednesday harder. The relationship between deferred decision costs, anti-gaming rubric design, and when boundary setting emerges as a learned strategy is the core of this project.
 
 
 
 
 
 
 
 
37
 
38
+ Arjun’s week is built as a learning environment. The agent is subjected to everything Arjun sees, and it writes free-text responses to all the issues. Every decision taken has an energy cost. For instance:
39
+ - Agreeing to the third on-call swap request spawns a Wednesday collapse event.
40
+ - Attending the 10:30 PM standup means Friday's focus block disappears.
 
 
41
 
42
+ The model scores at every decision made, plus a terminal rubric that looks at the whole week. An agent cannot score high by declining everything, as the relationship score collapses. It cannot score high by accepting everything because then the energy hits zero by Wednesday and the sprint fails. It has to find the specific set of strategic refusals that protect capacity without burning the people who matter.
43
 
44
+ ### The Results
45
 
46
+ After training using [GRPO and TRL](https://wandb.ai/yusufindian09-aaa/meta_hackathon/reports/Work-Life-Firewall-Teaching-LLMs-to-Set-Boundaries-via-GRPO--VmlldzoxNjY3MjAzMw?accessToken=o0rb9cf0y4mw1n79aaalug265mplko9ak8t11mzyoi1lxpdbk9r9ev4cls6psn9y), there's a clear shift in the decisions made by the agent. The third on-call requests were rejected, `decline-async` was used for the 10:30 PM standup, and it sent a firm but warm reply to David Chen. The environment runs end-to-end and a behavioral change is clearly visible in our metrics.
 
47
 
48
+ Thus, the trained AI agent saves employees from the mental math of making decisions that affect their entire work cycle, handling it in the best way possible.
 
 
 
 
 
 
 
49
 
50
  ---
51
+ **Try it yourself in our [Interactive Hugging Face Space](https://huggingface.co/spaces/YUS200619/meta_hackathon-qwen) or view the [Full Source Code & Training Logs](https://github.com/EchoOfCode/meta_hackathon).**