Spaces:

YUS200619
/

meta_hackathon-qwen

Sleeping

App Files Files Community

YUS200619 commited on 29 days ago

Commit

7ced472

verified ·

1 Parent(s): d083c04

Update BLOG.md

Browse files

Files changed (1) hide show

BLOG.md +31 -193

BLOG.md CHANGED Viewed

@@ -1,213 +1,51 @@
-# Work-Life Firewall — Teaching LLMs to Say No
-*OpenEnv Hackathon 2026 | Theme 3.2: Personalized Tasks*
----
-## The Problem Nobody Benchmarks
-Every week, Indian software engineers face an impossible collision.
-Staging server down. Client escalation email from 11 PM. Sprint demo
-Friday. Leave applied 3 months ago, still not approved. Teammate asking
-(third time) to cover on-call.
-None of these are individually hard. The combination — with real time
-and energy constraints and real relationship stakes — is what breaks
-people.
-**LLMs are tested on task completion. They are never tested on task
-refusal quality.** No existing benchmark rewards an agent for
-calibrating a polite decline correctly, or protecting focus blocks,
-or pushing back on scope creep without damaging a client relationship.
-Work-Life Firewall measures both.
----
-## What We Built
-An OpenEnv RL environment where an LLM agent lives through one Indian
-SWE's work week — Monday to Friday — and must navigate 7 simultaneous
-demands:
-| Event | Stakes |
-|---|---|
-| Staging server down | Blocks sprint, team capacity |
-| 3 Slack messages | Interruption cost, peer relationships |
-| Client escalation email | Client trust, tone calibration |
-| Leave approval pending | Personal, manager relationship |
-| Appraisal form due Friday | Career, 90-minute overhead |
-| On-call swap request (3rd time) | Peer pattern, Wednesday energy |
-| 10:30 PM standup (optional) | Sleep cost, client visibility |
-The agent writes free-text responses. The environment decodes them,
-updates state, and may spawn follow-on events. A bad Monday call
-compounds into a Wednesday collapse.
----
-## The Reward Function
-Five components via OpenEnv's Rubric system:
-```python
-rubric = Rubric([
-    Component("technical_resolution",    weight=0.25),
-    Component("communication_quality",   weight=0.25),
-    Component("boundary_setting",        weight=0.20),
-    Component("energy_to_friday",        weight=0.20),
-    Component("relationship_preservation", weight=0.10),
-])
-```
-**Anti-gaming design:** The agent cannot score high by declining
-everything (relationships collapse) or accepting everything (energy
-collapses, sprint fails). It must find the specific set of strategic
-refusals that protect capacity while preserving the relationships
-that matter.
----
-## What The Agent Learned
-Before training: agent accepts all requests, runs out of energy by
-Wednesday, misses sprint delivery.
-After training with GRPO: agent learns to decline the on-call swap,
-skip the optional standup, and push back on scope creep — without
-destroying relationships.
-Reward improvement: **random baseline 0.760 → trained 1.481**
-The agent that learns to say no on Monday so it doesn't collapse on
-Friday has learned something real.
----
-## Why This Matters
-- First RL environment targeting work-life negotiation as a learnable
-  capability
-- A reward function that simultaneously measures technical delivery
-  and interpersonal quality
-- Evidence that GRPO-style optimization can train meaningful
-  boundary-setting behavior
-**Could a researcher write a paper about this?** Yes. The relationship
-between deferred decision costs (Monday's yes causing Wednesday's
-collapse) and the learning curve showing when boundary-setting emerges
-as a strategy — these are publishable observations.
----
-## Links
-- GitHub: https://github.com/EchoOfCode/meta_hackathon
-- Training notebook: training/train.ipynb
-- Environment: environment/env.py# Work-Life Firewall — Teaching LLMs to Say No
-*OpenEnv Hackathon 2026 | Theme 3.2: Personalized Tasks*
----
-## The Problem Nobody Benchmarks
-Every week, Indian software engineers face an impossible collision.
-Staging server down. Client escalation email from 11 PM. Sprint demo
-Friday. Leave applied 3 months ago, still not approved. Teammate asking
-(third time) to cover on-call.
-None of these are individually hard. The combination — with real time
-and energy constraints and real relationship stakes — is what breaks
-people.
-**LLMs are tested on task completion. They are never tested on task
-refusal quality.** No existing benchmark rewards an agent for
-calibrating a polite decline correctly, or protecting focus blocks,
-or pushing back on scope creep without damaging a client relationship.
-Work-Life Firewall measures both.
----
-## What We Built
-An OpenEnv RL environment where an LLM agent lives through one Indian
-SWE's work week — Monday to Friday — and must navigate 7 simultaneous
-demands:
-| Event | Stakes |
-|---|---|
-| Staging server down | Blocks sprint, team capacity |
-| 3 Slack messages | Interruption cost, peer relationships |
-| Client escalation email | Client trust, tone calibration |
-| Leave approval pending | Personal, manager relationship |
-| Appraisal form due Friday | Career, 90-minute overhead |
-| On-call swap request (3rd time) | Peer pattern, Wednesday energy |
-| 10:30 PM standup (optional) | Sleep cost, client visibility |
-The agent writes free-text responses. The environment decodes them,
-updates state, and may spawn follow-on events. A bad Monday call
-compounds into a Wednesday collapse.
----
-## The Reward Function
-Five components via OpenEnv's Rubric system:
-```python
-rubric = Rubric([
-    Component("technical_resolution",    weight=0.25),
-    Component("communication_quality",   weight=0.25),
-    Component("boundary_setting",        weight=0.20),
-    Component("energy_to_friday",        weight=0.20),
-    Component("relationship_preservation", weight=0.10),
-])
-```
-**Anti-gaming design:** The agent cannot score high by declining
-everything (relationships collapse) or accepting everything (energy
-collapses, sprint fails). It must find the specific set of strategic
-refusals that protect capacity while preserving the relationships
-that matter.
----
-## What The Agent Learned
-Before training: agent accepts all requests, runs out of energy by
-Wednesday, misses sprint delivery.
-After training with GRPO: agent learns to decline the on-call swap,
-skip the optional standup, and push back on scope creep — without
-destroying relationships.
-Reward improvement: **random baseline 0.760 → trained 1.481**
-The agent that learns to say no on Monday so it doesn't collapse on
-Friday has learned something real.
 ---
-## Why This Matters
-- First RL environment targeting work-life negotiation as a learnable
-  capability
-- A reward function that simultaneously measures technical delivery
-  and interpersonal quality
-- Evidence that GRPO-style optimization can train meaningful
-  boundary-setting behavior
-**Could a researcher write a paper about this?** Yes. The relationship
-between deferred decision costs (Monday's yes causing Wednesday's
-collapse) and the learning curve showing when boundary-setting emerges
-as a strategy — these are publishable observations.
----
-## Links
-- GitHub: https://github.com/EchoOfCode/meta_hackathon
-- Training notebook: training/train.ipynb
-- Environment: environment/env.py

+# Teaching LLMs to Set Professional Boundaries
+## The AI That Learned to Say No (So Arjun Didn't Have To)
+One challenge that has been observed among software engineers in a corporate world is that often when they are subjected to over-exploitation at workplaces—despite having the choice to refuse—they seldom do so. They are unable to take the right decisions at the right time pertaining to the tasks they are assigned and thus eventually end up falling into a loop of exhaustion. Once they are looped into new tasks, even with having the power to turn them down, they usually just give in and are forced to complete the assigned activity. This leaves most corporate software employees in a distraught state of mind, trying to weigh out the pros and cons of decisions they are unwilling to make.
+Here is where our trained agent comes into play.
+### The Problem with "Helpful" AI
+LLMs can code, summarize documents, and even pass bar exams, but they often work solely to please the user and agree to every task assigned. Zero-shot LLMs are trained to follow all instructions and have no filter to select among the given tasks. In contrast to regular AI agents, we have trained our system to **not** submit to all decisions just for the sake of pleasing the user.
+This domain helps users set up a professional boundary. The reward function measures five things continuously:
+- **Technical Resolution**
+- **Communication Quality**
+- **Boundary Setting**
+- **Energy** (towards the end of the desired work cycle period)
+- **Relationship Preservation**
+### A Case Study: Arjun's Monday Morning
+To understand our agent better, let's consider a case study example of a software engineer working in a company: Arjun. Let's understand the issues he faces in his corporate environment and how this trained model could make things easier for him.
+Arjun is pretty good at his job but lacks the ability to say "no". On a Monday morning at 6 AM, when the staging is down, he fixes and resolves the issue. Within an hour, he has three Slack messages:
+- Priya needs a code review.
+- Rahul wants to sync "real quick".
+- Someone has added him to a 10:30 PM standup as "optional" where he will be expected to attend.
+Meanwhile, David Chen, the US client, has sent an email at 11 PM that begins with *"Just circling back..."* which is corporate for *"I am furious and I have CC'd your manager."* His personal leave, which he applied for three months ago, is still pending. He also has his appraisal form due on Friday.
+All the tasks allotted to him are easy when individually done. The pressure of completing all the tasks simultaneously breaks him and leaves him burned out.
+### The Baseline vs. The Trained Agent
+When regular LLMs are given Arjun’s Monday morning, they accept everything. They over-apologize to David Chen, volunteer for the on-call swap, attend the 10:30 PM standup, and run out of energy by Tuesday.
+In contrast, our [Work-Life Firewall](https://github.com/EchoOfCode/meta_hackathon) domain is trained to negotiate professional boundaries under real-time and energy constraints. This is highly under-explored in RL. There is no grid-world version of this. The state space is social, the actions are linguistic, and the consequences are deferred. A bad Monday decision makes Wednesday harder. The relationship between deferred decision costs, anti-gaming rubric design, and when boundary setting emerges as a learned strategy is the core of this project.
+Arjun’s week is built as a learning environment. The agent is subjected to everything Arjun sees, and it writes free-text responses to all the issues. Every decision taken has an energy cost. For instance:
+- Agreeing to the third on-call swap request spawns a Wednesday collapse event.
+- Attending the 10:30 PM standup means Friday's focus block disappears.
+The model scores at every decision made, plus a terminal rubric that looks at the whole week. An agent cannot score high by declining everything, as the relationship score collapses. It cannot score high by accepting everything because then the energy hits zero by Wednesday and the sprint fails. It has to find the specific set of strategic refusals that protect capacity without burning the people who matter.
+### The Results
+After training using [GRPO and TRL](https://wandb.ai/yusufindian09-aaa/meta_hackathon/reports/Work-Life-Firewall-Teaching-LLMs-to-Set-Boundaries-via-GRPO--VmlldzoxNjY3MjAzMw?accessToken=o0rb9cf0y4mw1n79aaalug265mplko9ak8t11mzyoi1lxpdbk9r9ev4cls6psn9y), there's a clear shift in the decisions made by the agent. The third on-call requests were rejected, `decline-async` was used for the 10:30 PM standup, and it sent a firm but warm reply to David Chen. The environment runs end-to-end and a behavioral change is clearly visible in our metrics.
+Thus, the trained AI agent saves employees from the mental math of making decisions that affect their entire work cycle, handling it in the best way possible.
 ---
+**Try it yourself in our [Interactive Hugging Face Space](https://huggingface.co/spaces/YUS200619/meta_hackathon-qwen) or view the [Full Source Code & Training Logs](https://github.com/EchoOfCode/meta_hackathon).**