Spaces:

YUS200619
/

meta_hackathon-qwen

Sleeping

App Files Files Community

meta_hackathon-qwen / BLOG.md

YUS200619

Update BLOG.md

6e6baba verified 12 days ago

preview code

raw

history blame contribute delete

5.26 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Teaching LLMs to Set Professional Boundaries

The AI That Learned to Say No (So Arjun Didn't Have To)

One challenge that has been observed among software engineers in a corporate world is that often when they are subjected to over-exploitation at workplaces—despite having the choice to refuse—they seldom do so. They are unable to take the right decisions at the right time pertaining to the tasks they are assigned and thus eventually end up falling into a loop of exhaustion. Once they are looped into new tasks, even with having the power to turn them down, they usually just give in and are forced to complete the assigned activity. This leaves most corporate software employees in a distraught state of mind, trying to weigh out the pros and cons of decisions they are unwilling to make.

Here is where our trained agent comes into play.

The Problem with "Helpful" AI

LLMs can code, summarize documents, and even pass bar exams, but they often work solely to please the user and agree to every task assigned. Zero-shot LLMs are trained to follow all instructions and have no filter to select among the given tasks. In contrast to regular AI agents, we have trained our system to not submit to all decisions just for the sake of pleasing the user.

This domain helps users set up a professional boundary. The reward function measures five things continuously:

Technical Resolution
Communication Quality
Boundary Setting
Energy (towards the end of the desired work cycle period)
Relationship Preservation

A Case Study: Arjun's Monday Morning

To understand our agent better, let's consider a case study example of a software engineer working in a company: Arjun. Let's understand the issues he faces in his corporate environment and how this trained model could make things easier for him.

Arjun is pretty good at his job but lacks the ability to say "no". On a Monday morning at 6 AM, when the staging is down, he fixes and resolves the issue. Within an hour, he has three Slack messages:

Priya needs a code review.
Rahul wants to sync "real quick".
Someone has added him to a 10:30 PM standup as "optional" where he will be expected to attend.

Meanwhile, David Chen, the US client, has sent an email at 11 PM that begins with "Just circling back..." which is corporate for "I am furious and I have CC'd your manager." His personal leave, which he applied for three months ago, is still pending. He also has his appraisal form due on Friday.

All the tasks allotted to him are easy when individually done. The pressure of completing all the tasks simultaneously breaks him and leaves him burned out.

The Baseline vs. The Trained Agent

When regular LLMs are given Arjun’s Monday morning, they accept everything. They over-apologize to David Chen, volunteer for the on-call swap, attend the 10:30 PM standup, and run out of energy by Tuesday.

In contrast, our Work-Life Firewall domain is trained to negotiate professional boundaries under real-time and energy constraints. This is highly under-explored in RL. There is no grid-world version of this. The state space is social, the actions are linguistic, and the consequences are deferred. A bad Monday decision makes Wednesday harder. The relationship between deferred decision costs, anti-gaming rubric design, and when boundary setting emerges as a learned strategy is the core of this project.

Arjun’s week is built as a learning environment. The agent is subjected to everything Arjun sees, and it writes free-text responses to all the issues. Every decision taken has an energy cost. For instance:

Agreeing to the third on-call swap request spawns a Wednesday collapse event.
Attending the 10:30 PM standup means Friday's focus block disappears.

The model scores at every decision made, plus a terminal rubric that looks at the whole week. An agent cannot score high by declining everything, as the relationship score collapses. It cannot score high by accepting everything because then the energy hits zero by Wednesday and the sprint fails. It has to find the specific set of strategic refusals that protect capacity without burning the people who matter.

The Results

After training using GRPO and TRL, there's a clear shift in the decisions made by the agent. The third on-call requests were rejected, decline-async was used for the 10:30 PM standup, and it sent a firm but warm reply to David Chen. The environment runs end-to-end and a behavioral change is clearly visible in our metrics.

Thus, the trained AI agent saves employees from the mental math of making decisions that affect their entire work cycle, handling it in the best way possible.

**You can view our youtube video here - https://youtu.be/0bEN6_UIA-Y

Try it yourself in our Interactive Hugging Face Space or view the Full Source Code & Training Logs.