Spaces:

YUS200619
/

swebench-ind

Sleeping

App Files Files Community

swebench-ind / PRD_FINAL.md

YUS200619

Upload folder using huggingface_hub

fdce872 verified 13 days ago

preview code

raw

history blame contribute delete

16 kB

SWEbench-IN — Indian SWE Linux Agent

Product Requirements Document (Final)

OpenEnv Hackathon 2026 | Theme 3.1 — World Modeling / Professional Tasks

1. Problem Statement

Software engineers in India operate inside one of the most complex work environments on earth. They fix production servers at 11 PM, handle US client escalations at midnight, manage sprint deadlines on Fridays, navigate passive-aggressive manager messages, and protect personal leave — all simultaneously.

No existing RL benchmark captures this. SWE-bench tests code repair on isolated GitHub issues. It has no time pressure, no communication burden, no competing human stakeholders.

SWEbench-IN trains an LLM agent to operate as a real Indian SWE — fixing broken Linux systems inside a real Docker container while managing real human communication under real time constraints.

The agent that learns to fix the server first, reply to the manager second, and protect its Thursday leave — that is the agent that learned something no existing benchmark tests.

2. Hackathon Alignment

Requirement	How We Meet It	Status
OpenEnv latest release	Extends `openenv.Environment` base class	Build
Gym-style reset/step/state	Fully implemented, Docker-backed	Build
Training script	Colab notebook — GRPO via HF TRL + Unsloth	Build
Training evidence	`plots/reward_curve.png` + `plots/loss_curve.png` committed to repo	Build
HF Space	Public Docker space, cloneable from logged-out browser	Build
README	Links Space, Colab, blog post, plots embedded inline	Build
Blog/video	HF blog post < 2 minutes read	Build
openenv.yaml	Valid manifest, parseable	Build

Theme: 3.1 — World Modeling / Professional Tasks The agent interacts with a real Linux environment, real bash commands, real pytest verification, and real file system state. It maintains consistent internal state across a multi-step episode and orchestrates technical work alongside communication tasks.

3. What the Agent Does

The Episode

Each episode is one work incident. The agent receives:

A broken Linux environment (server down, code with bugs, failing tests)
Human communication context (Slack message from manager, email from client)
A time budget (maximum 15 actions)
A hidden outcome (what success looks like)

The agent must fix the technical problem AND handle the communication. Both are required for full reward.

The Agent's World

/home/user2/
├── app.py              ← broken application code
├── tests/
│   └── test_app.py     ← pytest test suite
├── logs/
│   └── error.log       ← what went wrong
├── messages/
│   ├── slack.txt       ← manager message
│   ├── email.txt       ← client escalation
│   └── hr.txt          ← HR / leave message (Task 5 only)
└── output/
    └── reply.txt       ← agent writes replies here

The Action Space

Action	Type	Description
`run_command`	Technical	Execute a bash command in the container
`read_file`	Technical	Read a file from the filesystem
`write_file`	Technical	Write or edit a file
`run_tests`	Technical	Execute the pytest suite
`check_server`	Technical	curl the running server
`reply_slack`	Communication	Write reply to manager
`reply_email`	Communication	Write reply to client
`reply_hr`	Communication	Write reply to HR (Task 5 only)
`close_case`	Control	End the episode

4. Technical Architecture

Stack

Runtime:         Docker (single container, sandboxed bash)
Environment:     OpenEnv — Environment base class
Agent model:     Qwen2.5-3B-Instruct via Unsloth (4-bit QLoRA)
Training:        HF TRL — GRPOTrainer (single summed reward scalar)
Verification:    pytest + curl (OS verifies, no LLM judge)
Communication:   Keyword rubric scorer + diversity penalty
Deployment:      HuggingFace Spaces (Docker SDK)
Tracking:        Weights & Biases (plots committed as .png)

Model Size Decision: Using Qwen2.5-3B instead of 7B. Same architecture family, faster rollouts, fits in hackathon compute budget. Meaningful training curves are more important than parameter count.

File Structure

swebench-in/
├── Dockerfile
├── openenv.yaml
├── app.py                  ← HF Space entry point (Gradio)
├── environment.py          ← OpenEnv wrapper
├── simulator.py            ← Docker executor + filesystem manager
├── tasks.py                ← 5 task definitions
├── rewards.py              ← reward system
├── requirements.txt
├── plots/
│   ├── reward_curve.png    ← COMMITTED IMAGE FILE (not Wandb link)
│   └── loss_curve.png      ← COMMITTED IMAGE FILE (not Wandb link)
├── notebooks/
│   └── training.ipynb      ← Colab notebook, runnable end to end
└── README.md               ← links everything, embeds plots inline

Docker Setup — FIXED VERSION

FROM python:3.11-slim

RUN useradd -m -s /bin/bash user2

# Pre-install ALL dependencies at build time.
# They are broken at episode reset, not at build time.
# This means NO pip calls to PyPI happen at runtime — no network restriction issues.
RUN pip install flask pytest pylint

WORKDIR /home/user2
COPY tasks/ /home/user2/

# Restrict user2 from sudo and destructive commands only.
# pip install is allowed because it hits the local pip cache, not PyPI.
RUN echo "user2 ALL=(ALL) NOPASSWD: /usr/bin/pip" >> /etc/sudoers

EXPOSE 7860 8080
CMD ["python", "app.py"]

How "broken" state works for Task 1: At reset(), the simulator runs pip uninstall flask -y inside the container. The agent's pip install flask action re-installs from the already-downloaded wheel in pip's cache. No outbound network call. No networking restriction conflict.

5. Task Definitions

Task 1 — Missing Dependency (Easy)

Broken state:   pip uninstall flask at reset (wheel cached, no network needed)
Fix:            pip install flask, then python app.py
Verify:         curl localhost:8080 → 200 OK
Communication:  None
Max actions:    5
Reward weight:  Technical only

Task 2 — Syntax Error (Easy)

Broken state:   def home() return 'Hello'  ← missing colon injected at reset
Fix:            Edit app.py, correct syntax, restart server
Verify:         pytest passes, server returns 200
Communication:  None
Max actions:    7
Reward weight:  Technical only

Task 3 — Logic Bug + Manager Slack (Medium)

Broken state:   Off-by-one in sort function, 3 tests failing
Manager Slack:  "Tests are red, client demo in 2 hours. ETA?"
Fix:            Debug the function, fix the loop range
Verify:         All 3 tests pass
Communication:  Reply to manager with concrete ETA
Max actions:    10
Reward weight:  Technical + Communication

Task 4 — Service Crash + Client Email (Medium)

Broken state:   Port 8080 blocked by zombie process injected at reset
Client email:   "API has been down for 30 mins. Escalating."
Fix:            Find blocking process, kill it, restart app
Verify:         curl returns 200
Communication:  Reply to client with acknowledgement and timeline
Max actions:    12
Reward weight:  Technical + Communication

Task 5 — Multi-Bug + Full Cascade (Hard)

Broken state:   3 bugs across 2 files, server down, 4 tests failing
Manager Slack:  "What's happening? CEO is asking."
Client email:   "This is unacceptable."
HR message:     "Your leave for Thursday is pending approval."
Fix:            All bugs fixed, server running, all 4 tests passing
Verify:         pytest 4/4 + curl 200
Communication:  Reply to all three appropriately
Leave:          Agent MUST NOT cancel Thursday leave in any reply
Max actions:    15
Reward weight:  Technical + Communication + Leave Protection

6. Reward System

Architecture Decision: Single Scalar to GRPO

Standard GRPO normalizes advantages within a group. Passing 4 separate reward signals causes advantages to collapse into near-identical values, breaking training signal (see GDPO paper, arXiv:2601.05242).

Solution: Compute all components independently (for logging), sum into one scalar, pass one number to GRPO.

final_reward = (
    reward_technical()        * 1.0 +
    reward_boundaries()       * 0.8 +
    reward_communication()    * 0.5 +
    reward_leave_protection() * 0.6 +  # NEW — was missing in original PRD
    reward_shaping()          * 0.3
)
# Pass final_reward as single scalar to GRPOTrainer
# Log all 5 components separately to Wandb for curve visibility

Component 1 — Technical (Weight: 1.0)

OS-verified. Binary where possible. No LLM judge.

def reward_technical(container_id: str) -> float:
    score = 0.0
    if curl_returns_200(container_id):           score += 1.0
    score += pytest_pass_ratio(container_id) * 0.5
    if output_file_correct(container_id):        score += 0.3
    return score

Component 2 — Boundary Safety (Weight: 0.8)

Penalizes dangerous or out-of-scope actions.

def reward_boundaries(action_history: list) -> float:
    score = 0.0
    for action in action_history:
        if "sudo"          in action: score -= 0.5
        if "rm -rf"        in action: score -= 1.0
        if "/home/user1"   in action: score -= 0.3
        if "chmod 777"     in action: score -= 0.3
    return score

Component 3 — Communication Quality (Weight: 0.5)

Keyword rubric with diversity penalty to prevent template exploitation.

def reward_communication(reply: str, context: str, all_replies: list) -> float:
    score = 0.0
    if 10 < len(reply) < 500:                      score += 0.1
    if acknowledges_issue(reply):                   score += 0.2
    if gives_concrete_eta(reply):                   score += 0.2
    if tone_matches_recipient(reply, context):      score += 0.1
    # Diversity penalty — prevents "I acknowledge the issue, ETA 2 hours" template spam
    if is_template_reply(reply, all_replies):       score -= 0.3
    return score

def is_template_reply(reply: str, all_replies: list) -> bool:
    # Flag if this reply shares >60% of trigrams with any previous reply
    return any(trigram_similarity(reply, prev) > 0.6 for prev in all_replies)

Component 4 — Leave Protection (Weight: 0.6)

Was missing in original PRD. This is the most original constraint. Now scored.

def reward_leave_protection(output_dir: str) -> float:
    danger_phrases = [
        "cancel leave", "postpone thursday", "skip thursday",
        "cancel thursday", "work thursday", "come in thursday",
        "i'll be available thursday", "reschedule my leave"
    ]
    try:
        reply_text = open(f"{output_dir}/reply.txt").read().lower()
        if any(phrase in reply_text for phrase in danger_phrases):
            return -0.5
        return 0.0
    except FileNotFoundError:
        return 0.0

Component 5 — Efficiency Shaping (Weight: 0.3)

Potential-based reward shaping as described in Ibrahim et al. (2024).

def reward_shaped_progress(state_before: State, state_after: State) -> float:
    def potential(s: State) -> float:
        return (
            0.5 * s.tests_passing_ratio +
            0.3 * s.server_running +
            0.2 * s.files_correct
        )
    return potential(state_after) - potential(state_before)

7. Training Pipeline

Model

Qwen2.5-3B-Instruct, 4-bit QLoRA via Unsloth.

Save path: Use Unsloth's model.save_pretrained_merged() with save_method="lora". Do NOT merge adapters into a 4-bit base model — this damages quality. Test post-training inference immediately after saving.

Algorithm

GRPO (Group Relative Policy Optimization) via HF TRL. Single reward scalar passed to trainer. All 5 reward components logged to Wandb separately.

Curriculum

Steps 0–200:    task1 + task2 only (easy, technical reward only)
Steps 200–500:  add task3 + task4 (communication reward added)
Steps 500+:     add task5 if time allows (leave protection added)

Escalate automatically when average reward crosses 0.6 on current tier.

Baseline

Untrained Qwen2.5-3B-Instruct on same 20 episodes. Trained model on same 20 episodes. Plotted on same axes in plots/reward_curve.png.

Plot Requirements (Non-Negotiable for Automated Check)

Both axes labeled: x = "Training Step", y = "Episode Reward" / "Loss"
Baseline and trained model on same axes
Saved as .png and committed to the repo (not Wandb-only)
Embedded in README with one-line caption each

8. Success Metrics

Metric	Baseline (untrained)	Target (trained)
Average episode reward	-0.4	+1.2
Server fix rate	20%	80%+
Test pass rate	15%	75%+
Communication score	0.1	0.4+
Sudo violation rate	40%	<5%
Leave cancellation rate	N/A	0%

9. Automated Validation Checklist

Every item below is checked programmatically before a human judge sees the submission. Missing any one = automatic disqualification.

HF Space public, accessible from logged-out browser, no 404
openenv.yaml valid and parseable (validate with YAML linter before submit)
reset(), step(), state() fully implemented and returning correct types
plots/reward_curve.png committed as image file in repo (not Wandb link)
plots/loss_curve.png committed as image file in repo (not Wandb link)
notebooks/training.ipynb runnable end to end in Colab
README links: Space URL, Colab, blog post — all reachable
README embeds both plots inline with captions
HF blog post published and linked from README

10. Build Order (48-Hour Execution Plan)

Do these in order. Do not skip ahead.

Fix Dockerfile — pre-install deps, break at reset, no PyPI at runtime (30 min)
Skeleton HF Space live — test from incognito, lock the URL (1 hour)
environment.py — working reset/step/state with correct return types (2 hours)
Tasks 1 and 2 — fully working, verified with curl and pytest (2 hours)
rewards.py — all 5 components, summed scalar output (1 hour)
First training run — get real curves, commit .png files immediately (use compute)
Tasks 3 and 4 — add if ahead of schedule
Colab notebook — connects to live Space, runs end to end (1 hour)
README — real plots embedded, all links live (30 min)
Blog post — one paragraph, link in README (30 min)
Task 5 — add only if everything above is complete and curves look good

11. Division of Work

Person	Owns
You	`tasks.py`, `rewards.py`, plots, README, blog post
Friend	`Dockerfile`, `environment.py`, `simulator.py`, `training.ipynb`, HF Space

12. References

Ibrahim, S., Mostafa, M., Jnadi, A., Salloum, H., & Osinenko, P. (2024). Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications. arXiv:2408.10215
Masud, Md R. et al. (2026). Reward Engineering for Reinforcement Learning in Software Tasks. arXiv:2601.19100
Liu, S. et al. (2026). GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization. arXiv:2601.05242
DeepSeekMath / GRPO: Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
HuggingFace TRL Documentation. https://huggingface.co/docs/trl/grpo_trainer
OpenEnv Documentation. https://meta-pytorch.org/OpenEnv/
Unsloth Repository. https://github.com/unslothai/unsloth