File size: 11,318 Bytes
414b500
fbf5bf6
 
 
 
414b500
fbf5bf6
414b500
fbf5bf6
e6a02dd
414b500
 
57eab70
fbf5bf6
e6a02dd
 
 
 
fbf5bf6
e6a02dd
 
 
 
 
fbf5bf6
57eab70
fbf5bf6
e6a02dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57eab70
 
e6a02dd
57eab70
 
 
 
 
e6a02dd
57eab70
 
e6a02dd
57eab70
 
e6a02dd
57eab70
e6a02dd
 
 
57eab70
e6a02dd
57eab70
 
 
e6a02dd
 
 
 
57eab70
e6a02dd
 
 
 
57eab70
e6a02dd
57eab70
e6a02dd
 
 
57eab70
e6a02dd
 
 
 
 
 
 
57eab70
e6a02dd
 
 
57eab70
e6a02dd
57eab70
 
 
e6a02dd
 
 
 
57eab70
e6a02dd
 
57eab70
e6a02dd
57eab70
 
e6a02dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fbf5bf6
 
e6a02dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57eab70
 
 
e6a02dd
57eab70
 
e6a02dd
57eab70
 
e6a02dd
57eab70
e6a02dd
57eab70
e6a02dd
57eab70
e6a02dd
 
 
 
57eab70
 
4b86874
e6a02dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57eab70
e6a02dd
 
 
 
 
 
 
 
 
 
 
 
 
 
57eab70
 
 
 
e6a02dd
 
 
 
57eab70
e6a02dd
57eab70
e6a02dd
 
 
 
57eab70
e6a02dd
 
 
 
 
 
57eab70
e6a02dd
 
57eab70
 
e6a02dd
 
 
 
 
 
 
 
57eab70
e6a02dd
fbf5bf6
e6a02dd
 
 
4b86874
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
title: SalesPath Environment
emoji: 🀝
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO
---

# SalesPath β€” RL Environment for B2B Sales Agents

> **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step,
> rule-governed B2B sales workflow with programmatic verification at every
> step. Targets the Scale AI bonus track on long-horizon non-code business
> workflows.**

* **Theme**: #2 β€” Long-Horizon Planning & Instruction Following
* **Bonus track**: Scale AI β€” Sales / PM / HR & IT workflows
* **HF Space**: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
* **Blog post**: _add link before submission_
* **Demo video (≀2 min)**: _add link before submission_

---

## 1. Problem

Off-the-shelf LLMs prompted to act as a sales agent reliably break the
fundamentals of B2B selling: they pitch before qualifying, offer discounts
before establishing value, and ignore order constraints that real sales orgs
treat as inviolable. Not because they lack knowledge β€” because no training
environment ever penalised these behaviours.

SalesPath is that environment.

The agent navigates a 3-to-8 step workflow against a deterministic
`ProspectSimulator`, and at every turn the environment programmatically
verifies nine business rules (R01..R09). A composed
[OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component
reward.

## 2. Environment

### Observation
```jsonc
{
  "prospect_response":     "...",
  "workflow_stage":        "PRESENT",
  "constraints_violated":  ["R01"],
  "steps_completed":       ["PROSPECT", "PRESENT"],
  "turn_number":           3,
  "reward":                -0.18,
  "reward_components":     { "r_outcome": 0.0, "r_compliance": -0.2, ... },
  "done":                  false
}
```

### Action

| Action | When to use |
|---|---|
| `PROSPECT` | Opening turn only β€” initial outreach |
| `QUALIFY` | Uncover budget, decision maker, pain points |
| `PRESENT` | Pitch the solution (requires `QUALIFY` first) |
| `HANDLE_OBJECTION` | Respond to pricing / timing objections |
| `OFFER_DEMO` | Schedule a live product demo |
| `NEGOTIATE` | Discuss pricing/terms (requires `OFFER_DEMO` + known budget) |
| `CLOSE` | Attempt to sign the deal |
| `FOLLOW_UP` | Re-engage after prospect silence |
| `DISQUALIFY` | End the conversation (only valid for low-budget, no-DM prospects) |

The action carries a `format_ok` flag set by the agent's parser. A malformed
completion that happens to coerce to a valid action_type is still penalised
by the `FormatRubric` β€” closing the silent format-hack surface from v1.

### Business rules (R01..R09)

| Rule | Description |
|---|---|
| R01 | Must `QUALIFY` before `PRESENT` |
| R02 | Must `OFFER_DEMO` before `NEGOTIATE` |
| R03 | Cannot `NEGOTIATE` while budget is unknown |
| R04 | Discount in `NEGOTIATE` only after 2 objections handled |
| R05 | Cannot repeat the same action on consecutive turns |
| R06 | First action must be `PROSPECT` |
| R07 | `FOLLOW_UP` only valid after prospect silence (stall) |
| R08 | `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` |
| R09 | Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) |

### Reward β€” composed Rubric

`SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered
as an OpenEnv `Rubric` so external tooling can introspect per-component
scores via `env.rubric.named_rubrics()`.

| Component | Weight | Type | What it captures |
|---|---|---|---|
| `compliance`  | 0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 |
| `outcome`     | 0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination |
| `ordering`    | 0.20 | per-turn | **potential-based** β€” Ξ” correct-prefix length per turn (arXiv:2408.10215 Β§4.2) |
| `efficiency`  | 0.10 | terminal | -0.05 per turn over the per-difficulty optimum |
| `format`      | 0.10 | per-turn | +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type |

Why these weights: arXiv:2601.19100 Β§3.1 argues that for long-horizon
structured-output tasks the *process* signal must dominate the sparse
*outcome* signal. We give compliance 2Γ— the weight of outcome.

### Difficulty curriculum

| Level | Description | Correct terminal action |
|---|---|---|
| 1 | Budget known, decision maker present | `CLOSE` |
| 2 | Budget hidden, 1 objection, demo required | `CLOSE` |
| 3 | Budget hidden, 2 objections, possible stalling | `CLOSE` |
| 4 | Adversarial: misleading high-budget signal, no decision maker | `DISQUALIFY` |

The task bank carries ~20 prospect profiles per level (`task_bank.py`); the
last 4 of each level are held-out for `eval_baseline_vs_trained.py`.

## 3. Training pipeline

```
sft_demos.jsonl  β†’  train_sft.py  β†’  ./sft_checkpoint
                                         β”‚
                                         β–Ό
                                  train_grpo.py
                                         β”‚
                            on-policy rollouts in
                            SalesPathEnvironment
                                         β”‚
                                         β–Ό
                                  ./grpo_checkpoint
                                         β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β–Ό                                   β–Ό
                plot_rewards.py            eval_baseline_vs_trained.py
                       β”‚                                   β”‚
                       β–Ό                                   β–Ό
              ./plots/reward_curve.png            ./eval_results.md
```

### What's specifically engineered for fast Colab/Kaggle GPUs

* **Batched rollouts** β€” N parallel episodes, single `.generate()` call per
  turn (left-padded for correctness).
* **Threaded reward fn** β€” reward computation across GRPO's group of
  candidate completions runs in a `ThreadPoolExecutor` (the env is
  rule-based / CPU-cheap, so threads overlap with GPU forwards).
* **State snapshots keyed by SHA1** β€” the `STATE_BANK` trick lets GRPO score
  single-action completions against a frozen state, avoiding full episode
  re-rollouts during the gradient step.
* **N-step shaping** (`GAMMA=0.95`) β€” `true_env_reward_fn` extends the
  immediate per-turn reward with a discounted heuristic continuation, so
  GRPO sees credit for actions that pay off later. This is what gives this
  contextual-bandit-shaped problem a real long-horizon signal.
* **Optional vLLM** β€” `USE_VLLM=1` flips TRL's vLLM-backed sampler for
  ~3Γ— faster on-policy generation on A100/Kaggle P100.
* **Trainer-once** β€” `GRPOTrainer` is constructed once, trained once,
  preserving optimizer + LR-scheduler state across all gradient steps.

### Commands

```bash
# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py

# 1. SFT warm-start (~10–15 min on a T4)
python training/train_sft.py

# 2. Start the env server and run GRPO (~45–90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint  USE_VLLM=0  python training/train_grpo.py

# 3. Plot reward curves
python training/plot_rewards.py

# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
    --base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8
```

<!-- Useful env vars for Colab/Kaggle tuning:

| Var | Default | Notes |
|---|---|---|
| `ROLLOUTS_PER_DIFFICULTY` | 8 | More β†’ bigger / more diverse state bank |
| `NUM_GENERATIONS`         | 4 | GRPO group size; on T4 keep ≀4 to fit VRAM |
| `PER_DEVICE_BATCH`        | 2 | T4 / Kaggle P100 default |
| `GRAD_ACCUM`              | 4 | Effective batch = 8 |
| `NUM_REWARD_WORKERS`      | 8 | Threadpool size for the reward fn |
| `USE_VLLM`                | 0 | Set to `1` on A100 only |
| `BETA`                    | 0.05 | KL-to-reference penalty |
| `GAMMA`                   | 0.95 | n-step continuation discount |

## 4. Results

After ~1 GRPO pass (eval on the **held-out** profiles, 8 episodes per level):

> See `eval_results.md` (regenerated by `eval_baseline_vs_trained.py`)
> and `plots/reward_curve.png` (regenerated by `plot_rewards.py`).

The conservative target table from the proposal:

| Metric | Base | After GRPO (target) |
|---|---|---|
| Rule violations per episode | 3.5 | < 0.5 |
| Correct step ordering rate  | 0.45 | > 0.85 |
| Successful close rate (L1)  | 0.30 | > 0.75 |
| Correct disqualification rate (L4) | 0.20 | > 0.65 |
| Mean episode reward         | ~0.10 | > 0.6 |

## 5. File layout

```
salespath-env/
β”œβ”€β”€ salespath_env/
β”‚   β”œβ”€β”€ __init__.py                ← public API exports
β”‚   β”œβ”€β”€ client.py                  ← HTTP client for the env
β”‚   β”œβ”€β”€ models.py                  ← Action / Observation / State + format_ok
β”‚   β”œβ”€β”€ openenv.yaml               ← OpenEnv manifest (spec_version: 1)
β”‚   └── server/
β”‚       β”œβ”€β”€ app.py                 ← Custom stateful FastAPI (HF Spaces)
β”‚       β”œβ”€β”€ salespath_environment.py
β”‚       β”œβ”€β”€ prospect_simulator.py  ← Deterministic, state-seeded
β”‚       β”œβ”€β”€ rules.py               ← R01–R09
β”‚       β”œβ”€β”€ reward.py              ← SalesPathRubric (WeightedSum of 5)
β”‚       └── task_bank.py           ← 19–20 profiles/level + held-out split
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ sft_demos.jsonl
β”‚   β”œβ”€β”€ train_test.py              ← smoke test + bug regression
β”‚   β”œβ”€β”€ train_sft.py
β”‚   β”œβ”€β”€ train_grpo.py              ← GRPO + n-step + parallel reward fn
β”‚   β”œβ”€β”€ eval_baseline_vs_trained.py
β”‚   └── plot_rewards.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
└── pyproject.toml
```

## 6. Why this design wins on the rubric

| Criterion (weight) | How we hit it |
|---|---|
| **Environment Innovation (40%)** | Business workflow with programmatic verification, deterministic rule-based simulator (no LLM in verifier β€” prevents reward hacking via prompt manipulation), 4-level curriculum with held-out eval, OpenEnv `Rubric` composition. |
| **Storytelling (30%)** | Sales workflow is legible to any reader in 10 seconds. Before/after table from `eval_baseline_vs_trained.py` is the headline. Live-demo script in Β§0:30–1:30 of the demo plan. |
| **Improvement in Rewards (20%)** | Five tracked metrics, dense per-turn signal, reward curves with min/max band and difficulty-step markers, baseline vs trained eval table. |
| **Reward & Pipeline (10%)** | Composed Rubric system; potential-based ordering shaping (no policy distortion); n-step continuation closes the contextual-bandit gap; format-hack surface explicitly closed; trainer instantiated once with optimizer state preserved. |

## 7. References

* Reward engineering survey β€” [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
* Reward engineering for software RL β€” [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
* OpenEnv β€” https://github.com/meta-pytorch/OpenEnv
* OpenEnv Rubric RFC β€” [`rfcs/004-rubrics.md`](https://github.com/meta-pytorch/OpenEnv) -->