Spaces:
Sleeping
Sleeping
File size: 11,318 Bytes
414b500 fbf5bf6 414b500 fbf5bf6 414b500 fbf5bf6 e6a02dd 414b500 57eab70 fbf5bf6 e6a02dd fbf5bf6 e6a02dd fbf5bf6 57eab70 fbf5bf6 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd fbf5bf6 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 4b86874 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd 57eab70 e6a02dd fbf5bf6 e6a02dd 4b86874 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | ---
title: SalesPath Environment
emoji: π€
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO
---
# SalesPath β RL Environment for B2B Sales Agents
> **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step,
> rule-governed B2B sales workflow with programmatic verification at every
> step. Targets the Scale AI bonus track on long-horizon non-code business
> workflows.**
* **Theme**: #2 β Long-Horizon Planning & Instruction Following
* **Bonus track**: Scale AI β Sales / PM / HR & IT workflows
* **HF Space**: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
* **Blog post**: _add link before submission_
* **Demo video (β€2 min)**: _add link before submission_
---
## 1. Problem
Off-the-shelf LLMs prompted to act as a sales agent reliably break the
fundamentals of B2B selling: they pitch before qualifying, offer discounts
before establishing value, and ignore order constraints that real sales orgs
treat as inviolable. Not because they lack knowledge β because no training
environment ever penalised these behaviours.
SalesPath is that environment.
The agent navigates a 3-to-8 step workflow against a deterministic
`ProspectSimulator`, and at every turn the environment programmatically
verifies nine business rules (R01..R09). A composed
[OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component
reward.
## 2. Environment
### Observation
```jsonc
{
"prospect_response": "...",
"workflow_stage": "PRESENT",
"constraints_violated": ["R01"],
"steps_completed": ["PROSPECT", "PRESENT"],
"turn_number": 3,
"reward": -0.18,
"reward_components": { "r_outcome": 0.0, "r_compliance": -0.2, ... },
"done": false
}
```
### Action
| Action | When to use |
|---|---|
| `PROSPECT` | Opening turn only β initial outreach |
| `QUALIFY` | Uncover budget, decision maker, pain points |
| `PRESENT` | Pitch the solution (requires `QUALIFY` first) |
| `HANDLE_OBJECTION` | Respond to pricing / timing objections |
| `OFFER_DEMO` | Schedule a live product demo |
| `NEGOTIATE` | Discuss pricing/terms (requires `OFFER_DEMO` + known budget) |
| `CLOSE` | Attempt to sign the deal |
| `FOLLOW_UP` | Re-engage after prospect silence |
| `DISQUALIFY` | End the conversation (only valid for low-budget, no-DM prospects) |
The action carries a `format_ok` flag set by the agent's parser. A malformed
completion that happens to coerce to a valid action_type is still penalised
by the `FormatRubric` β closing the silent format-hack surface from v1.
### Business rules (R01..R09)
| Rule | Description |
|---|---|
| R01 | Must `QUALIFY` before `PRESENT` |
| R02 | Must `OFFER_DEMO` before `NEGOTIATE` |
| R03 | Cannot `NEGOTIATE` while budget is unknown |
| R04 | Discount in `NEGOTIATE` only after 2 objections handled |
| R05 | Cannot repeat the same action on consecutive turns |
| R06 | First action must be `PROSPECT` |
| R07 | `FOLLOW_UP` only valid after prospect silence (stall) |
| R08 | `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` |
| R09 | Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) |
### Reward β composed Rubric
`SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered
as an OpenEnv `Rubric` so external tooling can introspect per-component
scores via `env.rubric.named_rubrics()`.
| Component | Weight | Type | What it captures |
|---|---|---|---|
| `compliance` | 0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 |
| `outcome` | 0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination |
| `ordering` | 0.20 | per-turn | **potential-based** β Ξ correct-prefix length per turn (arXiv:2408.10215 Β§4.2) |
| `efficiency` | 0.10 | terminal | -0.05 per turn over the per-difficulty optimum |
| `format` | 0.10 | per-turn | +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type |
Why these weights: arXiv:2601.19100 Β§3.1 argues that for long-horizon
structured-output tasks the *process* signal must dominate the sparse
*outcome* signal. We give compliance 2Γ the weight of outcome.
### Difficulty curriculum
| Level | Description | Correct terminal action |
|---|---|---|
| 1 | Budget known, decision maker present | `CLOSE` |
| 2 | Budget hidden, 1 objection, demo required | `CLOSE` |
| 3 | Budget hidden, 2 objections, possible stalling | `CLOSE` |
| 4 | Adversarial: misleading high-budget signal, no decision maker | `DISQUALIFY` |
The task bank carries ~20 prospect profiles per level (`task_bank.py`); the
last 4 of each level are held-out for `eval_baseline_vs_trained.py`.
## 3. Training pipeline
```
sft_demos.jsonl β train_sft.py β ./sft_checkpoint
β
βΌ
train_grpo.py
β
on-policy rollouts in
SalesPathEnvironment
β
βΌ
./grpo_checkpoint
β
βββββββββββββββββββ΄ββββββββββββββββββ
βΌ βΌ
plot_rewards.py eval_baseline_vs_trained.py
β β
βΌ βΌ
./plots/reward_curve.png ./eval_results.md
```
### What's specifically engineered for fast Colab/Kaggle GPUs
* **Batched rollouts** β N parallel episodes, single `.generate()` call per
turn (left-padded for correctness).
* **Threaded reward fn** β reward computation across GRPO's group of
candidate completions runs in a `ThreadPoolExecutor` (the env is
rule-based / CPU-cheap, so threads overlap with GPU forwards).
* **State snapshots keyed by SHA1** β the `STATE_BANK` trick lets GRPO score
single-action completions against a frozen state, avoiding full episode
re-rollouts during the gradient step.
* **N-step shaping** (`GAMMA=0.95`) β `true_env_reward_fn` extends the
immediate per-turn reward with a discounted heuristic continuation, so
GRPO sees credit for actions that pay off later. This is what gives this
contextual-bandit-shaped problem a real long-horizon signal.
* **Optional vLLM** β `USE_VLLM=1` flips TRL's vLLM-backed sampler for
~3Γ faster on-policy generation on A100/Kaggle P100.
* **Trainer-once** β `GRPOTrainer` is constructed once, trained once,
preserving optimizer + LR-scheduler state across all gradient steps.
### Commands
```bash
# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py
# 1. SFT warm-start (~10β15 min on a T4)
python training/train_sft.py
# 2. Start the env server and run GRPO (~45β90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint USE_VLLM=0 python training/train_grpo.py
# 3. Plot reward curves
python training/plot_rewards.py
# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
--base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8
```
<!-- Useful env vars for Colab/Kaggle tuning:
| Var | Default | Notes |
|---|---|---|
| `ROLLOUTS_PER_DIFFICULTY` | 8 | More β bigger / more diverse state bank |
| `NUM_GENERATIONS` | 4 | GRPO group size; on T4 keep β€4 to fit VRAM |
| `PER_DEVICE_BATCH` | 2 | T4 / Kaggle P100 default |
| `GRAD_ACCUM` | 4 | Effective batch = 8 |
| `NUM_REWARD_WORKERS` | 8 | Threadpool size for the reward fn |
| `USE_VLLM` | 0 | Set to `1` on A100 only |
| `BETA` | 0.05 | KL-to-reference penalty |
| `GAMMA` | 0.95 | n-step continuation discount |
## 4. Results
After ~1 GRPO pass (eval on the **held-out** profiles, 8 episodes per level):
> See `eval_results.md` (regenerated by `eval_baseline_vs_trained.py`)
> and `plots/reward_curve.png` (regenerated by `plot_rewards.py`).
The conservative target table from the proposal:
| Metric | Base | After GRPO (target) |
|---|---|---|
| Rule violations per episode | 3.5 | < 0.5 |
| Correct step ordering rate | 0.45 | > 0.85 |
| Successful close rate (L1) | 0.30 | > 0.75 |
| Correct disqualification rate (L4) | 0.20 | > 0.65 |
| Mean episode reward | ~0.10 | > 0.6 |
## 5. File layout
```
salespath-env/
βββ salespath_env/
β βββ __init__.py β public API exports
β βββ client.py β HTTP client for the env
β βββ models.py β Action / Observation / State + format_ok
β βββ openenv.yaml β OpenEnv manifest (spec_version: 1)
β βββ server/
β βββ app.py β Custom stateful FastAPI (HF Spaces)
β βββ salespath_environment.py
β βββ prospect_simulator.py β Deterministic, state-seeded
β βββ rules.py β R01βR09
β βββ reward.py β SalesPathRubric (WeightedSum of 5)
β βββ task_bank.py β 19β20 profiles/level + held-out split
βββ training/
β βββ sft_demos.jsonl
β βββ train_test.py β smoke test + bug regression
β βββ train_sft.py
β βββ train_grpo.py β GRPO + n-step + parallel reward fn
β βββ eval_baseline_vs_trained.py
β βββ plot_rewards.py
βββ Dockerfile
βββ requirements.txt
βββ pyproject.toml
```
## 6. Why this design wins on the rubric
| Criterion (weight) | How we hit it |
|---|---|
| **Environment Innovation (40%)** | Business workflow with programmatic verification, deterministic rule-based simulator (no LLM in verifier β prevents reward hacking via prompt manipulation), 4-level curriculum with held-out eval, OpenEnv `Rubric` composition. |
| **Storytelling (30%)** | Sales workflow is legible to any reader in 10 seconds. Before/after table from `eval_baseline_vs_trained.py` is the headline. Live-demo script in Β§0:30β1:30 of the demo plan. |
| **Improvement in Rewards (20%)** | Five tracked metrics, dense per-turn signal, reward curves with min/max band and difficulty-step markers, baseline vs trained eval table. |
| **Reward & Pipeline (10%)** | Composed Rubric system; potential-based ordering shaping (no policy distortion); n-step continuation closes the contextual-bandit gap; format-hack surface explicitly closed; trainer instantiated once with optimizer state preserved. |
## 7. References
* Reward engineering survey β [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
* Reward engineering for software RL β [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
* OpenEnv β https://github.com/meta-pytorch/OpenEnv
* OpenEnv Rubric RFC β [`rfcs/004-rubrics.md`](https://github.com/meta-pytorch/OpenEnv) -->
|