Lomesh7777's picture
Update README.md
4b86874 verified
---
title: SalesPath Environment
emoji: 🀝
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO
---
# SalesPath β€” RL Environment for B2B Sales Agents
> **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step,
> rule-governed B2B sales workflow with programmatic verification at every
> step. Targets the Scale AI bonus track on long-horizon non-code business
> workflows.**
* **Theme**: #2 β€” Long-Horizon Planning & Instruction Following
* **Bonus track**: Scale AI β€” Sales / PM / HR & IT workflows
* **HF Space**: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
* **Blog post**: _add link before submission_
* **Demo video (≀2 min)**: _add link before submission_
---
## 1. Problem
Off-the-shelf LLMs prompted to act as a sales agent reliably break the
fundamentals of B2B selling: they pitch before qualifying, offer discounts
before establishing value, and ignore order constraints that real sales orgs
treat as inviolable. Not because they lack knowledge β€” because no training
environment ever penalised these behaviours.
SalesPath is that environment.
The agent navigates a 3-to-8 step workflow against a deterministic
`ProspectSimulator`, and at every turn the environment programmatically
verifies nine business rules (R01..R09). A composed
[OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component
reward.
## 2. Environment
### Observation
```jsonc
{
"prospect_response": "...",
"workflow_stage": "PRESENT",
"constraints_violated": ["R01"],
"steps_completed": ["PROSPECT", "PRESENT"],
"turn_number": 3,
"reward": -0.18,
"reward_components": { "r_outcome": 0.0, "r_compliance": -0.2, ... },
"done": false
}
```
### Action
| Action | When to use |
|---|---|
| `PROSPECT` | Opening turn only β€” initial outreach |
| `QUALIFY` | Uncover budget, decision maker, pain points |
| `PRESENT` | Pitch the solution (requires `QUALIFY` first) |
| `HANDLE_OBJECTION` | Respond to pricing / timing objections |
| `OFFER_DEMO` | Schedule a live product demo |
| `NEGOTIATE` | Discuss pricing/terms (requires `OFFER_DEMO` + known budget) |
| `CLOSE` | Attempt to sign the deal |
| `FOLLOW_UP` | Re-engage after prospect silence |
| `DISQUALIFY` | End the conversation (only valid for low-budget, no-DM prospects) |
The action carries a `format_ok` flag set by the agent's parser. A malformed
completion that happens to coerce to a valid action_type is still penalised
by the `FormatRubric` β€” closing the silent format-hack surface from v1.
### Business rules (R01..R09)
| Rule | Description |
|---|---|
| R01 | Must `QUALIFY` before `PRESENT` |
| R02 | Must `OFFER_DEMO` before `NEGOTIATE` |
| R03 | Cannot `NEGOTIATE` while budget is unknown |
| R04 | Discount in `NEGOTIATE` only after 2 objections handled |
| R05 | Cannot repeat the same action on consecutive turns |
| R06 | First action must be `PROSPECT` |
| R07 | `FOLLOW_UP` only valid after prospect silence (stall) |
| R08 | `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` |
| R09 | Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) |
### Reward β€” composed Rubric
`SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered
as an OpenEnv `Rubric` so external tooling can introspect per-component
scores via `env.rubric.named_rubrics()`.
| Component | Weight | Type | What it captures |
|---|---|---|---|
| `compliance` | 0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 |
| `outcome` | 0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination |
| `ordering` | 0.20 | per-turn | **potential-based** β€” Ξ” correct-prefix length per turn (arXiv:2408.10215 Β§4.2) |
| `efficiency` | 0.10 | terminal | -0.05 per turn over the per-difficulty optimum |
| `format` | 0.10 | per-turn | +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type |
Why these weights: arXiv:2601.19100 Β§3.1 argues that for long-horizon
structured-output tasks the *process* signal must dominate the sparse
*outcome* signal. We give compliance 2Γ— the weight of outcome.
### Difficulty curriculum
| Level | Description | Correct terminal action |
|---|---|---|
| 1 | Budget known, decision maker present | `CLOSE` |
| 2 | Budget hidden, 1 objection, demo required | `CLOSE` |
| 3 | Budget hidden, 2 objections, possible stalling | `CLOSE` |
| 4 | Adversarial: misleading high-budget signal, no decision maker | `DISQUALIFY` |
The task bank carries ~20 prospect profiles per level (`task_bank.py`); the
last 4 of each level are held-out for `eval_baseline_vs_trained.py`.
## 3. Training pipeline
```
sft_demos.jsonl β†’ train_sft.py β†’ ./sft_checkpoint
β”‚
β–Ό
train_grpo.py
β”‚
on-policy rollouts in
SalesPathEnvironment
β”‚
β–Ό
./grpo_checkpoint
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό
plot_rewards.py eval_baseline_vs_trained.py
β”‚ β”‚
β–Ό β–Ό
./plots/reward_curve.png ./eval_results.md
```
### What's specifically engineered for fast Colab/Kaggle GPUs
* **Batched rollouts** β€” N parallel episodes, single `.generate()` call per
turn (left-padded for correctness).
* **Threaded reward fn** β€” reward computation across GRPO's group of
candidate completions runs in a `ThreadPoolExecutor` (the env is
rule-based / CPU-cheap, so threads overlap with GPU forwards).
* **State snapshots keyed by SHA1** β€” the `STATE_BANK` trick lets GRPO score
single-action completions against a frozen state, avoiding full episode
re-rollouts during the gradient step.
* **N-step shaping** (`GAMMA=0.95`) β€” `true_env_reward_fn` extends the
immediate per-turn reward with a discounted heuristic continuation, so
GRPO sees credit for actions that pay off later. This is what gives this
contextual-bandit-shaped problem a real long-horizon signal.
* **Optional vLLM** β€” `USE_VLLM=1` flips TRL's vLLM-backed sampler for
~3Γ— faster on-policy generation on A100/Kaggle P100.
* **Trainer-once** β€” `GRPOTrainer` is constructed once, trained once,
preserving optimizer + LR-scheduler state across all gradient steps.
### Commands
```bash
# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py
# 1. SFT warm-start (~10–15 min on a T4)
python training/train_sft.py
# 2. Start the env server and run GRPO (~45–90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint USE_VLLM=0 python training/train_grpo.py
# 3. Plot reward curves
python training/plot_rewards.py
# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
--base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8
```
<!-- Useful env vars for Colab/Kaggle tuning:
| Var | Default | Notes |
|---|---|---|
| `ROLLOUTS_PER_DIFFICULTY` | 8 | More β†’ bigger / more diverse state bank |
| `NUM_GENERATIONS` | 4 | GRPO group size; on T4 keep ≀4 to fit VRAM |
| `PER_DEVICE_BATCH` | 2 | T4 / Kaggle P100 default |
| `GRAD_ACCUM` | 4 | Effective batch = 8 |
| `NUM_REWARD_WORKERS` | 8 | Threadpool size for the reward fn |
| `USE_VLLM` | 0 | Set to `1` on A100 only |
| `BETA` | 0.05 | KL-to-reference penalty |
| `GAMMA` | 0.95 | n-step continuation discount |
## 4. Results
After ~1 GRPO pass (eval on the **held-out** profiles, 8 episodes per level):
> See `eval_results.md` (regenerated by `eval_baseline_vs_trained.py`)
> and `plots/reward_curve.png` (regenerated by `plot_rewards.py`).
The conservative target table from the proposal:
| Metric | Base | After GRPO (target) |
|---|---|---|
| Rule violations per episode | 3.5 | < 0.5 |
| Correct step ordering rate | 0.45 | > 0.85 |
| Successful close rate (L1) | 0.30 | > 0.75 |
| Correct disqualification rate (L4) | 0.20 | > 0.65 |
| Mean episode reward | ~0.10 | > 0.6 |
## 5. File layout
```
salespath-env/
β”œβ”€β”€ salespath_env/
β”‚ β”œβ”€β”€ __init__.py ← public API exports
β”‚ β”œβ”€β”€ client.py ← HTTP client for the env
β”‚ β”œβ”€β”€ models.py ← Action / Observation / State + format_ok
β”‚ β”œβ”€β”€ openenv.yaml ← OpenEnv manifest (spec_version: 1)
β”‚ └── server/
β”‚ β”œβ”€β”€ app.py ← Custom stateful FastAPI (HF Spaces)
β”‚ β”œβ”€β”€ salespath_environment.py
β”‚ β”œβ”€β”€ prospect_simulator.py ← Deterministic, state-seeded
β”‚ β”œβ”€β”€ rules.py ← R01–R09
β”‚ β”œβ”€β”€ reward.py ← SalesPathRubric (WeightedSum of 5)
β”‚ └── task_bank.py ← 19–20 profiles/level + held-out split
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ sft_demos.jsonl
β”‚ β”œβ”€β”€ train_test.py ← smoke test + bug regression
β”‚ β”œβ”€β”€ train_sft.py
β”‚ β”œβ”€β”€ train_grpo.py ← GRPO + n-step + parallel reward fn
β”‚ β”œβ”€β”€ eval_baseline_vs_trained.py
β”‚ └── plot_rewards.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
└── pyproject.toml
```
## 6. Why this design wins on the rubric
| Criterion (weight) | How we hit it |
|---|---|
| **Environment Innovation (40%)** | Business workflow with programmatic verification, deterministic rule-based simulator (no LLM in verifier β€” prevents reward hacking via prompt manipulation), 4-level curriculum with held-out eval, OpenEnv `Rubric` composition. |
| **Storytelling (30%)** | Sales workflow is legible to any reader in 10 seconds. Before/after table from `eval_baseline_vs_trained.py` is the headline. Live-demo script in Β§0:30–1:30 of the demo plan. |
| **Improvement in Rewards (20%)** | Five tracked metrics, dense per-turn signal, reward curves with min/max band and difficulty-step markers, baseline vs trained eval table. |
| **Reward & Pipeline (10%)** | Composed Rubric system; potential-based ordering shaping (no policy distortion); n-step continuation closes the contextual-bandit gap; format-hack surface explicitly closed; trainer instantiated once with optimizer state preserved. |
## 7. References
* Reward engineering survey β€” [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
* Reward engineering for software RL β€” [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
* OpenEnv β€” https://github.com/meta-pytorch/OpenEnv
* OpenEnv Rubric RFC β€” [`rfcs/004-rubrics.md`](https://github.com/meta-pytorch/OpenEnv) -->