--- title: SalesPath Environment emoji: ๐Ÿค colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 pinned: false license: mit short_description: OpenEnv RL gym for training B2B sales agents via GRPO --- # SalesPath โ€” RL Environment for B2B Sales Agents > **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step, > rule-governed B2B sales workflow with programmatic verification at every > step. Targets the Scale AI bonus track on long-horizon non-code business > workflows.** * **Theme**: #2 โ€” Long-Horizon Planning & Instruction Following * **Bonus track**: Scale AI โ€” Sales / PM / HR & IT workflows * **HF Space**: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL * **Blog post**: _add link before submission_ * **Demo video (โ‰ค2 min)**: _add link before submission_ --- ## 1. Problem Off-the-shelf LLMs prompted to act as a sales agent reliably break the fundamentals of B2B selling: they pitch before qualifying, offer discounts before establishing value, and ignore order constraints that real sales orgs treat as inviolable. Not because they lack knowledge โ€” because no training environment ever penalised these behaviours. SalesPath is that environment. The agent navigates a 3-to-8 step workflow against a deterministic `ProspectSimulator`, and at every turn the environment programmatically verifies nine business rules (R01..R09). A composed [OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component reward. ## 2. Environment ### Observation ```jsonc { "prospect_response": "...", "workflow_stage": "PRESENT", "constraints_violated": ["R01"], "steps_completed": ["PROSPECT", "PRESENT"], "turn_number": 3, "reward": -0.18, "reward_components": { "r_outcome": 0.0, "r_compliance": -0.2, ... }, "done": false } ``` ### Action | Action | When to use | |---|---| | `PROSPECT` | Opening turn only โ€” initial outreach | | `QUALIFY` | Uncover budget, decision maker, pain points | | `PRESENT` | Pitch the solution (requires `QUALIFY` first) | | `HANDLE_OBJECTION` | Respond to pricing / timing objections | | `OFFER_DEMO` | Schedule a live product demo | | `NEGOTIATE` | Discuss pricing/terms (requires `OFFER_DEMO` + known budget) | | `CLOSE` | Attempt to sign the deal | | `FOLLOW_UP` | Re-engage after prospect silence | | `DISQUALIFY` | End the conversation (only valid for low-budget, no-DM prospects) | The action carries a `format_ok` flag set by the agent's parser. A malformed completion that happens to coerce to a valid action_type is still penalised by the `FormatRubric` โ€” closing the silent format-hack surface from v1. ### Business rules (R01..R09) | Rule | Description | |---|---| | R01 | Must `QUALIFY` before `PRESENT` | | R02 | Must `OFFER_DEMO` before `NEGOTIATE` | | R03 | Cannot `NEGOTIATE` while budget is unknown | | R04 | Discount in `NEGOTIATE` only after 2 objections handled | | R05 | Cannot repeat the same action on consecutive turns | | R06 | First action must be `PROSPECT` | | R07 | `FOLLOW_UP` only valid after prospect silence (stall) | | R08 | `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` | | R09 | Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) | ### Reward โ€” composed Rubric `SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered as an OpenEnv `Rubric` so external tooling can introspect per-component scores via `env.rubric.named_rubrics()`. | Component | Weight | Type | What it captures | |---|---|---|---| | `compliance` | 0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 | | `outcome` | 0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination | | `ordering` | 0.20 | per-turn | **potential-based** โ€” ฮ” correct-prefix length per turn (arXiv:2408.10215 ยง4.2) | | `efficiency` | 0.10 | terminal | -0.05 per turn over the per-difficulty optimum | | `format` | 0.10 | per-turn | +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type | Why these weights: arXiv:2601.19100 ยง3.1 argues that for long-horizon structured-output tasks the *process* signal must dominate the sparse *outcome* signal. We give compliance 2ร— the weight of outcome. ### Difficulty curriculum | Level | Description | Correct terminal action | |---|---|---| | 1 | Budget known, decision maker present | `CLOSE` | | 2 | Budget hidden, 1 objection, demo required | `CLOSE` | | 3 | Budget hidden, 2 objections, possible stalling | `CLOSE` | | 4 | Adversarial: misleading high-budget signal, no decision maker | `DISQUALIFY` | The task bank carries ~20 prospect profiles per level (`task_bank.py`); the last 4 of each level are held-out for `eval_baseline_vs_trained.py`. ## 3. Training pipeline ``` sft_demos.jsonl โ†’ train_sft.py โ†’ ./sft_checkpoint โ”‚ โ–ผ train_grpo.py โ”‚ on-policy rollouts in SalesPathEnvironment โ”‚ โ–ผ ./grpo_checkpoint โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ plot_rewards.py eval_baseline_vs_trained.py โ”‚ โ”‚ โ–ผ โ–ผ ./plots/reward_curve.png ./eval_results.md ``` ### What's specifically engineered for fast Colab/Kaggle GPUs * **Batched rollouts** โ€” N parallel episodes, single `.generate()` call per turn (left-padded for correctness). * **Threaded reward fn** โ€” reward computation across GRPO's group of candidate completions runs in a `ThreadPoolExecutor` (the env is rule-based / CPU-cheap, so threads overlap with GPU forwards). * **State snapshots keyed by SHA1** โ€” the `STATE_BANK` trick lets GRPO score single-action completions against a frozen state, avoiding full episode re-rollouts during the gradient step. * **N-step shaping** (`GAMMA=0.95`) โ€” `true_env_reward_fn` extends the immediate per-turn reward with a discounted heuristic continuation, so GRPO sees credit for actions that pay off later. This is what gives this contextual-bandit-shaped problem a real long-horizon signal. * **Optional vLLM** โ€” `USE_VLLM=1` flips TRL's vLLM-backed sampler for ~3ร— faster on-policy generation on A100/Kaggle P100. * **Trainer-once** โ€” `GRPOTrainer` is constructed once, trained once, preserving optimizer + LR-scheduler state across all gradient steps. ### Commands ```bash # 0. Smoke test (~30 sec, no GPU) python training/train_test.py # 1. SFT warm-start (~10โ€“15 min on a T4) python training/train_sft.py # 2. Start the env server and run GRPO (~45โ€“90 min on a T4) uvicorn salespath_env.server.app:app --port 7860 & SFT_CHECKPOINT=./sft_checkpoint USE_VLLM=0 python training/train_grpo.py # 3. Plot reward curves python training/plot_rewards.py # 4. Baseline-vs-trained head-to-head on the held-out eval split python training/eval_baseline_vs_trained.py \ --base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8 ```