Lomesh7777's picture
Update README.md
4b86874 verified
metadata
title: SalesPath Environment
emoji: 🀝
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO

SalesPath β€” RL Environment for B2B Sales Agents

An OpenEnv-compliant gym for teaching an LLM to follow a multi-step, rule-governed B2B sales workflow with programmatic verification at every step. Targets the Scale AI bonus track on long-horizon non-code business workflows.


1. Problem

Off-the-shelf LLMs prompted to act as a sales agent reliably break the fundamentals of B2B selling: they pitch before qualifying, offer discounts before establishing value, and ignore order constraints that real sales orgs treat as inviolable. Not because they lack knowledge β€” because no training environment ever penalised these behaviours.

SalesPath is that environment.

The agent navigates a 3-to-8 step workflow against a deterministic ProspectSimulator, and at every turn the environment programmatically verifies nine business rules (R01..R09). A composed OpenEnv Rubric emits a dense five-component reward.

2. Environment

Observation

{
  "prospect_response":     "...",
  "workflow_stage":        "PRESENT",
  "constraints_violated":  ["R01"],
  "steps_completed":       ["PROSPECT", "PRESENT"],
  "turn_number":           3,
  "reward":                -0.18,
  "reward_components":     { "r_outcome": 0.0, "r_compliance": -0.2, ... },
  "done":                  false
}

Action

Action When to use
PROSPECT Opening turn only β€” initial outreach
QUALIFY Uncover budget, decision maker, pain points
PRESENT Pitch the solution (requires QUALIFY first)
HANDLE_OBJECTION Respond to pricing / timing objections
OFFER_DEMO Schedule a live product demo
NEGOTIATE Discuss pricing/terms (requires OFFER_DEMO + known budget)
CLOSE Attempt to sign the deal
FOLLOW_UP Re-engage after prospect silence
DISQUALIFY End the conversation (only valid for low-budget, no-DM prospects)

The action carries a format_ok flag set by the agent's parser. A malformed completion that happens to coerce to a valid action_type is still penalised by the FormatRubric β€” closing the silent format-hack surface from v1.

Business rules (R01..R09)

Rule Description
R01 Must QUALIFY before PRESENT
R02 Must OFFER_DEMO before NEGOTIATE
R03 Cannot NEGOTIATE while budget is unknown
R04 Discount in NEGOTIATE only after 2 objections handled
R05 Cannot repeat the same action on consecutive turns
R06 First action must be PROSPECT
R07 FOLLOW_UP only valid after prospect silence (stall)
R08 DISQUALIFY valid only when budget < threshold AND no decision_maker
R09 Must OFFER_DEMO before CLOSE (difficulty 2+)

Reward β€” composed Rubric

SalesPathRubric is a WeightedSum over five sub-rubrics, each registered as an OpenEnv Rubric so external tooling can introspect per-component scores via env.rubric.named_rubrics().

Component Weight Type What it captures
compliance 0.40 per-turn -0.2 per new rule violation, capped at -1.0
outcome 0.20 terminal +1.0 success / +0.5 valid disqualify / -0.7 violation termination
ordering 0.20 per-turn potential-based β€” Ξ” correct-prefix length per turn (arXiv:2408.10215 Β§4.2)
efficiency 0.10 terminal -0.05 per turn over the per-difficulty optimum
format 0.10 per-turn +1.0 valid+parsed / -0.3 if format_ok=False or invalid action_type

Why these weights: arXiv:2601.19100 Β§3.1 argues that for long-horizon structured-output tasks the process signal must dominate the sparse outcome signal. We give compliance 2Γ— the weight of outcome.

Difficulty curriculum

Level Description Correct terminal action
1 Budget known, decision maker present CLOSE
2 Budget hidden, 1 objection, demo required CLOSE
3 Budget hidden, 2 objections, possible stalling CLOSE
4 Adversarial: misleading high-budget signal, no decision maker DISQUALIFY

The task bank carries ~20 prospect profiles per level (task_bank.py); the last 4 of each level are held-out for eval_baseline_vs_trained.py.

3. Training pipeline

sft_demos.jsonl  β†’  train_sft.py  β†’  ./sft_checkpoint
                                         β”‚
                                         β–Ό
                                  train_grpo.py
                                         β”‚
                            on-policy rollouts in
                            SalesPathEnvironment
                                         β”‚
                                         β–Ό
                                  ./grpo_checkpoint
                                         β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β–Ό                                   β–Ό
                plot_rewards.py            eval_baseline_vs_trained.py
                       β”‚                                   β”‚
                       β–Ό                                   β–Ό
              ./plots/reward_curve.png            ./eval_results.md

What's specifically engineered for fast Colab/Kaggle GPUs

  • Batched rollouts β€” N parallel episodes, single .generate() call per turn (left-padded for correctness).
  • Threaded reward fn β€” reward computation across GRPO's group of candidate completions runs in a ThreadPoolExecutor (the env is rule-based / CPU-cheap, so threads overlap with GPU forwards).
  • State snapshots keyed by SHA1 β€” the STATE_BANK trick lets GRPO score single-action completions against a frozen state, avoiding full episode re-rollouts during the gradient step.
  • N-step shaping (GAMMA=0.95) β€” true_env_reward_fn extends the immediate per-turn reward with a discounted heuristic continuation, so GRPO sees credit for actions that pay off later. This is what gives this contextual-bandit-shaped problem a real long-horizon signal.
  • Optional vLLM β€” USE_VLLM=1 flips TRL's vLLM-backed sampler for ~3Γ— faster on-policy generation on A100/Kaggle P100.
  • Trainer-once β€” GRPOTrainer is constructed once, trained once, preserving optimizer + LR-scheduler state across all gradient steps.

Commands

# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py

# 1. SFT warm-start (~10–15 min on a T4)
python training/train_sft.py

# 2. Start the env server and run GRPO (~45–90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint  USE_VLLM=0  python training/train_grpo.py

# 3. Plot reward curves
python training/plot_rewards.py

# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
    --base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8