Spaces:
Sleeping
title: SalesPath Environment
emoji: π€
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO
SalesPath β RL Environment for B2B Sales Agents
An OpenEnv-compliant gym for teaching an LLM to follow a multi-step, rule-governed B2B sales workflow with programmatic verification at every step. Targets the Scale AI bonus track on long-horizon non-code business workflows.
- Theme: #2 β Long-Horizon Planning & Instruction Following
- Bonus track: Scale AI β Sales / PM / HR & IT workflows
- HF Space: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
- Blog post: add link before submission
- Demo video (β€2 min): add link before submission
1. Problem
Off-the-shelf LLMs prompted to act as a sales agent reliably break the fundamentals of B2B selling: they pitch before qualifying, offer discounts before establishing value, and ignore order constraints that real sales orgs treat as inviolable. Not because they lack knowledge β because no training environment ever penalised these behaviours.
SalesPath is that environment.
The agent navigates a 3-to-8 step workflow against a deterministic
ProspectSimulator, and at every turn the environment programmatically
verifies nine business rules (R01..R09). A composed
OpenEnv Rubric emits a dense five-component
reward.
2. Environment
Observation
{
"prospect_response": "...",
"workflow_stage": "PRESENT",
"constraints_violated": ["R01"],
"steps_completed": ["PROSPECT", "PRESENT"],
"turn_number": 3,
"reward": -0.18,
"reward_components": { "r_outcome": 0.0, "r_compliance": -0.2, ... },
"done": false
}
Action
| Action | When to use |
|---|---|
PROSPECT |
Opening turn only β initial outreach |
QUALIFY |
Uncover budget, decision maker, pain points |
PRESENT |
Pitch the solution (requires QUALIFY first) |
HANDLE_OBJECTION |
Respond to pricing / timing objections |
OFFER_DEMO |
Schedule a live product demo |
NEGOTIATE |
Discuss pricing/terms (requires OFFER_DEMO + known budget) |
CLOSE |
Attempt to sign the deal |
FOLLOW_UP |
Re-engage after prospect silence |
DISQUALIFY |
End the conversation (only valid for low-budget, no-DM prospects) |
The action carries a format_ok flag set by the agent's parser. A malformed
completion that happens to coerce to a valid action_type is still penalised
by the FormatRubric β closing the silent format-hack surface from v1.
Business rules (R01..R09)
| Rule | Description |
|---|---|
| R01 | Must QUALIFY before PRESENT |
| R02 | Must OFFER_DEMO before NEGOTIATE |
| R03 | Cannot NEGOTIATE while budget is unknown |
| R04 | Discount in NEGOTIATE only after 2 objections handled |
| R05 | Cannot repeat the same action on consecutive turns |
| R06 | First action must be PROSPECT |
| R07 | FOLLOW_UP only valid after prospect silence (stall) |
| R08 | DISQUALIFY valid only when budget < threshold AND no decision_maker |
| R09 | Must OFFER_DEMO before CLOSE (difficulty 2+) |
Reward β composed Rubric
SalesPathRubric is a WeightedSum over five sub-rubrics, each registered
as an OpenEnv Rubric so external tooling can introspect per-component
scores via env.rubric.named_rubrics().
| Component | Weight | Type | What it captures |
|---|---|---|---|
compliance |
0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 |
outcome |
0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination |
ordering |
0.20 | per-turn | potential-based β Ξ correct-prefix length per turn (arXiv:2408.10215 Β§4.2) |
efficiency |
0.10 | terminal | -0.05 per turn over the per-difficulty optimum |
format |
0.10 | per-turn | +1.0 valid+parsed / -0.3 if format_ok=False or invalid action_type |
Why these weights: arXiv:2601.19100 Β§3.1 argues that for long-horizon structured-output tasks the process signal must dominate the sparse outcome signal. We give compliance 2Γ the weight of outcome.
Difficulty curriculum
| Level | Description | Correct terminal action |
|---|---|---|
| 1 | Budget known, decision maker present | CLOSE |
| 2 | Budget hidden, 1 objection, demo required | CLOSE |
| 3 | Budget hidden, 2 objections, possible stalling | CLOSE |
| 4 | Adversarial: misleading high-budget signal, no decision maker | DISQUALIFY |
The task bank carries ~20 prospect profiles per level (task_bank.py); the
last 4 of each level are held-out for eval_baseline_vs_trained.py.
3. Training pipeline
sft_demos.jsonl β train_sft.py β ./sft_checkpoint
β
βΌ
train_grpo.py
β
on-policy rollouts in
SalesPathEnvironment
β
βΌ
./grpo_checkpoint
β
βββββββββββββββββββ΄ββββββββββββββββββ
βΌ βΌ
plot_rewards.py eval_baseline_vs_trained.py
β β
βΌ βΌ
./plots/reward_curve.png ./eval_results.md
What's specifically engineered for fast Colab/Kaggle GPUs
- Batched rollouts β N parallel episodes, single
.generate()call per turn (left-padded for correctness). - Threaded reward fn β reward computation across GRPO's group of
candidate completions runs in a
ThreadPoolExecutor(the env is rule-based / CPU-cheap, so threads overlap with GPU forwards). - State snapshots keyed by SHA1 β the
STATE_BANKtrick lets GRPO score single-action completions against a frozen state, avoiding full episode re-rollouts during the gradient step. - N-step shaping (
GAMMA=0.95) βtrue_env_reward_fnextends the immediate per-turn reward with a discounted heuristic continuation, so GRPO sees credit for actions that pay off later. This is what gives this contextual-bandit-shaped problem a real long-horizon signal. - Optional vLLM β
USE_VLLM=1flips TRL's vLLM-backed sampler for ~3Γ faster on-policy generation on A100/Kaggle P100. - Trainer-once β
GRPOTraineris constructed once, trained once, preserving optimizer + LR-scheduler state across all gradient steps.
Commands
# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py
# 1. SFT warm-start (~10β15 min on a T4)
python training/train_sft.py
# 2. Start the env server and run GRPO (~45β90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint USE_VLLM=0 python training/train_grpo.py
# 3. Plot reward curves
python training/plot_rewards.py
# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
--base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8