Spaces:
Sleeping
Sleeping
| title: SalesPath Environment | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| short_description: OpenEnv RL gym for training B2B sales agents via GRPO | |
| # SalesPath β RL Environment for B2B Sales Agents | |
| > **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step, | |
| > rule-governed B2B sales workflow with programmatic verification at every | |
| > step. Targets the Scale AI bonus track on long-horizon non-code business | |
| > workflows.** | |
| * **Theme**: #2 β Long-Horizon Planning & Instruction Following | |
| * **Bonus track**: Scale AI β Sales / PM / HR & IT workflows | |
| * **HF Space**: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL | |
| * **Blog post**: _add link before submission_ | |
| * **Demo video (β€2 min)**: _add link before submission_ | |
| --- | |
| ## 1. Problem | |
| Off-the-shelf LLMs prompted to act as a sales agent reliably break the | |
| fundamentals of B2B selling: they pitch before qualifying, offer discounts | |
| before establishing value, and ignore order constraints that real sales orgs | |
| treat as inviolable. Not because they lack knowledge β because no training | |
| environment ever penalised these behaviours. | |
| SalesPath is that environment. | |
| The agent navigates a 3-to-8 step workflow against a deterministic | |
| `ProspectSimulator`, and at every turn the environment programmatically | |
| verifies nine business rules (R01..R09). A composed | |
| [OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component | |
| reward. | |
| ## 2. Environment | |
| ### Observation | |
| ```jsonc | |
| { | |
| "prospect_response": "...", | |
| "workflow_stage": "PRESENT", | |
| "constraints_violated": ["R01"], | |
| "steps_completed": ["PROSPECT", "PRESENT"], | |
| "turn_number": 3, | |
| "reward": -0.18, | |
| "reward_components": { "r_outcome": 0.0, "r_compliance": -0.2, ... }, | |
| "done": false | |
| } | |
| ``` | |
| ### Action | |
| | Action | When to use | | |
| |---|---| | |
| | `PROSPECT` | Opening turn only β initial outreach | | |
| | `QUALIFY` | Uncover budget, decision maker, pain points | | |
| | `PRESENT` | Pitch the solution (requires `QUALIFY` first) | | |
| | `HANDLE_OBJECTION` | Respond to pricing / timing objections | | |
| | `OFFER_DEMO` | Schedule a live product demo | | |
| | `NEGOTIATE` | Discuss pricing/terms (requires `OFFER_DEMO` + known budget) | | |
| | `CLOSE` | Attempt to sign the deal | | |
| | `FOLLOW_UP` | Re-engage after prospect silence | | |
| | `DISQUALIFY` | End the conversation (only valid for low-budget, no-DM prospects) | | |
| The action carries a `format_ok` flag set by the agent's parser. A malformed | |
| completion that happens to coerce to a valid action_type is still penalised | |
| by the `FormatRubric` β closing the silent format-hack surface from v1. | |
| ### Business rules (R01..R09) | |
| | Rule | Description | | |
| |---|---| | |
| | R01 | Must `QUALIFY` before `PRESENT` | | |
| | R02 | Must `OFFER_DEMO` before `NEGOTIATE` | | |
| | R03 | Cannot `NEGOTIATE` while budget is unknown | | |
| | R04 | Discount in `NEGOTIATE` only after 2 objections handled | | |
| | R05 | Cannot repeat the same action on consecutive turns | | |
| | R06 | First action must be `PROSPECT` | | |
| | R07 | `FOLLOW_UP` only valid after prospect silence (stall) | | |
| | R08 | `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` | | |
| | R09 | Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) | | |
| ### Reward β composed Rubric | |
| `SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered | |
| as an OpenEnv `Rubric` so external tooling can introspect per-component | |
| scores via `env.rubric.named_rubrics()`. | |
| | Component | Weight | Type | What it captures | | |
| |---|---|---|---| | |
| | `compliance` | 0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 | | |
| | `outcome` | 0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination | | |
| | `ordering` | 0.20 | per-turn | **potential-based** β Ξ correct-prefix length per turn (arXiv:2408.10215 Β§4.2) | | |
| | `efficiency` | 0.10 | terminal | -0.05 per turn over the per-difficulty optimum | | |
| | `format` | 0.10 | per-turn | +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type | | |
| Why these weights: arXiv:2601.19100 Β§3.1 argues that for long-horizon | |
| structured-output tasks the *process* signal must dominate the sparse | |
| *outcome* signal. We give compliance 2Γ the weight of outcome. | |
| ### Difficulty curriculum | |
| | Level | Description | Correct terminal action | | |
| |---|---|---| | |
| | 1 | Budget known, decision maker present | `CLOSE` | | |
| | 2 | Budget hidden, 1 objection, demo required | `CLOSE` | | |
| | 3 | Budget hidden, 2 objections, possible stalling | `CLOSE` | | |
| | 4 | Adversarial: misleading high-budget signal, no decision maker | `DISQUALIFY` | | |
| The task bank carries ~20 prospect profiles per level (`task_bank.py`); the | |
| last 4 of each level are held-out for `eval_baseline_vs_trained.py`. | |
| ## 3. Training pipeline | |
| ``` | |
| sft_demos.jsonl β train_sft.py β ./sft_checkpoint | |
| β | |
| βΌ | |
| train_grpo.py | |
| β | |
| on-policy rollouts in | |
| SalesPathEnvironment | |
| β | |
| βΌ | |
| ./grpo_checkpoint | |
| β | |
| βββββββββββββββββββ΄ββββββββββββββββββ | |
| βΌ βΌ | |
| plot_rewards.py eval_baseline_vs_trained.py | |
| β β | |
| βΌ βΌ | |
| ./plots/reward_curve.png ./eval_results.md | |
| ``` | |
| ### What's specifically engineered for fast Colab/Kaggle GPUs | |
| * **Batched rollouts** β N parallel episodes, single `.generate()` call per | |
| turn (left-padded for correctness). | |
| * **Threaded reward fn** β reward computation across GRPO's group of | |
| candidate completions runs in a `ThreadPoolExecutor` (the env is | |
| rule-based / CPU-cheap, so threads overlap with GPU forwards). | |
| * **State snapshots keyed by SHA1** β the `STATE_BANK` trick lets GRPO score | |
| single-action completions against a frozen state, avoiding full episode | |
| re-rollouts during the gradient step. | |
| * **N-step shaping** (`GAMMA=0.95`) β `true_env_reward_fn` extends the | |
| immediate per-turn reward with a discounted heuristic continuation, so | |
| GRPO sees credit for actions that pay off later. This is what gives this | |
| contextual-bandit-shaped problem a real long-horizon signal. | |
| * **Optional vLLM** β `USE_VLLM=1` flips TRL's vLLM-backed sampler for | |
| ~3Γ faster on-policy generation on A100/Kaggle P100. | |
| * **Trainer-once** β `GRPOTrainer` is constructed once, trained once, | |
| preserving optimizer + LR-scheduler state across all gradient steps. | |
| ### Commands | |
| ```bash | |
| # 0. Smoke test (~30 sec, no GPU) | |
| python training/train_test.py | |
| # 1. SFT warm-start (~10β15 min on a T4) | |
| python training/train_sft.py | |
| # 2. Start the env server and run GRPO (~45β90 min on a T4) | |
| uvicorn salespath_env.server.app:app --port 7860 & | |
| SFT_CHECKPOINT=./sft_checkpoint USE_VLLM=0 python training/train_grpo.py | |
| # 3. Plot reward curves | |
| python training/plot_rewards.py | |
| # 4. Baseline-vs-trained head-to-head on the held-out eval split | |
| python training/eval_baseline_vs_trained.py \ | |
| --base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8 | |
| ``` | |
| <!-- Useful env vars for Colab/Kaggle tuning: | |
| | Var | Default | Notes | | |
| |---|---|---| | |
| | `ROLLOUTS_PER_DIFFICULTY` | 8 | More β bigger / more diverse state bank | | |
| | `NUM_GENERATIONS` | 4 | GRPO group size; on T4 keep β€4 to fit VRAM | | |
| | `PER_DEVICE_BATCH` | 2 | T4 / Kaggle P100 default | | |
| | `GRAD_ACCUM` | 4 | Effective batch = 8 | | |
| | `NUM_REWARD_WORKERS` | 8 | Threadpool size for the reward fn | | |
| | `USE_VLLM` | 0 | Set to `1` on A100 only | | |
| | `BETA` | 0.05 | KL-to-reference penalty | | |
| | `GAMMA` | 0.95 | n-step continuation discount | | |
| ## 4. Results | |
| After ~1 GRPO pass (eval on the **held-out** profiles, 8 episodes per level): | |
| > See `eval_results.md` (regenerated by `eval_baseline_vs_trained.py`) | |
| > and `plots/reward_curve.png` (regenerated by `plot_rewards.py`). | |
| The conservative target table from the proposal: | |
| | Metric | Base | After GRPO (target) | | |
| |---|---|---| | |
| | Rule violations per episode | 3.5 | < 0.5 | | |
| | Correct step ordering rate | 0.45 | > 0.85 | | |
| | Successful close rate (L1) | 0.30 | > 0.75 | | |
| | Correct disqualification rate (L4) | 0.20 | > 0.65 | | |
| | Mean episode reward | ~0.10 | > 0.6 | | |
| ## 5. File layout | |
| ``` | |
| salespath-env/ | |
| βββ salespath_env/ | |
| β βββ __init__.py β public API exports | |
| β βββ client.py β HTTP client for the env | |
| β βββ models.py β Action / Observation / State + format_ok | |
| β βββ openenv.yaml β OpenEnv manifest (spec_version: 1) | |
| β βββ server/ | |
| β βββ app.py β Custom stateful FastAPI (HF Spaces) | |
| β βββ salespath_environment.py | |
| β βββ prospect_simulator.py β Deterministic, state-seeded | |
| β βββ rules.py β R01βR09 | |
| β βββ reward.py β SalesPathRubric (WeightedSum of 5) | |
| β βββ task_bank.py β 19β20 profiles/level + held-out split | |
| βββ training/ | |
| β βββ sft_demos.jsonl | |
| β βββ train_test.py β smoke test + bug regression | |
| β βββ train_sft.py | |
| β βββ train_grpo.py β GRPO + n-step + parallel reward fn | |
| β βββ eval_baseline_vs_trained.py | |
| β βββ plot_rewards.py | |
| βββ Dockerfile | |
| βββ requirements.txt | |
| βββ pyproject.toml | |
| ``` | |
| ## 6. Why this design wins on the rubric | |
| | Criterion (weight) | How we hit it | | |
| |---|---| | |
| | **Environment Innovation (40%)** | Business workflow with programmatic verification, deterministic rule-based simulator (no LLM in verifier β prevents reward hacking via prompt manipulation), 4-level curriculum with held-out eval, OpenEnv `Rubric` composition. | | |
| | **Storytelling (30%)** | Sales workflow is legible to any reader in 10 seconds. Before/after table from `eval_baseline_vs_trained.py` is the headline. Live-demo script in Β§0:30β1:30 of the demo plan. | | |
| | **Improvement in Rewards (20%)** | Five tracked metrics, dense per-turn signal, reward curves with min/max band and difficulty-step markers, baseline vs trained eval table. | | |
| | **Reward & Pipeline (10%)** | Composed Rubric system; potential-based ordering shaping (no policy distortion); n-step continuation closes the contextual-bandit gap; format-hack surface explicitly closed; trainer instantiated once with optimizer state preserved. | | |
| ## 7. References | |
| * Reward engineering survey β [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) | |
| * Reward engineering for software RL β [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) | |
| * OpenEnv β https://github.com/meta-pytorch/OpenEnv | |
| * OpenEnv Rubric RFC β [`rfcs/004-rubrics.md`](https://github.com/meta-pytorch/OpenEnv) --> | |