---
title: SalesPath Environment
emoji: 🤝
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO
---

# SalesPath — RL Environment for B2B Sales Agents

> **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step,
> rule-governed B2B sales workflow with programmatic verification at every
> step. Targets the Scale AI bonus track on long-horizon non-code business
> workflows.**

* **Theme**: #2 — Long-Horizon Planning & Instruction Following
* **Bonus track**: Scale AI — Sales / PM / HR & IT workflows
* **HF Space**: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
* **Blog post**: _add link before submission_
* **Demo video (≤2 min)**: _add link before submission_

---

## 1. Problem

Off-the-shelf LLMs prompted to act as a sales agent reliably break the
fundamentals of B2B selling: they pitch before qualifying, offer discounts
before establishing value, and ignore order constraints that real sales orgs
treat as inviolable. Not because they lack knowledge — because no training
environment ever penalised these behaviours.

SalesPath is that environment.

The agent navigates a 3-to-8 step workflow against a deterministic
`ProspectSimulator`, and at every turn the environment programmatically
verifies nine business rules (R01..R09). A composed
[OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component
reward.

## 2. Environment

### Observation
```jsonc
{
  "prospect_response":     "...",
  "workflow_stage":        "PRESENT",
  "constraints_violated":  ["R01"],
  "steps_completed":       ["PROSPECT", "PRESENT"],
  "turn_number":           3,
  "reward":                -0.18,
  "reward_components":     { "r_outcome": 0.0, "r_compliance": -0.2, ... },
  "done":                  false
}
```

### Action

| Action | When to use |
|---|---|
| `PROSPECT` | Opening turn only — initial outreach |
| `QUALIFY` | Uncover budget, decision maker, pain points |
| `PRESENT` | Pitch the solution (requires `QUALIFY` first) |
| `HANDLE_OBJECTION` | Respond to pricing / timing objections |
| `OFFER_DEMO` | Schedule a live product demo |
| `NEGOTIATE` | Discuss pricing/terms (requires `OFFER_DEMO` + known budget) |
| `CLOSE` | Attempt to sign the deal |
| `FOLLOW_UP` | Re-engage after prospect silence |
| `DISQUALIFY` | End the conversation (only valid for low-budget, no-DM prospects) |

The action carries a `format_ok` flag set by the agent's parser. A malformed
completion that happens to coerce to a valid action_type is still penalised
by the `FormatRubric` — closing the silent format-hack surface from v1.

### Business rules (R01..R09)

| Rule | Description |
|---|---|
| R01 | Must `QUALIFY` before `PRESENT` |
| R02 | Must `OFFER_DEMO` before `NEGOTIATE` |
| R03 | Cannot `NEGOTIATE` while budget is unknown |
| R04 | Discount in `NEGOTIATE` only after 2 objections handled |
| R05 | Cannot repeat the same action on consecutive turns |
| R06 | First action must be `PROSPECT` |
| R07 | `FOLLOW_UP` only valid after prospect silence (stall) |
| R08 | `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` |
| R09 | Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) |

### Reward — composed Rubric

`SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered
as an OpenEnv `Rubric` so external tooling can introspect per-component
scores via `env.rubric.named_rubrics()`.

| Component | Weight | Type | What it captures |
|---|---|---|---|
| `compliance`  | 0.40 | per-turn | -0.2 per new rule violation, capped at -1.0 |
| `outcome`     | 0.20 | terminal | +1.0 success / +0.5 valid disqualify / -0.7 violation termination |
| `ordering`    | 0.20 | per-turn | **potential-based** — Δ correct-prefix length per turn (arXiv:2408.10215 §4.2) |
| `efficiency`  | 0.10 | terminal | -0.05 per turn over the per-difficulty optimum |
| `format`      | 0.10 | per-turn | +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type |

Why these weights: arXiv:2601.19100 §3.1 argues that for long-horizon
structured-output tasks the *process* signal must dominate the sparse
*outcome* signal. We give compliance 2× the weight of outcome.

### Difficulty curriculum

| Level | Description | Correct terminal action |
|---|---|---|
| 1 | Budget known, decision maker present | `CLOSE` |
| 2 | Budget hidden, 1 objection, demo required | `CLOSE` |
| 3 | Budget hidden, 2 objections, possible stalling | `CLOSE` |
| 4 | Adversarial: misleading high-budget signal, no decision maker | `DISQUALIFY` |

The task bank carries ~20 prospect profiles per level (`task_bank.py`); the
last 4 of each level are held-out for `eval_baseline_vs_trained.py`.

## 3. Training pipeline

```
sft_demos.jsonl  →  train_sft.py  →  ./sft_checkpoint
                                         │
                                         ▼
                                  train_grpo.py
                                         │
                            on-policy rollouts in
                            SalesPathEnvironment
                                         │
                                         ▼
                                  ./grpo_checkpoint
                                         │
                       ┌─────────────────┴─────────────────┐
                       ▼                                   ▼
                plot_rewards.py            eval_baseline_vs_trained.py
                       │                                   │
                       ▼                                   ▼
              ./plots/reward_curve.png            ./eval_results.md
```

### What's specifically engineered for fast Colab/Kaggle GPUs

* **Batched rollouts** — N parallel episodes, single `.generate()` call per
  turn (left-padded for correctness).
* **Threaded reward fn** — reward computation across GRPO's group of
  candidate completions runs in a `ThreadPoolExecutor` (the env is
  rule-based / CPU-cheap, so threads overlap with GPU forwards).
* **State snapshots keyed by SHA1** — the `STATE_BANK` trick lets GRPO score
  single-action completions against a frozen state, avoiding full episode
  re-rollouts during the gradient step.
* **N-step shaping** (`GAMMA=0.95`) — `true_env_reward_fn` extends the
  immediate per-turn reward with a discounted heuristic continuation, so
  GRPO sees credit for actions that pay off later. This is what gives this
  contextual-bandit-shaped problem a real long-horizon signal.
* **Optional vLLM** — `USE_VLLM=1` flips TRL's vLLM-backed sampler for
  ~3× faster on-policy generation on A100/Kaggle P100.
* **Trainer-once** — `GRPOTrainer` is constructed once, trained once,
  preserving optimizer + LR-scheduler state across all gradient steps.

### Commands

```bash
# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py

# 1. SFT warm-start (~10–15 min on a T4)
python training/train_sft.py

# 2. Start the env server and run GRPO (~45–90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint  USE_VLLM=0  python training/train_grpo.py

# 3. Plot reward curves
python training/plot_rewards.py

# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
    --base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8
```

<!-- Useful env vars for Colab/Kaggle tuning:

| Var | Default | Notes |
|---|---|---|
| `ROLLOUTS_PER_DIFFICULTY` | 8 | More → bigger / more diverse state bank |
| `NUM_GENERATIONS`         | 4 | GRPO group size; on T4 keep ≤4 to fit VRAM |
| `PER_DEVICE_BATCH`        | 2 | T4 / Kaggle P100 default |
| `GRAD_ACCUM`              | 4 | Effective batch = 8 |
| `NUM_REWARD_WORKERS`      | 8 | Threadpool size for the reward fn |
| `USE_VLLM`                | 0 | Set to `1` on A100 only |
| `BETA`                    | 0.05 | KL-to-reference penalty |
| `GAMMA`                   | 0.95 | n-step continuation discount |

## 4. Results

After ~1 GRPO pass (eval on the **held-out** profiles, 8 episodes per level):

> See `eval_results.md` (regenerated by `eval_baseline_vs_trained.py`)
> and `plots/reward_curve.png` (regenerated by `plot_rewards.py`).

The conservative target table from the proposal:

| Metric | Base | After GRPO (target) |
|---|---|---|
| Rule violations per episode | 3.5 | < 0.5 |
| Correct step ordering rate  | 0.45 | > 0.85 |
| Successful close rate (L1)  | 0.30 | > 0.75 |
| Correct disqualification rate (L4) | 0.20 | > 0.65 |
| Mean episode reward         | ~0.10 | > 0.6 |

## 5. File layout

```
salespath-env/
├── salespath_env/
│   ├── __init__.py                ← public API exports
│   ├── client.py                  ← HTTP client for the env
│   ├── models.py                  ← Action / Observation / State + format_ok
│   ├── openenv.yaml               ← OpenEnv manifest (spec_version: 1)
│   └── server/
│       ├── app.py                 ← Custom stateful FastAPI (HF Spaces)
│       ├── salespath_environment.py
│       ├── prospect_simulator.py  ← Deterministic, state-seeded
│       ├── rules.py               ← R01–R09
│       ├── reward.py              ← SalesPathRubric (WeightedSum of 5)
│       └── task_bank.py           ← 19–20 profiles/level + held-out split
├── training/
│   ├── sft_demos.jsonl
│   ├── train_test.py              ← smoke test + bug regression
│   ├── train_sft.py
│   ├── train_grpo.py              ← GRPO + n-step + parallel reward fn
│   ├── eval_baseline_vs_trained.py
│   └── plot_rewards.py
├── Dockerfile
├── requirements.txt
└── pyproject.toml
```

## 6. Why this design wins on the rubric

| Criterion (weight) | How we hit it |
|---|---|
| **Environment Innovation (40%)** | Business workflow with programmatic verification, deterministic rule-based simulator (no LLM in verifier — prevents reward hacking via prompt manipulation), 4-level curriculum with held-out eval, OpenEnv `Rubric` composition. |
| **Storytelling (30%)** | Sales workflow is legible to any reader in 10 seconds. Before/after table from `eval_baseline_vs_trained.py` is the headline. Live-demo script in §0:30–1:30 of the demo plan. |
| **Improvement in Rewards (20%)** | Five tracked metrics, dense per-turn signal, reward curves with min/max band and difficulty-step markers, baseline vs trained eval table. |
| **Reward & Pipeline (10%)** | Composed Rubric system; potential-based ordering shaping (no policy distortion); n-step continuation closes the contextual-bandit gap; format-hack surface explicitly closed; trainer instantiated once with optimizer state preserved. |

## 7. References

* Reward engineering survey — [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
* Reward engineering for software RL — [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
* OpenEnv — https://github.com/meta-pytorch/OpenEnv
* OpenEnv Rubric RFC — [`rfcs/004-rubrics.md`](https://github.com/meta-pytorch/OpenEnv) -->