Spaces:

Lomesh7777
/

openenv-multi-agent-RL

Sleeping

App Files Files Community

openenv-multi-agent-RL / README.md

Lomesh7777

Update README.md

4b86874 verified 12 days ago

preview code

raw

history blame contribute delete

11.3 kB

metadata

title: SalesPath Environment
emoji: 🤝
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: OpenEnv RL gym for training B2B sales agents via GRPO

SalesPath — RL Environment for B2B Sales Agents

An OpenEnv-compliant gym for teaching an LLM to follow a multi-step, rule-governed B2B sales workflow with programmatic verification at every step. Targets the Scale AI bonus track on long-horizon non-code business workflows.

Theme: #2 — Long-Horizon Planning & Instruction Following
Bonus track: Scale AI — Sales / PM / HR & IT workflows
HF Space: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
Blog post: add link before submission
Demo video (≤2 min): add link before submission

1. Problem

Off-the-shelf LLMs prompted to act as a sales agent reliably break the fundamentals of B2B selling: they pitch before qualifying, offer discounts before establishing value, and ignore order constraints that real sales orgs treat as inviolable. Not because they lack knowledge — because no training environment ever penalised these behaviours.

SalesPath is that environment.

The agent navigates a 3-to-8 step workflow against a deterministic ProspectSimulator, and at every turn the environment programmatically verifies nine business rules (R01..R09). A composed OpenEnv Rubric emits a dense five-component reward.

2. Environment

Observation

{
  "prospect_response":     "...",
  "workflow_stage":        "PRESENT",
  "constraints_violated":  ["R01"],
  "steps_completed":       ["PROSPECT", "PRESENT"],
  "turn_number":           3,
  "reward":                -0.18,
  "reward_components":     { "r_outcome": 0.0, "r_compliance": -0.2, ... },
  "done":                  false
}

Action

Action	When to use
`PROSPECT`	Opening turn only — initial outreach
`QUALIFY`	Uncover budget, decision maker, pain points
`PRESENT`	Pitch the solution (requires `QUALIFY` first)
`HANDLE_OBJECTION`	Respond to pricing / timing objections
`OFFER_DEMO`	Schedule a live product demo
`NEGOTIATE`	Discuss pricing/terms (requires `OFFER_DEMO` + known budget)
`CLOSE`	Attempt to sign the deal
`FOLLOW_UP`	Re-engage after prospect silence
`DISQUALIFY`	End the conversation (only valid for low-budget, no-DM prospects)

The action carries a format_ok flag set by the agent's parser. A malformed completion that happens to coerce to a valid action_type is still penalised by the FormatRubric — closing the silent format-hack surface from v1.

Business rules (R01..R09)

Rule	Description
R01	Must `QUALIFY` before `PRESENT`
R02	Must `OFFER_DEMO` before `NEGOTIATE`
R03	Cannot `NEGOTIATE` while budget is unknown
R04	Discount in `NEGOTIATE` only after 2 objections handled
R05	Cannot repeat the same action on consecutive turns
R06	First action must be `PROSPECT`
R07	`FOLLOW_UP` only valid after prospect silence (stall)
R08	`DISQUALIFY` valid only when `budget < threshold AND no decision_maker`
R09	Must `OFFER_DEMO` before `CLOSE` (difficulty 2+)

Reward — composed Rubric

SalesPathRubric is a WeightedSum over five sub-rubrics, each registered as an OpenEnv Rubric so external tooling can introspect per-component scores via env.rubric.named_rubrics().

Component	Weight	Type	What it captures
`compliance`	0.40	per-turn	-0.2 per new rule violation, capped at -1.0
`outcome`	0.20	terminal	+1.0 success / +0.5 valid disqualify / -0.7 violation termination
`ordering`	0.20	per-turn	potential-based — Δ correct-prefix length per turn (arXiv:2408.10215 §4.2)
`efficiency`	0.10	terminal	-0.05 per turn over the per-difficulty optimum
`format`	0.10	per-turn	+1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type

Why these weights: arXiv:2601.19100 §3.1 argues that for long-horizon structured-output tasks the process signal must dominate the sparse outcome signal. We give compliance 2× the weight of outcome.

Difficulty curriculum

Level	Description	Correct terminal action
1	Budget known, decision maker present	`CLOSE`
2	Budget hidden, 1 objection, demo required	`CLOSE`
3	Budget hidden, 2 objections, possible stalling	`CLOSE`
4	Adversarial: misleading high-budget signal, no decision maker	`DISQUALIFY`

The task bank carries ~20 prospect profiles per level (task_bank.py); the last 4 of each level are held-out for eval_baseline_vs_trained.py.

3. Training pipeline

sft_demos.jsonl  →  train_sft.py  →  ./sft_checkpoint
                                         │
                                         ▼
                                  train_grpo.py
                                         │
                            on-policy rollouts in
                            SalesPathEnvironment
                                         │
                                         ▼
                                  ./grpo_checkpoint
                                         │
                       ┌─────────────────┴─────────────────┐
                       ▼                                   ▼
                plot_rewards.py            eval_baseline_vs_trained.py
                       │                                   │
                       ▼                                   ▼
              ./plots/reward_curve.png            ./eval_results.md

What's specifically engineered for fast Colab/Kaggle GPUs

Batched rollouts — N parallel episodes, single .generate() call per turn (left-padded for correctness).
Threaded reward fn — reward computation across GRPO's group of candidate completions runs in a ThreadPoolExecutor (the env is rule-based / CPU-cheap, so threads overlap with GPU forwards).
State snapshots keyed by SHA1 — the STATE_BANK trick lets GRPO score single-action completions against a frozen state, avoiding full episode re-rollouts during the gradient step.
N-step shaping (GAMMA=0.95) — true_env_reward_fn extends the immediate per-turn reward with a discounted heuristic continuation, so GRPO sees credit for actions that pay off later. This is what gives this contextual-bandit-shaped problem a real long-horizon signal.
Optional vLLM — USE_VLLM=1 flips TRL's vLLM-backed sampler for ~3× faster on-policy generation on A100/Kaggle P100.
Trainer-once — GRPOTrainer is constructed once, trained once, preserving optimizer + LR-scheduler state across all gradient steps.

Commands

# 0. Smoke test (~30 sec, no GPU)
python training/train_test.py

# 1. SFT warm-start (~10–15 min on a T4)
python training/train_sft.py

# 2. Start the env server and run GRPO (~45–90 min on a T4)
uvicorn salespath_env.server.app:app --port 7860 &
SFT_CHECKPOINT=./sft_checkpoint  USE_VLLM=0  python training/train_grpo.py

# 3. Plot reward curves
python training/plot_rewards.py

# 4. Baseline-vs-trained head-to-head on the held-out eval split
python training/eval_baseline_vs_trained.py \
    --base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8