Spaces:

Lomesh7777
/

openenv-multi-agent-RL

Sleeping

App Files Files Community

openenv-multi-agent-RL / README.md

Lomesh7777

Update README.md

4b86874 verified 12 days ago

preview code

raw

history blame contribute delete

11.3 kB

	---
	title: SalesPath Environment
	emoji: 🤝
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	short_description: OpenEnv RL gym for training B2B sales agents via GRPO
	---

	# SalesPath — RL Environment for B2B Sales Agents

	> **An OpenEnv-compliant gym for teaching an LLM to follow a multi-step,
	> rule-governed B2B sales workflow with programmatic verification at every
	> step. Targets the Scale AI bonus track on long-horizon non-code business
	> workflows.**

	* Theme: #2 — Long-Horizon Planning & Instruction Following
	* Bonus track: Scale AI — Sales / PM / HR & IT workflows
	* HF Space: https://huggingface.co/spaces/Lomesh7777/openenv-multi-agent-RL
	* Blog post: _add link before submission_
	* Demo video (≤2 min): _add link before submission_

	---

	## 1. Problem

	Off-the-shelf LLMs prompted to act as a sales agent reliably break the
	fundamentals of B2B selling: they pitch before qualifying, offer discounts
	before establishing value, and ignore order constraints that real sales orgs
	treat as inviolable. Not because they lack knowledge — because no training
	environment ever penalised these behaviours.

	SalesPath is that environment.

	The agent navigates a 3-to-8 step workflow against a deterministic
	`ProspectSimulator`, and at every turn the environment programmatically
	verifies nine business rules (R01..R09). A composed
	[OpenEnv `Rubric`](salespath_env/server/reward.py) emits a dense five-component
	reward.

	## 2. Environment

	### Observation
	```jsonc
	{
	"prospect_response": "...",
	"workflow_stage": "PRESENT",
	"constraints_violated": ["R01"],
	"steps_completed": ["PROSPECT", "PRESENT"],
	"turn_number": 3,
	"reward": -0.18,
	"reward_components": { "r_outcome": 0.0, "r_compliance": -0.2, ... },
	"done": false
	}
	```

	### Action

	\| Action \| When to use \|
	\|---\|---\|
	\| `PROSPECT` \| Opening turn only — initial outreach \|
	\| `QUALIFY` \| Uncover budget, decision maker, pain points \|
	\| `PRESENT` \| Pitch the solution (requires `QUALIFY` first) \|
	\| `HANDLE_OBJECTION` \| Respond to pricing / timing objections \|
	\| `OFFER_DEMO` \| Schedule a live product demo \|
	\| `NEGOTIATE` \| Discuss pricing/terms (requires `OFFER_DEMO` + known budget) \|
	\| `CLOSE` \| Attempt to sign the deal \|
	\| `FOLLOW_UP` \| Re-engage after prospect silence \|
	\| `DISQUALIFY` \| End the conversation (only valid for low-budget, no-DM prospects) \|

	The action carries a `format_ok` flag set by the agent's parser. A malformed
	completion that happens to coerce to a valid action_type is still penalised
	by the `FormatRubric` — closing the silent format-hack surface from v1.

	### Business rules (R01..R09)

	\| Rule \| Description \|
	\|---\|---\|
	\| R01 \| Must `QUALIFY` before `PRESENT` \|
	\| R02 \| Must `OFFER_DEMO` before `NEGOTIATE` \|
	\| R03 \| Cannot `NEGOTIATE` while budget is unknown \|
	\| R04 \| Discount in `NEGOTIATE` only after 2 objections handled \|
	\| R05 \| Cannot repeat the same action on consecutive turns \|
	\| R06 \| First action must be `PROSPECT` \|
	\| R07 \| `FOLLOW_UP` only valid after prospect silence (stall) \|
	\| R08 \| `DISQUALIFY` valid only when `budget < threshold AND no decision_maker` \|
	\| R09 \| Must `OFFER_DEMO` before `CLOSE` (difficulty 2+) \|

	### Reward — composed Rubric

	`SalesPathRubric` is a `WeightedSum` over five sub-rubrics, each registered
	as an OpenEnv `Rubric` so external tooling can introspect per-component
	scores via `env.rubric.named_rubrics()`.

	\| Component \| Weight \| Type \| What it captures \|
	\|---\|---\|---\|---\|
	\| `compliance` \| 0.40 \| per-turn \| -0.2 per new rule violation, capped at -1.0 \|
	\| `outcome` \| 0.20 \| terminal \| +1.0 success / +0.5 valid disqualify / -0.7 violation termination \|
	\| `ordering` \| 0.20 \| per-turn \| potential-based — Δ correct-prefix length per turn (arXiv:2408.10215 §4.2) \|
	\| `efficiency` \| 0.10 \| terminal \| -0.05 per turn over the per-difficulty optimum \|
	\| `format` \| 0.10 \| per-turn \| +1.0 valid+parsed / -0.3 if `format_ok=False` or invalid action_type \|

	Why these weights: arXiv:2601.19100 §3.1 argues that for long-horizon
	structured-output tasks the process signal must dominate the sparse
	outcome signal. We give compliance 2× the weight of outcome.

	### Difficulty curriculum

	\| Level \| Description \| Correct terminal action \|
	\|---\|---\|---\|
	\| 1 \| Budget known, decision maker present \| `CLOSE` \|
	\| 2 \| Budget hidden, 1 objection, demo required \| `CLOSE` \|
	\| 3 \| Budget hidden, 2 objections, possible stalling \| `CLOSE` \|
	\| 4 \| Adversarial: misleading high-budget signal, no decision maker \| `DISQUALIFY` \|

	The task bank carries ~20 prospect profiles per level (`task_bank.py`); the
	last 4 of each level are held-out for `eval_baseline_vs_trained.py`.

	## 3. Training pipeline

	```
	sft_demos.jsonl → train_sft.py → ./sft_checkpoint
	│
	▼
	train_grpo.py
	│
	on-policy rollouts in
	SalesPathEnvironment
	│
	▼
	./grpo_checkpoint
	│
	┌─────────────────┴─────────────────┐
	▼ ▼
	plot_rewards.py eval_baseline_vs_trained.py
	│ │
	▼ ▼
	./plots/reward_curve.png ./eval_results.md
	```

	### What's specifically engineered for fast Colab/Kaggle GPUs

	* Batched rollouts — N parallel episodes, single `.generate()` call per
	turn (left-padded for correctness).
	* Threaded reward fn — reward computation across GRPO's group of
	candidate completions runs in a `ThreadPoolExecutor` (the env is
	rule-based / CPU-cheap, so threads overlap with GPU forwards).
	* State snapshots keyed by SHA1 — the `STATE_BANK` trick lets GRPO score
	single-action completions against a frozen state, avoiding full episode
	re-rollouts during the gradient step.
	* N-step shaping (`GAMMA=0.95`) — `true_env_reward_fn` extends the
	immediate per-turn reward with a discounted heuristic continuation, so
	GRPO sees credit for actions that pay off later. This is what gives this
	contextual-bandit-shaped problem a real long-horizon signal.
	* Optional vLLM — `USE_VLLM=1` flips TRL's vLLM-backed sampler for
	~3× faster on-policy generation on A100/Kaggle P100.
	* Trainer-once — `GRPOTrainer` is constructed once, trained once,
	preserving optimizer + LR-scheduler state across all gradient steps.

	### Commands

	```bash
	# 0. Smoke test (~30 sec, no GPU)
	python training/train_test.py

	# 1. SFT warm-start (~10–15 min on a T4)
	python training/train_sft.py

	# 2. Start the env server and run GRPO (~45–90 min on a T4)
	uvicorn salespath_env.server.app:app --port 7860 &
	SFT_CHECKPOINT=./sft_checkpoint USE_VLLM=0 python training/train_grpo.py

	# 3. Plot reward curves
	python training/plot_rewards.py

	# 4. Baseline-vs-trained head-to-head on the held-out eval split
	python training/eval_baseline_vs_trained.py \
	--base ./sft_checkpoint --trained ./grpo_checkpoint --episodes-per-level 8
	```

	<!-- Useful env vars for Colab/Kaggle tuning:

	\| Var \| Default \| Notes \|
	\|---\|---\|---\|
	\| `ROLLOUTS_PER_DIFFICULTY` \| 8 \| More → bigger / more diverse state bank \|
	\| `NUM_GENERATIONS` \| 4 \| GRPO group size; on T4 keep ≤4 to fit VRAM \|
	\| `PER_DEVICE_BATCH` \| 2 \| T4 / Kaggle P100 default \|
	\| `GRAD_ACCUM` \| 4 \| Effective batch = 8 \|
	\| `NUM_REWARD_WORKERS` \| 8 \| Threadpool size for the reward fn \|
	\| `USE_VLLM` \| 0 \| Set to `1` on A100 only \|
	\| `BETA` \| 0.05 \| KL-to-reference penalty \|
	\| `GAMMA` \| 0.95 \| n-step continuation discount \|

	## 4. Results

	After ~1 GRPO pass (eval on the held-out profiles, 8 episodes per level):

	> See `eval_results.md` (regenerated by `eval_baseline_vs_trained.py`)
	> and `plots/reward_curve.png` (regenerated by `plot_rewards.py`).

	The conservative target table from the proposal:

	\| Metric \| Base \| After GRPO (target) \|
	\|---\|---\|---\|
	\| Rule violations per episode \| 3.5 \| < 0.5 \|
	\| Correct step ordering rate \| 0.45 \| > 0.85 \|
	\| Successful close rate (L1) \| 0.30 \| > 0.75 \|
	\| Correct disqualification rate (L4) \| 0.20 \| > 0.65 \|
	\| Mean episode reward \| ~0.10 \| > 0.6 \|

	## 5. File layout

	```
	salespath-env/
	├── salespath_env/
	│ ├── __init__.py ← public API exports
	│ ├── client.py ← HTTP client for the env
	│ ├── models.py ← Action / Observation / State + format_ok
	│ ├── openenv.yaml ← OpenEnv manifest (spec_version: 1)
	│ └── server/
	│ ├── app.py ← Custom stateful FastAPI (HF Spaces)
	│ ├── salespath_environment.py
	│ ├── prospect_simulator.py ← Deterministic, state-seeded
	│ ├── rules.py ← R01–R09
	│ ├── reward.py ← SalesPathRubric (WeightedSum of 5)
	│ └── task_bank.py ← 19–20 profiles/level + held-out split
	├── training/
	│ ├── sft_demos.jsonl
	│ ├── train_test.py ← smoke test + bug regression
	│ ├── train_sft.py
	│ ├── train_grpo.py ← GRPO + n-step + parallel reward fn
	│ ├── eval_baseline_vs_trained.py
	│ └── plot_rewards.py
	├── Dockerfile
	├── requirements.txt
	└── pyproject.toml
	```

	## 6. Why this design wins on the rubric

	\| Criterion (weight) \| How we hit it \|
	\|---\|---\|
	\| Environment Innovation (40%) \| Business workflow with programmatic verification, deterministic rule-based simulator (no LLM in verifier — prevents reward hacking via prompt manipulation), 4-level curriculum with held-out eval, OpenEnv `Rubric` composition. \|
	\| Storytelling (30%) \| Sales workflow is legible to any reader in 10 seconds. Before/after table from `eval_baseline_vs_trained.py` is the headline. Live-demo script in §0:30–1:30 of the demo plan. \|
	\| Improvement in Rewards (20%) \| Five tracked metrics, dense per-turn signal, reward curves with min/max band and difficulty-step markers, baseline vs trained eval table. \|
	\| Reward & Pipeline (10%) \| Composed Rubric system; potential-based ordering shaping (no policy distortion); n-step continuation closes the contextual-bandit gap; format-hack surface explicitly closed; trainer instantiated once with optimizer state preserved. \|

	## 7. References

	* Reward engineering survey — [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
	* Reward engineering for software RL — [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
	* OpenEnv — https://github.com/meta-pytorch/OpenEnv
	* OpenEnv Rubric RFC — [`rfcs/004-rubrics.md`](https://github.com/meta-pytorch/OpenEnv) -->