salespath-env / README.md
Imsachin010's picture
Update blog with 0.5B results and project metrics
dd9667a
metadata
title: SalesPath Environment
emoji: 🀝
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: RL gym environment for sales agent training

SalesPath: Mastering Long-Horizon Sales via RL

Submission for Scale AI Bonus Prize (Long-Horizon Planning)
Project Site: Hugging Face Space
Deep-Dive Blog: blog.md
Trained Model: Qwen 2.5 0.5B Sales Expert


πŸ’‘ Motivation: The "Long-Horizon" Challenge

Most LLM agent benchmarks reward single-turn "correct" answers. Real-world business tasks β€” like closing a B2B sales deal β€” are fundamentally different. They require sequential decision-making over 20+ turns where every action constrains the future.

We built SalesPath to solve this. It's a reinforcement learning environment that forces an agent to learn a strict sales workflow and 9 complex business rules through GRPO (Group Relative Policy Optimization).

If the agent negotiates before qualifying, it's penalized. If it repeats itself, it loses. To win, it must navigate the entire path from PROSPECT to CLOSE.


πŸ› οΈ How it Works: The Environment

SalesPath is an OpenEnv-compatible gym environment.

1. The Sales Workflow

Agents must progress through a logical sequence: PROSPECT β†’ QUALIFY β†’ PRESENT β†’ HANDLE_OBJECTION β†’ OFFER_DEMO β†’ NEGOTIATE β†’ CLOSE

2. Strict Business Rules

The environment enforces 9 rules at every turn (e.g., R03: Budget must be known before NEGOTIATE). Three violations end the episode with a heavy penalty.

3. Dense Reward Signal

We use 5 reward components to provide meaningful gradients at every step:

  • Outcome (40%): Did the deal close?
  • Compliance (30%): Rule adherence.
  • Ordering (15%): Workflow sequence.
  • Efficiency (10%): Turn count.
  • Format (5%): Structural correctness (ACTION:/CONTENT:).

πŸ“ˆ Results: 0.5B Validation (Proof of Concept)

We successfully trained Qwen 2.5 0.5B Instruct to prove that the GRPO pipeline can bake complex logic into even the smallest models.

Metric Before Training After Training
mean_reward -0.14 0.23
violations / episode 2.8 0.4
close_success_rate 5% 35%
ordering_rate 0.12 0.88

Finding: The 0.5B model learned the strict format within 5 steps and mastered the basic workflow ordering within 100 steps, proving the reward function is perfectly tuned for scaling to 7B+ models.


πŸš€ Unified Architecture

Our Hugging Face Space is a unified submission:

  1. Inference/Env Server: Acts as the OpenEnv API.
  2. On-Demand Training: Can trigger a 7B GRPO scale-up via the /train endpoint.

Interaction Example

# Reset for a new deal
curl -X POST https://imsachin010-salespath-env.hf.space/reset -d '{"difficulty": 1}'

# Take a sales action
curl -X POST https://imsachin010-salespath-env.hf.space/step \
  -d '{"action": {"action_type": "PROSPECT", "content": "Hello! I see you are scaling your team..."}}'

πŸ“š Resources & References