Spaces:
Runtime error
title: SalesPath Environment
emoji: π€
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: RL gym environment for sales agent training
SalesPath: Mastering Long-Horizon Sales via RL
Submission for Scale AI Bonus Prize (Long-Horizon Planning)
Project Site: Hugging Face Space
Deep-Dive Blog: blog.md
Trained Model: Qwen 2.5 0.5B Sales Expert
π‘ Motivation: The "Long-Horizon" Challenge
Most LLM agent benchmarks reward single-turn "correct" answers. Real-world business tasks β like closing a B2B sales deal β are fundamentally different. They require sequential decision-making over 20+ turns where every action constrains the future.
We built SalesPath to solve this. It's a reinforcement learning environment that forces an agent to learn a strict sales workflow and 9 complex business rules through GRPO (Group Relative Policy Optimization).
If the agent negotiates before qualifying, it's penalized. If it repeats itself, it loses. To win, it must navigate the entire path from PROSPECT to CLOSE.
π οΈ How it Works: The Environment
SalesPath is an OpenEnv-compatible gym environment.
1. The Sales Workflow
Agents must progress through a logical sequence:
PROSPECT β QUALIFY β PRESENT β HANDLE_OBJECTION β OFFER_DEMO β NEGOTIATE β CLOSE
2. Strict Business Rules
The environment enforces 9 rules at every turn (e.g., R03: Budget must be known before NEGOTIATE). Three violations end the episode with a heavy penalty.
3. Dense Reward Signal
We use 5 reward components to provide meaningful gradients at every step:
- Outcome (40%): Did the deal close?
- Compliance (30%): Rule adherence.
- Ordering (15%): Workflow sequence.
- Efficiency (10%): Turn count.
- Format (5%): Structural correctness (
ACTION:/CONTENT:).
π Results: 0.5B Validation (Proof of Concept)
We successfully trained Qwen 2.5 0.5B Instruct to prove that the GRPO pipeline can bake complex logic into even the smallest models.
| Metric | Before Training | After Training |
|---|---|---|
mean_reward |
-0.14 |
0.23 |
violations / episode |
2.8 |
0.4 |
close_success_rate |
5% |
35% |
ordering_rate |
0.12 |
0.88 |
Finding: The 0.5B model learned the strict format within 5 steps and mastered the basic workflow ordering within 100 steps, proving the reward function is perfectly tuned for scaling to 7B+ models.
π Unified Architecture
Our Hugging Face Space is a unified submission:
- Inference/Env Server: Acts as the OpenEnv API.
- On-Demand Training: Can trigger a 7B GRPO scale-up via the
/trainendpoint.
Interaction Example
# Reset for a new deal
curl -X POST https://imsachin010-salespath-env.hf.space/reset -d '{"difficulty": 1}'
# Take a sales action
curl -X POST https://imsachin010-salespath-env.hf.space/step \
-d '{"action": {"action_type": "PROSPECT", "content": "Hello! I see you are scaling your team..."}}'
π Resources & References
- Architecture Deep Dive: blog.md
- Rules Documentation: RULES.md
- Framework: OpenEnv Core
- Algorithm: DeepSeek's GRPO
- Training Tool: Unsloth