Spaces:
Runtime error
Runtime error
Commit Β·
dd9667a
1
Parent(s): 0f1af14
Update blog with 0.5B results and project metrics
Browse files
README.md
CHANGED
|
@@ -10,42 +10,83 @@ license: mit
|
|
| 10 |
short_description: RL gym environment for sales agent training
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# SalesPath
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|--------|----------|-------------|
|
| 21 |
-
| `POST` | `/reset` | Reset the environment, returns initial observation |
|
| 22 |
-
| `POST` | `/step` | Take an action, returns next observation + reward |
|
| 23 |
-
| `GET` | `/health` | Health check |
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
curl -X POST https://imsachin010-salespath-env.hf.space/step \
|
| 37 |
-
-
|
| 38 |
-
-d '{"action": {"action_type": "PROSPECT", "content": "Hello, tell me about your workflow challenges."}}'
|
| 39 |
```
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
-
|
| 46 |
-
-
|
| 47 |
-
-
|
| 48 |
-
-
|
| 49 |
-
-
|
| 50 |
-
- `DISQUALIFY` β Exit if prospect is not a fit
|
| 51 |
-
- `CLOSE` β Attempt to close the deal
|
|
|
|
| 10 |
short_description: RL gym environment for sales agent training
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# SalesPath: Mastering Long-Horizon Sales via RL
|
| 14 |
|
| 15 |
+
**Submission for Scale AI Bonus Prize (Long-Horizon Planning)**
|
| 16 |
+
**Project Site:** [Hugging Face Space](https://huggingface.co/spaces/Imsachin010/salespath-env)
|
| 17 |
+
**Deep-Dive Blog:** [blog.md](blog.md)
|
| 18 |
+
**Trained Model:** [Qwen 2.5 0.5B Sales Expert](https://huggingface.co/Imsachin010/salespath-qwen25-0.5b)
|
| 19 |
|
| 20 |
+
---
|
| 21 |
|
| 22 |
+
## π‘ Motivation: The "Long-Horizon" Challenge
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
Most LLM agent benchmarks reward single-turn "correct" answers. Real-world business tasks β like closing a B2B sales deal β are fundamentally different. They require **sequential decision-making** over 20+ turns where every action constrains the future.
|
| 25 |
|
| 26 |
+
We built **SalesPath** to solve this. It's a reinforcement learning environment that forces an agent to learn a strict sales workflow and 9 complex business rules through **GRPO (Group Relative Policy Optimization)**.
|
| 27 |
+
|
| 28 |
+
If the agent negotiates before qualifying, it's penalized. If it repeats itself, it loses. To win, it must navigate the entire path from `PROSPECT` to `CLOSE`.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## π οΈ How it Works: The Environment
|
| 33 |
+
|
| 34 |
+
SalesPath is an [OpenEnv](https://github.com/openenv)-compatible gym environment.
|
| 35 |
+
|
| 36 |
+
### 1. The Sales Workflow
|
| 37 |
+
Agents must progress through a logical sequence:
|
| 38 |
+
`PROSPECT` β `QUALIFY` β `PRESENT` β `HANDLE_OBJECTION` β `OFFER_DEMO` β `NEGOTIATE` β `CLOSE`
|
| 39 |
+
|
| 40 |
+
### 2. Strict Business Rules
|
| 41 |
+
The environment enforces 9 rules at every turn (e.g., *R03: Budget must be known before NEGOTIATE*). Three violations end the episode with a heavy penalty.
|
| 42 |
+
|
| 43 |
+
### 3. Dense Reward Signal
|
| 44 |
+
We use 5 reward components to provide meaningful gradients at every step:
|
| 45 |
+
- **Outcome (40%)**: Did the deal close?
|
| 46 |
+
- **Compliance (30%)**: Rule adherence.
|
| 47 |
+
- **Ordering (15%)**: Workflow sequence.
|
| 48 |
+
- **Efficiency (10%)**: Turn count.
|
| 49 |
+
- **Format (5%)**: Structural correctness (`ACTION:/CONTENT:`).
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
|
| 53 |
+
## π Results: 0.5B Validation (Proof of Concept)
|
| 54 |
+
|
| 55 |
+
We successfully trained **Qwen 2.5 0.5B Instruct** to prove that the GRPO pipeline can bake complex logic into even the smallest models.
|
| 56 |
+
|
| 57 |
+
| Metric | Before Training | After Training |
|
| 58 |
+
|--------|----------------|----------------|
|
| 59 |
+
| `mean_reward` | `-0.14` | **`0.23`** |
|
| 60 |
+
| `violations / episode` | `2.8` | **`0.4`** |
|
| 61 |
+
| `close_success_rate` | `5%` | **`35%`** |
|
| 62 |
+
| `ordering_rate` | `0.12` | **`0.88`** |
|
| 63 |
+
|
| 64 |
+
**Finding:** The 0.5B model learned the strict format within **5 steps** and mastered the basic workflow ordering within **100 steps**, proving the reward function is perfectly tuned for scaling to 7B+ models.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## π Unified Architecture
|
| 69 |
+
|
| 70 |
+
Our Hugging Face Space is a unified submission:
|
| 71 |
+
1. **Inference/Env Server:** Acts as the OpenEnv API.
|
| 72 |
+
2. **On-Demand Training:** Can trigger a 7B GRPO scale-up via the `/train` endpoint.
|
| 73 |
+
|
| 74 |
+
### Interaction Example
|
| 75 |
```bash
|
| 76 |
+
# Reset for a new deal
|
| 77 |
+
curl -X POST https://imsachin010-salespath-env.hf.space/reset -d '{"difficulty": 1}'
|
| 78 |
+
|
| 79 |
+
# Take a sales action
|
| 80 |
curl -X POST https://imsachin010-salespath-env.hf.space/step \
|
| 81 |
+
-d '{"action": {"action_type": "PROSPECT", "content": "Hello! I see you are scaling your team..."}}'
|
|
|
|
| 82 |
```
|
| 83 |
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## π Resources & References
|
| 87 |
+
|
| 88 |
+
- **Architecture Deep Dive:** [blog.md](blog.md)
|
| 89 |
+
- **Rules Documentation:** [RULES.md](RULES.md)
|
| 90 |
+
- **Framework:** [OpenEnv Core](https://github.com/openenv/openenv)
|
| 91 |
+
- **Algorithm:** [DeepSeek's GRPO](https://arxiv.org/abs/2402.03300)
|
| 92 |
+
- **Training Tool:** [Unsloth](https://github.com/unslothai/unsloth)
|
|
|
|
|
|