Spaces:
Sleeping
Sleeping
Aviral Bhargava commited on
Commit ·
a72a5e3
1
Parent(s): 610b7e5
feat: complete OpenEnv compliance, tasks, and logic fixes
Browse files- .gitignore +29 -0
- Dockerfile +26 -0
- README.md +115 -1
- __pycache__/env_wrapper.cpython-312.pyc +0 -0
- env_wrapper.py +291 -43
- inference.py +161 -73
- openenv.yaml +94 -0
- requirements.txt +3 -0
- tasks.py +126 -0
- test_env.py +107 -0
- test_output.txt +45 -0
.gitignore
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
*.pyo
|
| 5 |
+
*.egg-info/
|
| 6 |
+
dist/
|
| 7 |
+
build/
|
| 8 |
+
*.egg
|
| 9 |
+
|
| 10 |
+
# Environment
|
| 11 |
+
.env
|
| 12 |
+
.venv/
|
| 13 |
+
venv/
|
| 14 |
+
|
| 15 |
+
# IDE
|
| 16 |
+
.vscode/
|
| 17 |
+
.idea/
|
| 18 |
+
*.swp
|
| 19 |
+
*.swo
|
| 20 |
+
|
| 21 |
+
# OS
|
| 22 |
+
.DS_Store
|
| 23 |
+
Thumbs.db
|
| 24 |
+
|
| 25 |
+
# Build artifacts
|
| 26 |
+
*.exe
|
| 27 |
+
*.o
|
| 28 |
+
*.out
|
| 29 |
+
test_sim.exe
|
Dockerfile
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ── Negotiation Environment — OpenEnv Dockerfile ──
|
| 2 |
+
# Person 3: Complete this Dockerfile for HuggingFace Spaces deployment
|
| 3 |
+
# Requirements: Python 3.11+, pip dependencies, inference.py entrypoint
|
| 4 |
+
# Constraints: CPU only, vcpu=2, memory=8gb, runtime < 20min
|
| 5 |
+
|
| 6 |
+
FROM python:3.11-slim
|
| 7 |
+
|
| 8 |
+
WORKDIR /app
|
| 9 |
+
|
| 10 |
+
# Install dependencies
|
| 11 |
+
COPY requirements.txt .
|
| 12 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 13 |
+
|
| 14 |
+
# Copy project files
|
| 15 |
+
COPY env_wrapper.py .
|
| 16 |
+
COPY tasks.py .
|
| 17 |
+
COPY inference.py .
|
| 18 |
+
COPY openenv.yaml .
|
| 19 |
+
|
| 20 |
+
# Environment variables (set at runtime, NOT hardcoded)
|
| 21 |
+
# API_BASE_URL — The API endpoint for the LLM
|
| 22 |
+
# MODEL_NAME — The model identifier to use for inference
|
| 23 |
+
# HF_TOKEN — Your HuggingFace API key
|
| 24 |
+
|
| 25 |
+
# Entrypoint
|
| 26 |
+
CMD ["python", "inference.py"]
|
README.md
CHANGED
|
@@ -1 +1,115 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🤝 Strategic Negotiation Environment — OpenEnv
|
| 2 |
+
|
| 3 |
+
A simulation environment where an AI agent learns to negotiate under uncertainty, compliant with the [Meta OpenEnv specification](https://github.com/meta-llama/open-env).
|
| 4 |
+
|
| 5 |
+
## 🧠 Overview
|
| 6 |
+
|
| 7 |
+
This environment simulates **real-world price negotiation** — a task humans do daily in marketplaces, business deals, and automated pricing systems. The agent must:
|
| 8 |
+
|
| 9 |
+
- **Maximize profit** by negotiating favorable deals
|
| 10 |
+
- **Adapt to opponent behavior** (greedy, fair, or impatient personalities)
|
| 11 |
+
- **Make multi-step strategic decisions** under partial observability
|
| 12 |
+
|
| 13 |
+
The agent cannot see the opponent's true valuation or strategy — it must infer patterns and adjust.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 🎮 Action Space
|
| 18 |
+
|
| 19 |
+
| Action | Description |
|
| 20 |
+
|---|---|
|
| 21 |
+
| `OFFER <price>` | Make a counter-offer at the given price (100–1000) |
|
| 22 |
+
| `ACCEPT` | Accept the current offer on the table |
|
| 23 |
+
| `REJECT` | Walk away from the negotiation (terminal, -50 penalty) |
|
| 24 |
+
|
| 25 |
+
## 👁️ Observation Space
|
| 26 |
+
|
| 27 |
+
| Field | Type | Description |
|
| 28 |
+
|---|---|---|
|
| 29 |
+
| `current_offer` | int | Current price on the table |
|
| 30 |
+
| `round` | int | Current round number |
|
| 31 |
+
| `max_rounds` | int | Maximum allowed rounds |
|
| 32 |
+
| `role` | string | Agent's role: "buyer" or "seller" |
|
| 33 |
+
| `last_opponent_action` | string | "START", "OFFER", or "ACCEPT" |
|
| 34 |
+
| `last_opponent_offer` | int | Opponent's last offered price |
|
| 35 |
+
| `history` | list | History of all actions this episode |
|
| 36 |
+
|
| 37 |
+
## 💰 Reward Function
|
| 38 |
+
|
| 39 |
+
| Event | Reward |
|
| 40 |
+
|---|---|
|
| 41 |
+
| Successful deal | `profit × (1 - round/max_rounds)` |
|
| 42 |
+
| Bad deal (profit < 0) | Additional -20 penalty |
|
| 43 |
+
| Rejection / Timeout | -50 |
|
| 44 |
+
| Aggressive offers | Cumulative -2 per aggressive step |
|
| 45 |
+
| Progress toward deal | Small shaping signal (±2 max) |
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 📋 Tasks
|
| 50 |
+
|
| 51 |
+
| Task | Difficulty | Opponent | ZOPA | Rounds | Threshold |
|
| 52 |
+
|---|---|---|---|---|---|
|
| 53 |
+
| `task_a_easy` | Easy | Fair | Wide (400) | 20 | 0.2 |
|
| 54 |
+
| `task_b_medium` | Medium | Greedy | Narrow (200) | 15 | 0.3 |
|
| 55 |
+
| `task_c_hard` | Hard | Impatient | Tight (120) | 6 | 0.4 |
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## 🚀 Setup & Usage
|
| 60 |
+
|
| 61 |
+
### Prerequisites
|
| 62 |
+
- Python 3.11+
|
| 63 |
+
- HuggingFace API token
|
| 64 |
+
|
| 65 |
+
### Install
|
| 66 |
+
```bash
|
| 67 |
+
pip install -r requirements.txt
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### Configure Environment Variables
|
| 71 |
+
```bash
|
| 72 |
+
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 73 |
+
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
|
| 74 |
+
export HF_TOKEN="your_token_here"
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### Run Inference
|
| 78 |
+
```bash
|
| 79 |
+
python inference.py
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
### Docker
|
| 83 |
+
```bash
|
| 84 |
+
docker build -t negotiation-env .
|
| 85 |
+
docker run -e HF_TOKEN=your_token -e API_BASE_URL=https://router.huggingface.co/v1 -e MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct negotiation-env
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## 📊 Baseline Scores
|
| 91 |
+
|
| 92 |
+
<!-- Person 3: Fill in baseline scores after running inference -->
|
| 93 |
+
| Task | Score | Steps | Deal Made |
|
| 94 |
+
|---|---|---|---|
|
| 95 |
+
| task_a_easy | _TBD_ | _TBD_ | _TBD_ |
|
| 96 |
+
| task_b_medium | _TBD_ | _TBD_ | _TBD_ |
|
| 97 |
+
| task_c_hard | _TBD_ | _TBD_ | _TBD_ |
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## 🏗️ Architecture
|
| 102 |
+
|
| 103 |
+
```
|
| 104 |
+
LLM (HuggingFace via OpenAI Client)
|
| 105 |
+
↓
|
| 106 |
+
inference.py (control loop + logging)
|
| 107 |
+
↓
|
| 108 |
+
env_wrapper.py (OpenEnv-compatible environment)
|
| 109 |
+
↓
|
| 110 |
+
tasks.py (task configs + graders)
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## 📄 License
|
| 114 |
+
|
| 115 |
+
Apache 2.0
|
__pycache__/env_wrapper.cpython-312.pyc
CHANGED
|
Binary files a/__pycache__/env_wrapper.cpython-312.pyc and b/__pycache__/env_wrapper.cpython-312.pyc differ
|
|
|
env_wrapper.py
CHANGED
|
@@ -1,100 +1,348 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import random
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
class Opponent:
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
self.type = type_str
|
| 6 |
self.opponent_value = value
|
| 7 |
self.opponent_role = role
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
| 16 |
self.concession_rate = self.r
|
|
|
|
| 17 |
|
| 18 |
-
def get_response(self, round_num, current_offer, agent_offer, agent_action_type):
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
if agent_action_type != "OFFER":
|
| 20 |
return "REJECT", 0
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
return "ACCEPT", agent_offer
|
| 24 |
-
if self.opponent_role == "buyer" and agent_offer <= self.opponent_value:
|
|
|
|
| 25 |
return "ACCEPT", agent_offer
|
| 26 |
|
|
|
|
| 27 |
if round_num > self.patience:
|
| 28 |
self.concession_rate = min(0.4, self.concession_rate + 0.05)
|
| 29 |
|
|
|
|
| 30 |
target = self.opponent_value
|
| 31 |
delta = target - current_offer
|
| 32 |
next_offer = current_offer + self.concession_rate * delta
|
|
|
|
|
|
|
| 33 |
next_offer = (1.0 - self.alpha) * next_offer + self.alpha * current_offer
|
|
|
|
|
|
|
| 34 |
next_offer += random.randint(-self.epsilon, self.epsilon)
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
class EnvWrapper:
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
self.agent_value = a_val
|
| 41 |
self.opponent_value = o_val
|
| 42 |
self.role = agent_role
|
|
|
|
| 43 |
self.opp_role = "seller" if agent_role == "buyer" else "buyer"
|
|
|
|
| 44 |
self.opp = Opponent(opp_type, o_val, self.opp_role)
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
self.round = 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
if self.role == "buyer":
|
| 51 |
-
|
|
|
|
| 52 |
else:
|
|
|
|
| 53 |
self.current_offer = max(100, self.agent_value - 200)
|
|
|
|
| 54 |
self.last_opp_action = "START"
|
| 55 |
self.last_opp_offer = self.current_offer
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
self.round += 1
|
| 59 |
-
aggressive = False
|
| 60 |
-
done = False
|
| 61 |
reward = 0.0
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
if action_str == "ACCEPT":
|
| 64 |
deal_price = self.last_opp_offer
|
|
|
|
| 65 |
done = True
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
if profit < 0: reward -= 20
|
| 70 |
-
|
| 71 |
elif action_str == "REJECT":
|
| 72 |
reward = -50.0
|
|
|
|
| 73 |
done = True
|
| 74 |
-
|
|
|
|
| 75 |
elif action_str.startswith("OFFER"):
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
if opp_action == "ACCEPT":
|
| 79 |
deal_price = action_price
|
|
|
|
| 80 |
done = True
|
| 81 |
self.last_opp_action = "ACCEPT"
|
| 82 |
self.last_opp_offer = deal_price
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
t_factor = 1.0 - (self.round / self.max_rounds)
|
| 86 |
-
reward = profit * t_factor
|
| 87 |
-
if profit < 0: reward -= 20
|
| 88 |
-
if aggressive: reward -= 2
|
| 89 |
else:
|
|
|
|
| 90 |
self.current_offer = opp_price
|
| 91 |
self.last_opp_action = "OFFER"
|
| 92 |
self.last_opp_offer = opp_price
|
|
|
|
|
|
|
| 93 |
if self.round >= self.max_rounds:
|
| 94 |
reward = -50.0
|
|
|
|
| 95 |
done = True
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Negotiation Environment Wrapper — OpenEnv Compliant
|
| 3 |
+
Implements: reset(), step(), state()
|
| 4 |
+
Typed models via Pydantic for Observation, Action, Reward
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
import random
|
| 8 |
+
from typing import Optional, List, Dict, Any
|
| 9 |
+
from pydantic import BaseModel, Field
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# ─────────────────────────────────────────────
|
| 13 |
+
# OpenEnv Typed Models
|
| 14 |
+
# ─────────────────────────────────────────────
|
| 15 |
+
|
| 16 |
+
class Observation(BaseModel):
|
| 17 |
+
"""Observable state visible to the agent."""
|
| 18 |
+
agent_value: int = Field(description="The agent's private valuation/target value for the deal")
|
| 19 |
+
current_offer: int = Field(description="Current price on the table")
|
| 20 |
+
round: int = Field(description="Current round number (0-indexed before first step)")
|
| 21 |
+
max_rounds: int = Field(description="Maximum allowed rounds")
|
| 22 |
+
role: str = Field(description="Agent role: 'buyer' or 'seller'")
|
| 23 |
+
last_opponent_action: str = Field(description="Opponent's last action: 'START', 'OFFER', 'ACCEPT'")
|
| 24 |
+
last_opponent_offer: int = Field(description="Opponent's last offered price")
|
| 25 |
+
history: List[Dict[str, Any]] = Field(default_factory=list, description="History of all actions this episode")
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
class ActionModel(BaseModel):
|
| 29 |
+
"""Action the agent can take."""
|
| 30 |
+
action_type: str = Field(description="One of: 'OFFER', 'ACCEPT', 'REJECT'")
|
| 31 |
+
price: int = Field(default=0, description="Price for OFFER actions, ignored for ACCEPT/REJECT")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class RewardInfo(BaseModel):
|
| 35 |
+
"""Reward information returned by step()."""
|
| 36 |
+
reward: float = Field(description="Numeric reward for this step")
|
| 37 |
+
breakdown: Dict[str, float] = Field(default_factory=dict, description="Reward component breakdown")
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
# ─────────────────────────────────────────────
|
| 41 |
+
# Opponent Strategy
|
| 42 |
+
# ─────────────────────────────────────────────
|
| 43 |
|
| 44 |
class Opponent:
|
| 45 |
+
"""
|
| 46 |
+
Simulates opponent negotiation behavior.
|
| 47 |
+
Three personalities: greedy, fair, impatient.
|
| 48 |
+
Each has different concession rates, anchor effects, patience, and noise.
|
| 49 |
+
"""
|
| 50 |
+
|
| 51 |
+
PROFILES = {
|
| 52 |
+
"greedy": {"r": 0.05, "alpha": 0.7, "patience": 10, "epsilon": 5},
|
| 53 |
+
"fair": {"r": 0.15, "alpha": 0.4, "patience": 7, "epsilon": 10},
|
| 54 |
+
"impatient": {"r": 0.25, "alpha": 0.2, "patience": 3, "epsilon": 15},
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
def __init__(self, type_str: str, value: int, role: str):
|
| 58 |
self.type = type_str
|
| 59 |
self.opponent_value = value
|
| 60 |
self.opponent_role = role
|
| 61 |
+
self.history: List[Dict[str, Any]] = []
|
| 62 |
+
|
| 63 |
+
profile = self.PROFILES.get(type_str, self.PROFILES["fair"])
|
| 64 |
+
self.r = profile["r"]
|
| 65 |
+
self.alpha = profile["alpha"]
|
| 66 |
+
self.patience = profile["patience"]
|
| 67 |
+
self.epsilon = profile["epsilon"]
|
| 68 |
+
self.concession_rate = self.r
|
| 69 |
+
|
| 70 |
+
def reset_state(self):
|
| 71 |
+
"""Reset concession rate and history for new episode."""
|
| 72 |
self.concession_rate = self.r
|
| 73 |
+
self.history = []
|
| 74 |
|
| 75 |
+
def get_response(self, round_num: int, current_offer: int, agent_offer: int, agent_action_type: str):
|
| 76 |
+
"""
|
| 77 |
+
Generate opponent response to agent's action.
|
| 78 |
+
Returns: (action_type: str, price: int)
|
| 79 |
+
"""
|
| 80 |
if agent_action_type != "OFFER":
|
| 81 |
return "REJECT", 0
|
| 82 |
|
| 83 |
+
# ── Acceptance Check ──
|
| 84 |
+
if self.opponent_role == "seller" and agent_offer >= self.opponent_value:
|
| 85 |
+
self.history.append({"round": round_num, "action": "ACCEPT", "price": agent_offer})
|
| 86 |
return "ACCEPT", agent_offer
|
| 87 |
+
if self.opponent_role == "buyer" and agent_offer <= self.opponent_value:
|
| 88 |
+
self.history.append({"round": round_num, "action": "ACCEPT", "price": agent_offer})
|
| 89 |
return "ACCEPT", agent_offer
|
| 90 |
|
| 91 |
+
# ── Patience-based concession acceleration ──
|
| 92 |
if round_num > self.patience:
|
| 93 |
self.concession_rate = min(0.4, self.concession_rate + 0.05)
|
| 94 |
|
| 95 |
+
# ── Counter-offer calculation ──
|
| 96 |
target = self.opponent_value
|
| 97 |
delta = target - current_offer
|
| 98 |
next_offer = current_offer + self.concession_rate * delta
|
| 99 |
+
|
| 100 |
+
# Anchor effect — blend toward current offer
|
| 101 |
next_offer = (1.0 - self.alpha) * next_offer + self.alpha * current_offer
|
| 102 |
+
|
| 103 |
+
# Add noise
|
| 104 |
next_offer += random.randint(-self.epsilon, self.epsilon)
|
| 105 |
+
|
| 106 |
+
# ── VALUE-BASED CLAMPING (Tolerance Bug Fix) ──
|
| 107 |
+
# Seller must not offer below their own value
|
| 108 |
+
# Buyer must not offer above their own value
|
| 109 |
+
next_offer_int = int(next_offer)
|
| 110 |
+
if self.opponent_role == "seller":
|
| 111 |
+
next_offer_int = max(next_offer_int, self.opponent_value)
|
| 112 |
+
elif self.opponent_role == "buyer":
|
| 113 |
+
next_offer_int = min(next_offer_int, self.opponent_value)
|
| 114 |
+
|
| 115 |
+
# Absolute bounds
|
| 116 |
+
next_offer_int = max(100, min(1000, next_offer_int))
|
| 117 |
+
|
| 118 |
+
self.history.append({"round": round_num, "action": "OFFER", "price": next_offer_int})
|
| 119 |
+
return "OFFER", next_offer_int
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
# ─────────────────────────────────────────────
|
| 123 |
+
# Main Environment Wrapper
|
| 124 |
+
# ─────────────────────────────────────────────
|
| 125 |
|
| 126 |
class EnvWrapper:
|
| 127 |
+
"""
|
| 128 |
+
OpenEnv-compliant negotiation environment.
|
| 129 |
+
Exposes: reset(), step(), state()
|
| 130 |
+
"""
|
| 131 |
+
|
| 132 |
+
def __init__(self, opp_type: str = "fair", a_val: int = 800, o_val: int = 500,
|
| 133 |
+
agent_role: str = "buyer", max_rounds: int = 20):
|
| 134 |
self.agent_value = a_val
|
| 135 |
self.opponent_value = o_val
|
| 136 |
self.role = agent_role
|
| 137 |
+
self.opp_type = opp_type
|
| 138 |
self.opp_role = "seller" if agent_role == "buyer" else "buyer"
|
| 139 |
+
self.max_rounds = max_rounds
|
| 140 |
self.opp = Opponent(opp_type, o_val, self.opp_role)
|
| 141 |
+
|
| 142 |
+
# Episode tracking
|
| 143 |
+
self.round = 0
|
| 144 |
+
self.current_offer = 0
|
| 145 |
+
self.last_opp_action = "START"
|
| 146 |
+
self.last_opp_offer = 0
|
| 147 |
+
self.history: List[Dict[str, Any]] = []
|
| 148 |
+
self.cumulative_aggression_penalty = 0.0
|
| 149 |
+
self.done = False
|
| 150 |
+
|
| 151 |
+
def reset(self) -> Observation:
|
| 152 |
+
"""Reset environment and return initial observation."""
|
| 153 |
self.round = 0
|
| 154 |
+
self.done = False
|
| 155 |
+
self.history = []
|
| 156 |
+
self.cumulative_aggression_penalty = 0.0
|
| 157 |
+
self.opp.reset_state()
|
| 158 |
+
|
| 159 |
+
# Initial offer is shifted away from agent's value to force negotiation
|
| 160 |
if self.role == "buyer":
|
| 161 |
+
# Start high — agent (buyer) must negotiate DOWN
|
| 162 |
+
self.current_offer = min(1000, self.agent_value + 200)
|
| 163 |
else:
|
| 164 |
+
# Start low — agent (seller) must negotiate UP
|
| 165 |
self.current_offer = max(100, self.agent_value - 200)
|
| 166 |
+
|
| 167 |
self.last_opp_action = "START"
|
| 168 |
self.last_opp_offer = self.current_offer
|
| 169 |
|
| 170 |
+
return self.state()
|
| 171 |
+
|
| 172 |
+
def state(self) -> Observation:
|
| 173 |
+
"""Return current observable state."""
|
| 174 |
+
return Observation(
|
| 175 |
+
agent_value=self.agent_value,
|
| 176 |
+
current_offer=self.current_offer,
|
| 177 |
+
round=self.round,
|
| 178 |
+
max_rounds=self.max_rounds,
|
| 179 |
+
role=self.role,
|
| 180 |
+
last_opponent_action=self.last_opp_action,
|
| 181 |
+
last_opponent_offer=self.last_opp_offer,
|
| 182 |
+
history=list(self.history),
|
| 183 |
+
)
|
| 184 |
+
|
| 185 |
+
def _compute_reward(self, deal_price: int) -> tuple:
|
| 186 |
+
"""
|
| 187 |
+
Compute reward for a completed deal.
|
| 188 |
+
Returns: (total_reward, breakdown_dict)
|
| 189 |
+
"""
|
| 190 |
+
if self.role == "seller":
|
| 191 |
+
profit = deal_price - self.agent_value
|
| 192 |
+
else:
|
| 193 |
+
profit = self.agent_value - deal_price
|
| 194 |
+
|
| 195 |
+
time_factor = 1.0 - (self.round / self.max_rounds)
|
| 196 |
+
base_reward = profit * time_factor
|
| 197 |
+
|
| 198 |
+
# Penalty for bad deals (agent accepts a losing deal)
|
| 199 |
+
bad_deal_penalty = -20.0 if profit < 0 else 0.0
|
| 200 |
+
|
| 201 |
+
# Cumulative aggression penalty
|
| 202 |
+
aggression = -self.cumulative_aggression_penalty
|
| 203 |
+
|
| 204 |
+
total = base_reward + bad_deal_penalty + aggression
|
| 205 |
+
|
| 206 |
+
breakdown = {
|
| 207 |
+
"profit": float(profit),
|
| 208 |
+
"time_factor": round(time_factor, 4),
|
| 209 |
+
"base_reward": round(base_reward, 4),
|
| 210 |
+
"bad_deal_penalty": bad_deal_penalty,
|
| 211 |
+
"aggression_penalty": aggression,
|
| 212 |
+
"total": round(total, 4),
|
| 213 |
+
}
|
| 214 |
+
return total, breakdown
|
| 215 |
+
|
| 216 |
+
def _partial_progress_reward(self, action_str: str, action_price: int) -> tuple:
|
| 217 |
+
"""
|
| 218 |
+
Provide a small shaping reward for intermediate steps.
|
| 219 |
+
Rewards the agent for moving toward a deal (improving offers).
|
| 220 |
+
"""
|
| 221 |
+
reward = 0.0
|
| 222 |
+
breakdown = {}
|
| 223 |
+
|
| 224 |
+
if action_str.startswith("OFFER") and len(self.history) >= 2:
|
| 225 |
+
# Check if agent is making progress toward opponent
|
| 226 |
+
prev_agent_offers = [h["agent_price"] for h in self.history[:-1]
|
| 227 |
+
if h.get("agent_action", "").startswith("OFFER")]
|
| 228 |
+
if prev_agent_offers:
|
| 229 |
+
last_agent_offer = prev_agent_offers[-1]
|
| 230 |
+
# Positive signal if agent moves toward a reasonable range
|
| 231 |
+
if self.role == "buyer":
|
| 232 |
+
# Buyer should increase offers (toward seller's value)
|
| 233 |
+
improvement = action_price - last_agent_offer
|
| 234 |
+
reward = min(2.0, max(-1.0, improvement / 50.0))
|
| 235 |
+
else:
|
| 236 |
+
# Seller should decrease offers (toward buyer's value)
|
| 237 |
+
improvement = last_agent_offer - action_price
|
| 238 |
+
reward = min(2.0, max(-1.0, improvement / 50.0))
|
| 239 |
+
|
| 240 |
+
breakdown = {"progress_signal": round(reward, 4)}
|
| 241 |
+
|
| 242 |
+
return reward, breakdown
|
| 243 |
+
|
| 244 |
+
def step(self, action_str: str, action_price: int = 0):
|
| 245 |
+
"""
|
| 246 |
+
Take one step in the environment.
|
| 247 |
+
|
| 248 |
+
Args:
|
| 249 |
+
action_str: "OFFER", "ACCEPT", or "REJECT"
|
| 250 |
+
action_price: price for OFFER actions
|
| 251 |
+
|
| 252 |
+
Returns:
|
| 253 |
+
(observation: Observation, reward: float, done: bool, info: dict)
|
| 254 |
+
"""
|
| 255 |
+
if self.done:
|
| 256 |
+
return self.state(), 0.0, True, {"error": "Episode already ended"}
|
| 257 |
+
|
| 258 |
self.round += 1
|
|
|
|
|
|
|
| 259 |
reward = 0.0
|
| 260 |
+
done = False
|
| 261 |
+
info: Dict[str, Any] = {"error": None}
|
| 262 |
+
breakdown: Dict[str, float] = {}
|
| 263 |
+
|
| 264 |
+
# ── AGENT OFFER CLAMPING ──
|
| 265 |
+
if action_str.startswith("OFFER"):
|
| 266 |
+
action_price = max(100, min(1000, action_price))
|
| 267 |
+
action_str = f"OFFER {action_price}"
|
| 268 |
+
|
| 269 |
+
# ── CUMULATIVE AGGRESSION PENALTY ──
|
| 270 |
+
if abs(action_price - self.opponent_value) > 150:
|
| 271 |
+
self.cumulative_aggression_penalty += 2.0
|
| 272 |
+
|
| 273 |
+
# Record this step in history
|
| 274 |
+
step_record = {
|
| 275 |
+
"round": self.round,
|
| 276 |
+
"agent_action": action_str,
|
| 277 |
+
"agent_price": action_price,
|
| 278 |
+
}
|
| 279 |
+
|
| 280 |
if action_str == "ACCEPT":
|
| 281 |
deal_price = self.last_opp_offer
|
| 282 |
+
reward, breakdown = self._compute_reward(deal_price)
|
| 283 |
done = True
|
| 284 |
+
info["deal_price"] = deal_price
|
| 285 |
+
info["deal_type"] = "agent_accepted"
|
| 286 |
+
|
|
|
|
|
|
|
| 287 |
elif action_str == "REJECT":
|
| 288 |
reward = -50.0
|
| 289 |
+
breakdown = {"rejection_penalty": -50.0}
|
| 290 |
done = True
|
| 291 |
+
info["deal_type"] = "agent_rejected"
|
| 292 |
+
|
| 293 |
elif action_str.startswith("OFFER"):
|
| 294 |
+
opp_action, opp_price = self.opp.get_response(
|
| 295 |
+
self.round, self.current_offer, action_price, "OFFER"
|
| 296 |
+
)
|
| 297 |
+
|
| 298 |
if opp_action == "ACCEPT":
|
| 299 |
deal_price = action_price
|
| 300 |
+
reward, breakdown = self._compute_reward(deal_price)
|
| 301 |
done = True
|
| 302 |
self.last_opp_action = "ACCEPT"
|
| 303 |
self.last_opp_offer = deal_price
|
| 304 |
+
info["deal_price"] = deal_price
|
| 305 |
+
info["deal_type"] = "opponent_accepted"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 306 |
else:
|
| 307 |
+
# Opponent counters
|
| 308 |
self.current_offer = opp_price
|
| 309 |
self.last_opp_action = "OFFER"
|
| 310 |
self.last_opp_offer = opp_price
|
| 311 |
+
|
| 312 |
+
# Check max rounds
|
| 313 |
if self.round >= self.max_rounds:
|
| 314 |
reward = -50.0
|
| 315 |
+
breakdown = {"timeout_penalty": -50.0}
|
| 316 |
done = True
|
| 317 |
+
info["deal_type"] = "timeout"
|
| 318 |
+
else:
|
| 319 |
+
# Partial progress reward for intermediate steps
|
| 320 |
+
step_record["agent_price"] = action_price
|
| 321 |
+
self.history.append(step_record)
|
| 322 |
+
reward, breakdown = self._partial_progress_reward(action_str, action_price)
|
| 323 |
+
info["opponent_counter"] = opp_price
|
| 324 |
+
|
| 325 |
+
step_record["opp_action"] = opp_action
|
| 326 |
+
step_record["opp_price"] = opp_price
|
| 327 |
+
|
| 328 |
+
# Record history for terminal steps too
|
| 329 |
+
if done or action_str == "ACCEPT" or action_str == "REJECT":
|
| 330 |
+
# Avoid double-append for non-OFFER terminal steps
|
| 331 |
+
if step_record not in self.history:
|
| 332 |
+
self.history.append(step_record)
|
| 333 |
+
|
| 334 |
+
self.done = done
|
| 335 |
+
info["reward_breakdown"] = breakdown
|
| 336 |
+
|
| 337 |
+
return self.state(), reward, done, info
|
| 338 |
+
|
| 339 |
+
|
| 340 |
+
# ─────────────────────────────────────────────
|
| 341 |
+
# Convenience — max possible reward for scoring
|
| 342 |
+
# ─────────────────────────────────────────────
|
| 343 |
+
|
| 344 |
+
def get_max_possible_reward(agent_value: int, opponent_value: int) -> float:
|
| 345 |
+
"""
|
| 346 |
+
Maximum reward possible if agent gets the best possible deal on round 1.
|
| 347 |
+
"""
|
| 348 |
+
return float(abs(agent_value - opponent_value))
|
inference.py
CHANGED
|
@@ -1,111 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import os
|
| 2 |
import re
|
| 3 |
import sys
|
| 4 |
from openai import OpenAI
|
| 5 |
from env_wrapper import EnvWrapper
|
|
|
|
| 6 |
|
| 7 |
-
def main():
|
| 8 |
-
api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 9 |
-
model_name = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
|
| 10 |
-
hf_token = os.getenv("HF_TOKEN")
|
| 11 |
-
|
| 12 |
-
if not hf_token:
|
| 13 |
-
print("ERROR: HF_TOKEN environment variable is not set.")
|
| 14 |
-
sys.exit(1)
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
done = False
|
| 25 |
step_n = 0
|
| 26 |
rewards = []
|
| 27 |
-
|
|
|
|
|
|
|
| 28 |
while not done and step_n < env.max_rounds:
|
| 29 |
step_n += 1
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
State:
|
| 33 |
-
*
|
| 34 |
-
*
|
| 35 |
-
*
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
action_str = "REJECT"
|
| 43 |
action_price = 0
|
| 44 |
error_msg = "null"
|
| 45 |
-
|
| 46 |
try:
|
| 47 |
response = client.chat.completions.create(
|
| 48 |
model=model_name,
|
| 49 |
messages=[{"role": "user", "content": prompt}],
|
| 50 |
max_tokens=20,
|
| 51 |
-
temperature=0.3
|
| 52 |
)
|
| 53 |
llm_text = response.choices[0].message.content.strip()
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
|
|
|
|
|
|
| 58 |
else:
|
| 59 |
-
|
| 60 |
-
|
|
|
|
| 61 |
model=model_name,
|
| 62 |
-
messages=[
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
max_tokens=15,
|
| 64 |
-
temperature=0.1
|
| 65 |
)
|
| 66 |
-
llm_text2 =
|
| 67 |
-
|
| 68 |
-
if
|
| 69 |
-
action_str =
|
|
|
|
| 70 |
error_msg = "null"
|
| 71 |
else:
|
| 72 |
action_str = "REJECT"
|
|
|
|
| 73 |
error_msg = "parse error on retry, defaulting to REJECT"
|
|
|
|
| 74 |
except Exception as e:
|
| 75 |
-
error_msg = "API_Error"
|
| 76 |
action_str = "REJECT"
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
except ValueError:
|
| 82 |
-
action_str = "REJECT"
|
| 83 |
-
action_price = 0
|
| 84 |
-
error_msg = "invalid price format"
|
| 85 |
-
elif action_str == "ACCEPT":
|
| 86 |
-
action_str = "ACCEPT"
|
| 87 |
-
elif action_str == "REJECT":
|
| 88 |
-
action_str = "REJECT"
|
| 89 |
-
|
| 90 |
-
# Strip potential garbage
|
| 91 |
-
if "OFFER" in action_str:
|
| 92 |
-
action_str = f"OFFER {action_price}"
|
| 93 |
-
|
| 94 |
-
reward, d = env.step(action_str, action_price)
|
| 95 |
-
done = d
|
| 96 |
rewards.append(reward)
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
rewards_str = ",".join([f"{r:.2f}" for r in rewards])
|
| 108 |
-
print(f"[END] success={str(success).lower()} steps={step_n} score={score:.4f} rewards={rewards_str}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
if __name__ == "__main__":
|
| 111 |
main()
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Inference Script — OpenEnv Negotiation Environment
|
| 3 |
+
Runs LLM agent against all 3 tasks, produces structured logs.
|
| 4 |
+
Uses OpenAI-compatible client with HuggingFace router.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
import os
|
| 8 |
import re
|
| 9 |
import sys
|
| 10 |
from openai import OpenAI
|
| 11 |
from env_wrapper import EnvWrapper
|
| 12 |
+
from tasks import ALL_TASKS, get_grader
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
+
def parse_action(llm_text: str):
|
| 16 |
+
"""Parse LLM output into (action_str, action_price)."""
|
| 17 |
+
match = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text, re.IGNORECASE)
|
| 18 |
+
if match:
|
| 19 |
+
action = match.group(1).upper()
|
| 20 |
+
if action.startswith("OFFER"):
|
| 21 |
+
parts = action.split()
|
| 22 |
+
try:
|
| 23 |
+
price = int(parts[1])
|
| 24 |
+
return f"OFFER {price}", price, None
|
| 25 |
+
except (IndexError, ValueError):
|
| 26 |
+
return "REJECT", 0, "invalid price in OFFER"
|
| 27 |
+
return action, 0, None
|
| 28 |
+
return None, 0, "no action match"
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def run_task(client, model_name: str, task_config):
|
| 32 |
+
"""
|
| 33 |
+
Run a single task: LLM negotiates against the environment.
|
| 34 |
+
Returns: (rewards, steps, deal_made, score_info)
|
| 35 |
+
"""
|
| 36 |
+
env = EnvWrapper(
|
| 37 |
+
opp_type=task_config.opp_type,
|
| 38 |
+
a_val=task_config.agent_value,
|
| 39 |
+
o_val=task_config.opponent_value,
|
| 40 |
+
agent_role=task_config.agent_role,
|
| 41 |
+
max_rounds=task_config.max_rounds,
|
| 42 |
+
)
|
| 43 |
+
obs = env.reset()
|
| 44 |
+
|
| 45 |
+
print(f"[START] task={task_config.name} env=negotiation model={model_name}")
|
| 46 |
+
|
| 47 |
done = False
|
| 48 |
step_n = 0
|
| 49 |
rewards = []
|
| 50 |
+
deal_made = False
|
| 51 |
+
history_for_prompt = []
|
| 52 |
+
|
| 53 |
while not done and step_n < env.max_rounds:
|
| 54 |
step_n += 1
|
| 55 |
+
|
| 56 |
+
# ── Build prompt with history ──
|
| 57 |
+
history_text = ""
|
| 58 |
+
if history_for_prompt:
|
| 59 |
+
history_lines = []
|
| 60 |
+
for h in history_for_prompt[-5:]: # Last 5 rounds for context
|
| 61 |
+
history_lines.append(f" Round {h['round']}: You → {h['agent']}, Opponent → {h['opp']}")
|
| 62 |
+
history_text = "Negotiation history:\n" + "\n".join(history_lines) + "\n\n"
|
| 63 |
+
|
| 64 |
+
target_goal = "buy for as low as possible (below your maximum value)" if obs.role == "buyer" else "sell for as high as possible (above your minimum value)"
|
| 65 |
+
|
| 66 |
+
prompt = f"""You are negotiating as a {obs.role}. Your goal is to {target_goal} to maximize profit.
|
| 67 |
+
|
| 68 |
State:
|
| 69 |
+
* Your PRIVATE Valuation: {obs.agent_value} (DO NOT accept or offer a deal worse than this!)
|
| 70 |
+
* Current offer on the table: {obs.current_offer}
|
| 71 |
+
* Round: {step_n} of {obs.max_rounds}
|
| 72 |
+
* Opponent's last action: {obs.last_opponent_action}
|
| 73 |
+
* Opponent's last offer: {obs.last_opponent_offer}
|
| 74 |
+
|
| 75 |
+
{history_text}CRITICAL RULE: NEVER make an OFFER that is worse than your private valuation. For example, if you are a buyer with a valuation of 500, never offer >500.
|
| 76 |
+
|
| 77 |
+
Choose exactly ONE action:
|
| 78 |
+
* OFFER <price> — make a counter-offer (negotiate toward your private valuation)
|
| 79 |
+
* ACCEPT — accept the opponent's offer (ONLY if it is profitable compared to your valuation)
|
| 80 |
+
* REJECT — walk away (only if no deal is possible)
|
| 81 |
+
|
| 82 |
+
Respond with ONLY your chosen action, nothing else."""
|
| 83 |
+
|
| 84 |
action_str = "REJECT"
|
| 85 |
action_price = 0
|
| 86 |
error_msg = "null"
|
| 87 |
+
|
| 88 |
try:
|
| 89 |
response = client.chat.completions.create(
|
| 90 |
model=model_name,
|
| 91 |
messages=[{"role": "user", "content": prompt}],
|
| 92 |
max_tokens=20,
|
| 93 |
+
temperature=0.3,
|
| 94 |
)
|
| 95 |
llm_text = response.choices[0].message.content.strip()
|
| 96 |
+
|
| 97 |
+
parsed_action, parsed_price, parse_err = parse_action(llm_text)
|
| 98 |
+
|
| 99 |
+
if parsed_action:
|
| 100 |
+
action_str = parsed_action
|
| 101 |
+
action_price = parsed_price
|
| 102 |
else:
|
| 103 |
+
# Retry with stricter prompt
|
| 104 |
+
error_msg = f"parse failed: {parse_err}, retrying"
|
| 105 |
+
retry_response = client.chat.completions.create(
|
| 106 |
model=model_name,
|
| 107 |
+
messages=[
|
| 108 |
+
{"role": "user", "content": prompt},
|
| 109 |
+
{"role": "assistant", "content": llm_text},
|
| 110 |
+
{"role": "user", "content": "Output strictly ONLY ONE of: 'OFFER <price>', 'ACCEPT', or 'REJECT'. Nothing else."},
|
| 111 |
+
],
|
| 112 |
max_tokens=15,
|
| 113 |
+
temperature=0.1,
|
| 114 |
)
|
| 115 |
+
llm_text2 = retry_response.choices[0].message.content.strip()
|
| 116 |
+
parsed2, price2, err2 = parse_action(llm_text2)
|
| 117 |
+
if parsed2:
|
| 118 |
+
action_str = parsed2
|
| 119 |
+
action_price = price2
|
| 120 |
error_msg = "null"
|
| 121 |
else:
|
| 122 |
action_str = "REJECT"
|
| 123 |
+
action_price = 0
|
| 124 |
error_msg = "parse error on retry, defaulting to REJECT"
|
| 125 |
+
|
| 126 |
except Exception as e:
|
| 127 |
+
error_msg = f"API_Error: {str(e)[:50]}"
|
| 128 |
action_str = "REJECT"
|
| 129 |
+
action_price = 0
|
| 130 |
+
|
| 131 |
+
# ── Step the environment ──
|
| 132 |
+
obs, reward, done, info = env.step(action_str, action_price)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
rewards.append(reward)
|
| 134 |
+
|
| 135 |
+
# Track deal
|
| 136 |
+
if done and info.get("deal_type") in ("agent_accepted", "opponent_accepted"):
|
| 137 |
+
deal_made = True
|
| 138 |
+
|
| 139 |
+
# Track history for prompting
|
| 140 |
+
history_for_prompt.append({
|
| 141 |
+
"round": step_n,
|
| 142 |
+
"agent": action_str,
|
| 143 |
+
"opp": f"{obs.last_opponent_action} {obs.last_opponent_offer}" if obs.last_opponent_action == "OFFER" else obs.last_opponent_action,
|
| 144 |
+
})
|
| 145 |
+
|
| 146 |
+
# ── Log step ──
|
| 147 |
+
log_action = action_str if not action_str.startswith("OFFER") else f"OFFER {action_price}"
|
| 148 |
+
print(f"[STEP] step={step_n} action={log_action} reward={reward:.2f} done={str(done).lower()} error={error_msg}")
|
| 149 |
+
|
| 150 |
+
# ── Score ──
|
| 151 |
+
grader = get_grader(task_config)
|
| 152 |
+
result = grader.grade(rewards, step_n, deal_made)
|
| 153 |
+
|
| 154 |
rewards_str = ",".join([f"{r:.2f}" for r in rewards])
|
| 155 |
+
print(f"[END] success={str(result['success']).lower()} steps={step_n} score={result['score']:.4f} rewards={rewards_str}")
|
| 156 |
+
print()
|
| 157 |
+
|
| 158 |
+
return result
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def main():
|
| 162 |
+
api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 163 |
+
model_name = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
|
| 164 |
+
hf_token = os.getenv("HF_TOKEN")
|
| 165 |
+
|
| 166 |
+
if not hf_token:
|
| 167 |
+
print("ERROR: HF_TOKEN environment variable is not set.")
|
| 168 |
+
print("Set it with: $env:HF_TOKEN='your_token_here'")
|
| 169 |
+
sys.exit(1)
|
| 170 |
+
|
| 171 |
+
client = OpenAI(base_url=api_base_url, api_key=hf_token)
|
| 172 |
+
|
| 173 |
+
print("=" * 60)
|
| 174 |
+
print("NEGOTIATION ENVIRONMENT — OpenEnv Inference")
|
| 175 |
+
print("=" * 60)
|
| 176 |
+
print()
|
| 177 |
+
|
| 178 |
+
all_results = []
|
| 179 |
+
|
| 180 |
+
for task in ALL_TASKS:
|
| 181 |
+
result = run_task(client, model_name, task)
|
| 182 |
+
all_results.append(result)
|
| 183 |
+
|
| 184 |
+
# ── Summary ──
|
| 185 |
+
print("=" * 60)
|
| 186 |
+
print("SUMMARY")
|
| 187 |
+
print("=" * 60)
|
| 188 |
+
for r in all_results:
|
| 189 |
+
status = "PASS" if r["success"] else "FAIL"
|
| 190 |
+
print(f" [{status}] {r['task']} ({r['difficulty']}): score={r['score']:.4f} "
|
| 191 |
+
f"steps={r['steps']} deal={r['deal_made']} threshold={r['threshold']}")
|
| 192 |
+
|
| 193 |
+
avg_score = sum(r["score"] for r in all_results) / len(all_results)
|
| 194 |
+
print(f"\n Average Score: {avg_score:.4f}")
|
| 195 |
+
print("=" * 60)
|
| 196 |
+
|
| 197 |
|
| 198 |
if __name__ == "__main__":
|
| 199 |
main()
|
openenv.yaml
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: negotiation-env
|
| 2 |
+
version: "1.0.0"
|
| 3 |
+
description: >
|
| 4 |
+
Strategic Negotiation Simulation Environment where an AI agent learns
|
| 5 |
+
to negotiate under uncertainty with different opponent personalities.
|
| 6 |
+
The agent must maximize profit through multi-round price negotiation
|
| 7 |
+
while adapting to greedy, fair, or impatient opponents.
|
| 8 |
+
|
| 9 |
+
author: Team MEta_ai
|
| 10 |
+
license: Apache-2.0
|
| 11 |
+
|
| 12 |
+
environment:
|
| 13 |
+
type: simulation
|
| 14 |
+
domain: negotiation
|
| 15 |
+
real_world_task: automated marketplace pricing and negotiation
|
| 16 |
+
|
| 17 |
+
observation_space:
|
| 18 |
+
type: object
|
| 19 |
+
fields:
|
| 20 |
+
current_offer:
|
| 21 |
+
type: integer
|
| 22 |
+
description: Current price on the table
|
| 23 |
+
range: [100, 1000]
|
| 24 |
+
round:
|
| 25 |
+
type: integer
|
| 26 |
+
description: Current round number
|
| 27 |
+
range: [0, 20]
|
| 28 |
+
max_rounds:
|
| 29 |
+
type: integer
|
| 30 |
+
description: Maximum allowed rounds
|
| 31 |
+
role:
|
| 32 |
+
type: string
|
| 33 |
+
enum: ["buyer", "seller"]
|
| 34 |
+
description: Agent's role in the negotiation
|
| 35 |
+
last_opponent_action:
|
| 36 |
+
type: string
|
| 37 |
+
enum: ["START", "OFFER", "ACCEPT"]
|
| 38 |
+
description: Opponent's last action
|
| 39 |
+
last_opponent_offer:
|
| 40 |
+
type: integer
|
| 41 |
+
description: Opponent's last offered price
|
| 42 |
+
range: [100, 1000]
|
| 43 |
+
history:
|
| 44 |
+
type: array
|
| 45 |
+
description: History of all actions this episode
|
| 46 |
+
|
| 47 |
+
action_space:
|
| 48 |
+
type: object
|
| 49 |
+
fields:
|
| 50 |
+
action_type:
|
| 51 |
+
type: string
|
| 52 |
+
enum: ["OFFER", "ACCEPT", "REJECT"]
|
| 53 |
+
description: Type of negotiation action
|
| 54 |
+
price:
|
| 55 |
+
type: integer
|
| 56 |
+
description: Price for OFFER actions (ignored for ACCEPT/REJECT)
|
| 57 |
+
range: [100, 1000]
|
| 58 |
+
|
| 59 |
+
reward:
|
| 60 |
+
type: float
|
| 61 |
+
range: [-50.0, 855.0]
|
| 62 |
+
description: >
|
| 63 |
+
Reward based on deal profit scaled by time factor.
|
| 64 |
+
Partial progress signals during intermediate steps.
|
| 65 |
+
Penalty for failed negotiations (-50), bad deals (-20),
|
| 66 |
+
and aggressive offers (cumulative -2 per aggressive step).
|
| 67 |
+
|
| 68 |
+
tasks:
|
| 69 |
+
- name: task_a_easy
|
| 70 |
+
difficulty: easy
|
| 71 |
+
description: Fair opponent, wide ZOPA, 20 rounds
|
| 72 |
+
success_threshold: 0.2
|
| 73 |
+
|
| 74 |
+
- name: task_b_medium
|
| 75 |
+
difficulty: medium
|
| 76 |
+
description: Greedy opponent, narrow ZOPA, 15 rounds
|
| 77 |
+
success_threshold: 0.3
|
| 78 |
+
|
| 79 |
+
- name: task_c_hard
|
| 80 |
+
difficulty: hard
|
| 81 |
+
description: Impatient opponent, tight margins, 6 rounds
|
| 82 |
+
success_threshold: 0.4
|
| 83 |
+
|
| 84 |
+
inference:
|
| 85 |
+
script: inference.py
|
| 86 |
+
env_vars:
|
| 87 |
+
- API_BASE_URL
|
| 88 |
+
- MODEL_NAME
|
| 89 |
+
- HF_TOKEN
|
| 90 |
+
|
| 91 |
+
deployment:
|
| 92 |
+
dockerfile: Dockerfile
|
| 93 |
+
platform: huggingface-spaces
|
| 94 |
+
tag: openenv
|
requirements.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
openai>=1.0.0
|
| 2 |
+
pydantic>=2.0.0
|
| 3 |
+
pyyaml>=6.0
|
tasks.py
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Task Definitions & Graders for the Negotiation Environment.
|
| 3 |
+
3 tasks: Easy → Medium → Hard, each with a programmatic grader (0.0–1.0).
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
from typing import List
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
@dataclass
|
| 11 |
+
class TaskConfig:
|
| 12 |
+
"""Configuration for a single evaluation task."""
|
| 13 |
+
name: str
|
| 14 |
+
description: str
|
| 15 |
+
difficulty: str
|
| 16 |
+
opp_type: str
|
| 17 |
+
agent_value: int
|
| 18 |
+
opponent_value: int
|
| 19 |
+
agent_role: str
|
| 20 |
+
max_rounds: int
|
| 21 |
+
success_threshold: float # score >= this means success
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
# ─────────────────────────────────────────────
|
| 25 |
+
# Task Definitions
|
| 26 |
+
# ─────────────────────────────────────────────
|
| 27 |
+
|
| 28 |
+
TASK_A = TaskConfig(
|
| 29 |
+
name="task_a_easy",
|
| 30 |
+
description="Easy negotiation: fair opponent, wide ZOPA, plenty of rounds",
|
| 31 |
+
difficulty="easy",
|
| 32 |
+
opp_type="fair",
|
| 33 |
+
agent_value=800,
|
| 34 |
+
opponent_value=400,
|
| 35 |
+
agent_role="buyer",
|
| 36 |
+
max_rounds=20,
|
| 37 |
+
success_threshold=0.2,
|
| 38 |
+
)
|
| 39 |
+
|
| 40 |
+
TASK_B = TaskConfig(
|
| 41 |
+
name="task_b_medium",
|
| 42 |
+
description="Medium negotiation: greedy opponent, narrow ZOPA, fewer rounds",
|
| 43 |
+
difficulty="medium",
|
| 44 |
+
opp_type="greedy",
|
| 45 |
+
agent_value=700,
|
| 46 |
+
opponent_value=500,
|
| 47 |
+
agent_role="buyer",
|
| 48 |
+
max_rounds=15,
|
| 49 |
+
success_threshold=0.3,
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
TASK_C = TaskConfig(
|
| 53 |
+
name="task_c_hard",
|
| 54 |
+
description="Hard negotiation: impatient opponent, tight margins, very few rounds",
|
| 55 |
+
difficulty="hard",
|
| 56 |
+
opp_type="impatient",
|
| 57 |
+
agent_value=600,
|
| 58 |
+
opponent_value=480,
|
| 59 |
+
agent_role="buyer",
|
| 60 |
+
max_rounds=6,
|
| 61 |
+
success_threshold=0.4,
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
ALL_TASKS: List[TaskConfig] = [TASK_A, TASK_B, TASK_C]
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
# ─────────────────────────────────────────────
|
| 68 |
+
# Grader
|
| 69 |
+
# ─────────────────────────────────────────────
|
| 70 |
+
|
| 71 |
+
class Grader:
|
| 72 |
+
"""
|
| 73 |
+
Programmatic grader for a negotiation task.
|
| 74 |
+
Scores agent performance on a 0.0–1.0 scale.
|
| 75 |
+
"""
|
| 76 |
+
|
| 77 |
+
def __init__(self, task: TaskConfig):
|
| 78 |
+
self.task = task
|
| 79 |
+
self.max_possible = float(abs(task.agent_value - task.opponent_value))
|
| 80 |
+
|
| 81 |
+
def grade(self, rewards: List[float], steps: int, deal_made: bool) -> dict:
|
| 82 |
+
"""
|
| 83 |
+
Grade an episode.
|
| 84 |
+
|
| 85 |
+
Args:
|
| 86 |
+
rewards: list of per-step rewards
|
| 87 |
+
steps: number of steps taken
|
| 88 |
+
deal_made: whether a deal was successfully completed
|
| 89 |
+
|
| 90 |
+
Returns:
|
| 91 |
+
dict with score, success, and breakdown
|
| 92 |
+
"""
|
| 93 |
+
total_reward = sum(rewards)
|
| 94 |
+
|
| 95 |
+
# Score normalization: total_reward / max_possible, clamped to [0, 1]
|
| 96 |
+
if self.max_possible > 0:
|
| 97 |
+
raw_score = total_reward / self.max_possible
|
| 98 |
+
else:
|
| 99 |
+
raw_score = 0.0
|
| 100 |
+
|
| 101 |
+
score = max(0.0, min(1.0, raw_score))
|
| 102 |
+
success = score >= self.task.success_threshold
|
| 103 |
+
|
| 104 |
+
# ── Detailed breakdown ──
|
| 105 |
+
efficiency = 0.0
|
| 106 |
+
if deal_made and steps > 0:
|
| 107 |
+
# Bonus for fewer steps — max 1.0 if done in 1 step
|
| 108 |
+
efficiency = max(0.0, 1.0 - (steps / self.task.max_rounds))
|
| 109 |
+
|
| 110 |
+
return {
|
| 111 |
+
"task": self.task.name,
|
| 112 |
+
"difficulty": self.task.difficulty,
|
| 113 |
+
"score": round(score, 4),
|
| 114 |
+
"success": success,
|
| 115 |
+
"threshold": self.task.success_threshold,
|
| 116 |
+
"total_reward": round(total_reward, 4),
|
| 117 |
+
"max_possible": self.max_possible,
|
| 118 |
+
"steps": steps,
|
| 119 |
+
"deal_made": deal_made,
|
| 120 |
+
"efficiency": round(efficiency, 4),
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def get_grader(task: TaskConfig) -> Grader:
|
| 125 |
+
"""Create a grader for the given task."""
|
| 126 |
+
return Grader(task)
|
test_env.py
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Quick validation test for the environment — no API keys needed."""
|
| 2 |
+
import random
|
| 3 |
+
random.seed(42)
|
| 4 |
+
|
| 5 |
+
from env_wrapper import EnvWrapper, Observation
|
| 6 |
+
from tasks import ALL_TASKS, get_grader
|
| 7 |
+
|
| 8 |
+
print("=" * 50)
|
| 9 |
+
print("TEST 1: Multi-round negotiation")
|
| 10 |
+
print("=" * 50)
|
| 11 |
+
env = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer", max_rounds=10)
|
| 12 |
+
obs = env.reset()
|
| 13 |
+
print(f"Initial offer: {obs.current_offer}")
|
| 14 |
+
|
| 15 |
+
offers = [650, 600, 550, 500, 480, 450, 420, 400, 400, 400]
|
| 16 |
+
for i, price in enumerate(offers):
|
| 17 |
+
obs, r, d, info = env.step("OFFER", price)
|
| 18 |
+
opp_info = f"{obs.last_opponent_action} {obs.last_opponent_offer}"
|
| 19 |
+
print(f" R{i+1}: OFFER {price} -> Opp {opp_info} | reward={r:.2f} done={d}")
|
| 20 |
+
if d:
|
| 21 |
+
deal_type = info.get("deal_type", "none")
|
| 22 |
+
deal_price = info.get("deal_price", "N/A")
|
| 23 |
+
print(f" >>> Deal: {deal_type}, price={deal_price}")
|
| 24 |
+
break
|
| 25 |
+
|
| 26 |
+
print(f" History entries: {len(obs.history)}")
|
| 27 |
+
print()
|
| 28 |
+
|
| 29 |
+
print("=" * 50)
|
| 30 |
+
print("TEST 2: Value-based clamping")
|
| 31 |
+
print("=" * 50)
|
| 32 |
+
from env_wrapper import Opponent
|
| 33 |
+
bugs = 0
|
| 34 |
+
for trial in range(100):
|
| 35 |
+
opp = Opponent("fair", 500, "seller")
|
| 36 |
+
for rnd in range(20):
|
| 37 |
+
action, price = opp.get_response(rnd, 300, 250, "OFFER")
|
| 38 |
+
if action == "OFFER" and price < 500:
|
| 39 |
+
bugs += 1
|
| 40 |
+
print(f" BUG: trial={trial} round={rnd} seller offered {price} < 500")
|
| 41 |
+
break
|
| 42 |
+
if bugs == 0:
|
| 43 |
+
print(" PASS: Seller never offered below own value (100 trials x 20 rounds)")
|
| 44 |
+
else:
|
| 45 |
+
print(f" FAIL: {bugs} violations found")
|
| 46 |
+
print()
|
| 47 |
+
|
| 48 |
+
print("=" * 50)
|
| 49 |
+
print("TEST 3: Cumulative aggression penalty")
|
| 50 |
+
print("=" * 50)
|
| 51 |
+
env2 = EnvWrapper(opp_type="greedy", a_val=800, o_val=500, agent_role="buyer", max_rounds=20)
|
| 52 |
+
env2.reset()
|
| 53 |
+
# Make multiple aggressive offers (>150 away from opp_val=500, so <350 or >650)
|
| 54 |
+
for i in range(5):
|
| 55 |
+
obs, r, d, info = env2.step("OFFER", 200) # 300 away from 500 → aggressive
|
| 56 |
+
print(f" R{i+1}: penalty_so_far={env2.cumulative_aggression_penalty}")
|
| 57 |
+
if d:
|
| 58 |
+
break
|
| 59 |
+
|
| 60 |
+
expected_penalty = 10.0 # 5 rounds x 2.0 per round
|
| 61 |
+
actual_penalty = env2.cumulative_aggression_penalty
|
| 62 |
+
print(f" Expected cumulative penalty: {expected_penalty}, Actual: {actual_penalty}")
|
| 63 |
+
print(f" {'PASS' if actual_penalty == expected_penalty else 'FAIL'}")
|
| 64 |
+
print()
|
| 65 |
+
|
| 66 |
+
print("=" * 50)
|
| 67 |
+
print("TEST 4: Task configs and graders")
|
| 68 |
+
print("=" * 50)
|
| 69 |
+
for task in ALL_TASKS:
|
| 70 |
+
grader = get_grader(task)
|
| 71 |
+
# Test with sample rewards
|
| 72 |
+
result = grader.grade([0.0, 0.0, 50.0], 3, True)
|
| 73 |
+
print(f" {task.name} ({task.difficulty}): score={result['score']}, success={result['success']}")
|
| 74 |
+
print()
|
| 75 |
+
|
| 76 |
+
print("=" * 50)
|
| 77 |
+
print("TEST 5: state() method")
|
| 78 |
+
print("=" * 50)
|
| 79 |
+
env3 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
|
| 80 |
+
env3.reset()
|
| 81 |
+
s = env3.state()
|
| 82 |
+
assert isinstance(s, Observation), "state() must return Observation"
|
| 83 |
+
assert s.role == "buyer"
|
| 84 |
+
assert s.round == 0
|
| 85 |
+
print(f" PASS: state() returns Observation with role={s.role}, round={s.round}")
|
| 86 |
+
print()
|
| 87 |
+
|
| 88 |
+
print("=" * 50)
|
| 89 |
+
print("TEST 6: ACCEPT and REJECT")
|
| 90 |
+
print("=" * 50)
|
| 91 |
+
# ACCEPT test
|
| 92 |
+
env4 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
|
| 93 |
+
env4.reset()
|
| 94 |
+
env4.step("OFFER", 500) # Get an opponent counter
|
| 95 |
+
obs, r, d, info = env4.step("ACCEPT", 0)
|
| 96 |
+
print(f" ACCEPT: reward={r:.2f} done={d} deal_type={info.get('deal_type')}")
|
| 97 |
+
|
| 98 |
+
# REJECT test
|
| 99 |
+
env5 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
|
| 100 |
+
env5.reset()
|
| 101 |
+
obs, r, d, info = env5.step("REJECT", 0)
|
| 102 |
+
print(f" REJECT: reward={r:.2f} done={d} deal_type={info.get('deal_type')}")
|
| 103 |
+
print()
|
| 104 |
+
|
| 105 |
+
print("=" * 50)
|
| 106 |
+
print("ALL TESTS COMPLETE")
|
| 107 |
+
print("=" * 50)
|
test_output.txt
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
==================================================
|
| 2 |
+
TEST 1: Multi-round negotiation
|
| 3 |
+
==================================================
|
| 4 |
+
Initial offer: 1000
|
| 5 |
+
R1: OFFER 650 -> Opp ACCEPT 650 | reward=133.00 done=True
|
| 6 |
+
>>> Deal: opponent_accepted, price=650
|
| 7 |
+
History entries: 1
|
| 8 |
+
|
| 9 |
+
==================================================
|
| 10 |
+
TEST 2: Value-based clamping
|
| 11 |
+
==================================================
|
| 12 |
+
PASS: Seller never offered below own value (100 trials x 20 rounds)
|
| 13 |
+
|
| 14 |
+
==================================================
|
| 15 |
+
TEST 3: Cumulative aggression penalty
|
| 16 |
+
==================================================
|
| 17 |
+
R1: penalty_so_far=2.0
|
| 18 |
+
R2: penalty_so_far=4.0
|
| 19 |
+
R3: penalty_so_far=6.0
|
| 20 |
+
R4: penalty_so_far=8.0
|
| 21 |
+
R5: penalty_so_far=10.0
|
| 22 |
+
Expected cumulative penalty: 10.0, Actual: 10.0
|
| 23 |
+
PASS
|
| 24 |
+
|
| 25 |
+
==================================================
|
| 26 |
+
TEST 4: Task configs and graders
|
| 27 |
+
==================================================
|
| 28 |
+
task_a_easy (easy): score=0.125, success=False
|
| 29 |
+
task_b_medium (medium): score=0.25, success=False
|
| 30 |
+
task_c_hard (hard): score=0.4167, success=True
|
| 31 |
+
|
| 32 |
+
==================================================
|
| 33 |
+
TEST 5: state() method
|
| 34 |
+
==================================================
|
| 35 |
+
PASS: state() returns Observation with role=buyer, round=0
|
| 36 |
+
|
| 37 |
+
==================================================
|
| 38 |
+
TEST 6: ACCEPT and REJECT
|
| 39 |
+
==================================================
|
| 40 |
+
ACCEPT: reward=0.00 done=True deal_type=None
|
| 41 |
+
REJECT: reward=-50.00 done=True deal_type=agent_rejected
|
| 42 |
+
|
| 43 |
+
==================================================
|
| 44 |
+
ALL TESTS COMPLETE
|
| 45 |
+
==================================================
|