Aviral Bhargava commited on
Commit
a72a5e3
·
1 Parent(s): 610b7e5

feat: complete OpenEnv compliance, tasks, and logic fixes

Browse files
Files changed (11) hide show
  1. .gitignore +29 -0
  2. Dockerfile +26 -0
  3. README.md +115 -1
  4. __pycache__/env_wrapper.cpython-312.pyc +0 -0
  5. env_wrapper.py +291 -43
  6. inference.py +161 -73
  7. openenv.yaml +94 -0
  8. requirements.txt +3 -0
  9. tasks.py +126 -0
  10. test_env.py +107 -0
  11. test_output.txt +45 -0
.gitignore ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.pyo
5
+ *.egg-info/
6
+ dist/
7
+ build/
8
+ *.egg
9
+
10
+ # Environment
11
+ .env
12
+ .venv/
13
+ venv/
14
+
15
+ # IDE
16
+ .vscode/
17
+ .idea/
18
+ *.swp
19
+ *.swo
20
+
21
+ # OS
22
+ .DS_Store
23
+ Thumbs.db
24
+
25
+ # Build artifacts
26
+ *.exe
27
+ *.o
28
+ *.out
29
+ test_sim.exe
Dockerfile ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Negotiation Environment — OpenEnv Dockerfile ──
2
+ # Person 3: Complete this Dockerfile for HuggingFace Spaces deployment
3
+ # Requirements: Python 3.11+, pip dependencies, inference.py entrypoint
4
+ # Constraints: CPU only, vcpu=2, memory=8gb, runtime < 20min
5
+
6
+ FROM python:3.11-slim
7
+
8
+ WORKDIR /app
9
+
10
+ # Install dependencies
11
+ COPY requirements.txt .
12
+ RUN pip install --no-cache-dir -r requirements.txt
13
+
14
+ # Copy project files
15
+ COPY env_wrapper.py .
16
+ COPY tasks.py .
17
+ COPY inference.py .
18
+ COPY openenv.yaml .
19
+
20
+ # Environment variables (set at runtime, NOT hardcoded)
21
+ # API_BASE_URL — The API endpoint for the LLM
22
+ # MODEL_NAME — The model identifier to use for inference
23
+ # HF_TOKEN — Your HuggingFace API key
24
+
25
+ # Entrypoint
26
+ CMD ["python", "inference.py"]
README.md CHANGED
@@ -1 +1,115 @@
1
- # MEta_ai
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🤝 Strategic Negotiation Environment — OpenEnv
2
+
3
+ A simulation environment where an AI agent learns to negotiate under uncertainty, compliant with the [Meta OpenEnv specification](https://github.com/meta-llama/open-env).
4
+
5
+ ## 🧠 Overview
6
+
7
+ This environment simulates **real-world price negotiation** — a task humans do daily in marketplaces, business deals, and automated pricing systems. The agent must:
8
+
9
+ - **Maximize profit** by negotiating favorable deals
10
+ - **Adapt to opponent behavior** (greedy, fair, or impatient personalities)
11
+ - **Make multi-step strategic decisions** under partial observability
12
+
13
+ The agent cannot see the opponent's true valuation or strategy — it must infer patterns and adjust.
14
+
15
+ ---
16
+
17
+ ## 🎮 Action Space
18
+
19
+ | Action | Description |
20
+ |---|---|
21
+ | `OFFER <price>` | Make a counter-offer at the given price (100–1000) |
22
+ | `ACCEPT` | Accept the current offer on the table |
23
+ | `REJECT` | Walk away from the negotiation (terminal, -50 penalty) |
24
+
25
+ ## 👁️ Observation Space
26
+
27
+ | Field | Type | Description |
28
+ |---|---|---|
29
+ | `current_offer` | int | Current price on the table |
30
+ | `round` | int | Current round number |
31
+ | `max_rounds` | int | Maximum allowed rounds |
32
+ | `role` | string | Agent's role: "buyer" or "seller" |
33
+ | `last_opponent_action` | string | "START", "OFFER", or "ACCEPT" |
34
+ | `last_opponent_offer` | int | Opponent's last offered price |
35
+ | `history` | list | History of all actions this episode |
36
+
37
+ ## 💰 Reward Function
38
+
39
+ | Event | Reward |
40
+ |---|---|
41
+ | Successful deal | `profit × (1 - round/max_rounds)` |
42
+ | Bad deal (profit < 0) | Additional -20 penalty |
43
+ | Rejection / Timeout | -50 |
44
+ | Aggressive offers | Cumulative -2 per aggressive step |
45
+ | Progress toward deal | Small shaping signal (±2 max) |
46
+
47
+ ---
48
+
49
+ ## 📋 Tasks
50
+
51
+ | Task | Difficulty | Opponent | ZOPA | Rounds | Threshold |
52
+ |---|---|---|---|---|---|
53
+ | `task_a_easy` | Easy | Fair | Wide (400) | 20 | 0.2 |
54
+ | `task_b_medium` | Medium | Greedy | Narrow (200) | 15 | 0.3 |
55
+ | `task_c_hard` | Hard | Impatient | Tight (120) | 6 | 0.4 |
56
+
57
+ ---
58
+
59
+ ## 🚀 Setup & Usage
60
+
61
+ ### Prerequisites
62
+ - Python 3.11+
63
+ - HuggingFace API token
64
+
65
+ ### Install
66
+ ```bash
67
+ pip install -r requirements.txt
68
+ ```
69
+
70
+ ### Configure Environment Variables
71
+ ```bash
72
+ export API_BASE_URL="https://router.huggingface.co/v1"
73
+ export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
74
+ export HF_TOKEN="your_token_here"
75
+ ```
76
+
77
+ ### Run Inference
78
+ ```bash
79
+ python inference.py
80
+ ```
81
+
82
+ ### Docker
83
+ ```bash
84
+ docker build -t negotiation-env .
85
+ docker run -e HF_TOKEN=your_token -e API_BASE_URL=https://router.huggingface.co/v1 -e MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct negotiation-env
86
+ ```
87
+
88
+ ---
89
+
90
+ ## 📊 Baseline Scores
91
+
92
+ <!-- Person 3: Fill in baseline scores after running inference -->
93
+ | Task | Score | Steps | Deal Made |
94
+ |---|---|---|---|
95
+ | task_a_easy | _TBD_ | _TBD_ | _TBD_ |
96
+ | task_b_medium | _TBD_ | _TBD_ | _TBD_ |
97
+ | task_c_hard | _TBD_ | _TBD_ | _TBD_ |
98
+
99
+ ---
100
+
101
+ ## 🏗️ Architecture
102
+
103
+ ```
104
+ LLM (HuggingFace via OpenAI Client)
105
+
106
+ inference.py (control loop + logging)
107
+
108
+ env_wrapper.py (OpenEnv-compatible environment)
109
+
110
+ tasks.py (task configs + graders)
111
+ ```
112
+
113
+ ## 📄 License
114
+
115
+ Apache 2.0
__pycache__/env_wrapper.cpython-312.pyc CHANGED
Binary files a/__pycache__/env_wrapper.cpython-312.pyc and b/__pycache__/env_wrapper.cpython-312.pyc differ
 
env_wrapper.py CHANGED
@@ -1,100 +1,348 @@
 
 
 
 
 
 
1
  import random
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  class Opponent:
4
- def __init__(self, type_str, value, role):
 
 
 
 
 
 
 
 
 
 
 
 
5
  self.type = type_str
6
  self.opponent_value = value
7
  self.opponent_role = role
8
- if type_str == "greedy":
9
- self.r, self.alpha, self.patience, self.epsilon = 0.05, 0.7, 10, 5
10
- elif type_str == "fair":
11
- self.r, self.alpha, self.patience, self.epsilon = 0.15, 0.4, 7, 10
12
- elif type_str == "impatient":
13
- self.r, self.alpha, self.patience, self.epsilon = 0.25, 0.2, 3, 15
14
- else:
15
- self.r, self.alpha, self.patience, self.epsilon = 0.15, 0.4, 7, 10
 
 
 
16
  self.concession_rate = self.r
 
17
 
18
- def get_response(self, round_num, current_offer, agent_offer, agent_action_type):
 
 
 
 
19
  if agent_action_type != "OFFER":
20
  return "REJECT", 0
21
 
22
- if self.opponent_role == "seller" and agent_offer >= self.opponent_value:
 
 
23
  return "ACCEPT", agent_offer
24
- if self.opponent_role == "buyer" and agent_offer <= self.opponent_value:
 
25
  return "ACCEPT", agent_offer
26
 
 
27
  if round_num > self.patience:
28
  self.concession_rate = min(0.4, self.concession_rate + 0.05)
29
 
 
30
  target = self.opponent_value
31
  delta = target - current_offer
32
  next_offer = current_offer + self.concession_rate * delta
 
 
33
  next_offer = (1.0 - self.alpha) * next_offer + self.alpha * current_offer
 
 
34
  next_offer += random.randint(-self.epsilon, self.epsilon)
35
- next_offer = max(100, min(1000, int(next_offer)))
36
- return "OFFER", next_offer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  class EnvWrapper:
39
- def __init__(self, opp_type="fair", a_val=800, o_val=500, agent_role="buyer"):
 
 
 
 
 
 
40
  self.agent_value = a_val
41
  self.opponent_value = o_val
42
  self.role = agent_role
 
43
  self.opp_role = "seller" if agent_role == "buyer" else "buyer"
 
44
  self.opp = Opponent(opp_type, o_val, self.opp_role)
45
- self.max_rounds = 20
46
- self.reset()
47
-
48
- def reset(self):
 
 
 
 
 
 
 
 
49
  self.round = 0
 
 
 
 
 
 
50
  if self.role == "buyer":
51
- self.current_offer = self.agent_value + 200
 
52
  else:
 
53
  self.current_offer = max(100, self.agent_value - 200)
 
54
  self.last_opp_action = "START"
55
  self.last_opp_offer = self.current_offer
56
 
57
- def step(self, action_str, action_price=0):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  self.round += 1
59
- aggressive = False
60
- done = False
61
  reward = 0.0
62
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  if action_str == "ACCEPT":
64
  deal_price = self.last_opp_offer
 
65
  done = True
66
- profit = deal_price - self.agent_value if self.role == "seller" else self.agent_value - deal_price
67
- t_factor = 1.0 - (self.round / self.max_rounds)
68
- reward = profit * t_factor
69
- if profit < 0: reward -= 20
70
-
71
  elif action_str == "REJECT":
72
  reward = -50.0
 
73
  done = True
74
-
 
75
  elif action_str.startswith("OFFER"):
76
- aggressive = abs(action_price - self.opponent_value) > 150
77
- opp_action, opp_price = self.opp.get_response(self.round, self.current_offer, action_price, "OFFER")
 
 
78
  if opp_action == "ACCEPT":
79
  deal_price = action_price
 
80
  done = True
81
  self.last_opp_action = "ACCEPT"
82
  self.last_opp_offer = deal_price
83
-
84
- profit = deal_price - self.agent_value if self.role == "seller" else self.agent_value - deal_price
85
- t_factor = 1.0 - (self.round / self.max_rounds)
86
- reward = profit * t_factor
87
- if profit < 0: reward -= 20
88
- if aggressive: reward -= 2
89
  else:
 
90
  self.current_offer = opp_price
91
  self.last_opp_action = "OFFER"
92
  self.last_opp_offer = opp_price
 
 
93
  if self.round >= self.max_rounds:
94
  reward = -50.0
 
95
  done = True
96
-
97
- if not done:
98
- reward = 0.0
99
-
100
- return reward, done
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Negotiation Environment Wrapper — OpenEnv Compliant
3
+ Implements: reset(), step(), state()
4
+ Typed models via Pydantic for Observation, Action, Reward
5
+ """
6
+
7
  import random
8
+ from typing import Optional, List, Dict, Any
9
+ from pydantic import BaseModel, Field
10
+
11
+
12
+ # ─────────────────────────────────────────────
13
+ # OpenEnv Typed Models
14
+ # ─────────────────────────────────────────────
15
+
16
+ class Observation(BaseModel):
17
+ """Observable state visible to the agent."""
18
+ agent_value: int = Field(description="The agent's private valuation/target value for the deal")
19
+ current_offer: int = Field(description="Current price on the table")
20
+ round: int = Field(description="Current round number (0-indexed before first step)")
21
+ max_rounds: int = Field(description="Maximum allowed rounds")
22
+ role: str = Field(description="Agent role: 'buyer' or 'seller'")
23
+ last_opponent_action: str = Field(description="Opponent's last action: 'START', 'OFFER', 'ACCEPT'")
24
+ last_opponent_offer: int = Field(description="Opponent's last offered price")
25
+ history: List[Dict[str, Any]] = Field(default_factory=list, description="History of all actions this episode")
26
+
27
+
28
+ class ActionModel(BaseModel):
29
+ """Action the agent can take."""
30
+ action_type: str = Field(description="One of: 'OFFER', 'ACCEPT', 'REJECT'")
31
+ price: int = Field(default=0, description="Price for OFFER actions, ignored for ACCEPT/REJECT")
32
+
33
+
34
+ class RewardInfo(BaseModel):
35
+ """Reward information returned by step()."""
36
+ reward: float = Field(description="Numeric reward for this step")
37
+ breakdown: Dict[str, float] = Field(default_factory=dict, description="Reward component breakdown")
38
+
39
+
40
+ # ─────────────────────────────────────────────
41
+ # Opponent Strategy
42
+ # ─────────────────────────────────────────────
43
 
44
  class Opponent:
45
+ """
46
+ Simulates opponent negotiation behavior.
47
+ Three personalities: greedy, fair, impatient.
48
+ Each has different concession rates, anchor effects, patience, and noise.
49
+ """
50
+
51
+ PROFILES = {
52
+ "greedy": {"r": 0.05, "alpha": 0.7, "patience": 10, "epsilon": 5},
53
+ "fair": {"r": 0.15, "alpha": 0.4, "patience": 7, "epsilon": 10},
54
+ "impatient": {"r": 0.25, "alpha": 0.2, "patience": 3, "epsilon": 15},
55
+ }
56
+
57
+ def __init__(self, type_str: str, value: int, role: str):
58
  self.type = type_str
59
  self.opponent_value = value
60
  self.opponent_role = role
61
+ self.history: List[Dict[str, Any]] = []
62
+
63
+ profile = self.PROFILES.get(type_str, self.PROFILES["fair"])
64
+ self.r = profile["r"]
65
+ self.alpha = profile["alpha"]
66
+ self.patience = profile["patience"]
67
+ self.epsilon = profile["epsilon"]
68
+ self.concession_rate = self.r
69
+
70
+ def reset_state(self):
71
+ """Reset concession rate and history for new episode."""
72
  self.concession_rate = self.r
73
+ self.history = []
74
 
75
+ def get_response(self, round_num: int, current_offer: int, agent_offer: int, agent_action_type: str):
76
+ """
77
+ Generate opponent response to agent's action.
78
+ Returns: (action_type: str, price: int)
79
+ """
80
  if agent_action_type != "OFFER":
81
  return "REJECT", 0
82
 
83
+ # ── Acceptance Check ──
84
+ if self.opponent_role == "seller" and agent_offer >= self.opponent_value:
85
+ self.history.append({"round": round_num, "action": "ACCEPT", "price": agent_offer})
86
  return "ACCEPT", agent_offer
87
+ if self.opponent_role == "buyer" and agent_offer <= self.opponent_value:
88
+ self.history.append({"round": round_num, "action": "ACCEPT", "price": agent_offer})
89
  return "ACCEPT", agent_offer
90
 
91
+ # ── Patience-based concession acceleration ──
92
  if round_num > self.patience:
93
  self.concession_rate = min(0.4, self.concession_rate + 0.05)
94
 
95
+ # ── Counter-offer calculation ──
96
  target = self.opponent_value
97
  delta = target - current_offer
98
  next_offer = current_offer + self.concession_rate * delta
99
+
100
+ # Anchor effect — blend toward current offer
101
  next_offer = (1.0 - self.alpha) * next_offer + self.alpha * current_offer
102
+
103
+ # Add noise
104
  next_offer += random.randint(-self.epsilon, self.epsilon)
105
+
106
+ # ── VALUE-BASED CLAMPING (Tolerance Bug Fix) ──
107
+ # Seller must not offer below their own value
108
+ # Buyer must not offer above their own value
109
+ next_offer_int = int(next_offer)
110
+ if self.opponent_role == "seller":
111
+ next_offer_int = max(next_offer_int, self.opponent_value)
112
+ elif self.opponent_role == "buyer":
113
+ next_offer_int = min(next_offer_int, self.opponent_value)
114
+
115
+ # Absolute bounds
116
+ next_offer_int = max(100, min(1000, next_offer_int))
117
+
118
+ self.history.append({"round": round_num, "action": "OFFER", "price": next_offer_int})
119
+ return "OFFER", next_offer_int
120
+
121
+
122
+ # ─────────────────────────────────────────────
123
+ # Main Environment Wrapper
124
+ # ─────────────────────────────────────────────
125
 
126
  class EnvWrapper:
127
+ """
128
+ OpenEnv-compliant negotiation environment.
129
+ Exposes: reset(), step(), state()
130
+ """
131
+
132
+ def __init__(self, opp_type: str = "fair", a_val: int = 800, o_val: int = 500,
133
+ agent_role: str = "buyer", max_rounds: int = 20):
134
  self.agent_value = a_val
135
  self.opponent_value = o_val
136
  self.role = agent_role
137
+ self.opp_type = opp_type
138
  self.opp_role = "seller" if agent_role == "buyer" else "buyer"
139
+ self.max_rounds = max_rounds
140
  self.opp = Opponent(opp_type, o_val, self.opp_role)
141
+
142
+ # Episode tracking
143
+ self.round = 0
144
+ self.current_offer = 0
145
+ self.last_opp_action = "START"
146
+ self.last_opp_offer = 0
147
+ self.history: List[Dict[str, Any]] = []
148
+ self.cumulative_aggression_penalty = 0.0
149
+ self.done = False
150
+
151
+ def reset(self) -> Observation:
152
+ """Reset environment and return initial observation."""
153
  self.round = 0
154
+ self.done = False
155
+ self.history = []
156
+ self.cumulative_aggression_penalty = 0.0
157
+ self.opp.reset_state()
158
+
159
+ # Initial offer is shifted away from agent's value to force negotiation
160
  if self.role == "buyer":
161
+ # Start high agent (buyer) must negotiate DOWN
162
+ self.current_offer = min(1000, self.agent_value + 200)
163
  else:
164
+ # Start low — agent (seller) must negotiate UP
165
  self.current_offer = max(100, self.agent_value - 200)
166
+
167
  self.last_opp_action = "START"
168
  self.last_opp_offer = self.current_offer
169
 
170
+ return self.state()
171
+
172
+ def state(self) -> Observation:
173
+ """Return current observable state."""
174
+ return Observation(
175
+ agent_value=self.agent_value,
176
+ current_offer=self.current_offer,
177
+ round=self.round,
178
+ max_rounds=self.max_rounds,
179
+ role=self.role,
180
+ last_opponent_action=self.last_opp_action,
181
+ last_opponent_offer=self.last_opp_offer,
182
+ history=list(self.history),
183
+ )
184
+
185
+ def _compute_reward(self, deal_price: int) -> tuple:
186
+ """
187
+ Compute reward for a completed deal.
188
+ Returns: (total_reward, breakdown_dict)
189
+ """
190
+ if self.role == "seller":
191
+ profit = deal_price - self.agent_value
192
+ else:
193
+ profit = self.agent_value - deal_price
194
+
195
+ time_factor = 1.0 - (self.round / self.max_rounds)
196
+ base_reward = profit * time_factor
197
+
198
+ # Penalty for bad deals (agent accepts a losing deal)
199
+ bad_deal_penalty = -20.0 if profit < 0 else 0.0
200
+
201
+ # Cumulative aggression penalty
202
+ aggression = -self.cumulative_aggression_penalty
203
+
204
+ total = base_reward + bad_deal_penalty + aggression
205
+
206
+ breakdown = {
207
+ "profit": float(profit),
208
+ "time_factor": round(time_factor, 4),
209
+ "base_reward": round(base_reward, 4),
210
+ "bad_deal_penalty": bad_deal_penalty,
211
+ "aggression_penalty": aggression,
212
+ "total": round(total, 4),
213
+ }
214
+ return total, breakdown
215
+
216
+ def _partial_progress_reward(self, action_str: str, action_price: int) -> tuple:
217
+ """
218
+ Provide a small shaping reward for intermediate steps.
219
+ Rewards the agent for moving toward a deal (improving offers).
220
+ """
221
+ reward = 0.0
222
+ breakdown = {}
223
+
224
+ if action_str.startswith("OFFER") and len(self.history) >= 2:
225
+ # Check if agent is making progress toward opponent
226
+ prev_agent_offers = [h["agent_price"] for h in self.history[:-1]
227
+ if h.get("agent_action", "").startswith("OFFER")]
228
+ if prev_agent_offers:
229
+ last_agent_offer = prev_agent_offers[-1]
230
+ # Positive signal if agent moves toward a reasonable range
231
+ if self.role == "buyer":
232
+ # Buyer should increase offers (toward seller's value)
233
+ improvement = action_price - last_agent_offer
234
+ reward = min(2.0, max(-1.0, improvement / 50.0))
235
+ else:
236
+ # Seller should decrease offers (toward buyer's value)
237
+ improvement = last_agent_offer - action_price
238
+ reward = min(2.0, max(-1.0, improvement / 50.0))
239
+
240
+ breakdown = {"progress_signal": round(reward, 4)}
241
+
242
+ return reward, breakdown
243
+
244
+ def step(self, action_str: str, action_price: int = 0):
245
+ """
246
+ Take one step in the environment.
247
+
248
+ Args:
249
+ action_str: "OFFER", "ACCEPT", or "REJECT"
250
+ action_price: price for OFFER actions
251
+
252
+ Returns:
253
+ (observation: Observation, reward: float, done: bool, info: dict)
254
+ """
255
+ if self.done:
256
+ return self.state(), 0.0, True, {"error": "Episode already ended"}
257
+
258
  self.round += 1
 
 
259
  reward = 0.0
260
+ done = False
261
+ info: Dict[str, Any] = {"error": None}
262
+ breakdown: Dict[str, float] = {}
263
+
264
+ # ── AGENT OFFER CLAMPING ──
265
+ if action_str.startswith("OFFER"):
266
+ action_price = max(100, min(1000, action_price))
267
+ action_str = f"OFFER {action_price}"
268
+
269
+ # ── CUMULATIVE AGGRESSION PENALTY ──
270
+ if abs(action_price - self.opponent_value) > 150:
271
+ self.cumulative_aggression_penalty += 2.0
272
+
273
+ # Record this step in history
274
+ step_record = {
275
+ "round": self.round,
276
+ "agent_action": action_str,
277
+ "agent_price": action_price,
278
+ }
279
+
280
  if action_str == "ACCEPT":
281
  deal_price = self.last_opp_offer
282
+ reward, breakdown = self._compute_reward(deal_price)
283
  done = True
284
+ info["deal_price"] = deal_price
285
+ info["deal_type"] = "agent_accepted"
286
+
 
 
287
  elif action_str == "REJECT":
288
  reward = -50.0
289
+ breakdown = {"rejection_penalty": -50.0}
290
  done = True
291
+ info["deal_type"] = "agent_rejected"
292
+
293
  elif action_str.startswith("OFFER"):
294
+ opp_action, opp_price = self.opp.get_response(
295
+ self.round, self.current_offer, action_price, "OFFER"
296
+ )
297
+
298
  if opp_action == "ACCEPT":
299
  deal_price = action_price
300
+ reward, breakdown = self._compute_reward(deal_price)
301
  done = True
302
  self.last_opp_action = "ACCEPT"
303
  self.last_opp_offer = deal_price
304
+ info["deal_price"] = deal_price
305
+ info["deal_type"] = "opponent_accepted"
 
 
 
 
306
  else:
307
+ # Opponent counters
308
  self.current_offer = opp_price
309
  self.last_opp_action = "OFFER"
310
  self.last_opp_offer = opp_price
311
+
312
+ # Check max rounds
313
  if self.round >= self.max_rounds:
314
  reward = -50.0
315
+ breakdown = {"timeout_penalty": -50.0}
316
  done = True
317
+ info["deal_type"] = "timeout"
318
+ else:
319
+ # Partial progress reward for intermediate steps
320
+ step_record["agent_price"] = action_price
321
+ self.history.append(step_record)
322
+ reward, breakdown = self._partial_progress_reward(action_str, action_price)
323
+ info["opponent_counter"] = opp_price
324
+
325
+ step_record["opp_action"] = opp_action
326
+ step_record["opp_price"] = opp_price
327
+
328
+ # Record history for terminal steps too
329
+ if done or action_str == "ACCEPT" or action_str == "REJECT":
330
+ # Avoid double-append for non-OFFER terminal steps
331
+ if step_record not in self.history:
332
+ self.history.append(step_record)
333
+
334
+ self.done = done
335
+ info["reward_breakdown"] = breakdown
336
+
337
+ return self.state(), reward, done, info
338
+
339
+
340
+ # ─────────────────────────────────────────────
341
+ # Convenience — max possible reward for scoring
342
+ # ─────────────────────────────────────────────
343
+
344
+ def get_max_possible_reward(agent_value: int, opponent_value: int) -> float:
345
+ """
346
+ Maximum reward possible if agent gets the best possible deal on round 1.
347
+ """
348
+ return float(abs(agent_value - opponent_value))
inference.py CHANGED
@@ -1,111 +1,199 @@
 
 
 
 
 
 
1
  import os
2
  import re
3
  import sys
4
  from openai import OpenAI
5
  from env_wrapper import EnvWrapper
 
6
 
7
- def main():
8
- api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
9
- model_name = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
10
- hf_token = os.getenv("HF_TOKEN")
11
-
12
- if not hf_token:
13
- print("ERROR: HF_TOKEN environment variable is not set.")
14
- sys.exit(1)
15
 
16
- env = EnvWrapper(opp_type="fair", a_val=300, o_val=700, agent_role="buyer")
17
- env.max_rounds = 4
18
- env.reset()
19
-
20
- print(f"[START] task=negotiation env=custom model={model_name}")
21
-
22
- client = OpenAI(base_url=api_base_url, api_key=hf_token)
23
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  done = False
25
  step_n = 0
26
  rewards = []
27
-
 
 
28
  while not done and step_n < env.max_rounds:
29
  step_n += 1
30
-
31
- prompt = f"""You are negotiating as a {env.role}.
 
 
 
 
 
 
 
 
 
 
 
32
  State:
33
- * Current offer: {env.current_offer}
34
- * Round: {env.round}
35
- * Max rounds: {env.max_rounds}
36
-
37
- Choose ONE:
38
- * OFFER <price> (Preferred: counter-offer if you do not like the price!)
39
- * ACCEPT
40
- * REJECT"""
41
-
 
 
 
 
 
 
42
  action_str = "REJECT"
43
  action_price = 0
44
  error_msg = "null"
45
-
46
  try:
47
  response = client.chat.completions.create(
48
  model=model_name,
49
  messages=[{"role": "user", "content": prompt}],
50
  max_tokens=20,
51
- temperature=0.3
52
  )
53
  llm_text = response.choices[0].message.content.strip()
54
-
55
- match = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text, re.IGNORECASE)
56
- if match:
57
- action_str = match.group(1).upper()
 
 
58
  else:
59
- error_msg = "parsing failed, retrying"
60
- response = client.chat.completions.create(
 
61
  model=model_name,
62
- messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": llm_text}, {"role": "user", "content": "Output strictly ONLY ONE of: 'OFFER <price>', 'ACCEPT', or 'REJECT'."}],
 
 
 
 
63
  max_tokens=15,
64
- temperature=0.1
65
  )
66
- llm_text2 = response.choices[0].message.content.strip()
67
- match2 = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text2, re.IGNORECASE)
68
- if match2:
69
- action_str = match2.group(1).upper()
 
70
  error_msg = "null"
71
  else:
72
  action_str = "REJECT"
 
73
  error_msg = "parse error on retry, defaulting to REJECT"
 
74
  except Exception as e:
75
- error_msg = "API_Error"
76
  action_str = "REJECT"
77
-
78
- if action_str.startswith("OFFER"):
79
- try:
80
- action_price = int(action_str.split()[1])
81
- except ValueError:
82
- action_str = "REJECT"
83
- action_price = 0
84
- error_msg = "invalid price format"
85
- elif action_str == "ACCEPT":
86
- action_str = "ACCEPT"
87
- elif action_str == "REJECT":
88
- action_str = "REJECT"
89
-
90
- # Strip potential garbage
91
- if "OFFER" in action_str:
92
- action_str = f"OFFER {action_price}"
93
-
94
- reward, d = env.step(action_str, action_price)
95
- done = d
96
  rewards.append(reward)
97
-
98
- print(f"[STEP] step={step_n} action={action_str} reward={reward:.2f} done={str(done).lower()} error={error_msg}")
99
-
100
-
101
- # SCORING
102
- max_possible_reward = float(abs(env.agent_value - env.opponent_value))
103
- score = sum(rewards) / max_possible_reward if max_possible_reward > 0 else 0.0
104
- score = max(0.0, min(1.0, score))
105
- success = score > 0.3
106
-
 
 
 
 
 
 
 
 
 
 
107
  rewards_str = ",".join([f"{r:.2f}" for r in rewards])
108
- print(f"[END] success={str(success).lower()} steps={step_n} score={score:.4f} rewards={rewards_str}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
  if __name__ == "__main__":
111
  main()
 
1
+ """
2
+ Inference Script — OpenEnv Negotiation Environment
3
+ Runs LLM agent against all 3 tasks, produces structured logs.
4
+ Uses OpenAI-compatible client with HuggingFace router.
5
+ """
6
+
7
  import os
8
  import re
9
  import sys
10
  from openai import OpenAI
11
  from env_wrapper import EnvWrapper
12
+ from tasks import ALL_TASKS, get_grader
13
 
 
 
 
 
 
 
 
 
14
 
15
+ def parse_action(llm_text: str):
16
+ """Parse LLM output into (action_str, action_price)."""
17
+ match = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text, re.IGNORECASE)
18
+ if match:
19
+ action = match.group(1).upper()
20
+ if action.startswith("OFFER"):
21
+ parts = action.split()
22
+ try:
23
+ price = int(parts[1])
24
+ return f"OFFER {price}", price, None
25
+ except (IndexError, ValueError):
26
+ return "REJECT", 0, "invalid price in OFFER"
27
+ return action, 0, None
28
+ return None, 0, "no action match"
29
+
30
+
31
+ def run_task(client, model_name: str, task_config):
32
+ """
33
+ Run a single task: LLM negotiates against the environment.
34
+ Returns: (rewards, steps, deal_made, score_info)
35
+ """
36
+ env = EnvWrapper(
37
+ opp_type=task_config.opp_type,
38
+ a_val=task_config.agent_value,
39
+ o_val=task_config.opponent_value,
40
+ agent_role=task_config.agent_role,
41
+ max_rounds=task_config.max_rounds,
42
+ )
43
+ obs = env.reset()
44
+
45
+ print(f"[START] task={task_config.name} env=negotiation model={model_name}")
46
+
47
  done = False
48
  step_n = 0
49
  rewards = []
50
+ deal_made = False
51
+ history_for_prompt = []
52
+
53
  while not done and step_n < env.max_rounds:
54
  step_n += 1
55
+
56
+ # ── Build prompt with history ──
57
+ history_text = ""
58
+ if history_for_prompt:
59
+ history_lines = []
60
+ for h in history_for_prompt[-5:]: # Last 5 rounds for context
61
+ history_lines.append(f" Round {h['round']}: You → {h['agent']}, Opponent → {h['opp']}")
62
+ history_text = "Negotiation history:\n" + "\n".join(history_lines) + "\n\n"
63
+
64
+ target_goal = "buy for as low as possible (below your maximum value)" if obs.role == "buyer" else "sell for as high as possible (above your minimum value)"
65
+
66
+ prompt = f"""You are negotiating as a {obs.role}. Your goal is to {target_goal} to maximize profit.
67
+
68
  State:
69
+ * Your PRIVATE Valuation: {obs.agent_value} (DO NOT accept or offer a deal worse than this!)
70
+ * Current offer on the table: {obs.current_offer}
71
+ * Round: {step_n} of {obs.max_rounds}
72
+ * Opponent's last action: {obs.last_opponent_action}
73
+ * Opponent's last offer: {obs.last_opponent_offer}
74
+
75
+ {history_text}CRITICAL RULE: NEVER make an OFFER that is worse than your private valuation. For example, if you are a buyer with a valuation of 500, never offer >500.
76
+
77
+ Choose exactly ONE action:
78
+ * OFFER <price> — make a counter-offer (negotiate toward your private valuation)
79
+ * ACCEPT — accept the opponent's offer (ONLY if it is profitable compared to your valuation)
80
+ * REJECT — walk away (only if no deal is possible)
81
+
82
+ Respond with ONLY your chosen action, nothing else."""
83
+
84
  action_str = "REJECT"
85
  action_price = 0
86
  error_msg = "null"
87
+
88
  try:
89
  response = client.chat.completions.create(
90
  model=model_name,
91
  messages=[{"role": "user", "content": prompt}],
92
  max_tokens=20,
93
+ temperature=0.3,
94
  )
95
  llm_text = response.choices[0].message.content.strip()
96
+
97
+ parsed_action, parsed_price, parse_err = parse_action(llm_text)
98
+
99
+ if parsed_action:
100
+ action_str = parsed_action
101
+ action_price = parsed_price
102
  else:
103
+ # Retry with stricter prompt
104
+ error_msg = f"parse failed: {parse_err}, retrying"
105
+ retry_response = client.chat.completions.create(
106
  model=model_name,
107
+ messages=[
108
+ {"role": "user", "content": prompt},
109
+ {"role": "assistant", "content": llm_text},
110
+ {"role": "user", "content": "Output strictly ONLY ONE of: 'OFFER <price>', 'ACCEPT', or 'REJECT'. Nothing else."},
111
+ ],
112
  max_tokens=15,
113
+ temperature=0.1,
114
  )
115
+ llm_text2 = retry_response.choices[0].message.content.strip()
116
+ parsed2, price2, err2 = parse_action(llm_text2)
117
+ if parsed2:
118
+ action_str = parsed2
119
+ action_price = price2
120
  error_msg = "null"
121
  else:
122
  action_str = "REJECT"
123
+ action_price = 0
124
  error_msg = "parse error on retry, defaulting to REJECT"
125
+
126
  except Exception as e:
127
+ error_msg = f"API_Error: {str(e)[:50]}"
128
  action_str = "REJECT"
129
+ action_price = 0
130
+
131
+ # ── Step the environment ──
132
+ obs, reward, done, info = env.step(action_str, action_price)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  rewards.append(reward)
134
+
135
+ # Track deal
136
+ if done and info.get("deal_type") in ("agent_accepted", "opponent_accepted"):
137
+ deal_made = True
138
+
139
+ # Track history for prompting
140
+ history_for_prompt.append({
141
+ "round": step_n,
142
+ "agent": action_str,
143
+ "opp": f"{obs.last_opponent_action} {obs.last_opponent_offer}" if obs.last_opponent_action == "OFFER" else obs.last_opponent_action,
144
+ })
145
+
146
+ # ── Log step ──
147
+ log_action = action_str if not action_str.startswith("OFFER") else f"OFFER {action_price}"
148
+ print(f"[STEP] step={step_n} action={log_action} reward={reward:.2f} done={str(done).lower()} error={error_msg}")
149
+
150
+ # ── Score ──
151
+ grader = get_grader(task_config)
152
+ result = grader.grade(rewards, step_n, deal_made)
153
+
154
  rewards_str = ",".join([f"{r:.2f}" for r in rewards])
155
+ print(f"[END] success={str(result['success']).lower()} steps={step_n} score={result['score']:.4f} rewards={rewards_str}")
156
+ print()
157
+
158
+ return result
159
+
160
+
161
+ def main():
162
+ api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
163
+ model_name = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
164
+ hf_token = os.getenv("HF_TOKEN")
165
+
166
+ if not hf_token:
167
+ print("ERROR: HF_TOKEN environment variable is not set.")
168
+ print("Set it with: $env:HF_TOKEN='your_token_here'")
169
+ sys.exit(1)
170
+
171
+ client = OpenAI(base_url=api_base_url, api_key=hf_token)
172
+
173
+ print("=" * 60)
174
+ print("NEGOTIATION ENVIRONMENT — OpenEnv Inference")
175
+ print("=" * 60)
176
+ print()
177
+
178
+ all_results = []
179
+
180
+ for task in ALL_TASKS:
181
+ result = run_task(client, model_name, task)
182
+ all_results.append(result)
183
+
184
+ # ── Summary ──
185
+ print("=" * 60)
186
+ print("SUMMARY")
187
+ print("=" * 60)
188
+ for r in all_results:
189
+ status = "PASS" if r["success"] else "FAIL"
190
+ print(f" [{status}] {r['task']} ({r['difficulty']}): score={r['score']:.4f} "
191
+ f"steps={r['steps']} deal={r['deal_made']} threshold={r['threshold']}")
192
+
193
+ avg_score = sum(r["score"] for r in all_results) / len(all_results)
194
+ print(f"\n Average Score: {avg_score:.4f}")
195
+ print("=" * 60)
196
+
197
 
198
  if __name__ == "__main__":
199
  main()
openenv.yaml ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: negotiation-env
2
+ version: "1.0.0"
3
+ description: >
4
+ Strategic Negotiation Simulation Environment where an AI agent learns
5
+ to negotiate under uncertainty with different opponent personalities.
6
+ The agent must maximize profit through multi-round price negotiation
7
+ while adapting to greedy, fair, or impatient opponents.
8
+
9
+ author: Team MEta_ai
10
+ license: Apache-2.0
11
+
12
+ environment:
13
+ type: simulation
14
+ domain: negotiation
15
+ real_world_task: automated marketplace pricing and negotiation
16
+
17
+ observation_space:
18
+ type: object
19
+ fields:
20
+ current_offer:
21
+ type: integer
22
+ description: Current price on the table
23
+ range: [100, 1000]
24
+ round:
25
+ type: integer
26
+ description: Current round number
27
+ range: [0, 20]
28
+ max_rounds:
29
+ type: integer
30
+ description: Maximum allowed rounds
31
+ role:
32
+ type: string
33
+ enum: ["buyer", "seller"]
34
+ description: Agent's role in the negotiation
35
+ last_opponent_action:
36
+ type: string
37
+ enum: ["START", "OFFER", "ACCEPT"]
38
+ description: Opponent's last action
39
+ last_opponent_offer:
40
+ type: integer
41
+ description: Opponent's last offered price
42
+ range: [100, 1000]
43
+ history:
44
+ type: array
45
+ description: History of all actions this episode
46
+
47
+ action_space:
48
+ type: object
49
+ fields:
50
+ action_type:
51
+ type: string
52
+ enum: ["OFFER", "ACCEPT", "REJECT"]
53
+ description: Type of negotiation action
54
+ price:
55
+ type: integer
56
+ description: Price for OFFER actions (ignored for ACCEPT/REJECT)
57
+ range: [100, 1000]
58
+
59
+ reward:
60
+ type: float
61
+ range: [-50.0, 855.0]
62
+ description: >
63
+ Reward based on deal profit scaled by time factor.
64
+ Partial progress signals during intermediate steps.
65
+ Penalty for failed negotiations (-50), bad deals (-20),
66
+ and aggressive offers (cumulative -2 per aggressive step).
67
+
68
+ tasks:
69
+ - name: task_a_easy
70
+ difficulty: easy
71
+ description: Fair opponent, wide ZOPA, 20 rounds
72
+ success_threshold: 0.2
73
+
74
+ - name: task_b_medium
75
+ difficulty: medium
76
+ description: Greedy opponent, narrow ZOPA, 15 rounds
77
+ success_threshold: 0.3
78
+
79
+ - name: task_c_hard
80
+ difficulty: hard
81
+ description: Impatient opponent, tight margins, 6 rounds
82
+ success_threshold: 0.4
83
+
84
+ inference:
85
+ script: inference.py
86
+ env_vars:
87
+ - API_BASE_URL
88
+ - MODEL_NAME
89
+ - HF_TOKEN
90
+
91
+ deployment:
92
+ dockerfile: Dockerfile
93
+ platform: huggingface-spaces
94
+ tag: openenv
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ openai>=1.0.0
2
+ pydantic>=2.0.0
3
+ pyyaml>=6.0
tasks.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Task Definitions & Graders for the Negotiation Environment.
3
+ 3 tasks: Easy → Medium → Hard, each with a programmatic grader (0.0–1.0).
4
+ """
5
+
6
+ from dataclasses import dataclass
7
+ from typing import List
8
+
9
+
10
+ @dataclass
11
+ class TaskConfig:
12
+ """Configuration for a single evaluation task."""
13
+ name: str
14
+ description: str
15
+ difficulty: str
16
+ opp_type: str
17
+ agent_value: int
18
+ opponent_value: int
19
+ agent_role: str
20
+ max_rounds: int
21
+ success_threshold: float # score >= this means success
22
+
23
+
24
+ # ─────────────────────────────────────────────
25
+ # Task Definitions
26
+ # ─────────────────────────────────────────────
27
+
28
+ TASK_A = TaskConfig(
29
+ name="task_a_easy",
30
+ description="Easy negotiation: fair opponent, wide ZOPA, plenty of rounds",
31
+ difficulty="easy",
32
+ opp_type="fair",
33
+ agent_value=800,
34
+ opponent_value=400,
35
+ agent_role="buyer",
36
+ max_rounds=20,
37
+ success_threshold=0.2,
38
+ )
39
+
40
+ TASK_B = TaskConfig(
41
+ name="task_b_medium",
42
+ description="Medium negotiation: greedy opponent, narrow ZOPA, fewer rounds",
43
+ difficulty="medium",
44
+ opp_type="greedy",
45
+ agent_value=700,
46
+ opponent_value=500,
47
+ agent_role="buyer",
48
+ max_rounds=15,
49
+ success_threshold=0.3,
50
+ )
51
+
52
+ TASK_C = TaskConfig(
53
+ name="task_c_hard",
54
+ description="Hard negotiation: impatient opponent, tight margins, very few rounds",
55
+ difficulty="hard",
56
+ opp_type="impatient",
57
+ agent_value=600,
58
+ opponent_value=480,
59
+ agent_role="buyer",
60
+ max_rounds=6,
61
+ success_threshold=0.4,
62
+ )
63
+
64
+ ALL_TASKS: List[TaskConfig] = [TASK_A, TASK_B, TASK_C]
65
+
66
+
67
+ # ─────────────────────────────────────────────
68
+ # Grader
69
+ # ─────────────────────────────────────────────
70
+
71
+ class Grader:
72
+ """
73
+ Programmatic grader for a negotiation task.
74
+ Scores agent performance on a 0.0–1.0 scale.
75
+ """
76
+
77
+ def __init__(self, task: TaskConfig):
78
+ self.task = task
79
+ self.max_possible = float(abs(task.agent_value - task.opponent_value))
80
+
81
+ def grade(self, rewards: List[float], steps: int, deal_made: bool) -> dict:
82
+ """
83
+ Grade an episode.
84
+
85
+ Args:
86
+ rewards: list of per-step rewards
87
+ steps: number of steps taken
88
+ deal_made: whether a deal was successfully completed
89
+
90
+ Returns:
91
+ dict with score, success, and breakdown
92
+ """
93
+ total_reward = sum(rewards)
94
+
95
+ # Score normalization: total_reward / max_possible, clamped to [0, 1]
96
+ if self.max_possible > 0:
97
+ raw_score = total_reward / self.max_possible
98
+ else:
99
+ raw_score = 0.0
100
+
101
+ score = max(0.0, min(1.0, raw_score))
102
+ success = score >= self.task.success_threshold
103
+
104
+ # ── Detailed breakdown ──
105
+ efficiency = 0.0
106
+ if deal_made and steps > 0:
107
+ # Bonus for fewer steps — max 1.0 if done in 1 step
108
+ efficiency = max(0.0, 1.0 - (steps / self.task.max_rounds))
109
+
110
+ return {
111
+ "task": self.task.name,
112
+ "difficulty": self.task.difficulty,
113
+ "score": round(score, 4),
114
+ "success": success,
115
+ "threshold": self.task.success_threshold,
116
+ "total_reward": round(total_reward, 4),
117
+ "max_possible": self.max_possible,
118
+ "steps": steps,
119
+ "deal_made": deal_made,
120
+ "efficiency": round(efficiency, 4),
121
+ }
122
+
123
+
124
+ def get_grader(task: TaskConfig) -> Grader:
125
+ """Create a grader for the given task."""
126
+ return Grader(task)
test_env.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Quick validation test for the environment — no API keys needed."""
2
+ import random
3
+ random.seed(42)
4
+
5
+ from env_wrapper import EnvWrapper, Observation
6
+ from tasks import ALL_TASKS, get_grader
7
+
8
+ print("=" * 50)
9
+ print("TEST 1: Multi-round negotiation")
10
+ print("=" * 50)
11
+ env = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer", max_rounds=10)
12
+ obs = env.reset()
13
+ print(f"Initial offer: {obs.current_offer}")
14
+
15
+ offers = [650, 600, 550, 500, 480, 450, 420, 400, 400, 400]
16
+ for i, price in enumerate(offers):
17
+ obs, r, d, info = env.step("OFFER", price)
18
+ opp_info = f"{obs.last_opponent_action} {obs.last_opponent_offer}"
19
+ print(f" R{i+1}: OFFER {price} -> Opp {opp_info} | reward={r:.2f} done={d}")
20
+ if d:
21
+ deal_type = info.get("deal_type", "none")
22
+ deal_price = info.get("deal_price", "N/A")
23
+ print(f" >>> Deal: {deal_type}, price={deal_price}")
24
+ break
25
+
26
+ print(f" History entries: {len(obs.history)}")
27
+ print()
28
+
29
+ print("=" * 50)
30
+ print("TEST 2: Value-based clamping")
31
+ print("=" * 50)
32
+ from env_wrapper import Opponent
33
+ bugs = 0
34
+ for trial in range(100):
35
+ opp = Opponent("fair", 500, "seller")
36
+ for rnd in range(20):
37
+ action, price = opp.get_response(rnd, 300, 250, "OFFER")
38
+ if action == "OFFER" and price < 500:
39
+ bugs += 1
40
+ print(f" BUG: trial={trial} round={rnd} seller offered {price} < 500")
41
+ break
42
+ if bugs == 0:
43
+ print(" PASS: Seller never offered below own value (100 trials x 20 rounds)")
44
+ else:
45
+ print(f" FAIL: {bugs} violations found")
46
+ print()
47
+
48
+ print("=" * 50)
49
+ print("TEST 3: Cumulative aggression penalty")
50
+ print("=" * 50)
51
+ env2 = EnvWrapper(opp_type="greedy", a_val=800, o_val=500, agent_role="buyer", max_rounds=20)
52
+ env2.reset()
53
+ # Make multiple aggressive offers (>150 away from opp_val=500, so <350 or >650)
54
+ for i in range(5):
55
+ obs, r, d, info = env2.step("OFFER", 200) # 300 away from 500 → aggressive
56
+ print(f" R{i+1}: penalty_so_far={env2.cumulative_aggression_penalty}")
57
+ if d:
58
+ break
59
+
60
+ expected_penalty = 10.0 # 5 rounds x 2.0 per round
61
+ actual_penalty = env2.cumulative_aggression_penalty
62
+ print(f" Expected cumulative penalty: {expected_penalty}, Actual: {actual_penalty}")
63
+ print(f" {'PASS' if actual_penalty == expected_penalty else 'FAIL'}")
64
+ print()
65
+
66
+ print("=" * 50)
67
+ print("TEST 4: Task configs and graders")
68
+ print("=" * 50)
69
+ for task in ALL_TASKS:
70
+ grader = get_grader(task)
71
+ # Test with sample rewards
72
+ result = grader.grade([0.0, 0.0, 50.0], 3, True)
73
+ print(f" {task.name} ({task.difficulty}): score={result['score']}, success={result['success']}")
74
+ print()
75
+
76
+ print("=" * 50)
77
+ print("TEST 5: state() method")
78
+ print("=" * 50)
79
+ env3 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
80
+ env3.reset()
81
+ s = env3.state()
82
+ assert isinstance(s, Observation), "state() must return Observation"
83
+ assert s.role == "buyer"
84
+ assert s.round == 0
85
+ print(f" PASS: state() returns Observation with role={s.role}, round={s.round}")
86
+ print()
87
+
88
+ print("=" * 50)
89
+ print("TEST 6: ACCEPT and REJECT")
90
+ print("=" * 50)
91
+ # ACCEPT test
92
+ env4 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
93
+ env4.reset()
94
+ env4.step("OFFER", 500) # Get an opponent counter
95
+ obs, r, d, info = env4.step("ACCEPT", 0)
96
+ print(f" ACCEPT: reward={r:.2f} done={d} deal_type={info.get('deal_type')}")
97
+
98
+ # REJECT test
99
+ env5 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
100
+ env5.reset()
101
+ obs, r, d, info = env5.step("REJECT", 0)
102
+ print(f" REJECT: reward={r:.2f} done={d} deal_type={info.get('deal_type')}")
103
+ print()
104
+
105
+ print("=" * 50)
106
+ print("ALL TESTS COMPLETE")
107
+ print("=" * 50)
test_output.txt ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ==================================================
2
+ TEST 1: Multi-round negotiation
3
+ ==================================================
4
+ Initial offer: 1000
5
+ R1: OFFER 650 -> Opp ACCEPT 650 | reward=133.00 done=True
6
+ >>> Deal: opponent_accepted, price=650
7
+ History entries: 1
8
+
9
+ ==================================================
10
+ TEST 2: Value-based clamping
11
+ ==================================================
12
+ PASS: Seller never offered below own value (100 trials x 20 rounds)
13
+
14
+ ==================================================
15
+ TEST 3: Cumulative aggression penalty
16
+ ==================================================
17
+ R1: penalty_so_far=2.0
18
+ R2: penalty_so_far=4.0
19
+ R3: penalty_so_far=6.0
20
+ R4: penalty_so_far=8.0
21
+ R5: penalty_so_far=10.0
22
+ Expected cumulative penalty: 10.0, Actual: 10.0
23
+ PASS
24
+
25
+ ==================================================
26
+ TEST 4: Task configs and graders
27
+ ==================================================
28
+ task_a_easy (easy): score=0.125, success=False
29
+ task_b_medium (medium): score=0.25, success=False
30
+ task_c_hard (hard): score=0.4167, success=True
31
+
32
+ ==================================================
33
+ TEST 5: state() method
34
+ ==================================================
35
+ PASS: state() returns Observation with role=buyer, round=0
36
+
37
+ ==================================================
38
+ TEST 6: ACCEPT and REJECT
39
+ ==================================================
40
+ ACCEPT: reward=0.00 done=True deal_type=None
41
+ REJECT: reward=-50.00 done=True deal_type=agent_rejected
42
+
43
+ ==================================================
44
+ ALL TESTS COMPLETE
45
+ ==================================================