Spaces:

MridulNegi2005
/

negotiation-openenv

Sleeping

App Files Files Community

Aviral Bhargava commited on Apr 5

Commit

a72a5e3

1 Parent(s): 610b7e5

feat: complete OpenEnv compliance, tasks, and logic fixes

Browse files

Files changed (11) hide show

.gitignore +29 -0
Dockerfile +26 -0
README.md +115 -1
__pycache__/env_wrapper.cpython-312.pyc +0 -0
env_wrapper.py +291 -43
inference.py +161 -73
openenv.yaml +94 -0
requirements.txt +3 -0
tasks.py +126 -0
test_env.py +107 -0
test_output.txt +45 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,29 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.egg-info/
+dist/
+build/
+*.egg
+# Environment
+.env
+.venv/
+venv/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# OS
+.DS_Store
+Thumbs.db
+# Build artifacts
+*.exe
+*.o
+*.out
+test_sim.exe

Dockerfile ADDED Viewed

	@@ -0,0 +1,26 @@

+# ── Negotiation Environment — OpenEnv Dockerfile ──
+# Person 3: Complete this Dockerfile for HuggingFace Spaces deployment
+# Requirements: Python 3.11+, pip dependencies, inference.py entrypoint
+# Constraints: CPU only, vcpu=2, memory=8gb, runtime < 20min
+FROM python:3.11-slim
+WORKDIR /app
+# Install dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy project files
+COPY env_wrapper.py .
+COPY tasks.py .
+COPY inference.py .
+COPY openenv.yaml .
+# Environment variables (set at runtime, NOT hardcoded)
+# API_BASE_URL — The API endpoint for the LLM
+# MODEL_NAME   — The model identifier to use for inference
+# HF_TOKEN     — Your HuggingFace API key
+# Entrypoint
+CMD ["python", "inference.py"]

README.md CHANGED Viewed

	@@ -1 +1,115 @@
1	- # ~~MEta_ai~~

+# 🤝 Strategic Negotiation Environment — OpenEnv
+A simulation environment where an AI agent learns to negotiate under uncertainty, compliant with the [Meta OpenEnv specification](https://github.com/meta-llama/open-env).
+## 🧠 Overview
+This environment simulates **real-world price negotiation** — a task humans do daily in marketplaces, business deals, and automated pricing systems. The agent must:
+- **Maximize profit** by negotiating favorable deals
+- **Adapt to opponent behavior** (greedy, fair, or impatient personalities)
+- **Make multi-step strategic decisions** under partial observability
+The agent cannot see the opponent's true valuation or strategy — it must infer patterns and adjust.
+---
+## 🎮 Action Space
+| Action | Description |
+|---|---|
+| `OFFER <price>` | Make a counter-offer at the given price (100–1000) |
+| `ACCEPT` | Accept the current offer on the table |
+| `REJECT` | Walk away from the negotiation (terminal, -50 penalty) |
+## 👁️ Observation Space
+| Field | Type | Description |
+|---|---|---|
+| `current_offer` | int | Current price on the table |
+| `round` | int | Current round number |
+| `max_rounds` | int | Maximum allowed rounds |
+| `role` | string | Agent's role: "buyer" or "seller" |
+| `last_opponent_action` | string | "START", "OFFER", or "ACCEPT" |
+| `last_opponent_offer` | int | Opponent's last offered price |
+| `history` | list | History of all actions this episode |
+## 💰 Reward Function
+| Event | Reward |
+|---|---|
+| Successful deal | `profit × (1 - round/max_rounds)` |
+| Bad deal (profit < 0) | Additional -20 penalty |
+| Rejection / Timeout | -50 |
+| Aggressive offers | Cumulative -2 per aggressive step |
+| Progress toward deal | Small shaping signal (±2 max) |
+---
+## 📋 Tasks
+| Task | Difficulty | Opponent | ZOPA | Rounds | Threshold |
+|---|---|---|---|---|---|
+| `task_a_easy` | Easy | Fair | Wide (400) | 20 | 0.2 |
+| `task_b_medium` | Medium | Greedy | Narrow (200) | 15 | 0.3 |
+| `task_c_hard` | Hard | Impatient | Tight (120) | 6 | 0.4 |
+---
+## 🚀 Setup & Usage
+### Prerequisites
+- Python 3.11+
+- HuggingFace API token
+### Install
+```bash
+pip install -r requirements.txt
+```
+### Configure Environment Variables
+```bash
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
+export HF_TOKEN="your_token_here"
+```
+### Run Inference
+```bash
+python inference.py
+```
+### Docker
+```bash
+docker build -t negotiation-env .
+docker run -e HF_TOKEN=your_token -e API_BASE_URL=https://router.huggingface.co/v1 -e MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct negotiation-env
+```
+---
+## 📊 Baseline Scores
+<!-- Person 3: Fill in baseline scores after running inference -->
+| Task | Score | Steps | Deal Made |
+|---|---|---|---|
+| task_a_easy | _TBD_ | _TBD_ | _TBD_ |
+| task_b_medium | _TBD_ | _TBD_ | _TBD_ |
+| task_c_hard | _TBD_ | _TBD_ | _TBD_ |
+---
+## 🏗️ Architecture
+```
+LLM (HuggingFace via OpenAI Client)
+        ↓
+inference.py (control loop + logging)
+        ↓
+env_wrapper.py (OpenEnv-compatible environment)
+        ↓
+tasks.py (task configs + graders)
+```
+## 📄 License
+Apache 2.0

__pycache__/env_wrapper.cpython-312.pyc CHANGED Viewed

Binary files a/__pycache__/env_wrapper.cpython-312.pyc and b/__pycache__/env_wrapper.cpython-312.pyc differ

env_wrapper.py CHANGED Viewed

@@ -1,100 +1,348 @@
 import random
 class Opponent:
-    def __init__(self, type_str, value, role):
         self.type = type_str
         self.opponent_value = value
         self.opponent_role = role
-        if type_str == "greedy":
-            self.r, self.alpha, self.patience, self.epsilon = 0.05, 0.7, 10, 5
-        elif type_str == "fair":
-            self.r, self.alpha, self.patience, self.epsilon = 0.15, 0.4, 7, 10
-        elif type_str == "impatient":
-            self.r, self.alpha, self.patience, self.epsilon = 0.25, 0.2, 3, 15
-        else:
-            self.r, self.alpha, self.patience, self.epsilon = 0.15, 0.4, 7, 10
         self.concession_rate = self.r
-    def get_response(self, round_num, current_offer, agent_offer, agent_action_type):
         if agent_action_type != "OFFER":
             return "REJECT", 0
-        if self.opponent_role == "seller" and agent_offer >= self.opponent_value:
             return "ACCEPT", agent_offer
-        if self.opponent_role == "buyer" and agent_offer <= self.opponent_value:
             return "ACCEPT", agent_offer
         if round_num > self.patience:
             self.concession_rate = min(0.4, self.concession_rate + 0.05)
         target = self.opponent_value
         delta = target - current_offer
         next_offer = current_offer + self.concession_rate * delta
         next_offer = (1.0 - self.alpha) * next_offer + self.alpha * current_offer
         next_offer += random.randint(-self.epsilon, self.epsilon)
-        next_offer = max(100, min(1000, int(next_offer)))
-        return "OFFER", next_offer
 class EnvWrapper:
-    def __init__(self, opp_type="fair", a_val=800, o_val=500, agent_role="buyer"):
         self.agent_value = a_val
         self.opponent_value = o_val
         self.role = agent_role
         self.opp_role = "seller" if agent_role == "buyer" else "buyer"
         self.opp = Opponent(opp_type, o_val, self.opp_role)
-        self.max_rounds = 20
-        self.reset()
-    def reset(self):
         self.round = 0
         if self.role == "buyer":
-            self.current_offer = self.agent_value + 200
         else:
             self.current_offer = max(100, self.agent_value - 200)
         self.last_opp_action = "START"
         self.last_opp_offer = self.current_offer
-    def step(self, action_str, action_price=0):
         self.round += 1
-        aggressive = False
-        done = False
         reward = 0.0
         if action_str == "ACCEPT":
             deal_price = self.last_opp_offer
             done = True
-            profit = deal_price - self.agent_value if self.role == "seller" else self.agent_value - deal_price
-            t_factor = 1.0 - (self.round / self.max_rounds)
-            reward = profit * t_factor
-            if profit < 0: reward -= 20
         elif action_str == "REJECT":
             reward = -50.0
             done = True
         elif action_str.startswith("OFFER"):
-            aggressive = abs(action_price - self.opponent_value) > 150
-            opp_action, opp_price = self.opp.get_response(self.round, self.current_offer, action_price, "OFFER")
             if opp_action == "ACCEPT":
                 deal_price = action_price
                 done = True
                 self.last_opp_action = "ACCEPT"
                 self.last_opp_offer = deal_price
-                profit = deal_price - self.agent_value if self.role == "seller" else self.agent_value - deal_price
-                t_factor = 1.0 - (self.round / self.max_rounds)
-                reward = profit * t_factor
-                if profit < 0: reward -= 20
-                if aggressive: reward -= 2
             else:
                 self.current_offer = opp_price
                 self.last_opp_action = "OFFER"
                 self.last_opp_offer = opp_price
                 if self.round >= self.max_rounds:
                     reward = -50.0
                     done = True
-        if not done:
-            reward = 0.0
-        return reward, done

+"""
+Negotiation Environment Wrapper — OpenEnv Compliant
+Implements: reset(), step(), state()
+Typed models via Pydantic for Observation, Action, Reward
+"""
 import random
+from typing import Optional, List, Dict, Any
+from pydantic import BaseModel, Field
+# ─────────────────────────────────────────────
+# OpenEnv Typed Models
+# ─────────────────────────────────────────────
+class Observation(BaseModel):
+    """Observable state visible to the agent."""
+    agent_value: int = Field(description="The agent's private valuation/target value for the deal")
+    current_offer: int = Field(description="Current price on the table")
+    round: int = Field(description="Current round number (0-indexed before first step)")
+    max_rounds: int = Field(description="Maximum allowed rounds")
+    role: str = Field(description="Agent role: 'buyer' or 'seller'")
+    last_opponent_action: str = Field(description="Opponent's last action: 'START', 'OFFER', 'ACCEPT'")
+    last_opponent_offer: int = Field(description="Opponent's last offered price")
+    history: List[Dict[str, Any]] = Field(default_factory=list, description="History of all actions this episode")
+class ActionModel(BaseModel):
+    """Action the agent can take."""
+    action_type: str = Field(description="One of: 'OFFER', 'ACCEPT', 'REJECT'")
+    price: int = Field(default=0, description="Price for OFFER actions, ignored for ACCEPT/REJECT")
+class RewardInfo(BaseModel):
+    """Reward information returned by step()."""
+    reward: float = Field(description="Numeric reward for this step")
+    breakdown: Dict[str, float] = Field(default_factory=dict, description="Reward component breakdown")
+# ─────────────────────────────────────────────
+# Opponent Strategy
+# ─────────────────────────────────────────────
 class Opponent:
+    """
+    Simulates opponent negotiation behavior.
+    Three personalities: greedy, fair, impatient.
+    Each has different concession rates, anchor effects, patience, and noise.
+    """
+    PROFILES = {
+        "greedy":    {"r": 0.05, "alpha": 0.7, "patience": 10, "epsilon": 5},
+        "fair":      {"r": 0.15, "alpha": 0.4, "patience": 7,  "epsilon": 10},
+        "impatient": {"r": 0.25, "alpha": 0.2, "patience": 3,  "epsilon": 15},
+    }
+    def __init__(self, type_str: str, value: int, role: str):
         self.type = type_str
         self.opponent_value = value
         self.opponent_role = role
+        self.history: List[Dict[str, Any]] = []
+        profile = self.PROFILES.get(type_str, self.PROFILES["fair"])
+        self.r = profile["r"]
+        self.alpha = profile["alpha"]
+        self.patience = profile["patience"]
+        self.epsilon = profile["epsilon"]
+        self.concession_rate = self.r
+    def reset_state(self):
+        """Reset concession rate and history for new episode."""
         self.concession_rate = self.r
+        self.history = []
+    def get_response(self, round_num: int, current_offer: int, agent_offer: int, agent_action_type: str):
+        """
+        Generate opponent response to agent's action.
+        Returns: (action_type: str, price: int)
+        """
         if agent_action_type != "OFFER":
             return "REJECT", 0
+        # ── Acceptance Check ──
+        if self.opponent_role == "seller" and agent_offer >= self.opponent_value:
+            self.history.append({"round": round_num, "action": "ACCEPT", "price": agent_offer})
             return "ACCEPT", agent_offer
+        if self.opponent_role == "buyer" and agent_offer <= self.opponent_value:
+            self.history.append({"round": round_num, "action": "ACCEPT", "price": agent_offer})
             return "ACCEPT", agent_offer
+        # ── Patience-based concession acceleration ──
         if round_num > self.patience:
             self.concession_rate = min(0.4, self.concession_rate + 0.05)
+        # ── Counter-offer calculation ──
         target = self.opponent_value
         delta = target - current_offer
         next_offer = current_offer + self.concession_rate * delta
+        # Anchor effect — blend toward current offer
         next_offer = (1.0 - self.alpha) * next_offer + self.alpha * current_offer
+        # Add noise
         next_offer += random.randint(-self.epsilon, self.epsilon)
+        # ── VALUE-BASED CLAMPING (Tolerance Bug Fix) ──
+        # Seller must not offer below their own value
+        # Buyer must not offer above their own value
+        next_offer_int = int(next_offer)
+        if self.opponent_role == "seller":
+            next_offer_int = max(next_offer_int, self.opponent_value)
+        elif self.opponent_role == "buyer":
+            next_offer_int = min(next_offer_int, self.opponent_value)
+        # Absolute bounds
+        next_offer_int = max(100, min(1000, next_offer_int))
+        self.history.append({"round": round_num, "action": "OFFER", "price": next_offer_int})
+        return "OFFER", next_offer_int
+# ─────────────────────────────────────────────
+# Main Environment Wrapper
+# ─────────────────────────────────────────────
 class EnvWrapper:
+    """
+    OpenEnv-compliant negotiation environment.
+    Exposes: reset(), step(), state()
+    """
+    def __init__(self, opp_type: str = "fair", a_val: int = 800, o_val: int = 500,
+                 agent_role: str = "buyer", max_rounds: int = 20):
         self.agent_value = a_val
         self.opponent_value = o_val
         self.role = agent_role
+        self.opp_type = opp_type
         self.opp_role = "seller" if agent_role == "buyer" else "buyer"
+        self.max_rounds = max_rounds
         self.opp = Opponent(opp_type, o_val, self.opp_role)
+        # Episode tracking
+        self.round = 0
+        self.current_offer = 0
+        self.last_opp_action = "START"
+        self.last_opp_offer = 0
+        self.history: List[Dict[str, Any]] = []
+        self.cumulative_aggression_penalty = 0.0
+        self.done = False
+    def reset(self) -> Observation:
+        """Reset environment and return initial observation."""
         self.round = 0
+        self.done = False
+        self.history = []
+        self.cumulative_aggression_penalty = 0.0
+        self.opp.reset_state()
+        # Initial offer is shifted away from agent's value to force negotiation
         if self.role == "buyer":
+            # Start high — agent (buyer) must negotiate DOWN
+            self.current_offer = min(1000, self.agent_value + 200)
         else:
+            # Start low — agent (seller) must negotiate UP
             self.current_offer = max(100, self.agent_value - 200)
         self.last_opp_action = "START"
         self.last_opp_offer = self.current_offer
+        return self.state()
+    def state(self) -> Observation:
+        """Return current observable state."""
+        return Observation(
+            agent_value=self.agent_value,
+            current_offer=self.current_offer,
+            round=self.round,
+            max_rounds=self.max_rounds,
+            role=self.role,
+            last_opponent_action=self.last_opp_action,
+            last_opponent_offer=self.last_opp_offer,
+            history=list(self.history),
+        )
+    def _compute_reward(self, deal_price: int) -> tuple:
+        """
+        Compute reward for a completed deal.
+        Returns: (total_reward, breakdown_dict)
+        """
+        if self.role == "seller":
+            profit = deal_price - self.agent_value
+        else:
+            profit = self.agent_value - deal_price
+        time_factor = 1.0 - (self.round / self.max_rounds)
+        base_reward = profit * time_factor
+        # Penalty for bad deals (agent accepts a losing deal)
+        bad_deal_penalty = -20.0 if profit < 0 else 0.0
+        # Cumulative aggression penalty
+        aggression = -self.cumulative_aggression_penalty
+        total = base_reward + bad_deal_penalty + aggression
+        breakdown = {
+            "profit": float(profit),
+            "time_factor": round(time_factor, 4),
+            "base_reward": round(base_reward, 4),
+            "bad_deal_penalty": bad_deal_penalty,
+            "aggression_penalty": aggression,
+            "total": round(total, 4),
+        }
+        return total, breakdown
+    def _partial_progress_reward(self, action_str: str, action_price: int) -> tuple:
+        """
+        Provide a small shaping reward for intermediate steps.
+        Rewards the agent for moving toward a deal (improving offers).
+        """
+        reward = 0.0
+        breakdown = {}
+        if action_str.startswith("OFFER") and len(self.history) >= 2:
+            # Check if agent is making progress toward opponent
+            prev_agent_offers = [h["agent_price"] for h in self.history[:-1]
+                                 if h.get("agent_action", "").startswith("OFFER")]
+            if prev_agent_offers:
+                last_agent_offer = prev_agent_offers[-1]
+                # Positive signal if agent moves toward a reasonable range
+                if self.role == "buyer":
+                    # Buyer should increase offers (toward seller's value)
+                    improvement = action_price - last_agent_offer
+                    reward = min(2.0, max(-1.0, improvement / 50.0))
+                else:
+                    # Seller should decrease offers (toward buyer's value)
+                    improvement = last_agent_offer - action_price
+                    reward = min(2.0, max(-1.0, improvement / 50.0))
+                breakdown = {"progress_signal": round(reward, 4)}
+        return reward, breakdown
+    def step(self, action_str: str, action_price: int = 0):
+        """
+        Take one step in the environment.
+        Args:
+            action_str: "OFFER", "ACCEPT", or "REJECT"
+            action_price: price for OFFER actions
+        Returns:
+            (observation: Observation, reward: float, done: bool, info: dict)
+        """
+        if self.done:
+            return self.state(), 0.0, True, {"error": "Episode already ended"}
         self.round += 1
         reward = 0.0
+        done = False
+        info: Dict[str, Any] = {"error": None}
+        breakdown: Dict[str, float] = {}
+        # ── AGENT OFFER CLAMPING ──
+        if action_str.startswith("OFFER"):
+            action_price = max(100, min(1000, action_price))
+            action_str = f"OFFER {action_price}"
+            # ── CUMULATIVE AGGRESSION PENALTY ──
+            if abs(action_price - self.opponent_value) > 150:
+                self.cumulative_aggression_penalty += 2.0
+        # Record this step in history
+        step_record = {
+            "round": self.round,
+            "agent_action": action_str,
+            "agent_price": action_price,
+        }
         if action_str == "ACCEPT":
             deal_price = self.last_opp_offer
+            reward, breakdown = self._compute_reward(deal_price)
             done = True
+            info["deal_price"] = deal_price
+            info["deal_type"] = "agent_accepted"
         elif action_str == "REJECT":
             reward = -50.0
+            breakdown = {"rejection_penalty": -50.0}
             done = True
+            info["deal_type"] = "agent_rejected"
         elif action_str.startswith("OFFER"):
+            opp_action, opp_price = self.opp.get_response(
+                self.round, self.current_offer, action_price, "OFFER"
+            )
             if opp_action == "ACCEPT":
                 deal_price = action_price
+                reward, breakdown = self._compute_reward(deal_price)
                 done = True
                 self.last_opp_action = "ACCEPT"
                 self.last_opp_offer = deal_price
+                info["deal_price"] = deal_price
+                info["deal_type"] = "opponent_accepted"
             else:
+                # Opponent counters
                 self.current_offer = opp_price
                 self.last_opp_action = "OFFER"
                 self.last_opp_offer = opp_price
+                # Check max rounds
                 if self.round >= self.max_rounds:
                     reward = -50.0
+                    breakdown = {"timeout_penalty": -50.0}
                     done = True
+                    info["deal_type"] = "timeout"
+                else:
+                    # Partial progress reward for intermediate steps
+                    step_record["agent_price"] = action_price
+                    self.history.append(step_record)
+                    reward, breakdown = self._partial_progress_reward(action_str, action_price)
+                    info["opponent_counter"] = opp_price
+            step_record["opp_action"] = opp_action
+            step_record["opp_price"] = opp_price
+        # Record history for terminal steps too
+        if done or action_str == "ACCEPT" or action_str == "REJECT":
+            # Avoid double-append for non-OFFER terminal steps
+            if step_record not in self.history:
+                self.history.append(step_record)
+        self.done = done
+        info["reward_breakdown"] = breakdown
+        return self.state(), reward, done, info
+# ─────────────────────────────────────────────
+# Convenience — max possible reward for scoring
+# ─────────────────────────────────────────────
+def get_max_possible_reward(agent_value: int, opponent_value: int) -> float:
+    """
+    Maximum reward possible if agent gets the best possible deal on round 1.
+    """
+    return float(abs(agent_value - opponent_value))

inference.py CHANGED Viewed

@@ -1,111 +1,199 @@
 import os
 import re
 import sys
 from openai import OpenAI
 from env_wrapper import EnvWrapper
-def main():
-    api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
-    model_name = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
-    hf_token = os.getenv("HF_TOKEN")
-    if not hf_token:
-        print("ERROR: HF_TOKEN environment variable is not set.")
-        sys.exit(1)
-    env = EnvWrapper(opp_type="fair", a_val=300, o_val=700, agent_role="buyer")
-    env.max_rounds = 4
-    env.reset()
-    print(f"[START] task=negotiation env=custom model={model_name}")
-    client = OpenAI(base_url=api_base_url, api_key=hf_token)
     done = False
     step_n = 0
     rewards = []
     while not done and step_n < env.max_rounds:
         step_n += 1
-        prompt = f"""You are negotiating as a {env.role}.
 State:
-* Current offer: {env.current_offer}
-* Round: {env.round}
-* Max rounds: {env.max_rounds}
-Choose ONE:
-* OFFER <price> (Preferred: counter-offer if you do not like the price!)
-* ACCEPT
-* REJECT"""
         action_str = "REJECT"
         action_price = 0
         error_msg = "null"
         try:
             response = client.chat.completions.create(
                 model=model_name,
                 messages=[{"role": "user", "content": prompt}],
                 max_tokens=20,
-                temperature=0.3
             )
             llm_text = response.choices[0].message.content.strip()
-            match = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text, re.IGNORECASE)
-            if match:
-                action_str = match.group(1).upper()
             else:
-                error_msg = "parsing failed, retrying"
-                response = client.chat.completions.create(
                     model=model_name,
-                    messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": llm_text}, {"role": "user", "content": "Output strictly ONLY ONE of: 'OFFER <price>', 'ACCEPT', or 'REJECT'."}],
                     max_tokens=15,
-                    temperature=0.1
                 )
-                llm_text2 = response.choices[0].message.content.strip()
-                match2 = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text2, re.IGNORECASE)
-                if match2:
-                    action_str = match2.group(1).upper()
                     error_msg = "null"
                 else:
                     action_str = "REJECT"
                     error_msg = "parse error on retry, defaulting to REJECT"
         except Exception as e:
-            error_msg = "API_Error"
             action_str = "REJECT"
-        if action_str.startswith("OFFER"):
-            try:
-                action_price = int(action_str.split()[1])
-            except ValueError:
-                action_str = "REJECT"
-                action_price = 0
-                error_msg = "invalid price format"
-        elif action_str == "ACCEPT":
-            action_str = "ACCEPT"
-        elif action_str == "REJECT":
-            action_str = "REJECT"
-        # Strip potential garbage
-        if "OFFER" in action_str:
-            action_str = f"OFFER {action_price}"
-        reward, d = env.step(action_str, action_price)
-        done = d
         rewards.append(reward)
-        print(f"[STEP] step={step_n} action={action_str} reward={reward:.2f} done={str(done).lower()} error={error_msg}")
-    # SCORING
-    max_possible_reward = float(abs(env.agent_value - env.opponent_value))
-    score = sum(rewards) / max_possible_reward if max_possible_reward > 0 else 0.0
-    score = max(0.0, min(1.0, score))
-    success = score > 0.3
     rewards_str = ",".join([f"{r:.2f}" for r in rewards])
-    print(f"[END] success={str(success).lower()} steps={step_n} score={score:.4f} rewards={rewards_str}")
 if __name__ == "__main__":
     main()

+"""
+Inference Script — OpenEnv Negotiation Environment
+Runs LLM agent against all 3 tasks, produces structured logs.
+Uses OpenAI-compatible client with HuggingFace router.
+"""
 import os
 import re
 import sys
 from openai import OpenAI
 from env_wrapper import EnvWrapper
+from tasks import ALL_TASKS, get_grader
+def parse_action(llm_text: str):
+    """Parse LLM output into (action_str, action_price)."""
+    match = re.search(r'(OFFER\s+\d+|ACCEPT|REJECT)', llm_text, re.IGNORECASE)
+    if match:
+        action = match.group(1).upper()
+        if action.startswith("OFFER"):
+            parts = action.split()
+            try:
+                price = int(parts[1])
+                return f"OFFER {price}", price, None
+            except (IndexError, ValueError):
+                return "REJECT", 0, "invalid price in OFFER"
+        return action, 0, None
+    return None, 0, "no action match"
+def run_task(client, model_name: str, task_config):
+    """
+    Run a single task: LLM negotiates against the environment.
+    Returns: (rewards, steps, deal_made, score_info)
+    """
+    env = EnvWrapper(
+        opp_type=task_config.opp_type,
+        a_val=task_config.agent_value,
+        o_val=task_config.opponent_value,
+        agent_role=task_config.agent_role,
+        max_rounds=task_config.max_rounds,
+    )
+    obs = env.reset()
+    print(f"[START] task={task_config.name} env=negotiation model={model_name}")
     done = False
     step_n = 0
     rewards = []
+    deal_made = False
+    history_for_prompt = []
     while not done and step_n < env.max_rounds:
         step_n += 1
+        # ── Build prompt with history ──
+        history_text = ""
+        if history_for_prompt:
+            history_lines = []
+            for h in history_for_prompt[-5:]:  # Last 5 rounds for context
+                history_lines.append(f"  Round {h['round']}: You → {h['agent']}, Opponent → {h['opp']}")
+            history_text = "Negotiation history:\n" + "\n".join(history_lines) + "\n\n"
+        target_goal = "buy for as low as possible (below your maximum value)" if obs.role == "buyer" else "sell for as high as possible (above your minimum value)"
+        prompt = f"""You are negotiating as a {obs.role}. Your goal is to {target_goal} to maximize profit.
 State:
+* Your PRIVATE Valuation: {obs.agent_value} (DO NOT accept or offer a deal worse than this!)
+* Current offer on the table: {obs.current_offer}
+* Round: {step_n} of {obs.max_rounds}
+* Opponent's last action: {obs.last_opponent_action}
+* Opponent's last offer: {obs.last_opponent_offer}
+{history_text}CRITICAL RULE: NEVER make an OFFER that is worse than your private valuation. For example, if you are a buyer with a valuation of 500, never offer >500.
+Choose exactly ONE action:
+* OFFER <price> — make a counter-offer (negotiate toward your private valuation)
+* ACCEPT — accept the opponent's offer (ONLY if it is profitable compared to your valuation)
+* REJECT — walk away (only if no deal is possible)
+Respond with ONLY your chosen action, nothing else."""
         action_str = "REJECT"
         action_price = 0
         error_msg = "null"
         try:
             response = client.chat.completions.create(
                 model=model_name,
                 messages=[{"role": "user", "content": prompt}],
                 max_tokens=20,
+                temperature=0.3,
             )
             llm_text = response.choices[0].message.content.strip()
+            parsed_action, parsed_price, parse_err = parse_action(llm_text)
+            if parsed_action:
+                action_str = parsed_action
+                action_price = parsed_price
             else:
+                # Retry with stricter prompt
+                error_msg = f"parse failed: {parse_err}, retrying"
+                retry_response = client.chat.completions.create(
                     model=model_name,
+                    messages=[
+                        {"role": "user", "content": prompt},
+                        {"role": "assistant", "content": llm_text},
+                        {"role": "user", "content": "Output strictly ONLY ONE of: 'OFFER <price>', 'ACCEPT', or 'REJECT'. Nothing else."},
+                    ],
                     max_tokens=15,
+                    temperature=0.1,
                 )
+                llm_text2 = retry_response.choices[0].message.content.strip()
+                parsed2, price2, err2 = parse_action(llm_text2)
+                if parsed2:
+                    action_str = parsed2
+                    action_price = price2
                     error_msg = "null"
                 else:
                     action_str = "REJECT"
+                    action_price = 0
                     error_msg = "parse error on retry, defaulting to REJECT"
         except Exception as e:
+            error_msg = f"API_Error: {str(e)[:50]}"
             action_str = "REJECT"
+            action_price = 0
+        # ── Step the environment ──
+        obs, reward, done, info = env.step(action_str, action_price)
         rewards.append(reward)
+        # Track deal
+        if done and info.get("deal_type") in ("agent_accepted", "opponent_accepted"):
+            deal_made = True
+        # Track history for prompting
+        history_for_prompt.append({
+            "round": step_n,
+            "agent": action_str,
+            "opp": f"{obs.last_opponent_action} {obs.last_opponent_offer}" if obs.last_opponent_action == "OFFER" else obs.last_opponent_action,
+        })
+        # ── Log step ──
+        log_action = action_str if not action_str.startswith("OFFER") else f"OFFER {action_price}"
+        print(f"[STEP] step={step_n} action={log_action} reward={reward:.2f} done={str(done).lower()} error={error_msg}")
+    # ── Score ──
+    grader = get_grader(task_config)
+    result = grader.grade(rewards, step_n, deal_made)
     rewards_str = ",".join([f"{r:.2f}" for r in rewards])
+    print(f"[END] success={str(result['success']).lower()} steps={step_n} score={result['score']:.4f} rewards={rewards_str}")
+    print()
+    return result
+def main():
+    api_base_url = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+    model_name = os.getenv("MODEL_NAME", "meta-llama/Meta-Llama-3-8B-Instruct")
+    hf_token = os.getenv("HF_TOKEN")
+    if not hf_token:
+        print("ERROR: HF_TOKEN environment variable is not set.")
+        print("Set it with: $env:HF_TOKEN='your_token_here'")
+        sys.exit(1)
+    client = OpenAI(base_url=api_base_url, api_key=hf_token)
+    print("=" * 60)
+    print("NEGOTIATION ENVIRONMENT — OpenEnv Inference")
+    print("=" * 60)
+    print()
+    all_results = []
+    for task in ALL_TASKS:
+        result = run_task(client, model_name, task)
+        all_results.append(result)
+    # ── Summary ──
+    print("=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    for r in all_results:
+        status = "PASS" if r["success"] else "FAIL"
+        print(f"  [{status}] {r['task']} ({r['difficulty']}): score={r['score']:.4f} "
+              f"steps={r['steps']} deal={r['deal_made']} threshold={r['threshold']}")
+    avg_score = sum(r["score"] for r in all_results) / len(all_results)
+    print(f"\n  Average Score: {avg_score:.4f}")
+    print("=" * 60)
 if __name__ == "__main__":
     main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,94 @@

+name: negotiation-env
+version: "1.0.0"
+description: >
+  Strategic Negotiation Simulation Environment where an AI agent learns
+  to negotiate under uncertainty with different opponent personalities.
+  The agent must maximize profit through multi-round price negotiation
+  while adapting to greedy, fair, or impatient opponents.
+author: Team MEta_ai
+license: Apache-2.0
+environment:
+  type: simulation
+  domain: negotiation
+  real_world_task: automated marketplace pricing and negotiation
+observation_space:
+  type: object
+  fields:
+    current_offer:
+      type: integer
+      description: Current price on the table
+      range: [100, 1000]
+    round:
+      type: integer
+      description: Current round number
+      range: [0, 20]
+    max_rounds:
+      type: integer
+      description: Maximum allowed rounds
+    role:
+      type: string
+      enum: ["buyer", "seller"]
+      description: Agent's role in the negotiation
+    last_opponent_action:
+      type: string
+      enum: ["START", "OFFER", "ACCEPT"]
+      description: Opponent's last action
+    last_opponent_offer:
+      type: integer
+      description: Opponent's last offered price
+      range: [100, 1000]
+    history:
+      type: array
+      description: History of all actions this episode
+action_space:
+  type: object
+  fields:
+    action_type:
+      type: string
+      enum: ["OFFER", "ACCEPT", "REJECT"]
+      description: Type of negotiation action
+    price:
+      type: integer
+      description: Price for OFFER actions (ignored for ACCEPT/REJECT)
+      range: [100, 1000]
+reward:
+  type: float
+  range: [-50.0, 855.0]
+  description: >
+    Reward based on deal profit scaled by time factor.
+    Partial progress signals during intermediate steps.
+    Penalty for failed negotiations (-50), bad deals (-20),
+    and aggressive offers (cumulative -2 per aggressive step).
+tasks:
+  - name: task_a_easy
+    difficulty: easy
+    description: Fair opponent, wide ZOPA, 20 rounds
+    success_threshold: 0.2
+  - name: task_b_medium
+    difficulty: medium
+    description: Greedy opponent, narrow ZOPA, 15 rounds
+    success_threshold: 0.3
+  - name: task_c_hard
+    difficulty: hard
+    description: Impatient opponent, tight margins, 6 rounds
+    success_threshold: 0.4
+inference:
+  script: inference.py
+  env_vars:
+    - API_BASE_URL
+    - MODEL_NAME
+    - HF_TOKEN
+deployment:
+  dockerfile: Dockerfile
+  platform: huggingface-spaces
+  tag: openenv

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+openai>=1.0.0
+pydantic>=2.0.0
+pyyaml>=6.0

tasks.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""
+Task Definitions & Graders for the Negotiation Environment.
+3 tasks: Easy → Medium → Hard, each with a programmatic grader (0.0–1.0).
+"""
+from dataclasses import dataclass
+from typing import List
+@dataclass
+class TaskConfig:
+    """Configuration for a single evaluation task."""
+    name: str
+    description: str
+    difficulty: str
+    opp_type: str
+    agent_value: int
+    opponent_value: int
+    agent_role: str
+    max_rounds: int
+    success_threshold: float  # score >= this means success
+# ─────────────────────────────────────────────
+# Task Definitions
+# ─────────────────────────────────────────────
+TASK_A = TaskConfig(
+    name="task_a_easy",
+    description="Easy negotiation: fair opponent, wide ZOPA, plenty of rounds",
+    difficulty="easy",
+    opp_type="fair",
+    agent_value=800,
+    opponent_value=400,
+    agent_role="buyer",
+    max_rounds=20,
+    success_threshold=0.2,
+)
+TASK_B = TaskConfig(
+    name="task_b_medium",
+    description="Medium negotiation: greedy opponent, narrow ZOPA, fewer rounds",
+    difficulty="medium",
+    opp_type="greedy",
+    agent_value=700,
+    opponent_value=500,
+    agent_role="buyer",
+    max_rounds=15,
+    success_threshold=0.3,
+)
+TASK_C = TaskConfig(
+    name="task_c_hard",
+    description="Hard negotiation: impatient opponent, tight margins, very few rounds",
+    difficulty="hard",
+    opp_type="impatient",
+    agent_value=600,
+    opponent_value=480,
+    agent_role="buyer",
+    max_rounds=6,
+    success_threshold=0.4,
+)
+ALL_TASKS: List[TaskConfig] = [TASK_A, TASK_B, TASK_C]
+# ─────────────────────────────────────────────
+# Grader
+# ─────────────────────────────────────────────
+class Grader:
+    """
+    Programmatic grader for a negotiation task.
+    Scores agent performance on a 0.0–1.0 scale.
+    """
+    def __init__(self, task: TaskConfig):
+        self.task = task
+        self.max_possible = float(abs(task.agent_value - task.opponent_value))
+    def grade(self, rewards: List[float], steps: int, deal_made: bool) -> dict:
+        """
+        Grade an episode.
+        Args:
+            rewards: list of per-step rewards
+            steps: number of steps taken
+            deal_made: whether a deal was successfully completed
+        Returns:
+            dict with score, success, and breakdown
+        """
+        total_reward = sum(rewards)
+        # Score normalization: total_reward / max_possible, clamped to [0, 1]
+        if self.max_possible > 0:
+            raw_score = total_reward / self.max_possible
+        else:
+            raw_score = 0.0
+        score = max(0.0, min(1.0, raw_score))
+        success = score >= self.task.success_threshold
+        # ── Detailed breakdown ──
+        efficiency = 0.0
+        if deal_made and steps > 0:
+            # Bonus for fewer steps — max 1.0 if done in 1 step
+            efficiency = max(0.0, 1.0 - (steps / self.task.max_rounds))
+        return {
+            "task": self.task.name,
+            "difficulty": self.task.difficulty,
+            "score": round(score, 4),
+            "success": success,
+            "threshold": self.task.success_threshold,
+            "total_reward": round(total_reward, 4),
+            "max_possible": self.max_possible,
+            "steps": steps,
+            "deal_made": deal_made,
+            "efficiency": round(efficiency, 4),
+        }
+def get_grader(task: TaskConfig) -> Grader:
+    """Create a grader for the given task."""
+    return Grader(task)

test_env.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""Quick validation test for the environment — no API keys needed."""
+import random
+random.seed(42)
+from env_wrapper import EnvWrapper, Observation
+from tasks import ALL_TASKS, get_grader
+print("=" * 50)
+print("TEST 1: Multi-round negotiation")
+print("=" * 50)
+env = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer", max_rounds=10)
+obs = env.reset()
+print(f"Initial offer: {obs.current_offer}")
+offers = [650, 600, 550, 500, 480, 450, 420, 400, 400, 400]
+for i, price in enumerate(offers):
+    obs, r, d, info = env.step("OFFER", price)
+    opp_info = f"{obs.last_opponent_action} {obs.last_opponent_offer}"
+    print(f"  R{i+1}: OFFER {price} -> Opp {opp_info} | reward={r:.2f} done={d}")
+    if d:
+        deal_type = info.get("deal_type", "none")
+        deal_price = info.get("deal_price", "N/A")
+        print(f"  >>> Deal: {deal_type}, price={deal_price}")
+        break
+print(f"  History entries: {len(obs.history)}")
+print()
+print("=" * 50)
+print("TEST 2: Value-based clamping")
+print("=" * 50)
+from env_wrapper import Opponent
+bugs = 0
+for trial in range(100):
+    opp = Opponent("fair", 500, "seller")
+    for rnd in range(20):
+        action, price = opp.get_response(rnd, 300, 250, "OFFER")
+        if action == "OFFER" and price < 500:
+            bugs += 1
+            print(f"  BUG: trial={trial} round={rnd} seller offered {price} < 500")
+            break
+if bugs == 0:
+    print("  PASS: Seller never offered below own value (100 trials x 20 rounds)")
+else:
+    print(f"  FAIL: {bugs} violations found")
+print()
+print("=" * 50)
+print("TEST 3: Cumulative aggression penalty")
+print("=" * 50)
+env2 = EnvWrapper(opp_type="greedy", a_val=800, o_val=500, agent_role="buyer", max_rounds=20)
+env2.reset()
+# Make multiple aggressive offers (>150 away from opp_val=500, so <350 or >650)
+for i in range(5):
+    obs, r, d, info = env2.step("OFFER", 200)  # 300 away from 500 → aggressive
+    print(f"  R{i+1}: penalty_so_far={env2.cumulative_aggression_penalty}")
+    if d:
+        break
+expected_penalty = 10.0  # 5 rounds x 2.0 per round
+actual_penalty = env2.cumulative_aggression_penalty
+print(f"  Expected cumulative penalty: {expected_penalty}, Actual: {actual_penalty}")
+print(f"  {'PASS' if actual_penalty == expected_penalty else 'FAIL'}")
+print()
+print("=" * 50)
+print("TEST 4: Task configs and graders")
+print("=" * 50)
+for task in ALL_TASKS:
+    grader = get_grader(task)
+    # Test with sample rewards
+    result = grader.grade([0.0, 0.0, 50.0], 3, True)
+    print(f"  {task.name} ({task.difficulty}): score={result['score']}, success={result['success']}")
+print()
+print("=" * 50)
+print("TEST 5: state() method")
+print("=" * 50)
+env3 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
+env3.reset()
+s = env3.state()
+assert isinstance(s, Observation), "state() must return Observation"
+assert s.role == "buyer"
+assert s.round == 0
+print(f"  PASS: state() returns Observation with role={s.role}, round={s.round}")
+print()
+print("=" * 50)
+print("TEST 6: ACCEPT and REJECT")
+print("=" * 50)
+# ACCEPT test
+env4 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
+env4.reset()
+env4.step("OFFER", 500)  # Get an opponent counter
+obs, r, d, info = env4.step("ACCEPT", 0)
+print(f"  ACCEPT: reward={r:.2f} done={d} deal_type={info.get('deal_type')}")
+# REJECT test
+env5 = EnvWrapper(opp_type="fair", a_val=800, o_val=400, agent_role="buyer")
+env5.reset()
+obs, r, d, info = env5.step("REJECT", 0)
+print(f"  REJECT: reward={r:.2f} done={d} deal_type={info.get('deal_type')}")
+print()
+print("=" * 50)
+print("ALL TESTS COMPLETE")
+print("=" * 50)

test_output.txt ADDED Viewed

	@@ -0,0 +1,45 @@

+==================================================
+TEST 1: Multi-round negotiation
+==================================================
+Initial offer: 1000
+  R1: OFFER 650 -> Opp ACCEPT 650 | reward=133.00 done=True
+  >>> Deal: opponent_accepted, price=650
+  History entries: 1
+==================================================
+TEST 2: Value-based clamping
+==================================================
+  PASS: Seller never offered below own value (100 trials x 20 rounds)
+==================================================
+TEST 3: Cumulative aggression penalty
+==================================================
+  R1: penalty_so_far=2.0
+  R2: penalty_so_far=4.0
+  R3: penalty_so_far=6.0
+  R4: penalty_so_far=8.0
+  R5: penalty_so_far=10.0
+  Expected cumulative penalty: 10.0, Actual: 10.0
+  PASS
+==================================================
+TEST 4: Task configs and graders
+==================================================
+  task_a_easy (easy): score=0.125, success=False
+  task_b_medium (medium): score=0.25, success=False
+  task_c_hard (hard): score=0.4167, success=True
+==================================================
+TEST 5: state() method
+==================================================
+  PASS: state() returns Observation with role=buyer, round=0
+==================================================
+TEST 6: ACCEPT and REJECT
+==================================================
+  ACCEPT: reward=0.00 done=True deal_type=None
+  REJECT: reward=-50.00 done=True deal_type=agent_rejected
+==================================================
+ALL TESTS COMPLETE
+==================================================