Spaces:

MridulNegi2005
/

negotiation-openenv

Sleeping

App Files Files Community

negotiation-openenv / README.md

MridulNegi2005

Add HF Spaces frontmatter with openenv tag + fix Dockerfile

27e2512 about 1 month ago

preview code

raw

history blame contribute delete

5.94 kB

metadata

title: Strategic Negotiation Environment
emoji: 🤝
colorFrom: blue
colorTo: purple
sdk: docker
tags:
  - openenv
pinned: false

Strategic Negotiation Simulation — Meta OpenEnv

A reinforcement learning simulation environment developed for the Meta PyTorch OpenEnv Hackathon. This project tests an AI agent's ability to negotiate prices under conditions of partial observability and uncertainty. The environment adheres strictly to the Meta OpenEnv Gymnasium specifications.

Project Overview

This environment evaluates an LLM agent on realistic B2B and marketplace price negotiation dynamics. Instead of traditional grid-based tasks, the agent is placed in multi-turn conversational exchanges where it must infer opponent constraints and optimize its final profit margins.

The core objectives for the agent are to:

Optimize profit by closing favorable agreements.
Deduce and adapt to hidden opponent parameters (such as minimum viable limits and behavioral models).
Execute multi-step reasoning while adhering to strict environmental boundaries.

Architecture & Compliance

The codebase strictly utilizes the required OpenEnv structural patterns.

Pydantic Enforcement: Observation and ActionModel definitions strictly type-check all LLM outputs before they influence the environment state.
State Transparency: step(), reset(), and state() map directly to expected OpenEnv outputs returning (observation, reward, done, info).
Decoupled Logic Components:
- env_wrapper.py: Manages mathematical boundaries, execution logic, and reward distribution.
- tasks.py: Defines task bracket parameters (ZOPA margins, total rounds) and holds the programmatic Grader.
- inference.py: Executes the LLM integration loop securely with automated parsing and fallback contingencies.

Interactive Space Definitions

Action Space

The agent has three declarative actions available during any given turn:

Action	Execution Logic
`OFFER <price>`	Issues a counter-offer. The `<price>` parameter is constrained to integer values between 100 and 1000.
`ACCEPT`	Terminates the episode by agreeing to the `last_opponent_offer`. Calculates profit based on private valuation.
`REJECT`	Terminates the episode immediately with no deal, yielding a heavy penalty.

Observation Space

The state() context exposes exactly what a real-world negotiator would know, while explicitly hiding the opponent's true target threshold.

Field	Data Type	Implementation Detail
`agent_value`	integer	The agent's private valuation target (its bottom-line).
`current_offer`	integer	The active bid currently on the table.
`round`	integer	Current iteration out of maximum rounds allowed.
`max_rounds`	integer	Hard limit before a timeout termination.
`role`	string	Either "buyer" or "seller". Determines profit calculation algorithms.
`last_opponent_action`	string	Indicates "START", "OFFER", or "ACCEPT".
`last_opponent_offer`	integer	The direct integer value of the last proposal.
`history`	array	A comprehensive step-by-step memory of all previous bids across the episode limit.

Reward Formulation

The environment employs both dense shaping signals and sparse terminal rewards to effectively direct the agent toward optimal strategies.

Terminal Base Reward: Calculated as profit × (1 - (round / max_rounds)). This actively encourages closing positive deals as fast as possible.
Negative Outcome Penalties:
- Failing to reach an agreement or forcing a REJECT results in a direct -50.0 score loss.
- Accepting a deal that results in negative profit yields an added -20.0 penalty.
Aggression Stacking: Submitting offers that wildly diverge from reasonable limits assigns a cumulative -2.0 penalty per occurrence.
Dense Shaping: Intermediate fractional rewards (±2.0) encourage the agent when making minor constructive movements toward the opponent's ZOPA limits.

Evaluated Tasks

The configuration executes three escalating difficulties managed by different opponent behavior modules.

Task Profile	Classification	Opponent Bias	Margin of ZOPA	Turn Limit	Baseline Success Threshold
`task_a_easy`	Easy	Fair	Broad (400 units)	20 Rounds	0.2
`task_b_medium`	Medium	Greedy	Constrained (200 units)	15 Rounds	0.3
`task_c_hard`	Hard	Impatient	Narrow (120 units)	6 Rounds	0.4

Installation and Execution

System Requirements

Python 3.11+
HuggingFace Inference Token (HF_TOKEN)

Local Environment Setup

Install required dependencies:

pip install -r requirements.txt

Assign environment variables:

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
export HF_TOKEN="insert_token_here"

Trigger the inference framework:

python inference.py

Docker Deployment

docker build -t meta-openenv-negotiation .
docker run -e HF_TOKEN=token -e API_BASE_URL=https://router.huggingface.co/v1 -e MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct meta-openenv-negotiation

Baseline Evaluation Scores

Model Tested: meta-llama/Meta-Llama-3-8B-Instruct API Protocol: router.huggingface.co/v1

Evaluated Task	Score Computed	Rounds Used	Agreement Reached
`task_a_easy`	0.1138	1	True
`task_b_medium`	0.2333	1	True
`task_c_hard`	0.3472	1	True

Note: Baseline results evaluate the model natively against bounded thresholds without few-shot prompting modifications. The model successfully recognized margins and consistently closed tasks on the first available step.

License

This project operates under the Apache 2.0 software license.