Spaces:

sh4shv4t
/

Parlay

Paused

App Files Files Community

sh4shv4t commited on 13 days ago

Commit

df724f2

1 Parent(s): cf5b410

Add pre-training audit scripts, OpenEnv manifest, and tune Parlay training/env (GRPO 1.5B default, min-reward filters, weighted data gen, hiring ZOPA+drift, veteran/opponent prompts, Docker/docs)

Browse files

Files changed (28) hide show

.dockerignore +13 -0
.env.example +3 -26
Dockerfile +5 -4
Makefile +2 -1
README.md +89 -695
README_SPACES.md +14 -0
agent/gemini_client.py +5 -1
agent/personas.py +6 -1
agent/runner.py +5 -1
agent/tom_tracker.py +5 -0
game/scenarios.py +10 -4
openenv.yaml +130 -0
parlay_env/grader.py +21 -2
parlay_env/reward.py +6 -1
parlay_env/server.py +51 -20
scripts/audit_grpo_pipeline.py +177 -0
scripts/audit_reward.py +163 -0
scripts/check_training_config.py +148 -0
scripts/eval_comparison.py +78 -0
scripts/inspect_data.py +261 -0
scripts/push_docker.sh +8 -0
scripts/run_gemini_baseline.py +74 -0
scripts/validate_sft_data.py +132 -0
tests/test_keyless.py +20 -1
training/generate_data.py +126 -151
training/grpo_train.py +44 -13
training/random_baseline.py +107 -94
training/sft_train.py +105 -182

.dockerignore ADDED Viewed

	@@ -0,0 +1,13 @@

+.git
+__pycache__/
+parlay_env/__pycache__/
+*.pyc
+venv/
+venv-train/
+node_modules/
+data/*.jsonl
+checkpoints/
+# Keep Python source, drop generated artifacts under parlay_env/
+parlay_env/
+!parlay_env/**/*.py

.env.example CHANGED Viewed

@@ -1,26 +1,3 @@
-# API Keys — Required
-GOOGLE_API_KEY=AIza...
-HF_TOKEN=hf_...
-# Server ports
-ENV_PORT=8001
-DASHBOARD_PORT=8000
-MCP_SSE_PORT=8002
-# Game config
-MAX_TURNS_PER_EPISODE=20
-MIN_REWARD_THRESHOLD=-100
-TOP_PLAYER_THRESHOLD=0.30
-CREDIBILITY_POINTS_START=100
-CREDIBILITY_REGEN_PER_TURN=5
-# Training
-BASE_MODEL=Qwen/Qwen2.5-7B-Instruct
-GRPO_GENERATIONS=8
-GRPO_STEPS=500
-DATA_PATH=data/episodes.jsonl
-SFT_OUTPUT=models/parlay-sft
-GRPO_OUTPUT=models/parlay-grpo
-# HF Hub
-HF_REPO_ID=your-username/parlay-negotiator

+GEMINI_API_KEY=your_gemini_key_here
+HF_TOKEN=your_huggingface_token_here
+BASE_MODEL=checkpoints/sft_1.5b/

Dockerfile CHANGED Viewed

@@ -1,6 +1,10 @@
 FROM python:3.11-slim
 WORKDIR /app
 # Install deps first (layer-cached)
 COPY requirements.txt .
@@ -11,10 +15,7 @@ COPY . .
 # Initialise the database at build time
 RUN python -m scripts.init_db
-# startup script
-RUN chmod +x scripts/start.sh
 # HF Spaces exposes port 7860
 EXPOSE 7860
-CMD ["bash", "scripts/start.sh"]

 FROM python:3.11-slim
 WORKDIR /app
+ENV PYTHONUNBUFFERED=1
+ARG GEMINI_API_KEY=""
+ENV GEMINI_API_KEY=${GEMINI_API_KEY}
+ENV GOOGLE_API_KEY=${GEMINI_API_KEY}
 # Install deps first (layer-cached)
 COPY requirements.txt .
 # Initialise the database at build time
 RUN python -m scripts.init_db
 # HF Spaces exposes port 7860
 EXPOSE 7860
+CMD ["bash", "-lc", "if [ -z \"$GOOGLE_API_KEY\" ] && [ -n \"$GEMINI_API_KEY\" ]; then export GOOGLE_API_KEY=\"$GEMINI_API_KEY\"; fi; uvicorn main:app --host 0.0.0.0 --port 7860"]

Makefile CHANGED Viewed

@@ -34,7 +34,8 @@ test:
 	venv\Scripts\pytest tests/ -v
 train-data:
-	venv-train\Scripts\python -m training.generate_data --episodes 2000 --output data/episodes.jsonl
 train-sft:
 	venv-train\Scripts\python -m training.sft_train --model Qwen/Qwen2.5-7B-Instruct --data data/episodes.jsonl --output models/parlay-sft --threshold 0.30

 	venv\Scripts\pytest tests/ -v
 train-data:
+	# hackathon budget default; override with EPISODES=N
+	venv-train\Scripts\python -m training.generate_data --episodes 80 --output data/episodes.jsonl
 train-sft:
 	venv-train\Scripts\python -m training.sft_train --model Qwen/Qwen2.5-7B-Instruct --data data/episodes.jsonl --output models/parlay-sft --threshold 0.30

README.md CHANGED Viewed

@@ -11,670 +11,158 @@ pinned: false
 > **The arena where AIs learn to close.**
-`Python 3.11` | `FastAPI` | `Gemini 2.0 Flash` | `GRPO` | `OpenEnv`
----
 ## Overview
-Parlay is a high-fidelity **reinforcement learning negotiation environment** that ships three things at once:
-| Audience | What they get |
-|---|---|
-| **Hackathon Judges** | A fully playable browser game, an OpenEnv-compliant WebSocket server, an MCP integration layer, and a complete GRPO training pipeline — all in one repo |
-| **Players** | A real-time negotiation game with five scenarios, five AI personas (Gemini-powered), Theory of Mind tracking, tactical cards, drift events, and a global leaderboard |
-| **B2B / Researchers** | A clean OpenEnv protocol implementation for training negotiation agents; plug in your own model, collect episodes, run GRPO fine-tuning, push to HF Hub |
-Parlay is built on:
-- **Google Gemini 2.0 Flash** — the AI counterpart, generating persona-consistent responses in real time
-- **FastAPI + aiosqlite** — async backend, zero ORM overhead, SQLite for portability
-- **OpenEnv protocol** — standard `reset/step/state` WebSocket commands for agent interoperability
-- **FastMCP** — universal MCP server supporting both `stdio` and SSE transports
-- **HF TRL GRPOTrainer** — two-stage SFT → GRPO pipeline fine-tuning Qwen2.5-7B-Instruct
-- **Vanilla JS + Three.js r128** — zero npm, zero build step, runs in any browser
----
-## Quick Start
-### Prerequisites
-- Python **3.11** (required; training and game stacks expect 3.11)
-- A Google AI Studio API key ([get one free](https://aistudio.google.com/app/apikey))
-- (Optional) A Hugging Face token for training and model pushing
-### Windows (recommended): PowerShell from repo root
-All `scripts\*.ps1` files assume the **current directory is the project root**. If execution policy blocks scripts, run:
-`Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser`
-```powershell
-git clone https://github.com/your-username/parlay.git
-cd parlay
-.\scripts\setup.ps1
-# Edit .env and set GOOGLE_API_KEY=your_key_here
-.\scripts\run.ps1
-```
-Optional: `.\venv\Scripts\python -m scripts.seed_scenarios` for demo leaderboard rows. Training venv: `.\scripts\setup_train.ps1`, then `.\scripts\train_data.ps1` (and related `train_sft` / `train_grpo` / `evaluate` scripts).
-You can also use **GNU Make** on Windows (e.g. Git Bash or Chocolatey): `make setup`, `make run`, `make test`, `make clean`. The Makefile uses `venv\Scripts\` paths.
-### Cross-platform: Docker
-Use Docker when you want the same flow on every OS (no local venv):
-```bash
-cp .env.example .env
-# Set GOOGLE_API_KEY in .env
-docker compose up --build
-```
-See the [Docker](#docker) section for service URLs and training images.
-### Open in browser (local server)
-```
-http://localhost:8000          # Game dashboard
-http://localhost:8000/train    # Training dashboard
-http://localhost:8000/docs     # Interactive API docs (Swagger)
-```
-### macOS / Linux (manual venv)
 ```bash
-git clone https://github.com/your-username/parlay.git
-cd parlay
-python3.11 -m venv venv
-source venv/bin/activate
 pip install -r requirements.txt
-cp .env.example .env
-python -m scripts.init_db
-uvicorn main:app --host 0.0.0.0 --port 8000 --reload
 ```
----
-## API Keys
-| Key | Required | Where to get it | Used for |
-|---|---|---|---|
-| `GOOGLE_API_KEY` | **Yes** | [Google AI Studio](https://aistudio.google.com/app/apikey) | Gemini 2.0 Flash (game AI + data gen) |
-| `HF_TOKEN` | Only for training | [Hugging Face Settings](https://huggingface.co/settings/tokens) | Model push to HF Hub |
-Set these in your `.env` file (never commit `.env`):
-```bash
-GOOGLE_API_KEY=AIzaSy...
-HF_TOKEN=hf_...
-```
----
-## Project Structure
-```
-parlay/
-├── main.py                    # FastAPI app entry point
-│
-├── parlay_env/                # Core RL environment (OpenEnv-compliant)
-│   ├── __init__.py
-│   ├── server.py              # WebSocket router (reset/step/state endpoints)
-│   ├── env.py                 # ParlayEnv class — OpenEnv implementation
-│   ├── models.py              # Pydantic models: ParlayState, ParlayAction, BeliefState…
-│   ├── reward.py              # Reward coefficients (ALPHA, BETA, GAMMA, OMEGA…)
-│   ├── grader.py              # Pure reward functions: step reward, terminal reward
-│   ├── game_theory.py         # ZOPA, Nash, Pareto, Shapley, Rubinstein computations
-│   └── exceptions.py          # Custom exceptions (InvalidScenarioError, CapitulationError…)
-│
-├── game/                      # Game logic layer
-│   ├── __init__.py
-│   ├── scenarios.py           # 5 negotiation scenarios with drift events
-│   ├── personas.py            # 5 AI personas with Gemini prompt templates
-│   └── session.py             # Session management: active games, turn routing
-│
-├── agent/                     # AI agent components
-│   ├── __init__.py
-│   ├── gemini_client.py       # Gemini 2.0 Flash async wrapper
-│   ├── tom_tracker.py         # Theory of Mind belief tracker
-│   └── tactical.py            # Tactical card execution logic
-│
-├── dashboard/                 # Frontend (zero npm, zero build)
-│   ├── index.html             # Main game UI
-│   ├── train.html             # Training monitor UI
-│   ├── api.py                 # FastAPI router for dashboard REST endpoints
-│   └── static/
-│       ├── app.js             # Game WebSocket client + UI logic
-│       ├── character.js       # Three.js r128 animated persona character
-│       ├── chart_utils.js     # Chart.js reward visualization helpers
-│       └── style.css          # CSS with --parlay-* custom properties
-│
-├── mcp_server/                # MCP integration (stdio + SSE)
-│   ├── __init__.py
-│   └── server.py              # FastMCP tools: negotiate, get_state, get_leaderboard…
-│
-├── training/                  # Isolated training pipeline (never imported by game)
-│   ├── __init__.py
-│   ├── generate_data.py       # Gemini self-play episode generation
-│   ├── sft_train.py           # SFTTrainer fine-tuning on top episodes
-│   ├── grpo_train.py          # GRPOTrainer RL fine-tuning
-│   ├── reward_fn.py           # GRPO reward functions (wraps grader.py)
-│   ├── evaluate.py            # Three-bar comparison chart: base vs SFT vs GRPO
-│   └── push_to_hub.py         # Upload model to HF Hub
-│
-├── scripts/
-│   ├── __init__.py
-│   ├── init_db.py             # Create SQLite schema (idempotent)
-│   ├── seed_scenarios.py      # Insert demo leaderboard entries
-│   ├── setup.ps1              # Windows: game venv + .env + DB init
-│   ├── setup_train.ps1        # Windows: training venv (PyTorch, TRL, …)
-│   ├── run.ps1 / run_env.ps1 / run_mcp.ps1
-│   ├── train_*.ps1 / evaluate.ps1 / test.ps1
-│
-├── tests/
-│   ├── __init__.py
-│   ├── test_grader.py         # Reward computation tests
-│   ├── test_game_theory.py    # ZOPA/Nash/Pareto/Shapley tests
-│   ├── test_tom.py            # Theory of Mind tracker tests
-│   ├── test_reward.py         # Reward constants tests
-│   └── test_scenarios.py      # Scenario definition tests
-│
-├── data/                      # Generated episode JSONL files (gitignored)
-├── models/                    # Fine-tuned model checkpoints (gitignored)
-├── results/                   # Evaluation charts and metrics (gitignored)
-│
-├── requirements.txt           # Core dependencies
-├── requirements-train.txt     # Training-only dependencies (torch, trl, peft…)
-├── Makefile                   # Windows-oriented targets (venv\\Scripts\\ paths)
-├── .gitattributes             # LF for source; CRLF for .ps1
-├── .env.example               # Environment variable template
-├── .gitignore
-├── docker-compose.yml         # Multi-service Docker deployment
-├── Dockerfile.game            # Game + dashboard service
-├── Dockerfile.env             # OpenEnv WebSocket service
-└── Dockerfile.train           # GRPO training service (CUDA)
-```
----
-## Game Guide
-### How to Play
-1. **Choose a scenario** — five high-stakes deal types, each with unique ZOPA ranges and drift events
-2. **Choose your persona style** — affects how aggressively the AI counterpart responds
-3. **Negotiate in natural language** — type your offers and arguments in the chat
-4. **Use tactical cards** — spend Credibility Points to play power moves (anchor, BATNA reveal, deadline pressure)
-5. **Watch for drift events** — the AI's hidden priorities shift mid-negotiation; adapt or lose ground
-6. **Close within 20 turns** — speed bonuses reward efficient closers
-### Key Concepts
-| Term | Definition |
-|---|---|
-| **ZOPA** | Zone of Possible Agreement — the range between both parties' walk-away prices where a deal is mutually beneficial |
-| **BATNA** | Best Alternative to a Negotiated Agreement — your outside option; the floor below which you'd rather walk away |
-| **Nash Bargaining Solution** | The game-theoretically "fair" split of the ZOPA surplus — the midpoint of both BATNAs |
-| **Anchor** | Your opening offer. The higher you anchor (as seller), the more the counterpart adjusts from that reference point |
-| **Rubinstein Deadline** | The advantage of having more time — patient negotiators extract better deals |
-| **Capitulation Cliff** | Accepting below your BATNA triggers a hard -150 penalty (OMEGA). Never capitulate |
-| **Theory of Mind** | Parlay tracks the AI's inferred beliefs about you — high ToM accuracy gives a step reward bonus |
-| **Drift Event** | A mid-game shock (budget cut, competitor offer, urgency spike) that changes the AI's hidden priorities |
-### Tactical Cards
-| Card | CP Cost | Effect |
-|---|---|---|
-| **Anchor High** | 10 CP | Lock in a high reference price — reduces AI's willingness to counter aggressively |
-| **BATNA Reveal** | 15 CP | Signal your outside option — increases AI urgency if credible |
-| **Deadline Pressure** | 20 CP | Introduce artificial urgency — accelerates AI concessions by 15% |
-| **Bundle Offer** | 12 CP | Add non-monetary value — expands the ZOPA by shifting AI utility |
-| **Silent Close** | 25 CP | Make a final offer with no further negotiation signal — high risk, high reward |
-| **Coalition Play** | 30 CP | Invoke Act 3 coalition mechanics — brings in a third party for multi-issue negotiation |
-### Scoring
-Your final score is computed by the Parlay Grader:
-```
-Final Score = Terminal Reward + Cumulative Step Rewards
-```
-Deal Efficiency is displayed as a percentage:
-```
-Deal Efficiency = (Final Price - Seller BATNA) / (Buyer BATNA - Seller BATNA)
-```
-A deal efficiency of 1.0 means you captured the full ZOPA surplus. 0.5 is the Nash fair split.
----
-## OpenEnv Protocol
-Parlay implements the OpenEnv standard for RL environments over WebSocket.
-### Connection
-```
-ws://localhost:8000/env/ws/{session_id}
-```
-### Commands
-#### `reset` — Start a new episode
-```json
-{
-  "command": "reset",
-  "scenario_id": "saas_enterprise",
-  "persona": "shark",
-  "player_name": "MyAgent"
-}
-```
-Response:
-```json
-{
-  "type": "observation",
-  "session_id": "abc-123",
-  "scenario_id": "saas_enterprise",
-  "persona": "shark",
-  "act": 1,
-  "step_count": 0,
-  "belief": {
-    "est_budget": 140000,
-    "est_walk_away": 125000,
-    "est_urgency": 0.4,
-    "est_has_alternative": false,
-    "confidence": 0.3
-  },
-  "credibility_points": 100,
-  "offer_history": [],
-  "episode_done": false
-}
-```
-#### `step` — Take a negotiation turn
-```json
-{
-  "command": "step",
-  "utterance": "I propose an annual contract at $155,000 with a 90-day payment term.",
-  "offer_amount": 155000,
-  "tactical_move": null
-}
-```
-Response:
-```json
-{
-  "type": "step_result",
-  "step_reward": 12.4,
-  "cumulative_reward": 12.4,
-  "ai_response": "That's ambitious. Our budget ceiling won't stretch that far...",
-  "belief": { "...updated belief state..." },
-  "tension_score": 0.6,
-  "drift_fired": false,
-  "episode_done": false
-}
-```
-#### `state` — Query current state without acting
-```json
-{ "command": "state" }
-```
-#### `close` — Accept final price and end episode
-```json
-{
-  "command": "close",
-  "final_price": 148000
-}
-```
-Response:
-```json
-{
-  "type": "episode_done",
-  "total_reward": 287.4,
-  "deal_efficiency": 0.82,
-  "acts_completed": 2,
-  "bluffs_caught": 1,
-  "drift_adapted": true,
-  "deal_closed": true,
-  "leaderboard_rank": 3
-}
-```
-### HTTP REST Endpoints
-| Method | Path | Description |
-|---|---|---|
-| `GET` | `/health` | Health check |
-| `GET` | `/env/scenarios` | List all scenarios |
-| `GET` | `/env/personas` | List all personas |
-| `GET` | `/dashboard/leaderboard` | Global leaderboard |
-| `GET` | `/dashboard/leaderboard/{scenario_id}` | Per-scenario leaderboard |
-| `POST` | `/dashboard/submit` | Submit episode result |
-| `GET` | `/docs` | Swagger UI |
----
-## MCP Setup
-Parlay ships a universal MCP server supporting **both** `stdio` and SSE transports.
-### Available MCP Tools
-| Tool | Description |
-|---|---|
-| `negotiate` | Send a negotiation message and get the AI's response |
-| `get_state` | Retrieve current session state and belief model |
-| `reset_session` | Start a new negotiation session |
-| `close_deal` | Accept a final price and get episode grade |
-| `get_leaderboard` | Fetch top performers globally or by scenario |
-| `list_scenarios` | Get all available scenarios with ZOPA ranges |
-| `list_personas` | Get all personas with strategy profiles |
-| `get_game_theory` | Compute ZOPA, Nash point, Rubinstein advantage for any deal |
-### Client 1: Claude Desktop / Claude Code
-Add to your `claude_desktop_config.json` (usually at `~/Library/Application Support/Claude/claude_desktop_config.json` on macOS):
-```json
-{
-  "mcpServers": {
-    "parlay": {
-      "command": "python",
-      "args": ["-m", "mcp_server.server", "stdio"],
-      "cwd": "/path/to/parlay",
-      "env": {
-        "GOOGLE_API_KEY": "your_key_here"
-      }
-    }
-  }
-}
-```
-### Client 2: Continue.dev / Zed / Any SSE Client
-First start the SSE server:
-```bash
-python -m mcp_server.server sse
-# Listening on http://localhost:8002/sse
-```
-Then configure your client to point at:
-```
-http://localhost:8002/sse
-```
-In `Continue.dev` (`~/.continue/config.json`):
-```json
-{
-  "experimental": {
-    "modelContextProtocolServers": [
-      {
-        "transport": {
-          "type": "sse",
-          "url": "http://localhost:8002/sse"
-        }
-      }
-    ]
-  }
-}
-```
-### Client 3: Generic stdio (any MCP-compatible agent)
-```bash
-python -m mcp_server.server stdio
-```
-Pipe JSON-RPC messages to stdin; responses arrive on stdout. Compatible with any MCP client library.
----
 ## Training Pipeline
-Parlay uses a two-stage pipeline: **SFT warmup → GRPO fine-tuning**. Never skip the SFT stage — GRPO reward curves are noisy without a warm-started model.
-### Stage 1: Generate Self-Play Episodes
-Uses Gemini 2.0 Flash to simulate full negotiation episodes across all persona × scenario combinations.
-```bash
-python -m training.generate_data --episodes 2000 --output data/episodes.jsonl
-```
-Diversity guarantees enforced:
-- Minimum 20 episodes per (persona × scenario) pair = 500 baseline
-- 30% noise injection for exploration
-- 40% forced drift event rate
-- 25% Act 3 coalition scenarios
-Each episode record:
-```json
-{
-  "prompt": "You are negotiating a SaaS enterprise deal...",
-  "conversation": [...],
-  "reward": 247.3,
-  "deal_efficiency": 0.79,
-  "persona": "shark",
-  "scenario_id": "saas_enterprise",
-  "acts_completed": 2,
-  "tom_accuracy": 0.81,
-  "drift_adapted": true,
-  "split": "train"
-}
 ```
-### Stage 2: SFT Fine-Tuning
-Train Qwen2.5-7B-Instruct on the top 60% of episodes by reward:
 ```bash
-python -m training.sft_train \
-  --model Qwen/Qwen2.5-7B-Instruct \
-  --data data/episodes.jsonl \
-  --output models/parlay-sft \
-  --threshold 0.60
 ```
-Uses LoRA (r=16, alpha=32) on `q_proj` and `v_proj`. Full fine-tuning is never used.
-### Stage 3: GRPO Fine-Tuning
-Apply Group Relative Policy Optimization with G=8 generations per prompt:
 ```bash
-python -m training.grpo_train \
-  --model models/parlay-sft \
-  --data data/episodes.jsonl \
-  --output models/parlay-grpo \
-  --steps 500
 ```
-GRPO hyperparameters:
-- `num_generations=8` (G=8 per prompt)
-- `beta=0.001` (low KL coefficient — allows exploration)
-- `epsilon=0.2` (clipping range)
-- `scale_rewards="batch"` (batch-level reward standardization)
-- `learning_rate=5e-7`
-### Stage 4: Evaluate
 ```bash
-python -m training.evaluate \
-  --base Qwen/Qwen2.5-7B-Instruct \
-  --sft models/parlay-sft \
-  --grpo models/parlay-grpo \
-  --data data/episodes.jsonl \
-  --output results/eval_results.json
 ```
-Produces a three-bar comparison chart: **Base vs SFT vs GRPO** across mean reward, deal efficiency, and bluff detection rate.
-### Stage 5: Push to Hub
-```bash
-python -m training.push_to_hub \
-  --model models/parlay-grpo \
-  --repo your-username/parlay-negotiator
-```
-Requires `HF_TOKEN` and `HF_REPO_ID` in `.env`.
----
-## Personas
-Five AI negotiation personas, each powered by a distinct Gemini 2.0 Flash system prompt:
-| Persona | Aggression | Patience | Bluff Rate | Strategy |
-|---|---|---|---|---|
-| **Shark** | 0.90 | 0.20 | 0.45 | Opens high, concedes slowly, uses deadline pressure, willing to walk away |
-| **Diplomat** | 0.30 | 0.80 | 0.10 | Relationship-focused, seeks mutual gain, rarely bluffs, prefers bundle deals |
-| **Analyst** | 0.50 | 0.70 | 0.15 | Data-driven, requests justification for every number, ZOPA-aware, systematic |
-| **Veteran** | 0.65 | 0.85 | 0.30 | Pattern-recognizes anchors, absorbs pressure, uses silence as a tool |
-| **Wildcard** | 0.75 | 0.35 | 0.55 | Unpredictable, drift-prone, high bluff rate, can pivot strategy mid-negotiation |
-Persona drift events can cause a **Wildcard** to briefly adopt **Shark** tactics, or a **Diplomat** to reveal an unexpected BATNA. Adapt or get caught off guard.
----
-## Scenarios
-Five negotiation scenarios spanning B2B deal archetypes:
-| Scenario ID | Title | ZOPA Range | Complexity | Drift Events |
-|---|---|---|---|---|
-| `saas_enterprise` | Enterprise SaaS Annual License | $125K – $165K | Medium | Budget cut at turn 7 |
-| `consulting_retainer` | Consulting Retainer Contract | $8K – $15K/mo | Medium | Competitor reveal at turn 5 |
-| `hiring_package` | Senior Engineering Hire Package | $180K – $240K | Low | Competing offer at turn 6 |
-| `vendor_hardware` | Hardware Vendor Bulk Purchase | $2.1M – $3.4M | High | Supply chain shock at turn 8 |
-| `acquisition_term_sheet` | Startup Acquisition Term Sheet | $8.5M – $16M | Very High | Board veto threat at turn 10, valuation dispute at turn 14 |
-Each scenario defines:
-- `batna_buyer`: Buyer's walk-away ceiling
-- `batna_seller`: Seller's walk-away floor
-- `anchor_buyer`: Typical buyer opening offer
-- `anchor_seller`: Typical seller opening ask
-- `drift_events`: List of mid-game shocks with trigger turns and effects
-- `currency`: Always USD
-- `difficulty`: `low | medium | high | very_high`
----
-## Reward Function
-The Parlay grader computes rewards in two phases:
-### Step Reward (per turn)
-```
-r_step = α · ΔZOPA_position
-       + β · ToM_accuracy_improvement
-       - δ · concession_magnitude
-       - θ · noise_penalty
-       + ε · tactical_card_bonus
 ```
-Where:
-- **α (ALPHA = 2.0)** — reward for improving your ZOPA position
-- **β (BETA = 5.0)** — reward for improving ToM belief accuracy
-- **δ (DELTA = 1.5)** — penalty per unit of concession from previous offer
-- **θ (THETA = 3.0)** — penalty for low-grounding utterances (noise)
-- **ε (EPSILON = 8.0)** — bonus for successful tactical card execution
-### Terminal Reward (episode end)
-```
-r_terminal =
-  if final_price < batna_seller:      -Ω        (capitulation cliff: -150)
-  elif deal_closed:
-    Γ                                            (base close bonus: +100)
-    + ζ · deal_efficiency                        (ZOPA capture: up to +50)
-    + η · acts_completed                         (multi-act bonus: +10/act)
-    + Γ · (1 - t_close/t_max)                   (speed bonus: up to +100)
-    + ETA · drift_adapted                        (drift adaptation: +10)
-  else (no deal):
-    -Γ/2 + β · avg_tom_accuracy                 (partial credit)
 ```
-Where:
-- **Γ (GAMMA = 100.0)** — primary close bonus
-- **ζ (ZETA = 50.0)** — ZOPA efficiency multiplier
-- **η (ETA = 10.0)** — per-act completion bonus (max 3 acts = +30)
-- **Ω (OMEGA = 150.0)** — capitulation cliff penalty
-- **t_close** — turn at which deal was closed
-- **t_max** — maximum turns (default: 20)
-All coefficients live exclusively in `parlay_env/reward.py`. Never hardcode them elsewhere.
----
-## Docker
-### Run all services
 ```bash
-cp .env.example .env
-# Set GOOGLE_API_KEY in .env
-docker compose up --build
 ```
-Services:
-- `game` → `http://localhost:8000` — game dashboard + API
-- `env` → `http://localhost:8001` — OpenEnv WebSocket server
-- `mcp` → `http://localhost:8002` — MCP SSE server
-### Run training (requires GPU)
 ```bash
-docker build -f Dockerfile.train -t parlay-train .
-docker run --gpus all -v $(pwd)/data:/app/data -v $(pwd)/models:/app/models \
-  -e GOOGLE_API_KEY=$GOOGLE_API_KEY \
-  -e HF_TOKEN=$HF_TOKEN \
-  parlay-train python -m training.grpo_train --steps 500
 ```
-### Individual services
 ```bash
-# Game only
-docker build -f Dockerfile.game -t parlay-game .
-docker run -p 8000:8000 -e GOOGLE_API_KEY=$GOOGLE_API_KEY parlay-game
-# OpenEnv only
-docker build -f Dockerfile.env -t parlay-env .
-docker run -p 8001:8001 -e GOOGLE_API_KEY=$GOOGLE_API_KEY parlay-env
 ```
----
 ## Testing
-### Run the full test suite
 ```bash
 pytest tests/ -v
 ```
-### Run with coverage
-```bash
-pytest tests/ -v --tb=short --cov=parlay_env --cov=game --cov=agent --cov-report=term-missing
-```
-### Run a specific test module
 ```bash
 pytest tests/test_grader.py -v
@@ -684,122 +172,28 @@ pytest tests/test_reward.py -v
 pytest tests/test_scenarios.py -v
 ```
-### Test descriptions
-| File | What it tests |
-|---|---|
-| `test_grader.py` | Step reward, terminal reward, episode grade computation |
-| `test_game_theory.py` | ZOPA, Nash bargaining, Pareto frontier, Shapley value, anchoring, Rubinstein |
-| `test_tom.py` | Theory of Mind tracker: belief updates, bluff detection, drift events, accuracy |
-| `test_reward.py` | Reward coefficient constants and their mathematical constraints |
-| `test_scenarios.py` | Scenario definitions: ZOPA validity, drift events, currency, IDs |
-All tests follow the pattern: `Test{Module}` class → `test_{scenario}` methods → `assert ... f"Expected {expected}, got {result}"`.
----
-## Architecture Decisions
-### Why SQLite over Postgres?
-Parlay is designed to be a **zero-infrastructure hackathon demo**. SQLite with `aiosqlite` provides full async support, requires no Docker service for the database, and the `parlay.db` file can be committed for demo snapshots. Migrating to Postgres requires only changing the connection string.
-### Why Vanilla JS over React/Vue?
-The `.cursorrules` mandate: zero npm, zero build step. Three.js r128 from cdnjs gives us 3D animated personas. Chart.js 4.4 gives us reward curves. `fetch()` + `WebSocket` gives us real-time game state. The entire frontend loads from a single HTML file with `<script>` tags. This means anyone can open the dashboard without `node_modules`.
-### Why GRPO over PPO?
-GRPO (Group Relative Policy Optimization) eliminates the need for a separate critic/value model. With G=8 generations per prompt, GRPO uses within-group reward standardization as its baseline — simpler, more stable, and better suited to the sparse reward structure of negotiation episodes.
-### Why Gemini 2.0 Flash?
-- Free tier available via Google AI Studio (critical for hackathon accessibility)
-- Sub-500ms latency for negotiation turns with `max_output_tokens=500`
-- Strong instruction-following for persona-consistent responses
-- Async-compatible via `run_in_executor` pattern
----
-## Environment Variables Reference
-| Variable | Default | Description |
-|---|---|---|
-| `GOOGLE_API_KEY` | — | **Required.** Google AI Studio API key |
-| `HF_TOKEN` | — | Hugging Face token (training only) |
-| `ENV_PORT` | `8001` | OpenEnv WebSocket server port |
-| `DASHBOARD_PORT` | `8000` | Dashboard + game server port |
-| `MCP_SSE_PORT` | `8002` | MCP SSE server port |
-| `MAX_TURNS_PER_EPISODE` | `20` | Maximum turns before episode ends |
-| `MIN_REWARD_THRESHOLD` | `-100` | Minimum reward for SFT data inclusion |
-| `TOP_PLAYER_THRESHOLD` | `0.60` | Percentile cutoff for SFT training data |
-| `CREDIBILITY_POINTS_START` | `100` | Starting CP for tactical cards |
-| `CREDIBILITY_REGEN_PER_TURN` | `5` | CP regenerated each turn |
-| `BASE_MODEL` | `Qwen/Qwen2.5-7B-Instruct` | HF model ID for training base |
-| `GRPO_GENERATIONS` | `8` | G value for GRPO (generations per prompt) |
-| `GRPO_STEPS` | `500` | GRPO training steps |
-| `DATA_PATH` | `data/episodes.jsonl` | Episode data for training |
-| `SFT_OUTPUT` | `models/parlay-sft` | SFT checkpoint output path |
-| `GRPO_OUTPUT` | `models/parlay-grpo` | GRPO checkpoint output path |
-| `HF_REPO_ID` | — | HF Hub repo for model push |
----
-## Testing Without API Keys
-Everything in Parlay runs in **mock mode** when `GOOGLE_API_KEY` is not set.
-Mock mode returns canned persona-consistent responses so you can play and test
-the full game loop without any external account.
 ```bash
-# 1. Set up the game environment
-make setup
-# 2. Run the keyless test suite (zero API calls)
-make test-keyless
-# 3. Start the server in mock mode
-make run
-# 4. Open the game in your browser
-#    → http://localhost:8000
-#    A "Demo mode" banner confirms mock mode is active.
 ```
-To switch to real AI: add `GOOGLE_API_KEY=your_key` to `.env` and restart.
-To test the exact HF Spaces container locally before pushing:
 ```bash
-make docker-test
-# → http://localhost:7860
 ```
----
-## Contributing
-1. Fork the repo and create a feature branch
-2. Follow the module dependency graph: `training/ → parlay_env/ → game/ → agent/`
-3. Add type hints and docstrings to all public functions
-4. Write at least 2 tests per new function (happy path + edge case)
-5. Run `pytest tests/` — all tests must pass
-6. Verify `docker compose up --build` completes without errors
----
 ## License
-MIT License
-Copyright (c) 2026 Parlay Contributors
-Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----
-*Built for the Meta Hackathon 2026. Powered by Gemini 2.0 Flash + Qwen2.5-7B.*

 > **The arena where AIs learn to close.**
+`Python 3.11` | `FastAPI` | `Gemini 2.5 Flash` | `GRPO` | `OpenEnv-style WS`
 ## Overview
+Parlay is a negotiation RL environment + browser game + training stack:
+- Three negotiation scenarios and three personas.
+- OpenEnv-style WebSocket interface (`reset` / `step` / `state`) on `/env/ws`.
+- Theory-of-Mind belief tracking with dense reward shaping.
+- Dynamic ZOPA erosion under sustained tension.
+- Training pipeline from Gemini self-play data to SFT and GRPO.
+Gemini model routing:
+- `gemini-2.5-flash-lite` for data generation and self-play.
+- `gemini-2.5-flash` for demo gameplay and MCP tools.
+## Quickstart
+Run exactly in this order:
 ```bash
 pip install -r requirements.txt
+export GEMINI_API_KEY=your_key
+uvicorn main:app --port 8000
+open http://localhost:8000
 ```
+## Reward Design
+Per-step reward:
+`R_t = α·ΔV + β·ToM - δ·C - θ·noise + ψ·bluff + μ·MEV`
+Terminal reward:
+`R_T = γ·E + ε·S + ζ·D`
+Capitulation floor:
+`R_T = -ω` when final deal breaches BATNA.
+Constants (from `parlay_env/reward.py`):
+- `ALPHA=2`, `BETA=5`, `DELTA=3`, `THETA=10`
+- `PSI=12`, `MU=8`
+- `GAMMA=100`, `EPSILON=20`, `ZETA=15`, `OMEGA=200`
 ## Training Pipeline
+```text
+Gemini self-play (training/generate_data.py)
+                |
+                v
+SFT warm start (training/sft_train.py)
+                |
+                v
+GRPO fine-tune (training/grpo_train.py)
+                |
+                v
+Evaluation + comparison (training/evaluate.py, scripts/eval_comparison.py)
 ```
+### Data generation
 ```bash
+python -m training.generate_data --episodes 80 --output data/episodes.jsonl
 ```
+### SFT
 ```bash
+python -m training.sft_train --data data/episodes.jsonl --output checkpoints/sft_1.5b/
 ```
+### GRPO
 ```bash
+BASE_MODEL=checkpoints/sft_1.5b/ python -m training.grpo_train --data data/episodes.jsonl --output models/parlay-grpo
 ```
+## Baseline vs GRPO Results
+[Run scripts/eval_comparison.py after training to populate this section]
+`results/comparison.png`
+## HuggingFace Space
+[Space URL here]
+## OpenEnv
+See `openenv.yaml` for environment manifest metadata and reward spec.
+WebSocket endpoint:
+`ws://<host>:<port>/env/ws`
+## Architecture
+- `main.py`: FastAPI entry, routers, static files.
+- `parlay_env/`: server, models, grader, reward constants, game theory.
+- `agent/`: Gemini client, ToM tracker, self-play runner.
+- `game/`: scenarios, tactical cards, leaderboard.
+- `dashboard/`: UI and API routes, spectator stream.
+- `training/`: dataset generation, SFT, GRPO, evaluation.
+- `mcp_server/`: FastMCP tools.
+- `tests/`: keyless and module tests.
+## Runbook
+### Local app
+```bash
+uvicorn main:app --host 0.0.0.0 --port 8000
 ```
+### OpenEnv server only
+```bash
+python -m parlay_env.server --port 8001
 ```
+### Keyless test suite
 ```bash
+pytest tests/test_keyless.py -v
 ```
+### Smoke test
 ```bash
+python smoke_test.py
 ```
+### Docker
 ```bash
+docker build -t parlay .
+docker run -p 7860:7860 -e GEMINI_API_KEY=$GEMINI_API_KEY parlay
 ```
 ## Testing
+### Full suite
 ```bash
 pytest tests/ -v
 ```
+### Focused modules
 ```bash
 pytest tests/test_grader.py -v
 pytest tests/test_scenarios.py -v
 ```
+### What tests cover
+- `test_keyless.py`: no-key full stack sanity checks.
+- `test_grader.py`: step/terminal reward behavior.
+- `test_game_theory.py`: ZOPA/Nash/Pareto/Shapley.
+- `test_tom.py`: ToM updates and belief metrics.
+- `test_training_pipeline.py`: training data/plumbing checks.
+## MCP
+Run MCP server:
 ```bash
+python -m mcp_server.server stdio
 ```
+or SSE:
 ```bash
+python -m mcp_server.server sse
 ```
 ## License
+MIT

README_SPACES.md ADDED Viewed

	@@ -0,0 +1,14 @@

+---
+title: Parlay ◈
+emoji: 🤝
+colorFrom: purple
+colorTo: indigo
+sdk: docker
+app_port: 7860
+pinned: true
+---
+Parlay is a negotiation RL environment and playable browser game where agents bargain under hidden information.
+It combines Theory-of-Mind belief tracking, dynamic ZOPA erosion under sustained tension, and tactical negotiation moves.
+This Space exposes the live game UI and OpenEnv-style WebSocket flow so you can test policies interactively.
+Use the spectator view to inspect hidden state dynamics during a live negotiation demo.

agent/gemini_client.py CHANGED Viewed

@@ -310,7 +310,11 @@ async def call_gemini(
     mid = model if model is not None else MODEL_ID_DATA
     text = ""
-    from google.genai import types  # noqa: PLC0415
     history = messages[:-1] if len(messages) > 1 else []
     last_msg = messages[-1]["parts"][0] if messages else "Begin the negotiation."

     mid = model if model is not None else MODEL_ID_DATA
     text = ""
+    try:
+        from google.genai import types  # noqa: PLC0415
+    except ModuleNotFoundError:
+        logger.warning("google-genai SDK missing; falling back to mock response")
+        return _get_mock_response(persona, len(messages), scenario_id)
     history = messages[:-1] if len(messages) > 1 else []
     last_msg = messages[-1]["parts"][0] if messages else "Begin the negotiation."

agent/personas.py CHANGED Viewed

@@ -84,7 +84,12 @@ PERSONAS: dict[PersonaType, PersonaConfig] = {
             "say \"Interesting.\" or \"I see.\" After turn 6, begin making calculated "
             "concessions - but always get something in return first. You have seen every "
             "trick before. When the opponent plays time_pressure, say \"I appreciate the "
-            "deadline context, though our timeline is more flexible than you might expect.\""
         ),
         drift_trigger="power_shift",
         color_var="--parlay-purple",

             "say \"Interesting.\" or \"I see.\" After turn 6, begin making calculated "
             "concessions - but always get something in return first. You have seen every "
             "trick before. When the opponent plays time_pressure, say \"I appreciate the "
+            "deadline context, though our timeline is more flexible than you might expect.\" "
+            "After turn 12, shift from reading to closing. You have gathered enough information. "
+            "Now anchor a final position, reference what you have learned about the opponent, "
+            "and push for agreement. A veteran who never closes is not a veteran - patience "
+            "is a tool, not a strategy. In the final 4 turns, accept any offer within 8% of "
+            "your target rather than let the deal expire."
         ),
         drift_trigger="power_shift",
         color_var="--parlay-purple",

agent/runner.py CHANGED Viewed

@@ -190,7 +190,11 @@ async def run_episode(
                 f"You are the human buyer in this negotiation. "
                 f"Respond naturally to the AI seller.\n"
                 f"Scenario: {scenario.title}. "
-                f"Your budget ceiling: {hidden.budget_ceiling:,.0f}"
             ),
             opponent_messages,
             persona=persona.value,

                 f"You are the human buyer in this negotiation. "
                 f"Respond naturally to the AI seller.\n"
                 f"Scenario: {scenario.title}. "
+                f"Your budget ceiling: {hidden.budget_ceiling:,.0f}\n"
+                f"Occasionally (roughly 1 in 4 turns), make a claim about having a competing offer "
+                f"or a lower budget than you actually have. State a BATNA that is 20-30% away from "
+                f"your true position. This is realistic negotiation behavior and tests whether the "
+                f"agent can detect bluffing."
             ),
             opponent_messages,
             persona=persona.value,

agent/tom_tracker.py CHANGED Viewed

@@ -10,6 +10,11 @@ from parlay_env.models import BeliefState, HiddenState, PersonaType, TacticalMov
 logger = logging.getLogger(__name__)
 class ToMTracker:
     """

 logger = logging.getLogger(__name__)
+# NOTE: ToMTracker is used in two paths:
+# (1) agent/runner.py self-play — full update each turn;
+# (2) parlay_env/server.py WebSocket server — also uses ToMTracker after Task 1 fix.
+# Both paths now produce comparable belief_history for grader._tom_accuracy.
 class ToMTracker:
     """

game/scenarios.py CHANGED Viewed

@@ -49,12 +49,18 @@ SCENARIOS: dict[str, Scenario] = {
         title="Senior Engineer Offer",
         description="Total comp negotiation: base + equity + signing bonus.",
         anchor_seller=240_000, anchor_buyer=180_000,
-        batna_seller=195_000, batna_buyer=230_000,
-        zopa=(195_000, 230_000), currency="USD", unit="total annual comp",
         difficulty=2,
         drift_events=[
-            DriftEvent(trigger_turn=5, event="Competing offer received",
-                       effect_on_urgency=-0.25, effect_on_has_alternative=True),
         ],
     ),
     "acquisition_term_sheet": Scenario(

         title="Senior Engineer Offer",
         description="Total comp negotiation: base + equity + signing bonus.",
         anchor_seller=240_000, anchor_buyer=180_000,
+        # Widened 15% to improve deal rate in self-play data generation
+        batna_seller=195_000, batna_buyer=264_500,
+        zopa=(195_000, 264_500), currency="USD", unit="total annual comp",
         difficulty=2,
         drift_events=[
+            # Delayed from 5 to 8 - early drift was destabilizing pre-anchor phase
+            DriftEvent(
+                trigger_turn=8,
+                event="Competing offer received",
+                effect_on_urgency=-0.25,
+                effect_on_has_alternative=True,
+            ),
         ],
     ),
     "acquisition_term_sheet": Scenario(

openenv.yaml ADDED Viewed

	@@ -0,0 +1,130 @@

+env_id: parlay-negotiation-v1
+name: Parlay
+description: >
+  A negotiation MDP with hidden information, Theory-of-Mind belief tracking,
+  dynamic ZOPA erosion, and tactical moves. Three personas × three scenarios.
+version: "1.0.0"
+author: Shashvat Singh
+contact: shashvat.k.singh.16@gmail.com
+observation_space:
+  type: dict
+  fields:
+    - name: offers
+      type: list[float]
+    - name: zopa_lower
+      type: float
+    - name: zopa_upper
+      type: float
+    - name: nash_point
+      type: float
+    - name: tension_score
+      type: float
+      range: [0, 100]
+    - name: belief_state
+      type: dict
+      description: Agent's beliefs about opponent hidden state (est_budget, est_walk_away, est_urgency, est_has_alternative, confidence)
+    - name: last_utterance
+      type: string
+    - name: available_moves
+      type: list[string]
+      values: [anchor_high, batna_reveal, silence]
+    - name: cp
+      type: int
+      description: Tactical card points remaining
+    - name: drift_event
+      type: string|null
+      description: Human-readable drift event if triggered this turn, else null
+    - name: zopa_width_pct_remaining
+      type: float
+      range: [0.0, 1.0]
+action_space:
+  type: dict
+  fields:
+    - name: utterance
+      type: string
+      required: true
+    - name: offer_amount
+      type: float|null
+    - name: tactical_move
+      type: string|null
+      values: [anchor_high, batna_reveal, silence]
+    - name: accept_deal
+      type: bool
+      default: false
+    - name: walk_away
+      type: bool
+      default: false
+reward:
+  range: [-200, ~300]
+  per_step: "α·ΔV + β·ToM - δ·C - θ·noise + ψ·bluff + μ·MEV"
+  terminal: "γ·E + ε·S + ζ·D  (or -ω on capitulation)"
+  constants:
+    ALPHA: 2
+    BETA: 5
+    DELTA: 3
+    THETA: 10
+    PSI: 12
+    MU: 8
+    GAMMA: 100
+    EPSILON: 20
+    ZETA: 15
+    OMEGA: 200
+episode:
+  max_steps: 20
+  termination_conditions:
+    - deal accepted (offers within 3%)
+    - walk_away action
+    - max turns reached
+    - zopa_collapsed (BATNAs cross after erosion)
+    - step_reward below threshold (very negative)
+endpoints:
+  websocket: ws://host:port/env/ws
+  protocol:
+    reset:
+      send: '{"type": "reset", "scenario_id": "saas_enterprise|hiring_package|acquisition_term_sheet", "persona": "shark|diplomat|veteran"}'
+      receive: ParlayObservation (JSON)
+    step:
+      send: ParlayAction (JSON)
+      receive: ParlayObservation (JSON)
+    state:
+      send: '{"type": "state"}'
+      receive: ParlayState (JSON, includes hidden state for spectators)
+scenarios:
+  - id: saas_enterprise
+    description: SaaS software license negotiation, seller vs enterprise buyer
+  - id: hiring_package
+    description: Job offer compensation negotiation
+  - id: acquisition_term_sheet
+    description: Startup acquisition valuation negotiation
+personas:
+  - id: shark
+    style: Aggressive anchoring, high bluff rate
+  - id: diplomat
+    style: Collaborative, seeks mutual gain
+  - id: veteran
+    style: Patient, reads opponent carefully
+hidden_information:
+  - budget_ceiling (opponent's true max budget)
+  - walk_away_price (opponent's true BATNA)
+  - urgency_score (0-1, how time-pressured opponent is)
+  - has_alternative (whether opponent has a competing offer)
+rubric:
+  - criterion: Novel environment
+    description: Negotiation as MDP with hidden info, dynamic ZOPA, ToM beliefs
+  - criterion: Reward design
+    description: Multi-term dense reward with bluff detection, ToM accuracy, and drift adaptation
+  - criterion: Training story
+    description: Gemini self-play data → SFT cold start → GRPO fine-tune → reward improvement vs baseline
+  - criterion: Demo quality
+    description: Live browser play + spectator god-view with Three.js avatars
+  - criterion: Hosted Space
+    description: HuggingFace Space with Docker deployment

parlay_env/grader.py CHANGED Viewed

@@ -7,7 +7,7 @@ from dataclasses import dataclass
 from typing import Optional
 from .models import BeliefState, HiddenState, ParlayAction, ParlayState
-from .reward import ALPHA, BETA, DELTA, EPSILON, GAMMA, OMEGA, PSI, THETA, ZETA
 logger = logging.getLogger(__name__)
@@ -113,6 +113,7 @@ def compute_step_reward(
     state: ParlayState,
     action: ParlayAction,
     next_state: ParlayState,
 ) -> float:
     """
     Compute per-step reward R_t.
@@ -151,21 +152,39 @@ def compute_step_reward(
     ):
         bluff_bonus = PSI
     reward = (
         ALPHA * delta_v
         + BETA * tom_t
         - DELTA * concession_t
         - THETA * noise_t
         + bluff_bonus
     )
     logger.debug(
-        "Step reward: total=%.3f (dv=%.3f, tom=%.3f, concession=%.3f, noise=%.0f, bluff=%.3f)",
         reward,
         delta_v,
         tom_t,
         concession_t,
         noise_t,
         bluff_bonus,
     )
     return reward

 from typing import Optional
 from .models import BeliefState, HiddenState, ParlayAction, ParlayState
+from .reward import ALPHA, BETA, DELTA, EPSILON, GAMMA, MU, OMEGA, PSI, THETA, ZETA
 logger = logging.getLogger(__name__)
     state: ParlayState,
     action: ParlayAction,
     next_state: ParlayState,
+    drift_event: str | None = None,
 ) -> float:
     """
     Compute per-step reward R_t.
     ):
         bluff_bonus = PSI
+    mev_bonus = 0.0
+    drift_marker = drift_event
+    if drift_marker is None:
+        drift_marker = next_state.__dict__.get("drift_event")
+    if drift_marker:
+        lowered = action.utterance.lower()
+        adaptation_tokens = (
+            "given that",
+            "considering",
+            "in light of",
+            "noted",
+            "understood",
+        )
+        if any(token in lowered for token in adaptation_tokens):
+            mev_bonus = MU
     reward = (
         ALPHA * delta_v
         + BETA * tom_t
         - DELTA * concession_t
         - THETA * noise_t
         + bluff_bonus
+        + mev_bonus
     )
     logger.debug(
+        "Step reward: total=%.3f (dv=%.3f, tom=%.3f, concession=%.3f, noise=%.0f, bluff=%.3f, mev=%.3f)",
         reward,
         delta_v,
         tom_t,
         concession_t,
         noise_t,
         bluff_bonus,
+        mev_bonus,
     )
     return reward

parlay_env/reward.py CHANGED Viewed

@@ -1,4 +1,8 @@
-"""Reward function constants. Import from here everywhere — never hardcode coefficients."""
 # Per-step weights
 ALPHA: float = 2.0    # offer improvement toward ZOPA midpoint
@@ -13,6 +17,7 @@ ZETA: float    = 15.0   # drift adaptation bonus
 ETA: float     = 0.0    # retained for compatibility; single-act env disables it
 OMEGA: float   = 200.0  # capitulation cliff (hard discontinuous penalty)
 PSI: float     = 12.0   # bluff-caught bonus
 # Game config defaults (overridden by env vars at runtime)
 MAX_TURNS: int = 20

+"""Reward function constants. Import from here everywhere — never hardcode coefficients.
+Per-step reward form:
+    R_t = α·ΔV + β·ToM - δ·C - θ·noise + ψ·bluff + μ·MEV
+"""
 # Per-step weights
 ALPHA: float = 2.0    # offer improvement toward ZOPA midpoint
 ETA: float     = 0.0    # retained for compatibility; single-act env disables it
 OMEGA: float   = 200.0  # capitulation cliff (hard discontinuous penalty)
 PSI: float     = 12.0   # bluff-caught bonus
+MU: float      = 8.0    # drift-event recognition bonus (MEV proxy)
 # Game config defaults (overridden by env vars at runtime)
 MAX_TURNS: int = 20

parlay_env/server.py CHANGED Viewed

@@ -17,6 +17,9 @@ from typing import Any
 import numpy as np
 from fastapi import APIRouter, FastAPI, WebSocket, WebSocketDisconnect
 from .exceptions import (
     EpisodeAlreadyDoneError,
     InvalidActionError,
@@ -67,7 +70,7 @@ FALLBACK_OBSERVATION = ParlayObservation(
     cumulative_reward=0.0,
 )
-_sessions: dict[str, ParlayState] = {}
 MAX_TURNS = int(os.getenv("MAX_TURNS_PER_EPISODE", "20"))
 CP_START = int(os.getenv("CREDIBILITY_POINTS_START", "100"))
@@ -133,6 +136,27 @@ def _compute_tension(state: ParlayState, action: ParlayAction) -> float:
     return float(max(0.0, min(100.0, base)))
 def _make_observation(
     state: ParlayState,
     reward: float,
@@ -260,7 +284,10 @@ async def _handle_reset(msg: dict[str, Any]) -> dict:
         original_zopa_width=original_zopa_width,
         zopa_width_pct_remaining=1.0,
     )
-    _sessions[session_id] = state
     observation = _make_observation(state, 0.0, "Negotiation started. Make your opening move.")
     logger.info("Reset: session=%s, scenario=%s, persona=%s", session_id, scenario_id, persona_str)
@@ -273,7 +300,9 @@ async def _handle_step(msg: dict[str, Any]) -> dict:
     if not session_id or session_id not in _sessions:
         raise SessionNotFoundError(f"Session {session_id} not found")
-    state = _sessions[session_id]
     if state.episode_done:
         raise EpisodeAlreadyDoneError(f"Episode {session_id} is already done")
@@ -291,17 +320,13 @@ async def _handle_step(msg: dict[str, Any]) -> dict:
     if action.offer_amount is not None:
         new_offers.append(action.offer_amount)
-    new_beliefs = list(state.belief_history)
-    if new_beliefs:
-        last = new_beliefs[-1]
-        updated = BeliefState(
-            est_budget=last.est_budget * 0.98,
-            est_walk_away=last.est_walk_away * 1.01,
-            est_urgency=min(1.0, last.est_urgency + 0.02),
-            est_has_alternative=last.est_has_alternative,
-            confidence=min(1.0, last.confidence + 0.05),
-        )
-        new_beliefs.append(updated)
     next_state = ParlayState(
         **{
@@ -314,6 +339,8 @@ async def _handle_step(msg: dict[str, Any]) -> dict:
             "hidden_state": HiddenState(**state.hidden_state.model_dump()),
         }
     )
     if action.tactical_move == TacticalMove.BATNA_REVEAL:
         revealed = action.offer_amount if action.offer_amount is not None else next_state.hidden_state.walk_away_price
@@ -349,7 +376,8 @@ async def _handle_step(msg: dict[str, Any]) -> dict:
             <= next_state.hidden_state.budget_ceiling
         )
-    step_reward = compute_step_reward(state, action, next_state)
     next_state.cumulative_reward = state.cumulative_reward + step_reward
     if step_reward >= 0.0 and action.tactical_move is None and state.hidden_state.last_stated_batna is not None:
@@ -378,8 +406,8 @@ async def _handle_step(msg: dict[str, Any]) -> dict:
         else:
             next_state.termination_reason = "max_turns"
-    _sessions[session_id] = next_state
-    observation = _make_observation(next_state, step_reward, action.utterance)
     return {"observation": observation.model_dump(), "done": next_state.episode_done}
@@ -388,12 +416,15 @@ async def _handle_state(msg: dict[str, Any]) -> dict:
     session_id = msg.get("session_id")
     if not session_id or session_id not in _sessions:
         raise SessionNotFoundError(f"Session {session_id} not found")
-    return {"state": _sessions[session_id].model_dump()}
 def get_session_state(session_id: str) -> ParlayState | None:
     """Return the in-memory session state for SSE and tests."""
-    return _sessions.get(session_id)
 @router.get("/sessions/{session_id}")
@@ -401,7 +432,7 @@ async def get_session(session_id: str) -> dict:
     """Get session state via REST."""
     if session_id not in _sessions:
         raise SessionNotFoundError(f"Session {session_id} not found")
-    return {"state": _sessions[session_id].model_dump()}
 _env_app = FastAPI(title="Parlay OpenEnv", version="1.0.0")

 import numpy as np
 from fastapi import APIRouter, FastAPI, WebSocket, WebSocketDisconnect
+from agent.tom_tracker import ToMTracker
+from game.scenarios import get_scenario
 from .exceptions import (
     EpisodeAlreadyDoneError,
     InvalidActionError,
     cumulative_reward=0.0,
 )
+_sessions: dict[str, dict[str, Any]] = {}
 MAX_TURNS = int(os.getenv("MAX_TURNS_PER_EPISODE", "20"))
 CP_START = int(os.getenv("CREDIBILITY_POINTS_START", "100"))
     return float(max(0.0, min(100.0, base)))
+def _apply_drift_event(state: ParlayState, tom: ToMTracker) -> str | None:
+    """Apply scenario drift event at the current turn, if any."""
+    try:
+        scenario = get_scenario(state.scenario_id)
+    except Exception:
+        return None
+    for event in scenario.drift_events:
+        if event.trigger_turn == state.step_count:
+            state.drift_events_fired += 1
+            state.hidden_state.persona_drifted = True
+            tom.drift_event(
+                event.effect_on_urgency,
+                event.effect_on_has_alternative,
+                event_description=event.event,
+            )
+            state.belief_history = list(tom.history)
+            return event.event
+    return None
 def _make_observation(
     state: ParlayState,
     reward: float,
         original_zopa_width=original_zopa_width,
         zopa_width_pct_remaining=1.0,
     )
+    _sessions[session_id] = {
+        "state": state,
+        "tom_tracker": ToMTracker(initial_belief, persona),
+    }
     observation = _make_observation(state, 0.0, "Negotiation started. Make your opening move.")
     logger.info("Reset: session=%s, scenario=%s, persona=%s", session_id, scenario_id, persona_str)
     if not session_id or session_id not in _sessions:
         raise SessionNotFoundError(f"Session {session_id} not found")
+    session = _sessions[session_id]
+    state: ParlayState = session["state"]
+    tom: ToMTracker = session["tom_tracker"]
     if state.episode_done:
         raise EpisodeAlreadyDoneError(f"Episode {session_id} is already done")
     if action.offer_amount is not None:
         new_offers.append(action.offer_amount)
+    tom.update(
+        observed_offer=action.offer_amount,
+        observed_move=action.tactical_move,
+        utterance=action.utterance,
+        turn=state.step_count + 1,
+    )
+    new_beliefs = list(tom.history)
     next_state = ParlayState(
         **{
             "hidden_state": HiddenState(**state.hidden_state.model_dump()),
         }
     )
+    # Keep belief history aligned with ToM tracker history (single source of truth).
+    next_state.belief_history = new_beliefs
     if action.tactical_move == TacticalMove.BATNA_REVEAL:
         revealed = action.offer_amount if action.offer_amount is not None else next_state.hidden_state.walk_away_price
             <= next_state.hidden_state.budget_ceiling
         )
+    drift_event = _apply_drift_event(next_state, tom)
+    step_reward = compute_step_reward(state, action, next_state, drift_event=drift_event)
     next_state.cumulative_reward = state.cumulative_reward + step_reward
     if step_reward >= 0.0 and action.tactical_move is None and state.hidden_state.last_stated_batna is not None:
         else:
             next_state.termination_reason = "max_turns"
+    _sessions[session_id] = {"state": next_state, "tom_tracker": tom}
+    observation = _make_observation(next_state, step_reward, action.utterance, drift_event=drift_event)
     return {"observation": observation.model_dump(), "done": next_state.episode_done}
     session_id = msg.get("session_id")
     if not session_id or session_id not in _sessions:
         raise SessionNotFoundError(f"Session {session_id} not found")
+    return {"state": _sessions[session_id]["state"].model_dump()}
 def get_session_state(session_id: str) -> ParlayState | None:
     """Return the in-memory session state for SSE and tests."""
+    session = _sessions.get(session_id)
+    if not session:
+        return None
+    return session["state"]
 @router.get("/sessions/{session_id}")
     """Get session state via REST."""
     if session_id not in _sessions:
         raise SessionNotFoundError(f"Session {session_id} not found")
+    return {"state": _sessions[session_id]["state"].model_dump()}
 _env_app = FastAPI(title="Parlay OpenEnv", version="1.0.0")

scripts/audit_grpo_pipeline.py ADDED Viewed

	@@ -0,0 +1,177 @@

+#!/usr/bin/env python3
+"""
+Smoke test for ParlayGRPOEnvWrapper against one JSONL prompt (keyless / mock path).
+Read-only: does not modify training or env.
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+import traceback
+from pathlib import Path
+class _StubTrainer:
+    """Minimal object satisfying ParlayGRPOEnvWrapper's trainer attribute interface."""
+    def train(self) -> None:
+        return None
+    def save_model(self, _out: str) -> None:
+        return None
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="GRPO env wrapper smoke test (reset + play_turn, JSON completion handling)"
+    )
+    parser.add_argument("--data", type=str, default="data/episodes.jsonl", help="Path to JSONL (first row)")
+    parser.add_argument(
+        "--repo-root",
+        type=Path,
+        default=None,
+        help="Project root (default: parent of scripts/)",
+    )
+    args = parser.parse_args()
+    root = (args.repo_root or Path(__file__).resolve().parent.parent).resolve()
+    if str(root) not in sys.path:
+        sys.path.insert(0, str(root))
+    from training.grpo_env_wrapper import ParlayGRPOEnvWrapper
+    path = Path(args.data)
+    if not path.is_file():
+        print(f"File not found: {path.resolve()}")
+        print("Run: python -m training.generate_data --episodes 80 --output data/episodes.jsonl")
+        return
+    with path.open("r", encoding="utf-8") as f:
+        first = next((ln for ln in f if ln.strip()), None)
+    if not first:
+        print("Empty JSONL.")
+        return
+    try:
+        row = json.loads(first.strip())
+    except json.JSONDecodeError as e:
+        print(f"First row is not valid JSON: {e}")
+        return
+    scenario_id = str(row.get("scenario_id") or "saas_enterprise")
+    persona = str(row.get("persona") or "diplomat")
+    wrapper = ParlayGRPOEnvWrapper(_StubTrainer())
+    print("ParlayGRPOEnvWrapper smoke test")
+    print(f"  JSONL: {path.resolve()}")
+    print(f"  Using scenario_id={scenario_id!r} persona={persona!r} from first row (defaults if missing)")
+    entries: list[tuple[str, str, str]] = []
+    # 1) reset
+    try:
+        obs = wrapper.reset(scenario_id=scenario_id, persona=persona, seed=42)
+        entries.append(
+            (
+                "reset() completes",
+                "PASS",
+                f"ok; scenario_id in obs: {obs.get('scenario_id')!r}",
+            )
+        )
+    except Exception:
+        entries.append(("reset() completes", "FAIL", traceback.format_exc()[:500]))
+        _print_checks(entries)
+        return
+    # 2) play_turn with valid parsed completion
+    sample_json = '{"utterance": "I propose 50000", "offer_amount": 50000}'
+    try:
+        action = json.loads(sample_json)
+        out = wrapper.play_turn(action)
+        reward = float(out.get("reward", 0.0))
+    except Exception:
+        entries.append(
+            (
+                "play_turn(valid JSON → dict with offer)",
+                "FAIL",
+                traceback.format_exc()[:500],
+            )
+        )
+        _print_checks(entries)
+        return
+    print(f"  Sample model completion: {sample_json}")
+    print(f"  play_turn reward (wrapper): {reward}")
+    print(
+        "  Note: play_turn() returns result.grade.total_reward when offer is set (full episode total), not "
+        "the GRPO weighted reward_fn. GRPO training uses training/reward_fn.py on generated strings."
+    )
+    lo, hi = -10.0, 50.0
+    in_range = lo <= reward <= hi
+    entries.append(
+        (
+            f"Reward in [{lo}, {hi}] (heuristic single-step window)",
+            "PASS" if in_range else "FAIL",
+            (
+                f"reward={reward} inside range"
+                if in_range
+                else f"reward={reward} - expected often OUTSIDE range: wrapper total_reward can be large"
+            ),
+        )
+    )
+    # 3) Malformed JSON: must not be passed to play_turn as a string from a correct pipeline
+    bad = '{"utterance": "hello"'
+    try:
+        json.loads(bad)
+        par_mal = "UNEXPECTED: bad JSON parsed"
+    except json.JSONDecodeError:
+        err_line = None
+        try:
+            wrapper.play_turn(bad)  # type: ignore[arg-type]
+        except Exception as e:
+            err_line = f"json.loads fails; play_turn(str) -> {type(e).__name__}: {e!s}"[:200]
+        par_mal = err_line or "play_turn(str) did not raise"
+    entries.append(
+        (
+            "Malformed JSON string mishandled at play_turn",
+            "FAIL" if par_mal.startswith("UNEXPECTED") else "PASS",
+            "Correct pipeline: json.loads first; " + par_mal,
+        )
+    )
+    # 4) Empty string
+    empty_explain = []
+    try:
+        json.loads("")
+    except json.JSONDecodeError as e0:
+        empty_explain.append(f"json.loads('') -> {e0!s}"[:100])
+    try:
+        wrapper.play_turn("")
+    except Exception as e1:
+        empty_explain.append(f"play_turn('') -> {type(e1).__name__}")
+    else:
+        empty_explain.append("play_turn('') did not raise (unexpected)")
+    entries.append(
+        (
+            "Empty string completion / action",
+            "PASS" if "did not raise" not in str(empty_explain[-1]) else "FAIL",
+            " | ".join(empty_explain),
+        )
+    )
+    _print_checks(entries)
+def _print_checks(rows: list[tuple[str, str, str]]) -> None:
+    print()
+    print("CHECKS")
+    for name, status, detail in rows:
+        print(f"  [{status}] {name}")
+        for line in detail.split("\n"):
+            print(f"        {line}")
+if __name__ == "__main__":
+    main()

scripts/audit_reward.py ADDED Viewed

	@@ -0,0 +1,163 @@

+#!/usr/bin/env python3
+"""
+Static reward-surface audit for Parlay (read-only, no env rollouts).
+Analytical notes derived from parlay_env/grader.py, parlay_env/reward.py, game/scenarios.py.
+"""
+from __future__ import annotations
+import argparse
+import sys
+from pathlib import Path
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Analytical Parlay reward-hacking and alignment audit (static, no rollouts)"
+    )
+    parser.add_argument(
+        "--repo-root",
+        type=Path,
+        default=None,
+        help="Project root (default: parent of scripts/)",
+    )
+    args = parser.parse_args()
+    root = (args.repo_root or Path(__file__).resolve().parent.parent).resolve()
+    for sub in (root / "parlay_env", root / "game"):
+        if not sub.is_dir():
+            print(f"Expected directory missing: {sub}")
+            return
+    if str(root) not in sys.path:
+        sys.path.insert(0, str(root))
+    from parlay_env import grader as grader_mod
+    from parlay_env import reward as reward_mod
+    from game import scenarios as scenarios_mod
+    # Ensure grader symbols resolve (import side effects only)
+    _ = (grader_mod.compute_step_reward, grader_mod.detect_bluff_challenge)
+    results: list[tuple[str, str, str]] = []
+    print("=" * 72)
+    print("1. NOISE TERM (THETA * noise_t)")
+    print("-" * 72)
+    print(
+        "In compute_step_reward, noise_t = 1.0 when cosine_sim(utterance, prior offer text) < 0.3, "
+        "else 0.0. The total applies -THETA*noise (penalty on low similarity, not a bonus)."
+    )
+    print(
+        "Trivial *positive* side-channel from the noise term does not exist: noise can only add "
+        "a penalty, never increase reward. Avoiding the penalty means keeping utterance "
+        "overlapping the token history of prior offers (e.g. echoing offer-like numbers), not "
+        "necessarily any arbitrary small talk (which can score low overlap and be penalized)."
+    )
+    print("NOISE TERM: Low hacking risk - the term is a unilateral penalty, not a reward. OK.")
+    results.append(("NOISE TERM (THETA*noise)", "PASS", "Penalty only; no positive exploit"))
+    print()
+    print("=" * 72)
+    print("2. TOM TERM (BETA * ToM)")
+    print("-" * 72)
+    print(
+        "ToM in compute_step_reward uses the latest belief in next_state.belief_history against "
+        "next_state.hidden_state. The agent's utterance does not directly author beliefs; in the "
+        "runner/server path, beliefs update from observed opponent behavior."
+    )
+    print("TOM TERM: Not hackable by agent. OK.")
+    results.append(("ToM (BETA*ToM)", "PASS", "Beliefs from observation path, not direct agent edit"))
+    print()
+    print("=" * 72)
+    print("3. BLUFF BONUS (PSI)")
+    print("-" * 72)
+    print("detect_bluff_challenge() is structured as: (1) if stated/true are None -> False; (2) compute")
+    print("  bluff_threshold = 15% of |true| and require |stated-true| > threshold; (3) only then check")
+    print("  skepticism phrases. There is no partial credit for phrases alone if (2) fails.")
+    print(
+        "In compute_step_reward, bluff_bonus = PSI only when: tactical_move is None, "
+        "state.hidden_state.last_stated_batna is not None, AND detect_bluff_challenge(...)=True "
+        "(which already requires the >15% gap AND a skepticism phrase)."
+    )
+    print("All conditions are ANDed; there is no independent partial PSI for skepticism only.")
+    print("BLUFF BONUS: Gated correctly. OK.")
+    results.append(("BLUFF BONUS (PSI)", "PASS", "All conjuncts required; no partial PSI"))
+    print()
+    print("=" * 72)
+    print("4. MEV (MU * MEV) - drift + adaptation")
+    print("-" * 72)
+    print("MEV in compute_step_reward uses drift_event or next_state.drift_event; mev_bonus = MU if a drift")
+    print("marker is present AND the utterance contains an adaptation subphrase (see grader for tokens).")
+    print("The agent does not set drift_event; game/scenarios.py defines trigger_turn per scenario.\n")
+    for sid, sc in sorted(scenarios_mod.SCENARIOS.items()):
+        if not sc.drift_events:
+            print(f"  {sid}: (no drift_events)")
+        else:
+            turns = [f"turn {e.trigger_turn}: {e.event!r}" for e in sc.drift_events]
+            print(f"  {sid}: {', '.join(turns)}")
+    print()
+    print("MEV TERM: Not hackable. OK.")
+    results.append(("MEV (MU*drift adapt)", "PASS", "Drift is scenario-time-gated, not agent-triggered"))
+    print()
+    print("=" * 72)
+    print("5. DELTA CONCESSION - offer_amount = None")
+    print("-" * 72)
+    print(
+        "In compute_step_reward: delta_v only updates when action.offer_amount is not None. "
+        "concession_t only runs when state.offer_history and action.offer_amount is not None."
+    )
+    print(
+        "If offer_amount is always None, delta_v=0 and concession_t=0, so the agent forgoes both "
+        "alpha*deltaV upside and any delta*concession penalty in those terms."
+    )
+    print(
+        "CONCESSION HACK RISK: Agent can set offer_amount=None every turn to avoid both deltaV reward "
+        "AND concession penalty. Net effect: misses upside but avoids downside. "
+        "Document as known limitation."
+    )
+    results.append(
+        (
+            "Concession (DELTA) / offer=None",
+            "WARN",
+            "offer_amount=None zeroes both deltaV and concession terms",
+        )
+    )
+    print()
+    print("=" * 72)
+    print("6. TERMINAL vs STEP REWARD alignment")
+    print("-" * 72)
+    print("Step: emphasizes offer improvement (ALPHA), ToM (BETA), penalties and bonuses as shaped in grader.")
+    print(
+        "Terminal (compute_terminal_reward): deal_efficiency, speed, drift bonus; GAMMA = "
+        f"{reward_mod.GAMMA} on efficiency."
+    )
+    print(
+        "Tension: an agent can chase high per-step terms (e.g. anchoring, offer deltas) and still miss "
+        "agreement, yielding low terminal efficiency if no deal closes or final price is poor."
+    )
+    print(
+        "This is a mis-alignment by design: it pressures closing unless step weights drown the signal - "
+        "monitor in training, not a pure bug."
+    )
+    print("STEP vs TERMINAL: WARN - intentional tension; monitor in training, not a pure logic bug.")
+    results.append(
+        (
+            "Step vs terminal alignment",
+            "WARN",
+            "Dense step and terminal E can pull apart without a deal",
+        )
+    )
+    print()
+    print("=" * 72)
+    print("SUMMARY (6 checks)")
+    print("=" * 72)
+    for label, level, note in results:
+        print(f"  [{level:4s}] {label} - {note}")
+if __name__ == "__main__":
+    main()

scripts/check_training_config.py ADDED Viewed

	@@ -0,0 +1,148 @@

+#!/usr/bin/env python3
+"""
+Pre-flight training configuration checklist (SFT + GRPO).
+Read-only: inspects training/sft_train.py and training/grpo_train.py; does not start training.
+"""
+from __future__ import annotations
+import argparse
+import inspect
+import os
+import sys
+from pathlib import Path
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Pre-flight SFT/GRPO training config checklist")
+    parser.add_argument(
+        "--repo-root",
+        type=Path,
+        default=None,
+        help="Project root (default: parent of scripts/)",
+    )
+    args = parser.parse_args()
+    root = (args.repo_root or Path(__file__).resolve().parent.parent).resolve()
+    if str(root) not in sys.path:
+        sys.path.insert(0, str(root))
+    sft_path = root / "training" / "sft_train.py"
+    grpo_path = root / "training" / "grpo_train.py"
+    if not sft_path.is_file() or not grpo_path.is_file():
+        print(f"Missing {sft_path} or {grpo_path}")
+        return
+    import training.sft_train as sft
+    import training.grpo_train as grpo
+    sft_text = sft_path.read_text(encoding="utf-8")
+    grpo_text = grpo_path.read_text(encoding="utf-8")
+    sft_fn = inspect.getsource(sft.train_sft)
+    checks: list[tuple[str, bool, str]] = []
+    # Base model
+    want_model = "Qwen/Qwen2.5-1.5B-Instruct"
+    ok_model = sft.DEFAULT_MODEL == want_model
+    checks.append(("[ ] Base model: Qwen/Qwen2.5-1.5B-Instruct", ok_model, f"found {sft.DEFAULT_MODEL!r}"))
+    # LoRA
+    ok_lora = "r=16" in sft_fn and "lora_alpha=32" in sft_fn
+    checks.append(("[ ] SFT LoRA r=16, alpha=32", ok_lora, "in train_sft()"))
+    # SFT training args
+    ok_epochs = "num_train_epochs=3" in sft_text
+    ok_b = "per_device_train_batch_size=4" in sft_text
+    ok_g = "gradient_accumulation_steps=4" in sft_text
+    eff = 4 * 4
+    ok_sft = ok_epochs and ok_b and ok_g
+    checks.append(
+        (
+            f"[ ] SFT epochs=3, batch=4, grad_accum=4 (effective ~{eff})",
+            ok_sft,
+            f"epochs={ok_epochs} batch={ok_b} grad={ok_g}",
+        )
+    )
+    # Output dir
+    want_out = "checkpoints/sft_1.5b/"
+    ok_out = sft.DEFAULT_OUTPUT == want_out
+    checks.append(("[ ] SFT output: checkpoints/sft_1.5b/", ok_out, f"default={sft.DEFAULT_OUTPUT!r}"))
+    # GRPO BASE_MODEL (read at import time in grpo_train)
+    base = os.getenv("BASE_MODEL", "")
+    grpo_default = grpo.BASE_MODEL
+    if not base:
+        grpo_brief = f"BASE_MODEL env not set - will use module default {grpo_default!r}"
+    else:
+        grpo_brief = f"set to {base!r}"
+    checks.append(
+        (
+            "[ ] GRPO reads BASE_MODEL from env",
+            True,
+            grpo_brief,
+        )
+    )
+    # GRPO reward weights
+    want_line = "reward_weights=[3.0, 1.5, 2.0, 0.5]"
+    in_rw = want_line in grpo_text
+    w_line = next((ln.strip() for ln in grpo_text.splitlines() if "reward_weights" in ln), "")
+    checks.append(
+        (
+            "[ ] GRPO reward weights [efficiency, tom, anti-cap, format] = [3.0, 1.5, 2.0, 0.5]",
+            in_rw,
+            w_line or "not found",
+        )
+    )
+    # GRPO data path
+    d_ok = 'default="data/episodes.jsonl"' in grpo_text
+    checks.append(
+        (
+            '[ ] GRPO --data default: data/episodes.jsonl',
+            d_ok,
+            "see grpo_train.main argparse" if d_ok else "check grpo_train.py",
+        )
+    )
+    checks.append(
+        (
+            "[ ] Estimated VRAM note (1.5B + LoRA r=16 ~6-8GB SFT; more for GRPO)",
+            True,
+            "informational (not a failure if you skip the box)",
+        )
+    )
+    hf = os.environ.get("HF_TOKEN") or os.environ.get("HUGGING_FACE_HUB_TOKEN")
+    ok_hf = bool(hf)
+    checks.append(
+        (
+            "[ ] HF token for push (HF_TOKEN or HUGGING_FACE_HUB_TOKEN)",
+            ok_hf,
+            "set" if ok_hf else "not set - needed to push checkpoints",
+        )
+    )
+    print("Training config pre-flight (read from training/sft_train.py, training/grpo_train.py)\n")
+    for line, ok, note in checks:
+        mark = "x" if ok else " "
+        display = line.replace("[ ]", f"[{mark}]", 1) if line.startswith("[ ]") else line
+        print(display)
+        if note:
+            print(f"  -> {note}")
+    print()
+    core_ok = ok_model and ok_lora and ok_sft and ok_out and in_rw and d_ok
+    if core_ok and ok_hf:
+        print("\nREADY FOR TRAINING (SFT + GRPO config strings match; HF token present for hub).")
+    elif core_ok:
+        print(
+            "\nMOSTLY READY: fix missing HF token if you need push_to_hub; verify BASE_MODEL for GRPO stage."
+        )
+    else:
+        print("\nNEEDS FIXING: see failed [ ] items above (model path, LoRA, SFT args, or output dir).")
+if __name__ == "__main__":
+    main()

scripts/eval_comparison.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""Compare baseline, Gemini, and GRPO JSON summaries."""
+import argparse
+import json
+from pathlib import Path
+def _load(path: str) -> dict:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def _fmt_pct(value: float) -> str:
+    return f"{100.0 * value:.1f}%"
+def _row(label: str, data: dict) -> str:
+    return (
+        f"| {label} | {data.get('avg_reward', 0):.3f} | "
+        f"{_fmt_pct(float(data.get('deal_rate', 0)))} | "
+        f"{float(data.get('avg_efficiency', 0)):.3f} | "
+        f"{float(data.get('avg_tom_accuracy', 0)):.3f} | "
+        f"{int(data.get('bluffs_caught', 0))} |"
+    )
+def _save_chart(baseline: dict, gemini: dict, grpo: dict, output_path: Path) -> None:
+    import matplotlib.pyplot as plt
+    labels = ["avg_reward", "deal_rate", "avg_efficiency", "avg_tom_accuracy"]
+    names = ["Random", "Gemini", "GRPO"]
+    series = [baseline, gemini, grpo]
+    x = range(len(labels))
+    width = 0.22
+    plt.figure(figsize=(10, 5))
+    for idx, name in enumerate(names):
+        vals = [float(series[idx].get(k, 0.0)) for k in labels]
+        plt.bar([p + (idx - 1) * width for p in x], vals, width=width, label=name)
+    plt.xticks(list(x), labels)
+    plt.ylabel("Metric value")
+    plt.title("Parlay Baseline vs Gemini vs GRPO")
+    plt.legend()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=150)
+    plt.close()
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Compare evaluation result JSON files")
+    parser.add_argument("--baseline-results", required=True)
+    parser.add_argument("--gemini-results", required=True)
+    parser.add_argument("--grpo-results", required=True)
+    args = parser.parse_args()
+    baseline = _load(args.baseline_results)
+    gemini = _load(args.gemini_results)
+    grpo = _load(args.grpo_results)
+    lines = [
+        "| Model | avg_reward | deal_rate | avg_efficiency | avg_tom_accuracy | bluffs_caught |",
+        "|---|---:|---:|---:|---:|---:|",
+        _row("Random baseline", baseline),
+        _row("Gemini baseline", gemini),
+        _row("GRPO", grpo),
+    ]
+    table = "\n".join(lines)
+    print(table)
+    chart_path = Path("results/comparison.png")
+    _save_chart(baseline, gemini, grpo, chart_path)
+    print(f"\nSaved chart: {chart_path.resolve()}")
+if __name__ == "__main__":
+    main()

scripts/inspect_data.py ADDED Viewed

	@@ -0,0 +1,261 @@

+#!/usr/bin/env python3
+"""
+Pre-training data quality inspector for Parlay JSONL episode files.
+Read-only: loads JSONL and prints statistics and RED FLAGS.
+"""
+from __future__ import annotations
+import argparse
+import json
+import math
+import statistics
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any
+def _safe_float(x: Any, default: float = 0.0) -> float:
+    try:
+        return float(x)
+    except (TypeError, ValueError):
+        return default
+def _percentile_sorted(sorted_vals: list[float], p: float) -> float:
+    if not sorted_vals:
+        return 0.0
+    k = (len(sorted_vals) - 1) * p / 100.0
+    f = math.floor(k)
+    c = math.ceil(k)
+    if f == c:
+        return sorted_vals[int(k)]
+    return sorted_vals[f] * (c - k) + sorted_vals[c] * (k - f)
+def _outcome_bucket(rec: dict[str, Any]) -> str:
+    tr = (rec.get("termination_reason") or "") or ""
+    tr_l = tr.lower()
+    if rec.get("deal_reached") is True or tr_l in ("deal_reached", "deal", "agreement"):
+        return "deal"
+    if "zopa_collapsed" in tr_l or tr_l == "zopa_collapsed":
+        return "zopa_collapsed"
+    if "walk" in tr_l or tr_l in ("walk_away", "walkaway"):
+        return "walk_away"
+    if tr_l in ("max_turns",) or (rec.get("deal_reached") is False and "max" in tr_l):
+        return "max_turns"
+    if tr_l:
+        return f"other:{tr_l}"
+    if rec.get("deal_reached") is False and rec.get("final_price") is None:
+        return "no_deal_or_unknown"
+    return "unknown"
+def _utterance_lengths(conversation: Any) -> list[int]:
+    if not isinstance(conversation, list):
+        return []
+    out: list[int] = []
+    for turn in conversation:
+        if not isinstance(turn, dict):
+            continue
+        content = turn.get("content", "")
+        if isinstance(content, str) and content.strip():
+            out.append(len(content))
+    return out
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Inspect Parlay episode JSONL data quality")
+    parser.add_argument("--data", type=str, default="data/episodes.jsonl", help="Path to JSONL file")
+    args = parser.parse_args()
+    path = Path(args.data)
+    if not path.is_file():
+        print(f"File not found: {path.resolve()}")
+        print("Run: python -m training.generate_data --episodes 80 --output data/episodes.jsonl")
+        return
+    records: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as f:
+        for line_no, line in enumerate(f, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                records.append(json.loads(line))
+            except json.JSONDecodeError as e:
+                print(f"[WARN] Skipping line {line_no}: invalid JSON ({e})")
+    n = len(records)
+    print(f"=== Parlay data inspector: {path} ===")
+    print(f"Total episode records: {n}\n")
+    if n == 0:
+        print("No records to analyze.")
+        return
+    # SCHEMA
+    missing_prompt = sum(1 for r in records if not str(r.get("prompt", "")).strip())
+    missing_scenario = sum(1 for r in records if not str(r.get("scenario_id", "")).strip())
+    missing_persona = sum(1 for r in records if not str(r.get("persona", "")).strip())
+    missing_metadata = sum(1 for r in records if "metadata" not in r)
+    print("SCHEMA")
+    print(f"  prompt present:        {n - missing_prompt}/{n}")
+    print(f"  scenario_id present:  {n - missing_scenario}/{n}")
+    print(f"  persona present:      {n - missing_persona}/{n}")
+    print(f"  metadata key present:  {n - missing_metadata}/{n}  (audit checklist; generate_data may omit)")
+    print()
+    def cum_reward(r: dict[str, Any]) -> float:
+        if "cumulative_reward" in r:
+            return _safe_float(r.get("cumulative_reward"))
+        return _safe_float(r.get("reward"))
+    rews = [cum_reward(r) for r in records]
+    rews_sorted = sorted(rews)
+    print("REWARD (total / cumulative - field 'reward' or 'cumulative_reward')")
+    print(f"  min:    {min(rews):.4f}")
+    print(f"  max:    {max(rews):.4f}")
+    print(f"  mean:   {statistics.mean(rews):.4f}")
+    print(f"  std:    {statistics.stdev(rews) if len(rews) > 1 else 0.0:.4f}")
+    print(f"  p10:    {_percentile_sorted(rews_sorted, 10):.4f}")
+    print(f"  p90:    {_percentile_sorted(rews_sorted, 90):.4f}")
+    print()
+    outcomes = [_outcome_bucket(r) for r in records]
+    oc = Counter(outcomes)
+    print("EPISODE OUTCOMES (best-effort from termination_reason + deal_reached)")
+    for k, v in sorted(oc.items(), key=lambda x: -x[1]):
+        print(f"  {k}: {v} ({100.0 * v / n:.1f}%)")
+    print()
+    effs = [_safe_float(r.get("deal_efficiency"), 0.0) for r in records]
+    toms = []
+    for r in records:
+        t = r.get("tom_accuracy_avg", r.get("tom_accuracy"))
+        toms.append(_safe_float(t, 0.0))
+    print("EFFICIENCY (deal_efficiency, 0-1)")
+    if effs:
+        print(f"  mean: {statistics.mean(effs):.4f}  min: {min(effs):.4f}  max: {max(effs):.4f}")
+    print()
+    print("TOM (tom_accuracy_avg or tom_accuracy)")
+    if toms:
+        print(f"  mean: {statistics.mean(toms):.4f}  min: {min(toms):.4f}  max: {max(toms):.4f}")
+    print()
+    all_lens: list[int] = []
+    degenerate_turns = 0
+    total_turns = 0
+    for r in records:
+        lens = _utterance_lengths(r.get("conversation"))
+        all_lens.extend(lens)
+        for L in lens:
+            total_turns += 1
+            if L < 10:
+                degenerate_turns += 1
+    print("UTTERANCE LENGTH (conversation[*].content)")
+    if all_lens:
+        print(f"  mean chars/turn: {statistics.mean(all_lens):.1f}")
+        print(f"  turns < 10 chars: {degenerate_turns}/{total_turns} ({100.0 * degenerate_turns / max(1, total_turns):.1f}%)")
+    else:
+        print("  (no conversation utterances found)")
+    print()
+    bluff_pos = sum(1 for r in records if int(r.get("bluffs_caught", 0) or 0) > 0)
+    drift_yes = sum(1 for r in records if r.get("drift_adapted") is True)
+    print("BLUFF RATE: episodes with bluffs_caught > 0")
+    print(f"  {bluff_pos}/{n} ({100.0 * bluff_pos / n:.1f}%)  (field may be missing in JSONL -> counted as 0)")
+    print()
+    print("DRIFT ADAPTATION: drift_adapted == True")
+    print(f"  {drift_yes}/{n} ({100.0 * drift_yes / n:.1f}%)")
+    print()
+    by_persona: dict[str, list[dict]] = defaultdict(list)
+    by_scenario: dict[str, list[dict]] = defaultdict(list)
+    for r in records:
+        p = str(r.get("persona", "??"))
+        s = str(r.get("scenario_id", "??"))
+        by_persona[p].append(r)
+        by_scenario[s].append(r)
+    print("=== PER-PERSONA ===")
+    for p in sorted(by_persona.keys()):
+        grp = by_persona[p]
+        m = len(grp)
+        pr = [cum_reward(x) for x in grp]
+        pe = [_safe_float(x.get("deal_efficiency"), 0) for x in grp]
+        pt = [_safe_float(x.get("tom_accuracy_avg", x.get("tom_accuracy")), 0) for x in grp]
+        po = [_outcome_bucket(x) for x in grp]
+        dr = sum(1 for f in po if f == "deal") / m
+        print(
+            f"  {p}: n={m}  mean_reward={statistics.mean(pr) if pr else 0:.2f}  "
+            f"mean_eff={statistics.mean(pe) if pe else 0:.3f}  mean_tom={statistics.mean(pt) if pt else 0:.3f}  deal_rate={dr:.2%}"
+        )
+    print()
+    print("=== PER-SCENARIO ===")
+    for s in sorted(by_scenario.keys()):
+        grp = by_scenario[s]
+        m = len(grp)
+        pr = [cum_reward(x) for x in grp]
+        pe = [_safe_float(x.get("deal_efficiency"), 0) for x in grp]
+        pt = [_safe_float(x.get("tom_accuracy_avg", x.get("tom_accuracy")), 0) for x in grp]
+        po = [_outcome_bucket(x) for x in grp]
+        dr = sum(1 for f in po if f == "deal") / m
+        print(
+            f"  {s}: n={m}  mean_reward={statistics.mean(pr) if pr else 0:.2f}  "
+            f"mean_eff={statistics.mean(pe) if pe else 0:.3f}  mean_tom={statistics.mean(pt) if pt else 0:.3f}  deal_rate={dr:.2%}"
+        )
+    print()
+    # RED FLAGS
+    print("=== RED FLAGS ===")
+    flags: list[str] = []
+    bad_rew = sum(1 for x in rews if x < -50) / n
+    if bad_rew > 0.30:
+        flags.append(f"> 30% episodes with total reward < -50 ({100 * bad_rew:.1f}%)")
+    max_turns_rate = sum(1 for o in outcomes if o == "max_turns") / n
+    if max_turns_rate > 0.40:
+        flags.append(f"> 40% ending in max_turns ({100 * max_turns_rate:.1f}%)")
+    drift_rate = drift_yes / n
+    if drift_rate < 0.10:
+        flags.append(f"< 10% drift_adapted ({100 * drift_rate:.1f}%)")
+    for p, grp in by_persona.items():
+        po = [_outcome_bucket(x) for x in grp]
+        dr = sum(1 for f in po if f == "deal") / len(grp) if grp else 0.0
+        if dr == 0.0 and len(grp) >= 3:
+            flags.append(f"Persona {p!r} has 0% deal rate (n={len(grp)})")
+    for s, grp in by_scenario.items():
+        po = [_outcome_bucket(x) for x in grp]
+        dr = sum(1 for f in po if f == "deal") / len(grp) if grp else 0.0
+        if dr == 0.0 and len(grp) >= 3:
+            flags.append(f"Scenario {s!r} has 0% deal rate (n={len(grp)})")
+    if all_lens and statistics.mean(all_lens) < 20.0:
+        flags.append(f"Mean utterance length {statistics.mean(all_lens):.1f} chars < 20 (possibly degenerate)")
+    if max(rews) > 400:
+        flags.append(f"At least one episode with total reward > 400 (max={max(rews):.2f}) - check for scale bugs or rare combo")
+    if missing_metadata == n:
+        flags.append("No record has a top-level 'metadata' key (optional for training; audit asked for it)")
+    if not flags:
+        print("  (none triggered)")
+    else:
+        for f in flags:
+            print(f"  * {f}")
+    print()
+if __name__ == "__main__":
+    main()

scripts/push_docker.sh ADDED Viewed

	@@ -0,0 +1,8 @@

+#!/bin/bash
+# Usage: ./scripts/push_docker.sh <dockerhub-username> <tag>
+USERNAME=${1:-yourusername}
+TAG=${2:-latest}
+docker build -t $USERNAME/parlay:$TAG .
+docker push $USERNAME/parlay:$TAG
+echo "Pushed $USERNAME/parlay:$TAG"
+echo "For HF Spaces: set Dockerfile app_port to 7860 and push repo to huggingface.co/spaces/$USERNAME/parlay"

scripts/run_gemini_baseline.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""Run Gemini self-play baseline and save summary JSON."""
+import argparse
+import asyncio
+import json
+import logging
+from pathlib import Path
+from agent.runner import run_episode
+from parlay_env.models import PersonaType
+PERSONAS = [PersonaType.SHARK, PersonaType.DIPLOMAT, PersonaType.VETERAN]
+SCENARIOS = ["saas_enterprise", "hiring_package", "acquisition_term_sheet"]
+def _mean(values: list[float]) -> float:
+    return sum(values) / len(values) if values else 0.0
+async def _run(episodes: int) -> list[dict]:
+    rows: list[dict] = []
+    for i in range(episodes):
+        persona = PERSONAS[i % len(PERSONAS)]
+        scenario_id = SCENARIOS[(i // len(PERSONAS)) % len(SCENARIOS)]
+        result = await run_episode(
+            persona=persona,
+            scenario_id=scenario_id,
+            inject_noise=False,
+            force_drift=True,
+            seed=i + 100,
+            max_turns=20,
+        )
+        rows.append(
+            {
+                "avg_reward": float(result.grade.total_reward),
+                "deal_rate": 1.0 if result.final_price is not None else 0.0,
+                "avg_efficiency": float(result.grade.deal_efficiency),
+                "avg_tom_accuracy": float(result.grade.tom_accuracy_avg),
+                "bluffs_caught": int(result.grade.bluffs_caught),
+            }
+        )
+    return rows
+def _summarise(rows: list[dict], episodes_requested: int) -> dict:
+    return {
+        "episodes_requested": episodes_requested,
+        "episodes_completed": len(rows),
+        "avg_reward": round(_mean([r["avg_reward"] for r in rows]), 4),
+        "deal_rate": round(_mean([r["deal_rate"] for r in rows]), 4),
+        "avg_efficiency": round(_mean([r["avg_efficiency"] for r in rows]), 4),
+        "avg_tom_accuracy": round(_mean([r["avg_tom_accuracy"] for r in rows]), 4),
+        "bluffs_caught": int(sum(r["bluffs_caught"] for r in rows)),
+    }
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run Gemini self-play baseline")
+    parser.add_argument("--episodes", type=int, default=20)
+    parser.add_argument("--output", default="results/gemini_baseline.json")
+    args = parser.parse_args()
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
+    rows = asyncio.run(_run(args.episodes))
+    summary = _summarise(rows, args.episodes)
+    out = Path(args.output)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(summary, indent=2), encoding="utf-8")
+    print(json.dumps(summary, indent=2))
+    print(f"\nSaved Gemini baseline to {out.resolve()}")
+if __name__ == "__main__":
+    main()

scripts/validate_sft_data.py ADDED Viewed

	@@ -0,0 +1,132 @@

+#!/usr/bin/env python3
+"""
+Validate JSONL rows against what training/sft_train.py will actually use.
+Read-only; does not run training.
+"""
+from __future__ import annotations
+import argparse
+import json
+import statistics
+import sys
+from pathlib import Path
+# Mirror training/sft_train._extract_completions (do not import to avoid heavy deps at import time)
+def _extract_completions(rec: dict) -> list[str]:
+    completion = rec.get("completion")
+    if isinstance(completion, str) and completion.strip():
+        return [completion.strip()]
+    conversation = rec.get("conversation", [])
+    candidates: list[str] = []
+    if isinstance(conversation, list):
+        for turn in conversation:
+            if not isinstance(turn, dict):
+                continue
+            role = str(turn.get("role", "")).lower()
+            content = str(turn.get("content", "")).strip()
+            if role == "negotiator" and content:
+                candidates.append(content)
+    return candidates
+def _approx_tokens(text: str) -> float:
+    """Rough token estimate without tokenizer (good enough for preflight OOM risk)."""
+    if not text:
+        return 0.0
+    return len(text) / 4.0
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Validate SFT JSONL against sft_train.py expectations")
+    parser.add_argument("--data", type=str, default="data/episodes.jsonl", help="Path to JSONL file")
+    args = parser.parse_args()
+    path = Path(args.data)
+    if not path.is_file():
+        print(f"File not found: {path.resolve()}")
+        print("Run: python -m training.generate_data --episodes 80 --output data/episodes.jsonl")
+        return
+    usable_rows = 0
+    skipped = 0
+    prompt_tok: list[float] = []
+    completion_tok: list[float] = []
+    first_bad_line: int | None = None
+    with path.open("r", encoding="utf-8") as f:
+        for line_no, line in enumerate(f, 1):
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                rec = json.loads(line)
+            except json.JSONDecodeError:
+                skipped += 1
+                if first_bad_line is None:
+                    first_bad_line = line_no
+                continue
+            if not isinstance(rec, dict):
+                skipped += 1
+                continue
+            prompt = str(rec.get("prompt", "")).strip()
+            if not prompt:
+                skipped += 1
+                continue
+            completions = _extract_completions(rec)
+            if not completions:
+                skipped += 1
+                continue
+            usable_rows += 1
+            p_t = _approx_tokens(prompt)
+            prompt_tok.append(p_t)
+            for c in completions:
+                completion_tok.append(_approx_tokens(c))
+    sft_trains_one_row_per_completion = "sft_train.py expands one dataset row per negotiator line"
+    print("SFT data validator (vs training/sft_train.py load_sft_dataset / _extract_completions)")
+    print(f"  File: {path.resolve()}")
+    print(f"  Note: {sft_trains_one_row_per_completion} when 'completion' is absent.")
+    print()
+    print(f"  JSONL records usable (has non-empty 'prompt' and completion or negotiator text): {usable_rows}")
+    print(f"  Records / rows SKIPPED: {skipped}")
+    if first_bad_line is not None:
+        print(f"  (includes malformed JSONL starting around line {first_bad_line} if any)")
+    print()
+    if not prompt_tok and not completion_tok:
+        print("No prompt/completion lengths to summarize (all skipped).")
+    else:
+        def _summary(vals: list[float], label: str) -> None:
+            if not vals:
+                print(f"  {label}: (empty)")
+                return
+            print(
+                f"  {label} (approx. tokens, len/4): "
+                f"min={min(vals):.1f}  max={max(vals):.1f}  mean={statistics.mean(vals):.1f}  "
+                f"std={(statistics.pstdev(vals) if len(vals) > 1 else 0.0):.1f}"
+            )
+        _summary(prompt_tok, "Prompt length")
+        _summary(completion_tok, "Completion length (each negotiator / completion string)")
+        if prompt_tok and statistics.mean(prompt_tok) > 2048.0:
+            print(
+                "  FLAG: Mean prompt length > 2048 (approx. tokens) - may OOM or truncate with "
+                "SFTConfig max_seq_length=2048 in sft_train.py on small GPUs."
+            )
+    print()
+    print(f"  {usable_rows} records usable for SFT, {skipped} will be skipped (at record level; negotiator")
+    print("  expansion in sft_train can still multiply rows for usable records).")
+    if usable_rows < 50:
+        print("  WARNING: May be insufficient for SFT. Generate more data first.")
+    if usable_rows == 0:
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

tests/test_keyless.py CHANGED Viewed

@@ -26,7 +26,7 @@ from parlay_env.game_theory import (
 )
 from parlay_env.grader import compute_step_reward, compute_terminal_reward
 from parlay_env.grader import detect_bluff_challenge
-from parlay_env.reward import OMEGA, PSI
 from parlay_env.models import (
     BeliefState, HiddenState, ParlayAction, ParlayState, PersonaType,
 )
@@ -200,6 +200,25 @@ class TestGrader:
         assert caught is True, f"Expected True, got {caught}"
         assert reward >= PSI, f"Expected at least {PSI}, got {reward}"
     def test_zopa_collapse_walkaway_keyless(self):
         """Repeated high tension collapses the ZOPA and forces walk-away."""
         hidden = HiddenState(

 )
 from parlay_env.grader import compute_step_reward, compute_terminal_reward
 from parlay_env.grader import detect_bluff_challenge
+from parlay_env.reward import MU, OMEGA, PSI
 from parlay_env.models import (
     BeliefState, HiddenState, ParlayAction, ParlayState, PersonaType,
 )
         assert caught is True, f"Expected True, got {caught}"
         assert reward >= PSI, f"Expected at least {PSI}, got {reward}"
+    def test_mev_bonus_requires_drift_event(self, parlay_state):
+        """MEV bonus only activates when a drift event marker is present."""
+        action = ParlayAction(
+            utterance="Given that the market shifted, I can adjust.",
+            tactical_move=None,
+        )
+        no_drift_state = ParlayState(**{**parlay_state.model_dump(), "step_count": 1})
+        reward_no_drift = compute_step_reward(parlay_state, action, no_drift_state)
+        with_drift_state = ParlayState(**{**parlay_state.model_dump(), "step_count": 1})
+        with_drift_state.__dict__["drift_event"] = "Competitor drops price 15%"
+        reward_with_drift = compute_step_reward(parlay_state, action, with_drift_state)
+        assert reward_with_drift >= reward_no_drift + MU, (
+            f"Expected drift reward boost >= {MU}, got "
+            f"{reward_with_drift - reward_no_drift}"
+        )
     def test_zopa_collapse_walkaway_keyless(self):
         """Repeated high tension collapses the ZOPA and forces walk-away."""
         hidden = HiddenState(

training/generate_data.py CHANGED Viewed

@@ -33,6 +33,32 @@ REQUIRED_COMBINATIONS = [
     for scenario in ["saas_enterprise", "hiring_package", "acquisition_term_sheet"]
 ]
 def is_quality_episode(grade, args) -> tuple[bool, str]:
     """
@@ -308,7 +334,8 @@ async def run_inspect_mode(args) -> None:
 async def run_diversity_pass(args, output_path: Path) -> None:
     """
-    Generate a quality-filtered dataset with guaranteed persona x scenario coverage.
     """
     output_path.parent.mkdir(parents=True, exist_ok=True)
     coverage: dict[tuple[str, str], int] = defaultdict(int)
@@ -316,169 +343,111 @@ async def run_diversity_pass(args, output_path: Path) -> None:
     kept_records: list[dict] = []
     generated = 0
     discarded = 0
     seed = 0
-    min_per_combo = max(2, args.episodes // len(REQUIRED_COMBINATIONS))
     total_live_calls: int = 0
     total_fallback_calls: int = 0
     _verbose = not getattr(args, "quiet", False)
     with open(output_path, "w", encoding="utf-8") as out_f:
         while len(kept_records) < args.episodes:
-            progress_made = False
-            for persona, scenario_id in REQUIRED_COMBINATIONS:
-                if len(kept_records) >= args.episodes:
-                    break
-                if coverage[(persona, scenario_id)] >= min_per_combo:
-                    continue
-                record = await _run_one(persona, scenario_id, seed=seed, max_turns=args.max_turns)
-                seed += 1
-                generated += 1
-                if record is None:
-                    _live_n, _fall_n = get_and_reset_counts()
-                    total_live_calls += _live_n
-                    total_fallback_calls += _fall_n
-                    continue
-                keep, reason = is_quality_episode(
-                    _grade_proxy_from_record(record),
-                    args,
-                )
-                if not keep:
-                    discarded += 1
-                    _live_d, _fall_d = get_and_reset_counts()
-                    total_live_calls += _live_d
-                    total_fallback_calls += _fall_d
-                    if _verbose:
-                        print(
-                            f"[EP --/{args.episodes:02d}] "
-                            f"{persona}×{scenario_id:<27s} | "
-                            f"reward={record.get('reward', 0.0):+.2f} | "
-                            f"eff={record.get('deal_efficiency', 0.0):.3f} | "
-                            f"kept=NO  | "
-                            f"total_kept={len(kept_records)}/{generated} | "
-                            f"gemini_live={_live_d} fallback={_fall_d}",
-                            file=sys.stderr,
-                        )
-                    continue
-                out_f.write(json.dumps(record, ensure_ascii=False) + "\n")
-                kept_records.append(record)
-                _live, _fall = get_and_reset_counts()
-                total_live_calls += _live
-                total_fallback_calls += _fall
-                _ep_num = len(kept_records)
                 if _verbose:
-                    _reward = record.get("reward", 0.0)
-                    _eff = record.get("deal_efficiency", 0.0)
-                    _combo = f"{record['persona']}×{record['scenario_id']}"
                     print(
-                        f"[EP {_ep_num:02d}/{args.episodes:02d}] "
-                        f"{_combo:<35s} | "
-                        f"reward={_reward:+.2f} | "
-                        f"eff={_eff:.3f} | "
-                        f"kept=YES | "
-                        f"total_kept={_ep_num}/{generated} | "
-                        f"gemini_live={_live} fallback={_fall}",
                         file=sys.stderr,
                     )
-                    if _ep_num in (20, 40, 60):
-                        _all_rewards = [r.get("reward", 0.0) for r in kept_records]
-                        _all_eff = [r.get("deal_efficiency", 0.0) for r in kept_records]
-                        _combos_covered = len({(r["persona"], r["scenario_id"]) for r in kept_records})
-                        print(f"\n{'━' * 40}", file=sys.stderr)
-                        print(f"[CHECKPOINT {_ep_num}/{args.episodes}]", file=sys.stderr)
-                        print(
-                            f"  Kept so far     : {_ep_num}/{generated}  ({100 * _ep_num / max(generated, 1):.1f}%)",
-                            file=sys.stderr,
-                        )
-                        print(f"  Mean reward     : {statistics.mean(_all_rewards):.2f}", file=sys.stderr)
-                        print(f"  Mean efficiency : {statistics.mean(_all_eff):.3f}", file=sys.stderr)
-                        print(f"  Combos covered  : {_combos_covered}/9", file=sys.stderr)
-                        print(f"  Live calls total: {total_live_calls}", file=sys.stderr)
-                        print(f"  Fallback total  : {total_fallback_calls}", file=sys.stderr)
-                        print(f"{'━' * 40}\n", file=sys.stderr)
-                coverage[(persona, scenario_id)] += 1
-                kept_reason_counts[reason] += 1
-                progress_made = True
-            if len(kept_records) >= args.episodes:
-                break
-            if not progress_made:
-                persona, scenario_id = random.choice(REQUIRED_COMBINATIONS)
-                record = await _run_one(persona, scenario_id, seed=seed, max_turns=args.max_turns)
-                seed += 1
-                generated += 1
-                if record is None:
-                    _live_n, _fall_n = get_and_reset_counts()
-                    total_live_calls += _live_n
-                    total_fallback_calls += _fall_n
-                    continue
-                keep, reason = is_quality_episode(
-                    _grade_proxy_from_record(record),
-                    args,
                 )
-                if keep:
-                    out_f.write(json.dumps(record, ensure_ascii=False) + "\n")
-                    kept_records.append(record)
-                    _live, _fall = get_and_reset_counts()
-                    total_live_calls += _live
-                    total_fallback_calls += _fall
-                    _ep_num = len(kept_records)
-                    if _verbose:
-                        _reward = record.get("reward", 0.0)
-                        _eff = record.get("deal_efficiency", 0.0)
-                        _combo = f"{record['persona']}×{record['scenario_id']}"
-                        print(
-                            f"[EP {_ep_num:02d}/{args.episodes:02d}] "
-                            f"{_combo:<35s} | "
-                            f"reward={_reward:+.2f} | "
-                            f"eff={_eff:.3f} | "
-                            f"kept=YES | "
-                            f"total_kept={_ep_num}/{generated} | "
-                            f"gemini_live={_live} fallback={_fall}",
-                            file=sys.stderr,
-                        )
-                        if _ep_num in (20, 40, 60):
-                            _all_rewards = [r.get("reward", 0.0) for r in kept_records]
-                            _all_eff = [r.get("deal_efficiency", 0.0) for r in kept_records]
-                            _combos_covered = len({(r["persona"], r["scenario_id"]) for r in kept_records})
-                            print(f"\n{'━' * 40}", file=sys.stderr)
-                            print(f"[CHECKPOINT {_ep_num}/{args.episodes}]", file=sys.stderr)
-                            print(
-                                f"  Kept so far     : {_ep_num}/{generated}  ({100 * _ep_num / max(generated, 1):.1f}%)",
-                                file=sys.stderr,
-                            )
-                            print(f"  Mean reward     : {statistics.mean(_all_rewards):.2f}", file=sys.stderr)
-                            print(f"  Mean efficiency : {statistics.mean(_all_eff):.3f}", file=sys.stderr)
-                            print(f"  Combos covered  : {_combos_covered}/9", file=sys.stderr)
-                            print(f"  Live calls total: {total_live_calls}", file=sys.stderr)
-                            print(f"  Fallback total  : {total_fallback_calls}", file=sys.stderr)
-                            print(f"{'━' * 40}\n", file=sys.stderr)
-                    coverage[(persona, scenario_id)] += 1
-                    kept_reason_counts[reason] += 1
-                else:
-                    discarded += 1
-                    _live_d, _fall_d = get_and_reset_counts()
-                    total_live_calls += _live_d
-                    total_fallback_calls += _fall_d
-                    if _verbose:
-                        print(
-                            f"[EP --/{args.episodes:02d}] "
-                            f"{persona}×{scenario_id:<27s} | "
-                            f"reward={record.get('reward', 0.0):+.2f} | "
-                            f"eff={record.get('deal_efficiency', 0.0):.3f} | "
-                            f"kept=NO  | "
-                            f"total_kept={len(kept_records)}/{generated} | "
-                            f"gemini_live={_live_d} fallback={_fall_d}",
-                            file=sys.stderr,
-                        )
     discard_pct = (discarded / max(generated, 1)) * 100.0
     print(
         f"Generated: {generated} episodes | Kept: {len(kept_records)} | "
-        f"Discarded: {discarded} ({discard_pct:.0f}%)"
     )
     reasons_str = ", ".join(f"{reason}={count}" for reason, count in sorted(kept_reason_counts.items()))
     print(f"Reasons kept: {reasons_str or 'none'}")
@@ -488,9 +457,9 @@ async def run_diversity_pass(args, output_path: Path) -> None:
     _fallback_rate = 100.0 * total_fallback_calls / max(total_live_calls + total_fallback_calls, 1)
     _verdict = (
-        "ALL CALLS LIVE — data is real"
         if _fallback_rate < 5.0
-        else "WARNING: fallback rate high — check API key and rate limits"
     )
     print(f"\nGemini API health:")
     print(f"  Total live calls  : {total_live_calls}")
@@ -501,8 +470,14 @@ async def run_diversity_pass(args, output_path: Path) -> None:
 def main() -> None:
     parser = argparse.ArgumentParser(description="Generate Parlay training data")
-    parser.add_argument("--episodes", type=int, default=80)
     parser.add_argument("--output", type=str, default="data/episodes.jsonl")
     parser.add_argument(
         "--quality_filter",
         action="store_true",

     for scenario in ["saas_enterprise", "hiring_package", "acquisition_term_sheet"]
 ]
+# Weighted to oversample historically low deal-rate combinations (total weight = 15)
+COMBO_WEIGHTS: dict[tuple[str, str], int] = {
+    ("veteran", "hiring_package"): 3,
+    ("veteran", "saas_enterprise"): 2,
+    ("veteran", "acquisition_term_sheet"): 2,
+    ("shark", "hiring_package"): 2,
+    ("diplomat", "hiring_package"): 2,
+    ("shark", "saas_enterprise"): 1,
+    ("shark", "acquisition_term_sheet"): 1,
+    ("diplomat", "saas_enterprise"): 1,
+    ("diplomat", "acquisition_term_sheet"): 1,
+}
+WEIGHTED_COMBO_LIST: list[tuple[str, str]] = []
+for _pair, _weight in COMBO_WEIGHTS.items():
+    WEIGHTED_COMBO_LIST.extend([_pair] * _weight)
+def _row_total_reward(record: dict) -> float | None:
+    v = record.get("reward")
+    if v is not None:
+        return float(v)
+    v2 = record.get("cumulative_reward")
+    if v2 is not None:
+        return float(v2)
+    return None
 def is_quality_episode(grade, args) -> tuple[bool, str]:
     """
 async def run_diversity_pass(args, output_path: Path) -> None:
     """
+    Generate a quality-filtered dataset; persona x scenario is weighted-sampled
+    (see COMBO_WEIGHTS / WEIGHTED_COMBO_LIST).
     """
     output_path.parent.mkdir(parents=True, exist_ok=True)
     coverage: dict[tuple[str, str], int] = defaultdict(int)
     kept_records: list[dict] = []
     generated = 0
     discarded = 0
+    skipped_min_reward = 0
     seed = 0
     total_live_calls: int = 0
     total_fallback_calls: int = 0
     _verbose = not getattr(args, "quiet", False)
+    _checkpoints = {20, 40, 60, 80, 100, 120, 140}
+    def _emit_checkpoint(_ep_num: int) -> None:
+        if not _verbose or _ep_num not in _checkpoints:
+            return
+        _all_rewards = [r.get("reward", 0.0) for r in kept_records]
+        _all_eff = [r.get("deal_efficiency", 0.0) for r in kept_records]
+        _combos_covered = len({(r["persona"], r["scenario_id"]) for r in kept_records})
+        print(f"\n{'━' * 40}", file=sys.stderr)
+        print(f"[CHECKPOINT {_ep_num}/{args.episodes}]", file=sys.stderr)
+        print(
+            f"  Kept so far     : {_ep_num}/{generated}  ({100 * _ep_num / max(generated, 1):.1f}%)",
+            file=sys.stderr,
+        )
+        print(f"  Mean reward     : {statistics.mean(_all_rewards):.2f}", file=sys.stderr)
+        print(f"  Mean efficiency : {statistics.mean(_all_eff):.3f}", file=sys.stderr)
+        print(f"  Combos covered  : {_combos_covered}/9", file=sys.stderr)
+        print(f"  Min-reward skip : {skipped_min_reward}", file=sys.stderr)
+        print(f"  Live calls total: {total_live_calls}", file=sys.stderr)
+        print(f"  Fallback total  : {total_fallback_calls}", file=sys.stderr)
+        print(f"{'━' * 40}\n", file=sys.stderr)
     with open(output_path, "w", encoding="utf-8") as out_f:
         while len(kept_records) < args.episodes:
+            persona, scenario_id = random.choice(WEIGHTED_COMBO_LIST)
+            record = await _run_one(persona, scenario_id, seed=seed, max_turns=args.max_turns)
+            seed += 1
+            generated += 1
+            if record is None:
+                _live_n, _fall_n = get_and_reset_counts()
+                total_live_calls += _live_n
+                total_fallback_calls += _fall_n
+                continue
+            rw = _row_total_reward(record)
+            if rw is not None and rw < args.min_reward:
+                skipped_min_reward += 1
+                _live_m, _fall_m = get_and_reset_counts()
+                total_live_calls += _live_m
+                total_fallback_calls += _fall_m
                 if _verbose:
                     print(
+                        f"[min_reward skip #{skipped_min_reward}] {persona} x {scenario_id} "
+                        f"reward={rw:.2f} < {args.min_reward}",
                         file=sys.stderr,
                     )
+                continue
+            keep, reason = is_quality_episode(
+                _grade_proxy_from_record(record),
+                args,
+            )
+            if not keep:
+                discarded += 1
+                _live_d, _fall_d = get_and_reset_counts()
+                total_live_calls += _live_d
+                total_fallback_calls += _fall_d
+                if _verbose:
+                    print(
+                        f"[EP --/{args.episodes:02d}] "
+                        f"{persona}×{scenario_id:<27s} | "
+                        f"reward={record.get('reward', 0.0):+.2f} | "
+                        f"eff={record.get('deal_efficiency', 0.0):.3f} | "
+                        f"kept=NO  | "
+                        f"total_kept={len(kept_records)}/{generated} | "
+                        f"gemini_live={_live_d} fallback={_fall_d}",
+                        file=sys.stderr,
+                    )
+                continue
+            out_f.write(json.dumps(record, ensure_ascii=False) + "\n")
+            out_f.flush()
+            kept_records.append(record)
+            _live, _fall = get_and_reset_counts()
+            total_live_calls += _live
+            total_fallback_calls += _fall
+            _ep_num = len(kept_records)
+            if _verbose:
+                _reward = record.get("reward", 0.0)
+                _eff = record.get("deal_efficiency", 0.0)
+                _combo = f"{record['persona']}×{record['scenario_id']}"
+                print(
+                    f"[EP {_ep_num:02d}/{args.episodes:02d}] "
+                    f"{_combo:<35s} | "
+                    f"reward={_reward:+.2f} | "
+                    f"eff={_eff:.3f} | "
+                    f"kept=YES | "
+                    f"total_kept={_ep_num}/{generated} | "
+                    f"gemini_live={_live} fallback={_fall}",
+                    file=sys.stderr,
                 )
+                _emit_checkpoint(_ep_num)
+            coverage[(persona, scenario_id)] += 1
+            kept_reason_counts[reason] += 1
     discard_pct = (discarded / max(generated, 1)) * 100.0
     print(
         f"Generated: {generated} episodes | Kept: {len(kept_records)} | "
+        f"Discarded: {discarded} ({discard_pct:.0f}%) | "
+        f"Skipped (min_reward < {args.min_reward}): {skipped_min_reward}"
     )
     reasons_str = ", ".join(f"{reason}={count}" for reason, count in sorted(kept_reason_counts.items()))
     print(f"Reasons kept: {reasons_str or 'none'}")
     _fallback_rate = 100.0 * total_fallback_calls / max(total_live_calls + total_fallback_calls, 1)
     _verdict = (
+        "ALL CALLS LIVE - data is real"
         if _fallback_rate < 5.0
+        else "WARNING: fallback rate high - check API key and rate limits"
     )
     print(f"\nGemini API health:")
     print(f"  Total live calls  : {total_live_calls}")
 def main() -> None:
     parser = argparse.ArgumentParser(description="Generate Parlay training data")
+    parser.add_argument("--episodes", type=int, default=140)
     parser.add_argument("--output", type=str, default="data/episodes.jsonl")
+    parser.add_argument(
+        "--min-reward",
+        type=float,
+        default=-50.0,
+        help="After grading, do not write episodes with total reward below this (default: -50.0)",
+    )
     parser.add_argument(
         "--quality_filter",
         action="store_true",

training/grpo_train.py CHANGED Viewed

@@ -18,18 +18,32 @@ from pathlib import Path
 logger = logging.getLogger(__name__)
-BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-7B-Instruct")
 GRPO_STEPS = int(os.getenv("GRPO_STEPS", "500"))
 GRPO_GENERATIONS = int(os.getenv("GRPO_GENERATIONS", "8"))
-def build_grpo_dataset(jsonl_path: str):
     """
     Build GRPO dataset. Each record needs only a 'prompt' field plus metadata.
     The model generates G=8 completions per prompt; grader scores all 8.
     Args:
         jsonl_path: Path to the JSONL episodes file.
     Returns:
         HuggingFace Dataset with prompt + metadata columns.
@@ -40,21 +54,31 @@ def build_grpo_dataset(jsonl_path: str):
         raise ImportError("Install datasets: pip install datasets") from exc
     prompts = []
     with open(jsonl_path, encoding="utf-8") as f:
         for line in f:
             rec = json.loads(line.strip())
-            if rec.get("split") == "train":
-                # Extract ZOPA metadata for reward functions
-                prompts.append({
                     "prompt": rec["prompt"],
                     "scenario_id": rec.get("scenario_id", ""),
                     "persona": rec.get("persona", ""),
-                    # Reward function kwargs (passed through dataset)
                     "batna_seller": _get_batna(rec.get("scenario_id", ""), "seller"),
-                    "batna_buyer":  _get_batna(rec.get("scenario_id", ""), "buyer"),
-                    "zopa_width":   _get_zopa_width(rec.get("scenario_id", "")),
-                })
-    logger.info(f"GRPO dataset: {len(prompts)} prompts")
     return Dataset.from_list(prompts)
@@ -62,7 +86,7 @@ def _get_batna(scenario_id: str, side: str) -> float:
     """Lookup BATNA for a scenario without importing game module at training time."""
     batnas: dict[str, dict[str, float]] = {
         "saas_enterprise":        {"seller": 125_000,    "buyer": 165_000},
-        "hiring_package":         {"seller": 195_000,    "buyer": 230_000},
         "acquisition_term_sheet": {"seller": 10_500_000, "buyer": 16_000_000},
     }
     return float(batnas.get(scenario_id, {}).get(side, 0))
@@ -80,6 +104,7 @@ def train_grpo(
     data_path: str,
     output_dir: str,
     steps: int = 500,
 ) -> None:
     """
     GRPO training loop.
@@ -117,7 +142,7 @@ def train_grpo(
         format_reward,
     )
-    dataset = build_grpo_dataset(data_path)
     if len(dataset) == 0:
         raise ValueError("Empty GRPO dataset. Run generate_data.py first.")
@@ -181,6 +206,12 @@ def main() -> None:
     parser.add_argument("--model", default="models/parlay-sft")
     parser.add_argument("--base_model", default="")
     parser.add_argument("--data", default="data/episodes.jsonl")
     parser.add_argument("--output", default="models/parlay-grpo")
     parser.add_argument("--steps", type=int, default=GRPO_STEPS)
     parser.add_argument("--g", type=int, default=GRPO_GENERATIONS)
@@ -191,7 +222,7 @@ def main() -> None:
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
     GRPO_GENERATIONS = args.g
     model_path = args.base_model or args.model
-    train_grpo(model_path, args.data, args.output, args.steps)
     if args.save_curves:
         curves_path = Path(args.save_curves)

 logger = logging.getLogger(__name__)
+# SFT->GRPO pipeline: set BASE_MODEL=checkpoints/sft_1.5b/ after sft_train.py
+# (overridable via BASE_MODEL env var)
+BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-1.5B-Instruct")
 GRPO_STEPS = int(os.getenv("GRPO_STEPS", "500"))
 GRPO_GENERATIONS = int(os.getenv("GRPO_GENERATIONS", "8"))
+def _row_total_reward(rec: dict) -> float | None:
+    v = rec.get("reward")
+    if v is not None:
+        return float(v)
+    v2 = rec.get("cumulative_reward")
+    if v2 is not None:
+        return float(v2)
+    return None
+def build_grpo_dataset(jsonl_path: str, min_reward: float = -50.0):
     """
     Build GRPO dataset. Each record needs only a 'prompt' field plus metadata.
     The model generates G=8 completions per prompt; grader scores all 8.
     Args:
         jsonl_path: Path to the JSONL episodes file.
+        min_reward: Drop train rows with total reward (reward / cumulative_reward) below this
+            (missing reward fields are kept for backward compatibility).
     Returns:
         HuggingFace Dataset with prompt + metadata columns.
         raise ImportError("Install datasets: pip install datasets") from exc
     prompts = []
+    filtered = 0
     with open(jsonl_path, encoding="utf-8") as f:
         for line in f:
             rec = json.loads(line.strip())
+            if rec.get("split") != "train":
+                continue
+            r = _row_total_reward(rec)
+            if r is not None and r < min_reward:
+                filtered += 1
+                continue
+            # Extract ZOPA metadata for reward functions
+            prompts.append(
+                {
                     "prompt": rec["prompt"],
                     "scenario_id": rec.get("scenario_id", ""),
                     "persona": rec.get("persona", ""),
                     "batna_seller": _get_batna(rec.get("scenario_id", ""), "seller"),
+                    "batna_buyer": _get_batna(rec.get("scenario_id", ""), "buyer"),
+                    "zopa_width": _get_zopa_width(rec.get("scenario_id", "")),
+                }
+            )
+    print(
+        f"Filtered {filtered} records below min_reward={min_reward}, "
+        f"{len(prompts)} remaining for GRPO"
+    )
     return Dataset.from_list(prompts)
     """Lookup BATNA for a scenario without importing game module at training time."""
     batnas: dict[str, dict[str, float]] = {
         "saas_enterprise":        {"seller": 125_000,    "buyer": 165_000},
+        "hiring_package":         {"seller": 195_000,    "buyer": 264_500},  # match game/scenarios (widened zopa)
         "acquisition_term_sheet": {"seller": 10_500_000, "buyer": 16_000_000},
     }
     return float(batnas.get(scenario_id, {}).get(side, 0))
     data_path: str,
     output_dir: str,
     steps: int = 500,
+    min_reward: float = -50.0,
 ) -> None:
     """
     GRPO training loop.
         format_reward,
     )
+    dataset = build_grpo_dataset(data_path, min_reward=min_reward)
     if len(dataset) == 0:
         raise ValueError("Empty GRPO dataset. Run generate_data.py first.")
     parser.add_argument("--model", default="models/parlay-sft")
     parser.add_argument("--base_model", default="")
     parser.add_argument("--data", default="data/episodes.jsonl")
+    parser.add_argument(
+        "--min-reward",
+        type=float,
+        default=-50.0,
+        help="Skip JSONL train rows with total reward below this (default: -50.0)",
+    )
     parser.add_argument("--output", default="models/parlay-grpo")
     parser.add_argument("--steps", type=int, default=GRPO_STEPS)
     parser.add_argument("--g", type=int, default=GRPO_GENERATIONS)
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
     GRPO_GENERATIONS = args.g
     model_path = args.base_model or args.model
+    train_grpo(model_path, args.data, args.output, args.steps, min_reward=args.min_reward)
     if args.save_curves:
         curves_path = Path(args.save_curves)

training/random_baseline.py CHANGED Viewed

@@ -1,125 +1,138 @@
-"""
-Random-policy baseline for Parlay.
-Runs N episodes with purely random move selection (no Gemini API — always
-uses mock mode) and writes a summary JSON that the training notebook uses
-to benchmark SFT / GRPO improvement.
-Usage:
-    python training/random_baseline.py
-    python training/random_baseline.py --episodes 20 --output data/random_baseline.json
-"""
 import argparse
 import asyncio
 import json
 import logging
-import os
 import random
-import statistics
-import sys
 from pathlib import Path
-# Repo root on sys.path when run as `python training/random_baseline.py`
-sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
-# Force mock mode — random baseline never calls the real Gemini API
-os.environ.pop("GOOGLE_API_KEY", None)
-from agent.runner import run_episode
-from game.scenarios import SCENARIOS
-from parlay_env.models import PersonaType
 logger = logging.getLogger(__name__)
-REQUIRED_COMBINATIONS = [
-    (persona, scenario)
-    for persona in ["shark", "diplomat", "veteran"]
-    for scenario in ["saas_enterprise", "hiring_package", "acquisition_term_sheet"]
 ]
 async def _run_baseline(episodes: int) -> list[dict]:
-    """Run `episodes` random-policy episodes and return per-episode stats."""
-    results = []
-    seed = 0
     for i in range(episodes):
-        persona_str, scenario_id = REQUIRED_COMBINATIONS[i % len(REQUIRED_COMBINATIONS)]
         try:
-            result = await run_episode(
-                persona=PersonaType(persona_str),
-                scenario_id=scenario_id,
-                inject_noise=True,   # random noise simulates random policy
-                force_drift=random.random() < 0.4,
-                seed=seed,
-                max_turns=14,
-            )
-            results.append({
-                "persona": persona_str,
-                "scenario_id": scenario_id,
-                "reward": result.grade.total_reward,
-                "deal_efficiency": result.grade.deal_efficiency,
-                "deal_reached": result.final_price is not None,
-                "tom_accuracy_avg": result.grade.tom_accuracy_avg,
-                "drift_adapted": result.grade.drift_adapted,
-                "termination_reason": result.grade.termination_reason,
-            })
         except Exception as exc:
-            logger.warning("Baseline episode %d failed (%s, %s): %s", i, persona_str, scenario_id, exc)
-        seed += 1
-    return results
-def _summarise(results: list[dict]) -> dict:
-    if not results:
-        return {"error": "no episodes completed", "n_episodes": 0}
-    rewards = [r["reward"] for r in results]
-    efficiencies = [r["deal_efficiency"] for r in results]
-    deal_count = sum(1 for r in results if r["deal_reached"])
-    drift_count = sum(1 for r in results if r["drift_adapted"])
     return {
-        "n_episodes": len(results),
-        "mean_reward": round(statistics.mean(rewards), 3),
-        "std_reward": round(statistics.stdev(rewards) if len(rewards) > 1 else 0.0, 3),
-        "min_reward": round(min(rewards), 3),
-        "max_reward": round(max(rewards), 3),
-        "mean_efficiency": round(statistics.mean(efficiencies), 4),
-        "deal_rate": round(deal_count / len(results), 4),
-        "drift_adapted_rate": round(drift_count / len(results), 4),
-        "policy": "random_mock",
-        "note": (
-            "Baseline uses Parlay mock responses (no real Gemini API). "
-            "Compare mean_reward and mean_efficiency against SFT/GRPO runs."
-        ),
     }
 def main() -> None:
-    parser = argparse.ArgumentParser(description="Parlay random-policy baseline")
-    parser.add_argument("--episodes", type=int, default=27,
-                        help="Number of baseline episodes (default: 27 = 3 per combo)")
-    parser.add_argument("--output", type=str, default="data/random_baseline.json",
-                        help="Output path for the baseline JSON summary")
     args = parser.parse_args()
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
-    logging.getLogger("httpx").setLevel(logging.WARNING)
-    print(f"Running {args.episodes} random-policy episodes (mock mode, no API key)…")
-    results = asyncio.run(_run_baseline(args.episodes))
-    summary = _summarise(results)
     out_path = Path(args.output)
     out_path.parent.mkdir(parents=True, exist_ok=True)
-    with open(out_path, "w", encoding="utf-8") as f:
-        json.dump(summary, f, indent=2)
-    print(f"\nBaseline complete ({summary['n_episodes']} episodes):")
-    print(f"  Mean reward     : {summary.get('mean_reward', 'N/A')}")
-    print(f"  Mean efficiency : {summary.get('mean_efficiency', 'N/A')}")
-    print(f"  Deal rate       : {summary.get('deal_rate', 'N/A'):.1%}")
-    print(f"  Written to      : {out_path.resolve()}")
 if __name__ == "__main__":

+"""Random-action baseline for Parlay."""
 import argparse
 import asyncio
 import json
 import logging
 import random
 from pathlib import Path
+from parlay_env.grader import grade_episode
+from parlay_env.models import TacticalMove
+from parlay_env.server import _handle_reset, _handle_step, get_session_state
 logger = logging.getLogger(__name__)
+PERSONAS = ["shark", "diplomat", "veteran"]
+SCENARIOS = ["saas_enterprise", "hiring_package", "acquisition_term_sheet"]
+RANDOM_LINES = [
+    "Let's keep talking.",
+    "I can move a bit.",
+    "This is my proposal.",
+    "We should find middle ground.",
+    "Given that context, here's my number.",
 ]
+def _mean(values: list[float]) -> float:
+    return sum(values) / len(values) if values else 0.0
+async def _run_single_episode(scenario_id: str, persona: str, seed: int) -> dict:
+    random.seed(seed)
+    reset = await _handle_reset({"scenario_id": scenario_id, "persona": persona, "seed": seed})
+    session_id = str(reset["session_id"])
+    final_price = None
+    t_close = None
+    done = False
+    while not done:
+        state = get_session_state(session_id)
+        if state is None:
+            break
+        if state.episode_done:
+            break
+        low = state.hidden_state.walk_away_price
+        high = state.hidden_state.budget_ceiling
+        offer = round(random.uniform(low, high), 2)
+        moves: list[TacticalMove | None] = [None]
+        if state.credibility_points >= 0:
+            moves.append(TacticalMove.ANCHOR_HIGH)
+        if state.credibility_points >= 5:
+            moves.append(TacticalMove.SILENCE)
+        if state.credibility_points >= 20:
+            moves.append(TacticalMove.BATNA_REVEAL)
+        move = random.choice(moves)
+        payload = {
+            "session_id": session_id,
+            "action": {
+                "utterance": random.choice(RANDOM_LINES),
+                "offer_amount": offer,
+                "tactical_move": move.value if move else None,
+            },
+        }
+        step = await _handle_step(payload)
+        done = bool(step.get("done", False))
+        state = get_session_state(session_id)
+        if state and state.deal_reached and final_price is None:
+            final_price = offer
+            t_close = state.step_count
+    state = get_session_state(session_id)
+    if state is None:
+        raise RuntimeError(f"Missing session state for {session_id}")
+    grade = grade_episode(state, final_price=final_price, t_close=t_close, t_max=20)
+    return {
+        "avg_reward": float(grade.total_reward),
+        "deal_rate": 1.0 if final_price is not None else 0.0,
+        "avg_efficiency": float(grade.deal_efficiency),
+        "avg_tom_accuracy": float(grade.tom_accuracy_avg),
+        "bluffs_caught": int(grade.bluffs_caught),
+    }
 async def _run_baseline(episodes: int) -> list[dict]:
+    rows: list[dict] = []
     for i in range(episodes):
+        persona = PERSONAS[i % len(PERSONAS)]
+        scenario = SCENARIOS[(i // len(PERSONAS)) % len(SCENARIOS)]
         try:
+            rows.append(await _run_single_episode(scenario, persona, i + 7))
         except Exception as exc:
+            logger.warning("Baseline episode %d failed (%s/%s): %s", i + 1, scenario, persona, exc)
+    return rows
+def _summarise(rows: list[dict], episodes_requested: int) -> dict:
+    if not rows:
+        return {
+            "episodes_requested": episodes_requested,
+            "episodes_completed": 0,
+            "avg_reward": 0.0,
+            "deal_rate": 0.0,
+            "avg_efficiency": 0.0,
+            "avg_tom_accuracy": 0.0,
+            "bluffs_caught": 0,
+        }
     return {
+        "episodes_requested": episodes_requested,
+        "episodes_completed": len(rows),
+        "avg_reward": round(_mean([r["avg_reward"] for r in rows]), 4),
+        "deal_rate": round(_mean([r["deal_rate"] for r in rows]), 4),
+        "avg_efficiency": round(_mean([r["avg_efficiency"] for r in rows]), 4),
+        "avg_tom_accuracy": round(_mean([r["avg_tom_accuracy"] for r in rows]), 4),
+        "bluffs_caught": int(sum(r["bluffs_caught"] for r in rows)),
     }
 def main() -> None:
+    parser = argparse.ArgumentParser(description="Parlay random baseline")
+    parser.add_argument("--episodes", type=int, default=20)
+    parser.add_argument("--output", default="results/baseline.json")
     args = parser.parse_args()
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
+    rows = asyncio.run(_run_baseline(args.episodes))
+    summary = _summarise(rows, args.episodes)
     out_path = Path(args.output)
     out_path.parent.mkdir(parents=True, exist_ok=True)
+    out_path.write_text(json.dumps(summary, indent=2), encoding="utf-8")
+    print(json.dumps(summary, indent=2))
+    print(f"\nSaved random baseline to {out_path.resolve()}")
 if __name__ == "__main__":

training/sft_train.py CHANGED Viewed

@@ -1,177 +1,119 @@
 """
-Stage 1: SFT warmup on best episodes (efficiency >= threshold).
-Fine-tunes Qwen2.5-7B-Instruct on demonstrations of successful negotiation.
-Applies episode quality filters (offers + reward outliers) and stable SFT target
-metadata (log-scaled efficiency, clipped reward) when building training text.
-Usage:
-    python -m training.sft_train \
-        --data data/episodes.jsonl \
-        --model Qwen/Qwen2.5-7B-Instruct \
-        --output models/parlay-sft \
-        --threshold 0.30
 """
 import argparse
 import json
 import logging
-import os
 from pathlib import Path
-from .episode_filters import (
-    SFTFilterConfig,
-    clip_reward_for_label,
-    efficiency_sft_label,
-    episode_passes_sft_filters,
-)
 logger = logging.getLogger(__name__)
-TOP_PLAYER_THRESHOLD = float(os.getenv("TOP_PLAYER_THRESHOLD", "0.30"))
-BASE_MODEL = os.getenv("BASE_MODEL", "Qwen/Qwen2.5-7B-Instruct")
-def load_sft_dataset(
-    jsonl_path: Path,
-    threshold: float = 0.30,
-    filter_cfg: SFTFilterConfig | None = None,
-    include_sft_targets: bool = True,
-):
-    """
-    Load episodes above efficiency threshold and format for SFT.
-    Only 'train' split episodes above the threshold are included.
-    Rows failing quality filters (broken offers, extreme rewards) are skipped.
-    Each agent turn becomes one training example.
-    Args:
-        jsonl_path: Path to the JSONL episodes file.
-        threshold:  Minimum deal_efficiency to include.
-        filter_cfg: Drop/clip thresholds; default SFTFilterConfig().
-        include_sft_targets: If True, embed eff_log and reward_clip in the example text.
-    Returns:
-        HuggingFace Dataset with 'text' column.
-    """
     try:
         from datasets import Dataset
     except ImportError as exc:
         raise ImportError("Install datasets: pip install datasets") from exc
-    filter_cfg = filter_cfg or SFTFilterConfig()
-    records = []
-    skipped_filter = 0
-    with open(jsonl_path, encoding="utf-8") as f:
-        for line in f:
-            rec = json.loads(line.strip())
-            ok, _reason = episode_passes_sft_filters(rec, filter_cfg)
-            if not ok:
-                skipped_filter += 1
                 continue
-            if rec.get("deal_efficiency", 0) >= threshold and rec.get("split") == "train":
-                conversation = rec.get("conversation", [])
-                eff_l = efficiency_sft_label(rec.get("deal_efficiency"))
-                r_clip = clip_reward_for_label(rec.get("reward"), filter_cfg)
-                for i, turn in enumerate(conversation[:-1]):
-                    if turn.get("role") == "negotiator":
-                        context = conversation[:i]
-                        records.append({
-                            "text": _format_sft_example(
-                                system=rec["prompt"],
-                                context=context,
-                                response=turn["content"],
-                                efficiency_label=eff_l,
-                                reward_clip=r_clip,
-                                include_sft_targets=include_sft_targets,
-                            )
-                        })
-    logger.info(
-        f"SFT dataset: {len(records)} training examples from {jsonl_path} "
-        f"(skipped {skipped_filter} episodes by quality filter)"
-    )
-    return Dataset.from_list(records)
-def _format_sft_example(
-    system: str,
-    context: list[dict],
-    response: str,
-    efficiency_label: float,
-    reward_clip: float,
-    include_sft_targets: bool = True,
-) -> str:
-    """Format a single SFT training example in chat template format."""
-    history_lines = []
-    for turn in context:
-        role = turn.get("role", "unknown").upper()
-        content = turn.get("content", "")
-        history_lines.append(f"{role}: {content}")
-    history = "\n".join(history_lines)
-    targets_block = ""
-    if include_sft_targets:
-        targets_block = (
-            f"<|sft_targets|>eff_log={efficiency_label:.4f} "
-            f"reward_clip={reward_clip:.2f}</s>\n"
-        )
-    return (
-        f"<|system|>{system}</s>\n"
-        f"{targets_block}"
-        f"<|negotiation_history|>{history}</s>\n"
-        f"<|assistant|>{response}</s>"
     )
 def train_sft(
-    data_path: Path,
-    model_id: str,
-    output_dir: Path,
-    threshold: float = 0.30,
-    filter_cfg: SFTFilterConfig | None = None,
-    include_sft_targets: bool = True,
-) -> Path:
-    """
-    Run SFT fine-tuning.
-    Args:
-        data_path:   Path to episodes JSONL.
-        model_id:    HuggingFace model ID or local path.
-        output_dir:  Where to save the trained model.
-        threshold:   Efficiency filter for training data.
-        filter_cfg:  Quality filter / clip config.
-        include_sft_targets: Embed normalized targets in training strings.
-    Returns:
-        output_dir path.
-    """
     import torch
-    if not torch.cuda.is_available():
-        logger.warning("No GPU detected — SFT will be very slow on CPU")
-    try:
-        from peft import LoraConfig
-        from trl import SFTTrainer, SFTConfig
-    except ImportError as exc:
-        raise ImportError("Install: pip install trl peft") from exc
-    filter_cfg = filter_cfg or SFTFilterConfig()
-    dataset = load_sft_dataset(
-        data_path, threshold, filter_cfg, include_sft_targets=include_sft_targets
-    )
-    if len(dataset) == 0 and threshold > 0.0:
-        logger.warning(
-            f"No episodes above threshold {threshold}. Lowering to 0.0 (all train rows)."
-        )
-        dataset = load_sft_dataset(
-            data_path, 0.0, filter_cfg, include_sft_targets=include_sft_targets
-        )
-    if len(dataset) == 0:
-        raise RuntimeError(
-            "SFT dataset is empty. "
-            "Run generate_data.py first with --episodes >= 200"
-        )
     lora_config = LoraConfig(
         r=16,
@@ -188,16 +130,16 @@ def train_sft(
         per_device_train_batch_size=4,
         gradient_accumulation_steps=4,
         learning_rate=2e-4,
-        warmup_ratio=0.05,
-        lr_scheduler_type="cosine",
         logging_steps=10,
         save_strategy="epoch",
-        push_to_hub=False,
-        bf16=torch.cuda.is_available(),
-        max_seq_length=2048,
         report_to="none",
     )
     trainer = SFTTrainer(
         model=model_id,
         args=training_args,
@@ -205,46 +147,27 @@ def train_sft(
         peft_config=lora_config,
     )
-    logger.info(f"Starting SFT training: model={model_id}, examples={len(dataset)}, epochs=3")
     trainer.train()
     trainer.save_model(str(output_dir))
-    logger.info(f"SFT training complete. Model saved to {output_dir}")
-    return output_dir
 def main() -> None:
-    parser = argparse.ArgumentParser(description="Parlay SFT warmup training")
     parser.add_argument("--data", default="data/episodes.jsonl")
-    parser.add_argument("--model", default=BASE_MODEL)
-    parser.add_argument("--output", default="models/parlay-sft")
-    parser.add_argument("--steps", type=int, default=0, help="Notebook compatibility flag")
-    parser.add_argument("--threshold", type=float, default=TOP_PLAYER_THRESHOLD)
-    parser.add_argument("--reward-drop-min", type=float, default=-400.0)
-    parser.add_argument("--reward-drop-max", type=float, default=400.0)
-    parser.add_argument("--clip-reward-min", type=float, default=-200.0)
-    parser.add_argument("--clip-reward-max", type=float, default=200.0)
     parser.add_argument(
-        "--no-sft-targets",
-        action="store_true",
-        help="Omit eff_log / reward_clip block from training text",
     )
     args = parser.parse_args()
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
-    cfg = SFTFilterConfig(
-        reward_drop_min=args.reward_drop_min,
-        reward_drop_max=args.reward_drop_max,
-        clip_reward_min=args.clip_reward_min,
-        clip_reward_max=args.clip_reward_max,
-    )
-    train_sft(
-        Path(args.data),
-        args.model,
-        Path(args.output),
-        args.threshold,
-        filter_cfg=cfg,
-        include_sft_targets=not args.no_sft_targets,
-    )
 if __name__ == "__main__":

 """
+Run before grpo_train.py for SFT→GRPO pipeline. Pass checkpoint path as
+BASE_MODEL env var to grpo_train.py.
 """
 import argparse
 import json
 import logging
 from pathlib import Path
 logger = logging.getLogger(__name__)
+DEFAULT_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
+DEFAULT_OUTPUT = "checkpoints/sft_1.5b/"
+def _extract_completions(rec: dict) -> list[str]:
+    """Return candidate completion texts from a record."""
+    completion = rec.get("completion")
+    if isinstance(completion, str) and completion.strip():
+        return [completion.strip()]
+    conversation = rec.get("conversation", [])
+    candidates: list[str] = []
+    if isinstance(conversation, list):
+        for turn in conversation:
+            if not isinstance(turn, dict):
+                continue
+            role = str(turn.get("role", "")).lower()
+            content = str(turn.get("content", "")).strip()
+            if role == "negotiator" and content:
+                candidates.append(content)
+    return candidates
+def _row_total_reward(rec: dict) -> float | None:
+    v = rec.get("reward")
+    if v is not None:
+        return float(v)
+    v2 = rec.get("cumulative_reward")
+    if v2 is not None:
+        return float(v2)
+    return None
+def load_sft_dataset(data_path: Path, min_reward: float = -50.0):
+    """Build a text dataset from JSONL prompt/completion pairs."""
     try:
         from datasets import Dataset
     except ImportError as exc:
         raise ImportError("Install datasets: pip install datasets") from exc
+    rows: list[dict[str, str]] = []
+    skipped = 0
+    reward_filtered = 0
+    remaining_records = 0
+    with data_path.open("r", encoding="utf-8") as f:
+        for line_no, line in enumerate(f, start=1):
+            line = line.strip()
+            if not line:
                 continue
+            try:
+                rec = json.loads(line)
+            except json.JSONDecodeError:
+                logger.warning("Skipping malformed JSONL row %d", line_no)
+                skipped += 1
+                continue
+            r = _row_total_reward(rec)
+            if r is not None and r < min_reward:
+                reward_filtered += 1
+                continue
+            prompt = str(rec.get("prompt", "")).strip()
+            if not prompt:
+                logger.warning("Skipping row %d: missing prompt", line_no)
+                skipped += 1
+                continue
+            completions = _extract_completions(rec)
+            if not completions:
+                logger.warning("Skipping row %d: missing completion and negotiator turns", line_no)
+                skipped += 1
+                continue
+            remaining_records += 1
+            for completion in completions:
+                rows.append(
+                    {
+                        "text": (
+                            f"<|system|>{prompt}</s>\n"
+                            f"<|assistant|>{completion}</s>"
+                        )
+                    }
+                )
+    print(
+        f"Filtered {reward_filtered} records below min_reward={min_reward}, "
+        f"{remaining_records} remaining for SFT"
     )
+    if skipped:
+        logger.info("Also skipped %d malformed/empty JSONL rows; expanded to %d text rows", skipped, len(rows))
+    if not rows:
+        raise RuntimeError("No valid SFT examples found in dataset.")
+    return Dataset.from_list(rows)
 def train_sft(
+    data_path: Path, model_id: str, output_dir: Path, min_reward: float = -50.0
+) -> None:
+    """Fine-tune a base model with LoRA via TRL SFTTrainer."""
     import torch
+    from peft import LoraConfig
+    from trl import SFTConfig, SFTTrainer
+    dataset = load_sft_dataset(data_path, min_reward=min_reward)
+    output_dir.mkdir(parents=True, exist_ok=True)
     lora_config = LoraConfig(
         r=16,
         per_device_train_batch_size=4,
         gradient_accumulation_steps=4,
         learning_rate=2e-4,
         logging_steps=10,
         save_strategy="epoch",
+        fp16=True,
         report_to="none",
+        max_seq_length=2048,
     )
+    if not torch.cuda.is_available():
+        logger.warning("No CUDA GPU detected; training may be very slow.")
     trainer = SFTTrainer(
         model=model_id,
         args=training_args,
         peft_config=lora_config,
     )
+    logger.info("Starting SFT: model=%s, examples=%d", model_id, len(dataset))
     trainer.train()
     trainer.save_model(str(output_dir))
+    logger.info("Saved SFT checkpoint to %s", output_dir)
 def main() -> None:
+    parser = argparse.ArgumentParser(description="Parlay SFT training")
     parser.add_argument("--data", default="data/episodes.jsonl")
+    parser.add_argument("--model", default=DEFAULT_MODEL)
+    parser.add_argument("--output", default=DEFAULT_OUTPUT)
     parser.add_argument(
+        "--min-reward",
+        type=float,
+        default=-50.0,
+        help="Skip JSONL records with total reward below this (default: -50.0)",
     )
     args = parser.parse_args()
     logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
+    train_sft(Path(args.data), args.model, Path(args.output), min_reward=args.min_reward)
 if __name__ == "__main__":