sh4shv4t commited on
Commit
00a2188
·
1 Parent(s): 1b1c2d9

Relocate training notebooks, add BLOG and Google Colab links (SFT + GRPO HF Job), dashboard updates, and eval artifacts

Browse files
.gitignore CHANGED
@@ -1,6 +1,9 @@
1
  # Python
2
  __pycache__/
 
3
  *.py[cod]
 
 
4
  *.pyo
5
  .Python
6
  build/
@@ -16,7 +19,9 @@ parlay.db
16
  telemetry.json
17
  data/
18
  models/
19
- results/
 
 
20
 
21
  # Environment
22
  .env
 
1
  # Python
2
  __pycache__/
3
+ **/__pycache__/
4
  *.py[cod]
5
+ *.pyc
6
+ *.pyo
7
  *.pyo
8
  .Python
9
  build/
 
19
  telemetry.json
20
  data/
21
  models/
22
+ results/*
23
+ !results/eval_results.json
24
+ !results/random_baseline.json
25
 
26
  # Environment
27
  .env
.pre-commit-config.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Optional: pip install pre-commit && pre-commit install
2
+ repos:
3
+ - repo: local
4
+ hooks:
5
+ - id: no-pycache-in-commit
6
+ name: Forbid __pycache__ and .pyc in commits
7
+ entry: python scripts/check_staged_not_pycache.py
8
+ language: system
9
+ pass_filenames: false
10
+ always_run: true
11
+ stages: [pre-commit]
parlay_hf_article.md → BLOG.md RENAMED
@@ -1,5 +1,9 @@
1
  # ◈ Parlay — I Built an AI That Finally Beats Me at Negotiation
2
 
 
 
 
 
3
  *Teaching language models to close deals under hidden information, bluffing, and a world that doesn't stand still.*
4
 
5
  ---
@@ -130,6 +134,8 @@ No prior negotiation RL paper has this layer.
130
 
131
  ## Training Pipeline
132
 
 
 
133
  ```text
134
  Gemini self-play (generate_data.py)
135
  → 80 quality-filtered episodes across 9 persona×scenario combos
 
1
  # ◈ Parlay — I Built an AI That Finally Beats Me at Negotiation
2
 
3
+ <p align="center">
4
+ <img src="images/Parlay_square%20logo.png" alt="Parlay logo" width="220">
5
+ </p>
6
+
7
  *Teaching language models to close deals under hidden information, bluffing, and a world that doesn't stand still.*
8
 
9
  ---
 
134
 
135
  ## Training Pipeline
136
 
137
+ **Google Colab:** [parlay_sft_colab](https://colab.research.google.com/drive/1x5uZMbdKF7XeDNm-bM5YSPdpd1srgArA?usp=sharing) · [parlay_grpo_hf_job](https://colab.research.google.com/drive/1DNYogmRlR_YJrEO6GN3YC7xj8lfycDuL?usp=sharing) (in-repo: [`training/notebooks/parlay_sft_colab.ipynb`](https://github.com/sh4shv4t/Parlay/blob/main/training/notebooks/parlay_sft_colab.ipynb), plus [`training/GRPO_HF_RUNBOOK.md`](https://github.com/sh4shv4t/Parlay/blob/main/training/GRPO_HF_RUNBOOK.md) / `scripts/hf_grpo_entry.sh` for the job-style GRPO run).
138
+
139
  ```text
140
  Gemini self-play (generate_data.py)
141
  → 80 quality-filtered episodes across 9 persona×scenario combos
README.md CHANGED
@@ -8,24 +8,27 @@ pinned: false
8
  tags: ["openenv", "hackathon", "rl", "gametheory"]
9
  ---
10
 
 
 
 
 
11
  # Parlay ◈ — The Arena Where AIs Learn to Close
12
 
13
  **[▶ Play Now — HuggingFace Space](https://huggingface.co/spaces/sh4shv4t/Parlay)** |
14
- [Blog Post](https://huggingface.co/blog/sh4shv4t/parlay) |
15
  [SFT Model](https://huggingface.co/sh4shv4t/parlay-sft-1-5b) |
16
  [GRPO Model](https://huggingface.co/sh4shv4t/parlay-grpo-1-5b) |
17
  [Dataset](https://huggingface.co/datasets/sh4shv4t/parlay-episodes) |
18
  [Training (HF / TRL pipeline)](training/notebooks/parlay_training.ipynb) |
19
  [OpenEnv reset/step rollouts](training/notebooks/openenv_rollout_training.ipynb) |
 
 
20
  [OpenEnv Manifest](openenv.yaml)
21
 
22
  ![Python 3.11](https://img.shields.io/badge/Python-3.11-blue)
23
  ![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-00C853)
24
  ![MIT License](https://img.shields.io/badge/License-MIT-green)
25
  ![HF Spaces](https://img.shields.io/badge/HF%20Spaces-Ready-yellow)
26
-
27
- `Python 3.11` | `FastAPI` | `Gemini` | `GRPO` | `OpenEnv WebSocket`
28
-
29
  ---
30
 
31
  ## The Problem
@@ -299,6 +302,15 @@ python smoke_test.py # 7 integration tests
299
  pytest tests/ -v
300
  ```
301
 
 
 
 
 
 
 
 
 
 
302
  ### Focused modules (optional)
303
 
304
  ```bash
 
8
  tags: ["openenv", "hackathon", "rl", "gametheory"]
9
  ---
10
 
11
+ <p align="center">
12
+ <img src="images/Parlay_square%20logo.png" alt="Parlay logo" width="220">
13
+ </p>
14
+
15
  # Parlay ◈ — The Arena Where AIs Learn to Close
16
 
17
  **[▶ Play Now — HuggingFace Space](https://huggingface.co/spaces/sh4shv4t/Parlay)** |
18
+ [Blog Post](BLOG.md) |
19
  [SFT Model](https://huggingface.co/sh4shv4t/parlay-sft-1-5b) |
20
  [GRPO Model](https://huggingface.co/sh4shv4t/parlay-grpo-1-5b) |
21
  [Dataset](https://huggingface.co/datasets/sh4shv4t/parlay-episodes) |
22
  [Training (HF / TRL pipeline)](training/notebooks/parlay_training.ipynb) |
23
  [OpenEnv reset/step rollouts](training/notebooks/openenv_rollout_training.ipynb) |
24
+ [SFT — Colab (`parlay_sft_colab`)](https://colab.research.google.com/drive/1x5uZMbdKF7XeDNm-bM5YSPdpd1srgArA?usp=sharing) |
25
+ [GRPO HF Job — Colab (`parlay_grpo_hf_job`)](https://colab.research.google.com/drive/1DNYogmRlR_YJrEO6GN3YC7xj8lfycDuL?usp=sharing) |
26
  [OpenEnv Manifest](openenv.yaml)
27
 
28
  ![Python 3.11](https://img.shields.io/badge/Python-3.11-blue)
29
  ![OpenEnv Compliant](https://img.shields.io/badge/OpenEnv-Compliant-00C853)
30
  ![MIT License](https://img.shields.io/badge/License-MIT-green)
31
  ![HF Spaces](https://img.shields.io/badge/HF%20Spaces-Ready-yellow)
 
 
 
32
  ---
33
 
34
  ## The Problem
 
302
  pytest tests/ -v
303
  ```
304
 
305
+ ### Git hooks (optional)
306
+
307
+ ```bash
308
+ pip install -r requirements-dev.txt
309
+ pre-commit install
310
+ ```
311
+
312
+ Staged `__pycache__` paths and `.pyc` / `.pyo` files are blocked by a local pre-commit check (see `scripts/check_staged_not_pycache.py`).
313
+
314
  ### Focused modules (optional)
315
 
316
  ```bash
README_SPACES.md CHANGED
@@ -9,6 +9,10 @@ pinned: true
9
  tags: ["openenv", "hackathon", "rl", "gametheory"]
10
  ---
11
 
 
 
 
 
12
  Parlay is a negotiation RL environment and playable browser game where agents bargain under hidden information.
13
  It combines Theory-of-Mind belief tracking, dynamic ZOPA erosion under sustained tension, and tactical negotiation moves.
14
  This Space exposes the live game UI and OpenEnv-style WebSocket flow so you can test policies interactively.
 
9
  tags: ["openenv", "hackathon", "rl", "gametheory"]
10
  ---
11
 
12
+ <p align="center">
13
+ <img src="images/Parlay_square%20logo.png" alt="Parlay logo" width="200">
14
+ </p>
15
+
16
  Parlay is a negotiation RL environment and playable browser game where agents bargain under hidden information.
17
  It combines Theory-of-Mind belief tracking, dynamic ZOPA erosion under sustained tension, and tactical negotiation moves.
18
  This Space exposes the live game UI and OpenEnv-style WebSocket flow so you can test policies interactively.
dashboard/api.py CHANGED
@@ -101,6 +101,20 @@ class SetOpponentRequest(BaseModel):
101
  model: str # "trained" | "gemini"
102
 
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  def _get_tension(state: ParlayState, player_move: Optional[TacticalMove], opponent_move: Optional[TacticalMove]) -> float:
105
  base = 20.0 + ((state.step_count + 1) / MAX_TURNS) * 55.0
106
  if player_move == TacticalMove.ANCHOR_HIGH or opponent_move == TacticalMove.ANCHOR_HIGH:
@@ -345,6 +359,23 @@ def _training_status_payload() -> dict[str, Any]:
345
  rnd = float(rnd)
346
  except Exception: # noqa: BLE001
347
  has_results = False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
348
  repo = (os.environ.get("HF_MODEL_REPO") or "").strip() or None
349
 
350
  sft_loss_path: str | None = None
@@ -365,6 +396,14 @@ def _training_status_payload() -> dict[str, Any]:
365
  elif (_RESULTS_DIR / "grpo_loss_curve.png").is_file():
366
  grpo_loss_url = "/results/grpo_loss_curve.png"
367
 
 
 
 
 
 
 
 
 
368
  return {
369
  "has_results": has_results,
370
  "grpo_mean_reward": grpo,
@@ -375,10 +414,11 @@ def _training_status_payload() -> dict[str, Any]:
375
  "sft_loss_url": sft_loss_path,
376
  "grpo_reward_url": grpo_reward_url,
377
  "grpo_loss_url": grpo_loss_url,
 
378
  "plots_available": {
379
  "reward_curve": grpo_reward_url is not None,
380
  "grpo_loss": grpo_loss_url is not None,
381
- "comparison": (_RESULTS_DIR / "training_curves.png").is_file(),
382
  "transcript": (_RESULTS_DIR / "before_after_transcript.html").is_file(),
383
  "sft_loss": sft_loss_path is not None,
384
  },
@@ -391,6 +431,229 @@ async def get_training_status() -> dict:
391
  return _training_status_payload()
392
 
393
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
394
  @router.post("/set-opponent")
395
  async def set_opponent(req: SetOpponentRequest) -> dict:
396
  """
 
101
  model: str # "trained" | "gemini"
102
 
103
 
104
+ class ModelChatMessage(BaseModel):
105
+ role: str # "user" | "assistant"
106
+ text: str
107
+
108
+
109
+ class ModelChatRequest(BaseModel):
110
+ message: str
111
+ scenario_id: str = "saas_enterprise"
112
+ persona: str = "shark"
113
+ history: list[ModelChatMessage] = []
114
+ temperature: float = 0.7
115
+ max_tokens: int = 300
116
+
117
+
118
  def _get_tension(state: ParlayState, player_move: Optional[TacticalMove], opponent_move: Optional[TacticalMove]) -> float:
119
  base = 20.0 + ((state.step_count + 1) / MAX_TURNS) * 55.0
120
  if player_move == TacticalMove.ANCHOR_HIGH or opponent_move == TacticalMove.ANCHOR_HIGH:
 
359
  rnd = float(rnd)
360
  except Exception: # noqa: BLE001
361
  has_results = False
362
+ if rnd is None and eval_path.is_file():
363
+ for baseline_path in (
364
+ _RESULTS_DIR / "random_baseline.json",
365
+ _RESULTS_DIR / "baseline.json",
366
+ ):
367
+ if not baseline_path.is_file():
368
+ continue
369
+ try:
370
+ blob = json.loads(baseline_path.read_text(encoding="utf-8"))
371
+ v = blob.get("mean_reward")
372
+ if v is None:
373
+ v = blob.get("avg_reward")
374
+ if v is not None:
375
+ rnd = float(v)
376
+ break
377
+ except Exception: # noqa: BLE001
378
+ continue
379
  repo = (os.environ.get("HF_MODEL_REPO") or "").strip() or None
380
 
381
  sft_loss_path: str | None = None
 
396
  elif (_RESULTS_DIR / "grpo_loss_curve.png").is_file():
397
  grpo_loss_url = "/results/grpo_loss_curve.png"
398
 
399
+ comparison_url: str | None = None
400
+ if (_RESULTS_DIR / "training_curves.png").is_file():
401
+ comparison_url = "/results/training_curves.png"
402
+ elif (_IMAGES_DIR / "training_curves.png").is_file():
403
+ comparison_url = "/images/training_curves.png"
404
+ elif (_IMAGES_DIR / "comparison.png").is_file():
405
+ comparison_url = "/images/comparison.png"
406
+
407
  return {
408
  "has_results": has_results,
409
  "grpo_mean_reward": grpo,
 
414
  "sft_loss_url": sft_loss_path,
415
  "grpo_reward_url": grpo_reward_url,
416
  "grpo_loss_url": grpo_loss_url,
417
+ "comparison_url": comparison_url,
418
  "plots_available": {
419
  "reward_curve": grpo_reward_url is not None,
420
  "grpo_loss": grpo_loss_url is not None,
421
+ "comparison": comparison_url is not None,
422
  "transcript": (_RESULTS_DIR / "before_after_transcript.html").is_file(),
423
  "sft_loss": sft_loss_path is not None,
424
  },
 
431
  return _training_status_payload()
432
 
433
 
434
+ # Default model ID for docs / judge UI (see openenv.yaml grpo_model)
435
+ GRPO_MODEL_REPO_DEFAULT = "sh4shv4t/parlay-grpo-1-5b"
436
+
437
+
438
+ @router.get("/judge-config")
439
+ async def get_judge_config() -> dict:
440
+ """
441
+ Status for the /judge page: whether Hub weights are configured and current opponent mode.
442
+ """
443
+ repo = (os.environ.get("HF_MODEL_REPO") or "").strip() or None
444
+ return {
445
+ "hf_model_configured": bool(repo),
446
+ "model_repo": repo,
447
+ "suggested_grpo_repo": GRPO_MODEL_REPO_DEFAULT,
448
+ "opponent_mode": OPPONENT_MODE,
449
+ }
450
+
451
+
452
+ @router.get("/model/info")
453
+ async def model_info() -> dict:
454
+ """
455
+ Status + metadata for the /interact page.
456
+ Reports whether the GRPO Hub model is reachable and what repo is configured.
457
+ """
458
+ repo = (os.environ.get("HF_MODEL_REPO") or "").strip() or None
459
+ fallback_repo = GRPO_MODEL_REPO_DEFAULT
460
+ hub_url = f"https://huggingface.co/{repo or fallback_repo}"
461
+ return {
462
+ "configured": bool(repo),
463
+ "model_repo": repo or fallback_repo,
464
+ "hub_url": hub_url,
465
+ "base_model": "Qwen/Qwen2.5-1.5B-Instruct",
466
+ "training": "GRPO (TRL) on Parlay negotiation self-play episodes",
467
+ "note": (
468
+ "Model outputs structured JSON: utterance, optional offer_amount, optional tactical_move."
469
+ if repo
470
+ else (
471
+ "HF_MODEL_REPO is not set — using the public fallback repo. "
472
+ "Set HF_MODEL_REPO in Space secrets to enable local GPU inference."
473
+ )
474
+ ),
475
+ }
476
+
477
+
478
+ def _build_interact_system_prompt(scenario_id: str, persona: str) -> str:
479
+ """Lightweight system prompt for the /interact page (no live game state)."""
480
+ from agent.personas import PERSONAS
481
+ from parlay_env.models import PersonaType
482
+
483
+ try:
484
+ pt = PersonaType(persona)
485
+ except ValueError:
486
+ pt = PersonaType.SHARK
487
+ cfg = PERSONAS[pt]
488
+
489
+ sc = get_scenario(scenario_id)
490
+ mid = (sc.batna_seller + sc.batna_buyer) / 2
491
+
492
+ return (
493
+ f"You are {cfg.name} ({cfg.emoji}), an experienced negotiator.\n\n"
494
+ f"SCENARIO: {sc.title}\n"
495
+ f"{sc.description}\n"
496
+ f"The deal range is roughly {sc.batna_seller:,.0f}–{sc.batna_buyer:,.0f} {sc.currency}.\n"
497
+ f"You are negotiating from the opposing side, targeting around {mid:,.0f}.\n\n"
498
+ f"YOUR STYLE:\n{cfg.style}\n\n"
499
+ "RULES:\n"
500
+ "- Stay in character at all times.\n"
501
+ '- Respond ONLY with valid JSON: {"utterance": "...", "offer_amount": <number or null>, "tactical_move": <string or null>}\n'
502
+ "- Keep utterances under 100 words.\n"
503
+ )
504
+
505
+
506
+ async def _run_hf_inference(
507
+ system_prompt: str,
508
+ history: list[ModelChatMessage],
509
+ message: str,
510
+ temperature: float,
511
+ max_tokens: int,
512
+ ) -> dict[str, Any]:
513
+ """Load the Hub model (via hf_opponent._sync_generate) and run inference."""
514
+ from agent.hf_opponent import _sync_generate, _get_lock, _build_prompt, _parse_json_block # noqa: PLC0415
515
+
516
+ messages = []
517
+ for h in history:
518
+ role = "user" if h.role == "user" else "model"
519
+ messages.append({"role": role, "parts": [h.text]})
520
+ messages.append({"role": "user", "parts": [message]})
521
+
522
+ loop = asyncio.get_event_loop()
523
+
524
+ repo = (os.environ.get("HF_MODEL_REPO") or "").strip() or GRPO_MODEL_REPO_DEFAULT
525
+ os.environ.setdefault("HF_MODEL_REPO", repo)
526
+
527
+ result = await loop.run_in_executor(
528
+ None,
529
+ lambda: _sync_generate(system_prompt, messages, min(max_tokens, 512)),
530
+ )
531
+ return result
532
+
533
+
534
+ async def _run_hf_api_inference(
535
+ system_prompt: str,
536
+ history: list[ModelChatMessage],
537
+ message: str,
538
+ temperature: float,
539
+ max_tokens: int,
540
+ repo: str,
541
+ ) -> dict[str, Any]:
542
+ """
543
+ Call the HF Inference API for the given repo.
544
+ Tries the new /v1/chat/completions endpoint first, then falls back to the
545
+ legacy text-generation endpoint.
546
+ """
547
+ import httpx # noqa: PLC0415
548
+ from agent.hf_opponent import _parse_json_block # noqa: PLC0415
549
+
550
+ token = os.environ.get("HF_TOKEN", "")
551
+ headers = {"Authorization": f"Bearer {token}"} if token else {}
552
+
553
+ # Build chat messages for /v1/chat/completions
554
+ chat_msgs = [{"role": "system", "content": system_prompt}]
555
+ for h in history:
556
+ chat_msgs.append({"role": h.role, "content": h.text})
557
+ chat_msgs.append({"role": "user", "content": message})
558
+
559
+ url = f"https://api-inference.huggingface.co/models/{repo}/v1/chat/completions"
560
+ payload = {
561
+ "model": repo,
562
+ "messages": chat_msgs,
563
+ "max_tokens": min(max_tokens, 512),
564
+ "temperature": temperature,
565
+ }
566
+
567
+ async with httpx.AsyncClient(timeout=120.0) as client:
568
+ resp = await client.post(url, json=payload, headers=headers)
569
+ if resp.status_code == 200:
570
+ data = resp.json()
571
+ raw = data["choices"][0]["message"]["content"]
572
+ return _parse_json_block(raw)
573
+ # Legacy text-generation endpoint
574
+ legacy_url = f"https://api-inference.huggingface.co/models/{repo}"
575
+ # Format as ChatML for Qwen
576
+ eot = "<|im_end|>"
577
+ prompt_parts = [f"<|im_start|>system\n{system_prompt}\n{eot}\n"]
578
+ for h in history:
579
+ r = "user" if h.role == "user" else "assistant"
580
+ prompt_parts.append(f"<|im_start|>{r}\n{h.text}\n{eot}\n")
581
+ prompt_parts.append(f"<|im_start|>user\n{message}\n{eot}\n")
582
+ prompt_parts.append(
583
+ "<|im_start|>assistant\n"
584
+ 'Respond ONLY with valid JSON: {"utterance": "...", "offer_amount": <number or null>, "tactical_move": <string or null>}\n'
585
+ )
586
+ legacy_payload = {
587
+ "inputs": "".join(prompt_parts),
588
+ "parameters": {"max_new_tokens": min(max_tokens, 256), "temperature": temperature, "return_full_text": False},
589
+ }
590
+ resp2 = await client.post(legacy_url, json=legacy_payload, headers=headers)
591
+ resp2.raise_for_status()
592
+ data2 = resp2.json()
593
+ raw2 = data2[0]["generated_text"] if isinstance(data2, list) else str(data2)
594
+ return _parse_json_block(raw2)
595
+
596
+
597
+ @router.post("/model/chat")
598
+ async def model_chat(req: ModelChatRequest) -> dict:
599
+ """
600
+ Direct inference against the GRPO-finetuned negotiation model.
601
+ Used by the /interact page for free-form chat with the model.
602
+ Strategy:
603
+ 1. If torch + model weights are loadable locally (GPU Space), load and run.
604
+ 2. Otherwise hit the HF Inference API (works on CPU Spaces, may have cold-start).
605
+ """
606
+ try:
607
+ scenario = get_scenario(req.scenario_id)
608
+ except InvalidScenarioError:
609
+ raise HTTPException(status_code=400, detail=f"Unknown scenario: {req.scenario_id!r}")
610
+
611
+ valid_personas = {"shark", "diplomat", "veteran"}
612
+ if req.persona not in valid_personas:
613
+ raise HTTPException(status_code=400, detail=f"Unknown persona: {req.persona!r}")
614
+
615
+ system_prompt = _build_interact_system_prompt(req.scenario_id, req.persona)
616
+ repo = (os.environ.get("HF_MODEL_REPO") or "").strip() or GRPO_MODEL_REPO_DEFAULT
617
+
618
+ # Attempt 1 — local model (fast on GPU Spaces, slow on CPU)
619
+ try:
620
+ import torch # noqa: PLC0415
621
+ result = await _run_hf_inference(
622
+ system_prompt, req.history, req.message, req.temperature, req.max_tokens
623
+ )
624
+ return {
625
+ "utterance": result.get("utterance", ""),
626
+ "offer_amount": result.get("offer_amount"),
627
+ "tactical_move": result.get("tactical_move"),
628
+ "backend": "local",
629
+ "model_repo": repo,
630
+ }
631
+ except Exception as local_exc:
632
+ logger.info("Local inference unavailable, trying HF API: %s", local_exc)
633
+
634
+ # Attempt 2 — HF Inference API (no GPU needed)
635
+ try:
636
+ result = await _run_hf_api_inference(
637
+ system_prompt, req.history, req.message, req.temperature, req.max_tokens, repo
638
+ )
639
+ return {
640
+ "utterance": result.get("utterance", ""),
641
+ "offer_amount": result.get("offer_amount"),
642
+ "tactical_move": result.get("tactical_move"),
643
+ "backend": "hf_api",
644
+ "model_repo": repo,
645
+ }
646
+ except Exception as api_exc:
647
+ logger.warning("HF API inference failed: %s", api_exc)
648
+ raise HTTPException(
649
+ status_code=503,
650
+ detail=(
651
+ f"Model inference failed on both local and HF API backends. "
652
+ f"Model: {repo}. Error: {api_exc}"
653
+ ),
654
+ )
655
+
656
+
657
  @router.post("/set-opponent")
658
  async def set_opponent(req: SetOpponentRequest) -> dict:
659
  """
dashboard/index.html CHANGED
@@ -84,7 +84,9 @@
84
  </div>
85
 
86
  <nav class="header-nav" aria-label="Site navigation">
87
- <a href="/index.html" class="active">Game</a>
 
 
88
  </nav>
89
 
90
  <div class="header-actions">
 
84
  </div>
85
 
86
  <nav class="header-nav" aria-label="Site navigation">
87
+ <a href="/" class="active">Game</a>
88
+ <a href="/interact">Interact</a>
89
+ <a href="/judge">GRPO demo</a>
90
  </nav>
91
 
92
  <div class="header-actions">
dashboard/interact.html ADDED
@@ -0,0 +1,639 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en" data-theme="dark">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
6
+ <title>Parlay — Talk to the Model</title>
7
+ <link rel="icon" type="image/svg+xml" href="/static/favicon/favicon.svg?v=1" />
8
+ <link rel="preconnect" href="https://fonts.googleapis.com" />
9
+ <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
10
+ <link href="https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,600;0,700;1,400&family=EB+Garamond:ital,wght@0,400;0,500&family=DM+Mono:wght@400;500&display=swap" rel="stylesheet" />
11
+ <style>
12
+ /* ── Design tokens (same as train_results.html) ────────────────────────── */
13
+ :root {
14
+ --felt: #1c2b1a;
15
+ --felt-light: #2a3d28;
16
+ --mahogany: #2c1810;
17
+ --mahogany-light: #3d2518;
18
+ --cream: #f5f0e8;
19
+ --gold: #c9a84c;
20
+ --smoke: #8a8070;
21
+ --ink: #1a1208;
22
+ --scarlet: #8b1a1a;
23
+ --emerald: #1a5c2a;
24
+ --ivory: #faf6ee;
25
+ --red-accent: #c0392b;
26
+ --green-accent: #27ae60;
27
+ --font-display: "Playfair Display", Georgia, serif;
28
+ --font-body: "EB Garamond", Georgia, serif;
29
+ --font-mono: "DM Mono", "Courier New", monospace;
30
+ }
31
+ *, *::before, *::after { box-sizing: border-box; }
32
+ body {
33
+ margin: 0; min-height: 100vh;
34
+ font-family: var(--font-body); font-size: 1rem;
35
+ color: var(--cream); background: var(--felt);
36
+ background-image:
37
+ repeating-linear-gradient(45deg,
38
+ rgba(255,255,255,0.018) 0, rgba(255,255,255,0.018) 1px,
39
+ transparent 1px, transparent 6px);
40
+ }
41
+
42
+ /* ── Page shell ───────────────────────────────────────────────────────── */
43
+ .page { max-width: 820px; margin: 0 auto; padding: 2.5rem 1.25rem 5rem; }
44
+ a.back {
45
+ color: var(--smoke); text-decoration: none; font-size: 0.9rem;
46
+ display: inline-flex; align-items: center; gap: 4px; margin-bottom: 1.5rem;
47
+ }
48
+ a.back:hover { color: var(--gold); }
49
+ h1 {
50
+ font-family: var(--font-display); color: var(--gold);
51
+ font-size: 2rem; font-style: italic; font-weight: 600;
52
+ margin: 0 0 0.4rem 0;
53
+ }
54
+ .subtitle { color: var(--smoke); font-size: 1.05rem; margin: 0 0 0.9rem; }
55
+ h2 {
56
+ font-family: var(--font-display); color: var(--gold); font-size: 1.3rem;
57
+ font-weight: 600; border-bottom: 1px solid rgba(201,168,76,0.3);
58
+ padding-bottom: 0.3rem; margin: 0 0 0.9rem;
59
+ }
60
+ section { margin-top: 2rem; }
61
+
62
+ /* ── Status badge ─────────────────────────────────────────────────────── */
63
+ .badge {
64
+ display: inline-block; padding: 4px 12px; border-radius: 4px;
65
+ font-size: 0.75rem; font-family: var(--font-mono); letter-spacing: 0.06em;
66
+ }
67
+ .badge.ok { background: #1a3d22; color: #a8d4b0; border: 1px solid var(--emerald); }
68
+ .badge.warn { background: #3a3018; color: #e8d49a; border: 1px solid var(--gold); }
69
+ .badge.err { background: #3d1010; color: #e8a0a0; border: 1px solid var(--scarlet); }
70
+
71
+ /* ── Model info card ──────────────────────────────────────────────────── */
72
+ .model-info-card {
73
+ background: var(--mahogany); border: 1px solid rgba(201,168,76,0.4);
74
+ border-radius: 4px; padding: 1rem 1.25rem;
75
+ display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem 1.5rem;
76
+ }
77
+ @media (max-width: 560px) { .model-info-card { grid-template-columns: 1fr; } }
78
+ .info-row { display: flex; flex-direction: column; gap: 2px; }
79
+ .info-label {
80
+ font-family: var(--font-mono); font-size: 0.68rem;
81
+ text-transform: uppercase; letter-spacing: 0.1em; color: var(--smoke);
82
+ }
83
+ .info-val { font-size: 0.9rem; color: var(--cream); word-break: break-all; }
84
+ .info-val a { color: var(--gold); text-decoration: none; }
85
+ .info-val a:hover { text-decoration: underline; }
86
+
87
+ /* ── Context pickers ──────────────────────────────────────────────────── */
88
+ .context-row {
89
+ display: grid; grid-template-columns: 1fr 1fr; gap: 1rem;
90
+ margin-bottom: 1.2rem;
91
+ }
92
+ @media (max-width: 560px) { .context-row { grid-template-columns: 1fr; } }
93
+ .context-group { display: flex; flex-direction: column; gap: 0.35rem; }
94
+ .context-group label {
95
+ font-family: var(--font-mono); font-size: 0.7rem;
96
+ text-transform: uppercase; letter-spacing: 0.1em; color: var(--smoke);
97
+ }
98
+ select {
99
+ background: var(--mahogany-light); border: 1px solid rgba(201,168,76,0.35);
100
+ color: var(--cream); font-family: var(--font-body); font-size: 0.95rem;
101
+ padding: 8px 10px; border-radius: 3px; width: 100%; cursor: pointer;
102
+ appearance: none;
103
+ background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='12' height='8' viewBox='0 0 12 8'%3E%3Cpath fill='%23c9a84c' d='M1 1l5 5 5-5'/%3E%3C/svg%3E");
104
+ background-repeat: no-repeat; background-position: right 10px center;
105
+ }
106
+ select:focus { outline: 1px solid var(--gold); }
107
+
108
+ /* ── Chat window ──────────────────────────────────────────────────────── */
109
+ .chat-window {
110
+ background: var(--mahogany); border: 1px solid rgba(201,168,76,0.3);
111
+ border-radius: 4px; min-height: 340px; max-height: 480px;
112
+ overflow-y: auto; padding: 1rem; display: flex; flex-direction: column;
113
+ gap: 0.75rem; margin-bottom: 0.9rem;
114
+ scroll-behavior: smooth;
115
+ }
116
+ .chat-window:empty::after {
117
+ content: "Choose a scenario and persona above, then type a message to begin.";
118
+ color: var(--smoke); font-style: italic; font-size: 0.95rem;
119
+ margin: auto;
120
+ }
121
+
122
+ /* message bubbles */
123
+ .msg { display: flex; flex-direction: column; max-width: 82%; }
124
+ .msg.user { align-self: flex-end; align-items: flex-end; }
125
+ .msg.model { align-self: flex-start; align-items: flex-start; }
126
+ .msg.system { align-self: center; align-items: center; max-width: 100%; }
127
+
128
+ .msg-role {
129
+ font-family: var(--font-mono); font-size: 0.65rem;
130
+ text-transform: uppercase; letter-spacing: 0.1em;
131
+ color: var(--smoke); margin-bottom: 2px;
132
+ }
133
+ .msg-body {
134
+ padding: 0.65rem 0.9rem; border-radius: 4px;
135
+ font-size: 0.97rem; line-height: 1.5;
136
+ }
137
+ .msg.user .msg-body { background: #2d4030; border: 1px solid rgba(201,168,76,0.25); }
138
+ .msg.model .msg-body {
139
+ background: var(--mahogany-light); border: 1px solid rgba(201,168,76,0.4);
140
+ }
141
+ .msg.system .msg-body {
142
+ background: transparent; border: none;
143
+ color: var(--smoke); font-style: italic; font-size: 0.88rem; text-align: center;
144
+ }
145
+ /* offer chip inside model bubble */
146
+ .offer-chip {
147
+ display: inline-block; margin-top: 0.45rem;
148
+ background: rgba(201,168,76,0.15); border: 1px solid var(--gold);
149
+ color: var(--gold); font-family: var(--font-mono); font-size: 0.75rem;
150
+ padding: 2px 10px; border-radius: 3px;
151
+ }
152
+ .tactic-pill {
153
+ display: inline-block; margin-top: 0.35rem; margin-left: 0.4rem;
154
+ background: rgba(139,26,26,0.25); border: 1px solid var(--scarlet);
155
+ color: #e0a0a0; font-family: var(--font-mono); font-size: 0.68rem;
156
+ padding: 2px 8px; border-radius: 3px; text-transform: uppercase;
157
+ }
158
+ /* thinking pulse */
159
+ .thinking .msg-body { opacity: 0.65; }
160
+ .dot-pulse {
161
+ display: inline-flex; gap: 4px; align-items: center; padding: 2px 0;
162
+ }
163
+ .dot-pulse span {
164
+ width: 6px; height: 6px; border-radius: 50%; background: var(--gold);
165
+ animation: pulse 1.2s infinite ease-in-out;
166
+ }
167
+ .dot-pulse span:nth-child(2) { animation-delay: 0.2s; }
168
+ .dot-pulse span:nth-child(3) { animation-delay: 0.4s; }
169
+ @keyframes pulse { 0%,80%,100% { opacity: 0.2; transform: scale(0.8); } 40% { opacity: 1; transform: scale(1); } }
170
+
171
+ /* ── Input bar ────────────────────────────────────────────────────────── */
172
+ .input-bar { display: flex; gap: 0.65rem; }
173
+ .chat-input {
174
+ flex: 1; background: var(--mahogany-light);
175
+ border: 1px solid rgba(201,168,76,0.35);
176
+ color: var(--cream); font-family: var(--font-body); font-size: 1rem;
177
+ padding: 10px 14px; border-radius: 3px; resize: none; min-height: 44px;
178
+ }
179
+ .chat-input:focus { outline: 1px solid var(--gold); }
180
+ .chat-input::placeholder { color: var(--smoke); }
181
+ .btn-gold {
182
+ background: var(--gold); color: var(--ink);
183
+ border: none; padding: 10px 20px;
184
+ font-family: var(--font-mono); font-size: 0.85rem;
185
+ cursor: pointer; border-radius: 3px; font-weight: 500;
186
+ white-space: nowrap; align-self: flex-end;
187
+ }
188
+ .btn-gold:hover { filter: brightness(1.08); }
189
+ .btn-gold:disabled { opacity: 0.45; cursor: not-allowed; filter: none; }
190
+ .btn-ghost {
191
+ background: none; border: 1px solid rgba(201,168,76,0.35);
192
+ color: var(--smoke); padding: 8px 14px; border-radius: 3px;
193
+ font-family: var(--font-mono); font-size: 0.8rem; cursor: pointer;
194
+ }
195
+ .btn-ghost:hover { border-color: var(--gold); color: var(--gold); }
196
+
197
+ /* ── Footer toolbar under chat ────────────────────────────────────────── */
198
+ .chat-toolbar {
199
+ display: flex; justify-content: space-between; align-items: center;
200
+ margin-top: 0.55rem;
201
+ }
202
+ .char-hint { font-family: var(--font-mono); font-size: 0.7rem; color: var(--smoke); }
203
+ .backend-tag { font-family: var(--font-mono); font-size: 0.68rem; color: var(--smoke); }
204
+ .backend-tag span { color: var(--gold); }
205
+
206
+ /* ── Error box ────────────────────────────────────────────────────────── */
207
+ .error-box {
208
+ background: #3d1010; border: 1px solid var(--scarlet);
209
+ border-radius: 4px; padding: 0.75rem 1rem;
210
+ color: #e8a0a0; font-size: 0.9rem; display: none;
211
+ }
212
+
213
+ /* ── Explainer read-block (mirrors train_results) ─────────────────────── */
214
+ .read-block {
215
+ background: var(--mahogany); border: 1px solid rgba(201,168,76,0.28);
216
+ border-radius: 4px; padding: 1rem 1.15rem; margin-top: 0.8rem;
217
+ font-size: 0.95rem; line-height: 1.5; color: var(--cream);
218
+ }
219
+ .read-block h3 {
220
+ font-family: var(--font-display); color: var(--gold);
221
+ font-size: 1rem; margin: 0 0 0.4rem; font-style: italic;
222
+ }
223
+ .read-block p { margin: 0 0 0.6rem; }
224
+ .read-block p:last-child { margin-bottom: 0; }
225
+ .read-block code {
226
+ font-family: var(--font-mono); font-size: 0.82rem;
227
+ background: rgba(255,255,255,0.07); padding: 1px 5px; border-radius: 2px;
228
+ }
229
+
230
+ /* ── JSON viewer (collapsible raw output) ─────────────────────────────── */
231
+ details.raw-output {
232
+ background: var(--mahogany); border: 1px solid rgba(201,168,76,0.2);
233
+ border-radius: 4px; padding: 0 1rem; margin-top: 0.5rem;
234
+ }
235
+ details.raw-output summary {
236
+ cursor: pointer; color: var(--smoke); padding: 0.6rem 0;
237
+ font-family: var(--font-mono); font-size: 0.72rem; letter-spacing: 0.05em;
238
+ text-transform: uppercase;
239
+ }
240
+ details.raw-output pre {
241
+ font-family: var(--font-mono); font-size: 0.78rem; color: var(--cream);
242
+ background: var(--felt); padding: 0.75rem; border-radius: 3px;
243
+ overflow-x: auto; margin: 0 0 0.75rem;
244
+ }
245
+
246
+ /* ── Temperature control ──────────────────────────────────────────────── */
247
+ .param-row {
248
+ display: flex; align-items: center; gap: 0.75rem;
249
+ margin-top: 0.9rem;
250
+ }
251
+ .param-row label {
252
+ font-family: var(--font-mono); font-size: 0.7rem;
253
+ text-transform: uppercase; letter-spacing: 0.08em; color: var(--smoke);
254
+ white-space: nowrap;
255
+ }
256
+ input[type="range"] {
257
+ flex: 1; accent-color: var(--gold); cursor: pointer;
258
+ }
259
+ .param-val {
260
+ font-family: var(--font-mono); font-size: 0.82rem; color: var(--gold);
261
+ min-width: 28px; text-align: right;
262
+ }
263
+ </style>
264
+ </head>
265
+ <body>
266
+ <div class="page">
267
+
268
+ <p><a class="back" href="/">← Back to the Deal Room</a></p>
269
+
270
+ <header>
271
+ <h1>◈ Talk to the Model</h1>
272
+ <p class="subtitle">Direct inference with the GRPO-finetuned negotiator (Qwen2.5-1.5B)</p>
273
+ <p id="status-badge"></p>
274
+ </header>
275
+
276
+ <!-- Model info ──────────────────────────────────────────────────────── -->
277
+ <section>
278
+ <h2>Model</h2>
279
+ <div class="model-info-card" id="model-info-card">
280
+ <div class="info-row"><span class="info-label">Repo</span><span class="info-val" id="info-repo">—</span></div>
281
+ <div class="info-row"><span class="info-label">Base</span><span class="info-val" id="info-base">—</span></div>
282
+ <div class="info-row"><span class="info-label">Training</span><span class="info-val" id="info-training">—</span></div>
283
+ <div class="info-row"><span class="info-label">Output format</span><span class="info-val" id="info-note">—</span></div>
284
+ </div>
285
+ </section>
286
+
287
+ <!-- Chat interface ───────────────────────────────────────────────────── -->
288
+ <section>
289
+ <h2>Chat</h2>
290
+
291
+ <!-- Context pickers -->
292
+ <div class="context-row">
293
+ <div class="context-group">
294
+ <label for="sel-scenario">Scenario (gives the model context)</label>
295
+ <select id="sel-scenario">
296
+ <option value="saas_enterprise">Enterprise SaaS Contract — $125k–$165k ACV</option>
297
+ <option value="hiring_package">Senior Engineer Offer — $195k–$265k total comp</option>
298
+ <option value="acquisition_term_sheet">Startup Acquisition — $10.5M–$16M valuation</option>
299
+ </select>
300
+ </div>
301
+ <div class="context-group">
302
+ <label for="sel-persona">Persona (model negotiating style)</label>
303
+ <select id="sel-persona">
304
+ <option value="shark">🦈 The Shark — aggressive, anchors hard</option>
305
+ <option value="diplomat">🤝 The Diplomat — collaborative, reveals constraints</option>
306
+ <option value="veteran">🧓 The Veteran — strategic silence, k=2 ToM</option>
307
+ </select>
308
+ </div>
309
+ </div>
310
+
311
+ <!-- Temperature -->
312
+ <div class="param-row">
313
+ <label for="temp-slider">Temperature</label>
314
+ <input type="range" id="temp-slider" min="0.1" max="1.4" step="0.05" value="0.7" />
315
+ <span class="param-val" id="temp-val">0.7</span>
316
+ <button class="btn-ghost" id="btn-reset-chat" title="Start a new conversation">New chat</button>
317
+ </div>
318
+
319
+ <!-- Window -->
320
+ <div class="chat-window" id="chat-window" role="log" aria-live="polite" aria-label="Conversation with the model"></div>
321
+
322
+ <!-- Error -->
323
+ <div class="error-box" id="error-box"></div>
324
+
325
+ <!-- Input -->
326
+ <div class="input-bar">
327
+ <textarea
328
+ id="chat-input"
329
+ class="chat-input"
330
+ rows="1"
331
+ placeholder="Type your opening offer or message…"
332
+ aria-label="Your message"
333
+ ></textarea>
334
+ <button class="btn-gold" id="btn-send" type="button">Send</button>
335
+ </div>
336
+
337
+ <div class="chat-toolbar">
338
+ <span class="char-hint">Enter to send · Shift+Enter for new line</span>
339
+ <span class="backend-tag" id="backend-tag"></span>
340
+ </div>
341
+
342
+ <!-- Last raw output -->
343
+ <details class="raw-output" id="raw-details" style="display:none;">
344
+ <summary>Raw model JSON output</summary>
345
+ <pre id="raw-pre"></pre>
346
+ </details>
347
+ </section>
348
+
349
+ <!-- About the model ─────────────────────────────────────────────────── -->
350
+ <section>
351
+ <h2>About this model</h2>
352
+ <div class="read-block">
353
+ <h3>What it is</h3>
354
+ <p>
355
+ <strong>parlay-grpo-1-5b</strong> is a Qwen2.5-1.5B-Instruct model fine-tuned in two
356
+ stages: first with SFT on Gemini-generated negotiation transcripts, then with GRPO using
357
+ the Parlay reward function — a mix of ZOPA progress, Theory-of-Mind accuracy, tactical
358
+ card usage, and drift adaptation bonuses.
359
+ </p>
360
+ <h3>What it outputs</h3>
361
+ <p>
362
+ Every response is a JSON object with three fields:<br/>
363
+ <code>utterance</code> — the natural language negotiation turn,<br/>
364
+ <code>offer_amount</code> — a numeric bid (or <code>null</code> for conversational turns),<br/>
365
+ <code>tactical_move</code> — optional card played (<code>anchor_high</code>, <code>batna_reveal</code>, <code>silence</code>).
366
+ </p>
367
+ <h3>How to read the responses here</h3>
368
+ <p>
369
+ The <em>utterance</em> is displayed as the chat bubble. If the model includes an
370
+ <em>offer_amount</em>, it appears as a gold chip below the text. You can expand
371
+ "Raw model JSON output" to see the full structured response.
372
+ </p>
373
+ <h3>Backend</h3>
374
+ <p>
375
+ On a GPU Space the model runs locally (fast after the first load). On a CPU Space
376
+ inference falls back to the Hugging Face Inference API — the first request may take
377
+ 20–40 s while the model warms up; subsequent requests are faster.
378
+ </p>
379
+ </div>
380
+ </section>
381
+
382
+ </div><!-- /.page -->
383
+
384
+ <script>
385
+ // ── State ──────────────────────────────────────────────────────────────
386
+ let history = [];
387
+ let isLoading = false;
388
+ let lastBackend = "";
389
+
390
+ // ── Init ───────────────────────────────────────────────────────────────
391
+ document.addEventListener("DOMContentLoaded", () => {
392
+ loadModelInfo();
393
+
394
+ const sendBtn = document.getElementById("btn-send");
395
+ const inputEl = document.getElementById("chat-input");
396
+ const tempSldr = document.getElementById("temp-slider");
397
+ const tempVal = document.getElementById("temp-val");
398
+ const resetBtn = document.getElementById("btn-reset-chat");
399
+
400
+ sendBtn.addEventListener("click", sendMessage);
401
+
402
+ inputEl.addEventListener("keydown", (e) => {
403
+ if (e.key === "Enter" && !e.shiftKey) {
404
+ e.preventDefault();
405
+ sendMessage();
406
+ }
407
+ // auto-grow textarea
408
+ requestAnimationFrame(() => {
409
+ inputEl.style.height = "auto";
410
+ inputEl.style.height = Math.min(inputEl.scrollHeight, 140) + "px";
411
+ });
412
+ });
413
+
414
+ tempSldr.addEventListener("input", () => {
415
+ tempVal.textContent = parseFloat(tempSldr.value).toFixed(2);
416
+ });
417
+
418
+ resetBtn.addEventListener("click", resetChat);
419
+
420
+ // Changing scenario/persona resets the conversation
421
+ document.getElementById("sel-scenario").addEventListener("change", resetChat);
422
+ document.getElementById("sel-persona").addEventListener("change", resetChat);
423
+ });
424
+
425
+ // ── Load model info ────────────────────────────────────────────────────
426
+ async function loadModelInfo() {
427
+ const badge = document.getElementById("status-badge");
428
+ try {
429
+ const res = await fetch("/api/model/info");
430
+ const data = await res.json();
431
+
432
+ const repoEl = document.getElementById("info-repo");
433
+ if (data.hub_url && data.model_repo) {
434
+ repoEl.innerHTML = `<a href="${data.hub_url}" target="_blank" rel="noopener">${data.model_repo}</a>`;
435
+ }
436
+ document.getElementById("info-base").textContent = data.base_model || "—";
437
+ document.getElementById("info-training").textContent = data.training || "—";
438
+ document.getElementById("info-note").textContent = data.note || "—";
439
+
440
+ if (data.configured) {
441
+ badge.innerHTML = '<span class="badge ok">Trained model configured — inference ready</span>';
442
+ } else {
443
+ badge.innerHTML = '<span class="badge warn">Using public Hub repo — set HF_MODEL_REPO secret for local GPU inference</span>';
444
+ }
445
+ } catch (_e) {
446
+ badge.innerHTML = '<span class="badge err">Could not reach server</span>';
447
+ }
448
+ }
449
+
450
+ // ── Reset chat ─────────────────────────────────────────────────────────
451
+ function resetChat() {
452
+ history = [];
453
+ const win = document.getElementById("chat-window");
454
+ win.innerHTML = "";
455
+ setError(null);
456
+ document.getElementById("raw-details").style.display = "none";
457
+ document.getElementById("backend-tag").textContent = "";
458
+ addSystemMsg("Conversation reset. Send a message to begin.");
459
+ }
460
+
461
+ // ── Send ───────────────────────────────────────────────────────────────
462
+ async function sendMessage() {
463
+ if (isLoading) return;
464
+ const inputEl = document.getElementById("chat-input");
465
+ const text = inputEl.value.trim();
466
+ if (!text) return;
467
+
468
+ const scenarioId = document.getElementById("sel-scenario").value;
469
+ const persona = document.getElementById("sel-persona").value;
470
+ const temp = parseFloat(document.getElementById("temp-slider").value);
471
+
472
+ // Render user bubble
473
+ addMsg("user", text, null, null);
474
+ history.push({ role: "user", text });
475
+ inputEl.value = "";
476
+ inputEl.style.height = "auto";
477
+ setError(null);
478
+
479
+ // Thinking indicator
480
+ const thinkId = addThinking();
481
+ setLoading(true);
482
+
483
+ try {
484
+ const res = await fetch("/api/model/chat", {
485
+ method: "POST",
486
+ headers: { "Content-Type": "application/json" },
487
+ body: JSON.stringify({
488
+ message: text,
489
+ scenario_id: scenarioId,
490
+ persona,
491
+ history: history.slice(0, -1), // exclude the just-added user turn
492
+ temperature: temp,
493
+ max_tokens: 300,
494
+ }),
495
+ });
496
+
497
+ removeThinking(thinkId);
498
+
499
+ if (!res.ok) {
500
+ const err = await res.json().catch(() => ({}));
501
+ throw new Error(err.detail || `HTTP ${res.status}`);
502
+ }
503
+
504
+ const data = await res.json();
505
+ lastBackend = data.backend || "";
506
+
507
+ // Update backend tag
508
+ const backendTag = document.getElementById("backend-tag");
509
+ backendTag.innerHTML = `backend: <span>${lastBackend === "local" ? "local GPU" : "HF Inference API"}</span>`;
510
+
511
+ // Render model bubble
512
+ const utterance = data.utterance || "(no utterance)";
513
+ const offer = data.offer_amount ?? null;
514
+ const tactic = data.tactical_move || null;
515
+ addMsg("model", utterance, offer, tactic);
516
+ history.push({ role: "assistant", text: utterance });
517
+
518
+ // Show raw output
519
+ const rawDetails = document.getElementById("raw-details");
520
+ const rawPre = document.getElementById("raw-pre");
521
+ rawPre.textContent = JSON.stringify({
522
+ utterance: data.utterance,
523
+ offer_amount: data.offer_amount,
524
+ tactical_move: data.tactical_move,
525
+ }, null, 2);
526
+ rawDetails.style.display = "block";
527
+
528
+ } catch (e) {
529
+ removeThinking(thinkId);
530
+ setError("Inference failed: " + e.message);
531
+ // remove last user turn from history so user can retry
532
+ history.pop();
533
+ } finally {
534
+ setLoading(false);
535
+ }
536
+ }
537
+
538
+ // ── DOM helpers ──────────────────────────────────��─────────────────────
539
+ function formatCurrency(v) {
540
+ if (v == null) return null;
541
+ const n = parseFloat(v);
542
+ if (isNaN(n)) return null;
543
+ if (n >= 1_000_000) return "$" + (n / 1_000_000).toFixed(2) + "M";
544
+ if (n >= 1_000) return "$" + (n / 1_000).toFixed(0) + "k";
545
+ return "$" + n.toFixed(0);
546
+ }
547
+
548
+ function addMsg(role, utterance, offerAmount, tacticMove) {
549
+ const win = document.getElementById("chat-window");
550
+
551
+ // Remove placeholder text if present
552
+ const empty = win.querySelector(".empty-hint");
553
+ if (empty) empty.remove();
554
+
555
+ const wrap = document.createElement("div");
556
+ wrap.className = `msg ${role}`;
557
+
558
+ const roleLabel = document.createElement("div");
559
+ roleLabel.className = "msg-role";
560
+ roleLabel.textContent = role === "user" ? "You" : "Model";
561
+ wrap.appendChild(roleLabel);
562
+
563
+ const body = document.createElement("div");
564
+ body.className = "msg-body";
565
+ body.textContent = utterance;
566
+
567
+ if (offerAmount != null) {
568
+ const chip = document.createElement("div");
569
+ chip.className = "offer-chip";
570
+ chip.textContent = formatCurrency(offerAmount) || String(offerAmount);
571
+ body.appendChild(chip);
572
+ }
573
+ if (tacticMove) {
574
+ const pill = document.createElement("span");
575
+ pill.className = "tactic-pill";
576
+ const labels = { anchor_high: "⚓ anchor", batna_reveal: "🃏 BATNA reveal", silence: "🤫 silence" };
577
+ pill.textContent = labels[tacticMove] || tacticMove;
578
+ body.appendChild(pill);
579
+ }
580
+
581
+ wrap.appendChild(body);
582
+ win.appendChild(wrap);
583
+ win.scrollTop = win.scrollHeight;
584
+ return wrap;
585
+ }
586
+
587
+ function addSystemMsg(text) {
588
+ const win = document.getElementById("chat-window");
589
+ const wrap = document.createElement("div");
590
+ wrap.className = "msg system";
591
+ const body = document.createElement("div");
592
+ body.className = "msg-body";
593
+ body.textContent = text;
594
+ wrap.appendChild(body);
595
+ win.appendChild(wrap);
596
+ }
597
+
598
+ let _thinkingSeq = 0;
599
+ function addThinking() {
600
+ const id = "think-" + (++_thinkingSeq);
601
+ const win = document.getElementById("chat-window");
602
+ const wrap = document.createElement("div");
603
+ wrap.className = "msg model thinking";
604
+ wrap.id = id;
605
+ const roleLabel = document.createElement("div");
606
+ roleLabel.className = "msg-role";
607
+ roleLabel.textContent = "Model";
608
+ const body = document.createElement("div");
609
+ body.className = "msg-body";
610
+ body.innerHTML = '<div class="dot-pulse"><span></span><span></span><span></span></div>';
611
+ wrap.appendChild(roleLabel);
612
+ wrap.appendChild(body);
613
+ win.appendChild(wrap);
614
+ win.scrollTop = win.scrollHeight;
615
+ return id;
616
+ }
617
+
618
+ function removeThinking(id) {
619
+ document.getElementById(id)?.remove();
620
+ }
621
+
622
+ function setLoading(on) {
623
+ isLoading = on;
624
+ document.getElementById("btn-send").disabled = on;
625
+ document.getElementById("chat-input").disabled = on;
626
+ }
627
+
628
+ function setError(msg) {
629
+ const box = document.getElementById("error-box");
630
+ if (msg) {
631
+ box.textContent = msg;
632
+ box.style.display = "block";
633
+ } else {
634
+ box.style.display = "none";
635
+ }
636
+ }
637
+ </script>
638
+ </body>
639
+ </html>
dashboard/judge.html ADDED
@@ -0,0 +1,441 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en" data-theme="dark">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
6
+ <meta name="description" content="Parlay — The Deal Room. An RL-powered negotiation arena." />
7
+ <title>Parlay — The Deal Room</title>
8
+
9
+ <link rel="icon" type="image/svg+xml" href="/static/favicon/favicon.svg?v=1" />
10
+ <link rel="icon" type="image/x-icon" href="/favicon.ico" />
11
+
12
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js"
13
+ crossorigin="anonymous" referrerpolicy="no-referrer"></script>
14
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/4.4.1/chart.umd.js"
15
+ crossorigin="anonymous" referrerpolicy="no-referrer"></script>
16
+
17
+ <link rel="stylesheet" href="/static/style.css?v=5" />
18
+ </head>
19
+ <body>
20
+
21
+ <!-- DEMO BANNER -->
22
+ <div id="demo-banner" class="demo-banner hidden" role="status">
23
+ Demo mode — AI responses are simulated · Add GOOGLE_API_KEY to .env for real gameplay
24
+ <button class="demo-banner-dismiss" type="button" id="btn-dismiss-demo" aria-label="Dismiss">✕</button>
25
+ </div>
26
+
27
+ <!-- LOADING OVERLAY -->
28
+ <div id="loading-overlay" class="loading-overlay hidden" role="status" aria-live="polite">
29
+ <div class="loading-card">
30
+ <div class="spinner" aria-hidden="true"></div>
31
+ <p class="loading-text">The room is thinking…</p>
32
+ </div>
33
+ </div>
34
+
35
+ <!-- ONBOARDING — STEP 1 -->
36
+ <div id="onboarding-step-1" class="onboarding-overlay start-active" role="dialog" aria-modal="true" aria-label="Enter your name">
37
+ <div class="onboarding-card">
38
+ <div class="onboarding-step-num">Step 1 of 3</div>
39
+ <h1 class="onboarding-headline">Who's at<br>the table?</h1>
40
+ <p class="onboarding-sub">Every deal starts with a name on the door.</p>
41
+ <input type="text" id="step1-name" class="onboarding-name-input"
42
+ placeholder="Your name…" maxlength="40" autocomplete="off" autofocus />
43
+ <div id="step1-error" class="onboarding-error"></div>
44
+ <div class="onboarding-footer">
45
+ <button id="step1-continue" class="btn btn-primary" type="button">Continue &rarr;</button>
46
+ </div>
47
+ </div>
48
+ </div>
49
+
50
+ <!-- ONBOARDING — STEP 2 -->
51
+ <div id="onboarding-step-2" class="onboarding-overlay" role="dialog" aria-modal="true" aria-label="Choose a scenario" inert>
52
+ <div class="onboarding-card wide">
53
+ <div class="onboarding-step-num">Step 2 of 3</div>
54
+ <h1 class="onboarding-headline">Choose your deal</h1>
55
+ <p class="onboarding-sub">Select a case from the dossier.</p>
56
+ <div id="scenario-dossier-grid" class="scenario-dossier-grid" role="radiogroup" aria-label="Negotiation scenarios"></div>
57
+ <div id="step2-error" class="onboarding-error"></div>
58
+ <div class="onboarding-footer">
59
+ <button id="step2-back" class="btn btn-ghost" type="button">&larr; Back</button>
60
+ <button id="step2-continue" class="btn btn-primary" type="button">Continue &rarr;</button>
61
+ </div>
62
+ </div>
63
+ </div>
64
+
65
+ <!-- ONBOARDING — STEP 3 -->
66
+ <div id="onboarding-step-3" class="onboarding-overlay" role="dialog" aria-modal="true" aria-label="Choose your opponent" inert>
67
+ <div class="onboarding-card wide">
68
+ <div class="onboarding-step-num">Step 3 of 3</div>
69
+ <h1 class="onboarding-headline">Choose your opponent</h1>
70
+ <p class="onboarding-sub">Study the faces across the table.</p>
71
+ <div id="persona-cards-grid" class="persona-cards-grid" role="radiogroup" aria-label="Negotiator personas"></div>
72
+ <div id="step3-error" class="onboarding-error"></div>
73
+ <div class="onboarding-footer">
74
+ <button id="step3-back" class="btn btn-ghost" type="button">&larr; Back</button>
75
+ <button id="step3-start" class="btn btn-primary" type="button">Enter the Room &rarr;</button>
76
+ </div>
77
+ </div>
78
+ </div>
79
+
80
+ <!-- APP HEADER — item 19 topbar polish -->
81
+ <header class="app-header" role="banner">
82
+ <div class="header-brand" aria-label="Parlay">
83
+ <span class="brand-par">par</span><span class="brand-gem">◈</span><span class="brand-lay">lay</span>
84
+ </div>
85
+
86
+ <nav class="header-nav" aria-label="Site navigation">
87
+ <a href="/">Game</a>
88
+ <a href="/interact">Interact</a>
89
+ <a href="/judge" class="active">GRPO demo</a>
90
+ </nav>
91
+
92
+ <div class="header-actions">
93
+ <a href="/train" target="_blank" rel="noopener" style="color: var(--smoke); font-size: 0.8125rem; text-decoration: none; letter-spacing: 0.06em;">Training Results &rarr;</a>
94
+ <p id="global-error" class="hidden text-red text-sm" role="alert"></p>
95
+ <!-- Theme toggle — item 27 -->
96
+ <button id="theme-toggle" class="dark-toggle" type="button"
97
+ aria-label="Toggle display mode" title="Toggle light/dark">●</button>
98
+ </div>
99
+ </header>
100
+
101
+ <!-- 3-COLUMN BODY -->
102
+ <main class="app-body" role="main">
103
+
104
+ <!-- LEFT COLUMN -->
105
+ <aside class="col-left" aria-label="Player info">
106
+
107
+ <!-- Player Card -->
108
+ <section class="player-card panel" aria-label="Player card">
109
+ <div class="player-card-header">
110
+ <div id="player-avatar" class="player-avatar" aria-hidden="true">P</div>
111
+ <div class="player-info">
112
+ <div id="player-name-display" class="player-name">Player</div>
113
+ <div class="player-rank">#— Unranked</div>
114
+ </div>
115
+ </div>
116
+ <div class="cp-section">
117
+ <div class="cp-label-row">
118
+ <span class="cp-label">
119
+ <span class="gloss-wrap">Credibility Points
120
+ <span class="gloss-icon" aria-label="What are Credibility Points?">ⓘ</span>
121
+ <span class="gloss-tip" role="tooltip">Your tactical budget. Spend them to play power moves. Regenerates each turn.</span>
122
+ </span>
123
+ </span>
124
+ <span id="cp-value" class="cp-value">100 / 100</span>
125
+ </div>
126
+ <div class="cp-track" role="progressbar" aria-label="Credibility Points" aria-valuemin="0" aria-valuemax="100">
127
+ <div id="cp-fill" class="cp-fill" style="width:100%;"></div>
128
+ </div>
129
+ </div>
130
+ </section>
131
+
132
+ </aside>
133
+
134
+ <!-- CENTER COLUMN -->
135
+ <section class="col-center" aria-label="Negotiation arena">
136
+
137
+ <!-- Scenario Header -->
138
+ <div class="scenario-header" id="scenario-header">
139
+ <div>
140
+ <div id="scenario-title" class="scenario-title">Waiting for game…</div>
141
+ <div id="scenario-meta" class="scenario-meta text-muted">Select a scenario to begin</div>
142
+ <div id="session-id-label" class="scenario-meta text-muted">Session: —</div>
143
+ </div>
144
+ </div>
145
+
146
+ <!-- Drift Alert -->
147
+ <div id="drift-alert" class="drift-alert hidden" role="alert" aria-live="assertive">
148
+ <span class="drift-alert-icon" aria-hidden="true">⚠️</span>
149
+ <span id="drift-alert-text" class="drift-alert-text">Market conditions have shifted.</span>
150
+ <button id="btn-dismiss-drift" class="drift-dismiss" type="button" aria-label="Dismiss drift alert">✕</button>
151
+ </div>
152
+
153
+ <!-- Chat Thread with briefing overlay — items 7 & 15 -->
154
+ <div class="chat-thread-wrap" style="position:relative;">
155
+ <!-- Deal Briefing Panel — item 15 -->
156
+ <div id="briefing-overlay" class="briefing-overlay" aria-label="Deal briefing" style="display:none;">
157
+ <div class="briefing-card">
158
+ <div class="briefing-case-num" id="briefing-case-num">CASE FILE #SaaS-001</div>
159
+ <div class="briefing-title" id="briefing-title">Enterprise SaaS Contract</div>
160
+
161
+ <div class="briefing-section">
162
+ <div class="briefing-section-label">Your Goal</div>
163
+ <div class="briefing-section-text" id="briefing-your-goal">
164
+ Close the deal above $125,000. Your ideal: $165,000.
165
+ </div>
166
+ </div>
167
+
168
+ <div class="briefing-section">
169
+ <div class="briefing-section-label">Their Goal</div>
170
+ <div class="briefing-section-text" id="briefing-their-goal">
171
+ Pay as little as possible. They'll push hard on price.
172
+ </div>
173
+ </div>
174
+
175
+ <div class="briefing-range" id="briefing-range">
176
+ A deal is possible between $125k and $165k.
177
+ </div>
178
+
179
+ <button class="briefing-begin" id="btn-briefing-begin" type="button">
180
+ Begin Negotiation &rarr;
181
+ </button>
182
+ </div>
183
+ </div>
184
+
185
+ <div id="chat-thread" class="chat-thread" role="log" aria-live="polite" aria-label="Negotiation conversation">
186
+ <!-- Initial system message — item 7: plain italic text, no highlight -->
187
+ <div class="system-msg">Step through the door to begin negotiating.</div>
188
+ </div>
189
+ </div>
190
+
191
+ <!-- Result Banner -->
192
+ <div id="result-banner" class="result-banner hidden" aria-live="polite">
193
+ <div class="result-title">—</div>
194
+ <div class="result-amount">—</div>
195
+ <div class="result-score">Score: —</div>
196
+ </div>
197
+
198
+ <!-- Input Area — item 14 (simplified) -->
199
+ <div class="input-area" role="form" aria-label="Your move">
200
+ <div id="tactical-buttons" class="tactical-bar">
201
+ <button class="tactic-btn" data-card="anchor_high" data-cost="0" type="button">
202
+ ⚓ Anchor High <span class="cp-cost">0 CP</span>
203
+ </button>
204
+ <button class="tactic-btn" data-card="batna_reveal" data-cost="20" type="button">
205
+ 🃏 BATNA Reveal <span class="cp-cost">20 CP</span>
206
+ </button>
207
+ <button class="tactic-btn" data-card="silence" data-cost="5" type="button">
208
+ 🤫 Silence <span class="cp-cost">5 CP</span>
209
+ </button>
210
+ </div>
211
+
212
+ <!-- Main text input -->
213
+ <div class="input-main-row">
214
+ <label class="sr-only" for="offer-input">Your message or offer</label>
215
+ <input
216
+ type="text"
217
+ id="offer-input"
218
+ class="offer-input"
219
+ placeholder="Type your message or make an offer…"
220
+ aria-label="Type your message or offer amount"
221
+ disabled
222
+ />
223
+ <button id="btn-submit" class="btn btn-primary" type="button" disabled>Send</button>
224
+ </div>
225
+
226
+ <!-- Inline offer stepper (appears when Make Offer clicked) -->
227
+ <div id="offer-stepper" class="offer-stepper" aria-label="Offer amount stepper">
228
+ <button class="stepper-btn" id="stepper-down" type="button" aria-label="Decrease offer">−</button>
229
+ <span class="stepper-value" id="stepper-value">$145,000</span>
230
+ <button class="stepper-btn" id="stepper-up" type="button" aria-label="Increase offer">+</button>
231
+ <button class="btn btn-sm btn-primary" id="stepper-use" type="button">Use</button>
232
+ <button class="btn btn-sm btn-ghost" id="stepper-cancel" type="button">✕</button>
233
+ </div>
234
+
235
+ <!-- Quick action chips -->
236
+ <div class="quick-actions">
237
+ <button class="quick-chip offer-chip-btn" id="chip-offer" type="button" disabled>
238
+ Make Offer ▼
239
+ </button>
240
+ <button class="quick-chip accept-chip" id="chip-accept" type="button" disabled>
241
+ Accept Deal ✓
242
+ </button>
243
+ <button class="quick-chip walk-chip" id="chip-walk" type="button" disabled>
244
+ Walk Away ✕
245
+ </button>
246
+ </div>
247
+ </div>
248
+
249
+ <!-- ZOPA Bar — item 10 -->
250
+ <section class="zopa-section" aria-label="Zone of Possible Agreement">
251
+ <div class="panel-header">
252
+ <span class="panel-title">
253
+ <span class="gloss-wrap">ZOPA
254
+ <span class="gloss-icon" aria-label="What is ZOPA?">ⓘ</span>
255
+ <span class="gloss-tip" role="tooltip">The price range where a deal is possible — between both parties' minimum acceptable prices</span>
256
+ </span>
257
+ </span>
258
+ <span class="text-xs text-muted">Zone of Possible Agreement</span>
259
+ </div>
260
+
261
+ <div id="zopa-track" class="zopa-track-outer" role="img" aria-label="ZOPA visual range">
262
+ <div id="zopa-zone" class="zopa-zone"></div>
263
+
264
+ <div id="marker-player" class="zopa-marker marker-player" style="left:20%;">
265
+ <div class="zopa-marker-triangle"></div>
266
+ <div class="zopa-marker-line"></div>
267
+ <span class="zopa-label">Your floor</span>
268
+ </div>
269
+
270
+ <div id="marker-opponent" class="zopa-marker marker-opponent" style="left:80%;">
271
+ <div class="zopa-marker-triangle"></div>
272
+ <div class="zopa-marker-line"></div>
273
+ <span class="zopa-label">Their floor</span>
274
+ </div>
275
+
276
+ <div id="marker-current" class="zopa-marker marker-current" style="left:50%; display:none;">
277
+ <div class="zopa-marker-triangle"></div>
278
+ <div class="zopa-marker-line"></div>
279
+ <span class="zopa-label">Offer</span>
280
+ </div>
281
+
282
+ <!-- Nash diamond with label — always visible -->
283
+ <div id="nash-marker" style="position:absolute; top:0; bottom:0; left:50%; transform:translateX(-50%); pointer-events:none; display:none;">
284
+ <div id="nash-diamond" class="nash-diamond" style="position:absolute; top:50%; transform:translate(-50%,-50%);"></div>
285
+ <span class="nash-label">Fair deal</span>
286
+ </div>
287
+ </div>
288
+
289
+ <div class="zopa-labels-row">
290
+ <span id="zopa-label-low">$0</span>
291
+ <span class="gloss-wrap text-amber text-xs">◆ Nash Point
292
+ <span class="gloss-icon" aria-label="What is the Nash Point?">ⓘ</span>
293
+ <span class="gloss-tip" role="tooltip">The mathematically fair deal price, where both sides gain equally</span>
294
+ </span>
295
+ <span id="zopa-label-high">$100K</span>
296
+ </div>
297
+ <div id="zopa-width-indicator" class="zopa-width-indicator">Deal zone: 100%</div>
298
+ </section>
299
+
300
+ <!-- Tension Meter — item 12 -->
301
+ <div class="tension-section" aria-label="Tension meter">
302
+ <span class="tension-label">
303
+ <span class="gloss-wrap">Tension
304
+ <span class="gloss-icon" aria-label="What is Tension?">ⓘ</span>
305
+ <span class="gloss-tip" role="tooltip">How heated the negotiation is. High tension = opponent may make mistakes or walk away.</span>
306
+ </span>
307
+ </span>
308
+ <div class="tension-track" role="progressbar" aria-label="Negotiation tension" aria-valuemin="0" aria-valuemax="100">
309
+ <div id="tension-fill" class="tension-fill" data-level="low" style="width:0%;"></div>
310
+ </div>
311
+ <span id="tension-value" class="tension-value">0%</span>
312
+ <span id="tension-descriptor" class="tension-descriptor">· Calm</span>
313
+ </div>
314
+
315
+ </section>
316
+
317
+ <!-- RIGHT COLUMN -->
318
+ <aside class="col-right" aria-label="Opponent and analytics">
319
+
320
+ <!-- Three.js Character Canvas — item 28 (280×380) -->
321
+ <section class="panel" aria-label="Opponent character">
322
+ <div class="panel-header">
323
+ <span class="panel-title">Opponent</span>
324
+ <span id="character-state-label" class="stat-chip blue">idle</span>
325
+ </div>
326
+ <div class="character-canvas-wrap">
327
+ <canvas id="character-canvas" width="280" height="380" aria-label="3D negotiator character"></canvas>
328
+ <div class="character-state-badge">idle</div>
329
+ </div>
330
+ <!-- Persona name plate — item 32 -->
331
+ <div id="persona-nameplate" class="persona-nameplate">
332
+ <span id="nameplate-symbol" class="nameplate-symbol" style="color: var(--gold);">◈</span>
333
+ <div class="nameplate-text">
334
+ <div id="nameplate-name" class="nameplate-name">—</div>
335
+ <div id="nameplate-tag" class="nameplate-tag">Choose a persona</div>
336
+ </div>
337
+ </div>
338
+ </section>
339
+
340
+ <!-- ToM Belief State — item 11 -->
341
+ <section class="panel tom-section" aria-label="Theory of Mind belief state">
342
+ <div class="panel-header">
343
+ <span class="panel-title">
344
+ <span class="gloss-wrap">ToM Belief State
345
+ <span class="gloss-icon" aria-label="What is ToM?">ⓘ</span>
346
+ <span class="gloss-tip" role="tooltip">Theory of Mind — what the AI thinks it knows about you</span>
347
+ </span>
348
+ </span>
349
+ </div>
350
+
351
+ <div class="tom-beliefs">
352
+ <div class="belief-row">
353
+ <span class="belief-label">Cooperative</span>
354
+ <div class="belief-track" role="progressbar" aria-label="Cooperative belief">
355
+ <div id="belief-cooperative-fill" class="belief-fill cooperative" style="width:50%;"></div>
356
+ </div>
357
+ <span id="belief-cooperative-pct" class="belief-pct">50%</span>
358
+ <div id="belief-cooperative-conf" class="belief-confidence confidence-medium"></div>
359
+ </div>
360
+
361
+ <div class="belief-row">
362
+ <span class="belief-label">Competitive</span>
363
+ <div class="belief-track" role="progressbar" aria-label="Competitive belief">
364
+ <div id="belief-competitive-fill" class="belief-fill competitive" style="width:50%;"></div>
365
+ </div>
366
+ <span id="belief-competitive-pct" class="belief-pct">50%</span>
367
+ <div id="belief-competitive-conf" class="belief-confidence confidence-medium"></div>
368
+ </div>
369
+
370
+ <div class="belief-row">
371
+ <span class="belief-label">Reservation</span>
372
+ <div class="belief-track" role="progressbar" aria-label="Reservation sensitivity">
373
+ <div id="belief-reservation-fill" class="belief-fill reservation" style="width:76%;"></div>
374
+ </div>
375
+ <span id="belief-reservation-pct" class="belief-pct">76%</span>
376
+ <div id="belief-reservation-conf" class="belief-confidence confidence-high"></div>
377
+ </div>
378
+
379
+ <div class="belief-row">
380
+ <span class="belief-label">Flexibility</span>
381
+ <div class="belief-track" role="progressbar" aria-label="Flexibility belief">
382
+ <div id="belief-flexibility-fill" class="belief-fill flexibility" style="width:50%;"></div>
383
+ </div>
384
+ <span id="belief-flexibility-pct" class="belief-pct">50%</span>
385
+ <div id="belief-flexibility-conf" class="belief-confidence confidence-medium"></div>
386
+ </div>
387
+ </div>
388
+
389
+ <div class="mt-4 sparkline-wrap" style="height:80px;">
390
+ <canvas id="belief-chart" class="sparkline-canvas" aria-label="Belief confidence over time"></canvas>
391
+ </div>
392
+ </section>
393
+
394
+ <!-- Offer History Sparkline -->
395
+ <section class="panel" aria-label="Offer history">
396
+ <div class="panel-header"><span class="panel-title">Offer History</span></div>
397
+ <div class="sparkline-wrap" style="height:110px;">
398
+ <canvas id="offer-sparkline" class="sparkline-canvas" aria-label="Offer history sparkline"></canvas>
399
+ </div>
400
+ <div class="sparkline-labels">
401
+ <span id="sparkline-lo" class="sparkline-label">—</span>
402
+ <span class="gloss-wrap text-amber">◆ Nash
403
+ <span class="gloss-icon" aria-label="What is BATNA?">ⓘ</span>
404
+ <span class="gloss-tip" role="tooltip">Best Alternative To a Negotiated Agreement — your walk-away point. Never accept below this.</span>
405
+ </span>
406
+ <span id="sparkline-hi" class="sparkline-label">—</span>
407
+ </div>
408
+ </section>
409
+
410
+ <!-- Leaderboard -->
411
+ <section class="panel" aria-label="Leaderboard">
412
+ <div class="panel-header">
413
+ <span class="panel-title">Top 5</span>
414
+ <button class="btn btn-ghost btn-sm" type="button"
415
+ onclick="loadLeaderboard()" aria-label="Refresh leaderboard" title="Refresh">↻</button>
416
+ </div>
417
+ <table class="leaderboard-table" role="table" aria-label="Top players leaderboard">
418
+ <thead>
419
+ <tr>
420
+ <th scope="col">#</th>
421
+ <th scope="col">Player</th>
422
+ <th class="num" scope="col">Score</th>
423
+ <th class="num" scope="col">Deals</th>
424
+ </tr>
425
+ </thead>
426
+ <tbody id="leaderboard-body">
427
+ <tr><td colspan="4" class="empty-state text-muted">No games yet</td></tr>
428
+ </tbody>
429
+ </table>
430
+ </section>
431
+
432
+ </aside>
433
+
434
+ </main>
435
+
436
+ <script src="/static/character.js?v=5"></script>
437
+ <script src="/static/chart.js?v=5"></script>
438
+ <script src="/static/app.js?v=5"></script>
439
+
440
+ </body>
441
+ </html>
dashboard/train_results.html CHANGED
@@ -77,15 +77,6 @@
77
  }
78
  .fig-placeholder a { color: var(--gold); }
79
  .caption { font-size: 0.9rem; color: var(--smoke); margin-top: 0.5rem; }
80
- .transcript-box {
81
- max-height: 500px; overflow-y: auto;
82
- background: var(--mahogany);
83
- border: 1px solid var(--gold);
84
- padding: 1rem; border-radius: 4px;
85
- }
86
- .transcript-box::-webkit-scrollbar { width: 8px; }
87
- .transcript-box::-webkit-scrollbar-track { background: var(--felt); }
88
- .transcript-box::-webkit-scrollbar-thumb { background: #4a3d28; border-radius: 4px; }
89
  .hub-card {
90
  border: 2px solid var(--gold); background: var(--mahogany);
91
  padding: 1.25rem; border-radius: 4px;
@@ -105,11 +96,28 @@
105
  background: var(--felt); padding: 1rem; border-radius: 3px; overflow-x: auto; }
106
  a.back { color: var(--smoke); text-decoration: none; font-size: 0.9rem; }
107
  a.back:hover { color: var(--gold); }
 
 
 
 
 
 
 
 
 
 
 
108
  </style>
109
  </head>
110
  <body>
111
  <div class="page">
112
- <p><a class="back" href="/">← Back to the Deal Room</a></p>
 
 
 
 
 
 
113
 
114
  <header>
115
  <h1>◈ Training Results</h1>
@@ -119,13 +127,14 @@
119
 
120
  <section>
121
  <h2>Key Numbers</h2>
 
122
  <div class="card-row" id="key-numbers">
123
  <div class="num-card">
124
- <h3>Random Baseline</h3>
125
  <div class="val scar" id="k-random">—</div>
126
  </div>
127
  <div class="num-card">
128
- <h3>Base Model</h3>
129
  <div class="val smoke" id="k-base">—</div>
130
  </div>
131
  <div class="num-card">
@@ -161,8 +170,19 @@
161
  </section>
162
 
163
  <section>
164
- <h2>What Changed: Base Model vs Trained Agent</h2>
165
- <div id="transcript-container"></div>
 
 
 
 
 
 
 
 
 
 
 
166
  </section>
167
 
168
  <section>
@@ -192,9 +212,9 @@
192
  function renderStatus(data) {
193
  const badge = document.getElementById("status-badge");
194
  if (data.model_on_hub) {
195
- badge.innerHTML = '<span class="badge ok">Model on Hub</span>';
196
  } else {
197
- badge.innerHTML = '<span class="badge wait">Training not yet run</span>';
198
  }
199
  }
200
  function renderKeyNumbers(data) {
@@ -246,38 +266,28 @@
246
  cap.hidden = true;
247
  }
248
  }
249
- function compareSection(pa) {
250
  const el = document.getElementById("fig-compare");
251
  const cap = document.getElementById("cap-compare");
252
- if (pa.comparison) {
253
- el.innerHTML = '<img src="/results/training_curves.png" alt="Comparison" style="width:100%;border:1px solid var(--gold);border-radius:2px" />';
 
254
  cap.hidden = false;
255
  } else {
256
- el.innerHTML = '<div class="fig-placeholder">Four-bar chart will appear after evaluation.</div>';
257
  cap.hidden = true;
258
  }
259
  }
260
- async function transcriptSection(pa) {
261
- const c = document.getElementById("transcript-container");
262
- if (pa.transcript) {
263
- try {
264
- const res = await fetch("/results/before_after_transcript.html", { cache: "no-cache" });
265
- const html = await res.text();
266
- c.innerHTML = '<div class="transcript-box">' + html + '</div>';
267
- } catch (e) {
268
- c.innerHTML = '<div class="fig-placeholder">Could not load transcript.</div>';
269
- }
270
- } else {
271
- c.innerHTML = '<div class="fig-placeholder">Transcript comparison will appear after evaluation run.</div>';
272
- }
273
- }
274
  function hubSection(data) {
275
  const h = document.getElementById("hub-block");
 
 
276
  if (data.model_on_hub) {
277
  h.innerHTML =
278
  '<div class="hub-card">' +
279
- '<p><strong>Trained model available on Hugging Face Hub</strong></p>' +
280
- '<p><a href="https://huggingface.co/sh4shv4t/parlay-negotiator" target="_blank" rel="noopener">→ sh4shv4t/parlay-negotiator</a></p>' +
 
281
  '<button type="button" class="btn-gold" id="btn-try-trained">→ Play against the trained model</button>' +
282
  '</div>';
283
  document.getElementById("btn-try-trained").addEventListener("click", async () => {
@@ -293,9 +303,10 @@
293
  });
294
  } else {
295
  h.innerHTML =
296
- '<div class="hub-card muted">' +
297
- '<p>Model will be pushed to Hub after training completes.</p>' +
298
- '<p style="color:var(--smoke)">huggingface.co/sh4shv4t/parlay-negotiator</p>' +
 
299
  '</div>';
300
  }
301
  }
@@ -311,8 +322,7 @@
311
  sftSection(data.sft_loss_url);
312
  rewardSection(data.grpo_reward_url);
313
  grpoLossSection(data.grpo_loss_url);
314
- compareSection(data.plots_available);
315
- await transcriptSection(data.plots_available);
316
  hubSection(data);
317
  } catch (e) {
318
  document.getElementById("status-badge").innerHTML = '<span class="badge wait">Could not load status</span>';
 
77
  }
78
  .fig-placeholder a { color: var(--gold); }
79
  .caption { font-size: 0.9rem; color: var(--smoke); margin-top: 0.5rem; }
 
 
 
 
 
 
 
 
 
80
  .hub-card {
81
  border: 2px solid var(--gold); background: var(--mahogany);
82
  padding: 1.25rem; border-radius: 4px;
 
96
  background: var(--felt); padding: 1rem; border-radius: 3px; overflow-x: auto; }
97
  a.back { color: var(--smoke); text-decoration: none; font-size: 0.9rem; }
98
  a.back:hover { color: var(--gold); }
99
+ .read-block {
100
+ background: var(--mahogany);
101
+ border: 1px solid rgba(201, 168, 76, 0.35);
102
+ border-radius: 4px; padding: 1rem 1.15rem; margin-top: 0.9rem;
103
+ font-size: 0.95rem; line-height: 1.45; color: var(--cream);
104
+ }
105
+ .read-block h3 {
106
+ font-family: var(--font-display); color: var(--gold); font-size: 1.02rem; margin: 0 0 0.4rem 0; font-style: italic;
107
+ }
108
+ .read-block p { margin: 0 0 0.65rem 0; }
109
+ .read-block p:last-child { margin-bottom: 0; }
110
  </style>
111
  </head>
112
  <body>
113
  <div class="page">
114
+ <p>
115
+ <a class="back" href="/">← Deal Room</a>
116
+ &ensp;·&ensp;
117
+ <a class="back" href="/interact">Talk to the Model</a>
118
+ &ensp;·&ensp;
119
+ <a class="back" href="/judge">GRPO demo</a>
120
+ </p>
121
 
122
  <header>
123
  <h1>◈ Training Results</h1>
 
127
 
128
  <section>
129
  <h2>Key Numbers</h2>
130
+ <p class="caption" style="margin:0 0 0.75rem 0">Mean episode reward for the same eval protocol: random play, the frozen base (instruction) model, and your GRPO model.</p>
131
  <div class="card-row" id="key-numbers">
132
  <div class="num-card">
133
+ <h3>Random baseline</h3>
134
  <div class="val scar" id="k-random">—</div>
135
  </div>
136
  <div class="num-card">
137
+ <h3>Base model</h3>
138
  <div class="val smoke" id="k-base">—</div>
139
  </div>
140
  <div class="num-card">
 
170
  </section>
171
 
172
  <section>
173
+ <h2>What you are seeing (and what compute unlocks next)</h2>
174
+ <div class="read-block">
175
+ <h3>SFT loss</h3>
176
+ <p>Supervised training usually shows a clear downward trend early, then a gentler slope as the model matches the Parlay format and tone. A flatter tail often means the model is close to the local optimum for that data — not that learning has “stopped” entirely.</p>
177
+ <h3>GRPO mean reward</h3>
178
+ <p>Policy-gradient training optimizes a noisy signal: each batch samples different episodes and rollouts, so the curve wiggles. Uptrends mean the value head and policy are moving toward higher rewards on average. Plateaus are common when the policy hits a local policy and the advantage estimates shrink.</p>
179
+ <h3>GRPO training loss</h3>
180
+ <p>Unlike SFT, this is not a simple cross-entropy to a single target. Loss can bump around while reward improves because the loss reflects ratios, clipping, and changing baselines, not just “closer to data.”</p>
181
+ <h3>Random vs base vs trained</h3>
182
+ <p>Random play anchors the scale: it shows what unstructured actions score under the same reward. The base model (before GRPO) reflects instruction-following without RL shaping; GRPO should lift the mean if the environment signal is learnable. Gaps that look small in absolute value can still be meaningful when rewards mix sparse bonuses and penalties.</p>
183
+ <h3>Compute</h3>
184
+ <p>We ran a compact schedule (1.5B + LoRA, modest generations per step) to keep iteration fast. With more budget, the same stack could support longer rollouts, higher group size for less noisy advantage estimates, or additional GRPO steps to see whether reward plateaus or keeps climbing — and full fine-tuning if we wanted to stress capacity over adapters.</p>
185
+ </div>
186
  </section>
187
 
188
  <section>
 
212
  function renderStatus(data) {
213
  const badge = document.getElementById("status-badge");
214
  if (data.model_on_hub) {
215
+ badge.innerHTML = '<span class="badge ok">In-app: trained model ready</span>';
216
  } else {
217
+ badge.innerHTML = '<span class="badge wait">Hub checkpoint is live; set <code>HF_MODEL_REPO</code> for the trained in-app opponent</span>';
218
  }
219
  }
220
  function renderKeyNumbers(data) {
 
266
  cap.hidden = true;
267
  }
268
  }
269
+ function compareSection(data) {
270
  const el = document.getElementById("fig-compare");
271
  const cap = document.getElementById("cap-compare");
272
+ const url = data.comparison_url;
273
+ if (url) {
274
+ el.innerHTML = '<img src="' + url + '" alt="Random vs Base vs GRPO comparison" style="width:100%;border:1px solid var(--gold);border-radius:2px" />';
275
  cap.hidden = false;
276
  } else {
277
+ el.innerHTML = '<div class="fig-placeholder">Four-bar chart: add <code>images/training_curves.png</code> (or <code>results/training_curves.png</code>) or run eval to generate a comparison.</div>';
278
  cap.hidden = true;
279
  }
280
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
281
  function hubSection(data) {
282
  const h = document.getElementById("hub-block");
283
+ const hubUrl = "https://huggingface.co/sh4shv4t/parlay-grpo-1-5b";
284
+ const hubName = "sh4shv4t/parlay-grpo-1-5b";
285
  if (data.model_on_hub) {
286
  h.innerHTML =
287
  '<div class="hub-card">' +
288
+ '<p><strong>Trained model on Hugging Face Hub</strong></p>' +
289
+ '<p><a href="' + hubUrl + '" target="_blank" rel="noopener">→ ' + hubName + '</a></p>' +
290
+ '<p class="caption" style="color:var(--smoke)">GRPO LoRA on Qwen2.5-1.5B (Parlay episodes).</p>' +
291
  '<button type="button" class="btn-gold" id="btn-try-trained">→ Play against the trained model</button>' +
292
  '</div>';
293
  document.getElementById("btn-try-trained").addEventListener("click", async () => {
 
303
  });
304
  } else {
305
  h.innerHTML =
306
+ '<div class="hub-card">' +
307
+ '<p><strong>Trained model on Hugging Face Hub</strong></p>' +
308
+ '<p><a href="' + hubUrl + '" target="_blank" rel="noopener">→ ' + hubName + '</a></p>' +
309
+ '<p class="caption" style="color:var(--smoke)">The checkpoint is live. To use it as the opponent in this app, set the server env <code>HF_MODEL_REPO</code> to <code>' + hubName + '</code> (Hugging Face Spaces: Secrets), then use the play button on the home page after switching the opponent.</p>' +
310
  '</div>';
311
  }
312
  }
 
322
  sftSection(data.sft_loss_url);
323
  rewardSection(data.grpo_reward_url);
324
  grpoLossSection(data.grpo_loss_url);
325
+ compareSection(data);
 
326
  hubSection(data);
327
  } catch (e) {
328
  document.getElementById("status-badge").innerHTML = '<span class="badge wait">Could not load status</span>';
images/Parlay_square logo.png ADDED

Git LFS Details

  • SHA256: 0e2e1a93a0513de853c7161072b2aedee5b297fa7087ded3684513e444f1157a
  • Pointer size: 132 Bytes
  • Size of remote file: 3.16 MB
images/grpo_loss_curve.png ADDED

Git LFS Details

  • SHA256: 83ea972cabea22d17f145ed43ca92795f312b834ab83284eabcedc72d6d52a8e
  • Pointer size: 130 Bytes
  • Size of remote file: 65.2 kB
images/grpo_reward_curve.png ADDED

Git LFS Details

  • SHA256: 89fecc46790b4a680a3432d6ec09c8554018062516bf0588289d9278644c8618
  • Pointer size: 130 Bytes
  • Size of remote file: 55.8 kB
images/training_curves.png ADDED

Git LFS Details

  • SHA256: 02ee29dfcc78bd403fa8b6f91272947751be1300e6a607f85b9e82296a517df1
  • Pointer size: 130 Bytes
  • Size of remote file: 18.6 kB
main.py CHANGED
@@ -112,6 +112,24 @@ async def serve_train_results() -> FileResponse:
112
  )
113
 
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  @app.get("/favicon.ico", include_in_schema=False, response_model=None)
116
  async def favicon():
117
  """
 
112
  )
113
 
114
 
115
+ @app.get("/judge", include_in_schema=False)
116
+ async def serve_judge_demo() -> FileResponse:
117
+ """GRPO (trained) negotiator: same game UI with opponent forced to HF model."""
118
+ return FileResponse(
119
+ "dashboard/judge.html",
120
+ headers={"Cache-Control": "no-cache, must-revalidate"},
121
+ )
122
+
123
+
124
+ @app.get("/interact", include_in_schema=False)
125
+ async def serve_interact() -> FileResponse:
126
+ """Direct model inference page — talk to the GRPO model without game scaffolding."""
127
+ return FileResponse(
128
+ "dashboard/interact.html",
129
+ headers={"Cache-Control": "no-cache, must-revalidate"},
130
+ )
131
+
132
+
133
  @app.get("/favicon.ico", include_in_schema=False, response_model=None)
134
  async def favicon():
135
  """
openenv.yaml CHANGED
@@ -17,7 +17,7 @@ license: "MIT"
17
  # URLs — judges pull the env from space_url
18
  space_url: "https://huggingface.co/spaces/sh4shv4t/Parlay"
19
  repository: "https://github.com/sh4shv4t/Parlay"
20
- blog: "https://huggingface.co/blog/sh4shv4t/parlay"
21
  dataset: "https://huggingface.co/datasets/sh4shv4t/parlay-episodes"
22
  sft_model: "https://huggingface.co/sh4shv4t/parlay-sft-1-5b"
23
  grpo_model: "https://huggingface.co/sh4shv4t/parlay-grpo-1-5b"
 
17
  # URLs — judges pull the env from space_url
18
  space_url: "https://huggingface.co/spaces/sh4shv4t/Parlay"
19
  repository: "https://github.com/sh4shv4t/Parlay"
20
+ blog: "https://github.com/sh4shv4t/Parlay/blob/main/BLOG.md"
21
  dataset: "https://huggingface.co/datasets/sh4shv4t/parlay-episodes"
22
  sft_model: "https://huggingface.co/sh4shv4t/parlay-sft-1-5b"
23
  grpo_model: "https://huggingface.co/sh4shv4t/parlay-grpo-1-5b"
requirements-dev.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Development tooling (optional)
2
+ pre-commit>=3.7.0
results/eval_results.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "random_mean_reward": 70.8231,
3
+ "base_mean_reward": null,
4
+ "grpo_mean_reward": null,
5
+ "_comment": "random from: python -m training.random_baseline --episodes 50 --output results/random_baseline.json (local, 2026-04-26). base_mean_reward and grpo_mean_reward need: Python with torch+GPU, data/episodes.jsonl with split=eval, then python -m training.evaluate --base ... --sft ... --grpo ... -n 50 -o results/eval_results.json (merges these keys)."
6
+ }
results/random_baseline.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "episodes_requested": 50,
3
+ "episodes_completed": 50,
4
+ "avg_reward": 70.8231,
5
+ "deal_rate": 1.0,
6
+ "avg_efficiency": 0.4836,
7
+ "avg_tom_accuracy": 0.663,
8
+ "bluffs_caught": 0
9
+ }
scripts/check_staged_not_pycache.py ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fail if any staged file is under __pycache__ or is a .pyc / .pyo (pre-commit local hook)."""
2
+ from __future__ import annotations
3
+
4
+ import subprocess
5
+ import sys
6
+ from pathlib import Path
7
+
8
+
9
+ def main() -> int:
10
+ out = subprocess.run(
11
+ ["git", "diff", "--cached", "--name-only", "-z"],
12
+ check=True,
13
+ capture_output=True,
14
+ ).stdout
15
+ if not out:
16
+ return 0
17
+ bad: list[Path] = []
18
+ for raw in out.split(b"\0"):
19
+ if not raw:
20
+ continue
21
+ p = raw.decode("utf-8", errors="replace")
22
+ pl = p.lower()
23
+ if "__pycache__" in p:
24
+ bad.append(Path(p))
25
+ elif pl.endswith(".pyc") or pl.endswith(".pyo"):
26
+ bad.append(Path(p))
27
+ if not bad:
28
+ return 0
29
+ print("Refuse to commit bytecode or __pycache__ paths:", file=sys.stderr)
30
+ for p in bad:
31
+ print(f" {p}", file=sys.stderr)
32
+ print("Remove from the index: git reset HEAD -- <file>", file=sys.stderr)
33
+ return 1
34
+
35
+
36
+ if __name__ == "__main__":
37
+ raise SystemExit(main())
scripts/push_dataset.py CHANGED
@@ -76,7 +76,7 @@ deal_efficiency, tom_accuracy, drift_adapted
76
  [Space](https://huggingface.co/spaces/sh4shv4t/Parlay) |
77
  [GitHub](https://github.com/sh4shv4t/Parlay) |
78
  [SFT Model](https://huggingface.co/sh4shv4t/parlay-sft-1-5b) |
79
- [Blog](https://huggingface.co/blog/sh4shv4t/parlay)
80
  """
81
  with tempfile.NamedTemporaryFile(
82
  mode="w",
 
76
  [Space](https://huggingface.co/spaces/sh4shv4t/Parlay) |
77
  [GitHub](https://github.com/sh4shv4t/Parlay) |
78
  [SFT Model](https://huggingface.co/sh4shv4t/parlay-sft-1-5b) |
79
+ [Blog](https://github.com/sh4shv4t/Parlay/blob/main/BLOG.md)
80
  """
81
  with tempfile.NamedTemporaryFile(
82
  mode="w",
training/GRPO_HF_RUNBOOK.md CHANGED
@@ -140,7 +140,7 @@ GRPO already **builds charts in code** (`training/grpo_train.py` → `_save_trai
140
 
141
  ## 4. Alternative: Colab (no Jobs)
142
 
143
- Use `notebooks/parlay_grpo_colab.ipynb`. In the **config** cell, set:
144
 
145
  ```python
146
  JSONL_VIA_HF = ("sh4shv4t/parlay-episodes", "episodes_v2.jsonl")
 
140
 
141
  ## 4. Alternative: Colab (no Jobs)
142
 
143
+ Use `training/notebooks/parlay_grpo_colab.ipynb`. In the **config** cell, set:
144
 
145
  ```python
146
  JSONL_VIA_HF = ("sh4shv4t/parlay-episodes", "episodes_v2.jsonl")
{notebooks → training/notebooks}/parlay_grpo_colab.ipynb RENAMED
File without changes
training/notebooks/parlay_grpo_hf_job_log_summary.ipynb ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 5,
4
+ "metadata": {
5
+ "colab": {
6
+ "provenance": [],
7
+ "gpuType": "A100"
8
+ },
9
+ "kernelspec": {
10
+ "display_name": "Python 3",
11
+ "name": "python3"
12
+ },
13
+ "language_info": {
14
+ "name": "python"
15
+ }
16
+ },
17
+ "cells": [
18
+ {
19
+ "cell_type": "markdown",
20
+ "metadata": {},
21
+ "source": [
22
+ "# Hugging Face Job — GRPO run (truncated log)\n",
23
+ "\n",
24
+ "**Why:** Stage-2 **GRPO** on top of the SFT LoRA so the policy is optimized with Parlay rewards (ToM, format, anti-capitulation, etc.) and the adapter can be **pushed to the Hub** for eval / Spaces.\n",
25
+ "\n",
26
+ "**This notebook** is not a runnable training recipe — it is a **short record** of one HF Job: intent + shell entrypoint + cherry-picked log lines."
27
+ ],
28
+ "id": "md-purpose"
29
+ },
30
+ {
31
+ "cell_type": "markdown",
32
+ "metadata": {},
33
+ "source": [
34
+ "## Command (repo root on the job, e.g. `/work`)\n",
35
+ "\n",
36
+ "After `git clone` and `pip install -r requirements-train.txt`, the job used the standard entry script with **80 steps** and **G=2** (matches console lines below)."
37
+ ],
38
+ "id": "md-cmd-intro"
39
+ },
40
+ {
41
+ "cell_type": "code",
42
+ "metadata": {},
43
+ "source": [
44
+ "%%bash\n",
45
+ "# Equivalent to what the HF Job ran (set HF_TOKEN / HUGGINGFACE_HUB_TOKEN for push)\n",
46
+ "export GRPO_STEPS=80 GRPO_G=2\n",
47
+ "export DATASET_ID=sh4shv4t/parlay-episodes EPISODE_FILE=episodes_v2.jsonl\n",
48
+ "export SFT_MODEL=sh4shv4t/parlay-sft-1-5b HF_GRPO_REPO=sh4shv4t/parlay-grpo-1-5b OUTPUT_DIR=outputs/grpo_run\n",
49
+ "bash scripts/hf_grpo_entry.sh"
50
+ ],
51
+ "id": "cell-bash-cmd",
52
+ "execution_count": null,
53
+ "outputs": []
54
+ },
55
+ {
56
+ "cell_type": "markdown",
57
+ "metadata": {},
58
+ "source": [
59
+ "## Truncated log (high signal only)\n",
60
+ "\n",
61
+ "```text\n",
62
+ "Job started at 2026-04-26 00:15:37\n",
63
+ "Cloning into '/work'...\n",
64
+ "... pip install -r requirements-train.txt ... (torch 2.8, transformers 5.6, trl 1.2, ...)\n",
65
+ "\n",
66
+ "==> Downloading episodes_v2.jsonl from dataset sh4shv4t/parlay-episodes ...\n",
67
+ "==> GRPO: SFT=sh4shv4t/parlay-sft-1-5b steps=80 G=2 out=/work/outputs/grpo_run\n",
68
+ "Filtered 0 records below min_reward=-50.0, 124 remaining for GRPO\n",
69
+ "\n",
70
+ "INFO: Loading SFT LoRA: adapter=sh4shv4t/parlay-sft-1-5b base=Qwen/Qwen2.5-1.5B-Instruct\n",
71
+ "INFO: Starting GRPO training: ... prompts=124, G=2, steps=80\n",
72
+ "\n",
73
+ " ... 80/80 steps (~15s/step) ...\n",
74
+ " {'train_loss': '0.0001051', 'epoch': '5.333', ...}\n",
75
+ "\n",
76
+ "No log history to plot\n",
77
+ "INFO: Model saved to /work/outputs/grpo_run\n",
78
+ "==> Pushing to https://huggingface.co/sh4shv4t/parlay-grpo-1-5b ...\n",
79
+ "Model uploaded successfully!\n",
80
+ "==> Done.\n",
81
+ "```"
82
+ ],
83
+ "id": "md-truncated-log"
84
+ }
85
+ ]
86
+ }
training/notebooks/parlay_hf_only_eval_colab.ipynb ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "b3dfc8be",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Parlay — Hugging Face only eval (no GitHub)\n",
9
+ "\n",
10
+ "This notebook **only** needs a Hugging Face token (`HF_TOKEN Colab secret`) and a **GPU** runtime. After the first run of the **Install dependencies** cell, use **Runtime → Restart session** and run all cells from the top (needed so the reinstalled `torch` / `torchvision` stack is loaded). If you change packages or see `torchao`, `peft`, or `BloomPreTrainedModel` import errors, restart again.\n",
11
+ "\n",
12
+ "It will:\n",
13
+ "\n",
14
+ "1. Download `episodes_v2.jsonl` from the dataset [sh4shv4t/parlay-episodes](https://huggingface.co/datasets/sh4shv4t/parlay-episodes)\n",
15
+ "2. Keep rows with `split == \"eval\"`\n",
16
+ "3. Load **Qwen2.5-1.5B-Instruct**, [sh4shv4t/parlay-sft-1-5b](https://huggingface.co/sh4shv4t/parlay-sft-1-5b), and [sh4shv4t/parlay-grpo-1-5b](https://huggingface.co/sh4shv4t/parlay-grpo-1-5b) (LoRA adapters on top of the base)\n",
17
+ "4. For each episode, generate one JSON reply in the same chat style as training (`apply_chat_template`), parse `offer_amount`, and score **terminal deal efficiency** (GAMMA=100) with buyer- vs seller-AI ZOPA logic (same rules as the Parlay `reward_fn` for efficiency)\n",
18
+ "5. Print means and a JSON blob you can paste into `results/eval_results.json`\n",
19
+ "\n",
20
+ "This is a **single-step** policy probe on the first user turn (like `training/evaluate.py` on GPU), not a full multi-turn OpenEnv roll-out."
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": null,
26
+ "id": "00322384",
27
+ "metadata": {},
28
+ "outputs": [],
29
+ "source": [
30
+ "# @title 1) Install dependencies\n",
31
+ "%%capture\n",
32
+ "# Reinstall torch+torchvision+torchaudio from ONE CUDA index (fixes mismatched cu130/cu128 and the bogus BloomPreTrainedModel peft error). If cu130 wheels fail, switch URL to .../whl/cu128.\n",
33
+ "%pip install -q --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130\n",
34
+ "%pip install -q -U transformers accelerate peft bitsandbytes safetensors huggingface_hub sentencepiece\n",
35
+ "%pip install -q -U \"torchao>=0.16.0\""
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "code",
40
+ "execution_count": null,
41
+ "metadata": {},
42
+ "outputs": [],
43
+ "source": [
44
+ "# @title 2) HF_TOKEN (only secret you need)\n",
45
+ "import os\n",
46
+ "from google.colab import userdata\n",
47
+ "\n",
48
+ "HF_TOKEN = (userdata.get(\"HF_TOKEN\") or os.environ.get(\"HF_TOKEN\") or \"\").strip()\n",
49
+ "if not HF_TOKEN:\n",
50
+ " raise RuntimeError(\n",
51
+ " \"Set Colab secret HF_TOKEN: open the key icon → add HF_TOKEN with a read token from huggingface.co/settings/tokens\"\n",
52
+ " )\n",
53
+ "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
54
+ "os.environ[\"HUGGING_FACE_HUB_TOKEN\"] = HF_TOKEN\n",
55
+ "print(\"HF_TOKEN: OK (length)\", len(HF_TOKEN))\n"
56
+ ]
57
+ },
58
+ {
59
+ "cell_type": "code",
60
+ "execution_count": null,
61
+ "metadata": {},
62
+ "outputs": [],
63
+ "source": [
64
+ "# @title 3) Config — Hub IDs (defaults match the Parlay README)\n",
65
+ "BASE_MODEL = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
66
+ "SFT_ADAPTER = \"sh4shv4t/parlay-sft-1-5b\"\n",
67
+ "GRPO_ADAPTER = \"sh4shv4t/parlay-grpo-1-5b\"\n",
68
+ "DATASET_REPO = \"sh4shv4t/parlay-episodes\"\n",
69
+ "DATASET_FILE = \"episodes_v2.jsonl\"\n",
70
+ "N_EVAL = 50 # set smaller (e.g. 10) for a quick smoke test\n",
71
+ "\n",
72
+ "import subprocess\n",
73
+ "import torch\n",
74
+ "if not torch.cuda.is_available():\n",
75
+ " print(\"WARNING: no GPU — inference will be very slow; use Runtime → Change runtime type → GPU.\")\n",
76
+ "else:\n",
77
+ " subprocess.run([\"nvidia-smi\", \"-L\"], check=False)\n"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "code",
82
+ "execution_count": null,
83
+ "metadata": {},
84
+ "outputs": [],
85
+ "source": [
86
+ "# @title 4) Load eval rows from the Hub dataset (JSONL, no local clone)\n",
87
+ "import json\n",
88
+ "from huggingface_hub import hf_hub_download\n",
89
+ "\n",
90
+ "path = hf_hub_download(\n",
91
+ " repo_id=DATASET_REPO,\n",
92
+ " filename=DATASET_FILE,\n",
93
+ " repo_type=\"dataset\",\n",
94
+ " token=HF_TOKEN,\n",
95
+ ")\n",
96
+ "print(\"Downloaded:\", path)\n",
97
+ "\n",
98
+ "rows = []\n",
99
+ "with open(path, \"r\", encoding=\"utf-8\") as f:\n",
100
+ " for line in f:\n",
101
+ " line = line.strip()\n",
102
+ " if not line:\n",
103
+ " continue\n",
104
+ " r = json.loads(line)\n",
105
+ " if r.get(\"split\") == \"eval\":\n",
106
+ " rows.append(r)\n",
107
+ "if not rows:\n",
108
+ " raise RuntimeError(\"No rows with split=eval in JSONL — check the dataset file name / version on the Hub.\")\n",
109
+ "rows = rows[:N_EVAL]\n",
110
+ "print(\"Eval rows:\", len(rows), \"(capped to N_EVAL)\")"
111
+ ]
112
+ },
113
+ {
114
+ "cell_type": "code",
115
+ "execution_count": null,
116
+ "id": "43ef16ab",
117
+ "metadata": {},
118
+ "outputs": [],
119
+ "source": [
120
+ "# @title 5) Scoring + prompt (matches Parlay `prompts_qwen` + efficiency reward semantics)\n",
121
+ "import re\n",
122
+ "import json as _json\n",
123
+ "from typing import Any\n",
124
+ "\n",
125
+ "GAMMA = 100.0\n",
126
+ "BUYER_AI = frozenset({\"hiring_package\", \"acquisition_term_sheet\"})\n",
127
+ "\n",
128
+ "\n",
129
+ "def _first_user_content(conversation) -> str:\n",
130
+ " if not isinstance(conversation, list):\n",
131
+ " return (\n",
132
+ " 'Please make your opening offer. Reply in valid JSON: '\n",
133
+ " '{\"utterance\": \"...\", \"offer_amount\": <number or null>, \"tactical_move\": <string or null>}'\n",
134
+ " )\n",
135
+ " for turn in conversation:\n",
136
+ " if not isinstance(turn, dict):\n",
137
+ " continue\n",
138
+ " if turn.get(\"role\") in (\"user\", \"negotiator\"):\n",
139
+ " c = str(turn.get(\"content\", \"\")).strip()\n",
140
+ " if c:\n",
141
+ " return c\n",
142
+ " return (\n",
143
+ " 'Please make your opening offer. Reply in valid JSON: '\n",
144
+ " '{\"utterance\": \"...\", \"offer_amount\": <number or null>, \"tactical_move\": <string or null>}'\n",
145
+ " )\n",
146
+ "\n",
147
+ "\n",
148
+ "def build_generation_prompt(rec: dict, tokenizer) -> str:\n",
149
+ " system_msg = str(rec.get(\"prompt\", \"\")).strip()\n",
150
+ " user_msg = _first_user_content(rec.get(\"conversation\", []))\n",
151
+ " messages = [\n",
152
+ " {\"role\": \"system\", \"content\": system_msg},\n",
153
+ " {\"role\": \"user\", \"content\": user_msg},\n",
154
+ " ]\n",
155
+ " if hasattr(tokenizer, \"apply_chat_template\"):\n",
156
+ " return tokenizer.apply_chat_template(\n",
157
+ " messages, tokenize=False, add_generation_prompt=True\n",
158
+ " )\n",
159
+ " eot = str(\n",
160
+ " bytes((60, 124, 105, 109, 95, 101, 110, 100, 124, 62)), \"ascii\"\n",
161
+ " )\n",
162
+ " return (\n",
163
+ " f\"<|im_start|>system\\n{system_msg}{eot}\\n\"\n",
164
+ " f\"<|im_start|>user\\n{user_msg}{eot}\\n\"\n",
165
+ " f\"<|im_start|>assistant\\n\"\n",
166
+ " )\n",
167
+ "\n",
168
+ "\n",
169
+ "def parse_offer(text: str) -> float:\n",
170
+ " t = (text or \"\").replace(\"```json\", \"\").replace(\"```\", \"\").strip()\n",
171
+ " m = re.search(r\"\\{[\\s\\S]*\\}\", t)\n",
172
+ " if not m:\n",
173
+ " return 0.0\n",
174
+ " try:\n",
175
+ " d = _json.loads(m.group(0))\n",
176
+ " v = d.get(\"offer_amount\")\n",
177
+ " if v is None:\n",
178
+ " return 0.0\n",
179
+ " return float(v)\n",
180
+ " except Exception:\n",
181
+ " return 0.0\n",
182
+ "\n",
183
+ "\n",
184
+ "def efficiency_reward(offer: float, rec: dict) -> float:\n",
185
+ " \"\"\"Parlay-style E in [0,1] * GAMMA for terminal deal efficiency.\"\"\"\n",
186
+ " batna_seller = float(rec.get(\"batna_seller\", 0) or 0)\n",
187
+ " batna_buyer = float(rec.get(\"batna_buyer\", batna_seller) or batna_seller)\n",
188
+ " zopa = max(1.0, batna_buyer - batna_seller)\n",
189
+ " sid = str(rec.get(\"scenario_id\", \"\") or \"\")\n",
190
+ " is_buyer = sid in BUYER_AI\n",
191
+ " if offer <= 0:\n",
192
+ " return 0.0\n",
193
+ " if is_buyer:\n",
194
+ " e = max(0.0, min(1.0, (batna_buyer - offer) / zopa))\n",
195
+ " else:\n",
196
+ " e = max(0.0, min(1.0, (offer - batna_seller) / zopa))\n",
197
+ " return e * GAMMA\n"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": null,
203
+ "id": "aeefab1f",
204
+ "metadata": {},
205
+ "outputs": [],
206
+ "source": [
207
+ "# @title 6) Load models (4-bit on GPU) — base, SFT, GRPO\n",
208
+ "from pathlib import Path\n",
209
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
210
+ "from peft import PeftModel\n",
211
+ "from huggingface_hub import hf_hub_download\n",
212
+ "import json as _json\n",
213
+ "\n",
214
+ "\n",
215
+ "def load_tokenizer_for(repo_id: str):\n",
216
+ " # Prefer adapter repo tokenizer; Hub adapter repos usually ship tokenizer config\n",
217
+ " return AutoTokenizer.from_pretrained(\n",
218
+ " repo_id, trust_remote_code=True, token=HF_TOKEN\n",
219
+ " )\n",
220
+ "\n",
221
+ "\n",
222
+ "def is_adapter_repo(repo_id: str) -> bool:\n",
223
+ " try:\n",
224
+ " hf_hub_download(repo_id=repo_id, filename=\"adapter_config.json\", token=HF_TOKEN)\n",
225
+ " return True\n",
226
+ " except Exception:\n",
227
+ " return False\n",
228
+ "\n",
229
+ "\n",
230
+ "def load_causal(hub_id: str, use_4bit: bool = True):\n",
231
+ " use_bnb = use_4bit and torch.cuda.is_available()\n",
232
+ " common = dict(trust_remote_code=True, token=HF_TOKEN)\n",
233
+ " if use_bnb:\n",
234
+ " bnb = BitsAndBytesConfig(\n",
235
+ " load_in_4bit=True,\n",
236
+ " bnb_4bit_compute_dtype=torch.bfloat16,\n",
237
+ " bnb_4bit_use_double_quant=True,\n",
238
+ " bnb_4bit_quant_type=\"nf4\",\n",
239
+ " )\n",
240
+ " mkw = {**common, \"quantization_config\": bnb, \"device_map\": \"auto\"}\n",
241
+ " else:\n",
242
+ " mkw = {\n",
243
+ " **common,\n",
244
+ " \"torch_dtype\": torch.bfloat16 if torch.cuda.is_available() else torch.float32,\n",
245
+ " \"device_map\": \"auto\" if torch.cuda.is_available() else None,\n",
246
+ " }\n",
247
+ " if is_adapter_repo(hub_id):\n",
248
+ " cfg_p = Path(hf_hub_download(repo_id=hub_id, filename=\"adapter_config.json\", token=HF_TOKEN))\n",
249
+ " ac = _json.loads(cfg_p.read_text(encoding=\"utf-8\"))\n",
250
+ " base = ac.get(\"base_model_name_or_path\", BASE_MODEL)\n",
251
+ " base_m = AutoModelForCausalLM.from_pretrained(base, **mkw)\n",
252
+ " m = PeftModel.from_pretrained(base_m, hub_id, token=HF_TOKEN)\n",
253
+ " else:\n",
254
+ " m = AutoModelForCausalLM.from_pretrained(hub_id, **mkw)\n",
255
+ " m.eval()\n",
256
+ " return m\n",
257
+ "\n",
258
+ "print(\"Load helpers: OK\")"
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": null,
264
+ "id": "2eeef38d",
265
+ "metadata": {},
266
+ "outputs": [],
267
+ "source": [
268
+ "# @title 7) Run evaluation (reload tokenizer per checkpoint for matching templates)\n",
269
+ "import gc\n",
270
+ "\n",
271
+ "SPECS = [\n",
272
+ " (\"base\", BASE_MODEL),\n",
273
+ " (\"sft\", SFT_ADAPTER),\n",
274
+ " (\"grpo\", GRPO_ADAPTER),\n",
275
+ "]\n",
276
+ "\n",
277
+ "\n",
278
+ "@torch.inference_mode()\n",
279
+ "def run_eval_for_repo(name: str, repo_id: str) -> float:\n",
280
+ " tok = load_tokenizer_for(repo_id)\n",
281
+ " if tok.pad_token is None:\n",
282
+ " tok.pad_token = tok.eos_token\n",
283
+ " model = None\n",
284
+ " try:\n",
285
+ " model = load_causal(repo_id, use_4bit=True)\n",
286
+ " rews = []\n",
287
+ " for i, rec in enumerate(rows):\n",
288
+ " prompt = build_generation_prompt(rec, tok)\n",
289
+ " dev = next(model.parameters()).device\n",
290
+ " batch = tok(\n",
291
+ " prompt, return_tensors=\"pt\", max_length=4096, truncation=True\n",
292
+ " )\n",
293
+ " batch = {k: v.to(dev) for k, v in batch.items()}\n",
294
+ " out = model.generate(\n",
295
+ " **batch,\n",
296
+ " max_new_tokens=256,\n",
297
+ " do_sample=True,\n",
298
+ " temperature=0.7,\n",
299
+ " pad_token_id=tok.pad_token_id,\n",
300
+ " )\n",
301
+ " gen = out[0][batch[\"input_ids\"].shape[-1] :]\n",
302
+ " text = tok.decode(gen, skip_special_tokens=True)\n",
303
+ " offer = parse_offer(text)\n",
304
+ " rews.append(efficiency_reward(offer, rec))\n",
305
+ " return sum(rews) / max(len(rews), 1)\n",
306
+ " finally:\n",
307
+ " if model is not None:\n",
308
+ " del model\n",
309
+ " gc.collect()\n",
310
+ " if torch.cuda.is_available():\n",
311
+ " torch.cuda.empty_cache()\n",
312
+ " if torch.cuda.is_available():\n",
313
+ " torch.cuda.synchronize()\n",
314
+ "\n",
315
+ "\n",
316
+ "results = {}\n",
317
+ "for name, rid in SPECS:\n",
318
+ " print(\"Evaluating:\", name, rid)\n",
319
+ " m = run_eval_for_repo(name, rid)\n",
320
+ " results[name] = m\n",
321
+ " print(f\" mean reward (eff proxy): {m:.3f}\")\n",
322
+ "\n",
323
+ "print(\"\\n--- Summary ---\", results)\n"
324
+ ]
325
+ },
326
+ {
327
+ "cell_type": "code",
328
+ "execution_count": null,
329
+ "id": "fc806987",
330
+ "metadata": {},
331
+ "outputs": [],
332
+ "source": [
333
+ "# @title 8) Build `eval_results.json` (copy into your repo: results/eval_results.json)\n",
334
+ "import json\n",
335
+ "out = {\n",
336
+ " \"base_mean_reward\": results[\"base\"],\n",
337
+ " \"sft_mean_reward\": results.get(\"sft\"),\n",
338
+ " \"grpo_mean_reward\": results.get(\"grpo\"),\n",
339
+ " \"n_eval\": len(rows),\n",
340
+ " \"dataset\": DATASET_REPO,\n",
341
+ " \"data_file\": DATASET_FILE,\n",
342
+ " \"models\": {\n",
343
+ " \"base\": BASE_MODEL,\n",
344
+ " \"sft\": SFT_ADAPTER,\n",
345
+ " \"grpo\": GRPO_ADAPTER,\n",
346
+ " },\n",
347
+ "}\n",
348
+ "print(json.dumps(out, indent=2))\n",
349
+ "from google.colab import files\n",
350
+ "path = \"/content/eval_results.json\"\n",
351
+ "with open(path, \"w\", encoding=\"utf-8\") as f:\n",
352
+ " json.dump(out, f, indent=2)\n",
353
+ "files.download(path)\n",
354
+ "print(\"Downloaded eval_results.json — add random_mean_reward manually (e.g. from random_baseline) if needed.\")"
355
+ ]
356
+ }
357
+ ],
358
+ "metadata": {
359
+ "kernelspec": {
360
+ "display_name": "Python 3",
361
+ "language": "python",
362
+ "name": "python3"
363
+ },
364
+ "language_info": {
365
+ "name": "python",
366
+ "version": "3.11.0"
367
+ }
368
+ },
369
+ "nbformat": 4,
370
+ "nbformat_minor": 5
371
+ }
{notebooks → training/notebooks}/parlay_sft_colab.ipynb RENAMED
File without changes