Direct inference with the GRPO-finetuned negotiator (Qwen2.5-1.5B)
parlay-grpo-1-5b is a Qwen2.5-1.5B-Instruct model fine-tuned in two stages: first with SFT on Gemini-generated negotiation transcripts, then with GRPO using the Parlay reward function — a mix of ZOPA progress, Theory-of-Mind accuracy, tactical card usage, and drift adaptation bonuses.
Every response is a JSON object with three fields:
utterance — the natural language negotiation turn,
offer_amount — a numeric bid (or null for conversational turns),
tactical_move — optional card played (anchor_high, batna_reveal, silence).
The utterance is displayed as the chat bubble. If the model includes an offer_amount, it appears as a gold chip below the text. You can expand "Raw model JSON output" to see the full structured response.
On a GPU Space the model runs locally (fast after the first load). On a CPU Space inference falls back to the Hugging Face Inference API — the first request may take 20–40 s while the model warms up; subsequent requests are faster.