Spaces:

aamrinder
/

subtext-arena

Sleeping

App Files Files Community

aamrinder commited on 12 days ago

Commit

9bd1f77

verified ·

1 Parent(s): 4cdc991

sync Colab notebook with current train_grpo.py

Browse files

Files changed (1) hide show

notebooks/train_grpo_colab.ipynb +168 -125

notebooks/train_grpo_colab.ipynb CHANGED Viewed

@@ -3,52 +3,24 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# Subtext Arena — GRPO training (Colab-runnable)\n",
-    "\n",
-    "Re-runnable notebook for judges. Trains a Qwen2.5-3B-Instruct policy with **Unsloth + TRL `GRPOTrainer`** on the Subtext Arena task.\n",
-    "\n",
-    "**Architecture (Option A — single-step CoT classification)**\n",
-    "\n",
-    "Each training rollout:\n",
-    "  1. We build ONE prompt for one MUStARD clip — system + transcript + prosody features + pitch contour, all in the user message.\n",
-    "  2. The model emits ONE completion: `<think>...</think><final>{\"label\":\"sarcastic\"|\"sincere\",\"confidence\":0..1}</final>`\n",
-    "  3. Reward = 0.70 · correctness (confidence-weighted) + 0.15 · reasoning_length + 0.15 · format.\n",
-    "  4. GRPO updates LoRA weights from the group-relative advantage.\n",
-    "\n",
-    "The Subtext Arena env still supports multi-step tool calling at inference time — that's our HF Space demo. But for *training* we sidestep TRL's single-shot generate-then-score constraint by pre-rendering the tool outputs into the prompt. This is the same pattern as the deck's Wordle / Sudoku notebooks.\n",
-    "\n",
-    "**Stack** (deck-named, requirement #2): Unsloth + TRL. T4-medium fits.\n",
-    "**Estimated runtime**: ~12 hours for 200 GRPO steps on T4-medium ($0.60/hr × 12 ≈ $8)."
-   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## 1. Install dependencies\n",
-    "\n",
-    "Replace `aamrinder` with your HF username after pushing the env to a Space."
-   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "!pip install -q --upgrade unsloth \"trl>=0.11\" \"transformers>=4.46\" peft datasets accelerate matplotlib\n",
-    "!pip install -q git+https://huggingface.co/spaces/aamrinder/subtext-arena\n",
-    "import torch\n",
-    "print('CUDA:', torch.cuda.is_available(), '|', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"
-   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## 2. Load Qwen2.5-3B-Instruct with Unsloth (4-bit + LoRA)"
-   ]
   },
   {
    "cell_type": "code",
@@ -56,29 +28,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from unsloth import FastLanguageModel\n",
-    "\n",
-    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
-    "    model_name='unsloth/Qwen2.5-3B-Instruct',\n",
-    "    max_seq_length=4096,\n",
-    "    load_in_4bit=True,\n",
-    ")\n",
-    "model = FastLanguageModel.get_peft_model(\n",
-    "    model,\n",
-    "    r=16, lora_alpha=16,\n",
-    "    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],\n",
-    "    use_gradient_checkpointing='unsloth',\n",
-    ")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## 3. Build the training dataset\n",
-    "\n",
-    "Each row is one MUStARD clip's full briefing (transcript + prosody summary + pitch contour) wrapped as a chat prompt. The Pivot Set is oversampled 3×."
-   ]
   },
   {
    "cell_type": "code",
@@ -86,36 +54,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Reuses the env's audio_tools + scenarios — same prompt format that an\n",
-    "# interactive agent would see if it called all the tools in sequence.\n",
-    "from train.train_grpo import (\n",
-    "    SYSTEM_PROMPT, build_full_observation, build_dataset,\n",
-    "    parse_final, reasoning_length_score, make_reward_fn,\n",
     ")\n",
-    "from server.scenarios import load_scenarios\n",
     "\n",
     "scenarios = load_scenarios()\n",
-    "n_pivot = sum(1 for s in scenarios.values() if s.get('is_pivot'))\n",
-    "print(f'Loaded {len(scenarios)} clips ({n_pivot} marked Pivot Set)')\n",
     "\n",
-    "ds = build_dataset(scenarios, n_rows=600, seed=0)\n",
-    "print(f'Built {len(ds)} training prompts. Sample row:')\n",
-    "print(ds[0])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## 4. Reward function — single scalar per completion\n",
-    "\n",
-    "Parses `<final>{label, confidence}</final>` from the completion, scores against the gold label from the dataset row.\n",
-    "\n",
-    "Reward components (all in [0, 1]):\n",
-    "- **correctness** (weight 0.70): `0.5 + 0.5 × confidence` if label matches gold, `0.5 - 0.5 × confidence` if wrong, `0.0` if no valid `<final>` tag.\n",
-    "- **reasoning_length** (weight 0.15): incentivizes 50-150-word `<think>` blocks; penalizes <30 (lazy) and >300 (rambling).\n",
-    "- **format** (weight 0.15): 1.0 if `<final>` tag has parseable JSON with valid label, else 0."
-   ]
   },
   {
    "cell_type": "code",
@@ -124,22 +81,18 @@
    "outputs": [],
    "source": [
     "reward_fn = make_reward_fn()\n",
-    "\n",
-    "# Sanity check on synthetic completions\n",
-    "fake_completions = [\n",
-    "    [{'role':'assistant','content': '<think>Pitch HIGH, pre-pause 320ms, positive lexical content with exaggerated melody — classic sarcasm signature.</think><final>{\"label\":\"sarcastic\",\"confidence\":0.85}</final>'}],\n",
-    "    [{'role':'assistant','content': '<think>Flat affect on neutral content, low pitch variability.</think><final>{\"label\":\"sincere\",\"confidence\":0.65}</final>'}],\n",
     "]\n",
-    "rewards = reward_fn(prompts=None, completions=fake_completions, gold=['sarcastic','sincere'])\n",
-    "print('Synthetic rewards:', rewards)  # should be high for both"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## 5. Run GRPO training (200 steps, ~12 h on T4-medium)"
-   ]
   },
   {
    "cell_type": "code",
@@ -147,41 +100,48 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from trl import GRPOTrainer, GRPOConfig\n",
     "\n",
-    "trainer = GRPOTrainer(\n",
-    "    model=model,\n",
-    "    reward_funcs=reward_fn,\n",
-    "    args=GRPOConfig(\n",
-    "        output_dir='./checkpoints/run1',\n",
-    "        num_generations=4,\n",
-    "        max_completion_length=768,\n",
-    "        per_device_train_batch_size=1,\n",
-    "        learning_rate=5e-6,\n",
-    "        max_steps=200,\n",
-    "        logging_steps=1,\n",
-    "        save_steps=50,\n",
-    "        save_total_limit=4,\n",
-    "        bf16=True,\n",
-    "        report_to='none',\n",
-    "        gradient_checkpointing=True,\n",
-    "    ),\n",
-    "    train_dataset=ds,\n",
-    "    processing_class=tokenizer,\n",
     ")\n",
-    "trainer.train()\n",
-    "trainer.save_model('./checkpoints/run1')\n",
-    "print('checkpoint saved')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## 6. Eval on the Prosody-Pivot Set\n",
-    "\n",
-    "Headline number: `X / 50` clips correct (per-clip majority across 3 seeds). Run BOTH the trained checkpoint and the base model — the delta is your story."
-   ]
   },
   {
    "cell_type": "code",
@@ -189,41 +149,124 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# After training:\n",
-    "# !python train/eval_pivot_set.py --checkpoint baseline-only --pivot data/pivot_set.json --out docs/plots/pivot_baseline.json\n",
-    "# !python train/eval_pivot_set.py --checkpoint ./checkpoints/run1 --pivot data/pivot_set.json --out docs/plots/pivot_trained.json\n",
     "\n",
-    "import json\n",
-    "pivot = json.load(open('subtext_arena/data/pivot_set.json'))\n",
-    "print(f'Pivot Set size: {len(pivot[\"clip_ids\"])} clips')\n",
-    "print('Method:', pivot.get('method'))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 7. Plot the reward decomposition\n",
-    "\n",
-    "The killer chart: 3 colored lines (correctness, reasoning_length, format) climbing at different rates over training steps. This is the visual proof judges look for."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# !python train/plot_reward_decomp.py --log-jsonl ./checkpoints/run1/trainer_state.json --out docs/plots/reward_decomposition.png\n",
-    "from IPython.display import Image\n",
-    "# Image('docs/plots/reward_decomposition.png')"
    ]
   }
  ],
  "metadata": {
-  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
-  "language_info": {"name": "python", "version": "3.11"}
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "# Subtext Arena — GRPO training\n\nThis is the actual training that produced the README numbers — reward 0.33 → 0.97 on training, 51% on the broad held-out set, 5/6 on the Pivot Set.\n\nThe notebook imports functions straight from `subtext_arena.train.train_grpo` so it stays in sync with the script. Set the config below, run all cells. Around 2 hours on a Colab L4, ~$1.60."
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Install"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": "!pip install -q \"trl>=0.11\" \"transformers>=4.46\" peft datasets accelerate bitsandbytes huggingface_hub\n!pip install -q git+https://huggingface.co/spaces/aamrinder/subtext-arena\nimport torch\nprint('CUDA:', torch.cuda.is_available(), '|', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Config\n\nSame args as the CLI script. Tweak `MAX_STEPS` if you're on a smaller GPU."
   },
   {
    "cell_type": "code",
    "metadata": {},
    "outputs": [],
    "source": [
+    "MODEL              = 'Qwen/Qwen2.5-3B-Instruct'\n",
+    "OUTPUT_DIR         = './checkpoints/run1'\n",
+    "MAX_STEPS          = 200\n",
+    "NUM_GENERATIONS    = 4\n",
+    "PER_DEVICE_BATCH   = 4   # must be divisible by NUM_GENERATIONS\n",
+    "LEARNING_RATE      = 5e-6\n",
+    "MAX_COMPLETION_LEN = 768\n",
+    "LORA_R             = 16\n",
+    "LORA_DROPOUT       = 0.05\n",
+    "N_TRAIN_ROWS       = 600\n",
+    "EVAL_RATIO         = 0.2\n",
+    "N_EVAL_CLIPS       = 80\n",
+    "PUSH_TO_HUB        = None  # e.g. 'your-username/subtext-arena-grpo' (or None to skip)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Load data + train/eval split\n\nEval clips never appear in training. Split is seeded so it's reproducible."
   },
   {
    "cell_type": "code",
    "metadata": {},
    "outputs": [],
    "source": [
+    "from subtext_arena.server.scenarios import load_scenarios\n",
+    "from subtext_arena.train.train_grpo import (\n",
+    "    SYSTEM_PROMPT, build_full_observation, split_clip_ids, build_dataset,\n",
+    "    parse_final, reasoning_length_score, make_reward_fn, reward_decomposition,\n",
     ")\n",
     "\n",
     "scenarios = load_scenarios()\n",
+    "print(f'Loaded {len(scenarios)} clips ({sum(1 for s in scenarios.values() if s.get(\"is_pivot\"))} marked Pivot)')\n",
     "\n",
+    "train_ids, eval_ids = split_clip_ids(scenarios, eval_ratio=EVAL_RATIO, seed=42)\n",
+    "dataset = build_dataset(scenarios, n_rows=N_TRAIN_ROWS, allowed_clip_ids=train_ids)\n",
+    "print(f'{len(dataset)} train prompt rows from {len(train_ids)} unique train clips')\n",
+    "print('Sample prompt:', dataset[0]['prompt'][1]['content'][:300], '...')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Reward sanity check\n\nQuick check that the reward function scores synthetic completions correctly before burning GPU time."
   },
   {
    "cell_type": "code",
    "outputs": [],
    "source": [
     "reward_fn = make_reward_fn()\n",
+    "fake = [\n",
+    "    [{'role':'assistant','content': '<think>Pitch range is 180Hz — wide for a single line. Pre-utterance silence is 320ms which suggests deliberate emphasis. The literal words are positive but the prosodic delivery is exaggerated, classic sarcasm signature in TV dialogue.</think><final>{\"label\":\"sarcastic\",\"confidence\":0.85}</final>'}],\n",
+    "    [{'role':'assistant','content': '<think>Flat affect, narrow pitch range, no internal pauses. Content is neutral and matches delivery. No prosodic-lexical mismatch.</think><final>{\"label\":\"sincere\",\"confidence\":0.65}</final>'}],\n",
     "]\n",
+    "rewards = reward_fn(prompts=None, completions=fake, gold=['sarcastic','sincere'])\n",
+    "print('Synthetic rewards:', rewards)  # both should be > 0.85"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Load Qwen2.5-3B in 4-bit + LoRA"
   },
   {
    "cell_type": "code",
    "metadata": {},
    "outputs": [],
    "source": [
+    "import torch as _t\n",
+    "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
+    "from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training\n",
     "\n",
+    "bnb = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_compute_dtype=_t.bfloat16,\n",
+    "    bnb_4bit_quant_type='nf4',\n",
+    "    bnb_4bit_use_double_quant=True,\n",
+    ")\n",
+    "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n",
+    "if tokenizer.pad_token is None:\n",
+    "    tokenizer.pad_token = tokenizer.eos_token\n",
+    "base = AutoModelForCausalLM.from_pretrained(\n",
+    "    MODEL, quantization_config=bnb, dtype=_t.bfloat16, device_map='auto',\n",
     ")\n",
+    "base = prepare_model_for_kbit_training(base, use_gradient_checkpointing=True)\n",
+    "peft_config = LoraConfig(\n",
+    "    r=LORA_R, lora_alpha=LORA_R, lora_dropout=LORA_DROPOUT, bias='none',\n",
+    "    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],\n",
+    "    task_type='CAUSAL_LM',\n",
+    ")\n",
+    "model = get_peft_model(base, peft_config)\n",
+    "model.print_trainable_parameters()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Train"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os, json\nfrom pathlib import Path\nfrom trl import GRPOTrainer, GRPOConfig\n\nconfig = GRPOConfig(\n    output_dir=OUTPUT_DIR,\n    num_generations=NUM_GENERATIONS,\n    max_completion_length=MAX_COMPLETION_LEN,\n    per_device_train_batch_size=PER_DEVICE_BATCH,\n    learning_rate=LEARNING_RATE,\n    max_steps=MAX_STEPS,\n    logging_steps=1,\n    save_steps=50,\n    save_total_limit=4,\n    bf16=True,\n    report_to=('wandb' if os.environ.get('WANDB_API_KEY') else 'none'),\n    gradient_checkpointing=True,\n)\ntrainer = GRPOTrainer(\n    model=model,\n    reward_funcs=make_reward_fn(),\n    args=config,\n    train_dataset=dataset,\n    processing_class=tokenizer,\n)\ntrainer.train()\n\ntrainer.save_state()\ntrainer.save_model(OUTPUT_DIR)\nPath(OUTPUT_DIR, 'log_history.json').write_text(json.dumps(trainer.state.log_history, indent=2))\nprint(f'saved to {OUTPUT_DIR}')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Held-out eval\n\nGreedy decoding on 80 unseen clips. Run #3 landed at 51% broad accuracy and 5/6 on the Pivot subset."
   },
   {
    "cell_type": "code",
    "metadata": {},
    "outputs": [],
    "source": [
+    "model.eval()\n",
+    "if hasattr(model, 'gradient_checkpointing_disable'):\n",
+    "    try: model.gradient_checkpointing_disable()\n",
+    "    except Exception: pass\n",
     "\n",
+    "eval_clip_ids = sorted(eval_ids)[:N_EVAL_CLIPS]\n",
+    "results, eval_rewards = [], []\n",
+    "n_correct = n_well_formed = 0\n",
+    "for i, cid in enumerate(eval_clip_ids):\n",
+    "    sc = scenarios[cid]\n",
+    "    gold = 'sarcastic' if sc['sarcasm'] else 'sincere'\n",
+    "    messages = [\n",
+    "        {'role': 'system', 'content': SYSTEM_PROMPT},\n",
+    "        {'role': 'user', 'content': build_full_observation(cid, scenarios)},\n",
+    "    ]\n",
+    "    encoded = tokenizer.apply_chat_template(messages, return_tensors='pt', add_generation_prompt=True)\n",
+    "    input_ids = (encoded.input_ids if hasattr(encoded, 'input_ids') else encoded).to(model.device)\n",
+    "    prompt_len = input_ids.shape[1]\n",
+    "    with _t.no_grad():\n",
+    "        out = model.generate(\n",
+    "            input_ids=input_ids, max_new_tokens=MAX_COMPLETION_LEN,\n",
+    "            do_sample=False, pad_token_id=tokenizer.eos_token_id, use_cache=True,\n",
+    "        )\n",
+    "    text = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)\n",
+    "    decomp = reward_decomposition(text, gold)\n",
+    "    results.append({\n",
+    "        'clip_id': cid, 'gold': gold, 'is_pivot': bool(sc.get('is_pivot')),\n",
+    "        'predicted': decomp['_predicted'], 'confidence': decomp['_confidence'],\n",
+    "        'correct': decomp['_correct'], 'well_formed': decomp['_well_formed'],\n",
+    "        'reward_total': decomp['_total'], 'completion_text': text[:1500],\n",
+    "    })\n",
+    "    eval_rewards.append(decomp['_total'])\n",
+    "    if decomp['_correct']: n_correct += 1\n",
+    "    if decomp['_well_formed']: n_well_formed += 1\n",
+    "    if (i + 1) % 20 == 0:\n",
+    "        print(f'  [{i+1}/{len(eval_clip_ids)}] running mean reward = {sum(eval_rewards)/len(eval_rewards):.3f}, '\n",
+    "              f'correct so far = {n_correct}/{i+1}', flush=True)\n",
+    "\n",
+    "n_pivot = sum(1 for r in results if r['is_pivot'])\n",
+    "n_pivot_correct = sum(1 for r in results if r['is_pivot'] and r['correct'])\n",
+    "summary = {\n",
+    "    'n_eval_clips': len(eval_clip_ids),\n",
+    "    'mean_reward': sum(eval_rewards) / max(1, len(eval_rewards)),\n",
+    "    'well_formed_rate': n_well_formed / max(1, len(eval_clip_ids)),\n",
+    "    'accuracy': n_correct / max(1, len(eval_clip_ids)),\n",
+    "    'pivot_in_eval': n_pivot,\n",
+    "    'pivot_correct': n_pivot_correct,\n",
+    "    'results': results,\n",
+    "}\n",
+    "Path(OUTPUT_DIR, 'held_out_eval.json').write_text(json.dumps(summary, indent=2))\n",
+    "print(f\"\\nHELD-OUT: mean_reward={summary['mean_reward']:.3f}, accuracy={summary['accuracy']:.2%} ({n_correct}/{len(eval_clip_ids)})\")\n",
+    "print(f\"          pivot accuracy: {n_pivot_correct}/{n_pivot}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "## Reward curve"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
+    "import matplotlib.pyplot as plt\n",
+    "log = json.loads(Path(OUTPUT_DIR, 'log_history.json').read_text())\n",
+    "steps = [e['step'] for e in log if 'reward' in e]\n",
+    "rewards = [e['reward'] for e in log if 'reward' in e]\n",
+    "plt.figure(figsize=(8, 4))\n",
+    "plt.plot(steps, rewards, alpha=0.4, label='per-step')\n",
+    "if len(rewards) >= 10:\n",
+    "    import numpy as np\n",
+    "    ema = np.array(rewards)\n",
+    "    for i in range(1, len(ema)):\n",
+    "        ema[i] = 0.9 * ema[i-1] + 0.1 * ema[i]\n",
+    "    plt.plot(steps, ema, linewidth=2, label='EMA(0.9)')\n",
+    "plt.xlabel('GRPO step'); plt.ylabel('reward'); plt.legend(); plt.grid(alpha=0.3)\n",
+    "plt.title('Subtext Arena — GRPO training reward')\n",
+    "plt.tight_layout(); plt.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Push to Hub (optional)\n\nSet `PUSH_TO_HUB` at the top, then `huggingface-cli login` first or set `HF_TOKEN` in Colab secrets."
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "if PUSH_TO_HUB:\n",
+    "    from huggingface_hub import HfApi\n",
+    "    api = HfApi()\n",
+    "    api.create_repo(repo_id=PUSH_TO_HUB, repo_type='model', exist_ok=True)\n",
+    "    api.upload_folder(\n",
+    "        folder_path=OUTPUT_DIR, repo_id=PUSH_TO_HUB, repo_type='model',\n",
+    "        commit_message=f'GRPO ({MAX_STEPS} steps, lr={LEARNING_RATE})',\n",
+    "    )\n",
+    "    print(f'LoRA pushed to https://huggingface.co/{PUSH_TO_HUB}')\n",
+    "else:\n",
+    "    print('PUSH_TO_HUB is None — skipping')"
    ]
   }
  ],
  "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}