Spaces:

ycwhencpp
/

final-iteration

Paused

vaibhav12332112312 commited on 12 days ago

Commit

e82b235

1 Parent(s): 3419724

Strip heatmap leak from prompt; let model discover peak hours via tools

- Remove explicit Mon..Sun peak-hour table from system prompt
- Drop "today's peak hours=..." from format_obs
- Compress two-phase + posting rules to essentials
- Forces model to learn timing via query_audience/query_trends

Made-with: Cursor

Files changed (1) hide show

training/train_grpo.ipynb +60 -78

training/train_grpo.ipynb CHANGED Viewed

@@ -25,7 +25,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 1: Install dependencies (quote versions — zsh treats `>` as redirect otherwise)\n",
         "!pip install -q torch torchvision torchaudio\n",
@@ -34,13 +36,13 @@
         "!pip install -q \"typing_extensions>=4.13.0\" pydantic httpx\n",
         "!pip install -q \"openenv-core[core]>=0.2.2\"\n",
         "!pip install -q flash-attn --no-build-isolation || echo \"flash-attn install skipped; will use sdpa\""
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 2: Resolve repo path (Colab: fresh clone. Local: auto-detect project root)\n",
         "import os\n",
@@ -116,13 +118,13 @@
         "print(f\"Branch: {REPO_BRANCH}\")\n",
         "print(f\"Commit: {commit}\")\n",
         "print(f\"Plots dir: {PLOTS_DIR}\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 3: Imports (with runtime validation)\n",
         "import json, random, time, textwrap, copy, os, sys\n",
@@ -176,9 +178,7 @@
         "import ast\n",
         "ast.parse(\"def _t(x: int) -> str: return f'{x}'\")\n",
         "print(\"OK: ast.parse (syntax check)\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -191,7 +191,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 4: Define heuristic agents + episode runner\n",
         "_rng = random.Random(42)\n",
@@ -267,13 +269,13 @@
         "            \"rewards\": rewards, \"energies\": energies}\n",
         "\n",
         "print(\"Agents and episode runner defined.\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 5: Run baselines (safe)\n",
         "print(\"Running heuristic baselines (5 agents × 3 tasks)...\")\n",
@@ -308,13 +310,13 @@
         "for name in BASELINE_AGENTS:\n",
         "    scores = [baseline_results[name][t][\"grader_score\"] for t in TASKS]\n",
         "    print(f\"{name:<14s} {scores[0]:>10.4f} {scores[1]:>12.4f} {scores[2]:>14.4f} {sum(scores)/3:>8.4f}\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 6: Baseline plots\n",
         "fig, axes = plt.subplots(1, 3, figsize=(16, 5), sharey=True)\n",
@@ -332,9 +334,7 @@
         "fig.tight_layout()\n",
         "fig.savefig(f\"{PLOTS_DIR}/baseline_leaderboard.png\", dpi=150, bbox_inches='tight')\n",
         "plt.show()"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -347,7 +347,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 7: Load model (Qwen2.5-3B bf16 on CUDA + flash-attn-2; fp16/fp32 fallback)\n",
         "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
@@ -391,13 +393,13 @@
         "print(f\"Model loaded. dtype={next(model.parameters()).dtype} device={next(model.parameters()).device}\")\n",
         "if torch.cuda.is_available():\n",
         "    print(f\"CUDA memory: {torch.cuda.memory_allocated()/1e9:.2f} GB\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 8: LLM agent functions\n",
         "_SYSTEM_BASE = textwrap.dedent(\"\"\"\\\n",
@@ -439,38 +441,21 @@
         "- topic:        free-form string\n",
         "- empty scheduled_actions = full day rest\n",
         "\n",
-        "POSTING RULES (critical — only `post` actions earn engagement reward):\n",
-        "- EVERY active day MUST schedule at least 2 `post` actions (max 3). `create_content`\n",
-        "  alone gives 0 reward — content stays in queue. Mix in 0-1 `create_content` only\n",
-        "  if the queue is empty.\n",
-        "- Schedule posts at HEATMAP PEAK HOURS (Buffer/Sprout-derived):\n",
-        "    Mon  peaks 14, 18, 19      Tue  peaks 14, 15, 19\n",
-        "    Wed  peaks 13, 14, 18      Thu  peaks 12, 13, 19\n",
-        "    Fri  peaks 12, 13, 22      Sat  peaks 21, 22, 13\n",
-        "    Sun  peaks 21, 22, 11\n",
-        "- Vary `intent` across the day; rotate `content_type` to avoid fatigue.\n",
-        "- Reuse strong tags from the Recent-days summary (those that earned reward).\"\"\")\n",
         "\n",
         "SYSTEM_PROMPT = _SYSTEM_BASE + textwrap.dedent(\"\"\"\n",
         "\n",
-        "TWO-PHASE FLOW (each day has two turns — same observation, two responses):\n",
-        "PHASE A — DISCOVERY: respond with {\"tool_calls\": [...]} only. Tools cost nothing,\n",
-        "  call as many query_* / predict_engagement / draft_review as useful. Their results\n",
-        "  are dispatched immediately and shown to you in PHASE B of the SAME day.\n",
-        "PHASE B — PLANNING: respond with {\"scheduled_actions\": [...], \"notes\": \"...\"}\n",
-        "  using the freshly returned Tool results.\n",
-        "Audience peak hours, segment affinities, trends, competitor schedules are NOT in\n",
-        "the observation — discover them in PHASE A. Useful PHASE-A starter set:\n",
-        "  query_trends(niche), query_audience(segment_id), query_creator_pool(),\n",
-        "  query_competitor(competitor_id, window_days), and on later days also\n",
-        "  predict_engagement(scheduled_actions=[...candidate plan...]).\"\"\")\n",
         "SYSTEM_PROMPT_EVAL = SYSTEM_PROMPT\n",
         "SYSTEM_PROMPT_TRAIN = SYSTEM_PROMPT\n",
         "\n",
         "\n",
         "_DAY_NAMES = [\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"]\n",
-        "_PEAK_HOURS = {0:[14,18,19], 1:[14,15,19], 2:[13,14,18], 3:[12,13,19],\n",
-        "               4:[12,13,22], 5:[21,22,13], 6:[21,22,11]}\n",
         "\n",
         "\n",
         "def _format_history(history, k=3):\n",
@@ -489,7 +474,6 @@
         "\n",
         "def format_obs(obs, history=None):\n",
         "    day_name = _DAY_NAMES[obs.day_of_week] if 0 <= obs.day_of_week < 7 else \"?\"\n",
-        "    peaks = _PEAK_HOURS.get(obs.day_of_week, [12, 18, 20])\n",
         "    signals_str = \"\"\n",
         "    signals = getattr(obs, \"engagement_signals\", None)\n",
         "    if signals:\n",
@@ -502,7 +486,7 @@
         "            tool_str += f\"  {tr.name}: {json.dumps(tr.data)}\\n\"\n",
         "    if not tool_str:\n",
         "        tool_str = \"  (none — call query_* tools to discover)\\n\"\n",
-        "    return (f\"Day: {day_name} | days_elapsed={obs.days_elapsed} | today's peak hours={peaks}\\n\"\n",
         "            f\"Energy: {obs.creator_energy:.2f} | Followers: {obs.follower_count}\\n\"\n",
         "            f\"Engagement: {obs.engagement_rate:.3f} | Queue: {obs.content_queue_size}\\n\"\n",
         "            f\"{signals_str}\"\n",
@@ -732,9 +716,7 @@
         "\n",
         "\n",
         "print(\"LLM agent functions defined (batched).\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -747,7 +729,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 9: Run untrained model (batched: all 3 tasks in parallel envs)\n",
         "print(\"Running UNTRAINED base model on all tasks (batched)...\")\n",
@@ -761,9 +745,7 @@
         "print(f\"BEFORE TRAINING (took {time.time()-t0:.1f}s):\")\n",
         "for t in TASKS:\n",
         "    print(f\"  {t}: grader={before_results[t]['grader_score']:.4f}\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -782,7 +764,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 10: Attach LoRA adapter\n",
         "from peft import LoraConfig, get_peft_model, TaskType\n",
@@ -796,13 +780,13 @@
         "model.enable_input_require_grads()\n",
         "peft_model = get_peft_model(model, lora_config)\n",
         "peft_model.print_trainable_parameters()"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 11: Training loop\n",
         "from trl import SFTTrainer, SFTConfig\n",
@@ -907,9 +891,7 @@
         "elapsed = time.time() - t_start\n",
         "print(f\"\\nTraining complete in {elapsed/60:.1f} min\")\n",
         "print(pd.DataFrame(training_log).to_string(index=False))"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -922,7 +904,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 12: Run trained model (batched)\n",
         "print(\"Running TRAINED model on all tasks (batched)...\")\n",
@@ -937,9 +921,7 @@
         "print(f\"AFTER TRAINING (took {time.time()-t0:.1f}s):\")\n",
         "for t in TASKS:\n",
         "    print(f\"  {t}: grader={after_results[t]['grader_score']:.4f}\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -950,7 +932,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 13: Training curves\n",
         "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
@@ -972,13 +956,13 @@
         "fig.tight_layout()\n",
         "fig.savefig(f'{PLOTS_DIR}/reward_curve.png', dpi=150, bbox_inches='tight')\n",
         "plt.show()"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 14: Before vs After\n",
         "task_labels = [t.replace('monthly_', '').title() for t in TASKS]\n",
@@ -1008,13 +992,13 @@
         "fig.tight_layout()\n",
         "fig.savefig(f'{PLOTS_DIR}/before_after.png', dpi=150, bbox_inches='tight')\n",
         "plt.show()"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 15: Trajectory comparison\n",
         "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n",
@@ -1038,9 +1022,7 @@
         "fig.tight_layout()\n",
         "fig.savefig(f'{PLOTS_DIR}/training_trajectories.png', dpi=150, bbox_inches='tight')\n",
         "plt.show()"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -1051,7 +1033,9 @@
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 16: Final summary\n",
         "print(\"=\" * 67)\n",
@@ -1088,13 +1072,13 @@
         "\n",
         "print(f\"\\nSaved to {PLOTS_DIR}/\")\n",
         "print(\"All results are from real LoRA weight updates on real environment runs.\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
         "# Cell 17: Save adapter\n",
         "save_path = \"./viraltest_trained_adapter\"\n",
@@ -1102,9 +1086,7 @@
         "tokenizer.save_pretrained(save_path)\n",
         "print(f\"LoRA adapter saved to {save_path}\")\n",
         "print(\"Load with: PeftModel.from_pretrained(base_model, save_path)\")"
-      ],
-      "execution_count": null,
-      "outputs": []
     }
   ],
   "metadata": {
@@ -1130,4 +1112,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 4
-}

     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 1: Install dependencies (quote versions — zsh treats `>` as redirect otherwise)\n",
         "!pip install -q torch torchvision torchaudio\n",
         "!pip install -q \"typing_extensions>=4.13.0\" pydantic httpx\n",
         "!pip install -q \"openenv-core[core]>=0.2.2\"\n",
         "!pip install -q flash-attn --no-build-isolation || echo \"flash-attn install skipped; will use sdpa\""
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 2: Resolve repo path (Colab: fresh clone. Local: auto-detect project root)\n",
         "import os\n",
         "print(f\"Branch: {REPO_BRANCH}\")\n",
         "print(f\"Commit: {commit}\")\n",
         "print(f\"Plots dir: {PLOTS_DIR}\")"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 3: Imports (with runtime validation)\n",
         "import json, random, time, textwrap, copy, os, sys\n",
         "import ast\n",
         "ast.parse(\"def _t(x: int) -> str: return f'{x}'\")\n",
         "print(\"OK: ast.parse (syntax check)\")"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 4: Define heuristic agents + episode runner\n",
         "_rng = random.Random(42)\n",
         "            \"rewards\": rewards, \"energies\": energies}\n",
         "\n",
         "print(\"Agents and episode runner defined.\")"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 5: Run baselines (safe)\n",
         "print(\"Running heuristic baselines (5 agents × 3 tasks)...\")\n",
         "for name in BASELINE_AGENTS:\n",
         "    scores = [baseline_results[name][t][\"grader_score\"] for t in TASKS]\n",
         "    print(f\"{name:<14s} {scores[0]:>10.4f} {scores[1]:>12.4f} {scores[2]:>14.4f} {sum(scores)/3:>8.4f}\")"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 6: Baseline plots\n",
         "fig, axes = plt.subplots(1, 3, figsize=(16, 5), sharey=True)\n",
         "fig.tight_layout()\n",
         "fig.savefig(f\"{PLOTS_DIR}/baseline_leaderboard.png\", dpi=150, bbox_inches='tight')\n",
         "plt.show()"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 7: Load model (Qwen2.5-3B bf16 on CUDA + flash-attn-2; fp16/fp32 fallback)\n",
         "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
         "print(f\"Model loaded. dtype={next(model.parameters()).dtype} device={next(model.parameters()).device}\")\n",
         "if torch.cuda.is_available():\n",
         "    print(f\"CUDA memory: {torch.cuda.memory_allocated()/1e9:.2f} GB\")"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 8: LLM agent functions\n",
         "_SYSTEM_BASE = textwrap.dedent(\"\"\"\\\n",
         "- topic:        free-form string\n",
         "- empty scheduled_actions = full day rest\n",
         "\n",
+        "POSTING RULES:\n",
+        "- Each active day: 2-3 `post` actions at the audience's peak hours.\n",
+        "- `create_content` alone earns 0 reward.\n",
+        "- Vary `intent` and `content_type`.\"\"\")\n",
         "\n",
         "SYSTEM_PROMPT = _SYSTEM_BASE + textwrap.dedent(\"\"\"\n",
         "\n",
+        "TWO-PHASE FLOW per day (same observation, two responses):\n",
+        "PHASE A: respond with {\"tool_calls\": [...]} only.\n",
+        "PHASE B: respond with {\"scheduled_actions\": [...], \"notes\": \"...\"} using the tool results.\"\"\")\n",
         "SYSTEM_PROMPT_EVAL = SYSTEM_PROMPT\n",
         "SYSTEM_PROMPT_TRAIN = SYSTEM_PROMPT\n",
         "\n",
         "\n",
         "_DAY_NAMES = [\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"]\n",
         "\n",
         "\n",
         "def _format_history(history, k=3):\n",
         "\n",
         "def format_obs(obs, history=None):\n",
         "    day_name = _DAY_NAMES[obs.day_of_week] if 0 <= obs.day_of_week < 7 else \"?\"\n",
         "    signals_str = \"\"\n",
         "    signals = getattr(obs, \"engagement_signals\", None)\n",
         "    if signals:\n",
         "            tool_str += f\"  {tr.name}: {json.dumps(tr.data)}\\n\"\n",
         "    if not tool_str:\n",
         "        tool_str = \"  (none — call query_* tools to discover)\\n\"\n",
+        "    return (f\"Day: {day_name} | days_elapsed={obs.days_elapsed}\\n\"\n",
         "            f\"Energy: {obs.creator_energy:.2f} | Followers: {obs.follower_count}\\n\"\n",
         "            f\"Engagement: {obs.engagement_rate:.3f} | Queue: {obs.content_queue_size}\\n\"\n",
         "            f\"{signals_str}\"\n",
         "\n",
         "\n",
         "print(\"LLM agent functions defined (batched).\")"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 9: Run untrained model (batched: all 3 tasks in parallel envs)\n",
         "print(\"Running UNTRAINED base model on all tasks (batched)...\")\n",
         "print(f\"BEFORE TRAINING (took {time.time()-t0:.1f}s):\")\n",
         "for t in TASKS:\n",
         "    print(f\"  {t}: grader={before_results[t]['grader_score']:.4f}\")"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 10: Attach LoRA adapter\n",
         "from peft import LoraConfig, get_peft_model, TaskType\n",
         "model.enable_input_require_grads()\n",
         "peft_model = get_peft_model(model, lora_config)\n",
         "peft_model.print_trainable_parameters()"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 11: Training loop\n",
         "from trl import SFTTrainer, SFTConfig\n",
         "elapsed = time.time() - t_start\n",
         "print(f\"\\nTraining complete in {elapsed/60:.1f} min\")\n",
         "print(pd.DataFrame(training_log).to_string(index=False))"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 12: Run trained model (batched)\n",
         "print(\"Running TRAINED model on all tasks (batched)...\")\n",
         "print(f\"AFTER TRAINING (took {time.time()-t0:.1f}s):\")\n",
         "for t in TASKS:\n",
         "    print(f\"  {t}: grader={after_results[t]['grader_score']:.4f}\")"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 13: Training curves\n",
         "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
         "fig.tight_layout()\n",
         "fig.savefig(f'{PLOTS_DIR}/reward_curve.png', dpi=150, bbox_inches='tight')\n",
         "plt.show()"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 14: Before vs After\n",
         "task_labels = [t.replace('monthly_', '').title() for t in TASKS]\n",
         "fig.tight_layout()\n",
         "fig.savefig(f'{PLOTS_DIR}/before_after.png', dpi=150, bbox_inches='tight')\n",
         "plt.show()"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 15: Trajectory comparison\n",
         "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n",
         "fig.tight_layout()\n",
         "fig.savefig(f'{PLOTS_DIR}/training_trajectories.png', dpi=150, bbox_inches='tight')\n",
         "plt.show()"
+      ]
     },
     {
       "cell_type": "markdown",
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 16: Final summary\n",
         "print(\"=\" * 67)\n",
         "\n",
         "print(f\"\\nSaved to {PLOTS_DIR}/\")\n",
         "print(\"All results are from real LoRA weight updates on real environment runs.\")"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
       "metadata": {},
+      "outputs": [],
       "source": [
         "# Cell 17: Save adapter\n",
         "save_path = \"./viraltest_trained_adapter\"\n",
         "tokenizer.save_pretrained(save_path)\n",
         "print(f\"LoRA adapter saved to {save_path}\")\n",
         "print(\"Load with: PeftModel.from_pretrained(base_model, save_path)\")"
+      ]
     }
   ],
   "metadata": {
   },
   "nbformat": 4,
   "nbformat_minor": 4
+}