Spaces:

ycwhencpp
/

final-iteration

Paused

anuragredbus commited on 13 days ago

Commit

4a29e22

1 Parent(s): e2c547b

fix: rewrite training notebook for real LoRA fine-tuning on Colab

- Add missing openenv-core dependency to install cell
- Self-contained: clones repo, installs all deps, runs end-to-end
- Real weight updates via LoRA + SFT (not prompt engineering)
- 4-bit quantization to fit free Colab T4 GPU
- Pipeline: baselines → untrained LLM → LoRA training → trained LLM → plots

Made-with: Cursor

Files changed (1) hide show

training/train_grpo.ipynb +774 -1039

training/train_grpo.ipynb CHANGED Viewed

@@ -1,1041 +1,776 @@
 {
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Viraltest v2 — GRPO Training on Qwen2.5-1.5B-Instruct\n",
-    "\n",
-    "This notebook trains an LLM to be an Instagram strategy agent using **Group Relative Policy Optimization (GRPO)**.\n",
-    "\n",
-    "**What we train:** The model learns to plan daily posting schedules (content type, timing, topics, tags, intent signals) that maximise engagement while managing energy/burnout.\n",
-    "\n",
-    "**Pipeline:**\n",
-    "1. Run heuristic baselines (smart, spam, rest, random) to establish baseline scores\n",
-    "2. Run the **untrained** base model and record scores\n",
-    "3. Train with GRPO using environment rewards\n",
-    "4. Run the **trained** model and compare\n",
-    "5. Plot real reward curves and before/after comparisons\n",
-    "\n",
-    "**Requirements:** Free Colab T4 GPU, ~45 min total.\n",
-    "\n",
-    "**Reward:** per-step env reward (0-1) + 2× terminal `grader_score`."
-   ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pip install -q trl>=0.12.0 transformers accelerate peft bitsandbytes datasets\n",
-    "!pip install -q openai httpx matplotlib pandas\n",
-    "!pip install -q openenv-core[core]>=0.2.2"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "import os\n",
-    "import time\n",
-    "import random\n",
-    "import copy\n",
-    "from pathlib import Path\n",
-    "from typing import Any, Dict, List, Optional, Tuple\n",
-    "\n",
-    "import matplotlib.pyplot as plt\n",
-    "import numpy as np\n",
-    "import pandas as pd\n",
-    "\n",
-    "PLOTS_DIR = Path(\"../plots\")\n",
-    "PLOTS_DIR.mkdir(exist_ok=True)\n",
-    "\n",
-    "print(\"Imports OK\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 1: Environment Setup — Direct In-Process Access\n",
-    "\n",
-    "We instantiate the environment directly (no HTTP server needed) so we can run hundreds of episodes quickly."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import sys\n",
-    "sys.path.insert(0, \"..\")\n",
-    "\n",
-    "from models import ScheduledAction, ViraltestAction, ToolCall\n",
-    "from server.viraltest_environment import (\n",
-    "    ViraltestEnvironment,\n",
-    "    TAG_POOL,\n",
-    "    TOPIC_CATEGORIES,\n",
-    "    TASK_HORIZON,\n",
-    ")\n",
-    "\n",
-    "ALL_TOPICS = [t for topics in TOPIC_CATEGORIES.values() for t in topics]\n",
-    "NICHES = list(TOPIC_CATEGORIES.keys())\n",
-    "CONTENT_TYPES = [\"reel\", \"carousel\", \"story\", \"text_post\"]\n",
-    "INTENTS = [\"send_bait\", \"save_bait\", \"watch_bait\", \"like_bait\"]\n",
-    "TASKS = [\"monthly_engage\", \"monthly_strategic\", \"monthly_competitive\"]\n",
-    "\n",
-    "print(f\"Tags: {len(TAG_POOL)}, Topics: {len(ALL_TOPICS)}, Niches: {len(NICHES)}\")\n",
-    "print(f\"Tasks: {TASKS}\")\n",
-    "print(f\"Horizon: {TASK_HORIZON} steps (days)\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 2: Heuristic Baselines\n",
-    "\n",
-    "Before touching any LLM, we run scripted agents to establish a **baseline leaderboard**.\n",
-    "This proves the environment can differentiate skill levels."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "_rng = random.Random(42)\n",
-    "\n",
-    "\n",
-    "def plan_always_rest(obs_dict: dict, day: int) -> ViraltestAction:\n",
-    "    return ViraltestAction(scheduled_actions=[], notes=\"Rest day.\")\n",
-    "\n",
-    "\n",
-    "def plan_spam(obs_dict: dict, day: int) -> ViraltestAction:\n",
-    "    actions = [\n",
-    "        {\"hour\": h, \"action_type\": \"post\", \"content_type\": \"reel\",\n",
-    "         \"topic\": \"AI tools\", \"tags\": [\"ai\"], \"intent\": \"watch_bait\"}\n",
-    "        for h in range(24)\n",
-    "    ]\n",
-    "    return ViraltestAction(scheduled_actions=[ScheduledAction(**a) for a in actions])\n",
-    "\n",
-    "\n",
-    "def plan_random(obs_dict: dict, day: int) -> ViraltestAction:\n",
-    "    actions = []\n",
-    "    for h in range(24):\n",
-    "        if _rng.random() < 0.1:\n",
-    "            ct = _rng.choice(CONTENT_TYPES)\n",
-    "            topic = _rng.choice(ALL_TOPICS)\n",
-    "            tags = _rng.sample(TAG_POOL[:30], min(3, len(TAG_POOL)))\n",
-    "            intent = _rng.choice(INTENTS)\n",
-    "            actions.append({\"hour\": h, \"action_type\": \"post\", \"content_type\": ct,\n",
-    "                            \"topic\": topic, \"tags\": tags, \"intent\": intent})\n",
-    "    return ViraltestAction(scheduled_actions=[ScheduledAction(**a) for a in actions])\n",
-    "\n",
-    "\n",
-    "def plan_minimal(obs_dict: dict, day: int) -> ViraltestAction:\n",
-    "    topic = ALL_TOPICS[day % len(ALL_TOPICS)]\n",
-    "    tags = [TAG_POOL[i % len(TAG_POOL)] for i in range(day, day + 3)]\n",
-    "    actions = [\n",
-    "        {\"hour\": 12, \"action_type\": \"post\", \"content_type\": \"carousel\",\n",
-    "         \"topic\": topic, \"tags\": tags, \"intent\": \"save_bait\"},\n",
-    "    ]\n",
-    "    return ViraltestAction(scheduled_actions=[ScheduledAction(**a) for a in actions])\n",
-    "\n",
-    "\n",
-    "def plan_smart(obs_dict: dict, day: int) -> ViraltestAction:\n",
-    "    \"\"\"Best heuristic: 2 posts at peak hours, varied content types and intents, tag rotation.\"\"\"\n",
-    "    topic1 = ALL_TOPICS[(day * 2) % len(ALL_TOPICS)]\n",
-    "    topic2 = ALL_TOPICS[(day * 2 + 1) % len(ALL_TOPICS)]\n",
-    "    ct1 = CONTENT_TYPES[(day * 2) % 4]\n",
-    "    ct2 = CONTENT_TYPES[(day * 2 + 1) % 4]\n",
-    "    intent1 = INTENTS[(day * 2) % 4]\n",
-    "    intent2 = INTENTS[(day * 2 + 1) % 4]\n",
-    "    tags1 = [TAG_POOL[(day * 6 + i) % len(TAG_POOL)] for i in range(3)]\n",
-    "    tags2 = [TAG_POOL[(day * 6 + 3 + i) % len(TAG_POOL)] for i in range(3)]\n",
-    "\n",
-    "    actions = [\n",
-    "        {\"hour\": 8, \"action_type\": \"create_content\"},\n",
-    "        {\"hour\": 12, \"action_type\": \"post\", \"content_type\": ct1,\n",
-    "         \"topic\": topic1, \"tags\": tags1, \"intent\": intent1},\n",
-    "        {\"hour\": 19, \"action_type\": \"post\", \"content_type\": ct2,\n",
-    "         \"topic\": topic2, \"tags\": tags2, \"intent\": intent2},\n",
-    "    ]\n",
-    "    replies = [{\"post_hour\": 12, \"reply_hour\": 13}]\n",
-    "    return ViraltestAction(\n",
-    "        scheduled_actions=[ScheduledAction(**a) for a in actions],\n",
-    "        replies=[{\"post_hour\": 12, \"reply_hour\": 13}],\n",
-    "        notes=f\"Day {day}: varied content at peak hours.\",\n",
-    "    )\n",
-    "\n",
-    "\n",
-    "def plan_smart_with_tools(obs_dict: dict, day: int) -> ViraltestAction:\n",
-    "    \"\"\"Smart agent that also uses tools for world discovery.\"\"\"\n",
-    "    tool_calls = []\n",
-    "    if day <= 3:\n",
-    "        tool_calls.append(ToolCall(name=\"query_trends\", arguments={\"niche\": NICHES[day % len(NICHES)]}))\n",
-    "    if day % 5 == 0:\n",
-    "        tool_calls.append(ToolCall(name=\"query_competitor\", arguments={\"competitor_id\": \"niche_expert\", \"window_days\": 7}))\n",
-    "    if day % 7 == 0:\n",
-    "        tool_calls.append(ToolCall(name=\"query_audience\", arguments={\"segment_id\": \"gen_z\"}))\n",
-    "\n",
-    "    base = plan_smart(obs_dict, day)\n",
-    "    return ViraltestAction(\n",
-    "        tool_calls=tool_calls,\n",
-    "        scheduled_actions=base.scheduled_actions,\n",
-    "        replies=base.replies,\n",
-    "        notes=f\"Day {day}: tool-assisted planning.\",\n",
-    "    )\n",
-    "\n",
-    "\n",
-    "BASELINE_AGENTS = {\n",
-    "    \"always_rest\": plan_always_rest,\n",
-    "    \"spam\": plan_spam,\n",
-    "    \"random\": plan_random,\n",
-    "    \"minimal\": plan_minimal,\n",
-    "    \"smart\": plan_smart,\n",
-    "    \"smart_with_tools\": plan_smart_with_tools,\n",
-    "}"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def run_episode(task: str, plan_fn, seed: int = 42) -> Dict[str, Any]:\n",
-    "    \"\"\"Run one full 30-day episode and return metrics.\"\"\"\n",
-    "    env = ViraltestEnvironment()\n",
-    "    obs = env.reset(task=task, seed=seed)\n",
-    "    obs_dict = obs.model_dump()\n",
-    "\n",
-    "    rewards = []\n",
-    "    energies = [obs.creator_energy]\n",
-    "    followers_hist = [obs.follower_count]\n",
-    "\n",
-    "    for day in range(1, TASK_HORIZON + 1):\n",
-    "        action = plan_fn(obs_dict, day)\n",
-    "        obs = env.step(action)\n",
-    "        obs_dict = obs.model_dump()\n",
-    "        r = obs.reward if obs.reward is not None else 0.0\n",
-    "        rewards.append(r)\n",
-    "        energies.append(obs.creator_energy)\n",
-    "        followers_hist.append(obs.follower_count)\n",
-    "        if obs.done:\n",
-    "            break\n",
-    "\n",
-    "    grader_score = (obs.metadata or {}).get(\"grader_score\", 0.0)\n",
-    "\n",
-    "    return {\n",
-    "        \"task\": task,\n",
-    "        \"steps\": len(rewards),\n",
-    "        \"total_reward\": sum(rewards),\n",
-    "        \"avg_reward\": sum(rewards) / len(rewards) if rewards else 0,\n",
-    "        \"grader_score\": grader_score,\n",
-    "        \"final_energy\": obs.creator_energy,\n",
-    "        \"min_energy\": min(energies),\n",
-    "        \"final_followers\": obs.follower_count,\n",
-    "        \"follower_delta\": obs.follower_count - 10000,\n",
-    "        \"burned_out\": obs.creator_energy <= 0,\n",
-    "        \"rewards\": rewards,\n",
-    "        \"energies\": energies,\n",
-    "        \"followers\": followers_hist,\n",
-    "    }\n",
-    "\n",
-    "\n",
-    "print(\"Running heuristic baselines across all tasks...\")\n",
-    "print(\"=\" * 80)\n",
-    "\n",
-    "baseline_results = {}\n",
-    "for agent_name, plan_fn in BASELINE_AGENTS.items():\n",
-    "    baseline_results[agent_name] = {}\n",
-    "    for task in TASKS:\n",
-    "        _rng = random.Random(42)\n",
-    "        result = run_episode(task, plan_fn, seed=42)\n",
-    "        baseline_results[agent_name][task] = result\n",
-    "        print(f\"  {agent_name:>20s} | {task:>22s} | score={result['grader_score']:.4f} | \"\n",
-    "              f\"reward={result['total_reward']:.3f} | energy={result['final_energy']:.2f} | \"\n",
-    "              f\"followers={result['follower_delta']:+d}\")\n",
-    "    print()\n",
-    "\n",
-    "print(\"\\n\" + \"=\" * 80)\n",
-    "print(\"BASELINE LEADERBOARD (grader_score)\")\n",
-    "print(\"=\" * 80)\n",
-    "print(f\"{'Agent':<22s} {'engage':>10s} {'strategic':>12s} {'competitive':>14s} {'avg':>8s}\")\n",
-    "print(\"-\" * 68)\n",
-    "for agent_name in BASELINE_AGENTS:\n",
-    "    scores = [baseline_results[agent_name][t][\"grader_score\"] for t in TASKS]\n",
-    "    avg = sum(scores) / len(scores)\n",
-    "    print(f\"{agent_name:<22s} {scores[0]:>10.4f} {scores[1]:>12.4f} {scores[2]:>14.4f} {avg:>8.4f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 3: Baseline Visualization\n",
-    "\n",
-    "Plot the heuristic baseline results to show the environment differentiates skill levels."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "fig, axes = plt.subplots(1, 3, figsize=(16, 5), sharey=True)\n",
-    "agent_names = list(BASELINE_AGENTS.keys())\n",
-    "colors = ['#E53935', '#FF9800', '#9E9E9E', '#42A5F5', '#4CAF50', '#2E7D32']\n",
-    "\n",
-    "for i, task in enumerate(TASKS):\n",
-    "    scores = [baseline_results[a][task][\"grader_score\"] for a in agent_names]\n",
-    "    bars = axes[i].barh(agent_names, scores, color=colors)\n",
-    "    axes[i].set_title(task.replace(\"monthly_\", \"\").title(), fontsize=13, fontweight='bold')\n",
-    "    axes[i].set_xlim(0, max(max(scores) * 1.15, 0.01))\n",
-    "    for bar, score in zip(bars, scores):\n",
-    "        axes[i].text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,\n",
-    "                     f\"{score:.3f}\", va='center', fontsize=9)\n",
-    "\n",
-    "axes[0].set_ylabel(\"Agent\")\n",
-    "fig.suptitle(\"Viraltest v2 — Heuristic Baseline Leaderboard\", fontsize=14, fontweight='bold')\n",
-    "fig.tight_layout()\n",
-    "fig.savefig(PLOTS_DIR / \"baseline_leaderboard.png\", dpi=150, bbox_inches='tight')\n",
-    "plt.show()\n",
-    "print(f\"Saved {PLOTS_DIR / 'baseline_leaderboard.png'}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n",
-    "\n",
-    "for i, task in enumerate(TASKS):\n",
-    "    for j, agent_name in enumerate(agent_names):\n",
-    "        result = baseline_results[agent_name][task]\n",
-    "        axes[0, i].plot(result[\"rewards\"], label=agent_name, color=colors[j], alpha=0.8)\n",
-    "        axes[1, i].plot(result[\"energies\"], label=agent_name, color=colors[j], alpha=0.8)\n",
-    "\n",
-    "    axes[0, i].set_title(f\"{task.replace('monthly_', '').title()} — Rewards\", fontsize=11)\n",
-    "    axes[0, i].set_xlabel(\"Day\")\n",
-    "    axes[0, i].set_ylabel(\"Reward\")\n",
-    "    axes[0, i].grid(True, alpha=0.3)\n",
-    "\n",
-    "    axes[1, i].set_title(f\"{task.replace('monthly_', '').title()} — Energy\", fontsize=11)\n",
-    "    axes[1, i].set_xlabel(\"Day\")\n",
-    "    axes[1, i].set_ylabel(\"Energy\")\n",
-    "    axes[1, i].grid(True, alpha=0.3)\n",
-    "\n",
-    "axes[0, 2].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)\n",
-    "fig.suptitle(\"Viraltest v2 — Daily Rewards & Energy by Agent\", fontsize=14, fontweight='bold', y=1.01)\n",
-    "fig.tight_layout()\n",
-    "fig.savefig(PLOTS_DIR / \"baseline_trajectories.png\", dpi=150, bbox_inches='tight')\n",
-    "plt.show()\n",
-    "print(f\"Saved {PLOTS_DIR / 'baseline_trajectories.png'}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 4: LLM Evaluation — Untrained Baseline\n",
-    "\n",
-    "We run the base Qwen2.5-1.5B-Instruct model (no fine-tuning) against the environment\n",
-    "using the same prompt format as `inference.py`. This gives us the **before** scores.\n",
-    "\n",
-    "### Option A: Via HTTP (if you have a running env server + model API)\n",
-    "Set `ENV_BASE_URL` and `API_BASE_URL` environment variables.\n",
-    "\n",
-    "### Option B: Direct in-process (no server needed)\n",
-    "We load the model locally and run the environment directly. This is what we do below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import textwrap\n",
-    "import torch\n",
-    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
-    "\n",
-    "MODEL_NAME = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
-    "\n",
-    "print(f\"Loading {MODEL_NAME}...\")\n",
-    "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n",
-    "model = AutoModelForCausalLM.from_pretrained(\n",
-    "    MODEL_NAME,\n",
-    "    trust_remote_code=True,\n",
-    "    torch_dtype=torch.float16,\n",
-    "    device_map=\"auto\",\n",
-    ")\n",
-    "model.eval()\n",
-    "print(f\"Model loaded on {model.device}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "SYSTEM_PROMPT = textwrap.dedent(\"\"\"\\\n",
-    "You are an Instagram content strategy agent. Each step is one full day (24 hours).\n",
-    "You manage a creator account over a 30-day monthly cycle.\n",
-    "\n",
-    "You receive a SPARSE observation (energy, followers, last reward, notes echo).\n",
-    "To learn about the world, you MUST use TOOLS before planning your day.\n",
-    "\n",
-    "AVAILABLE TOOLS (call via tool_calls before scheduling posts):\n",
-    "- query_trends(niche): Get trending topics and tags for a niche\n",
-    "- query_competitor(competitor_id, window_days): See competitor activity\n",
-    "- query_tag_history(tag): Check your past performance with a tag\n",
-    "- query_audience(segment_id): Learn audience segment preferences\n",
-    "- predict_engagement(scheduled_actions): Simulate engagement without committing\n",
-    "- draft_review(scheduled_actions): Get feedback on a draft plan\n",
-    "\n",
-    "RESPONSE FORMAT (JSON only, no markdown, no prose):\n",
-    "{\n",
-    "  \"tool_calls\": [\n",
-    "    {\"name\": \"query_trends\", \"arguments\": {\"niche\": \"tech\"}}\n",
-    "  ],\n",
-    "  \"scheduled_actions\": [\n",
-    "    {\"hour\": 12, \"action_type\": \"post\", \"content_type\": \"reel\", \"topic\": \"AI tools\", \"tags\": [\"ai\", \"coding\"], \"intent\": \"watch_bait\"},\n",
-    "    {\"hour\": 19, \"action_type\": \"post\", \"content_type\": \"carousel\", \"topic\": \"startup life\", \"tags\": [\"startup\"], \"intent\": \"save_bait\"}\n",
-    "  ],\n",
-    "  \"replies\": [{\"post_hour\": 12, \"reply_hour\": 13}],\n",
-    "  \"notes\": \"Day 3: tech niche trending up.\"\n",
-    "}\n",
-    "\n",
-    "RULES:\n",
-    "- hour: 0-23. content_type: reel|story|carousel|text_post. intent: send_bait|save_bait|watch_bait|like_bait\n",
-    "- 1-2 posts per day is optimal. More causes audience fatigue.\n",
-    "- Empty scheduled_actions = rest all day (recovers energy)\n",
-    "- Use notes to track hypotheses across days\n",
-    "- Tool calls cost API budget (starts at 100). Use wisely.\n",
-    "- Reply within 90 minutes of a post for reach bonus\"\"\")\n",
-    "\n",
-    "\n",
-    "def format_obs_for_prompt(obs) -> str:\n",
-    "    \"\"\"Format environment observation into a prompt string.\"\"\"\n",
-    "    days = [\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"]\n",
-    "    day_name = days[obs.day_of_week] if 0 <= obs.day_of_week < 7 else \"?\"\n",
-    "    notes_echo = getattr(obs, \"agent_notes\", None) or \"none\"\n",
-    "    budget = getattr(obs, \"api_budget_remaining\", 100)\n",
-    "    burnout = getattr(obs, \"burnout_risk\", 0.0)\n",
-    "\n",
-    "    tool_results_str = \"\"\n",
-    "    for tr in getattr(obs, \"tool_results\", []):\n",
-    "        if tr.success:\n",
-    "            tool_results_str += f\"  {tr.name}: {json.dumps(tr.data)[:200]}\\n\"\n",
-    "        else:\n",
-    "            tool_results_str += f\"  {tr.name}: ERROR - {tr.error}\\n\"\n",
-    "\n",
-    "    coach = getattr(obs, \"coach_feedback\", None)\n",
-    "    coach_str = \"\"\n",
-    "    if coach:\n",
-    "        coach_str = f\"Coach: delta={coach.get('delta', 0):.3f}, suggestion={coach.get('suggestion', '')}\\n\"\n",
-    "\n",
-    "    signals = getattr(obs, \"engagement_signals\", None)\n",
-    "    signals_str = \"\"\n",
-    "    if signals:\n",
-    "        signals_str = (\n",
-    "            f\"Signals: watch={signals.watch_time:.3f} sends={signals.sends_per_reach:.3f} \"\n",
-    "            f\"saves={signals.saves:.3f} likes={signals.likes_per_reach:.3f}\\n\"\n",
-    "        )\n",
-    "\n",
-    "    return textwrap.dedent(f\"\"\"\\\n",
-    "Day: {day_name} (day_of_week={obs.day_of_week}) | days_elapsed={obs.days_elapsed}\n",
-    "Energy: {obs.creator_energy:.2f} | Burnout risk: {burnout:.2f} | Followers: {obs.follower_count}\n",
-    "Engagement rate: {obs.engagement_rate:.3f} | Content queue: {obs.content_queue_size}\n",
-    "API budget remaining: {budget}\n",
-    "{signals_str}{coach_str}Tool results from last step:\n",
-    "{tool_results_str if tool_results_str else '  (none)\\n'}Your notes from last step: {notes_echo}\n",
-    "Plan your tool calls and actions for today:\"\"\")\n",
-    "\n",
-    "\n",
-    "def parse_model_output(text: str) -> ViraltestAction:\n",
-    "    \"\"\"Parse model JSON output into a ViraltestAction.\"\"\"\n",
-    "    text = text.strip()\n",
-    "    if text.startswith(\"```\"):\n",
-    "        lines = text.split(\"\\n\")\n",
-    "        lines = [l for l in lines if not l.strip().startswith(\"```\")]\n",
-    "        text = \"\\n\".join(lines).strip()\n",
-    "\n",
-    "    try:\n",
-    "        data = json.loads(text)\n",
-    "        tool_calls = []\n",
-    "        for tc in data.get(\"tool_calls\", []):\n",
-    "            if isinstance(tc, dict) and \"name\" in tc:\n",
-    "                tool_calls.append(ToolCall(name=tc[\"name\"], arguments=tc.get(\"arguments\", {})))\n",
-    "\n",
-    "        scheduled = []\n",
-    "        for a in data.get(\"scheduled_actions\", []):\n",
-    "            if isinstance(a, dict):\n",
-    "                try:\n",
-    "                    scheduled.append(ScheduledAction(**a))\n",
-    "                except Exception:\n",
-    "                    pass\n",
-    "\n",
-    "        return ViraltestAction(\n",
-    "            tool_calls=tool_calls,\n",
-    "            scheduled_actions=scheduled,\n",
-    "            replies=data.get(\"replies\", []),\n",
-    "            notes=data.get(\"notes\"),\n",
-    "        )\n",
-    "    except (json.JSONDecodeError, Exception):\n",
-    "        return ViraltestAction(scheduled_actions=[])\n",
-    "\n",
-    "\n",
-    "def generate_action(model, tokenizer, obs, history: List[dict], temperature=0.7, max_new_tokens=512) -> Tuple[str, ViraltestAction]:\n",
-    "    \"\"\"Generate an action from the model given an observation.\"\"\"\n",
-    "    user_prompt = format_obs_for_prompt(obs)\n",
-    "    messages = [{\"role\": \"system\", \"content\": SYSTEM_PROMPT}]\n",
-    "    messages.extend(history[-4:])\n",
-    "    messages.append({\"role\": \"user\", \"content\": user_prompt})\n",
-    "\n",
-    "    text_input = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
-    "    inputs = tokenizer(text_input, return_tensors=\"pt\").to(model.device)\n",
-    "\n",
-    "    with torch.no_grad():\n",
-    "        output_ids = model.generate(\n",
-    "            **inputs,\n",
-    "            max_new_tokens=max_new_tokens,\n",
-    "            temperature=temperature,\n",
-    "            do_sample=True,\n",
-    "            top_p=0.9,\n",
-    "            pad_token_id=tokenizer.eos_token_id,\n",
-    "        )\n",
-    "\n",
-    "    new_tokens = output_ids[0][inputs[\"input_ids\"].shape[1]:]\n",
-    "    response = tokenizer.decode(new_tokens, skip_special_tokens=True)\n",
-    "    action = parse_model_output(response)\n",
-    "    return response, action\n",
-    "\n",
-    "print(\"LLM agent functions defined.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def run_llm_episode(model, tokenizer, task: str, seed: int = 42, verbose: bool = False) -> Dict[str, Any]:\n",
-    "    \"\"\"Run one full episode using the LLM agent.\"\"\"\n",
-    "    env = ViraltestEnvironment()\n",
-    "    obs = env.reset(task=task, seed=seed)\n",
-    "\n",
-    "    rewards = []\n",
-    "    energies = [obs.creator_energy]\n",
-    "    history = []\n",
-    "    prompts_and_responses = []\n",
-    "\n",
-    "    for day in range(1, TASK_HORIZON + 1):\n",
-    "        if obs.done:\n",
-    "            break\n",
-    "\n",
-    "        if obs.creator_energy <= 0.25:\n",
-    "            action = ViraltestAction(scheduled_actions=[], notes=\"Low energy — forced rest.\")\n",
-    "            response_text = '{\"scheduled_actions\": [], \"notes\": \"Low energy — rest.\"}'\n",
-    "        else:\n",
-    "            response_text, action = generate_action(model, tokenizer, obs, history)\n",
-    "\n",
-    "        prompt_text = format_obs_for_prompt(obs)\n",
-    "        prompts_and_responses.append({\n",
-    "            \"prompt\": prompt_text,\n",
-    "            \"response\": response_text,\n",
-    "        })\n",
-    "\n",
-    "        obs = env.step(action)\n",
-    "        r = obs.reward if obs.reward is not None else 0.0\n",
-    "        rewards.append(r)\n",
-    "        energies.append(obs.creator_energy)\n",
-    "\n",
-    "        history.append({\"role\": \"user\", \"content\": prompt_text})\n",
-    "        history.append({\"role\": \"assistant\", \"content\": response_text})\n",
-    "\n",
-    "        if verbose:\n",
-    "            n_posts = len([sa for sa in action.scheduled_actions if sa.action_type == \"post\"])\n",
-    "            n_tools = len(action.tool_calls)\n",
-    "            print(f\"  Day {day:2d}: reward={r:.4f} energy={obs.creator_energy:.2f} \"\n",
-    "                  f\"posts={n_posts} tools={n_tools}\")\n",
-    "\n",
-    "        if obs.done:\n",
-    "            break\n",
-    "\n",
-    "    grader_score = (obs.metadata or {}).get(\"grader_score\", 0.0)\n",
-    "\n",
-    "    return {\n",
-    "        \"task\": task,\n",
-    "        \"steps\": len(rewards),\n",
-    "        \"total_reward\": sum(rewards),\n",
-    "        \"avg_reward\": sum(rewards) / len(rewards) if rewards else 0,\n",
-    "        \"grader_score\": grader_score,\n",
-    "        \"final_energy\": obs.creator_energy,\n",
-    "        \"min_energy\": min(energies),\n",
-    "        \"final_followers\": obs.follower_count,\n",
-    "        \"follower_delta\": obs.follower_count - 10000,\n",
-    "        \"burned_out\": obs.creator_energy <= 0,\n",
-    "        \"rewards\": rewards,\n",
-    "        \"energies\": energies,\n",
-    "        \"prompts_and_responses\": prompts_and_responses,\n",
-    "    }\n",
-    "\n",
-    "print(\"LLM episode runner defined.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(\"Running UNTRAINED base model...\")\n",
-    "print(\"=\" * 60)\n",
-    "\n",
-    "before_results = {}\n",
-    "for task in TASKS:\n",
-    "    print(f\"\\nTask: {task}\")\n",
-    "    result = run_llm_episode(model, tokenizer, task, seed=42, verbose=True)\n",
-    "    before_results[task] = result\n",
-    "    print(f\"  => grader_score={result['grader_score']:.4f}, \"\n",
-    "          f\"total_reward={result['total_reward']:.3f}, \"\n",
-    "          f\"burned_out={result['burned_out']}\")\n",
-    "\n",
-    "print(\"\\n\" + \"=\" * 60)\n",
-    "print(\"BEFORE TRAINING SCORES\")\n",
-    "print(\"=\" * 60)\n",
-    "for task in TASKS:\n",
-    "    r = before_results[task]\n",
-    "    print(f\"  {task}: grader={r['grader_score']:.4f} reward={r['total_reward']:.3f} energy={r['final_energy']:.2f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 5: GRPO Training\n",
-    "\n",
-    "We use TRL's GRPO trainer to optimize the model on environment rewards.\n",
-    "\n",
-    "**Approach:** For each training step, we collect a batch of episodes, score them with the environment reward, and use GRPO to reinforce high-reward responses relative to the group.\n",
-    "\n",
-    "Since full multi-step GRPO with TRL requires careful integration, we use a **reward-weighted SFT** approach that achieves similar results:\n",
-    "1. Collect N episodes with the current model\n",
-    "2. Weight each (prompt, response) pair by its environment reward\n",
-    "3. Fine-tune on the reward-weighted dataset\n",
-    "4. Repeat for multiple rounds"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from peft import LoraConfig, get_peft_model, TaskType\n",
-    "from transformers import TrainingArguments\n",
-    "from trl import SFTTrainer, SFTConfig\n",
-    "from datasets import Dataset\n",
-    "\n",
-    "lora_config = LoraConfig(\n",
-    "    r=16,\n",
-    "    lora_alpha=32,\n",
-    "    lora_dropout=0.05,\n",
-    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
-    "    task_type=TaskType.CAUSAL_LM,\n",
-    "    bias=\"none\",\n",
-    ")\n",
-    "\n",
-    "model.enable_input_require_grads()\n",
-    "peft_model = get_peft_model(model, lora_config)\n",
-    "peft_model.print_trainable_parameters()\n",
-    "print(\"LoRA adapter attached.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def collect_training_data(\n",
-    "    model, tokenizer, n_episodes: int = 8, tasks: List[str] = None\n",
-    ") -> Tuple[List[Dict], List[float]]:\n",
-    "    \"\"\"Collect episodes and build reward-weighted training pairs.\"\"\"\n",
-    "    tasks = tasks or TASKS\n",
-    "    all_pairs = []\n",
-    "    all_episode_rewards = []\n",
-    "\n",
-    "    for ep in range(n_episodes):\n",
-    "        task = tasks[ep % len(tasks)]\n",
-    "        seed = 42 + ep\n",
-    "        result = run_llm_episode(model, tokenizer, task, seed=seed)\n",
-    "        episode_reward = result[\"total_reward\"] + 2.0 * result[\"grader_score\"]\n",
-    "        all_episode_rewards.append(episode_reward)\n",
-    "\n",
-    "        for pr in result[\"prompts_and_responses\"]:\n",
-    "            step_text = (\n",
-    "                f\"<|im_start|>system\\n{SYSTEM_PROMPT}<|im_end|>\\n\"\n",
-    "                f\"<|im_start|>user\\n{pr['prompt']}<|im_end|>\\n\"\n",
-    "                f\"<|im_start|>assistant\\n{pr['response']}<|im_end|>\"\n",
-    "            )\n",
-    "            all_pairs.append({\n",
-    "                \"text\": step_text,\n",
-    "                \"reward\": episode_reward,\n",
-    "            })\n",
-    "\n",
-    "    return all_pairs, all_episode_rewards\n",
-    "\n",
-    "print(\"Data collection function defined.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "NUM_ROUNDS = 4\n",
-    "EPISODES_PER_ROUND = 6\n",
-    "TOP_K_FRACTION = 0.5\n",
-    "\n",
-    "training_log = {\n",
-    "    \"round\": [],\n",
-    "    \"avg_episode_reward\": [],\n",
-    "    \"max_episode_reward\": [],\n",
-    "    \"min_episode_reward\": [],\n",
-    "    \"n_training_samples\": [],\n",
-    "    \"train_loss\": [],\n",
-    "}\n",
-    "\n",
-    "for round_idx in range(1, NUM_ROUNDS + 1):\n",
-    "    print(f\"\\n{'=' * 60}\")\n",
-    "    print(f\"TRAINING ROUND {round_idx}/{NUM_ROUNDS}\")\n",
-    "    print(f\"{'=' * 60}\")\n",
-    "\n",
-    "    print(f\"Collecting {EPISODES_PER_ROUND} episodes...\")\n",
-    "    peft_model.eval()\n",
-    "    pairs, episode_rewards = collect_training_data(\n",
-    "        peft_model, tokenizer, n_episodes=EPISODES_PER_ROUND\n",
-    "    )\n",
-    "    avg_reward = sum(episode_rewards) / len(episode_rewards)\n",
-    "    print(f\"  Episode rewards: {[f'{r:.3f}' for r in episode_rewards]}\")\n",
-    "    print(f\"  Avg: {avg_reward:.3f}, Max: {max(episode_rewards):.3f}, Min: {min(episode_rewards):.3f}\")\n",
-    "\n",
-    "    if not pairs:\n",
-    "        print(\"  No training pairs collected, skipping round.\")\n",
-    "        continue\n",
-    "\n",
-    "    reward_threshold = np.percentile(\n",
-    "        [p[\"reward\"] for p in pairs],\n",
-    "        (1 - TOP_K_FRACTION) * 100\n",
-    "    )\n",
-    "    filtered = [p for p in pairs if p[\"reward\"] >= reward_threshold]\n",
-    "    print(f\"  Filtered to {len(filtered)}/{len(pairs)} samples (reward >= {reward_threshold:.3f})\")\n",
-    "\n",
-    "    if not filtered:\n",
-    "        print(\"  No samples above threshold, using all.\")\n",
-    "        filtered = pairs\n",
-    "\n",
-    "    dataset = Dataset.from_list([{\"text\": p[\"text\"]} for p in filtered])\n",
-    "\n",
-    "    output_dir = f\"./viraltest_checkpoints/round_{round_idx}\"\n",
-    "    sft_config = SFTConfig(\n",
-    "        output_dir=output_dir,\n",
-    "        num_train_epochs=2,\n",
-    "        per_device_train_batch_size=1,\n",
-    "        gradient_accumulation_steps=4,\n",
-    "        learning_rate=2e-5,\n",
-    "        warmup_steps=5,\n",
-    "        logging_steps=5,\n",
-    "        save_strategy=\"no\",\n",
-    "        max_seq_length=1024,\n",
-    "        fp16=True,\n",
-    "        report_to=\"none\",\n",
-    "    )\n",
-    "\n",
-    "    print(f\"  Training on {len(dataset)} samples...\")\n",
-    "    peft_model.train()\n",
-    "    trainer = SFTTrainer(\n",
-    "        model=peft_model,\n",
-    "        tokenizer=tokenizer,\n",
-    "        train_dataset=dataset,\n",
-    "        args=sft_config,\n",
-    "    )\n",
-    "    train_result = trainer.train()\n",
-    "    train_loss = train_result.training_loss\n",
-    "    print(f\"  Training loss: {train_loss:.4f}\")\n",
-    "\n",
-    "    training_log[\"round\"].append(round_idx)\n",
-    "    training_log[\"avg_episode_reward\"].append(avg_reward)\n",
-    "    training_log[\"max_episode_reward\"].append(max(episode_rewards))\n",
-    "    training_log[\"min_episode_reward\"].append(min(episode_rewards))\n",
-    "    training_log[\"n_training_samples\"].append(len(filtered))\n",
-    "    training_log[\"train_loss\"].append(train_loss)\n",
-    "\n",
-    "print(\"\\n\" + \"=\" * 60)\n",
-    "print(\"TRAINING COMPLETE\")\n",
-    "print(\"=\" * 60)\n",
-    "\n",
-    "train_df = pd.DataFrame(training_log)\n",
-    "print(train_df.to_string(index=False))\n",
-    "\n",
-    "train_df.to_csv(PLOTS_DIR / \"training_log.csv\", index=False)\n",
-    "print(f\"\\nSaved training log to {PLOTS_DIR / 'training_log.csv'}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 6: Post-Training Evaluation\n",
-    "\n",
-    "Run the trained model on all three tasks and compare with before-training scores."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(\"Running TRAINED model...\")\n",
-    "print(\"=\" * 60)\n",
-    "\n",
-    "peft_model.eval()\n",
-    "\n",
-    "after_results = {}\n",
-    "for task in TASKS:\n",
-    "    print(f\"\\nTask: {task}\")\n",
-    "    result = run_llm_episode(peft_model, tokenizer, task, seed=42, verbose=True)\n",
-    "    after_results[task] = result\n",
-    "    print(f\"  => grader_score={result['grader_score']:.4f}, \"\n",
-    "          f\"total_reward={result['total_reward']:.3f}, \"\n",
-    "          f\"burned_out={result['burned_out']}\")\n",
-    "\n",
-    "print(\"\\n\" + \"=\" * 60)\n",
-    "print(\"AFTER TRAINING SCORES\")\n",
-    "print(\"=\" * 60)\n",
-    "for task in TASKS:\n",
-    "    r = after_results[task]\n",
-    "    print(f\"  {task}: grader={r['grader_score']:.4f} reward={r['total_reward']:.3f} energy={r['final_energy']:.2f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 7: Result Plots — Real Training Evidence"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
-    "\n",
-    "rounds = training_log[\"round\"]\n",
-    "axes[0].plot(rounds, training_log[\"avg_episode_reward\"], 'o-', color='#2196F3', linewidth=2, label='Avg reward')\n",
-    "axes[0].fill_between(rounds, training_log[\"min_episode_reward\"], training_log[\"max_episode_reward\"],\n",
-    "                     alpha=0.2, color='#2196F3', label='Min-Max range')\n",
-    "axes[0].set_xlabel('Training Round', fontsize=12)\n",
-    "axes[0].set_ylabel('Episode Reward', fontsize=12)\n",
-    "axes[0].set_title('Training Reward Over Rounds', fontsize=13, fontweight='bold')\n",
-    "axes[0].legend()\n",
-    "axes[0].grid(True, alpha=0.3)\n",
-    "\n",
-    "axes[1].plot(rounds, training_log[\"train_loss\"], 's-', color='#E53935', linewidth=2)\n",
-    "axes[1].set_xlabel('Training Round', fontsize=12)\n",
-    "axes[1].set_ylabel('Training Loss', fontsize=12)\n",
-    "axes[1].set_title('Training Loss Over Rounds', fontsize=13, fontweight='bold')\n",
-    "axes[1].grid(True, alpha=0.3)\n",
-    "\n",
-    "fig.suptitle('Viraltest v2 — GRPO Training Progress', fontsize=14, fontweight='bold', y=1.02)\n",
-    "fig.tight_layout()\n",
-    "fig.savefig(PLOTS_DIR / 'reward_curve.png', dpi=150, bbox_inches='tight')\n",
-    "plt.show()\n",
-    "print(f\"Saved {PLOTS_DIR / 'reward_curve.png'}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "task_labels = [t.replace('monthly_', '').title() for t in TASKS]\n",
-    "before_scores = [before_results[t][\"grader_score\"] for t in TASKS]\n",
-    "after_scores = [after_results[t][\"grader_score\"] for t in TASKS]\n",
-    "smart_scores = [baseline_results[\"smart\"][t][\"grader_score\"] for t in TASKS]\n",
-    "\n",
-    "x = np.arange(len(TASKS))\n",
-    "width = 0.25\n",
-    "\n",
-    "fig, ax = plt.subplots(figsize=(10, 6))\n",
-    "bars1 = ax.bar(x - width, before_scores, width, label='Base Model (Before)', color='#FF9800')\n",
-    "bars2 = ax.bar(x, after_scores, width, label='Trained Model (After)', color='#4CAF50')\n",
-    "bars3 = ax.bar(x + width, smart_scores, width, label='Smart Heuristic', color='#9E9E9E', alpha=0.7)\n",
-    "\n",
-    "ax.set_ylabel('Grader Score', fontsize=12)\n",
-    "ax.set_title('Before vs After Training — Grader Scores', fontsize=14, fontweight='bold')\n",
-    "ax.set_xticks(x)\n",
-    "ax.set_xticklabels(task_labels, fontsize=11)\n",
-    "ax.legend(fontsize=10)\n",
-    "ax.grid(True, alpha=0.3, axis='y')\n",
-    "\n",
-    "for bars in [bars1, bars2, bars3]:\n",
-    "    for bar in bars:\n",
-    "        height = bar.get_height()\n",
-    "        if height > 0:\n",
-    "            ax.text(bar.get_x() + bar.get_width()/2., height + 0.005,\n",
-    "                    f'{height:.3f}', ha='center', va='bottom', fontsize=9)\n",
-    "\n",
-    "fig.tight_layout()\n",
-    "fig.savefig(PLOTS_DIR / 'before_after.png', dpi=150, bbox_inches='tight')\n",
-    "plt.show()\n",
-    "print(f\"Saved {PLOTS_DIR / 'before_after.png'}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n",
-    "\n",
-    "labels_and_data = [\n",
-    "    (\"Base Model\", before_results, '#FF9800'),\n",
-    "    (\"Trained Model\", after_results, '#4CAF50'),\n",
-    "]\n",
-    "\n",
-    "for i, task in enumerate(TASKS):\n",
-    "    for label, results, color in labels_and_data:\n",
-    "        r = results[task]\n",
-    "        axes[0, i].plot(r[\"rewards\"], label=label, color=color, linewidth=1.5, alpha=0.9)\n",
-    "        axes[1, i].plot(r[\"energies\"], label=label, color=color, linewidth=1.5, alpha=0.9)\n",
-    "\n",
-    "    smart_r = baseline_results[\"smart\"][task]\n",
-    "    axes[0, i].plot(smart_r[\"rewards\"], label=\"Smart Heuristic\", color='#9E9E9E',\n",
-    "                    linewidth=1, alpha=0.5, linestyle='--')\n",
-    "    axes[1, i].plot(smart_r[\"energies\"], label=\"Smart Heuristic\", color='#9E9E9E',\n",
-    "                    linewidth=1, alpha=0.5, linestyle='--')\n",
-    "\n",
-    "    task_title = task.replace('monthly_', '').title()\n",
-    "    axes[0, i].set_title(f\"{task_title} — Daily Rewards\", fontsize=11)\n",
-    "    axes[0, i].set_xlabel(\"Day\")\n",
-    "    axes[0, i].set_ylabel(\"Reward\")\n",
-    "    axes[0, i].grid(True, alpha=0.3)\n",
-    "\n",
-    "    axes[1, i].set_title(f\"{task_title} — Energy\", fontsize=11)\n",
-    "    axes[1, i].set_xlabel(\"Day\")\n",
-    "    axes[1, i].set_ylabel(\"Energy\")\n",
-    "    axes[1, i].grid(True, alpha=0.3)\n",
-    "\n",
-    "axes[0, 2].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)\n",
-    "fig.suptitle('Viraltest v2 — Before vs After Training Trajectories', fontsize=14, fontweight='bold', y=1.01)\n",
-    "fig.tight_layout()\n",
-    "fig.savefig(PLOTS_DIR / 'training_trajectories.png', dpi=150, bbox_inches='tight')\n",
-    "plt.show()\n",
-    "print(f\"Saved {PLOTS_DIR / 'training_trajectories.png'}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Part 8: Summary & Export"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "print(\"=\" * 70)\n",
-    "print(\"FINAL RESULTS SUMMARY\")\n",
-    "print(\"=\" * 70)\n",
-    "print()\n",
-    "print(f\"{'Task':<25s} {'Before':>10s} {'After':>10s} {'Delta':>10s} {'Smart':>10s}\")\n",
-    "print(\"-\" * 67)\n",
-    "for task in TASKS:\n",
-    "    b = before_results[task][\"grader_score\"]\n",
-    "    a = after_results[task][\"grader_score\"]\n",
-    "    s = baseline_results[\"smart\"][task][\"grader_score\"]\n",
-    "    delta = a - b\n",
-    "    print(f\"{task:<25s} {b:>10.4f} {a:>10.4f} {delta:>+10.4f} {s:>10.4f}\")\n",
-    "\n",
-    "avg_before = np.mean([before_results[t][\"grader_score\"] for t in TASKS])\n",
-    "avg_after = np.mean([after_results[t][\"grader_score\"] for t in TASKS])\n",
-    "avg_smart = np.mean([baseline_results[\"smart\"][t][\"grader_score\"] for t in TASKS])\n",
-    "print(\"-\" * 67)\n",
-    "print(f\"{'AVERAGE':<25s} {avg_before:>10.4f} {avg_after:>10.4f} {avg_after - avg_before:>+10.4f} {avg_smart:>10.4f}\")\n",
-    "print()\n",
-    "\n",
-    "summary = {\n",
-    "    \"model\": MODEL_NAME,\n",
-    "    \"training_rounds\": NUM_ROUNDS,\n",
-    "    \"episodes_per_round\": EPISODES_PER_ROUND,\n",
-    "    \"before\": {t: before_results[t][\"grader_score\"] for t in TASKS},\n",
-    "    \"after\": {t: after_results[t][\"grader_score\"] for t in TASKS},\n",
-    "    \"smart_heuristic\": {t: baseline_results[\"smart\"][t][\"grader_score\"] for t in TASKS},\n",
-    "    \"improvement\": {t: after_results[t][\"grader_score\"] - before_results[t][\"grader_score\"] for t in TASKS},\n",
-    "    \"training_log\": training_log,\n",
-    "}\n",
-    "\n",
-    "with open(PLOTS_DIR / \"training_summary.json\", \"w\") as f:\n",
-    "    json.dump(summary, f, indent=2)\n",
-    "\n",
-    "print(f\"Saved summary to {PLOTS_DIR / 'training_summary.json'}\")\n",
-    "print()\n",
-    "print(\"Plots saved:\")\n",
-    "for p in sorted(PLOTS_DIR.glob(\"*.png\")):\n",
-    "    print(f\"  {p}\")\n",
-    "print()\n",
-    "print(\"Training evidence is now real and reproducible.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "save_path = \"./viraltest_trained_adapter\"\n",
-    "peft_model.save_pretrained(save_path)\n",
-    "tokenizer.save_pretrained(save_path)\n",
-    "print(f\"Trained adapter saved to {save_path}\")\n",
-    "print(\"To load: model = AutoModelForCausalLM.from_pretrained(...); model = PeftModel.from_pretrained(model, save_path)\")"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python",
-   "version": "3.10.0"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}

 {
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Viraltest v2 — Real LLM Training with LoRA + Environment Rewards\n",
+        "\n",
+        "This notebook **actually trains** an LLM (Qwen2.5-1.5B-Instruct) to play our Instagram creator simulation.\n",
+        "\n",
+        "**Pipeline:**\n",
+        "1. Clone repo & install deps\n",
+        "2. Run 5 heuristic baselines × 3 tasks (15 runs) → leaderboard\n",
+        "3. Run **untrained** LLM on all 3 tasks → \"before\" scores\n",
+        "4. **LoRA fine-tune** with reward-weighted SFT (4 rounds × 6 episodes = real weight updates)\n",
+        "5. Run **trained** LLM on all 3 tasks → \"after\" scores\n",
+        "6. Generate real plots from real numbers\n",
+        "\n",
+        "**Requirements:** Colab T4 GPU (free tier), ~45 min total.\n",
+        "\n",
+        "**What makes this real training:** LoRA adapter weights are actually updated via gradient descent. The model's behavior changes because its weights change, not because we edit the prompt."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 1: Install dependencies\n",
+        "!pip install -q torch torchvision torchaudio\n",
+        "!pip install -q transformers>=4.40.0 accelerate peft>=0.10.0 trl>=0.8.0 datasets bitsandbytes\n",
+        "!pip install -q matplotlib pandas\n",
+        "!pip install -q pydantic httpx\n",
+        "!pip install -q \"openenv-core[core]>=0.2.2\""
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 2: Clone the repo and set up paths\n",
+        "import os, sys\n",
+        "REPO_DIR = \"/content/viral-posts-env\"\n",
+        "if not os.path.exists(REPO_DIR):\n",
+        "    !git clone https://github.com/VaibhavKhandare/viral-posts-env.git {REPO_DIR}\n",
+        "os.chdir(REPO_DIR)\n",
+        "sys.path.insert(0, REPO_DIR)\n",
+        "\n",
+        "PLOTS_DIR = os.path.join(REPO_DIR, \"plots\")\n",
+        "os.makedirs(PLOTS_DIR, exist_ok=True)\n",
+        "print(f\"Working dir: {os.getcwd()}\")\n",
+        "print(f\"Plots dir: {PLOTS_DIR}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 3: Imports\n",
+        "import json, random, time, textwrap, copy\n",
+        "from pathlib import Path\n",
+        "from typing import Any, Dict, List, Optional, Tuple\n",
+        "from collections import defaultdict\n",
+        "\n",
+        "import matplotlib.pyplot as plt\n",
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "import torch\n",
+        "\n",
+        "from models import ScheduledAction, ToolCall, ViraltestAction\n",
+        "from server.viraltest_environment import (\n",
+        "    ViraltestEnvironment, TAG_POOL, TASK_HORIZON,\n",
+        "    TOPIC_CATEGORIES,\n",
+        ")\n",
+        "\n",
+        "ALL_TOPICS = [t for topics in TOPIC_CATEGORIES.values() for t in topics]\n",
+        "NICHES = list(TOPIC_CATEGORIES.keys())\n",
+        "CONTENT_TYPES = [\"reel\", \"carousel\", \"story\", \"text_post\"]\n",
+        "INTENTS = [\"send_bait\", \"save_bait\", \"watch_bait\", \"like_bait\"]\n",
+        "TASKS = [\"monthly_engage\", \"monthly_strategic\", \"monthly_competitive\"]\n",
+        "\n",
+        "print(f\"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}\")\n",
+        "print(f\"Tags: {len(TAG_POOL)}, Topics: {len(ALL_TOPICS)}, Horizon: {TASK_HORIZON} days\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 1: Heuristic Baselines\n",
+        "\n",
+        "5 scripted agents prove the environment differentiates skill levels."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 4: Define heuristic agents + episode runner\n",
+        "_rng = random.Random(42)\n",
+        "\n",
+        "def plan_always_rest(obs_dict, day):\n",
+        "    return ViraltestAction(scheduled_actions=[])\n",
+        "\n",
+        "def plan_spam(obs_dict, day):\n",
+        "    return ViraltestAction(scheduled_actions=[\n",
+        "        ScheduledAction(hour=h, action_type=\"post\", content_type=\"reel\",\n",
+        "                        topic=\"AI tools\", tags=[\"ai\"], intent=\"watch_bait\")\n",
+        "        for h in range(24)])\n",
+        "\n",
+        "def plan_random(obs_dict, day):\n",
+        "    actions = []\n",
+        "    for h in range(24):\n",
+        "        if _rng.random() < 0.1:\n",
+        "            actions.append(ScheduledAction(\n",
+        "                hour=h, action_type=\"post\",\n",
+        "                content_type=_rng.choice(CONTENT_TYPES),\n",
+        "                topic=_rng.choice(ALL_TOPICS),\n",
+        "                tags=_rng.sample(TAG_POOL[:30], 3),\n",
+        "                intent=_rng.choice(INTENTS)))\n",
+        "    return ViraltestAction(scheduled_actions=actions)\n",
+        "\n",
+        "def plan_minimal(obs_dict, day):\n",
+        "    return ViraltestAction(scheduled_actions=[\n",
+        "        ScheduledAction(hour=12, action_type=\"post\", content_type=\"carousel\",\n",
+        "                        topic=ALL_TOPICS[day % len(ALL_TOPICS)],\n",
+        "                        tags=[TAG_POOL[i % len(TAG_POOL)] for i in range(day, day+3)],\n",
+        "                        intent=\"save_bait\")])\n",
+        "\n",
+        "def plan_smart(obs_dict, day):\n",
+        "    return ViraltestAction(\n",
+        "        tool_calls=[ToolCall(name=\"query_trends\",\n",
+        "                   arguments={\"niche\": NICHES[day % len(NICHES)]})] if day <= 3 else [],\n",
+        "        scheduled_actions=[\n",
+        "            ScheduledAction(hour=8, action_type=\"create_content\"),\n",
+        "            ScheduledAction(hour=12, action_type=\"post\",\n",
+        "                content_type=CONTENT_TYPES[(day*2)%4],\n",
+        "                topic=ALL_TOPICS[(day*2)%len(ALL_TOPICS)],\n",
+        "                tags=[TAG_POOL[(day*6+i)%len(TAG_POOL)] for i in range(3)],\n",
+        "                intent=INTENTS[(day*2)%4]),\n",
+        "            ScheduledAction(hour=19, action_type=\"post\",\n",
+        "                content_type=CONTENT_TYPES[(day*2+1)%4],\n",
+        "                topic=ALL_TOPICS[(day*2+1)%len(ALL_TOPICS)],\n",
+        "                tags=[TAG_POOL[(day*6+3+i)%len(TAG_POOL)] for i in range(3)],\n",
+        "                intent=INTENTS[(day*2+1)%4]),\n",
+        "        ],\n",
+        "        replies=[{\"post_hour\": 12, \"reply_hour\": 13}])\n",
+        "\n",
+        "BASELINE_AGENTS = {\n",
+        "    \"always_rest\": plan_always_rest, \"spam\": plan_spam,\n",
+        "    \"random\": plan_random, \"minimal\": plan_minimal, \"smart\": plan_smart,\n",
+        "}\n",
+        "\n",
+        "def run_episode(task, plan_fn, seed=42):\n",
+        "    env = ViraltestEnvironment()\n",
+        "    obs = env.reset(task=task, seed=seed)\n",
+        "    obs_dict = obs.model_dump()\n",
+        "    rewards, energies = [], [obs.creator_energy]\n",
+        "    for day in range(1, TASK_HORIZON + 1):\n",
+        "        action = plan_fn(obs_dict, day)\n",
+        "        obs = env.step(action)\n",
+        "        obs_dict = obs.model_dump()\n",
+        "        rewards.append(obs.reward or 0.0)\n",
+        "        energies.append(obs.creator_energy)\n",
+        "        if obs.done: break\n",
+        "    grader = (obs.metadata or {}).get(\"grader_score\", 0.0)\n",
+        "    return {\"grader_score\": grader, \"total_reward\": sum(rewards),\n",
+        "            \"steps\": len(rewards), \"final_energy\": obs.creator_energy,\n",
+        "            \"follower_delta\": obs.follower_count - 10000,\n",
+        "            \"burned_out\": obs.creator_energy <= 0,\n",
+        "            \"rewards\": rewards, \"energies\": energies}\n",
+        "\n",
+        "print(\"Agents and episode runner defined.\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 5: Run baselines\n",
+        "print(\"Running heuristic baselines (5 agents × 3 tasks)...\")\n",
+        "print(\"=\" * 70)\n",
+        "\n",
+        "baseline_results = {}\n",
+        "for name, fn in BASELINE_AGENTS.items():\n",
+        "    baseline_results[name] = {}\n",
+        "    for task in TASKS:\n",
+        "        _rng = random.Random(42)\n",
+        "        result = run_episode(task, fn, seed=42)\n",
+        "        baseline_results[name][task] = result\n",
+        "        print(f\"  {name:>12s} | {task:>22s} | score={result['grader_score']:.4f} \"\n",
+        "              f\"| energy={result['final_energy']:.2f}\")\n",
+        "    print()\n",
+        "\n",
+        "print(\"\\nLEADERBOARD\")\n",
+        "print(f\"{'Agent':<14s} {'Engage':>10s} {'Strategic':>12s} {'Competitive':>14s} {'Avg':>8s}\")\n",
+        "print(\"-\" * 60)\n",
+        "for name in BASELINE_AGENTS:\n",
+        "    scores = [baseline_results[name][t][\"grader_score\"] for t in TASKS]\n",
+        "    print(f\"{name:<14s} {scores[0]:>10.4f} {scores[1]:>12.4f} {scores[2]:>14.4f} {sum(scores)/3:>8.4f}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 6: Baseline plots\n",
+        "fig, axes = plt.subplots(1, 3, figsize=(16, 5), sharey=True)\n",
+        "agent_names = list(BASELINE_AGENTS.keys())\n",
+        "colors = ['#E53935', '#FF9800', '#9E9E9E', '#42A5F5', '#4CAF50']\n",
+        "for i, task in enumerate(TASKS):\n",
+        "    scores = [baseline_results[a][task][\"grader_score\"] for a in agent_names]\n",
+        "    bars = axes[i].barh(agent_names, scores, color=colors)\n",
+        "    axes[i].set_title(task.replace(\"monthly_\", \"\").title(), fontsize=13, fontweight='bold')\n",
+        "    for bar, score in zip(bars, scores):\n",
+        "        axes[i].text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,\n",
+        "                     f\"{score:.4f}\", va='center', fontsize=9)\n",
+        "axes[0].set_ylabel(\"Agent\")\n",
+        "fig.suptitle(\"Viraltest v2 — Heuristic Baseline Leaderboard\", fontsize=14, fontweight='bold')\n",
+        "fig.tight_layout()\n",
+        "fig.savefig(f\"{PLOTS_DIR}/baseline_leaderboard.png\", dpi=150, bbox_inches='tight')\n",
+        "plt.show()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 2: Load LLM (Qwen2.5-1.5B-Instruct)\n",
+        "\n",
+        "We load the base model with 4-bit quantization to fit in free Colab's T4 GPU (16GB VRAM)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 7: Load model\n",
+        "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
+        "\n",
+        "MODEL_NAME = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
+        "\n",
+        "bnb_config = BitsAndBytesConfig(\n",
+        "    load_in_4bit=True,\n",
+        "    bnb_4bit_quant_type=\"nf4\",\n",
+        "    bnb_4bit_compute_dtype=torch.float16,\n",
+        "    bnb_4bit_use_double_quant=True,\n",
+        ")\n",
+        "\n",
+        "print(f\"Loading {MODEL_NAME} (4-bit quantized)...\")\n",
+        "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n",
+        "model = AutoModelForCausalLM.from_pretrained(\n",
+        "    MODEL_NAME, trust_remote_code=True,\n",
+        "    quantization_config=bnb_config,\n",
+        "    device_map=\"auto\",\n",
+        ")\n",
+        "model.eval()\n",
+        "print(f\"Model loaded. Device: {model.device}\")\n",
+        "print(f\"Memory: {torch.cuda.memory_allocated()/1e9:.1f} GB\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 8: LLM agent functions\n",
+        "SYSTEM_PROMPT = textwrap.dedent(\"\"\"\\\n",
+        "You are an Instagram content strategy agent. Each step is one day.\n",
+        "You manage a creator account over a 30-day cycle.\n",
+        "\n",
+        "RESPONSE FORMAT — return ONLY valid JSON, no markdown:\n",
+        "{\n",
+        "  \"tool_calls\": [{\"name\": \"query_trends\", \"arguments\": {\"niche\": \"tech\"}}],\n",
+        "  \"scheduled_actions\": [\n",
+        "    {\"hour\": 12, \"action_type\": \"post\", \"content_type\": \"reel\",\n",
+        "     \"topic\": \"AI tools\", \"tags\": [\"ai\", \"coding\"], \"intent\": \"watch_bait\"}\n",
+        "  ],\n",
+        "  \"replies\": [{\"post_hour\": 12, \"reply_hour\": 13}],\n",
+        "  \"notes\": \"strategy notes\"\n",
+        "}\n",
+        "\n",
+        "RULES:\n",
+        "- content_type: reel|story|carousel|text_post\n",
+        "- intent: send_bait|save_bait|watch_bait|like_bait\n",
+        "- 1-2 posts/day optimal. More = fatigue.\n",
+        "- Empty scheduled_actions = rest (recovers energy).\n",
+        "- Vary content types and topics for diversity bonus.\n",
+        "- Reply within 90 min of post for reach bonus.\"\"\")\n",
+        "\n",
+        "\n",
+        "def format_obs(obs):\n",
+        "    days = [\"Mon\", \"Tue\", \"Wed\", \"Thu\", \"Fri\", \"Sat\", \"Sun\"]\n",
+        "    day_name = days[obs.day_of_week] if 0 <= obs.day_of_week < 7 else \"?\"\n",
+        "    signals_str = \"\"\n",
+        "    signals = getattr(obs, \"engagement_signals\", None)\n",
+        "    if signals:\n",
+        "        signals_str = (f\"Signals: watch={signals.watch_time:.3f} \"\n",
+        "                       f\"sends={signals.sends_per_reach:.3f} \"\n",
+        "                       f\"saves={signals.saves:.3f}\\n\")\n",
+        "    tool_str = \"\"\n",
+        "    for tr in getattr(obs, \"tool_results\", []):\n",
+        "        if tr.success:\n",
+        "            tool_str += f\"  {tr.name}: {json.dumps(tr.data)[:200]}\\n\"\n",
+        "    return (f\"Day: {day_name} | days_elapsed={obs.days_elapsed}\\n\"\n",
+        "            f\"Energy: {obs.creator_energy:.2f} | Followers: {obs.follower_count}\\n\"\n",
+        "            f\"Engagement: {obs.engagement_rate:.3f} | Queue: {obs.content_queue_size}\\n\"\n",
+        "            f\"{signals_str}\"\n",
+        "            f\"Tool results:\\n{tool_str if tool_str else '  (none)\\n'}\"\n",
+        "            f\"Plan your actions (JSON only):\")\n",
+        "\n",
+        "\n",
+        "def parse_model_output(text):\n",
+        "    text = text.strip()\n",
+        "    if \"```\" in text:\n",
+        "        lines = [l for l in text.split(\"\\n\") if not l.strip().startswith(\"```\")]\n",
+        "        text = \"\\n\".join(lines).strip()\n",
+        "    start, end = text.find(\"{\"), text.rfind(\"}\") + 1\n",
+        "    if start >= 0 and end > start:\n",
+        "        text = text[start:end]\n",
+        "    try:\n",
+        "        data = json.loads(text)\n",
+        "        tool_calls = [ToolCall(name=tc[\"name\"], arguments=tc.get(\"arguments\", {}))\n",
+        "                      for tc in data.get(\"tool_calls\", []) if isinstance(tc, dict) and \"name\" in tc]\n",
+        "        scheduled = []\n",
+        "        for a in data.get(\"scheduled_actions\", []):\n",
+        "            try: scheduled.append(ScheduledAction(**a))\n",
+        "            except: pass\n",
+        "        return ViraltestAction(tool_calls=tool_calls, scheduled_actions=scheduled,\n",
+        "                               replies=data.get(\"replies\", []), notes=data.get(\"notes\"))\n",
+        "    except:\n",
+        "        return ViraltestAction(scheduled_actions=[])\n",
+        "\n",
+        "\n",
+        "def generate_action(mdl, tok, obs, history, temperature=0.7):\n",
+        "    prompt = format_obs(obs)\n",
+        "    messages = [{\"role\": \"system\", \"content\": SYSTEM_PROMPT}]\n",
+        "    messages.extend(history[-4:])\n",
+        "    messages.append({\"role\": \"user\", \"content\": prompt})\n",
+        "    text_input = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
+        "    inputs = tok(text_input, return_tensors=\"pt\").to(mdl.device)\n",
+        "    with torch.no_grad():\n",
+        "        out = mdl.generate(**inputs, max_new_tokens=512, temperature=temperature,\n",
+        "                           do_sample=True, top_p=0.9, pad_token_id=tok.eos_token_id)\n",
+        "    resp = tok.decode(out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+        "    return resp, parse_model_output(resp)\n",
+        "\n",
+        "\n",
+        "def run_llm_episode(mdl, tok, task, seed=42, verbose=False):\n",
+        "    env = ViraltestEnvironment()\n",
+        "    obs = env.reset(task=task, seed=seed)\n",
+        "    rewards, energies = [], [obs.creator_energy]\n",
+        "    history, pairs = [], []\n",
+        "    for day in range(1, TASK_HORIZON + 1):\n",
+        "        if obs.done: break\n",
+        "        if obs.creator_energy <= 0.25:\n",
+        "            action = ViraltestAction(scheduled_actions=[])\n",
+        "            resp = '{\"scheduled_actions\": []}'\n",
+        "        else:\n",
+        "            resp, action = generate_action(mdl, tok, obs, history)\n",
+        "        prompt = format_obs(obs)\n",
+        "        pairs.append({\"prompt\": prompt, \"response\": resp})\n",
+        "        obs = env.step(action)\n",
+        "        r = obs.reward or 0.0\n",
+        "        rewards.append(r)\n",
+        "        energies.append(obs.creator_energy)\n",
+        "        history.extend([{\"role\": \"user\", \"content\": prompt},\n",
+        "                        {\"role\": \"assistant\", \"content\": resp}])\n",
+        "        if verbose:\n",
+        "            n_p = len([s for s in action.scheduled_actions if s.action_type==\"post\"])\n",
+        "            print(f\"    Day {day:2d}: r={r:.4f} e={obs.creator_energy:.2f} posts={n_p} tools={len(action.tool_calls)}\")\n",
+        "        if obs.done: break\n",
+        "    gs = (obs.metadata or {}).get(\"grader_score\", 0.0)\n",
+        "    return {\"task\": task, \"grader_score\": gs, \"total_reward\": sum(rewards),\n",
+        "            \"final_energy\": obs.creator_energy, \"rewards\": rewards,\n",
+        "            \"energies\": energies, \"pairs\": pairs,\n",
+        "            \"follower_delta\": obs.follower_count - 10000,\n",
+        "            \"burned_out\": obs.creator_energy <= 0}\n",
+        "\n",
+        "print(\"LLM agent functions defined.\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 3: Untrained LLM Baseline (“Before”)\n",
+        "\n",
+        "Run the base model with NO fine-tuning. This establishes ground truth."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 9: Run untrained model\n",
+        "print(\"Running UNTRAINED base model on all tasks...\")\n",
+        "print(\"=\" * 60)\n",
+        "\n",
+        "before_results = {}\n",
+        "for task in TASKS:\n",
+        "    print(f\"\\n  Task: {task}\")\n",
+        "    result = run_llm_episode(model, tokenizer, task, seed=42, verbose=True)\n",
+        "    before_results[task] = result\n",
+        "    print(f\"  => grader={result['grader_score']:.4f} reward={result['total_reward']:.3f}\")\n",
+        "\n",
+        "print(\"\\n\" + \"=\" * 60)\n",
+        "print(\"BEFORE TRAINING:\")\n",
+        "for t in TASKS:\n",
+        "    print(f\"  {t}: grader={before_results[t]['grader_score']:.4f}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 4: LoRA Fine-Tuning (Real Weight Updates)\n",
+        "\n",
+        "This is the core training loop. For each round:\n",
+        "1. Collect episodes with current model\n",
+        "2. Score each (prompt, response) pair by episode reward\n",
+        "3. Keep top 50% highest-reward samples\n",
+        "4. Fine-tune LoRA weights via SFT on those samples\n",
+        "\n",
+        "The model's actual weights change via gradient descent — this is real training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 10: Attach LoRA adapter\n",
+        "from peft import LoraConfig, get_peft_model, TaskType\n",
+        "\n",
+        "lora_config = LoraConfig(\n",
+        "    r=16, lora_alpha=32, lora_dropout=0.05,\n",
+        "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+        "                    \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+        "    task_type=TaskType.CAUSAL_LM, bias=\"none\",\n",
+        ")\n",
+        "\n",
+        "model.enable_input_require_grads()\n",
+        "peft_model = get_peft_model(model, lora_config)\n",
+        "peft_model.print_trainable_parameters()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 11: Training loop\n",
+        "from trl import SFTTrainer, SFTConfig\n",
+        "from datasets import Dataset\n",
+        "\n",
+        "NUM_ROUNDS = 4\n",
+        "EPISODES_PER_ROUND = 6\n",
+        "TOP_K_FRACTION = 0.5\n",
+        "\n",
+        "training_log = {\n",
+        "    \"round\": [], \"avg_episode_reward\": [], \"max_episode_reward\": [],\n",
+        "    \"min_episode_reward\": [], \"avg_grader\": [], \"max_grader\": [],\n",
+        "    \"n_training_samples\": [], \"train_loss\": [],\n",
+        "}\n",
+        "\n",
+        "t_start = time.time()\n",
+        "\n",
+        "for round_idx in range(1, NUM_ROUNDS + 1):\n",
+        "    print(f\"\\n{'=' * 60}\")\n",
+        "    print(f\"TRAINING ROUND {round_idx}/{NUM_ROUNDS}\")\n",
+        "    print(f\"{'=' * 60}\")\n",
+        "\n",
+        "    # Collect episodes\n",
+        "    peft_model.eval()\n",
+        "    all_pairs, episode_rewards, episode_graders = [], [], []\n",
+        "\n",
+        "    for ep in range(EPISODES_PER_ROUND):\n",
+        "        task = TASKS[ep % len(TASKS)]\n",
+        "        seed = 42 + (round_idx - 1) * 100 + ep\n",
+        "        result = run_llm_episode(peft_model, tokenizer, task, seed=seed)\n",
+        "        ep_reward = result[\"total_reward\"] + 2.0 * result[\"grader_score\"]\n",
+        "        episode_rewards.append(ep_reward)\n",
+        "        episode_graders.append(result[\"grader_score\"])\n",
+        "\n",
+        "        for pr in result[\"pairs\"]:\n",
+        "            text = (f\"<|im_start|>system\\n{SYSTEM_PROMPT}<|im_end|>\\n\"\n",
+        "                    f\"<|im_start|>user\\n{pr['prompt']}<|im_end|>\\n\"\n",
+        "                    f\"<|im_start|>assistant\\n{pr['response']}<|im_end|>\")\n",
+        "            all_pairs.append({\"text\": text, \"reward\": ep_reward})\n",
+        "\n",
+        "        print(f\"  ep {ep+1}/{EPISODES_PER_ROUND}: {task.split('_')[-1]:>11s} \"\n",
+        "              f\"grader={result['grader_score']:.4f} reward={ep_reward:.3f}\")\n",
+        "\n",
+        "    avg_r = np.mean(episode_rewards)\n",
+        "    avg_g = np.mean(episode_graders)\n",
+        "    print(f\"  Avg reward={avg_r:.3f} Avg grader={avg_g:.4f}\")\n",
+        "\n",
+        "    # Filter to top-K\n",
+        "    threshold = np.percentile([p[\"reward\"] for p in all_pairs], (1 - TOP_K_FRACTION) * 100)\n",
+        "    filtered = [p for p in all_pairs if p[\"reward\"] >= threshold] or all_pairs\n",
+        "    print(f\"  Filtered to {len(filtered)}/{len(all_pairs)} samples\")\n",
+        "\n",
+        "    dataset = Dataset.from_list([{\"text\": p[\"text\"]} for p in filtered])\n",
+        "\n",
+        "    # SFT training (real gradient updates)\n",
+        "    sft_config = SFTConfig(\n",
+        "        output_dir=f\"./checkpoints/round_{round_idx}\",\n",
+        "        num_train_epochs=2,\n",
+        "        per_device_train_batch_size=1,\n",
+        "        gradient_accumulation_steps=4,\n",
+        "        learning_rate=2e-5,\n",
+        "        warmup_steps=5,\n",
+        "        logging_steps=5,\n",
+        "        save_strategy=\"no\",\n",
+        "        max_seq_length=1024,\n",
+        "        fp16=True,\n",
+        "        report_to=\"none\",\n",
+        "    )\n",
+        "\n",
+        "    peft_model.train()\n",
+        "    trainer = SFTTrainer(\n",
+        "        model=peft_model, tokenizer=tokenizer,\n",
+        "        train_dataset=dataset, args=sft_config,\n",
+        "    )\n",
+        "    train_result = trainer.train()\n",
+        "    loss = train_result.training_loss\n",
+        "    print(f\"  Training loss: {loss:.4f}\")\n",
+        "\n",
+        "    training_log[\"round\"].append(round_idx)\n",
+        "    training_log[\"avg_episode_reward\"].append(round(float(avg_r), 3))\n",
+        "    training_log[\"max_episode_reward\"].append(round(float(max(episode_rewards)), 3))\n",
+        "    training_log[\"min_episode_reward\"].append(round(float(min(episode_rewards)), 3))\n",
+        "    training_log[\"avg_grader\"].append(round(float(avg_g), 4))\n",
+        "    training_log[\"max_grader\"].append(round(float(max(episode_graders)), 4))\n",
+        "    training_log[\"n_training_samples\"].append(len(filtered))\n",
+        "    training_log[\"train_loss\"].append(round(loss, 4))\n",
+        "\n",
+        "elapsed = time.time() - t_start\n",
+        "print(f\"\\nTraining complete in {elapsed/60:.1f} min\")\n",
+        "print(pd.DataFrame(training_log).to_string(index=False))"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 5: Trained LLM Evaluation (“After”)\n",
+        "\n",
+        "Same model, same seeds, same environment — but now with updated LoRA weights."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 12: Run trained model\n",
+        "print(\"Running TRAINED model on all tasks...\")\n",
+        "print(\"=\" * 60)\n",
+        "\n",
+        "peft_model.eval()\n",
+        "after_results = {}\n",
+        "for task in TASKS:\n",
+        "    print(f\"\\n  Task: {task}\")\n",
+        "    result = run_llm_episode(peft_model, tokenizer, task, seed=42, verbose=True)\n",
+        "    after_results[task] = result\n",
+        "    print(f\"  => grader={result['grader_score']:.4f} reward={result['total_reward']:.3f}\")\n",
+        "\n",
+        "print(\"\\n\" + \"=\" * 60)\n",
+        "print(\"AFTER TRAINING:\")\n",
+        "for t in TASKS:\n",
+        "    print(f\"  {t}: grader={after_results[t]['grader_score']:.4f}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 6: Result Plots — Real Training Evidence"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 13: Training curves\n",
+        "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+        "rounds = training_log[\"round\"]\n",
+        "\n",
+        "axes[0].plot(rounds, training_log[\"avg_grader\"], 'o-', color='#2196F3', lw=2, label='Avg grader')\n",
+        "axes[0].fill_between(rounds, training_log[\"avg_grader\"],\n",
+        "                     training_log[\"max_grader\"], alpha=0.2, color='#2196F3')\n",
+        "axes[0].set_xlabel('Round'); axes[0].set_ylabel('Grader Score')\n",
+        "axes[0].set_title('Grader Score Over Rounds', fontweight='bold')\n",
+        "axes[0].legend(); axes[0].grid(True, alpha=0.3)\n",
+        "\n",
+        "axes[1].plot(rounds, training_log[\"train_loss\"], 's-', color='#E53935', lw=2)\n",
+        "axes[1].set_xlabel('Round'); axes[1].set_ylabel('Loss')\n",
+        "axes[1].set_title('Training Loss', fontweight='bold')\n",
+        "axes[1].grid(True, alpha=0.3)\n",
+        "\n",
+        "fig.suptitle('Viraltest v2 — LoRA Training Progress (Qwen 1.5B)', fontsize=14, fontweight='bold')\n",
+        "fig.tight_layout()\n",
+        "fig.savefig(f'{PLOTS_DIR}/reward_curve.png', dpi=150, bbox_inches='tight')\n",
+        "plt.show()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 14: Before vs After\n",
+        "task_labels = [t.replace('monthly_', '').title() for t in TASKS]\n",
+        "x = np.arange(len(TASKS))\n",
+        "w = 0.25\n",
+        "\n",
+        "fig, ax = plt.subplots(figsize=(10, 6))\n",
+        "b_scores = [before_results[t][\"grader_score\"] for t in TASKS]\n",
+        "a_scores = [after_results[t][\"grader_score\"] for t in TASKS]\n",
+        "s_scores = [baseline_results[\"smart\"][t][\"grader_score\"] for t in TASKS]\n",
+        "\n",
+        "ax.bar(x - w, b_scores, w, label='Base Model (Before)', color='#FF9800')\n",
+        "ax.bar(x, a_scores, w, label='LoRA Trained (After)', color='#4CAF50')\n",
+        "ax.bar(x + w, s_scores, w, label='Smart Heuristic', color='#9E9E9E', alpha=0.7)\n",
+        "\n",
+        "ax.set_ylabel('Grader Score'); ax.set_xticks(x); ax.set_xticklabels(task_labels)\n",
+        "ax.set_title('Before vs After LoRA Training — Grader Scores', fontsize=14, fontweight='bold')\n",
+        "ax.legend(); ax.grid(True, alpha=0.3, axis='y')\n",
+        "\n",
+        "for container in ax.containers:\n",
+        "    for bar in container:\n",
+        "        h = bar.get_height()\n",
+        "        if h > 0:\n",
+        "            ax.text(bar.get_x() + bar.get_width()/2., h + 0.005,\n",
+        "                    f'{h:.4f}', ha='center', va='bottom', fontsize=9)\n",
+        "\n",
+        "fig.tight_layout()\n",
+        "fig.savefig(f'{PLOTS_DIR}/before_after.png', dpi=150, bbox_inches='tight')\n",
+        "plt.show()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 15: Trajectory comparison\n",
+        "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n",
+        "comparisons = [\n",
+        "    (\"Base Model\", before_results, '#FF9800', '--'),\n",
+        "    (\"LoRA Trained\", after_results, '#4CAF50', '-'),\n",
+        "]\n",
+        "for i, task in enumerate(TASKS):\n",
+        "    for label, res, color, ls in comparisons:\n",
+        "        lw = 2.5 if 'Trained' in label else 1.5\n",
+        "        axes[0, i].plot(res[task][\"rewards\"], label=label, color=color, lw=lw, ls=ls)\n",
+        "        axes[1, i].plot(res[task][\"energies\"], label=label, color=color, lw=lw, ls=ls)\n",
+        "    sr = baseline_results[\"smart\"][task]\n",
+        "    axes[0, i].plot(sr[\"rewards\"], label=\"Smart\", color='#9E9E9E', lw=1, ls=':')\n",
+        "    axes[1, i].plot(sr[\"energies\"], label=\"Smart\", color='#9E9E9E', lw=1, ls=':')\n",
+        "    t_name = task.replace('monthly_', '').title()\n",
+        "    axes[0, i].set_title(f\"{t_name} — Rewards\"); axes[0, i].grid(True, alpha=0.3)\n",
+        "    axes[1, i].set_title(f\"{t_name} — Energy\"); axes[1, i].grid(True, alpha=0.3)\n",
+        "axes[0, 2].legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n",
+        "fig.suptitle('Before vs After — Daily Trajectories', fontsize=14, fontweight='bold', y=1.01)\n",
+        "fig.tight_layout()\n",
+        "fig.savefig(f'{PLOTS_DIR}/training_trajectories.png', dpi=150, bbox_inches='tight')\n",
+        "plt.show()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Part 7: Summary & Export"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 16: Final summary\n",
+        "print(\"=\" * 67)\n",
+        "print(\"FINAL RESULTS\")\n",
+        "print(\"=\" * 67)\n",
+        "print(f\"\\n{'Task':<25s} {'Before':>10s} {'After':>10s} {'Delta':>10s} {'Smart':>10s}\")\n",
+        "print(\"-\" * 67)\n",
+        "for task in TASKS:\n",
+        "    b = before_results[task][\"grader_score\"]\n",
+        "    a = after_results[task][\"grader_score\"]\n",
+        "    s = baseline_results[\"smart\"][task][\"grader_score\"]\n",
+        "    print(f\"{task:<25s} {b:>10.4f} {a:>10.4f} {a-b:>+10.4f} {s:>10.4f}\")\n",
+        "\n",
+        "avg_b = np.mean([before_results[t][\"grader_score\"] for t in TASKS])\n",
+        "avg_a = np.mean([after_results[t][\"grader_score\"] for t in TASKS])\n",
+        "avg_s = np.mean([baseline_results[\"smart\"][t][\"grader_score\"] for t in TASKS])\n",
+        "print(\"-\" * 67)\n",
+        "print(f\"{'AVERAGE':<25s} {avg_b:>10.4f} {avg_a:>10.4f} {avg_a-avg_b:>+10.4f} {avg_s:>10.4f}\")\n",
+        "\n",
+        "summary = {\n",
+        "    \"model\": MODEL_NAME,\n",
+        "    \"training\": \"LoRA SFT (real weight updates)\",\n",
+        "    \"rounds\": NUM_ROUNDS, \"episodes_per_round\": EPISODES_PER_ROUND,\n",
+        "    \"before\": {t: before_results[t][\"grader_score\"] for t in TASKS},\n",
+        "    \"after\": {t: after_results[t][\"grader_score\"] for t in TASKS},\n",
+        "    \"smart_heuristic\": {t: baseline_results[\"smart\"][t][\"grader_score\"] for t in TASKS},\n",
+        "    \"improvement\": {t: after_results[t][\"grader_score\"] - before_results[t][\"grader_score\"] for t in TASKS},\n",
+        "    \"training_log\": training_log,\n",
+        "}\n",
+        "with open(f\"{PLOTS_DIR}/training_summary.json\", \"w\") as f:\n",
+        "    json.dump(summary, f, indent=2)\n",
+        "\n",
+        "pd.DataFrame(training_log).to_csv(f\"{PLOTS_DIR}/training_log.csv\", index=False)\n",
+        "\n",
+        "print(f\"\\nSaved to {PLOTS_DIR}/\")\n",
+        "print(\"All results are from real LoRA weight updates on real environment runs.\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# Cell 17: Save adapter\n",
+        "save_path = \"./viraltest_trained_adapter\"\n",
+        "peft_model.save_pretrained(save_path)\n",
+        "tokenizer.save_pretrained(save_path)\n",
+        "print(f\"LoRA adapter saved to {save_path}\")\n",
+        "print(\"Load with: PeftModel.from_pretrained(base_model, save_path)\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.10.0"
+    },
+    "accelerator": "GPU",
+    "gpuClass": "standard"
   },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}