Spaces:

srishtichugh
/

orgOS

Sleeping

App Files Files Community

muskan singh commited on 13 days ago

Commit

e22f664

1 Parent(s): 9e29238

training notebook with training logs

Browse files

Files changed (1) hide show

training/grpo_orgos.ipynb +199 -140

training/grpo_orgos.ipynb CHANGED Viewed

@@ -21,7 +21,7 @@
     "4. GRPO computes relative advantages within the group (which action did better than average?)\n",
     "5. Model is updated to favour higher-reward actions\n",
     "\n",
-    "**Key training signal:** Schema drift creates a sharp reward gap.\n",
     "Using a stale field name (e.g. `priority` when schema says `severity`) → **−0.20**.  \n",
     "Using the correct drifted name → **+0.10** adaptation bonus.  \n",
     "The model learns to read `schema_hints` before constructing action args."
@@ -77,10 +77,52 @@
   },
   {
    "cell_type": "markdown",
-   "id": "sec3",
    "metadata": {},
    "source": [
-    "## 3. Start the OrgOS Environment Server"
    ]
   },
   {
@@ -101,15 +143,15 @@
     "\n",
     "health = httpx.get(\"http://localhost:8000/health\").json()\n",
     "assert health[\"status\"] == \"healthy\", f\"Server not healthy: {health}\"\n",
-    "print(\"OrgOS server running:\", health)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec4",
    "metadata": {},
    "source": [
-    "## 4. Load Model with Unsloth 4-bit LoRA"
    ]
   },
   {
@@ -124,6 +166,7 @@
     "\n",
     "MAX_SEQ_LEN = 2048\n",
     "MODEL_NAME  = \"Qwen/Qwen2.5-3B-Instruct\"\n",
     "\n",
     "model, tokenizer = FastLanguageModel.from_pretrained(\n",
     "    model_name     = MODEL_NAME,\n",
@@ -134,31 +177,27 @@
     "\n",
     "model = FastLanguageModel.get_peft_model(\n",
     "    model,\n",
-    "    r              = 16,\n",
     "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
     "                      \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
-    "    lora_alpha     = 16,\n",
     "    lora_dropout   = 0,\n",
     "    bias           = \"none\",\n",
     "    use_gradient_checkpointing = \"unsloth\",\n",
     "    random_state   = 42,\n",
     ")\n",
     "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
-    "print(f\"Model loaded — trainable params: {trainable:,}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec5",
    "metadata": {},
    "source": [
-    "## 5. Prompt Dataset\n",
-    "\n",
-    "We collect **first-turn observations** from fresh episode resets as our prompt dataset.\n",
-    "These are the most important turns — they contain `schema_hints`, `active_rules`, and the\n",
-    "full workflow goal. The model must learn to read schema hints and produce a correct first action.\n",
-    "\n",
-    "During GRPO training, the reward function will reset the env and evaluate each generated action live."
    ]
   },
   {
@@ -168,7 +207,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import json\n",
     "from datasets import Dataset\n",
     "\n",
     "SYSTEM_PROMPT = \"\"\"\\\n",
@@ -209,6 +250,8 @@
     "6. Stop when pending_steps is empty or done=true.\n",
     "\"\"\"\n",
     "\n",
     "\n",
     "def obs_to_text(obs: dict) -> str:\n",
     "    hints   = obs.get(\"schema_hints\", {})\n",
@@ -243,23 +286,34 @@
     "\n",
     "\n",
     "def build_prompt(obs_text: str) -> str:\n",
-    "    \"\"\"Format as a chat prompt with system injected into first user message.\"\"\"\n",
     "    messages = [{\"role\": \"user\", \"content\": SYSTEM_PROMPT + \"\\n\\n---\\n\\n\" + obs_text}]\n",
     "    return tokenizer.apply_chat_template(\n",
     "        messages, tokenize=False, add_generation_prompt=True\n",
     "    )\n",
     "\n",
     "\n",
-    "# Collect first-turn observations across all 3 workflows, multiple episodes\n",
-    "# Each episode has a different schema version (seed varies) so we get diverse prompts\n",
     "N_PROMPTS_PER_WORKFLOW = 20\n",
     "prompt_rows = []\n",
     "\n",
     "print(\"Collecting prompts from env resets...\")\n",
     "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for _ in range(N_PROMPTS_PER_WORKFLOW):\n",
-    "        result  = httpx.post(\"http://localhost:8000/reset\", json={\"workflow_id\": wf}).json()\n",
-    "        obs     = result[\"observation\"]\n",
     "        obs_text = obs_to_text(obs)\n",
     "        prompt_rows.append({\n",
     "            \"prompt\":      build_prompt(obs_text),\n",
@@ -268,25 +322,17 @@
     "        })\n",
     "\n",
     "prompt_dataset = Dataset.from_list(prompt_rows)\n",
-    "print(f\"Prompt dataset: {len(prompt_dataset)} examples across 3 workflows\")\n",
-    "print(\"Sample prompt (truncated):\\n\", prompt_rows[0][\"prompt\"][:600], \"...\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec6",
    "metadata": {},
    "source": [
-    "## 6. Reward Function\n",
-    "\n",
-    "Called by GRPOTrainer during training on each batch of generated completions.\n",
-    "For each completion:\n",
-    "1. Parse it as action JSON\n",
-    "2. Reset the env to a fresh episode for the right workflow\n",
-    "3. Send the action via `/step`\n",
-    "4. Return the reward\n",
-    "\n",
-    "This gives the model a live signal from the actual environment."
    ]
   },
   {
@@ -296,53 +342,20 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import re\n",
-    "from typing import List\n",
-    "\n",
-    "ENV_URL = \"http://localhost:8000\"\n",
-    "\n",
-    "\n",
-    "def parse_action(text: str):\n",
-    "    \"\"\"Extract JSON action from model output.\"\"\"\n",
-    "    text = text.strip()\n",
-    "    # Strip markdown code fences if present\n",
-    "    text = re.sub(r\"```(?:json)?\\s*\", \"\", text).strip()\n",
-    "    try:\n",
-    "        return json.loads(text)\n",
-    "    except json.JSONDecodeError:\n",
-    "        m = re.search(r\"\\{.*\\}\", text, re.DOTALL)\n",
-    "        if m:\n",
-    "            try:\n",
-    "                return json.loads(m.group())\n",
-    "            except Exception:\n",
-    "                pass\n",
-    "    return None\n",
-    "\n",
-    "\n",
     "def orgos_reward_fn(completions: List[str], prompts: List[str], **kwargs) -> List[float]:\n",
     "    \"\"\"\n",
     "    GRPO reward function — called by GRPOTrainer each training step.\n",
-    "\n",
-    "    For each generated completion:\n",
-    "      - Parse as action JSON\n",
-    "      - Reset env to a fresh episode (workflow inferred from prompt)\n",
-    "      - Step the env with the action\n",
-    "      - Return the step reward\n",
-    "\n",
-    "    Invalid JSON or failed actions return a -0.1 penalty.\n",
     "    \"\"\"\n",
     "    workflow_ids = kwargs.get(\"workflow_id\", [\"A\"] * len(completions))\n",
     "    rewards = []\n",
     "\n",
     "    for completion, wf_id in zip(completions, workflow_ids):\n",
     "        action = parse_action(completion)\n",
-    "\n",
     "        if action is None:\n",
     "            rewards.append(-0.1)\n",
     "            continue\n",
-    "\n",
     "        try:\n",
-    "            # Fresh episode for this action evaluation\n",
     "            httpx.post(f\"{ENV_URL}/reset\", json={\"workflow_id\": wf_id}, timeout=10)\n",
     "            result = httpx.post(f\"{ENV_URL}/step\", json=action, timeout=10).json()\n",
     "            rewards.append(float(result[\"reward\"]))\n",
@@ -352,24 +365,22 @@
     "    return rewards\n",
     "\n",
     "\n",
-    "print(\"Reward function defined.\")\n",
-    "print(\"Quick sanity check...\")\n",
-    "test_rewards = orgos_reward_fn(\n",
-    "    completions  = ['{\"app\": \"zendesk\", \"operation\": \"list_tickets\", \"args\": {\"state\": \"new\"}}',\n",
-    "                    'this is not valid json'],\n",
-    "    prompts      = [\"\", \"\"],\n",
-    "    workflow_id  = [\"A\", \"A\"],\n",
     ")\n",
-    "print(f\"  Valid action reward:   {test_rewards[0]:.4f}\")\n",
-    "print(f\"  Invalid action reward: {test_rewards[1]:.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec7",
    "metadata": {},
    "source": [
-    "## 7. Collect Baseline Scores (Pre-Training)"
    ]
   },
   {
@@ -379,8 +390,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "\n",
     "FastLanguageModel.for_inference(model)\n",
     "\n",
     "\n",
@@ -397,9 +406,9 @@
     "        obs_text = obs_to_text(obs)\n",
     "        history.append({\"role\": \"user\", \"content\": obs_text})\n",
     "\n",
-    "        # Inject system prompt into first user message\n",
-    "        messages = list(history)\n",
-    "        messages[0] = {\"role\": \"user\", \"content\": SYSTEM_PROMPT + \"\\n\\n---\\n\\n\" + messages[0][\"content\"]}\n",
     "\n",
     "        text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
     "        inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
@@ -430,27 +439,29 @@
     "    return obs.get(\"current_score\", 0.001)\n",
     "\n",
     "\n",
-    "N_EVAL = 10   # episodes per workflow for evaluation\n",
     "baseline_scores = {wf: [] for wf in [\"A\", \"B\", \"C\"]}\n",
     "\n",
-    "print(\"Collecting pre-training baseline scores...\")\n",
     "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for ep in range(N_EVAL):\n",
     "        score = run_episode_with_model(wf)\n",
     "        baseline_scores[wf].append(score)\n",
-    "        print(f\"  Workflow {wf} ep {ep+1}/{N_EVAL}: score={score:.4f}\", end=\"\\r\")\n",
-    "    print(f\"  Workflow {wf}: mean={np.mean(baseline_scores[wf]):.4f}\")\n",
     "\n",
     "baseline_mean = np.mean([s for v in baseline_scores.values() for s in v])\n",
-    "print(f\"\\nOverall baseline mean: {baseline_mean:.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec8",
    "metadata": {},
    "source": [
-    "## 8. GRPO Training"
    ]
   },
   {
@@ -461,57 +472,86 @@
    "outputs": [],
    "source": [
     "from trl import GRPOConfig, GRPOTrainer\n",
     "\n",
-    "# Switch back to training mode\n",
     "model.train()\n",
     "\n",
     "grpo_config = GRPOConfig(\n",
     "    output_dir                  = \"./orgos_grpo_ckpt\",\n",
-    "    num_train_epochs            = 3,\n",
-    "    per_device_train_batch_size = 4,\n",
-    "    gradient_accumulation_steps = 2,\n",
-    "    learning_rate               = 5e-5,\n",
     "    warmup_steps                = 10,\n",
     "    logging_steps               = 5,\n",
     "    save_steps                  = 100,\n",
     "    bf16                        = torch.cuda.is_bf16_supported(),\n",
     "    fp16                        = not torch.cuda.is_bf16_supported(),\n",
     "    max_grad_norm               = 1.0,\n",
-    "    # GRPO-specific\n",
-    "    num_generations             = 4,     # G: candidate actions per prompt\n",
     "    max_new_tokens              = 256,\n",
-    "    temperature                 = 0.8,   # exploration during training\n",
-    "    beta                        = 0.04,  # KL penalty coefficient\n",
     "    report_to                   = \"none\",\n",
     "    seed                        = 42,\n",
     ")\n",
     "\n",
     "trainer = GRPOTrainer(\n",
-    "    model         = model,\n",
-    "    args          = grpo_config,\n",
-    "    reward_funcs  = orgos_reward_fn,\n",
-    "    train_dataset = prompt_dataset,\n",
     "    processing_class = tokenizer,\n",
     ")\n",
     "\n",
-    "print(\"Starting GRPO training...\")\n",
-    "print(f\"  Prompts: {len(prompt_dataset)}\")\n",
-    "print(f\"  Generations per prompt (G): {grpo_config.num_generations}\")\n",
-    "print(f\"  Epochs: {grpo_config.num_train_epochs}\")\n",
-    "print(f\"  Total env calls per epoch: ~{len(prompt_dataset) * grpo_config.num_generations}\")\n",
-    "print()\n",
-    "\n",
     "train_result = trainer.train()\n",
-    "print(\"\\nTraining complete!\")\n",
-    "print(train_result.metrics)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec9",
    "metadata": {},
    "source": [
-    "## 9. Collect Post-Training Scores"
    ]
   },
   {
@@ -525,25 +565,41 @@
     "\n",
     "post_scores = {wf: [] for wf in [\"A\", \"B\", \"C\"]}\n",
     "\n",
-    "print(\"Collecting post-training scores...\")\n",
     "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for ep in range(N_EVAL):\n",
     "        score = run_episode_with_model(wf)\n",
     "        post_scores[wf].append(score)\n",
-    "        print(f\"  Workflow {wf} ep {ep+1}/{N_EVAL}: score={score:.4f}\", end=\"\\r\")\n",
-    "    print(f\"  Workflow {wf}: mean={np.mean(post_scores[wf]):.4f}\")\n",
-    "\n",
-    "post_mean = np.mean([s for v in post_scores.values() for s in v])\n",
-    "print(f\"\\nOverall post-training mean: {post_mean:.4f}\")\n",
-    "print(f\"Improvement: {post_mean - baseline_mean:+.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec10",
    "metadata": {},
    "source": [
-    "## 10. Plot Before / After"
    ]
   },
   {
@@ -561,8 +617,7 @@
     "             color=\"white\", fontweight=\"bold\", y=0.98)\n",
     "\n",
     "gs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.35)\n",
-    "\n",
-    "COLORS = {\"before\": \"#f87171\", \"after\": \"#34d399\", \"bg\": \"#1e293b\", \"grid\": \"#334155\"}\n",
     "WF_LABELS = {\n",
     "    \"A\": \"Workflow A\\nCustomer Bug Fix\",\n",
     "    \"B\": \"Workflow B\\nEmployee Onboarding\",\n",
@@ -570,19 +625,16 @@
     "}\n",
     "\n",
     "for col, wf in enumerate([\"A\", \"B\", \"C\"]):\n",
-    "    ax = fig.add_subplot(gs[0, col])\n",
     "    ax.set_facecolor(COLORS[\"bg\"])\n",
     "    ax.grid(color=COLORS[\"grid\"], linewidth=0.5, alpha=0.7)\n",
-    "\n",
     "    before = baseline_scores[wf]\n",
     "    after  = post_scores[wf]\n",
     "    delta  = np.mean(after) - np.mean(before)\n",
-    "\n",
     "    ax.plot(before, color=COLORS[\"before\"], linewidth=1.5, alpha=0.8, label=\"Before GRPO\")\n",
     "    ax.plot(after,  color=COLORS[\"after\"],  linewidth=1.5, alpha=0.8, label=\"After GRPO\")\n",
     "    ax.axhline(np.mean(before), color=COLORS[\"before\"], linestyle=\"--\", linewidth=1, alpha=0.5)\n",
     "    ax.axhline(np.mean(after),  color=COLORS[\"after\"],  linestyle=\"--\", linewidth=1, alpha=0.5)\n",
-    "\n",
     "    ax.set_title(WF_LABELS[wf] + f\"\\n(Δ = {delta:+.4f})\", color=\"white\", fontsize=9)\n",
     "    ax.set_xlabel(\"Episode\", color=\"#94a3b8\", fontsize=8)\n",
     "    ax.set_ylabel(\"Final Score\", color=\"#94a3b8\", fontsize=8)\n",
@@ -596,18 +648,15 @@
     "ax_hist = fig.add_subplot(gs[1, :])\n",
     "ax_hist.set_facecolor(COLORS[\"bg\"])\n",
     "ax_hist.grid(color=COLORS[\"grid\"], linewidth=0.5, alpha=0.5, axis=\"x\")\n",
-    "\n",
     "all_before = [s for v in baseline_scores.values() for s in v]\n",
     "all_after  = [s for v in post_scores.values() for s in v]\n",
     "bins = np.linspace(0, 1, 25)\n",
-    "\n",
     "ax_hist.hist(all_before, bins=bins, color=COLORS[\"before\"], alpha=0.6,\n",
     "             label=f\"Before GRPO  (mean={np.mean(all_before):.4f})\", edgecolor=\"none\")\n",
     "ax_hist.hist(all_after,  bins=bins, color=COLORS[\"after\"],  alpha=0.6,\n",
     "             label=f\"After GRPO   (mean={np.mean(all_after):.4f})\", edgecolor=\"none\")\n",
     "ax_hist.axvline(np.mean(all_before), color=COLORS[\"before\"], linestyle=\"--\", linewidth=1.5)\n",
     "ax_hist.axvline(np.mean(all_after),  color=COLORS[\"after\"],  linestyle=\"--\", linewidth=1.5)\n",
-    "\n",
     "ax_hist.set_title(\"Score Distribution Across All Workflows\", color=\"white\", fontsize=10)\n",
     "ax_hist.set_xlabel(\"Final Score\", color=\"#94a3b8\", fontsize=9)\n",
     "ax_hist.set_ylabel(\"Count\", color=\"#94a3b8\", fontsize=9)\n",
@@ -620,15 +669,16 @@
     "plt.savefig(\"before_after_curves.png\", dpi=150, bbox_inches=\"tight\",\n",
     "            facecolor=\"#0f172a\", edgecolor=\"none\")\n",
     "plt.show()\n",
     "print(\"Saved: before_after_curves.png\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec11",
    "metadata": {},
    "source": [
-    "## 11. Save LoRA Adapter"
    ]
   },
   {
@@ -640,9 +690,18 @@
    "source": [
     "model.save_pretrained(\"orgos_lora_adapter\")\n",
     "tokenizer.save_pretrained(\"orgos_lora_adapter\")\n",
-    "print(\"LoRA adapter saved to ./orgos_lora_adapter\")\n",
-    "\n",
-    "# Push to HuggingFace Hub\n",
     "# from huggingface_hub import login\n",
     "# login(token=\"YOUR_HF_TOKEN\")\n",
     "# model.push_to_hub(\"YOUR_USERNAME/orgos-qwen25-3b-grpo\")\n",

     "4. GRPO computes relative advantages within the group (which action did better than average?)\n",
     "5. Model is updated to favour higher-reward actions\n",
     "\n",
+    "**Key training signal:** Schema drift creates a sharp reward gap.  \n",
     "Using a stale field name (e.g. `priority` when schema says `severity`) → **−0.20**.  \n",
     "Using the correct drifted name → **+0.10** adaptation bonus.  \n",
     "The model learns to read `schema_hints` before constructing action args."
   },
   {
    "cell_type": "markdown",
+   "id": "sec_logger",
    "metadata": {},
    "source": [
+    "## 3. Training Logger\n",
+    "\n",
+    "Writes structured logs to `training_log.txt` for submission.  \n",
+    "Format mirrors the OpenEnv inference log spec:\n",
+    "- `[TRAIN_CONFIG]` — model, algorithm, hyperparameters\n",
+    "- `[EVAL]` — per-episode score during baseline or post-training eval\n",
+    "- `[TRAIN_STEP]` — loss, mean reward, KL per training step\n",
+    "- `[TRAIN_SUMMARY]` — final before/after comparison"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "logger",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import datetime\n",
+    "\n",
+    "LOG_FILE = \"training_log.txt\"\n",
+    "\n",
+    "# Clear any previous log\n",
+    "with open(LOG_FILE, \"w\") as f:\n",
+    "    f.write(f\"# OrgOS GRPO Training Log\\n\")\n",
+    "    f.write(f\"# Generated: {datetime.datetime.utcnow().isoformat()}Z\\n\\n\")\n",
+    "\n",
+    "\n",
+    "def tlog(line: str) -> None:\n",
+    "    \"\"\"Append one structured log line to training_log.txt and print it.\"\"\"\n",
+    "    print(line, flush=True)\n",
+    "    with open(LOG_FILE, \"a\") as f:\n",
+    "        f.write(line + \"\\n\")\n",
+    "\n",
+    "\n",
+    "print(f\"Logger ready — writing to {LOG_FILE}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "sec4",
+   "metadata": {},
+   "source": [
+    "## 4. Start the OrgOS Environment Server"
    ]
   },
   {
     "\n",
     "health = httpx.get(\"http://localhost:8000/health\").json()\n",
     "assert health[\"status\"] == \"healthy\", f\"Server not healthy: {health}\"\n",
+    "tlog(f\"[ENV] status=healthy version={health.get('version', '?')}\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec5",
    "metadata": {},
    "source": [
+    "## 5. Load Model with Unsloth 4-bit LoRA"
    ]
   },
   {
     "\n",
     "MAX_SEQ_LEN = 2048\n",
     "MODEL_NAME  = \"Qwen/Qwen2.5-3B-Instruct\"\n",
+    "LORA_R      = 16\n",
     "\n",
     "model, tokenizer = FastLanguageModel.from_pretrained(\n",
     "    model_name     = MODEL_NAME,\n",
     "\n",
     "model = FastLanguageModel.get_peft_model(\n",
     "    model,\n",
+    "    r              = LORA_R,\n",
     "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
     "                      \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    lora_alpha     = LORA_R,\n",
     "    lora_dropout   = 0,\n",
     "    bias           = \"none\",\n",
     "    use_gradient_checkpointing = \"unsloth\",\n",
     "    random_state   = 42,\n",
     ")\n",
+    "\n",
     "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "tlog(f\"[TRAIN_CONFIG] model={MODEL_NAME} lora_r={LORA_R} max_seq_len={MAX_SEQ_LEN} \"\n",
+    "     f\"trainable_params={trainable:,} quantization=4bit\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec6",
    "metadata": {},
    "source": [
+    "## 6. Prompt Dataset"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "import json, re\n",
+    "import numpy as np\n",
+    "from typing import List\n",
     "from datasets import Dataset\n",
     "\n",
     "SYSTEM_PROMPT = \"\"\"\\\n",
     "6. Stop when pending_steps is empty or done=true.\n",
     "\"\"\"\n",
     "\n",
+    "ENV_URL = \"http://localhost:8000\"\n",
+    "\n",
     "\n",
     "def obs_to_text(obs: dict) -> str:\n",
     "    hints   = obs.get(\"schema_hints\", {})\n",
     "\n",
     "\n",
     "def build_prompt(obs_text: str) -> str:\n",
     "    messages = [{\"role\": \"user\", \"content\": SYSTEM_PROMPT + \"\\n\\n---\\n\\n\" + obs_text}]\n",
     "    return tokenizer.apply_chat_template(\n",
     "        messages, tokenize=False, add_generation_prompt=True\n",
     "    )\n",
     "\n",
     "\n",
+    "def parse_action(text: str):\n",
+    "    text = re.sub(r\"```(?:json)?\\s*\", \"\", text.strip()).strip()\n",
+    "    try:\n",
+    "        return json.loads(text)\n",
+    "    except json.JSONDecodeError:\n",
+    "        m = re.search(r\"\\{.*\\}\", text, re.DOTALL)\n",
+    "        if m:\n",
+    "            try:\n",
+    "                return json.loads(m.group())\n",
+    "            except Exception:\n",
+    "                pass\n",
+    "    return None\n",
+    "\n",
+    "\n",
     "N_PROMPTS_PER_WORKFLOW = 20\n",
     "prompt_rows = []\n",
     "\n",
     "print(\"Collecting prompts from env resets...\")\n",
     "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for _ in range(N_PROMPTS_PER_WORKFLOW):\n",
+    "        result   = httpx.post(f\"{ENV_URL}/reset\", json={\"workflow_id\": wf}).json()\n",
+    "        obs      = result[\"observation\"]\n",
     "        obs_text = obs_to_text(obs)\n",
     "        prompt_rows.append({\n",
     "            \"prompt\":      build_prompt(obs_text),\n",
     "        })\n",
     "\n",
     "prompt_dataset = Dataset.from_list(prompt_rows)\n",
+    "tlog(f\"[TRAIN_CONFIG] algorithm=GRPO prompts={len(prompt_dataset)} \"\n",
+    "     f\"workflows=A,B,C prompts_per_workflow={N_PROMPTS_PER_WORKFLOW}\")\n",
+    "print(f\"Prompt dataset ready: {len(prompt_dataset)} examples\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec7",
    "metadata": {},
    "source": [
+    "## 7. Reward Function"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
     "def orgos_reward_fn(completions: List[str], prompts: List[str], **kwargs) -> List[float]:\n",
     "    \"\"\"\n",
     "    GRPO reward function — called by GRPOTrainer each training step.\n",
+    "    Parses each completion as an action JSON, steps the live env, returns the reward.\n",
     "    \"\"\"\n",
     "    workflow_ids = kwargs.get(\"workflow_id\", [\"A\"] * len(completions))\n",
     "    rewards = []\n",
     "\n",
     "    for completion, wf_id in zip(completions, workflow_ids):\n",
     "        action = parse_action(completion)\n",
     "        if action is None:\n",
     "            rewards.append(-0.1)\n",
     "            continue\n",
     "        try:\n",
     "            httpx.post(f\"{ENV_URL}/reset\", json={\"workflow_id\": wf_id}, timeout=10)\n",
     "            result = httpx.post(f\"{ENV_URL}/step\", json=action, timeout=10).json()\n",
     "            rewards.append(float(result[\"reward\"]))\n",
     "    return rewards\n",
     "\n",
     "\n",
+    "# Sanity check\n",
+    "test_r = orgos_reward_fn(\n",
+    "    completions = ['{\"app\": \"zendesk\", \"operation\": \"list_tickets\", \"args\": {\"state\": \"new\"}}',\n",
+    "                   'not json'],\n",
+    "    prompts     = [\"\", \"\"],\n",
+    "    workflow_id = [\"A\", \"A\"],\n",
     ")\n",
+    "tlog(f\"[REWARD_FN_CHECK] valid_action={test_r[0]:.4f} invalid_action={test_r[1]:.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec8",
    "metadata": {},
    "source": [
+    "## 8. Collect Baseline Scores (Pre-Training)"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
     "FastLanguageModel.for_inference(model)\n",
     "\n",
     "\n",
     "        obs_text = obs_to_text(obs)\n",
     "        history.append({\"role\": \"user\", \"content\": obs_text})\n",
     "\n",
+    "        messages    = list(history)\n",
+    "        messages[0] = {\"role\": \"user\",\n",
+    "                       \"content\": SYSTEM_PROMPT + \"\\n\\n---\\n\\n\" + messages[0][\"content\"]}\n",
     "\n",
     "        text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
     "        inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
     "    return obs.get(\"current_score\", 0.001)\n",
     "\n",
     "\n",
+    "N_EVAL = 10\n",
     "baseline_scores = {wf: [] for wf in [\"A\", \"B\", \"C\"]}\n",
     "\n",
+    "tlog(\"[EVAL_START] phase=baseline\")\n",
     "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for ep in range(N_EVAL):\n",
     "        score = run_episode_with_model(wf)\n",
     "        baseline_scores[wf].append(score)\n",
+    "        tlog(f\"[EVAL] phase=baseline workflow={wf} episode={ep+1} score={score:.4f}\")\n",
+    "    wf_mean = np.mean(baseline_scores[wf])\n",
+    "    tlog(f\"[EVAL_WORKFLOW] phase=baseline workflow={wf} \"\n",
+    "         f\"mean={wf_mean:.4f} min={min(baseline_scores[wf]):.4f} max={max(baseline_scores[wf]):.4f}\")\n",
     "\n",
     "baseline_mean = np.mean([s for v in baseline_scores.values() for s in v])\n",
+    "tlog(f\"[EVAL_END] phase=baseline overall_mean={baseline_mean:.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec9",
    "metadata": {},
    "source": [
+    "## 9. GRPO Training"
    ]
   },
   {
    "outputs": [],
    "source": [
     "from trl import GRPOConfig, GRPOTrainer\n",
+    "from transformers import TrainerCallback\n",
     "\n",
     "model.train()\n",
     "\n",
+    "NUM_EPOCHS   = 3\n",
+    "BATCH_SIZE   = 4\n",
+    "GRAD_ACCUM   = 2\n",
+    "LR           = 5e-5\n",
+    "NUM_GEN      = 4\n",
+    "TEMPERATURE  = 0.8\n",
+    "BETA         = 0.04\n",
+    "\n",
     "grpo_config = GRPOConfig(\n",
     "    output_dir                  = \"./orgos_grpo_ckpt\",\n",
+    "    num_train_epochs            = NUM_EPOCHS,\n",
+    "    per_device_train_batch_size = BATCH_SIZE,\n",
+    "    gradient_accumulation_steps = GRAD_ACCUM,\n",
+    "    learning_rate               = LR,\n",
     "    warmup_steps                = 10,\n",
     "    logging_steps               = 5,\n",
     "    save_steps                  = 100,\n",
     "    bf16                        = torch.cuda.is_bf16_supported(),\n",
     "    fp16                        = not torch.cuda.is_bf16_supported(),\n",
     "    max_grad_norm               = 1.0,\n",
+    "    num_generations             = NUM_GEN,\n",
     "    max_new_tokens              = 256,\n",
+    "    temperature                 = TEMPERATURE,\n",
+    "    beta                        = BETA,\n",
     "    report_to                   = \"none\",\n",
     "    seed                        = 42,\n",
     ")\n",
     "\n",
+    "tlog(f\"[TRAIN_CONFIG] epochs={NUM_EPOCHS} batch_size={BATCH_SIZE} \"\n",
+    "     f\"grad_accum={GRAD_ACCUM} lr={LR} num_generations={NUM_GEN} \"\n",
+    "     f\"temperature={TEMPERATURE} beta_kl={BETA}\")\n",
+    "\n",
+    "\n",
+    "class OrgOSLogCallback(TrainerCallback):\n",
+    "    \"\"\"Logs each training step to training_log.txt.\"\"\"\n",
+    "\n",
+    "    def on_log(self, args, state, control, logs=None, **kwargs):\n",
+    "        if logs is None:\n",
+    "            return\n",
+    "        step        = state.global_step\n",
+    "        loss        = logs.get(\"loss\", logs.get(\"train_loss\", \"?\"))\n",
+    "        mean_reward = logs.get(\"reward\", logs.get(\"mean_reward\", \"?\"))\n",
+    "        kl          = logs.get(\"kl\", logs.get(\"approx_kl\", \"?\"))\n",
+    "        lr_now      = logs.get(\"learning_rate\", \"?\")\n",
+    "\n",
+    "        loss_str   = f\"{loss:.6f}\"        if isinstance(loss, float)        else str(loss)\n",
+    "        reward_str = f\"{mean_reward:.4f}\" if isinstance(mean_reward, float) else str(mean_reward)\n",
+    "        kl_str     = f\"{kl:.6f}\"          if isinstance(kl, float)          else str(kl)\n",
+    "        lr_str     = f\"{lr_now:.2e}\"      if isinstance(lr_now, float)      else str(lr_now)\n",
+    "\n",
+    "        tlog(f\"[TRAIN_STEP] step={step} loss={loss_str} \"\n",
+    "             f\"mean_reward={reward_str} kl={kl_str} lr={lr_str}\")\n",
+    "\n",
+    "\n",
     "trainer = GRPOTrainer(\n",
+    "    model            = model,\n",
+    "    args             = grpo_config,\n",
+    "    reward_funcs     = orgos_reward_fn,\n",
+    "    train_dataset    = prompt_dataset,\n",
     "    processing_class = tokenizer,\n",
+    "    callbacks        = [OrgOSLogCallback()],\n",
     ")\n",
     "\n",
+    "tlog(\"[TRAIN_START]\")\n",
     "train_result = trainer.train()\n",
+    "tlog(f\"[TRAIN_END] total_steps={train_result.global_step} \"\n",
+    "     f\"train_loss={train_result.training_loss:.6f} \"\n",
+    "     f\"train_runtime_s={train_result.metrics.get('train_runtime', 0):.1f}\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec10",
    "metadata": {},
    "source": [
+    "## 10. Collect Post-Training Scores"
    ]
   },
   {
     "\n",
     "post_scores = {wf: [] for wf in [\"A\", \"B\", \"C\"]}\n",
     "\n",
+    "tlog(\"[EVAL_START] phase=post_training\")\n",
     "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for ep in range(N_EVAL):\n",
     "        score = run_episode_with_model(wf)\n",
     "        post_scores[wf].append(score)\n",
+    "        tlog(f\"[EVAL] phase=post_training workflow={wf} episode={ep+1} score={score:.4f}\")\n",
+    "    wf_mean = np.mean(post_scores[wf])\n",
+    "    tlog(f\"[EVAL_WORKFLOW] phase=post_training workflow={wf} \"\n",
+    "         f\"mean={wf_mean:.4f} min={min(post_scores[wf]):.4f} max={max(post_scores[wf]):.4f}\")\n",
+    "\n",
+    "post_mean   = np.mean([s for v in post_scores.values() for s in v])\n",
+    "improvement = post_mean - baseline_mean\n",
+    "tlog(f\"[EVAL_END] phase=post_training overall_mean={post_mean:.4f}\")\n",
+    "tlog(\n",
+    "    f\"[TRAIN_SUMMARY] \"\n",
+    "    f\"model={MODEL_NAME} algorithm=GRPO \"\n",
+    "    f\"baseline_mean={baseline_mean:.4f} \"\n",
+    "    f\"post_training_mean={post_mean:.4f} \"\n",
+    "    f\"improvement={improvement:+.4f} \"\n",
+    "    f\"workflow_A_before={np.mean(baseline_scores['A']):.4f} \"\n",
+    "    f\"workflow_A_after={np.mean(post_scores['A']):.4f} \"\n",
+    "    f\"workflow_B_before={np.mean(baseline_scores['B']):.4f} \"\n",
+    "    f\"workflow_B_after={np.mean(post_scores['B']):.4f} \"\n",
+    "    f\"workflow_C_before={np.mean(baseline_scores['C']):.4f} \"\n",
+    "    f\"workflow_C_after={np.mean(post_scores['C']):.4f}\"\n",
+    ")\n",
+    "print(f\"\\nImprovement: {baseline_mean:.4f} → {post_mean:.4f} ({improvement:+.4f})\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec11",
    "metadata": {},
    "source": [
+    "## 11. Plot Before / After"
    ]
   },
   {
     "             color=\"white\", fontweight=\"bold\", y=0.98)\n",
     "\n",
     "gs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.35)\n",
+    "COLORS    = {\"before\": \"#f87171\", \"after\": \"#34d399\", \"bg\": \"#1e293b\", \"grid\": \"#334155\"}\n",
     "WF_LABELS = {\n",
     "    \"A\": \"Workflow A\\nCustomer Bug Fix\",\n",
     "    \"B\": \"Workflow B\\nEmployee Onboarding\",\n",
     "}\n",
     "\n",
     "for col, wf in enumerate([\"A\", \"B\", \"C\"]):\n",
+    "    ax     = fig.add_subplot(gs[0, col])\n",
     "    ax.set_facecolor(COLORS[\"bg\"])\n",
     "    ax.grid(color=COLORS[\"grid\"], linewidth=0.5, alpha=0.7)\n",
     "    before = baseline_scores[wf]\n",
     "    after  = post_scores[wf]\n",
     "    delta  = np.mean(after) - np.mean(before)\n",
     "    ax.plot(before, color=COLORS[\"before\"], linewidth=1.5, alpha=0.8, label=\"Before GRPO\")\n",
     "    ax.plot(after,  color=COLORS[\"after\"],  linewidth=1.5, alpha=0.8, label=\"After GRPO\")\n",
     "    ax.axhline(np.mean(before), color=COLORS[\"before\"], linestyle=\"--\", linewidth=1, alpha=0.5)\n",
     "    ax.axhline(np.mean(after),  color=COLORS[\"after\"],  linestyle=\"--\", linewidth=1, alpha=0.5)\n",
     "    ax.set_title(WF_LABELS[wf] + f\"\\n(Δ = {delta:+.4f})\", color=\"white\", fontsize=9)\n",
     "    ax.set_xlabel(\"Episode\", color=\"#94a3b8\", fontsize=8)\n",
     "    ax.set_ylabel(\"Final Score\", color=\"#94a3b8\", fontsize=8)\n",
     "ax_hist = fig.add_subplot(gs[1, :])\n",
     "ax_hist.set_facecolor(COLORS[\"bg\"])\n",
     "ax_hist.grid(color=COLORS[\"grid\"], linewidth=0.5, alpha=0.5, axis=\"x\")\n",
     "all_before = [s for v in baseline_scores.values() for s in v]\n",
     "all_after  = [s for v in post_scores.values() for s in v]\n",
     "bins = np.linspace(0, 1, 25)\n",
     "ax_hist.hist(all_before, bins=bins, color=COLORS[\"before\"], alpha=0.6,\n",
     "             label=f\"Before GRPO  (mean={np.mean(all_before):.4f})\", edgecolor=\"none\")\n",
     "ax_hist.hist(all_after,  bins=bins, color=COLORS[\"after\"],  alpha=0.6,\n",
     "             label=f\"After GRPO   (mean={np.mean(all_after):.4f})\", edgecolor=\"none\")\n",
     "ax_hist.axvline(np.mean(all_before), color=COLORS[\"before\"], linestyle=\"--\", linewidth=1.5)\n",
     "ax_hist.axvline(np.mean(all_after),  color=COLORS[\"after\"],  linestyle=\"--\", linewidth=1.5)\n",
     "ax_hist.set_title(\"Score Distribution Across All Workflows\", color=\"white\", fontsize=10)\n",
     "ax_hist.set_xlabel(\"Final Score\", color=\"#94a3b8\", fontsize=9)\n",
     "ax_hist.set_ylabel(\"Count\", color=\"#94a3b8\", fontsize=9)\n",
     "plt.savefig(\"before_after_curves.png\", dpi=150, bbox_inches=\"tight\",\n",
     "            facecolor=\"#0f172a\", edgecolor=\"none\")\n",
     "plt.show()\n",
+    "tlog(\"[ARTIFACT] file=before_after_curves.png\")\n",
     "print(\"Saved: before_after_curves.png\")"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "sec12",
    "metadata": {},
    "source": [
+    "## 12. Save LoRA Adapter & Training Log"
    ]
   },
   {
    "source": [
     "model.save_pretrained(\"orgos_lora_adapter\")\n",
     "tokenizer.save_pretrained(\"orgos_lora_adapter\")\n",
+    "tlog(\"[ARTIFACT] file=orgos_lora_adapter/\")\n",
+    "tlog(\"[ARTIFACT] file=training_log.txt\")\n",
+    "\n",
+    "print(f\"\\n{'='*60}\")\n",
+    "print(\"  Submission artefacts\")\n",
+    "print(f\"{'='*60}\")\n",
+    "print(\"  training_log.txt      — structured training log\")\n",
+    "print(\"  before_after_curves.png — score improvement chart\")\n",
+    "print(\"  orgos_lora_adapter/   — LoRA weights\")\n",
+    "print(f\"{'='*60}\")\n",
+    "\n",
+    "# Optional: push to HuggingFace Hub\n",
     "# from huggingface_hub import login\n",
     "# login(token=\"YOUR_HF_TOKEN\")\n",
     "# model.push_to_hub(\"YOUR_USERNAME/orgos-qwen25-3b-grpo\")\n",