Spaces:

srishtichugh
/

orgOS

Running

muskan singh Claude Opus 4.7 commited on 12 days ago

Commit

5ebb26b

1 Parent(s): 03d30a6

fix: stable GRPO notebook — pin TRL<=0.24, multi-step reward, Drive checkpoints every 30 steps

Key changes vs previous run that stopped at step 21:
- Pin trl>=0.18.2,<=0.24.0 BEFORE unsloth install (trl 1.x breaks Unsloth patches)
- Multi-step reward fn (REWARD_STEPS=2) for richer training signal
- NUM_GENERATIONS=2 to halve VRAM pressure from G×reward_steps inference calls
- max_new_tokens=256 in GRPOConfig (works with pinned TRL, fixes 95% clipping)
- Drive checkpoint every 30 steps via callback (survives Colab disconnects)
- MAX_TRAIN_STEPS=150, LR=8e-6

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (1) hide show

training/grpo_orgos.ipynb +530 -492

training/grpo_orgos.ipynb CHANGED Viewed

@@ -1,173 +1,203 @@
 {
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "title",
    "metadata": {},
    "source": [
-    "# OrgOS GRPO Training\n",
-    "\n",
-    "**Environment:** OrgOS — Multi-App Enterprise RL Environment  \n",
-    "**Model:** `Qwen/Qwen2.5-3B-Instruct` (4-bit LoRA via Unsloth)  \n",
-    "**Algorithm:** GRPO (Group Relative Policy Optimization) via HuggingFace TRL  \n",
-    "**Target hardware:** HuggingFace compute (A10G / A100)  \n",
-    "\n",
-    "## How this works\n",
-    "\n",
-    "GRPO is an **online** RL algorithm:\n",
-    "1. Each training step takes a batch of **prompts** (observations from the env)\n",
-    "2. The model generates **G candidate actions** per prompt (the group)\n",
-    "3. Each action is sent to the **live OrgOS env** to get a real reward\n",
-    "4. GRPO computes relative advantages within the group (which action did better than average?)\n",
-    "5. Model is updated to favour higher-reward actions\n",
-    "\n",
-    "**Key training signal:** Schema drift creates a sharp reward gap.  \n",
-    "Using a stale field name (e.g. `priority` when schema says `severity`) → **−0.20**.  \n",
-    "Using the correct drifted name → **+0.10** adaptation bonus.  \n",
-    "The model learns to read `schema_hints` before constructing action args."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec1",
    "metadata": {},
-   "source": [
-    "## 1. Install Dependencies"
-   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "install",
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install -q \"unsloth[huggingface]\" \"trl>=0.12.0\" peft accelerate bitsandbytes\n",
-    "!pip install -q fastapi uvicorn httpx openai pydantic python-dotenv\n",
-    "!pip install -q matplotlib numpy datasets"
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "sec2",
    "metadata": {},
    "source": [
-    "## 2. Clone the OrgOS Repo"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "clone_repo",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
     "\n",
-    "REPO_URL = \"https://huggingface.co/spaces/tanvibisht/orgos-openenv\"\n",
-    "REPO_DIR = \"/home/user/orgos\"\n",
     "\n",
-    "if not os.path.exists(REPO_DIR):\n",
-    "    !git clone {REPO_URL} {REPO_DIR}\n",
     "\n",
-    "os.chdir(REPO_DIR)\n",
-    "print(\"Working directory:\", os.getcwd())\n",
-    "!ls"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec_logger",
    "metadata": {},
    "source": [
-    "## 3. Training Logger\n",
-    "\n",
-    "Writes structured logs to `training_log.txt` for submission.  \n",
-    "Format mirrors the OpenEnv inference log spec:\n",
-    "- `[TRAIN_CONFIG]` — model, algorithm, hyperparameters\n",
-    "- `[EVAL]` — per-episode score during baseline or post-training eval\n",
-    "- `[TRAIN_STEP]` — loss, mean reward, KL per training step\n",
-    "- `[TRAIN_SUMMARY]` — final before/after comparison"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "logger",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import datetime\n",
     "\n",
-    "LOG_FILE = \"training_log.txt\"\n",
-    "\n",
-    "# Clear any previous log\n",
-    "with open(LOG_FILE, \"w\") as f:\n",
-    "    f.write(f\"# OrgOS GRPO Training Log\\n\")\n",
-    "    f.write(f\"# Generated: {datetime.datetime.utcnow().isoformat()}Z\\n\\n\")\n",
-    "\n",
-    "\n",
-    "def tlog(line: str) -> None:\n",
-    "    \"\"\"Append one structured log line to training_log.txt and print it.\"\"\"\n",
     "    print(line, flush=True)\n",
-    "    with open(LOG_FILE, \"a\") as f:\n",
-    "        f.write(line + \"\\n\")\n",
-    "\n",
-    "\n",
-    "print(f\"Logger ready — writing to {LOG_FILE}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec4",
    "metadata": {},
    "source": [
-    "## 4. Start the OrgOS Environment Server"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "start_server",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import subprocess, time, httpx\n",
-    "\n",
-    "server_proc = subprocess.Popen(\n",
-    "    [\"python\", \"-m\", \"uvicorn\", \"server.app:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"],\n",
     "    stdout=subprocess.DEVNULL,\n",
     "    stderr=subprocess.DEVNULL,\n",
     ")\n",
-    "time.sleep(4)\n",
-    "\n",
-    "health = httpx.get(\"http://localhost:8000/health\").json()\n",
-    "assert health[\"status\"] == \"healthy\", f\"Server not healthy: {health}\"\n",
-    "tlog(f\"[ENV] status=healthy version={health.get('version', '?')}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec5",
    "metadata": {},
-   "source": [
-    "## 5. Load Model with Unsloth 4-bit LoRA"
-   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "load_model",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from unsloth import FastLanguageModel\n",
-    "import torch\n",
-    "\n",
-    "MAX_SEQ_LEN = 2048\n",
-    "MODEL_NAME  = \"Qwen/Qwen2.5-3B-Instruct\"\n",
-    "LORA_R      = 16\n",
-    "\n",
     "model, tokenizer = FastLanguageModel.from_pretrained(\n",
     "    model_name     = MODEL_NAME,\n",
     "    max_seq_length = MAX_SEQ_LEN,\n",
@@ -178,548 +208,556 @@
     "model = FastLanguageModel.get_peft_model(\n",
     "    model,\n",
     "    r              = LORA_R,\n",
-    "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
-    "                      \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
-    "    lora_alpha     = LORA_R,\n",
-    "    lora_dropout   = 0,\n",
-    "    bias           = \"none\",\n",
-    "    use_gradient_checkpointing = \"unsloth\",\n",
-    "    random_state   = 42,\n",
     ")\n",
     "\n",
     "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
-    "tlog(f\"[TRAIN_CONFIG] model={MODEL_NAME} lora_r={LORA_R} max_seq_len={MAX_SEQ_LEN} \"\n",
-    "     f\"trainable_params={trainable:,} quantization=4bit\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec6",
    "metadata": {},
    "source": [
-    "## 6. Prompt Dataset"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "build_prompts",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import json, re\n",
-    "import numpy as np\n",
-    "from typing import List\n",
-    "from datasets import Dataset\n",
     "\n",
-    "SYSTEM_PROMPT = \"\"\"\\\n",
-    "You are OrgOS Agent — an enterprise workflow automation agent.\n",
-    "You operate across four SaaS applications: Jira, Zendesk, Salesforce, and Workday.\n",
     "\n",
-    "Each turn you receive a JSON observation with:\n",
-    "  - workflow_goal    : the task you must complete\n",
-    "  - pending_steps    : remaining steps in the workflow\n",
-    "  - app_states       : current state of each app\n",
-    "  - schema_hints     : field renames in effect this episode (e.g. {\"jira.priority\": \"severity\"})\n",
-    "  - active_rules     : current SLA / approval thresholds\n",
-    "  - message          : feedback from the last action\n",
-    "  - current_score    : your cumulative score (0.001-0.999)\n",
-    "\n",
-    "Respond ONLY with a valid JSON object — no markdown, no explanation.\n",
-    "\n",
-    "Action format:\n",
     "  {\"app\": \"<app>\", \"operation\": \"<op>\", \"args\": {...}}\n",
     "\n",
     "Available apps and key operations:\n",
     "  jira:       get_issue, create_issue, update_status, set_priority, assign_owner,\n",
     "              add_label, link_zendesk_ticket, close_issue, list_issues\n",
     "  zendesk:    get_ticket, acknowledge_ticket, set_urgency, assign_agent,\n",
-    "              escalate_to_jira, resolve_ticket, add_note, list_tickets,\n",
-    "              create_agent_profile\n",
     "  salesforce: get_account, list_accounts, update_deal_stage, flag_churn_risk,\n",
     "              assign_account_owner, log_interaction, get_opportunity\n",
     "  workday:    get_employee, list_employees, provision_access, log_sla_event,\n",
     "              request_budget_approval, create_onboarding_task, complete_task\n",
     "\n",
     "CRITICAL RULES:\n",
-    "1. Read schema_hints FIRST — if \"jira.priority\" -> \"severity\", use \"severity\" not \"priority\" in args.\n",
-    "2. Complete ALL pending_steps in order.\n",
-    "3. Do not repeat a successful action.\n",
-    "4. If an operation fails, read the message carefully and adapt.\n",
-    "5. Use list_* operations to discover record IDs when needed.\n",
-    "6. Stop when pending_steps is empty or done=true.\n",
-    "\"\"\"\n",
-    "\n",
-    "ENV_URL = \"http://localhost:8000\"\n",
-    "\n",
     "\n",
     "def obs_to_text(obs: dict) -> str:\n",
-    "    hints   = obs.get(\"schema_hints\", {})\n",
-    "    pending = obs.get(\"pending_steps\", [])\n",
     "    lines = [\n",
     "        f\"current_score: {obs['current_score']}\",\n",
     "        f\"step_count:    {obs['step_count']}\",\n",
     "        f\"workflow_id:   {obs['workflow_id']}\",\n",
-    "        \"\",\n",
-    "        \"=== WORKFLOW GOAL ===\",\n",
-    "        obs[\"workflow_goal\"],\n",
-    "        \"\",\n",
-    "        \"=== PENDING STEPS ===\",\n",
-    "        \"\\n\".join(f\"  - {s}\" for s in pending) or \"  (all steps complete!)\",\n",
-    "        \"\",\n",
-    "        \"=== SCHEMA HINTS (use these field names) ===\",\n",
-    "        json.dumps(hints, indent=2) if hints else \"  (no drift — use canonical names)\",\n",
-    "        \"\",\n",
-    "        \"=== ACTIVE RULES ===\",\n",
-    "        json.dumps(obs.get(\"active_rules\", {}), indent=2),\n",
-    "        \"\",\n",
-    "        \"=== LAST MESSAGE ===\",\n",
-    "        obs[\"message\"],\n",
-    "        \"\",\n",
-    "        \"=== APP STATES ===\",\n",
     "    ]\n",
-    "    for app_name, view in obs.get(\"app_states\", {}).items():\n",
-    "        lines.append(f\"  [{app_name.upper()}]\")\n",
-    "        lines.append(f\"  {view}\")\n",
-    "        lines.append(\"\")\n",
-    "    return \"\\n\".join(lines)\n",
-    "\n",
-    "\n",
-    "def build_prompt(obs_text: str) -> str:\n",
-    "    messages = [{\"role\": \"user\", \"content\": SYSTEM_PROMPT + \"\\n\\n---\\n\\n\" + obs_text}]\n",
-    "    return tokenizer.apply_chat_template(\n",
-    "        messages, tokenize=False, add_generation_prompt=True\n",
-    "    )\n",
-    "\n",
     "\n",
     "def parse_action(text: str):\n",
-    "    text = re.sub(r\"```(?:json)?\\s*\", \"\", text.strip()).strip()\n",
     "    try:\n",
     "        return json.loads(text)\n",
     "    except json.JSONDecodeError:\n",
-    "        m = re.search(r\"\\{.*\\}\", text, re.DOTALL)\n",
     "        if m:\n",
-    "            try:\n",
-    "                return json.loads(m.group())\n",
-    "            except Exception:\n",
-    "                pass\n",
     "    return None\n",
     "\n",
     "\n",
-    "N_PROMPTS_PER_WORKFLOW = 20\n",
-    "prompt_rows = []\n",
     "\n",
-    "print(\"Collecting prompts from env resets...\")\n",
-    "for wf in [\"A\", \"B\", \"C\"]:\n",
     "    for _ in range(N_PROMPTS_PER_WORKFLOW):\n",
-    "        result   = httpx.post(f\"{ENV_URL}/reset\", json={\"workflow_id\": wf}).json()\n",
-    "        obs      = result[\"observation\"]\n",
-    "        obs_text = obs_to_text(obs)\n",
-    "        prompt_rows.append({\n",
-    "            \"prompt\":      build_prompt(obs_text),\n",
-    "            \"workflow_id\": wf,\n",
-    "            \"obs_text\":    obs_text,\n",
     "        })\n",
     "\n",
-    "prompt_dataset = Dataset.from_list(prompt_rows)\n",
-    "tlog(f\"[TRAIN_CONFIG] algorithm=GRPO prompts={len(prompt_dataset)} \"\n",
-    "     f\"workflows=A,B,C prompts_per_workflow={N_PROMPTS_PER_WORKFLOW}\")\n",
-    "print(f\"Prompt dataset ready: {len(prompt_dataset)} examples\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec7",
    "metadata": {},
    "source": [
-    "## 7. Reward Function"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "reward_fn",
    "metadata": {},
    "outputs": [],
    "source": [
-    "def orgos_reward_fn(completions: List[str], prompts: List[str], **kwargs) -> List[float]:\n",
-    "    \"\"\"\n",
-    "    GRPO reward function — called by GRPOTrainer each training step.\n",
-    "    Parses each completion as an action JSON, steps the live env, returns the reward.\n",
-    "    \"\"\"\n",
-    "    workflow_ids = kwargs.get(\"workflow_id\", [\"A\"] * len(completions))\n",
     "    rewards = []\n",
-    "\n",
     "    for completion, wf_id in zip(completions, workflow_ids):\n",
     "        action = parse_action(completion)\n",
     "        if action is None:\n",
     "            rewards.append(-0.1)\n",
     "            continue\n",
     "        try:\n",
-    "            httpx.post(f\"{ENV_URL}/reset\", json={\"workflow_id\": wf_id}, timeout=10)\n",
-    "            result = httpx.post(f\"{ENV_URL}/step\", json=action, timeout=10).json()\n",
-    "            rewards.append(float(result[\"reward\"]))\n",
-    "        except Exception:\n",
     "            rewards.append(-0.1)\n",
-    "\n",
     "    return rewards\n",
     "\n",
-    "\n",
     "# Sanity check\n",
-    "test_r = orgos_reward_fn(\n",
-    "    completions = ['{\"app\": \"zendesk\", \"operation\": \"list_tickets\", \"args\": {\"state\": \"new\"}}',\n",
-    "                   'not json'],\n",
-    "    prompts     = [\"\", \"\"],\n",
-    "    workflow_id = [\"A\", \"A\"],\n",
-    ")\n",
-    "tlog(f\"[REWARD_FN_CHECK] valid_action={test_r[0]:.4f} invalid_action={test_r[1]:.4f}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec8",
    "metadata": {},
-   "source": [
-    "## 8. Collect Baseline Scores (Pre-Training)"
-   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "baseline",
    "metadata": {},
    "outputs": [],
    "source": [
-    "FastLanguageModel.for_inference(model)\n",
-    "\n",
-    "\n",
-    "def run_episode_with_model(workflow_id: str, max_steps: int = 15) -> float:\n",
-    "    \"\"\"Run one full episode with the current model. Returns final score.\"\"\"\n",
-    "    result  = httpx.post(f\"{ENV_URL}/reset\", json={\"workflow_id\": workflow_id}).json()\n",
-    "    obs     = result[\"observation\"]\n",
-    "    history = []\n",
-    "\n",
-    "    for _ in range(max_steps):\n",
-    "        if obs[\"done\"]:\n",
-    "            break\n",
-    "\n",
-    "        obs_text = obs_to_text(obs)\n",
-    "        history.append({\"role\": \"user\", \"content\": obs_text})\n",
     "\n",
-    "        messages    = list(history)\n",
-    "        messages[0] = {\"role\": \"user\",\n",
-    "                       \"content\": SYSTEM_PROMPT + \"\\n\\n---\\n\\n\" + messages[0][\"content\"]}\n",
-    "\n",
-    "        text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
-    "        inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
-    "\n",
-    "        with torch.no_grad():\n",
-    "            out = model.generate(\n",
-    "                **inputs,\n",
-    "                max_new_tokens = 256,\n",
-    "                temperature    = 0.0,\n",
-    "                do_sample      = False,\n",
-    "                pad_token_id   = tokenizer.eos_token_id,\n",
-    "            )\n",
-    "        action_str = tokenizer.decode(\n",
-    "            out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True\n",
-    "        ).strip()\n",
-    "\n",
-    "        history.append({\"role\": \"assistant\", \"content\": action_str})\n",
-    "\n",
-    "        action = parse_action(action_str)\n",
-    "        if action is None:\n",
-    "            break\n",
-    "\n",
-    "        result = httpx.post(f\"{ENV_URL}/step\", json=action).json()\n",
-    "        obs    = result[\"observation\"]\n",
-    "        if obs[\"done\"]:\n",
-    "            break\n",
-    "\n",
-    "    return obs.get(\"current_score\", 0.001)\n",
-    "\n",
-    "\n",
-    "N_EVAL = 10\n",
-    "baseline_scores = {wf: [] for wf in [\"A\", \"B\", \"C\"]}\n",
-    "\n",
-    "tlog(\"[EVAL_START] phase=baseline\")\n",
-    "for wf in [\"A\", \"B\", \"C\"]:\n",
-    "    for ep in range(N_EVAL):\n",
-    "        score = run_episode_with_model(wf)\n",
-    "        baseline_scores[wf].append(score)\n",
-    "        tlog(f\"[EVAL] phase=baseline workflow={wf} episode={ep+1} score={score:.4f}\")\n",
-    "    wf_mean = np.mean(baseline_scores[wf])\n",
-    "    tlog(f\"[EVAL_WORKFLOW] phase=baseline workflow={wf} \"\n",
-    "         f\"mean={wf_mean:.4f} min={min(baseline_scores[wf]):.4f} max={max(baseline_scores[wf]):.4f}\")\n",
-    "\n",
-    "baseline_mean = np.mean([s for v in baseline_scores.values() for s in v])\n",
-    "tlog(f\"[EVAL_END] phase=baseline overall_mean={baseline_mean:.4f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "sec9",
-   "metadata": {},
-   "source": [
-    "## 9. GRPO Training"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "grpo_training",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from trl import GRPOConfig, GRPOTrainer\n",
-    "from transformers import TrainerCallback\n",
-    "\n",
-    "model.train()\n",
-    "\n",
-    "NUM_EPOCHS   = 3\n",
-    "BATCH_SIZE   = 4\n",
-    "GRAD_ACCUM   = 2\n",
-    "LR           = 5e-5\n",
-    "NUM_GEN      = 4\n",
-    "TEMPERATURE  = 0.8\n",
-    "BETA         = 0.04\n",
     "\n",
     "grpo_config = GRPOConfig(\n",
-    "    output_dir                  = \"./orgos_grpo_ckpt\",\n",
-    "    num_train_epochs            = NUM_EPOCHS,\n",
-    "    per_device_train_batch_size = BATCH_SIZE,\n",
     "    gradient_accumulation_steps = GRAD_ACCUM,\n",
-    "    learning_rate               = LR,\n",
-    "    warmup_steps                = 10,\n",
-    "    logging_steps               = 5,\n",
-    "    save_steps                  = 100,\n",
-    "    bf16                        = torch.cuda.is_bf16_supported(),\n",
-    "    fp16                        = not torch.cuda.is_bf16_supported(),\n",
-    "    max_grad_norm               = 1.0,\n",
-    "    num_generations             = NUM_GEN,\n",
-    "    max_new_tokens              = 256,\n",
-    "    temperature                 = TEMPERATURE,\n",
-    "    beta                        = BETA,\n",
-    "    report_to                   = \"none\",\n",
-    "    seed                        = 42,\n",
     ")\n",
     "\n",
-    "tlog(f\"[TRAIN_CONFIG] epochs={NUM_EPOCHS} batch_size={BATCH_SIZE} \"\n",
-    "     f\"grad_accum={GRAD_ACCUM} lr={LR} num_generations={NUM_GEN} \"\n",
-    "     f\"temperature={TEMPERATURE} beta_kl={BETA}\")\n",
-    "\n",
-    "\n",
-    "class OrgOSLogCallback(TrainerCallback):\n",
-    "    \"\"\"Logs each training step to training_log.txt.\"\"\"\n",
-    "\n",
-    "    def on_log(self, args, state, control, logs=None, **kwargs):\n",
-    "        if logs is None:\n",
-    "            return\n",
-    "        step        = state.global_step\n",
-    "        loss        = logs.get(\"loss\", logs.get(\"train_loss\", \"?\"))\n",
-    "        mean_reward = logs.get(\"reward\", logs.get(\"mean_reward\", \"?\"))\n",
-    "        kl          = logs.get(\"kl\", logs.get(\"approx_kl\", \"?\"))\n",
-    "        lr_now      = logs.get(\"learning_rate\", \"?\")\n",
-    "\n",
-    "        loss_str   = f\"{loss:.6f}\"        if isinstance(loss, float)        else str(loss)\n",
-    "        reward_str = f\"{mean_reward:.4f}\" if isinstance(mean_reward, float) else str(mean_reward)\n",
-    "        kl_str     = f\"{kl:.6f}\"          if isinstance(kl, float)          else str(kl)\n",
-    "        lr_str     = f\"{lr_now:.2e}\"      if isinstance(lr_now, float)      else str(lr_now)\n",
-    "\n",
-    "        tlog(f\"[TRAIN_STEP] step={step} loss={loss_str} \"\n",
-    "             f\"mean_reward={reward_str} kl={kl_str} lr={lr_str}\")\n",
-    "\n",
-    "\n",
     "trainer = GRPOTrainer(\n",
     "    model            = model,\n",
-    "    args             = grpo_config,\n",
-    "    reward_funcs     = orgos_reward_fn,\n",
-    "    train_dataset    = prompt_dataset,\n",
     "    processing_class = tokenizer,\n",
     "    callbacks        = [OrgOSLogCallback()],\n",
     ")\n",
     "\n",
-    "tlog(\"[TRAIN_START]\")\n",
-    "train_result = trainer.train()\n",
-    "tlog(f\"[TRAIN_END] total_steps={train_result.global_step} \"\n",
-    "     f\"train_loss={train_result.training_loss:.6f} \"\n",
-    "     f\"train_runtime_s={train_result.metrics.get('train_runtime', 0):.1f}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec10",
    "metadata": {},
    "source": [
-    "## 10. Collect Post-Training Scores"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "post_training",
    "metadata": {},
    "outputs": [],
    "source": [
     "FastLanguageModel.for_inference(model)\n",
     "\n",
-    "post_scores = {wf: [] for wf in [\"A\", \"B\", \"C\"]}\n",
-    "\n",
-    "tlog(\"[EVAL_START] phase=post_training\")\n",
-    "for wf in [\"A\", \"B\", \"C\"]:\n",
-    "    for ep in range(N_EVAL):\n",
-    "        score = run_episode_with_model(wf)\n",
-    "        post_scores[wf].append(score)\n",
-    "        tlog(f\"[EVAL] phase=post_training workflow={wf} episode={ep+1} score={score:.4f}\")\n",
-    "    wf_mean = np.mean(post_scores[wf])\n",
-    "    tlog(f\"[EVAL_WORKFLOW] phase=post_training workflow={wf} \"\n",
-    "         f\"mean={wf_mean:.4f} min={min(post_scores[wf]):.4f} max={max(post_scores[wf]):.4f}\")\n",
-    "\n",
-    "post_mean   = np.mean([s for v in post_scores.values() for s in v])\n",
-    "improvement = post_mean - baseline_mean\n",
-    "tlog(f\"[EVAL_END] phase=post_training overall_mean={post_mean:.4f}\")\n",
-    "tlog(\n",
-    "    f\"[TRAIN_SUMMARY] \"\n",
-    "    f\"model={MODEL_NAME} algorithm=GRPO \"\n",
-    "    f\"baseline_mean={baseline_mean:.4f} \"\n",
-    "    f\"post_training_mean={post_mean:.4f} \"\n",
-    "    f\"improvement={improvement:+.4f} \"\n",
-    "    f\"workflow_A_before={np.mean(baseline_scores['A']):.4f} \"\n",
-    "    f\"workflow_A_after={np.mean(post_scores['A']):.4f} \"\n",
-    "    f\"workflow_B_before={np.mean(baseline_scores['B']):.4f} \"\n",
-    "    f\"workflow_B_after={np.mean(post_scores['B']):.4f} \"\n",
-    "    f\"workflow_C_before={np.mean(baseline_scores['C']):.4f} \"\n",
-    "    f\"workflow_C_after={np.mean(post_scores['C']):.4f}\"\n",
-    ")\n",
-    "print(f\"\\nImprovement: {baseline_mean:.4f} → {post_mean:.4f} ({improvement:+.4f})\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec11",
    "metadata": {},
    "source": [
-    "## 11. Plot Before / After"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "plot",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import matplotlib.pyplot as plt\n",
-    "import matplotlib.gridspec as gridspec\n",
-    "\n",
-    "fig = plt.figure(figsize=(14, 8), facecolor=\"#0f172a\")\n",
-    "fig.suptitle(\"OrgOS: Before vs After GRPO Training\", fontsize=15,\n",
-    "             color=\"white\", fontweight=\"bold\", y=0.98)\n",
-    "\n",
-    "gs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.35)\n",
-    "COLORS    = {\"before\": \"#f87171\", \"after\": \"#34d399\", \"bg\": \"#1e293b\", \"grid\": \"#334155\"}\n",
-    "WF_LABELS = {\n",
-    "    \"A\": \"Workflow A\\nCustomer Bug Fix\",\n",
-    "    \"B\": \"Workflow B\\nEmployee Onboarding\",\n",
-    "    \"C\": \"Workflow C\\nChurn Risk Alert\",\n",
-    "}\n",
-    "\n",
-    "for col, wf in enumerate([\"A\", \"B\", \"C\"]):\n",
-    "    ax     = fig.add_subplot(gs[0, col])\n",
-    "    ax.set_facecolor(COLORS[\"bg\"])\n",
-    "    ax.grid(color=COLORS[\"grid\"], linewidth=0.5, alpha=0.7)\n",
-    "    before = baseline_scores[wf]\n",
-    "    after  = post_scores[wf]\n",
-    "    delta  = np.mean(after) - np.mean(before)\n",
-    "    ax.plot(before, color=COLORS[\"before\"], linewidth=1.5, alpha=0.8, label=\"Before GRPO\")\n",
-    "    ax.plot(after,  color=COLORS[\"after\"],  linewidth=1.5, alpha=0.8, label=\"After GRPO\")\n",
-    "    ax.axhline(np.mean(before), color=COLORS[\"before\"], linestyle=\"--\", linewidth=1, alpha=0.5)\n",
-    "    ax.axhline(np.mean(after),  color=COLORS[\"after\"],  linestyle=\"--\", linewidth=1, alpha=0.5)\n",
-    "    ax.set_title(WF_LABELS[wf] + f\"\\n(Δ = {delta:+.4f})\", color=\"white\", fontsize=9)\n",
-    "    ax.set_xlabel(\"Episode\", color=\"#94a3b8\", fontsize=8)\n",
-    "    ax.set_ylabel(\"Final Score\", color=\"#94a3b8\", fontsize=8)\n",
-    "    ax.tick_params(colors=\"#64748b\", labelsize=7)\n",
-    "    ax.set_ylim(0, 1)\n",
-    "    ax.legend(fontsize=7, facecolor=\"#1e293b\", labelcolor=\"white\",\n",
-    "              edgecolor=\"#475569\", framealpha=0.8)\n",
-    "    for spine in ax.spines.values():\n",
-    "        spine.set_edgecolor(\"#334155\")\n",
-    "\n",
-    "ax_hist = fig.add_subplot(gs[1, :])\n",
-    "ax_hist.set_facecolor(COLORS[\"bg\"])\n",
-    "ax_hist.grid(color=COLORS[\"grid\"], linewidth=0.5, alpha=0.5, axis=\"x\")\n",
-    "all_before = [s for v in baseline_scores.values() for s in v]\n",
-    "all_after  = [s for v in post_scores.values() for s in v]\n",
-    "bins = np.linspace(0, 1, 25)\n",
-    "ax_hist.hist(all_before, bins=bins, color=COLORS[\"before\"], alpha=0.6,\n",
-    "             label=f\"Before GRPO  (mean={np.mean(all_before):.4f})\", edgecolor=\"none\")\n",
-    "ax_hist.hist(all_after,  bins=bins, color=COLORS[\"after\"],  alpha=0.6,\n",
-    "             label=f\"After GRPO   (mean={np.mean(all_after):.4f})\", edgecolor=\"none\")\n",
-    "ax_hist.axvline(np.mean(all_before), color=COLORS[\"before\"], linestyle=\"--\", linewidth=1.5)\n",
-    "ax_hist.axvline(np.mean(all_after),  color=COLORS[\"after\"],  linestyle=\"--\", linewidth=1.5)\n",
-    "ax_hist.set_title(\"Score Distribution Across All Workflows\", color=\"white\", fontsize=10)\n",
-    "ax_hist.set_xlabel(\"Final Score\", color=\"#94a3b8\", fontsize=9)\n",
-    "ax_hist.set_ylabel(\"Count\", color=\"#94a3b8\", fontsize=9)\n",
-    "ax_hist.tick_params(colors=\"#64748b\", labelsize=8)\n",
-    "ax_hist.legend(fontsize=9, facecolor=\"#1e293b\", labelcolor=\"white\",\n",
-    "               edgecolor=\"#475569\", framealpha=0.9)\n",
-    "for spine in ax_hist.spines.values():\n",
-    "    spine.set_edgecolor(\"#334155\")\n",
-    "\n",
-    "plt.savefig(\"before_after_curves.png\", dpi=150, bbox_inches=\"tight\",\n",
-    "            facecolor=\"#0f172a\", edgecolor=\"none\")\n",
     "plt.show()\n",
-    "tlog(\"[ARTIFACT] file=before_after_curves.png\")\n",
-    "print(\"Saved: before_after_curves.png\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec12",
    "metadata": {},
    "source": [
-    "## 12. Save LoRA Adapter & Training Log"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "save_model",
    "metadata": {},
    "outputs": [],
    "source": [
-    "model.save_pretrained(\"orgos_lora_adapter\")\n",
-    "tokenizer.save_pretrained(\"orgos_lora_adapter\")\n",
-    "tlog(\"[ARTIFACT] file=orgos_lora_adapter/\")\n",
-    "tlog(\"[ARTIFACT] file=training_log.txt\")\n",
-    "\n",
-    "print(f\"\\n{'='*60}\")\n",
-    "print(\"  Submission artefacts\")\n",
-    "print(f\"{'='*60}\")\n",
-    "print(\"  training_log.txt      — structured training log\")\n",
-    "print(\"  before_after_curves.png — score improvement chart\")\n",
-    "print(\"  orgos_lora_adapter/   — LoRA weights\")\n",
-    "print(f\"{'='*60}\")\n",
-    "\n",
-    "# Optional: push to HuggingFace Hub\n",
-    "# from huggingface_hub import login\n",
-    "# login(token=\"YOUR_HF_TOKEN\")\n",
-    "# model.push_to_hub(\"YOUR_USERNAME/orgos-qwen25-3b-grpo\")\n",
-    "# tokenizer.push_to_hub(\"YOUR_USERNAME/orgos-qwen25-3b-grpo\")"
    ]
   }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python",
-   "version": "3.10.0"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
 }

 {
+ "nbformat": 4,
+ "nbformat_minor": 5,
+ "metadata": {
+  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
+  "language_info": {"name": "python", "version": "3.10.0"},
+  "accelerator": "GPU",
+  "colab": {"gpuType": "T4"}
+ },
  "cells": [
   {
    "cell_type": "markdown",
+   "id": "cell-0",
    "metadata": {},
    "source": [
+    "# OrgOS — GRPO Training on a Multi-App Enterprise RL Environment\n",
+    "\n",
+    "**Project:** OrgOS — an OpenEnv environment that simulates enterprise workflows across **Jira, Zendesk, Salesforce, and Workday** with realistic challenges: schema drift, RBAC, SLA constraints, and policy drift.\n",
+    "\n",
+    "**Goal of this notebook:** Fine-tune `Qwen2.5-3B-Instruct` with **GRPO** (Group Relative Policy Optimization) using **live environment rewards**, then compare the trained agent against the untrained baseline.\n",
+    "\n",
+    "**Hardware:** Colab T4 (free tier, 16 GB VRAM). End-to-end runtime ≈ 45–60 min.\n",
+    "\n",
+    "**Outputs (committed to the repo):**\n",
+    "- `training/training_log.txt` — structured logs (`[TRAIN_CONFIG]`, `[EVAL]`, `[TRAIN_STEP]`, …)\n",
+    "- `training/plots/training_curve.png` — mean reward vs GRPO step\n",
+    "- `training/plots/baseline_vs_trained.png` — per-workflow comparison\n",
+    "- `training/plots/score_distribution.png` — per-episode score distribution\n",
+    "- `training/orgos_lora_adapter/` — trained LoRA weights\n",
+    "\n",
+    "Reviewers can open this notebook on Colab → Runtime → *Run all* and reproduce every artifact end-to-end."
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-1",
    "metadata": {},
+   "source": ["## 1. Setup — install dependencies and clone the repo"]
   },
   {
    "cell_type": "code",
+   "id": "cell-2",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Pin TRL to the version Unsloth requires BEFORE installing unsloth.\n",
+    "# trl 1.x breaks Unsloth's GRPOTrainer patches — keep it <=0.24.\n",
+    "%pip install -q \"trl>=0.18.2,<=0.24.0\" peft accelerate bitsandbytes datasets\n",
+    "# Install Unsloth after TRL so its patches apply to the right TRL version.\n",
+    "%pip install -q --upgrade unsloth\n",
+    "%pip install -q fastapi 'uvicorn[standard]' pydantic httpx faker openai aiofiles"
    ]
   },
   {
+   "cell_type": "code",
+   "id": "cell-3",
    "metadata": {},
+   "outputs": [],
    "source": [
+    "# Clone the OrgOS dev repo (env server, models, business rules)\n",
+    "import os\n",
+    "REPO_URL = 'https://github.com/Tanvi51204/OpenEnv-Round-2.git'\n",
+    "if not os.path.isdir('/content/OpenEnv-Round-2'):\n",
+    "    !git clone {REPO_URL} /content/OpenEnv-Round-2\n",
+    "%cd /content/OpenEnv-Round-2"
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-4",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Imports — keep `import unsloth` first to register its patches.\n",
+    "import unsloth\n",
     "\n",
+    "import json, os, re, sys, time, subprocess\n",
+    "from pathlib import Path\n",
+    "from typing import List\n",
     "\n",
+    "import httpx\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "import matplotlib.pyplot as plt\n",
+    "from datasets import Dataset\n",
+    "from transformers import TrainerCallback\n",
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "from unsloth import FastLanguageModel\n",
     "\n",
+    "torch.set_float32_matmul_precision('high')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-5",
    "metadata": {},
+   "source": ["## 2. Configuration"]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-6",
+   "metadata": {},
+   "outputs": [],
    "source": [
+    "# ---- Model ----\n",
+    "MODEL_NAME    = 'unsloth/Qwen2.5-3B-Instruct-bnb-4bit'\n",
+    "MAX_SEQ_LEN   = 4096\n",
+    "LORA_R        = 16\n",
+    "LORA_ALPHA    = 16\n",
+    "\n",
+    "# ---- Environment ----\n",
+    "ENV_URL       = 'http://localhost:8000'\n",
+    "WORKFLOWS     = ['A', 'B', 'C']\n",
+    "\n",
+    "# ---- Data / eval ----\n",
+    "N_PROMPTS_PER_WORKFLOW = 20      # 20 × 3 = 60 prompts\n",
+    "N_EVAL_EPISODES        = 5       # episodes per workflow at eval time\n",
+    "MAX_EPISODE_STEPS      = 15\n",
+    "\n",
+    "# ---- GRPO ----\n",
+    "MAX_TRAIN_STEPS        = 150     # more steps for better convergence\n",
+    "NUM_GENERATIONS        = 2       # G = candidates per prompt (keep low for T4 VRAM)\n",
+    "PER_DEVICE_BATCH       = 1\n",
+    "GRAD_ACCUM             = 2       # effective batch = 2 with grad accum\n",
+    "LEARNING_RATE          = 8e-6\n",
+    "MAX_COMPLETION_LENGTH  = 256\n",
+    "REWARD_STEPS           = 2       # multi-step rollout depth in reward fn\n",
+    "\n",
+    "# ---- Drive checkpoint (saves every N steps so Colab disconnects don't lose progress) ----\n",
+    "CKPT_EVERY_STEPS = 30\n",
+    "\n",
+    "# ---- Output paths ----\n",
+    "TRAIN_DIR   = Path('/content/OpenEnv-Round-2/training')\n",
+    "PLOTS_DIR   = TRAIN_DIR / 'plots'\n",
+    "ADAPTER_DIR = TRAIN_DIR / 'orgos_lora_adapter'\n",
+    "LOG_PATH    = TRAIN_DIR / 'training_log.txt'\n",
+    "PLOTS_DIR.mkdir(parents=True, exist_ok=True)"
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-7",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Structured logger — every important event goes through this so submission has a clean log.\n",
+    "LOG_PATH.write_text('')   # truncate\n",
     "\n",
+    "def tlog(line: str):\n",
     "    print(line, flush=True)\n",
+    "    with open(LOG_PATH, 'a') as f:\n",
+    "        f.write(line + '\\n')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-8",
    "metadata": {},
    "source": [
+    "## 3. Start the OrgOS environment server\n",
+    "\n",
+    "We launch the FastAPI env server (`server/app.py`) as a background subprocess. The reward function and eval loop call it over HTTP at `localhost:8000`."
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-9",
    "metadata": {},
    "outputs": [],
    "source": [
+    "ENV_PROC = subprocess.Popen(\n",
+    "    [sys.executable, '-m', 'uvicorn', 'server.app:app', '--host', '0.0.0.0', '--port', '8000'],\n",
+    "    cwd='/content/OpenEnv-Round-2',\n",
     "    stdout=subprocess.DEVNULL,\n",
     "    stderr=subprocess.DEVNULL,\n",
     ")\n",
+    "for _ in range(30):\n",
+    "    try:\n",
+    "        r = httpx.get(f'{ENV_URL}/health', timeout=2)\n",
+    "        if r.status_code == 200:\n",
+    "            tlog(f\"[ENV] status={r.json().get('status')} version={r.json().get('version','?')}\")\n",
+    "            break\n",
+    "    except Exception:\n",
+    "        time.sleep(1)\n",
+    "else:\n",
+    "    raise RuntimeError('Env server failed to start')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-10",
    "metadata": {},
+   "source": ["## 4. Load model — Qwen2.5-3B-Instruct, 4-bit, with LoRA adapters"]
   },
   {
    "cell_type": "code",
+   "id": "cell-11",
    "metadata": {},
    "outputs": [],
    "source": [
     "model, tokenizer = FastLanguageModel.from_pretrained(\n",
     "    model_name     = MODEL_NAME,\n",
     "    max_seq_length = MAX_SEQ_LEN,\n",
     "model = FastLanguageModel.get_peft_model(\n",
     "    model,\n",
     "    r              = LORA_R,\n",
+    "    lora_alpha     = LORA_ALPHA,\n",
+    "    target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj','down_proj'],\n",
+    "    use_gradient_checkpointing = 'unsloth',\n",
     ")\n",
     "\n",
+    "# Clear max_length so generate() doesn't warn about max_new_tokens vs max_length conflict.\n",
+    "model.config.max_length = None\n",
+    "\n",
     "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "tlog(f'[TRAIN_CONFIG] model={MODEL_NAME} lora_r={LORA_R} max_seq_len={MAX_SEQ_LEN} '\n",
+    "     f'trainable_params={trainable:,} quantization=4bit')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-12",
    "metadata": {},
    "source": [
+    "## 5. Helpers — system prompt, observation formatting, action parsing\n",
+    "\n",
+    "The agent gets a **stateless single-turn prompt**: `[SYSTEM_PROMPT] + [observation]` → `[action JSON]`. This matches what GRPO trains on, which is critical for eval/train alignment, and prevents context accumulation over a multi-step episode."
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-13",
    "metadata": {},
    "outputs": [],
    "source": [
+    "SYSTEM_PROMPT = '''You are OrgOS Agent — an enterprise workflow automation agent.\n",
+    "You operate across four SaaS apps: Jira, Zendesk, Salesforce, and Workday.\n",
     "\n",
+    "Each turn you receive a JSON observation with workflow_goal, pending_steps, app_states,\n",
+    "schema_hints (field renames in effect this episode, e.g. {\"jira.priority\": \"severity\"}),\n",
+    "active_rules, message (feedback from last action), and current_score.\n",
     "\n",
+    "Respond ONLY with a valid JSON object — no markdown, no explanation:\n",
     "  {\"app\": \"<app>\", \"operation\": \"<op>\", \"args\": {...}}\n",
     "\n",
     "Available apps and key operations:\n",
     "  jira:       get_issue, create_issue, update_status, set_priority, assign_owner,\n",
     "              add_label, link_zendesk_ticket, close_issue, list_issues\n",
     "  zendesk:    get_ticket, acknowledge_ticket, set_urgency, assign_agent,\n",
+    "              escalate_to_jira, resolve_ticket, add_note, list_tickets, create_agent_profile\n",
     "  salesforce: get_account, list_accounts, update_deal_stage, flag_churn_risk,\n",
     "              assign_account_owner, log_interaction, get_opportunity\n",
     "  workday:    get_employee, list_employees, provision_access, log_sla_event,\n",
     "              request_budget_approval, create_onboarding_task, complete_task\n",
     "\n",
     "CRITICAL RULES:\n",
+    "1. Read schema_hints FIRST. If \"salesforce.owner\" -> \"rep_email\", use \"rep_email\" not \"owner\".\n",
+    "2. Complete pending_steps in order.\n",
+    "3. Never repeat a failed action unchanged — read the message and adapt.\n",
+    "4. Use list_* operations to discover record IDs.\n",
+    "5. Stop when pending_steps is empty or done=true.'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-14",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "WORKFLOW_APPS = {\n",
+    "    'A': {'jira', 'zendesk', 'salesforce', 'workday'},\n",
+    "    'B': {'zendesk', 'salesforce', 'workday'},\n",
+    "    'C': {'jira', 'zendesk', 'salesforce'},\n",
+    "}\n",
     "\n",
     "def obs_to_text(obs: dict) -> str:\n",
+    "    hints   = obs.get('schema_hints', {})\n",
+    "    pending = obs.get('pending_steps', [])\n",
     "    lines = [\n",
     "        f\"current_score: {obs['current_score']}\",\n",
     "        f\"step_count:    {obs['step_count']}\",\n",
     "        f\"workflow_id:   {obs['workflow_id']}\",\n",
+    "        '',\n",
+    "        '=== WORKFLOW GOAL ===' , obs['workflow_goal'], '',\n",
+    "        '=== PENDING STEPS ===',\n",
+    "        '\\n'.join(f'  - {s}' for s in pending) or '  (all steps complete!)',\n",
+    "        '',\n",
+    "        '=== SCHEMA HINTS (use these field names) ===',\n",
+    "        json.dumps(hints, indent=2) if hints else '  (no drift — use canonical names)',\n",
+    "        '',\n",
+    "        '=== ACTIVE RULES ===',\n",
+    "        json.dumps(obs.get('active_rules', {}), indent=2),\n",
+    "        '',\n",
+    "        '=== LAST MESSAGE ===', obs['message'], '',\n",
+    "        '=== APP STATES ===',\n",
     "    ]\n",
+    "    relevant = WORKFLOW_APPS.get(obs.get('workflow_id', 'A'),\n",
+    "                                 {'jira','zendesk','salesforce','workday'})\n",
+    "    for app_name, view in obs.get('app_states', {}).items():\n",
+    "        if app_name not in relevant:\n",
+    "            continue\n",
+    "        view_str = str(view)\n",
+    "        if len(view_str) > 600:\n",
+    "            view_str = view_str[:600] + '...[truncated]'\n",
+    "        lines += [f'  [{app_name.upper()}]', f'  {view_str}', '']\n",
+    "    return '\\n'.join(lines)\n",
     "\n",
     "def parse_action(text: str):\n",
+    "    text = re.sub(r'```(?:json)?\\s*', '', text.strip()).strip()\n",
     "    try:\n",
     "        return json.loads(text)\n",
     "    except json.JSONDecodeError:\n",
+    "        m = re.search(r'\\{.*\\}', text, re.DOTALL)\n",
     "        if m:\n",
+    "            try: return json.loads(m.group())\n",
+    "            except Exception: return None\n",
     "    return None\n",
     "\n",
+    "def build_prompt(obs_text: str) -> str:\n",
+    "    msgs = [{'role': 'user', 'content': SYSTEM_PROMPT + '\\n\\n---\\n\\n' + obs_text}]\n",
+    "    return tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell-15",
+   "metadata": {},
+   "source": [
+    "## 6. Episode runner & evaluator\n",
     "\n",
+    "`run_episode_with_model` is **stateless** — each step sends just `[system + current obs]`, no chat history. This (a) keeps prompts under `MAX_SEQ_LEN`, (b) matches the GRPO training format exactly, and (c) avoids context accumulation across multi-step episodes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-16",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def run_episode_with_model(workflow_id: str, max_steps: int = MAX_EPISODE_STEPS) -> float:\n",
+    "    obs = httpx.post(f'{ENV_URL}/reset', json={'workflow_id': workflow_id}).json()['observation']\n",
+    "    for _ in range(max_steps):\n",
+    "        if obs['done']:\n",
+    "            break\n",
+    "        prompt = build_prompt(obs_to_text(obs))\n",
+    "        inputs = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
+    "        with torch.no_grad():\n",
+    "            out = model.generate(\n",
+    "                **inputs,\n",
+    "                max_new_tokens = 256,\n",
+    "                do_sample      = False,\n",
+    "                pad_token_id   = tokenizer.eos_token_id,\n",
+    "            )\n",
+    "        action_str = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:],\n",
+    "                                      skip_special_tokens=True).strip()\n",
+    "        action = parse_action(action_str)\n",
+    "        if action is None:\n",
+    "            break\n",
+    "        result = httpx.post(f'{ENV_URL}/step', json=action).json()\n",
+    "        obs    = result['observation']\n",
+    "        if obs['done']:\n",
+    "            break\n",
+    "    return float(obs.get('current_score', 0.001))\n",
+    "\n",
+    "def evaluate(phase: str, n_eval: int = N_EVAL_EPISODES) -> dict:\n",
+    "    scores = {wf: [] for wf in WORKFLOWS}\n",
+    "    tlog(f'[EVAL_START] phase={phase}')\n",
+    "    for wf in WORKFLOWS:\n",
+    "        for ep in range(n_eval):\n",
+    "            s = run_episode_with_model(wf)\n",
+    "            scores[wf].append(s)\n",
+    "            tlog(f'[EVAL] phase={phase} workflow={wf} episode={ep+1} score={s:.4f}')\n",
+    "        m = float(np.mean(scores[wf]))\n",
+    "        tlog(f'[EVAL_WORKFLOW] phase={phase} workflow={wf} '\n",
+    "             f'mean={m:.4f} min={min(scores[wf]):.4f} max={max(scores[wf]):.4f}')\n",
+    "    overall = float(np.mean([s for v in scores.values() for s in v]))\n",
+    "    tlog(f'[EVAL_END] phase={phase} overall_mean={overall:.4f}')\n",
+    "    return scores"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell-17",
+   "metadata": {},
+   "source": [
+    "## 7. Baseline evaluation — *before* training\n",
+    "\n",
+    "This is the untrained Qwen2.5-3B reference point. We will compare against this after GRPO training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "baseline_scores = evaluate(phase='baseline')\n",
+    "baseline_overall = float(np.mean([s for v in baseline_scores.values() for s in v]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cell-19",
+   "metadata": {},
+   "source": [
+    "## 8. Build the prompt dataset for GRPO\n",
     "\n",
+    "We collect 60 fresh observations (20 per workflow) by resetting the env. GRPO will sample from this dataset, generate G=2 candidate actions per prompt, score each via the live env, and update the policy from the relative advantages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-20",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rows = []\n",
+    "for wf in WORKFLOWS:\n",
     "    for _ in range(N_PROMPTS_PER_WORKFLOW):\n",
+    "        obs = httpx.post(f'{ENV_URL}/reset', json={'workflow_id': wf}).json()['observation']\n",
+    "        rows.append({\n",
+    "            'prompt':      build_prompt(obs_to_text(obs)),\n",
+    "            'workflow_id': wf,\n",
     "        })\n",
+    "prompt_dataset = Dataset.from_list(rows)\n",
+    "tlog(f'[TRAIN_CONFIG] algorithm=GRPO prompts={len(prompt_dataset)} '\n",
+    "     f'workflows={\",\".join(WORKFLOWS)} prompts_per_workflow={N_PROMPTS_PER_WORKFLOW}')\n",
     "\n",
+    "tok_len = len(tokenizer(prompt_dataset[0]['prompt']).input_ids)\n",
+    "tlog(f'[PROMPT_DEBUG] first_prompt_tokens={tok_len}')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-21",
    "metadata": {},
    "source": [
+    "## 9. Reward function — multi-step live environment rollout\n",
+    "\n",
+    "For each candidate completion we:\n",
+    "1. Parse it as a JSON action.\n",
+    "2. Reset the env and apply the action (step 1).\n",
+    "3. Continue `REWARD_STEPS-1` more steps with the current model (greedy), accumulating env transitions.\n",
+    "4. Return the **cumulative episode score** — not just single-step reward.\n",
+    "\n",
+    "This multi-step signal prevents the model from collapsing to always outputting `list_tickets` (which gives a small single-step reward but doesn't advance the workflow). Invalid JSON gets a −0.1 penalty."
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-22",
    "metadata": {},
    "outputs": [],
    "source": [
+    "def orgos_reward_fn(completions: List[str], prompts: List[str] = None, **kwargs) -> List[float]:\n",
+    "    workflow_ids = kwargs.get('workflow_id', ['A'] * len(completions))\n",
     "    rewards = []\n",
     "    for completion, wf_id in zip(completions, workflow_ids):\n",
     "        action = parse_action(completion)\n",
     "        if action is None:\n",
     "            rewards.append(-0.1)\n",
     "            continue\n",
     "        try:\n",
+    "            # Reset env and apply GRPO-generated action (step 1)\n",
+    "            obs = httpx.post(f'{ENV_URL}/reset', json={'workflow_id': wf_id}, timeout=10).json()['observation']\n",
+    "            result = httpx.post(f'{ENV_URL}/step', json=action, timeout=10).json()\n",
+    "            obs = result['observation']\n",
+    "\n",
+    "            # Continue REWARD_STEPS-1 more steps with the current model (greedy)\n",
+    "            for _ in range(REWARD_STEPS - 1):\n",
+    "                if obs.get('done'):\n",
+    "                    break\n",
+    "                prompt = build_prompt(obs_to_text(obs))\n",
+    "                inputs = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
+    "                with torch.no_grad():\n",
+    "                    out = model.generate(\n",
+    "                        **inputs,\n",
+    "                        max_new_tokens = 128,\n",
+    "                        do_sample      = False,\n",
+    "                        pad_token_id   = tokenizer.eos_token_id,\n",
+    "                    )\n",
+    "                cont_str = tokenizer.decode(\n",
+    "                    out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True\n",
+    "                ).strip()\n",
+    "                cont_action = parse_action(cont_str)\n",
+    "                if cont_action is None:\n",
+    "                    break\n",
+    "                result = httpx.post(f'{ENV_URL}/step', json=cont_action, timeout=10).json()\n",
+    "                obs    = result['observation']\n",
+    "\n",
+    "            # Return cumulative score — rewards multi-step progress, not just single actions\n",
+    "            rewards.append(float(obs.get('current_score', 0.001)))\n",
+    "        except Exception as e:\n",
     "            rewards.append(-0.1)\n",
     "    return rewards\n",
     "\n",
     "# Sanity check\n",
+    "_v = orgos_reward_fn(['{\\'app\\':\\'zendesk\\',\\'operation\\':\\'list_tickets\\',\\'args\\':{}}'], workflow_id=['A'])\n",
+    "_i = orgos_reward_fn(['not json'], workflow_id=['A'])\n",
+    "tlog(f'[REWARD_FN_CHECK] valid_action={_v[0]:.4f} invalid_action={_i[0]:.4f}')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-23",
    "metadata": {},
+   "source": ["## 10. GRPO training\n",
+    "\n",
+    "We log every training step's reward into `[TRAIN_STEP]` lines so we can plot a meaningful learning curve.\n",
+    "A Drive checkpoint callback saves the adapter every 30 steps so a Colab disconnect doesn't lose progress."]
   },
   {
    "cell_type": "code",
+   "id": "cell-24",
    "metadata": {},
    "outputs": [],
    "source": [
+    "training_step_rewards = []   # captured by callback for the plot\n",
     "\n",
+    "class OrgOSLogCallback(TrainerCallback):\n",
+    "    def on_log(self, args, state, control, logs=None, **kwargs):\n",
+    "        if not logs:\n",
+    "            return\n",
+    "        step   = state.global_step\n",
+    "        reward = logs.get('reward') or logs.get('rewards/orgos_reward_fn') or logs.get('reward/mean')\n",
+    "        loss   = logs.get('loss')\n",
+    "        kl     = logs.get('kl')\n",
+    "        if reward is not None:\n",
+    "            training_step_rewards.append((step, float(reward)))\n",
+    "            tlog(f'[TRAIN_STEP] step={step} reward={float(reward):.4f} '\n",
+    "                 f\"loss={('%.4f'%loss) if loss is not None else 'NA'} \"\n",
+    "                 f\"kl={('%.4f'%kl) if kl is not None else 'NA'}\")\n",
+    "\n",
+    "    def on_step_end(self, args, state, control, **kwargs):\n",
+    "        \"\"\"Save checkpoint to Drive every CKPT_EVERY_STEPS steps.\"\"\"\n",
+    "        if state.global_step % CKPT_EVERY_STEPS == 0 and state.global_step > 0:\n",
+    "            try:\n",
+    "                from google.colab import drive\n",
+    "                drive.mount('/content/drive', force_remount=False)\n",
+    "                ckpt_dir = Path(f'/content/drive/MyDrive/orgos-training/ckpt_step{state.global_step}')\n",
+    "                ckpt_dir.mkdir(parents=True, exist_ok=True)\n",
+    "                model.save_pretrained(str(ckpt_dir))\n",
+    "                import shutil\n",
+    "                shutil.copy(LOG_PATH, ckpt_dir / 'training_log.txt')\n",
+    "                tlog(f'[CHECKPOINT] step={state.global_step} saved to {ckpt_dir}')\n",
+    "            except Exception:\n",
+    "                pass  # not on Colab or Drive not mounted — skip silently"
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-25",
    "metadata": {},
    "outputs": [],
    "source": [
+    "FastLanguageModel.for_training(model)\n",
     "\n",
+    "# GRPOConfig — using TRL <=0.24 (pinned in cell 2) so max_new_tokens is accepted.\n",
+    "# Unsloth patches this config; max_prompt_length / max_completion_length are NOT supported.\n",
     "grpo_config = GRPOConfig(\n",
+    "    output_dir                  = '/content/grpo_ckpt',\n",
+    "    num_train_epochs            = 1,\n",
+    "    max_steps                   = MAX_TRAIN_STEPS,\n",
+    "    per_device_train_batch_size = PER_DEVICE_BATCH,\n",
     "    gradient_accumulation_steps = GRAD_ACCUM,\n",
+    "    learning_rate               = LEARNING_RATE,\n",
+    "    num_generations             = NUM_GENERATIONS,\n",
+    "    max_new_tokens              = MAX_COMPLETION_LENGTH,\n",
+    "    temperature                 = 0.9,\n",
+    "    logging_steps               = 1,\n",
+    "    save_strategy               = 'no',\n",
+    "    report_to                   = 'none',\n",
+    "    bf16                        = False,\n",
+    "    fp16                        = True,\n",
+    "    optim                       = 'adamw_8bit',\n",
     ")\n",
     "\n",
     "trainer = GRPOTrainer(\n",
     "    model            = model,\n",
     "    processing_class = tokenizer,\n",
+    "    reward_funcs     = [orgos_reward_fn],\n",
+    "    train_dataset    = prompt_dataset,\n",
+    "    args             = grpo_config,\n",
     "    callbacks        = [OrgOSLogCallback()],\n",
     ")\n",
     "\n",
+    "tlog(f'[TRAIN_START] max_steps={MAX_TRAIN_STEPS} G={NUM_GENERATIONS} lr={LEARNING_RATE} reward_steps={REWARD_STEPS}')\n",
+    "trainer.train()\n",
+    "tlog(f'[TRAIN_END] steps_completed={trainer.state.global_step}')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-26",
    "metadata": {},
    "source": [
+    "## 11. Post-training evaluation\n",
+    "\n",
+    "Same protocol as the baseline (3 workflows × 5 episodes), so the comparison is apples-to-apples."
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-27",
    "metadata": {},
    "outputs": [],
    "source": [
     "FastLanguageModel.for_inference(model)\n",
+    "trained_scores = evaluate(phase='trained')\n",
+    "trained_overall = float(np.mean([s for v in trained_scores.values() for s in v]))\n",
     "\n",
+    "tlog('[TRAIN_SUMMARY] '\n",
+    "     f'baseline_overall={baseline_overall:.4f} trained_overall={trained_overall:.4f} '\n",
+    "     f'delta={trained_overall - baseline_overall:+.4f}')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-28",
    "metadata": {},
    "source": [
+    "## 12. Plots\n",
+    "\n",
+    "All plots are saved to `training/plots/` and committed to the repo so reviewers can see them in the README."
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-29",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# 12a. Training curve — mean reward vs GRPO step\n",
+    "if training_step_rewards:\n",
+    "    steps, rewards = zip(*training_step_rewards)\n",
+    "    plt.figure(figsize=(8,5))\n",
+    "    plt.plot(steps, rewards, marker='o', markersize=3, linewidth=1.5, color='tab:blue', label='per-step reward')\n",
+    "    if len(rewards) >= 5:\n",
+    "        win = max(3, len(rewards)//10)\n",
+    "        roll = np.convolve(rewards, np.ones(win)/win, mode='valid')\n",
+    "        plt.plot(steps[win-1:], roll, color='tab:orange', linewidth=2.5, label=f'rolling mean (w={win})')\n",
+    "    plt.xlabel('GRPO training step')\n",
+    "    plt.ylabel('mean reward (per batch)')\n",
+    "    plt.title('OrgOS GRPO training curve — Qwen2.5-3B-Instruct')\n",
+    "    plt.legend()\n",
+    "    plt.grid(alpha=0.3)\n",
+    "    plt.tight_layout()\n",
+    "    plt.savefig(PLOTS_DIR / 'training_curve.png', dpi=150)\n",
+    "    plt.show()\n",
+    "    tlog('[ARTIFACT] training_curve.png saved')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 12b. Baseline vs trained — grouped bar per workflow\n",
+    "x = np.arange(len(WORKFLOWS))\n",
+    "width = 0.35\n",
+    "baseline_means = [np.mean(baseline_scores[wf]) for wf in WORKFLOWS]\n",
+    "trained_means  = [np.mean(trained_scores[wf])  for wf in WORKFLOWS]\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(8,5))\n",
+    "ax.bar(x - width/2, baseline_means, width, label='baseline (untrained)', color='tab:gray')\n",
+    "ax.bar(x + width/2, trained_means,  width, label='GRPO-trained',         color='tab:green')\n",
+    "ax.set_xticks(x)\n",
+    "ax.set_xticklabels([f'Workflow {wf}' for wf in WORKFLOWS])\n",
+    "ax.set_ylabel('mean episode score (0–1)')\n",
+    "ax.set_ylim(0, 1)\n",
+    "ax.set_title(f'Baseline vs GRPO-trained — overall {baseline_overall:.3f} → {trained_overall:.3f}')\n",
+    "ax.legend()\n",
+    "ax.grid(axis='y', alpha=0.3)\n",
+    "for i, (b, t) in enumerate(zip(baseline_means, trained_means)):\n",
+    "    ax.text(i - width/2, b + 0.01, f'{b:.2f}', ha='center', fontsize=9)\n",
+    "    ax.text(i + width/2, t + 0.01, f'{t:.2f}', ha='center', fontsize=9)\n",
+    "plt.tight_layout()\n",
+    "plt.savefig(PLOTS_DIR / 'baseline_vs_trained.png', dpi=150)\n",
     "plt.show()\n",
+    "tlog('[ARTIFACT] baseline_vs_trained.png saved')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 12c. Per-episode score distribution (box plot)\n",
+    "fig, ax = plt.subplots(figsize=(9,5))\n",
+    "data, labels, colors = [], [], []\n",
+    "for wf in WORKFLOWS:\n",
+    "    data += [baseline_scores[wf], trained_scores[wf]]\n",
+    "    labels += [f'{wf} (base)', f'{wf} (trained)']\n",
+    "    colors += ['lightgray', 'lightgreen']\n",
+    "bp = ax.boxplot(data, labels=labels, patch_artist=True)\n",
+    "for patch, c in zip(bp['boxes'], colors):\n",
+    "    patch.set_facecolor(c)\n",
+    "ax.set_ylabel('episode score (0–1)')\n",
+    "ax.set_title('Per-episode score distribution — baseline vs GRPO-trained')\n",
+    "ax.grid(axis='y', alpha=0.3)\n",
+    "plt.tight_layout()\n",
+    "plt.savefig(PLOTS_DIR / 'score_distribution.png', dpi=150)\n",
+    "plt.show()\n",
+    "tlog('[ARTIFACT] score_distribution.png saved')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "cell-32",
    "metadata": {},
    "source": [
+    "## 13. Save artifacts\n",
+    "\n",
+    "Saves the LoRA adapter and copies all artifacts to Google Drive so they survive a Colab disconnect."
    ]
   },
   {
    "cell_type": "code",
+   "id": "cell-33",
    "metadata": {},
    "outputs": [],
    "source": [
+    "model.save_pretrained(str(ADAPTER_DIR))\n",
+    "tokenizer.save_pretrained(str(ADAPTER_DIR))\n",
+    "tlog(f'[ARTIFACT] lora_adapter saved to {ADAPTER_DIR}')\n",
+    "\n",
+    "try:\n",
+    "    from google.colab import drive\n",
+    "    drive.mount('/content/drive', force_remount=False)\n",
+    "    DRIVE_DIR = Path('/content/drive/MyDrive/orgos-training')\n",
+    "    DRIVE_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "    !cp    {LOG_PATH}    {DRIVE_DIR}/\n",
+    "    !cp -r {PLOTS_DIR}   {DRIVE_DIR}/\n",
+    "    !cp -r {ADAPTER_DIR} {DRIVE_DIR}/\n",
+    "    print(f'Artifacts copied to {DRIVE_DIR}')\n",
+    "except ImportError:\n",
+    "    print('Not on Colab — skipping Drive copy')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "cell-34",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Stop the env server cleanly\n",
+    "ENV_PROC.terminate()\n",
+    "tlog('[RUN_END]')\n",
+    "print('\\nDone. Commit these to the repo:')\n",
+    "print(f'  - {LOG_PATH}')\n",
+    "print(f'  - {PLOTS_DIR}/*.png')\n",
+    "print(f'  - {ADAPTER_DIR}/')"
    ]
   }
+ ]
 }