rtferraz
/

tucano2-commerce

Model card Files Files and versions

xet

Community

rtferraz commited on 14 days ago

Commit

c9b11b9

verified ·

1 Parent(s): a62f1dc

Upload grpo_vertex_v3.ipynb

Browse files

Files changed (1) hide show

notebooks/grpo_vertex_v3.ipynb +1674 -0

notebooks/grpo_vertex_v3.ipynb ADDED Viewed

	@@ -0,0 +1,1674 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tucano2 Commerce — GRPO Training v3 (Vertex AI Workbench / L4)\n",
+    "\n",
+    "**v3 changes over v2 — grounded in published research:**\n",
+    "\n",
+    "| Change | v2 Value | v3 Value | Paper Reference |\n",
+    "|--------|----------|----------|----------------|\n",
+    "| Temperature | 0.8 | **1.0** | Skywork-OR1 (2505.22312) §4: τ=1.0 gives 5-8% better results, delays entropy collapse |\n",
+    "| Completion length | 2048 | **4096** | Dr. GRPO (2503.20783) §3.1: length bias inflates wrong answers → ceiling hit blocks learning |\n",
+    "| Num generations | 8 | **4** | VRAM tradeoff: 4×4096 ≈ 8×2048. MC-GRPO (2601.22582): G=4 works with noise mitigation |\n",
+    "| Learning rate | 5e-7 | **2e-6** | Dr. GRPO Appendix G: LR=1e-6; Reasoning-SQL: LR=1e-6. v2 clip_ratio=0 → room to push 2-4× |\n",
+    "| β (KL penalty) | implicit | **0.0** | Dr. GRPO §3.2: β=0 optimal for rule-based rewards |\n",
+    "| Training data | 300 | **ALL (~1400)** | Skywork-OR1 §3.1: small prompt sets → model memorizes → entropy collapse |\n",
+    "| Reward functions | single composite | **staged (format→partial→task)** | Reasoning-SQL (2503.23157) §3.2: format rewards converge first, enable task learning |\n",
+    "| Zero-advantage groups | included | **filtered with noise injection** | Skywork-OR1 §3.1: zero-std groups destabilize training |\n",
+    "| Entropy monitoring | none | **EntropyMonitorCallback** | Skywork-OR1 §4: early detection prevents collapse |\n",
+    "| Early stopping patience | 10 | **15** | More runway for longer completions |\n",
+    "| Save total limit | 3 | **5** | Keep more checkpoints — v2 lost the best one |\n",
+    "| Eval temperature | 0.7 | **0.1** | Deterministic eval = less noisy signal |\n",
+    "| General reasoning mix | none | **30% (optional)** | Cocktail Effect (2410.01109): multi-task mix boosts domain performance 2-15% |\n",
+    "\n",
+    "**Prerequisites:**\n",
+    "- Upload `data/pairs/train.jsonl` (2.1 MB) to `./data/pairs/`\n",
+    "- Upload `models/tucano2-commerce-sft/` (126 MB) to `./models/tucano2-commerce-sft/`\n",
+    "- **NEW:** Optional `data/pairs/general_reasoning.jsonl` for 30% general data mix\n",
+    "\n",
+    "**Hardware:** L4 (24GB), PyTorch kernel, bf16 supported\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Cell 1: Dependencies\n",
+    "\n",
+    "Restart your kernel first (Kernel → Restart), then run these cells in order, one at a time:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Cell 1a — Nuke everything ML-related\n",
+    "!pip uninstall -y torch torchvision torchaudio \\\n",
+    "    unsloth unsloth-zoo \\\n",
+    "    trl transformers peft accelerate \\\n",
+    "    bitsandbytes vllm vllm-flash-attn \\\n",
+    "    datasets tokenizers safetensors huggingface-hub \\\n",
+    "    wandb xformers triton \\\n",
+    "    cuda-bindings cuda-python \\\n",
+    "    sentencepiece protobuf \\\n",
+    "    2>/dev/null"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Cell 1b — Kill any stragglers\n",
+    "!pip freeze | grep -iE \"torch|unsloth|trl|vllm|bitsandbytes|transformers|peft|accelerate\" | xargs pip uninstall -y 2>/dev/null"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Cell 1c — Purge cache\n",
+    "!pip cache purge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**⚠️ Restart kernel again**, then:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Cell 1d — Clean install, correct order\n",
+    "!pip install \"unsloth\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Cell 1e — Pin TRL (Unsloth may pull a different version)\n",
+    "!pip install \"trl==0.24.0\" --no-deps"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Cell 1f — Extra deps\n",
+    "!pip install \"rich\" \"wandb\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 2: Hello World — GPU + Unsloth Verification"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
+    "print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "print(f\"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB\")\n",
+    "print(f\"bf16 support: {torch.cuda.is_bf16_supported()}\")\n",
+    "\n",
+    "from unsloth import FastLanguageModel\n",
+    "print(\"\\n✓ Unsloth loaded successfully\")\n",
+    "\n",
+    "import trl\n",
+    "print(f\"✓ TRL version: {trl.__version__}\")\n",
+    "\n",
+    "import transformers\n",
+    "print(f\"✓ Transformers version: {transformers.__version__}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 3: Config + Constants"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20160257",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"UNSLOTH_COMPILE_DISABLE\"] = \"1\"\n",
+    "\n",
+    "import json\n",
+    "import re\n",
+    "import time\n",
+    "import random\n",
+    "import gc\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "# v3 CONFIG — Every change is annotated with paper reference\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "\n",
+    "MODEL_ID = \"Polygl0t/Tucano2-qwen-3.7B-Think\"\n",
+    "MAX_SEQ_LENGTH = 8192  # v3: increased from 4096 — model supports 32k, we need room for 4096 completion + prompt\n",
+    "\n",
+    "# ── Paths ─────────────────────────────────────────────────────────────────────\n",
+    "DATA_DIR = Path(\"/home/jupyter/tucano2/data\")\n",
+    "MODELS_DIR = Path(\"/home/jupyter/tucano2/models\")\n",
+    "SFT_ADAPTER_DIR = MODELS_DIR / \"tucano2-commerce-sft\"\n",
+    "GRPO_ADAPTER_DIR = MODELS_DIR / \"tucano2-commerce-grpo-v3\"   # v3: separate dir from v2\n",
+    "CHECKPOINT_DIR = GRPO_ADAPTER_DIR / \"checkpoints\"\n",
+    "\n",
+    "# ── Training data ─────────────────────────────────────────────────────────────\n",
+    "GRPO_PROMPTS = None  # v3: None = use ALL available prompts (was 300 subset in v2)\n",
+    "GENERAL_MIX_RATIO = 0.0  # v3: set to 0.3 if general_reasoning.jsonl exists (Cocktail Effect paper)\n",
+    "\n",
+    "# ── Valid enums for reward scoring (unchanged from v2) ────────────────────────\n",
+    "VALID_SENTIMENTS = {\"positive\", \"negative\", \"neutral\"}\n",
+    "VALID_CATEGORIES = {\n",
+    "    \"delivery_delay\", \"product_quality\", \"product_not_received\",\n",
+    "    \"wrong_product\", \"seller_communication\", \"app_issue\",\n",
+    "    \"price_value\", \"other\", \"none\",\n",
+    "}\n",
+    "VALID_CHURN = {\"low\", \"medium\", \"high\"}\n",
+    "VALID_REPEAT = {\"yes\", \"no\", \"maybe\"}\n",
+    "EXTRACTION_FIELDS = [\n",
+    "    \"sentiment\", \"sentiment_score\", \"churn_risk\", \"delivery_issue\",\n",
+    "    \"product_issue\", \"seller_issue\", \"main_complaint\",\n",
+    "    \"complaint_category\", \"repeat_intent\", \"would_recommend\",\n",
+    "]\n",
+    "\n",
+    "SYSTEM_PT = (\n",
+    "    \"Você é um assistente de IA especializado em análise de e-commerce brasileiro. \"\n",
+    "    \"Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\"\n",
+    ")\n",
+    "\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "# TRAINING HYPERPARAMETERS — v3 fixes (all changes annotated)\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "\n",
+    "# ── Core GRPO params ──────────────────────────────────────────────────────────\n",
+    "BATCH_SIZE = 4\n",
+    "GRAD_ACCUM = 1        # v3: reduced from 2. Effective batch = 4×1 = 4 (was 8)\n",
+    "                      # With G=4: steps = prompts × 4 / 4 = prompts per epoch\n",
+    "NUM_GENERATIONS = 4   # v3: reduced from 8 — VRAM tradeoff for longer completions\n",
+    "                      # MC-GRPO (2601.22582): G=4 works if noise is mitigated\n",
+    "SCALE_REWARDS = False # Dr. GRPO (2503.20783): remove std normalization bias\n",
+    "\n",
+    "# ── v3 CRITICAL FIXES ─────────────────────────────────���──────────────────────\n",
+    "\n",
+    "# FIX 1: Temperature — prevent entropy collapse\n",
+    "# v2 had 0.8. All published GRPO papers use 1.0.\n",
+    "# Skywork-OR1 (2505.22312) ablation: τ=1.0 vs τ=0.6 → 5-8% better test performance\n",
+    "TEMPERATURE = 1.0\n",
+    "\n",
+    "# FIX 2: Completion length — remove the ceiling\n",
+    "# v2: every single completion hit 2048 ceiling. Model couldn't finish reasoning.\n",
+    "# Dr. GRPO (2503.20783) §3.1: GRPO length bias inflates wrong answers → ceiling kill gradient\n",
+    "MAX_COMPLETION_LENGTH = 4096\n",
+    "\n",
+    "# FIX 3: Learning rate — more aggressive\n",
+    "# v2: clip_ratio=0 on all steps → updates were too small to matter\n",
+    "# Dr. GRPO Appendix G: LR=1e-6 (constant). Reasoning-SQL: LR=1e-6 with cosine.\n",
+    "# We go 2× since v2 showed zero clipping (model can absorb stronger push)\n",
+    "LEARNING_RATE = 2e-6\n",
+    "\n",
+    "# FIX 4: β = 0 (no KL penalty)\n",
+    "# Dr. GRPO (2503.20783) §3.2: KL penalty is unnecessary for rule-based rewards\n",
+    "# v2 used implicit KL through default β — we explicitly disable it\n",
+    "BETA = 0.0\n",
+    "\n",
+    "# ── Training schedule ─────────────────────────────────────────────────────────\n",
+    "NUM_EPOCHS = 1\n",
+    "MAX_STEPS = 500       # v3: increased for expanded data; early stopping will halt if needed\n",
+    "                      # With ~1400 prompts × 4 gen / (4 batch × 1 accum) = 1400 steps/epoch\n",
+    "                      # MAX_STEPS=500 < 1 epoch — early stopping or manual extension\n",
+    "\n",
+    "# ── Checkpoint + Eval + Early-Stop ────────────────────────────────────────────\n",
+    "EVAL_SPLIT_RATIO        = 0.15\n",
+    "EVAL_STEPS              = 10\n",
+    "EARLY_STOPPING_PATIENCE = 15    # v3: increased from 10 — gives 150 steps of runway\n",
+    "EARLY_STOPPING_DELTA    = 0.005 # v3: reduced from 0.01 — more sensitive to small gains\n",
+    "SAVE_STEPS              = 10    # v3: more frequent (was 15) — never lose best checkpoint again\n",
+    "SAVE_TOTAL_LIMIT        = 5    # v3: keep more checkpoints (was 3 — lost best in v2)\n",
+    "WANDB_PROJECT           = \"tucano2-commerce\"\n",
+    "\n",
+    "# ── Eval callback ─────────────────────────────────────────────────────────────\n",
+    "EVAL_MAX_SAMPLES = 5\n",
+    "EVAL_MAX_TOKENS  = 4096  # v3: match training max_completion_length (was 2048)\n",
+    "EVAL_TEMPERATURE = 0.1   # v3: deterministic eval for less noisy signal (was 0.7)\n",
+    "\n",
+    "# ── Backend ───────────────────────────────────────────────────────────────────\n",
+    "USE_VLLM = False\n",
+    "\n",
+    "# ── v3: Zero-advantage noise injection ────────────────────────────────────────\n",
+    "# Skywork-OR1 (2505.22312) §3.1: zero-std groups destabilize GRPO training\n",
+    "# When all G completions get identical rewards, the advantage is undefined.\n",
+    "# Noise injection breaks ties without corrupting the signal.\n",
+    "ZERO_ADV_NOISE_STD = 0.005  # Small gaussian noise added to zero-variance groups\n",
+    "\n",
+    "os.environ[\"PYTORCH_CUDA_ALLOC_CONF\"] = \"expandable_segments:True\"\n",
+    "\n",
+    "# ── Version assertion ─────────────────────────────────────────────────────────\n",
+    "import trl as _trl\n",
+    "assert _trl.__version__ == \"0.24.0\", (\n",
+    "    f\"UnslothGRPOTrainer was written for TRL 0.24.0, found {_trl.__version__}.\\n\"\n",
+    "    \"Verify that GRPOTrainer._generate() still exists before proceeding.\"\n",
+    ")\n",
+    "\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "# W&B EARLY INIT — all cells log here, outputs survive notebook disconnect\n",
+    "# Prompts once for API key, then caches in ~/.netrc for future runs.\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "import wandb\n",
+    "\n",
+    "wandb.login()  # reads from ~/.netrc if cached, prompts interactively if not\n",
+    "\n",
+    "RUN_NAME = f\"grpo-v3-l4-{time.strftime('%Y%m%d-%H%M')}\"\n",
+    "wandb.init(\n",
+    "    project=WANDB_PROJECT,\n",
+    "    name=RUN_NAME,\n",
+    "    config={\n",
+    "        \"model_id\":               MODEL_ID,\n",
+    "        \"version\":                \"v3\",\n",
+    "        \"temperature\":            TEMPERATURE,\n",
+    "        \"max_completion_length\":  MAX_COMPLETION_LENGTH,\n",
+    "        \"num_generations\":        NUM_GENERATIONS,\n",
+    "        \"learning_rate\":          LEARNING_RATE,\n",
+    "        \"beta\":                   BETA,\n",
+    "        \"batch_size\":             BATCH_SIZE,\n",
+    "        \"grad_accum\":             GRAD_ACCUM,\n",
+    "        \"max_steps\":              MAX_STEPS,\n",
+    "        \"scale_rewards\":          SCALE_REWARDS,\n",
+    "        \"save_steps\":             SAVE_STEPS,\n",
+    "        \"eval_steps\":             EVAL_STEPS,\n",
+    "        \"eval_max_samples\":       EVAL_MAX_SAMPLES,\n",
+    "        \"eval_max_tokens\":        EVAL_MAX_TOKENS,\n",
+    "        \"eval_temperature\":       EVAL_TEMPERATURE,\n",
+    "        \"patience\":               EARLY_STOPPING_PATIENCE,\n",
+    "        \"delta\":                  EARLY_STOPPING_DELTA,\n",
+    "        \"zero_adv_noise_std\":     ZERO_ADV_NOISE_STD,\n",
+    "        \"general_mix_ratio\":      GENERAL_MIX_RATIO,\n",
+    "        \"_ref_temperature\":       \"Skywork-OR1 (2505.22312)\",\n",
+    "        \"_ref_completion_length\": \"Dr. GRPO (2503.20783)\",\n",
+    "        \"_ref_staged_rewards\":    \"Reasoning-SQL (2503.23157)\",\n",
+    "        \"_ref_zero_adv\":          \"Skywork-OR1 (2505.22312)\",\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "print(\"✓ v3 Config loaded\")\n",
+    "print(f\"  SFT adapter: {SFT_ADAPTER_DIR} (exists: {SFT_ADAPTER_DIR.exists()})\")\n",
+    "print(f\"  Train data: {DATA_DIR / 'pairs' / 'train.jsonl'} (exists: {(DATA_DIR / 'pairs' / 'train.jsonl').exists()})\")\n",
+    "print(f\"  Training: batch={BATCH_SIZE}, grad_accum={GRAD_ACCUM}, eff_batch={BATCH_SIZE*GRAD_ACCUM}\")\n",
+    "print(f\"  GRPO: G={NUM_GENERATIONS}, temp={TEMPERATURE}, LR={LEARNING_RATE}, β={BETA}\")\n",
+    "print(f\"  Completion: max={MAX_COMPLETION_LENGTH} (v2 was 2048)\")\n",
+    "print(f\"  ADR: save_steps={SAVE_STEPS}, eval_steps={EVAL_STEPS}, patience={EARLY_STOPPING_PATIENCE}\")\n",
+    "print(f\"✓ TRL {_trl.__version__} verified\")\n",
+    "print(f\"✓ W&B run: {wandb.run.url}\")\n",
+    "\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "# v3 VRAM BUDGET (L4 24GB)\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "# Model (NF4):          ~3.5 GB\n",
+    "# KV Cache (8192 seq):  ~3.0 GB\n",
+    "# Activations:          ~4.0 GB\n",
+    "# Optimizer states:     ~3.0 GB\n",
+    "# Generations (4×4096): ~8.0 GB\n",
+    "# ─────────────────────────────────\n",
+    "# Estimated total:      ~21.5 GB\n",
+    "# Headroom:             ~2.5 GB\n",
+    "#\n",
+    "# If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first, then 2560.\n",
+    "# Do NOT reduce NUM_GENERATIONS below 4 — GRPO needs variance.\n",
+    "# ══════════════════════════════════════════════════════════════════════════════"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 4: Load SFT Adapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Loading SFT adapter...\")\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=str(SFT_ADAPTER_DIR),\n",
+    "    max_seq_length=MAX_SEQ_LENGTH,\n",
+    "    load_in_4bit=True,\n",
+    "    dtype=None,\n",
+    ")\n",
+    "\n",
+    "if tokenizer.pad_token is None:\n",
+    "    tokenizer.pad_token = tokenizer.eos_token\n",
+    "\n",
+    "# Load chat template from base model (SFT adapter doesn't save it)\n",
+    "from transformers import AutoTokenizer\n",
+    "base_tok = AutoTokenizer.from_pretrained(MODEL_ID)\n",
+    "tokenizer.chat_template = base_tok.chat_template\n",
+    "del base_tok\n",
+    "\n",
+    "# v2: Force KV cache — Unsloth patching may reset this\n",
+    "model.config.use_cache = True\n",
+    "model.generation_config.use_cache = True\n",
+    "\n",
+    "print(f\"✓ Model loaded on {model.device}\")\n",
+    "print(f\"  use_cache: {model.config.use_cache}\")\n",
+    "print(f\"  Params: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M\")\n",
+    "print(f\"  Chat template: {tokenizer.chat_template[:50]}...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 5: Single Inference Test\n",
+    "\n",
+    "**Gate:** Does the model close `</think>` and produce an answer within 4096 tokens?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "test_msgs = [\n",
+    "    {\"role\": \"system\", \"content\": SYSTEM_PT},\n",
+    "    {\"role\": \"user\", \"content\": \"Quais são as categorias de reclamação mais frequentes e como afetam a nota média?\"},\n",
+    "]\n",
+    "text = tokenizer.apply_chat_template(test_msgs, tokenize=False, add_generation_prompt=True)\n",
+    "inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "t0 = time.time()\n",
+    "outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)\n",
+    "elapsed = time.time() - t0\n",
+    "\n",
+    "response = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+    "gen_tokens = outputs.shape[1] - inputs[\"input_ids\"].shape[1]\n",
+    "\n",
+    "print(f\"Generation time: {elapsed:.1f}s ({gen_tokens} tokens, {gen_tokens/elapsed:.1f} tok/s)\")\n",
+    "print(f\"Response length: {len(response)} chars, {gen_tokens} tokens\")\n",
+    "print(f\"Hit ceiling: {gen_tokens >= MAX_COMPLETION_LENGTH}\")  # v3: should NOT hit ceiling with 4096\n",
+    "print(f\"closed_think: {'</think>' in response}\")\n",
+    "print(f\"\\n{'='*60}\")\n",
+    "print(response[:800])\n",
+    "\n",
+    "# ── Log to W&B ───────────────────────────────────────────────────────────────\n",
+    "wandb.log({\n",
+    "    \"preflight/inference_time_s\": elapsed,\n",
+    "    \"preflight/inference_tokens\": gen_tokens,\n",
+    "    \"preflight/tok_per_s\": gen_tokens / elapsed,\n",
+    "    \"preflight/hit_ceiling\": int(gen_tokens >= MAX_COMPLETION_LENGTH),\n",
+    "    \"preflight/closed_think\": int(\"</think>\" in response),\n",
+    "})\n",
+    "wandb.summary[\"preflight/inference_response_preview\"] = response[:500]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 5b: KV Cache Diagnostic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "_kv_msgs = [{\"role\": \"system\", \"content\": SYSTEM_PT},\n",
+    "            {\"role\": \"user\", \"content\": \"Qual a categoria de reclamação mais frequente?\"}]\n",
+    "_kv_text   = tokenizer.apply_chat_template(_kv_msgs, tokenize=False, add_generation_prompt=True)\n",
+    "_kv_inputs = tokenizer(_kv_text, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "_token_times, _past, _generated = [], None, _kv_inputs[\"input_ids\"]\n",
+    "with torch.no_grad():\n",
+    "    for _step in range(50):\n",
+    "        _t0 = time.time()\n",
+    "        seq_len = _generated.shape[1]\n",
+    "        if _past is None:\n",
+    "            _position_ids = torch.arange(seq_len, dtype=torch.long, device=model.device).unsqueeze(0)\n",
+    "        else:\n",
+    "            _position_ids = torch.tensor([[seq_len - 1]], dtype=torch.long, device=model.device)\n",
+    "        _out = model(\n",
+    "            input_ids=_generated[:, -1:] if _past else _generated,\n",
+    "            position_ids=_position_ids,\n",
+    "            attention_mask=torch.ones(1, seq_len, device=model.device),\n",
+    "            past_key_values=_past,\n",
+    "            use_cache=True,\n",
+    "            return_dict=True,\n",
+    "        )\n",
+    "        _past = _out.past_key_values\n",
+    "        _next = _out.logits[:, -1, :].argmax(dim=-1, keepdim=True)\n",
+    "        _generated = torch.cat([_generated, _next], dim=1)\n",
+    "        _token_times.append(time.time() - _t0)\n",
+    "\n",
+    "_ratio = sum(_token_times[45:]) / max(sum(_token_times[:5]), 1e-9)\n",
+    "print(f\"First 5 tok : {[f'{t*1000:.0f}ms' for t in _token_times[:5]]}\")\n",
+    "print(f\"Last  5 tok : {[f'{t*1000:.0f}ms' for t in _token_times[45:]]}\")\n",
+    "print(f\"Ratio last/first: {_ratio:.1f}x\")\n",
+    "if _ratio < 3:\n",
+    "    print(\"✓ KV cache is working correctly\")\n",
+    "elif _ratio < 6:\n",
+    "    print(\"⚠ KV cache may be degraded — check model.config.use_cache\")\n",
+    "else:\n",
+    "    print(\"✗ KV cache BROKEN — GRPO generation will be catastrophically slow.\")\n",
+    "\n",
+    "# ── Log to W&B ─────────────────────────���─────────────────────────────────────\n",
+    "wandb.log({\n",
+    "    \"preflight/kv_cache_ratio\": _ratio,\n",
+    "    \"preflight/kv_cache_ok\": int(_ratio < 3),\n",
+    "})\n",
+    "\n",
+    "del _past, _generated, _kv_inputs, _token_times, _out\n",
+    "gc.collect()\n",
+    "if torch.cuda.is_available(): torch.cuda.empty_cache()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 6: Reward Functions v3\n",
+    "\n",
+    "**v3 changes:**\n",
+    "- Staged reward design: format → partial content → full task (Reasoning-SQL, 2503.23157)\n",
+    "- Zero-advantage noise injection (Skywork-OR1, 2505.22312)\n",
+    "- Extraction reward redesigned for completion-length-friendly scoring"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def strip_think(text: str) -> str:\n",
+    "    \"\"\"Remove <think>...</think> block, return the answer portion.\"\"\"\n",
+    "    return re.sub(r\"<think>.*?</think>\", \"\", text, flags=re.DOTALL).strip()\n",
+    "\n",
+    "\n",
+    "def has_think_block(text: str) -> bool:\n",
+    "    \"\"\"Check if text contains a non-empty <think> block.\"\"\"\n",
+    "    return bool(re.search(r\"<think>.+</think>\", text, flags=re.DOTALL))\n",
+    "\n",
+    "\n",
+    "def _classify_task_type(prompt_text: str) -> str:\n",
+    "    \"\"\"Classify prompt into task type by keywords.\"\"\"\n",
+    "    p = prompt_text.lower()\n",
+    "    if \"retorne um objeto json\" in p or \"extraia dados\" in p:\n",
+    "        return \"extraction\"\n",
+    "    elif \"notificação push\" in p or \"notificação de reengajamento\" in p:\n",
+    "        return \"push\"\n",
+    "    elif \"perfil do cliente\" in p:\n",
+    "        return \"insights\"\n",
+    "    else:\n",
+    "        return \"sql_qa\"\n",
+    "\n",
+    "\n",
+    "def _json_similarity(text: str) -> float:\n",
+    "    \"\"\"Rough heuristic: how JSON-like is this text? 0.0 to 1.0.\"\"\"\n",
+    "    text = text.strip()\n",
+    "    if not text:\n",
+    "        return 0.0\n",
+    "    score = 0.0\n",
+    "    if text.startswith(\"{\") and text.endswith(\"}\"):\n",
+    "        score += 0.5\n",
+    "    if '\"' in text:\n",
+    "        score += 0.2\n",
+    "    if \":\" in text:\n",
+    "        score += 0.2\n",
+    "    if \",\" in text:\n",
+    "        score += 0.1\n",
+    "    return min(score, 1.0)\n",
+    "\n",
+    "\n",
+    "def _string_similarity(a: str, b: str) -> float:\n",
+    "    \"\"\"Simple Jaccard-like similarity for short strings. 0.0 to 1.0.\"\"\"\n",
+    "    if not a or not b:\n",
+    "        return 0.0\n",
+    "    a_set = set(a.split())\n",
+    "    b_set = set(b.split())\n",
+    "    intersection = len(a_set & b_set)\n",
+    "    union = len(a_set | b_set)\n",
+    "    return intersection / union if union > 0 else 0.0\n",
+    "\n",
+    "\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "# v3 STAGED REWARD DESIGN\n",
+    "# Reference: Reasoning-SQL (2503.23157) §3.2\n",
+    "#\n",
+    "# Each reward function scores THREE stages independently:\n",
+    "#   Stage 1 — FORMAT (0.0–0.2): Is the output well-structured?\n",
+    "#   Stage 2 — PARTIAL (0.0–0.3): Are some content elements correct?\n",
+    "#   Stage 3 — TASK   (0.0–0.5): Is the full task completed correctly?\n",
+    "#\n",
+    "# Format rewards converge first (easy to learn), which stabilizes training\n",
+    "# and enables the model to then learn harder task-specific skills.\n",
+    "# ══════════════════════════════════════════════════════════════════════════════\n",
+    "\n",
+    "\n",
+    "def reward_extraction(completion: str) -> float:\n",
+    "    \"\"\"Staged reward for structured extraction (max 1.0).\"\"\"\n",
+    "    answer = strip_think(completion)\n",
+    "\n",
+    "    # ── Stage 1: FORMAT (max 0.2) ─────────────────────────────────────────────\n",
+    "    r_format = 0.0\n",
+    "    if has_think_block(completion):\n",
+    "        r_format += 0.1  # Used reasoning\n",
+    "\n",
+    "    try:\n",
+    "        data = json.loads(answer)\n",
+    "        if isinstance(data, dict):\n",
+    "            r_format += 0.1  # Valid JSON object\n",
+    "    except (json.JSONDecodeError, TypeError):\n",
+    "        r_format += 0.05 * _json_similarity(answer)\n",
+    "        return min(r_format, 0.2)\n",
+    "\n",
+    "    if not isinstance(data, dict):\n",
+    "        return min(r_format, 0.2)\n",
+    "\n",
+    "    # ── Stage 2: PARTIAL CONTENT (max 0.3) ────────────────────────────────────\n",
+    "    r_partial = 0.0\n",
+    "\n",
+    "    present = sum(1 for f in EXTRACTION_FIELDS if f in data)\n",
+    "    r_partial += 0.15 * (present / len(EXTRACTION_FIELDS))\n",
+    "\n",
+    "    type_checks = 0\n",
+    "    type_total = 0\n",
+    "    for field in EXTRACTION_FIELDS:\n",
+    "        if field not in data:\n",
+    "            continue\n",
+    "        type_total += 1\n",
+    "        val = data[field]\n",
+    "        if field in (\"delivery_issue\", \"product_issue\", \"seller_issue\", \"would_recommend\"):\n",
+    "            if isinstance(val, bool):\n",
+    "                type_checks += 1\n",
+    "        elif field in (\"sentiment_score\",):\n",
+    "            if isinstance(val, (int, float)):\n",
+    "                type_checks += 1\n",
+    "        elif field in (\"main_complaint\", \"sentiment\", \"complaint_category\", \"churn_risk\", \"repeat_intent\"):\n",
+    "            if isinstance(val, str):\n",
+    "                type_checks += 1\n",
+    "    if type_total > 0:\n",
+    "        r_partial += 0.15 * (type_checks / type_total)\n",
+    "\n",
+    "    # ── Stage 3: FULL TASK (max 0.5) ─────────────────────────────────────────\n",
+    "    r_task = 0.0\n",
+    "    cat_checks = 0\n",
+    "    cat_total = 0\n",
+    "\n",
+    "    checks = [\n",
+    "        (\"sentiment\", lambda v: v in VALID_SENTIMENTS),\n",
+    "        (\"complaint_category\", lambda v: v in VALID_CATEGORIES),\n",
+    "        (\"churn_risk\", lambda v: v in VALID_CHURN),\n",
+    "        (\"repeat_intent\", lambda v: v in VALID_REPEAT),\n",
+    "        (\"sentiment_score\", lambda v: isinstance(v, (int, float)) and 1 <= v <= 5),\n",
+    "    ]\n",
+    "    for field, validator in checks:\n",
+    "        cat_total += 1\n",
+    "        if field in data and validator(data[field]):\n",
+    "            cat_checks += 1\n",
+    "\n",
+    "    for bool_field in (\"delivery_issue\", \"product_issue\", \"seller_issue\", \"would_recommend\"):\n",
+    "        cat_total += 1\n",
+    "        if bool_field in data and isinstance(data[bool_field], bool):\n",
+    "            cat_checks += 1\n",
+    "\n",
+    "    if cat_total > 0:\n",
+    "        r_task += 0.35 * (cat_checks / cat_total)\n",
+    "\n",
+    "    if \"main_complaint\" in data and isinstance(data[\"main_complaint\"], str):\n",
+    "        complaint = data[\"main_complaint\"].strip()\n",
+    "        if len(complaint) > 10:\n",
+    "            r_task += 0.15\n",
+    "\n",
+    "    return min(r_format + r_partial + r_task, 1.0)\n",
+    "\n",
+    "\n",
+    "def reward_sql_qa(completion: str) -> float:\n",
+    "    \"\"\"Staged reward for SQL Q&A (max 1.0).\"\"\"\n",
+    "    answer = strip_think(completion)\n",
+    "\n",
+    "    # ── Stage 1: FORMAT (max 0.2)\n",
+    "    r_format = 0.0\n",
+    "    if has_think_block(completion):\n",
+    "        r_format += 0.1\n",
+    "    if \"```\" in answer or re.search(r\"SELECT|FROM\", answer, re.IGNORECASE):\n",
+    "        r_format += 0.1\n",
+    "\n",
+    "    # ── Stage 2: PARTIAL (max 0.3)\n",
+    "    r_partial = 0.0\n",
+    "    sql_keywords = r\"SELECT|FROM|WHERE|GROUP BY|ORDER BY|COUNT|SUM|AVG|JOIN|HAVING\"\n",
+    "    matches = len(re.findall(sql_keywords, answer, re.IGNORECASE))\n",
+    "    r_partial += min(0.15, 0.03 * matches)\n",
+    "    numbers = re.findall(r\"\\d+(?:[.,]\\d+)?\", answer)\n",
+    "    r_partial += min(0.15, 0.03 * len(numbers))\n",
+    "\n",
+    "    # ── Stage 3: TASK (max 0.5)\n",
+    "    r_task = 0.0\n",
+    "    length = len(answer)\n",
+    "    if 50 <= length <= 600:\n",
+    "        r_task += 0.25\n",
+    "    elif length > 0:\n",
+    "        r_task += 0.25 * max(0, 1 - abs(length - 325) / 275)\n",
+    "    explanation_markers = [\"para \", \"porque\", \"resultado\", \"mostra\", \"indica\", \"análise\"]\n",
+    "    expl_matches = sum(1 for w in explanation_markers if w in answer.lower())\n",
+    "    r_task += min(0.25, 0.05 * expl_matches)\n",
+    "\n",
+    "    return min(r_format + r_partial + r_task, 1.0)\n",
+    "\n",
+    "\n",
+    "def reward_insights(completion: str) -> float:\n",
+    "    \"\"\"Staged reward for insights (max 1.0).\"\"\"\n",
+    "    answer = strip_think(completion)\n",
+    "\n",
+    "    # ── Stage 1: FORMAT (max 0.2)\n",
+    "    r_format = 0.0\n",
+    "    if has_think_block(completion):\n",
+    "        r_format += 0.1\n",
+    "    structure_marks = len(re.findall(r\"^[-•*]\\s|^\\d+[.)]\\s|^#{1,3}\\s\", answer, re.MULTILINE))\n",
+    "    r_format += min(0.1, 0.02 * structure_marks)\n",
+    "\n",
+    "    # ── Stage 2: PARTIAL (max 0.3)\n",
+    "    r_partial = 0.0\n",
+    "    length = len(answer)\n",
+    "    if 100 <= length <= 1200:\n",
+    "        r_partial += 0.15\n",
+    "    elif length > 0:\n",
+    "        r_partial += 0.15 * max(0, 1 - abs(length - 650) / 550)\n",
+    "    pt_markers = re.findall(r\"[ãçéêóúâõ]|você|para|como|seu|sua|cliente|produto\", answer, re.IGNORECASE)\n",
+    "    r_partial += min(0.15, 0.01 * len(pt_markers))\n",
+    "\n",
+    "    # ── Stage 3: TASK (max 0.5)\n",
+    "    r_task = 0.0\n",
+    "    action_words = [\"recomend\", \"implement\", \"melhor\", \"reduzir\", \"aumentar\",\n",
+    "                    \"priorizar\", \"investir\", \"otimizar\", \"estratégi\", \"suger\",\n",
+    "                    \"consider\", \"ação\", \"plano\"]\n",
+    "    matches = sum(1 for w in action_words if w in answer.lower())\n",
+    "    r_task += min(0.3, 0.06 * matches)\n",
+    "    data_refs = len(re.findall(r\"\\d+%|R\\$\\s*\\d|média|percentual|comparad|taxa\", answer, re.IGNORECASE))\n",
+    "    r_task += min(0.2, 0.04 * data_refs)\n",
+    "\n",
+    "    return min(r_format + r_partial + r_task, 1.0)\n",
+    "\n",
+    "\n",
+    "def reward_push(completion: str) -> float:\n",
+    "    \"\"\"Staged reward for push notifications (max 1.0).\"\"\"\n",
+    "    answer = strip_think(completion)\n",
+    "    if not answer:\n",
+    "        return 0.0\n",
+    "\n",
+    "    # ── Stage 1: FORMAT (max 0.2)\n",
+    "    r_format = 0.0\n",
+    "    if has_think_block(completion):\n",
+    "        r_format += 0.05\n",
+    "    length = len(answer)\n",
+    "    if length <= 160:\n",
+    "        r_format += 0.15\n",
+    "    elif length <= 300:\n",
+    "        r_format += 0.1\n",
+    "    else:\n",
+    "        r_format += 0.05\n",
+    "\n",
+    "    # ── Stage 2: PARTIAL (max 0.3)\n",
+    "    r_partial = 0.0\n",
+    "    pt_markers = re.findall(r\"[ãçéêóúâõ]|você|para|como|seu|sua\", answer, re.IGNORECASE)\n",
+    "    r_partial += min(0.15, 0.02 * len(pt_markers))\n",
+    "    if re.search(r\"[!?]|[\\U0001F600-\\U0001F64F]|[\\U0001F300-\\U0001F5FF]\", answer):\n",
+    "        r_partial += 0.05\n",
+    "    if len(answer.split()) >= 5:\n",
+    "        r_partial += 0.1\n",
+    "\n",
+    "    # ── Stage 3: TASK (max 0.5)\n",
+    "    r_task = 0.0\n",
+    "    if length <= 120:\n",
+    "        r_task += 0.25\n",
+    "    else:\n",
+    "        r_task += 0.25 * max(0, 1 - (length - 120) / 120)\n",
+    "    generic_phrases = [\n",
+    "        \"olá! como podemos ajudar\", \"obrigado pela sua compra\",\n",
+    "        \"seu pedido foi confirmado\", \"agradecemos sua preferência\",\n",
+    "    ]\n",
+    "    max_similarity = max(_string_similarity(answer.lower(), g) for g in generic_phrases)\n",
+    "    r_task += 0.25 * (1 - max_similarity)\n",
+    "\n",
+    "    return min(r_format + r_partial + r_task, 1.0)\n",
+    "\n",
+    "\n",
+    "def commerce_reward_fn(completions, prompts, **kwargs) -> list[float]:\n",
+    "    \"\"\"\n",
+    "    Master reward function v3: dispatches by task type + zero-advantage noise.\n",
+    "    \"\"\"\n",
+    "    rewards = []\n",
+    "    for completion, prompt in zip(completions, prompts):\n",
+    "        if isinstance(completion, list):\n",
+    "            comp_text = completion[-1][\"content\"] if completion else \"\"\n",
+    "        else:\n",
+    "            comp_text = str(completion)\n",
+    "\n",
+    "        if isinstance(prompt, list):\n",
+    "            prompt_text = \" \".join(m.get(\"content\", \"\") for m in prompt)\n",
+    "        else:\n",
+    "            prompt_text = str(prompt)\n",
+    "\n",
+    "        task = _classify_task_type(prompt_text)\n",
+    "\n",
+    "        if task == \"extraction\":\n",
+    "            rewards.append(reward_extraction(comp_text))\n",
+    "        elif task == \"sql_qa\":\n",
+    "            rewards.append(reward_sql_qa(comp_text))\n",
+    "        elif task == \"insights\":\n",
+    "            rewards.append(reward_insights(comp_text))\n",
+    "        elif task == \"push\":\n",
+    "            rewards.append(reward_push(comp_text))\n",
+    "        else:\n",
+    "            r = 0.15 if has_think_block(comp_text) else 0.0\n",
+    "            r += 0.2 if comp_text.strip() else 0.0\n",
+    "            rewards.append(r)\n",
+    "\n",
+    "    # ── v3: Zero-advantage noise injection ────────────────────────────────────\n",
+    "    if ZERO_ADV_NOISE_STD > 0 and NUM_GENERATIONS > 1:\n",
+    "        for i in range(0, len(rewards), NUM_GENERATIONS):\n",
+    "            group = rewards[i:i+NUM_GENERATIONS]\n",
+    "            if len(group) < 2:\n",
+    "                continue\n",
+    "            if max(group) - min(group) < 0.001:\n",
+    "                for j in range(i, min(i+NUM_GENERATIONS, len(rewards))):\n",
+    "                    rewards[j] += random.gauss(0, ZERO_ADV_NOISE_STD)\n",
+    "\n",
+    "    return rewards\n",
+    "\n",
+    "\n",
+    "print(\"✓ v3 Reward functions defined (staged: format → partial → task)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 7: Reward Calibration\n",
+    "\n",
+    "**Gate:** Verify reward variance > 0. Compare v3 scoring to v2 calibration (mean=0.38)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_path = DATA_DIR / \"pairs\" / \"train.jsonl\"\n",
+    "\n",
+    "by_type = {\"extraction\": [], \"sql_qa\": [], \"insights\": [], \"push\": []}\n",
+    "with open(train_path) as f:\n",
+    "    for line in f:\n",
+    "        row = json.loads(line)\n",
+    "        convs = row[\"conversations\"]\n",
+    "        prompt_msgs = [m for m in convs if m[\"role\"] in (\"system\", \"user\")]\n",
+    "        if not prompt_msgs:\n",
+    "            continue\n",
+    "        user_text = \" \".join(m[\"content\"] for m in prompt_msgs if m[\"role\"] == \"user\")\n",
+    "        task = _classify_task_type(user_text)\n",
+    "        by_type[task].append(prompt_msgs)\n",
+    "\n",
+    "print(f\"Prompts by type: {', '.join(f'{k}={len(v)}' for k, v in by_type.items())}\")\n",
+    "\n",
+    "rng = random.Random(42)\n",
+    "cal_samples = []\n",
+    "for task_type in [\"extraction\", \"extraction\", \"sql_qa\", \"sql_qa\", \"insights\", \"insights\", \"push\", \"push\"]:\n",
+    "    cal_samples.append(rng.choice(by_type[task_type]))\n",
+    "\n",
+    "FastLanguageModel.for_inference(model)\n",
+    "print(f\"\\nReward calibration v3 ({len(cal_samples)} samples):\")\n",
+    "print(\"-\" * 70)\n",
+    "\n",
+    "cal_rewards = []\n",
+    "cal_rows = []  # collect per-sample data for W&B Table\n",
+    "for i, msgs in enumerate(cal_samples):\n",
+    "    text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)\n",
+    "    inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
+    "    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)\n",
+    "    response = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+    "    gen_tokens = outputs.shape[1] - inputs[\"input_ids\"].shape[1]\n",
+    "\n",
+    "    r = commerce_reward_fn([response], [text])[0]\n",
+    "    cal_rewards.append(r)\n",
+    "    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH\n",
+    "    has_answer = \"</think>\" in response\n",
+    "    answer_preview = strip_think(response)[:100] if has_answer else \"[stuck in <think>]\"\n",
+    "    task = _classify_task_type(text)\n",
+    "    print(f\"  [{task:12s}] reward={r:.2f} | tokens={gen_tokens:4d} | ceiling={'HIT' if hit_ceiling else 'ok':6s} | {answer_preview}\")\n",
+    "\n",
+    "    cal_rows.append([i, task, r, gen_tokens, hit_ceiling, has_answer, answer_preview])\n",
+    "\n",
+    "print(f\"\\nMean={sum(cal_rewards)/len(cal_rewards):.2f}, Min={min(cal_rewards):.2f}, Max={max(cal_rewards):.2f}\")\n",
+    "print(f\"v2 calibration was: Mean=0.38, Min=0.02, Max=0.70\")\n",
+    "print(f\"Variance > 0: {len(set(cal_rewards)) > 1}\")\n",
+    "\n",
+    "# ── Log to W&B ───────────────────────────────────────────────────────────────\n",
+    "cal_table = wandb.Table(\n",
+    "    columns=[\"sample\", \"task\", \"reward\", \"tokens\", \"hit_ceiling\", \"closed_think\", \"answer_preview\"],\n",
+    "    data=cal_rows,\n",
+    ")\n",
+    "wandb.log({\n",
+    "    \"calibration/mean_reward\": sum(cal_rewards) / len(cal_rewards),\n",
+    "    \"calibration/min_reward\": min(cal_rewards),\n",
+    "    \"calibration/max_reward\": max(cal_rewards),\n",
+    "    \"calibration/has_variance\": int(len(set(cal_rewards)) > 1),\n",
+    "    \"calibration/samples\": cal_table,\n",
+    "})\n",
+    "wandb.config.update({\"prompts_by_type\": {k: len(v) for k, v in by_type.items()}}, allow_val_change=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 8: Dataset Preparation v3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import Dataset\n",
+    "\n",
+    "def prepare_grpo_datasets_v3(n_prompts=GRPO_PROMPTS, eval_ratio=EVAL_SPLIT_RATIO,\n",
+    "                              general_mix=GENERAL_MIX_RATIO, seed=42):\n",
+    "    rng = random.Random(seed)\n",
+    "\n",
+    "    train_pools = {}\n",
+    "    eval_records = []\n",
+    "    for task, pool in by_type.items():\n",
+    "        shuffled = pool.copy()\n",
+    "        rng.shuffle(shuffled)\n",
+    "        n_eval = max(1, int(len(shuffled) * eval_ratio))\n",
+    "        eval_records.extend(shuffled[:n_eval])\n",
+    "        train_pools[task] = shuffled[n_eval:]\n",
+    "\n",
+    "    if n_prompts is None:\n",
+    "        train_records = []\n",
+    "        for task, pool in train_pools.items():\n",
+    "            train_records.extend(pool)\n",
+    "        rng.shuffle(train_records)\n",
+    "    else:\n",
+    "        targets = {\n",
+    "            \"extraction\": int(n_prompts * 0.4),\n",
+    "            \"sql_qa\":     int(n_prompts * 0.4),\n",
+    "            \"insights\":   int(n_prompts * 0.1),\n",
+    "            \"push\":       int(n_prompts * 0.1),\n",
+    "        }\n",
+    "        train_records = []\n",
+    "        for task, target_n in targets.items():\n",
+    "            pool = train_pools[task]\n",
+    "            n = min(target_n, len(pool))\n",
+    "            train_records.extend(rng.sample(pool, n))\n",
+    "        rng.shuffle(train_records)\n",
+    "\n",
+    "    general_path = DATA_DIR / \"pairs\" / \"general_reasoning.jsonl\"\n",
+    "    if general_mix > 0 and general_path.exists():\n",
+    "        general_records = []\n",
+    "        with open(general_path) as f:\n",
+    "            for line in f:\n",
+    "                row = json.loads(line)\n",
+    "                convs = row[\"conversations\"]\n",
+    "                prompt_msgs = [m for m in convs if m[\"role\"] in (\"system\", \"user\")]\n",
+    "                if prompt_msgs:\n",
+    "                    general_records.append(prompt_msgs)\n",
+    "        n_general = int(len(train_records) * general_mix / (1 - general_mix))\n",
+    "        n_general = min(n_general, len(general_records))\n",
+    "        if n_general > 0:\n",
+    "            train_records.extend(rng.sample(general_records, n_general))\n",
+    "            rng.shuffle(train_records)\n",
+    "            print(f\"  Cocktail Effect: added {n_general} general reasoning samples ({general_mix:.0%} mix)\")\n",
+    "    elif general_mix > 0:\n",
+    "        print(f\"  general_reasoning.jsonl not found — skipping mix\")\n",
+    "\n",
+    "    task_dist = {}\n",
+    "    for record in train_records:\n",
+    "        user_text = \" \".join(m[\"content\"] for m in record if m[\"role\"] == \"user\")\n",
+    "        task = _classify_task_type(user_text)\n",
+    "        task_dist[task] = task_dist.get(task, 0) + 1\n",
+    "\n",
+    "    n_domain = len(train_records)\n",
+    "    steps_per_epoch = n_domain * NUM_GENERATIONS // (BATCH_SIZE * GRAD_ACCUM)\n",
+    "\n",
+    "    print(f\"v3 Dataset split (eval_ratio={eval_ratio}):\")\n",
+    "    print(f\"  train : {n_domain} prompts\")\n",
+    "    print(f\"  eval  : {len(eval_records)} prompts\")\n",
+    "    print(f\"  distribution: {', '.join(f'{k}={v}' for k, v in sorted(task_dist.items()))}\")\n",
+    "    print(f\"  steps/epoch: {n_domain} x {NUM_GENERATIONS} / ({BATCH_SIZE} x {GRAD_ACCUM}) = {steps_per_epoch}\")\n",
+    "    print(f\"  MAX_STEPS={MAX_STEPS} -> {'< 1 epoch' if MAX_STEPS < steps_per_epoch else f'{MAX_STEPS/steps_per_epoch:.1f} epochs'}\")\n",
+    "\n",
+    "    train_ds = Dataset.from_list([{\"prompt\": msgs} for msgs in train_records])\n",
+    "    eval_ds  = Dataset.from_list([{\"prompt\": msgs} for msgs in eval_records])\n",
+    "    return train_ds, eval_ds\n",
+    "\n",
+    "\n",
+    "train_dataset, eval_dataset = prepare_grpo_datasets_v3()\n",
+    "dataset = train_dataset\n",
+    "print(f\"\\n✓ v3 Datasets ready: train={len(train_dataset)}, eval={len(eval_dataset)}\")\n",
+    "\n",
+    "# ── Log dataset sizes to W&B ─────────────────────────────────────────────────\n",
+    "wandb.config.update({\n",
+    "    \"train_prompts\": len(train_dataset),\n",
+    "    \"eval_prompts\": len(eval_dataset),\n",
+    "}, allow_val_change=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 9: Smoke Test\n",
+    "\n",
+    "**Gate:** Runs 1 step without OOM at new completion length (4096)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "\n",
+    "FastLanguageModel.for_training(model)\n",
+    "\n",
+    "smoke_config = GRPOConfig(\n",
+    "    output_dir=str(CHECKPOINT_DIR / \"smoke\"),\n",
+    "    num_generations=NUM_GENERATIONS,\n",
+    "    scale_rewards=SCALE_REWARDS,\n",
+    "    max_completion_length=MAX_COMPLETION_LENGTH,\n",
+    "    max_steps=1,\n",
+    "    num_train_epochs=1,\n",
+    "    temperature=TEMPERATURE,\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=1,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    fp16=False,\n",
+    "    bf16=True,\n",
+    "    logging_steps=1,\n",
+    "    save_steps=999,\n",
+    "    report_to=\"none\",\n",
+    "    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,\n",
+    "    seed=42,\n",
+    "    remove_unused_columns=False,\n",
+    ")\n",
+    "\n",
+    "smoke_trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    reward_funcs=commerce_reward_fn,\n",
+    "    args=smoke_config,\n",
+    "    train_dataset=dataset,\n",
+    "    tokenizer=tokenizer,\n",
+    ")\n",
+    "\n",
+    "t0 = time.time()\n",
+    "smoke_trainer.train()\n",
+    "step_time = time.time() - t0\n",
+    "\n",
+    "vram_used = torch.cuda.max_memory_allocated() / 1e9\n",
+    "vram_total = torch.cuda.get_device_properties(0).total_mem / 1e9\n",
+    "\n",
+    "print(f\"\\n✓ Smoke test passed!\")\n",
+    "print(f\"  Step time (grad_accum=1): {step_time:.0f}s\")\n",
+    "print(f\"  Estimated step time (grad_accum={GRAD_ACCUM}): {step_time * GRAD_ACCUM:.0f}s\")\n",
+    "print(f\"  VRAM peak: {vram_used:.1f} GB / {vram_total:.1f} GB\")\n",
+    "\n",
+    "if vram_used > vram_total * 0.95:\n",
+    "    print(f\"\\n  VRAM at {vram_used/vram_total:.0%} — dangerously close to OOM\")\n",
+    "    print(f\"    Option 1: Reduce MAX_COMPLETION_LENGTH to 3072\")\n",
+    "    print(f\"    Option 2: Reduce BATCH_SIZE to 2 (increase GRAD_ACCUM to 2)\")\n",
+    "\n",
+    "# ── Log to W&B ───────────────────────────────────────────────────────────────\n",
+    "wandb.log({\n",
+    "    \"smoke/step_time_s\": step_time,\n",
+    "    \"smoke/vram_peak_gb\": vram_used,\n",
+    "    \"smoke/vram_total_gb\": vram_total,\n",
+    "    \"smoke/vram_pct\": vram_used / vram_total,\n",
+    "    \"smoke/estimated_step_time_s\": step_time * GRAD_ACCUM,\n",
+    "})\n",
+    "\n",
+    "del smoke_trainer\n",
+    "gc.collect(); torch.cuda.empty_cache()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 10: Probe Run (3 steps)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_training(model)\n",
+    "\n",
+    "probe_config = GRPOConfig(\n",
+    "    output_dir=str(CHECKPOINT_DIR / \"probe\"),\n",
+    "    num_generations=NUM_GENERATIONS,\n",
+    "    scale_rewards=SCALE_REWARDS,\n",
+    "    max_completion_length=MAX_COMPLETION_LENGTH,\n",
+    "    max_steps=3,\n",
+    "    temperature=TEMPERATURE,\n",
+    "    num_train_epochs=NUM_EPOCHS,\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=GRAD_ACCUM,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    warmup_ratio=0.1,\n",
+    "    lr_scheduler_type=\"cosine\",\n",
+    "    fp16=False,\n",
+    "    bf16=True,\n",
+    "    logging_steps=1,\n",
+    "    disable_tqdm=True,\n",
+    "    logging_first_step=True,\n",
+    "    save_steps=999,\n",
+    "    report_to=\"none\",\n",
+    "    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,\n",
+    "    seed=42,\n",
+    "    remove_unused_columns=False,\n",
+    ")\n",
+    "\n",
+    "probe_trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    reward_funcs=commerce_reward_fn,\n",
+    "    args=probe_config,\n",
+    "    train_dataset=dataset,\n",
+    "    tokenizer=tokenizer,\n",
+    ")\n",
+    "\n",
+    "t0 = time.time()\n",
+    "result = probe_trainer.train()\n",
+    "elapsed = time.time() - t0\n",
+    "\n",
+    "has_gradient = abs(result.training_loss) >= 1e-6\n",
+    "\n",
+    "print(f\"\\n✓ Probe complete in {elapsed:.0f}s ({elapsed/3:.0f}s/step)\")\n",
+    "print(f\"  Train loss: {result.training_loss:.6f}\")\n",
+    "print(f\"  Estimated full run ({MAX_STEPS} steps): {elapsed/3 * MAX_STEPS / 3600:.1f}h\")\n",
+    "\n",
+    "if not has_gradient:\n",
+    "    print(\"  Loss is near-zero — reward variance may be insufficient\")\n",
+    "else:\n",
+    "    print(\"  ✓ Loss is non-zero — GRPO has gradient signal\")\n",
+    "\n",
+    "# ── Log to W&B ───────────────────────────────────────────────────────────────\n",
+    "wandb.log({\n",
+    "    \"probe/loss\": result.training_loss,\n",
+    "    \"probe/time_per_step_s\": elapsed / 3,\n",
+    "    \"probe/estimated_full_run_h\": elapsed / 3 * MAX_STEPS / 3600,\n",
+    "    \"probe/has_gradient\": int(has_gradient),\n",
+    "})\n",
+    "\n",
+    "del probe_trainer\n",
+    "gc.collect(); torch.cuda.empty_cache()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 11: Full Training Run v3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import shutil\n",
+    "import torch\n",
+    "from transformers import TrainerCallback\n",
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "\n",
+    "# W&B run already active from Cell 3 — just update config with dataset counts\n",
+    "wandb.config.update({\n",
+    "    \"train_prompts\": len(train_dataset),\n",
+    "    \"eval_prompts\":  len(eval_dataset),\n",
+    "}, allow_val_change=True)\n",
+    "\n",
+    "FRESH = True\n",
+    "resume_from = None\n",
+    "if FRESH and CHECKPOINT_DIR.exists():\n",
+    "    print(\"FRESH: deleting old checkpoints...\")\n",
+    "    shutil.rmtree(CHECKPOINT_DIR)\n",
+    "elif CHECKPOINT_DIR.exists():\n",
+    "    checkpoints = sorted(\n",
+    "        [d for d in CHECKPOINT_DIR.iterdir()\n",
+    "         if d.is_dir() and d.name.startswith(\"checkpoint-\")],\n",
+    "        key=lambda d: int(d.name.split(\"-\")[-1]),\n",
+    "    )\n",
+    "    if checkpoints:\n",
+    "        resume_from = str(checkpoints[-1])\n",
+    "        print(f\"Resuming from: {resume_from}\")\n",
+    "\n",
+    "\n",
+    "class UnslothGRPOTrainer(GRPOTrainer):\n",
+    "    \"\"\"Wraps generation with Unsloth for_inference()/for_training().\"\"\"\n",
+    "    def _generate(self, prompts, images):\n",
+    "        FastLanguageModel.for_inference(self.model)\n",
+    "        try:\n",
+    "            result = super()._generate(prompts, images)\n",
+    "        finally:\n",
+    "            FastLanguageModel.for_training(self.model)\n",
+    "        return result\n",
+    "\n",
+    "\n",
+    "class EvalRewardCallback(TrainerCallback):\n",
+    "    \"\"\"v3: deterministic eval, per-task breakdown, patience=15.\"\"\"\n",
+    "    def __init__(self, eval_records, reward_fn, patience=EARLY_STOPPING_PATIENCE,\n",
+    "                 delta=EARLY_STOPPING_DELTA):\n",
+    "        self.eval_records     = eval_records\n",
+    "        self.reward_fn        = reward_fn\n",
+    "        self.patience         = patience\n",
+    "        self.delta            = delta\n",
+    "        self.best_reward      = -float(\"inf\")\n",
+    "        self.no_improve_count = 0\n",
+    "\n",
+    "    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):\n",
+    "        if state.global_step == 0 or state.global_step % EVAL_STEPS != 0:\n",
+    "            return control\n",
+    "        tokenizer = processing_class\n",
+    "        if tokenizer is None:\n",
+    "            print(\"[EvalRewardCallback] WARNING: tokenizer is None, skipping eval\")\n",
+    "            return control\n",
+    "\n",
+    "        mean_reward, task_rewards = self._run_eval(model, tokenizer, args)\n",
+    "        improved = mean_reward > self.best_reward + self.delta\n",
+    "        status = \"improved\" if improved else f\"no gain ({self.no_improve_count + 1}/{self.patience})\"\n",
+    "\n",
+    "        log_dict = {\n",
+    "            \"eval/mean_reward\":      mean_reward,\n",
+    "            \"eval/best_reward\":      max(self.best_reward, mean_reward),\n",
+    "            \"eval/no_improve_count\": self.no_improve_count,\n",
+    "        }\n",
+    "        for task, rewards in task_rewards.items():\n",
+    "            if rewards:\n",
+    "                log_dict[f\"eval/{task}_reward\"] = sum(rewards) / len(rewards)\n",
+    "        wandb.log(log_dict, step=state.global_step)\n",
+    "\n",
+    "        print(f\"\\n[EvalReward] step={state.global_step} | mean={mean_reward:.4f} | best={self.best_reward:.4f} | {status}\")\n",
+    "        for task, rewards in task_rewards.items():\n",
+    "            if rewards:\n",
+    "                print(f\"  {task}: {sum(rewards)/len(rewards):.3f} (n={len(rewards)})\")\n",
+    "\n",
+    "        if improved:\n",
+    "            self.best_reward      = mean_reward\n",
+    "            self.no_improve_count = 0\n",
+    "        else:\n",
+    "            self.no_improve_count += 1\n",
+    "            if self.no_improve_count >= self.patience:\n",
+    "                print(f\"[EarlyStopping] No improvement >= {self.delta} for {self.patience} consecutive evals. Halting.\")\n",
+    "                wandb.log({\"early_stop/step\": state.global_step}, step=state.global_step)\n",
+    "                control.should_training_stop = True\n",
+    "        return control\n",
+    "\n",
+    "    def _run_eval(self, model, tokenizer, args):\n",
+    "        FastLanguageModel.for_inference(model)\n",
+    "        rewards = []\n",
+    "        task_rewards = {}\n",
+    "        subset = self.eval_records[:EVAL_MAX_SAMPLES]\n",
+    "        for record in subset:\n",
+    "            msgs = record[\"prompt\"]\n",
+    "            text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)\n",
+    "            inputs = tokenizer(text, return_tensors=\"pt\", truncation=True,\n",
+    "                             max_length=args.max_prompt_length).to(model.device)\n",
+    "            with torch.no_grad():\n",
+    "                out = model.generate(**inputs, max_new_tokens=EVAL_MAX_TOKENS,\n",
+    "                                    temperature=EVAL_TEMPERATURE, do_sample=True)\n",
+    "            resp = tokenizer.decode(out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+    "            r = self.reward_fn([resp], [text])[0]\n",
+    "            rewards.append(r)\n",
+    "            user_text = \" \".join(m.get(\"content\", \"\") for m in msgs if m.get(\"role\") == \"user\")\n",
+    "            task = _classify_task_type(user_text)\n",
+    "            task_rewards.setdefault(task, []).append(r)\n",
+    "        FastLanguageModel.for_training(model)\n",
+    "        mean = sum(rewards) / len(rewards) if rewards else 0.0\n",
+    "        return mean, task_rewards\n",
+    "\n",
+    "\n",
+    "class EntropyMonitorCallback(TrainerCallback):\n",
+    "    \"\"\"v3 NEW: Monitor entropy collapse indicators (Skywork-OR1 S4).\"\"\"\n",
+    "    def __init__(self):\n",
+    "        self.consecutive_ceiling_hits = 0\n",
+    "\n",
+    "    def on_log(self, args, state, control, logs=None, **kwargs):\n",
+    "        if not logs:\n",
+    "            return\n",
+    "        step = state.global_step\n",
+    "        monitor = {}\n",
+    "        comp_len = logs.get(\"completion_length\", 0)\n",
+    "        if comp_len > 0:\n",
+    "            ratio = comp_len / MAX_COMPLETION_LENGTH\n",
+    "            monitor[\"monitor/completion_ratio\"] = ratio\n",
+    "            if ratio > 0.95:\n",
+    "                self.consecutive_ceiling_hits += 1\n",
+    "                if self.consecutive_ceiling_hits >= 3:\n",
+    "                    print(f\"Step {step}: Completion ceiling hit {self.consecutive_ceiling_hits} consecutive times.\")\n",
+    "            else:\n",
+    "                self.consecutive_ceiling_hits = 0\n",
+    "        reward_std = logs.get(\"reward_std\", logs.get(\"rewards/commerce_reward_fn/std\", 0))\n",
+    "        if reward_std is not None:\n",
+    "            monitor[\"monitor/reward_std\"] = reward_std\n",
+    "            if reward_std < 0.01:\n",
+    "                print(f\"Step {step}: reward_std={reward_std:.4f} — near-zero variance\")\n",
+    "        clip_high = logs.get(\"clip_ratio/high_mean\", 0)\n",
+    "        clip_low = logs.get(\"clip_ratio/low_mean\", 0)\n",
+    "        if clip_high is not None and clip_low is not None:\n",
+    "            total_clip = clip_high + abs(clip_low)\n",
+    "            monitor[\"monitor/total_clip_ratio\"] = total_clip\n",
+    "            if total_clip > 0.01 and step > 10:\n",
+    "                print(f\"Step {step}: clip_ratio={total_clip:.3f} — policy is updating\")\n",
+    "        if monitor and wandb.run:\n",
+    "            wandb.log(monitor, step=step)\n",
+    "\n",
+    "\n",
+    "FastLanguageModel.for_training(model)\n",
+    "\n",
+    "grpo_config = GRPOConfig(\n",
+    "    output_dir=str(CHECKPOINT_DIR),\n",
+    "    num_generations=NUM_GENERATIONS,\n",
+    "    scale_rewards=SCALE_REWARDS,\n",
+    "    max_completion_length=MAX_COMPLETION_LENGTH,\n",
+    "    temperature=TEMPERATURE,\n",
+    "    max_steps=MAX_STEPS,\n",
+    "    num_train_epochs=NUM_EPOCHS,\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=GRAD_ACCUM,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    warmup_ratio=0.1,\n",
+    "    lr_scheduler_type=\"cosine\",\n",
+    "    fp16=False,\n",
+    "    bf16=True,\n",
+    "    logging_steps=1,\n",
+    "    logging_first_step=True,\n",
+    "    disable_tqdm=True,\n",
+    "    save_steps=SAVE_STEPS,\n",
+    "    save_total_limit=SAVE_TOTAL_LIMIT,\n",
+    "    save_only_model=True,\n",
+    "    eval_steps=EVAL_STEPS,\n",
+    "    report_to=\"wandb\",\n",
+    "    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,\n",
+    "    seed=42,\n",
+    "    remove_unused_columns=False,\n",
+    "    **({\"use_vllm\": True, \"vllm_mode\": \"colocate\",\n",
+    "        \"vllm_enable_sleep_mode\": True} if USE_VLLM else {}),\n",
+    ")\n",
+    "\n",
+    "eval_cb = EvalRewardCallback(eval_records=list(eval_dataset), reward_fn=commerce_reward_fn)\n",
+    "entropy_cb = EntropyMonitorCallback()\n",
+    "\n",
+    "TrainerClass = GRPOTrainer if USE_VLLM else UnslothGRPOTrainer\n",
+    "trainer = TrainerClass(\n",
+    "    model=model,\n",
+    "    reward_funcs=commerce_reward_fn,\n",
+    "    args=grpo_config,\n",
+    "    train_dataset=train_dataset,\n",
+    "    processing_class=tokenizer,\n",
+    "    callbacks=[eval_cb, entropy_cb],\n",
+    ")\n",
+    "\n",
+    "print(f\"{'='*70}\")\n",
+    "print(f\"GRPO v3 Training — Ready to Launch\")\n",
+    "print(f\"{'='*70}\")\n",
+    "print(f\"  Trainer:       {TrainerClass.__name__}\")\n",
+    "print(f\"  Max steps:     {MAX_STEPS}\")\n",
+    "print(f\"  Temperature:   {TEMPERATURE} (v2 was 0.8)\")\n",
+    "print(f\"  Completion:    {MAX_COMPLETION_LENGTH} tokens (v2 was 2048)\")\n",
+    "print(f\"  Generations:   {NUM_GENERATIONS} per prompt (v2 was 8)\")\n",
+    "print(f\"  Learning rate: {LEARNING_RATE} (v2 was 5e-7)\")\n",
+    "print(f\"  Save every:    {SAVE_STEPS} steps (keep {SAVE_TOTAL_LIMIT})\")\n",
+    "print(f\"  Eval every:    {EVAL_STEPS} steps ({EVAL_MAX_SAMPLES} samples x {EVAL_MAX_TOKENS} tok)\")\n",
+    "print(f\"  Patience:      {EARLY_STOPPING_PATIENCE} evals ({EARLY_STOPPING_PATIENCE * EVAL_STEPS} steps)\")\n",
+    "print(f\"  Resume:        {resume_from is not None}\")\n",
+    "print(f\"  W&B run:       {wandb.run.url}\")\n",
+    "print(f\"{'='*70}\")\n",
+    "\n",
+    "t_start = time.time()\n",
+    "result  = trainer.train(resume_from_checkpoint=resume_from)\n",
+    "elapsed = time.time() - t_start\n",
+    "\n",
+    "# ── Log final training metrics (run stays open for save + validation) ────────\n",
+    "wandb.log({\n",
+    "    \"train/final_loss\":       result.training_loss,\n",
+    "    \"train/duration_hours\":   elapsed / 3600,\n",
+    "    \"train/total_steps\":      result.global_step,\n",
+    "    \"eval/best_reward_final\": eval_cb.best_reward,\n",
+    "})\n",
+    "\n",
+    "print(f\"\\n{'='*70}\")\n",
+    "print(f\"GRPO v3 Training Complete\")\n",
+    "print(f\"  Loss:        {result.training_loss:.6f}\")\n",
+    "print(f\"  Steps:       {result.global_step}\")\n",
+    "print(f\"  Duration:    {elapsed/3600:.1f}h\")\n",
+    "print(f\"  Best eval R: {eval_cb.best_reward:.4f}\")\n",
+    "print(f\"  Trainer:     {TrainerClass.__name__}\")\n",
+    "print(f\"  W&B run:     {wandb.run.url}\")\n",
+    "print(f\"{'='*70}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b5a463d0",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cell 12: Save Adapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GRPO_ADAPTER_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "model.save_pretrained(str(GRPO_ADAPTER_DIR))\n",
+    "tokenizer.save_pretrained(str(GRPO_ADAPTER_DIR))\n",
+    "\n",
+    "summary = {\n",
+    "    \"model_id\": MODEL_ID,\n",
+    "    \"sft_adapter\": str(SFT_ADAPTER_DIR),\n",
+    "    \"method\": \"GRPO\",\n",
+    "    \"version\": \"v3\",\n",
+    "    \"train_loss\": result.training_loss,\n",
+    "    \"best_eval_reward\": eval_cb.best_reward,\n",
+    "    \"num_prompts\": len(train_dataset),\n",
+    "    \"num_generations\": NUM_GENERATIONS,\n",
+    "    \"scale_rewards\": SCALE_REWARDS,\n",
+    "    \"temperature\": TEMPERATURE,\n",
+    "    \"learning_rate\": LEARNING_RATE,\n",
+    "    \"beta\": BETA,\n",
+    "    \"max_completion_length\": MAX_COMPLETION_LENGTH,\n",
+    "    \"max_steps\": MAX_STEPS,\n",
+    "    \"actual_steps\": result.global_step,\n",
+    "    \"epochs\": NUM_EPOCHS,\n",
+    "    \"max_seq_length\": MAX_SEQ_LENGTH,\n",
+    "    \"duration_seconds\": round(elapsed),\n",
+    "    \"gpu\": \"L4\",\n",
+    "    \"platform\": \"vertex-ai-workbench\",\n",
+    "    \"v3_fixes\": [\n",
+    "        \"temperature=1.0 (Skywork-OR1)\",\n",
+    "        \"max_completion_length=4096 (Dr. GRPO)\",\n",
+    "        \"learning_rate=2e-6 (4x v2)\",\n",
+    "        \"beta=0.0 (Dr. GRPO)\",\n",
+    "        \"staged rewards (Reasoning-SQL)\",\n",
+    "        \"zero-advantage noise (Skywork-OR1)\",\n",
+    "        \"entropy monitoring callback\",\n",
+    "    ],\n",
+    "}\n",
+    "with open(GRPO_ADAPTER_DIR / \"training_summary.json\", \"w\") as f:\n",
+    "    json.dump(summary, f, indent=2)\n",
+    "\n",
+    "print(f\"✓ Adapter saved to {GRPO_ADAPTER_DIR}\")\n",
+    "print(f\"  Files: {[f.name for f in GRPO_ADAPTER_DIR.iterdir() if f.is_file()]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "system_msg = {\"role\": \"system\", \"content\": SYSTEM_PT}\n",
+    "\n",
+    "test_prompts = [\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Analise esta avaliacao de e-commerce brasileiro e extraia dados estruturados.\\n\\n\"\n",
+    "        \"nota=2/5 | status=delivered\\ntitulo: decepcionado\\n\"\n",
+    "        \"texto: Produto veio com defeito e o vendedor nao respondeu.\\n\\n\"\n",
+    "        \"Retorne um objeto JSON com exatamente estas chaves:\\n\"\n",
+    "        \"sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, \"\n",
+    "        \"seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend\"\n",
+    "    )},\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Analise esta avaliacao de e-commerce brasileiro e extraia dados estruturados.\\n\\n\"\n",
+    "        \"nota=5/5 | status=delivered\\ntitulo: adorei!\\n\"\n",
+    "        \"texto: Entrega rapida e produto exatamente como descrito. Recomendo!\\n\\n\"\n",
+    "        \"Retorne um objeto JSON com exatamente estas chaves:\\n\"\n",
+    "        \"sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, \"\n",
+    "        \"seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend\"\n",
+    "    )},\n",
+    "    {\"role\": \"user\", \"content\": \"Quais sao as categorias de reclamacao mais frequentes e como afetam a nota media?\"},\n",
+    "    {\"role\": \"user\", \"content\": \"Analise a retencao de clientes afetados por product_quality.\"},\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Perfil do cliente:\\n- Estado: MG\\n- Valor do pedido: R$150\\n\"\n",
+    "        \"- Reclamacao: produto com defeito\\n- Nota: 1.0/5\\n\\n\"\n",
+    "        \"Este cliente deve receber uma notificacao de reengajamento?\"\n",
+    "    )},\n",
+    "    {\"role\": \"user\", \"content\": \"Compare a satisfacao de clientes em SP vs RJ.\"},\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Crie uma notificacao push de reengajamento para um cliente em SP \"\n",
+    "        \"que reclamou de atraso na entrega. Nota: 2/5.\"\n",
+    "    )},\n",
+    "]\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"GRPO v3 Validation\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "v3_rewards = []\n",
+    "val_rows = []\n",
+    "for i, prompt in enumerate(test_prompts):\n",
+    "    messages = [system_msg, prompt]\n",
+    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
+    "    inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.1, do_sample=True)\n",
+    "    gen_tokens = outputs.shape[1] - inputs[\"input_ids\"].shape[1]\n",
+    "    response = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+    "\n",
+    "    reward = commerce_reward_fn([response], [text])[0]\n",
+    "    v3_rewards.append(reward)\n",
+    "    answer = strip_think(response)\n",
+    "    task = _classify_task_type(prompt[\"content\"])\n",
+    "    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH\n",
+    "\n",
+    "    print(f\"\\n--- Sample {i+1} [{task}] (reward={reward:.2f}, tokens={gen_tokens}, ceiling={'HIT' if hit_ceiling else 'ok'}) ---\")\n",
+    "    print(f\"Prompt: {prompt['content'][:80]}...\")\n",
+    "    print(f\"Answer: {answer[:400]}\")\n",
+    "\n",
+    "    val_rows.append([i + 1, task, reward, gen_tokens, hit_ceiling,\n",
+    "                     prompt[\"content\"][:120], answer[:500]])\n",
+    "\n",
+    "v3_mean = sum(v3_rewards) / len(v3_rewards)\n",
+    "v3_vs_v2 = (v3_mean - 0.54) / 0.54 * 100\n",
+    "\n",
+    "print(f\"\\n{'='*70}\")\n",
+    "print(f\"v3 Validation Summary\")\n",
+    "print(f\"{'='*70}\")\n",
+    "print(f\"  Mean reward: {v3_mean:.3f}\")\n",
+    "print(f\"  Min:         {min(v3_rewards):.3f}\")\n",
+    "print(f\"  Max:         {max(v3_rewards):.3f}\")\n",
+    "print()\n",
+    "print(f\"  Comparison to baselines:\")\n",
+    "print(f\"    SFT calibration (Cell 7): mean=0.38\")\n",
+    "print(f\"    GRPO v2 validation:       mean=0.54\")\n",
+    "print(f\"    GRPO v3 validation:       mean={v3_mean:.3f}\")\n",
+    "print(f\"    v3 vs v2:                 {v3_vs_v2:+.1f}%\")\n",
+    "\n",
+    "# ── Log validation results to W&B ────────────────────────────────────────────\n",
+    "val_table = wandb.Table(\n",
+    "    columns=[\"sample\", \"task\", \"reward\", \"tokens\", \"hit_ceiling\", \"prompt_preview\", \"answer_preview\"],\n",
+    "    data=val_rows,\n",
+    ")\n",
+    "wandb.log({\n",
+    "    \"validation/mean_reward\": v3_mean,\n",
+    "    \"validation/min_reward\": min(v3_rewards),\n",
+    "    \"validation/max_reward\": max(v3_rewards),\n",
+    "    \"validation/v3_vs_v2_pct\": v3_vs_v2,\n",
+    "    \"validation/samples\": val_table,\n",
+    "})\n",
+    "wandb.summary.update({\n",
+    "    \"validation/mean_reward\": v3_mean,\n",
+    "    \"validation/v3_vs_v2_pct\": v3_vs_v2,\n",
+    "})\n",
+    "\n",
+    "# ── Close the W&B run — all outputs are now persisted ────────────────────────\n",
+    "wandb.finish()\n",
+    "print(f\"\\n✓ W&B run finalized — all outputs saved\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "system_msg = {\"role\": \"system\", \"content\": SYSTEM_PT}\n",
+    "\n",
+    "test_prompts = [\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Analise esta avaliação de e-commerce brasileiro e extraia dados estruturados.\\n\\n\"\n",
+    "        \"nota=2/5 | status=delivered\\ntítulo: decepcionado\\n\"\n",
+    "        \"texto: Produto veio com defeito e o vendedor não respondeu.\\n\\n\"\n",
+    "        \"Retorne um objeto JSON com exatamente estas chaves:\\n\"\n",
+    "        \"sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, \"\n",
+    "        \"seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend\"\n",
+    "    )},\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Analise esta avaliação de e-commerce brasileiro e extraia dados estruturados.\\n\\n\"\n",
+    "        \"nota=5/5 | status=delivered\\ntítulo: adorei!\\n\"\n",
+    "        \"texto: Entrega rápida e produto exatamente como descrito. Recomendo!\\n\\n\"\n",
+    "        \"Retorne um objeto JSON com exatamente estas chaves:\\n\"\n",
+    "        \"sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, \"\n",
+    "        \"seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend\"\n",
+    "    )},\n",
+    "    {\"role\": \"user\", \"content\": \"Quais são as categorias de reclamação mais frequentes e como afetam a nota média?\"},\n",
+    "    {\"role\": \"user\", \"content\": \"Analise a retenção de clientes afetados por product_quality.\"},\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Perfil do cliente:\\n- Estado: MG\\n- Valor do pedido: R$150\\n\"\n",
+    "        \"- Reclamação: produto com defeito\\n- Nota: 1.0/5\\n\\n\"\n",
+    "        \"Este cliente deve receber uma notificação de reengajamento?\"\n",
+    "    )},\n",
+    "    {\"role\": \"user\", \"content\": \"Compare a satisfação de clientes em SP vs RJ.\"},\n",
+    "    {\"role\": \"user\", \"content\": (\n",
+    "        \"Crie uma notificação push de reengajamento para um cliente em SP \"\n",
+    "        \"que reclamou de atraso na entrega. Nota: 2/5.\"\n",
+    "    )},\n",
+    "]\n",
+    "\n",
+    "print(\"=\" * 70)\n",
+    "print(\"GRPO v3 Validation\")\n",
+    "print(\"=\" * 70)\n",
+    "\n",
+    "v3_rewards = []\n",
+    "for i, prompt in enumerate(test_prompts):\n",
+    "    messages = [system_msg, prompt]\n",
+    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
+    "    inputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.1, do_sample=True)\n",
+    "    gen_tokens = outputs.shape[1] - inputs[\"input_ids\"].shape[1]\n",
+    "    response = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+    "\n",
+    "    reward = commerce_reward_fn([response], [text])[0]\n",
+    "    v3_rewards.append(reward)\n",
+    "    answer = strip_think(response)\n",
+    "    task = _classify_task_type(prompt[\"content\"])\n",
+    "    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH\n",
+    "\n",
+    "    print(f\"\\n--- Sample {i+1} [{task}] (reward={reward:.2f}, tokens={gen_tokens}, ceiling={'HIT' if hit_ceiling else 'ok'}) ---\")\n",
+    "    print(f\"Prompt: {prompt['content'][:80]}...\")\n",
+    "    print(f\"Answer: {answer[:400]}\")\n",
+    "\n",
+    "print(f\"\\n{'='*70}\")\n",
+    "print(f\"v3 Validation Summary\")\n",
+    "print(f\"{'='*70}\")\n",
+    "print(f\"  Mean reward: {sum(v3_rewards)/len(v3_rewards):.3f}\")\n",
+    "print(f\"  Min:         {min(v3_rewards):.3f}\")\n",
+    "print(f\"  Max:         {max(v3_rewards):.3f}\")\n",
+    "print()\n",
+    "print(f\"  Comparison to baselines:\")\n",
+    "print(f\"    SFT calibration (Cell 7): mean=0.38\")\n",
+    "print(f\"    GRPO v2 validation:       mean=0.54\")\n",
+    "print(f\"    GRPO v3 validation:       mean={sum(v3_rewards)/len(v3_rewards):.3f}\")\n",
+    "v3_vs_v2 = (sum(v3_rewards)/len(v3_rewards) - 0.54) / 0.54 * 100\n",
+    "print(f\"    v3 vs v2:                 {v3_vs_v2:+.1f}%\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}