rtferraz
/

tucano2-commerce

Model card Files Files and versions

xet

Community

rtferraz commited on 15 days ago

Commit

a6a8b11

verified ·

1 Parent(s): 042d2b9

feat: add GRPO v3 implementation with entropy collapse fixes

Browse files

Files changed (1) hide show

grpo_vertex_v3.md +1322 -0

grpo_vertex_v3.md ADDED Viewed

	@@ -0,0 +1,1322 @@

+# Tucano2 Commerce — GRPO Training v3 (Vertex AI Workbench / L4)
+**v3 changes over v2 — grounded in published research:**
+| Change | v2 Value | v3 Value | Paper Reference |
+|--------|----------|----------|----------------|
+| Temperature | 0.8 | **1.0** | Skywork-OR1 (2505.22312) §4: τ=1.0 gives 5-8% better results, delays entropy collapse |
+| Completion length | 2048 | **4096** | Dr. GRPO (2503.20783) §3.1: length bias inflates wrong answers → ceiling hit blocks learning |
+| Num generations | 8 | **4** | VRAM tradeoff: 4×4096 ≈ 8×2048. MC-GRPO (2601.22582): G=4 works with noise mitigation |
+| Learning rate | 5e-7 | **2e-6** | Dr. GRPO Appendix G: LR=1e-6; Reasoning-SQL: LR=1e-6. v2 clip_ratio=0 → room to push 2-4× |
+| β (KL penalty) | implicit | **0.0** | Dr. GRPO §3.2: β=0 optimal for rule-based rewards |
+| Training data | 300 | **ALL (~1400)** | Skywork-OR1 §3.1: small prompt sets → model memorizes → entropy collapse |
+| Reward functions | single composite | **staged (format→partial→task)** | Reasoning-SQL (2503.23157) §3.2: format rewards converge first, enable task learning |
+| Zero-advantage groups | included | **filtered with noise injection** | Skywork-OR1 §3.1: zero-std groups destabilize training |
+| Entropy monitoring | none | **EntropyMonitorCallback** | Skywork-OR1 §4: early detection prevents collapse |
+| Early stopping patience | 10 | **15** | More runway for longer completions |
+| Save total limit | 3 | **5** | Keep more checkpoints — v2 lost the best one |
+| Eval temperature | 0.7 | **0.1** | Deterministic eval = less noisy signal |
+| General reasoning mix | none | **30% (optional)** | Cocktail Effect (2410.01109): multi-task mix boosts domain performance 2-15% |
+**Prerequisites:**
+- Upload `data/pairs/train.jsonl` (2.1 MB) to `./data/pairs/`
+- Upload `models/tucano2-commerce-sft/` (126 MB) to `./models/tucano2-commerce-sft/`
+- **NEW:** Optional `data/pairs/general_reasoning.jsonl` for 30% general data mix
+**Hardware:** L4 (24GB), PyTorch kernel, bf16 supported
+---
+## Cell 1: Dependencies
+Restart your kernel first (Kernel → Restart), then run these cells in order, one at a time:
+```python
+# Cell 1a — Nuke everything ML-related
+!pip uninstall -y torch torchvision torchaudio \
+    unsloth unsloth-zoo \
+    trl transformers peft accelerate \
+    bitsandbytes vllm vllm-flash-attn \
+    datasets tokenizers safetensors huggingface-hub \
+    wandb xformers triton \
+    cuda-bindings cuda-python \
+    sentencepiece protobuf \
+    2>/dev/null
+```
+```python
+# Cell 1b — Kill any stragglers
+!pip freeze | grep -iE "torch|unsloth|trl|vllm|bitsandbytes|transformers|peft|accelerate" | xargs pip uninstall -y 2>/dev/null
+```
+```python
+# Cell 1c — Purge cache
+!pip cache purge
+```
+**⚠️ Restart kernel again**, then:
+```python
+# Cell 1d — Clean install, correct order
+!pip install "unsloth"
+```
+```python
+# Cell 1e — Pin TRL (Unsloth may pull a different version)
+!pip install "trl==0.24.0" --no-deps
+```
+```python
+# Cell 1f — Extra deps
+!pip install "rich" "wandb"
+```
+---
+## Cell 2: Hello World — GPU + Unsloth Verification
+```python
+import torch
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"GPU: {torch.cuda.get_device_name(0)}")
+print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
+print(f"bf16 support: {torch.cuda.is_bf16_supported()}")
+from unsloth import FastLanguageModel
+print("\n✓ Unsloth loaded successfully")
+import trl
+print(f"✓ TRL version: {trl.__version__}")
+import transformers
+print(f"✓ Transformers version: {transformers.__version__}")
+```
+---
+## Cell 3: Config + Constants
+```python
+import os
+os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
+import json
+import re
+import time
+import random
+import gc
+from pathlib import Path
+# ══════════════════════════════════════════════════════════════════════════════
+# v3 CONFIG — Every change is annotated with paper reference
+# ══════════════════════════════════════════════════════════════════════════════
+MODEL_ID = "Polygl0t/Tucano2-qwen-3.7B-Think"
+MAX_SEQ_LENGTH = 8192  # v3: increased from 4096 — model supports 32k, we need room for 4096 completion + prompt
+# ── Paths ─────────────────────────────────────────────────────────────────────
+DATA_DIR = Path("/home/jupyter/tucano2/data")
+MODELS_DIR = Path("/home/jupyter/tucano2/models")
+SFT_ADAPTER_DIR = MODELS_DIR / "tucano2-commerce-sft"
+GRPO_ADAPTER_DIR = MODELS_DIR / "tucano2-commerce-grpo-v3"   # v3: separate dir from v2
+CHECKPOINT_DIR = GRPO_ADAPTER_DIR / "checkpoints"
+# ── Training data ─────────────────────────────────────────────────────────────
+GRPO_PROMPTS = None  # v3: None = use ALL available prompts (was 300 subset in v2)
+GENERAL_MIX_RATIO = 0.0  # v3: set to 0.3 if general_reasoning.jsonl exists (Cocktail Effect paper)
+# ── Valid enums for reward scoring (unchanged from v2) ────────────────────────
+VALID_SENTIMENTS = {"positive", "negative", "neutral"}
+VALID_CATEGORIES = {
+    "delivery_delay", "product_quality", "product_not_received",
+    "wrong_product", "seller_communication", "app_issue",
+    "price_value", "other", "none",
+}
+VALID_CHURN = {"low", "medium", "high"}
+VALID_REPEAT = {"yes", "no", "maybe"}
+EXTRACTION_FIELDS = [
+    "sentiment", "sentiment_score", "churn_risk", "delivery_issue",
+    "product_issue", "seller_issue", "main_complaint",
+    "complaint_category", "repeat_intent", "would_recommend",
+]
+SYSTEM_PT = (
+    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
+    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro."
+)
+# ══════════════════════════════════════════════════════════════════════════════
+# TRAINING HYPERPARAMETERS — v3 fixes (all changes annotated)
+# ══════════════════════════════════════════════════════════════════════════════
+# ── Core GRPO params ──────────────────────────────────────────────────────────
+BATCH_SIZE = 4
+GRAD_ACCUM = 1        # v3: reduced from 2. Effective batch = 4×1 = 4 (was 8)
+                      # With G=4: steps = prompts × 4 / 4 = prompts per epoch
+NUM_GENERATIONS = 4   # v3: reduced from 8 — VRAM tradeoff for longer completions
+                      # MC-GRPO (2601.22582): G=4 works if noise is mitigated
+SCALE_REWARDS = False # Dr. GRPO (2503.20783): remove std normalization bias
+# ── v3 CRITICAL FIXES ────────────────────────────────────────────────────────
+# FIX 1: Temperature — prevent entropy collapse
+# v2 had 0.8. All published GRPO papers use 1.0.
+# Skywork-OR1 (2505.22312) ablation: τ=1.0 vs τ=0.6 → 5-8% better test performance
+TEMPERATURE = 1.0
+# FIX 2: Completion length — remove the ceiling
+# v2: every single completion hit 2048 ceiling. Model couldn't finish reasoning.
+# Dr. GRPO (2503.20783) §3.1: GRPO length bias inflates wrong answers → ceiling kill gradient
+MAX_COMPLETION_LENGTH = 4096
+# FIX 3: Learning rate — more aggressive
+# v2: clip_ratio=0 on all steps → updates were too small to matter
+# Dr. GRPO Appendix G: LR=1e-6 (constant). Reasoning-SQL: LR=1e-6 with cosine.
+# We go 2× since v2 showed zero clipping (model can absorb stronger push)
+LEARNING_RATE = 2e-6
+# FIX 4: β = 0 (no KL penalty)
+# Dr. GRPO (2503.20783) §3.2: KL penalty is unnecessary for rule-based rewards
+# v2 used implicit KL through default β — we explicitly disable it
+BETA = 0.0
+# ── Training schedule ─────────────────────────────────────────────────────────
+NUM_EPOCHS = 1
+MAX_STEPS = 500       # v3: increased for expanded data; early stopping will halt if needed
+                      # With ~1400 prompts × 4 gen / (4 batch × 1 accum) = 1400 steps/epoch
+                      # MAX_STEPS=500 < 1 epoch — early stopping or manual extension
+# ── Checkpoint + Eval + Early-Stop ────────────────────────────────────────────
+EVAL_SPLIT_RATIO        = 0.15
+EVAL_STEPS              = 10
+EARLY_STOPPING_PATIENCE = 15    # v3: increased from 10 — gives 150 steps of runway
+EARLY_STOPPING_DELTA    = 0.005 # v3: reduced from 0.01 — more sensitive to small gains
+SAVE_STEPS              = 10    # v3: more frequent (was 15) — never lose best checkpoint again
+SAVE_TOTAL_LIMIT        = 5    # v3: keep more checkpoints (was 3 — lost best in v2)
+WANDB_PROJECT           = "tucano2-commerce"
+# ── Eval callback ─────────────────────────────────────────────────────────────
+EVAL_MAX_SAMPLES = 5
+EVAL_MAX_TOKENS  = 4096  # v3: match training max_completion_length (was 2048)
+EVAL_TEMPERATURE = 0.1   # v3: deterministic eval for less noisy signal (was 0.7)
+# ── Backend ───────────────────────────────────────────────────────────────────
+USE_VLLM = False
+# ── v3: Zero-advantage noise injection ────────────────────────────────────────
+# Skywork-OR1 (2505.22312) §3.1: zero-std groups destabilize GRPO training
+# When all G completions get identical rewards, the advantage is undefined.
+# Noise injection breaks ties without corrupting the signal.
+ZERO_ADV_NOISE_STD = 0.005  # Small gaussian noise added to zero-variance groups
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+# ── Version assertion ─────────────────────────────────────────────────────────
+import trl as _trl
+assert _trl.__version__ == "0.24.0", (
+    f"UnslothGRPOTrainer was written for TRL 0.24.0, found {_trl.__version__}.\n"
+    "Verify that GRPOTrainer._generate() still exists before proceeding."
+)
+print("✓ v3 Config loaded")
+print(f"  SFT adapter: {SFT_ADAPTER_DIR} (exists: {SFT_ADAPTER_DIR.exists()})")
+print(f"  Train data: {DATA_DIR / 'pairs' / 'train.jsonl'} (exists: {(DATA_DIR / 'pairs' / 'train.jsonl').exists()})")
+print(f"  Training: batch={BATCH_SIZE}, grad_accum={GRAD_ACCUM}, eff_batch={BATCH_SIZE*GRAD_ACCUM}")
+print(f"  GRPO: G={NUM_GENERATIONS}, temp={TEMPERATURE}, LR={LEARNING_RATE}, β={BETA}")
+print(f"  Completion: max={MAX_COMPLETION_LENGTH} (v2 was 2048)")
+print(f"  ADR: save_steps={SAVE_STEPS}, eval_steps={EVAL_STEPS}, patience={EARLY_STOPPING_PATIENCE}")
+print(f"✓ TRL {_trl.__version__} verified")
+# ══════════════════════════════════════════════════════════════════════════════
+# v3 VRAM BUDGET (L4 24GB)
+# ══════════════════════════════════════════════════════════════════════════════
+# Model (NF4):          ~3.5 GB
+# KV Cache (8192 seq):  ~3.0 GB
+# Activations:          ~4.0 GB
+# Optimizer states:     ~3.0 GB
+# Generations (4×4096): ~8.0 GB
+# ─────────────────────────────────
+# Estimated total:      ~21.5 GB
+# Headroom:             ~2.5 GB
+#
+# If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first, then 2560.
+# Do NOT reduce NUM_GENERATIONS below 4 — GRPO needs variance.
+# ══════════════════════════════════════════════════════════════════════════════
+```
+---
+## Cell 4: Load SFT Adapter
+```python
+print("Loading SFT adapter...")
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name=str(SFT_ADAPTER_DIR),
+    max_seq_length=MAX_SEQ_LENGTH,
+    load_in_4bit=True,
+    dtype=None,
+)
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+# Load chat template from base model (SFT adapter doesn't save it)
+from transformers import AutoTokenizer
+base_tok = AutoTokenizer.from_pretrained(MODEL_ID)
+tokenizer.chat_template = base_tok.chat_template
+del base_tok
+# v2: Force KV cache — Unsloth patching may reset this
+model.config.use_cache = True
+model.generation_config.use_cache = True
+print(f"✓ Model loaded on {model.device}")
+print(f"  use_cache: {model.config.use_cache}")
+print(f"  Params: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M")
+print(f"  Chat template: {tokenizer.chat_template[:50]}...")
+```
+---
+## Cell 5: Single Inference Test
+**Gate:** Does the model close `</think>` and produce an answer within 4096 tokens?
+```python
+FastLanguageModel.for_inference(model)
+test_msgs = [
+    {"role": "system", "content": SYSTEM_PT},
+    {"role": "user", "content": "Quais são as categorias de reclamação mais frequentes e como afetam a nota média?"},
+]
+text = tokenizer.apply_chat_template(test_msgs, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+t0 = time.time()
+outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)
+elapsed = time.time() - t0
+response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+gen_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
+print(f"Generation time: {elapsed:.1f}s ({gen_tokens} tokens, {gen_tokens/elapsed:.1f} tok/s)")
+print(f"Response length: {len(response)} chars, {gen_tokens} tokens")
+print(f"Hit ceiling: {gen_tokens >= MAX_COMPLETION_LENGTH}")  # v3: should NOT hit ceiling with 4096
+print(f"closed_think: {'</think>' in response}")
+print(f"\n{'='*60}")
+print(response[:800])
+```
+---
+## Cell 5b: KV Cache Diagnostic
+```python
+import time
+FastLanguageModel.for_inference(model)
+_kv_msgs = [{"role": "system", "content": SYSTEM_PT},
+            {"role": "user", "content": "Qual a categoria de reclamação mais frequente?"}]
+_kv_text   = tokenizer.apply_chat_template(_kv_msgs, tokenize=False, add_generation_prompt=True)
+_kv_inputs = tokenizer(_kv_text, return_tensors="pt").to(model.device)
+_token_times, _past, _generated = [], None, _kv_inputs["input_ids"]
+with torch.no_grad():
+    for _step in range(50):
+        _t0 = time.time()
+        seq_len = _generated.shape[1]
+        if _past is None:
+            _position_ids = torch.arange(seq_len, dtype=torch.long, device=model.device).unsqueeze(0)
+        else:
+            _position_ids = torch.tensor([[seq_len - 1]], dtype=torch.long, device=model.device)
+        _out = model(
+            input_ids=_generated[:, -1:] if _past else _generated,
+            position_ids=_position_ids,
+            attention_mask=torch.ones(1, seq_len, device=model.device),
+            past_key_values=_past,
+            use_cache=True,
+            return_dict=True,
+        )
+        _past = _out.past_key_values
+        _next = _out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
+        _generated = torch.cat([_generated, _next], dim=1)
+        _token_times.append(time.time() - _t0)
+_ratio = sum(_token_times[45:]) / max(sum(_token_times[:5]), 1e-9)
+print(f"First 5 tok : {[f'{t*1000:.0f}ms' for t in _token_times[:5]]}")
+print(f"Last  5 tok : {[f'{t*1000:.0f}ms' for t in _token_times[45:]]}")
+print(f"Ratio last/first: {_ratio:.1f}x")
+if _ratio < 3:
+    print("✓ KV cache is working correctly")
+elif _ratio < 6:
+    print("⚠ KV cache may be degraded — check model.config.use_cache")
+else:
+    print("✗ KV cache BROKEN — GRPO generation will be catastrophically slow.")
+del _past, _generated, _kv_inputs, _token_times, _out
+gc.collect()
+if torch.cuda.is_available(): torch.cuda.empty_cache()
+```
+---
+## Cell 6: Reward Functions v3
+**v3 changes:**
+- Staged reward design: format → partial content → full task (Reasoning-SQL, 2503.23157)
+- Zero-advantage noise injection (Skywork-OR1, 2505.22312)
+- Extraction reward redesigned for completion-length-friendly scoring
+```python
+def strip_think(text: str) -> str:
+    """Remove <think>...</think> block, return the answer portion."""
+    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
+def has_think_block(text: str) -> bool:
+    """Check if text contains a non-empty <think> block."""
+    return bool(re.search(r"<think>.+</think>", text, flags=re.DOTALL))
+def _classify_task_type(prompt_text: str) -> str:
+    """Classify prompt into task type by keywords."""
+    p = prompt_text.lower()
+    if "retorne um objeto json" in p or "extraia dados" in p:
+        return "extraction"
+    elif "notificação push" in p or "notificação de reengajamento" in p:
+        return "push"
+    elif "perfil do cliente" in p:
+        return "insights"
+    else:
+        return "sql_qa"
+def _json_similarity(text: str) -> float:
+    """Rough heuristic: how JSON-like is this text? 0.0 to 1.0."""
+    text = text.strip()
+    if not text:
+        return 0.0
+    score = 0.0
+    if text.startswith("{") and text.endswith("}"):
+        score += 0.5
+    if '"' in text:
+        score += 0.2
+    if ":" in text:
+        score += 0.2
+    if "," in text:
+        score += 0.1
+    return min(score, 1.0)
+def _string_similarity(a: str, b: str) -> float:
+    """Simple Jaccard-like similarity for short strings. 0.0 to 1.0."""
+    if not a or not b:
+        return 0.0
+    a_set = set(a.split())
+    b_set = set(b.split())
+    intersection = len(a_set & b_set)
+    union = len(a_set | b_set)
+    return intersection / union if union > 0 else 0.0
+# ══════════════════════════════════════════════════════════════════════════════
+# v3 STAGED REWARD DESIGN
+# Reference: Reasoning-SQL (2503.23157) §3.2
+#
+# Each reward function scores THREE stages independently:
+#   Stage 1 — FORMAT (0.0–0.2): Is the output well-structured?
+#   Stage 2 — PARTIAL (0.0–0.3): Are some content elements correct?
+#   Stage 3 — TASK   (0.0–0.5): Is the full task completed correctly?
+#
+# Format rewards converge first (easy to learn), which stabilizes training
+# and enables the model to then learn harder task-specific skills.
+# ══════════════════════════════════════════════════════════════════════════════
+def reward_extraction(completion: str) -> float:
+    """Staged reward for structured extraction (max 1.0)."""
+    answer = strip_think(completion)
+    # ── Stage 1: FORMAT (max 0.2) ─────────────────────────────────────────────
+    r_format = 0.0
+    if has_think_block(completion):
+        r_format += 0.1  # Used reasoning
+    try:
+        data = json.loads(answer)
+        if isinstance(data, dict):
+            r_format += 0.1  # Valid JSON object
+    except (json.JSONDecodeError, TypeError):
+        r_format += 0.05 * _json_similarity(answer)
+        return min(r_format, 0.2)
+    if not isinstance(data, dict):
+        return min(r_format, 0.2)
+    # ── Stage 2: PARTIAL CONTENT (max 0.3) ────────────────────────────────────
+    r_partial = 0.0
+    present = sum(1 for f in EXTRACTION_FIELDS if f in data)
+    r_partial += 0.15 * (present / len(EXTRACTION_FIELDS))
+    type_checks = 0
+    type_total = 0
+    for field in EXTRACTION_FIELDS:
+        if field not in data:
+            continue
+        type_total += 1
+        val = data[field]
+        if field in ("delivery_issue", "product_issue", "seller_issue", "would_recommend"):
+            if isinstance(val, bool):
+                type_checks += 1
+        elif field in ("sentiment_score",):
+            if isinstance(val, (int, float)):
+                type_checks += 1
+        elif field in ("main_complaint", "sentiment", "complaint_category", "churn_risk", "repeat_intent"):
+            if isinstance(val, str):
+                type_checks += 1
+    if type_total > 0:
+        r_partial += 0.15 * (type_checks / type_total)
+    # ── Stage 3: FULL TASK (max 0.5) ─────────────────────────────────────────
+    r_task = 0.0
+    cat_checks = 0
+    cat_total = 0
+    checks = [
+        ("sentiment", lambda v: v in VALID_SENTIMENTS),
+        ("complaint_category", lambda v: v in VALID_CATEGORIES),
+        ("churn_risk", lambda v: v in VALID_CHURN),
+        ("repeat_intent", lambda v: v in VALID_REPEAT),
+        ("sentiment_score", lambda v: isinstance(v, (int, float)) and 1 <= v <= 5),
+    ]
+    for field, validator in checks:
+        cat_total += 1
+        if field in data and validator(data[field]):
+            cat_checks += 1
+    for bool_field in ("delivery_issue", "product_issue", "seller_issue", "would_recommend"):
+        cat_total += 1
+        if bool_field in data and isinstance(data[bool_field], bool):
+            cat_checks += 1
+    if cat_total > 0:
+        r_task += 0.35 * (cat_checks / cat_total)
+    if "main_complaint" in data and isinstance(data["main_complaint"], str):
+        complaint = data["main_complaint"].strip()
+        if len(complaint) > 10:
+            r_task += 0.15
+    return min(r_format + r_partial + r_task, 1.0)
+def reward_sql_qa(completion: str) -> float:
+    """Staged reward for SQL Q&A (max 1.0)."""
+    answer = strip_think(completion)
+    # ── Stage 1: FORMAT (max 0.2)
+    r_format = 0.0
+    if has_think_block(completion):
+        r_format += 0.1
+    if "```" in answer or re.search(r"SELECT|FROM", answer, re.IGNORECASE):
+        r_format += 0.1
+    # ── Stage 2: PARTIAL (max 0.3)
+    r_partial = 0.0
+    sql_keywords = r"SELECT|FROM|WHERE|GROUP BY|ORDER BY|COUNT|SUM|AVG|JOIN|HAVING"
+    matches = len(re.findall(sql_keywords, answer, re.IGNORECASE))
+    r_partial += min(0.15, 0.03 * matches)
+    numbers = re.findall(r"\d+(?:[.,]\d+)?", answer)
+    r_partial += min(0.15, 0.03 * len(numbers))
+    # ── Stage 3: TASK (max 0.5)
+    r_task = 0.0
+    length = len(answer)
+    if 50 <= length <= 600:
+        r_task += 0.25
+    elif length > 0:
+        r_task += 0.25 * max(0, 1 - abs(length - 325) / 275)
+    explanation_markers = ["para ", "porque", "resultado", "mostra", "indica", "análise"]
+    expl_matches = sum(1 for w in explanation_markers if w in answer.lower())
+    r_task += min(0.25, 0.05 * expl_matches)
+    return min(r_format + r_partial + r_task, 1.0)
+def reward_insights(completion: str) -> float:
+    """Staged reward for insights (max 1.0)."""
+    answer = strip_think(completion)
+    # ── Stage 1: FORMAT (max 0.2)
+    r_format = 0.0
+    if has_think_block(completion):
+        r_format += 0.1
+    structure_marks = len(re.findall(r"^[-•*]\s|^\d+[.)]\s|^#{1,3}\s", answer, re.MULTILINE))
+    r_format += min(0.1, 0.02 * structure_marks)
+    # ── Stage 2: PARTIAL (max 0.3)
+    r_partial = 0.0
+    length = len(answer)
+    if 100 <= length <= 1200:
+        r_partial += 0.15
+    elif length > 0:
+        r_partial += 0.15 * max(0, 1 - abs(length - 650) / 550)
+    pt_markers = re.findall(r"[ãçéêóúâõ]|você|para|como|seu|sua|cliente|produto", answer, re.IGNORECASE)
+    r_partial += min(0.15, 0.01 * len(pt_markers))
+    # ── Stage 3: TASK (max 0.5)
+    r_task = 0.0
+    action_words = ["recomend", "implement", "melhor", "reduzir", "aumentar",
+                    "priorizar", "investir", "otimizar", "estratégi", "suger",
+                    "consider", "ação", "plano"]
+    matches = sum(1 for w in action_words if w in answer.lower())
+    r_task += min(0.3, 0.06 * matches)
+    data_refs = len(re.findall(r"\d+%|R\$\s*\d|média|percentual|comparad|taxa", answer, re.IGNORECASE))
+    r_task += min(0.2, 0.04 * data_refs)
+    return min(r_format + r_partial + r_task, 1.0)
+def reward_push(completion: str) -> float:
+    """Staged reward for push notifications (max 1.0)."""
+    answer = strip_think(completion)
+    if not answer:
+        return 0.0
+    # ── Stage 1: FORMAT (max 0.2)
+    r_format = 0.0
+    if has_think_block(completion):
+        r_format += 0.05
+    length = len(answer)
+    if length <= 160:
+        r_format += 0.15
+    elif length <= 300:
+        r_format += 0.1
+    else:
+        r_format += 0.05
+    # ── Stage 2: PARTIAL (max 0.3)
+    r_partial = 0.0
+    pt_markers = re.findall(r"[ãçéêóúâõ]|você|para|como|seu|sua", answer, re.IGNORECASE)
+    r_partial += min(0.15, 0.02 * len(pt_markers))
+    if re.search(r"[!?]|[\U0001F600-\U0001F64F]|[\U0001F300-\U0001F5FF]", answer):
+        r_partial += 0.05
+    if len(answer.split()) >= 5:
+        r_partial += 0.1
+    # ── Stage 3: TASK (max 0.5)
+    r_task = 0.0
+    if length <= 120:
+        r_task += 0.25
+    else:
+        r_task += 0.25 * max(0, 1 - (length - 120) / 120)
+    generic_phrases = [
+        "olá! como podemos ajudar", "obrigado pela sua compra",
+        "seu pedido foi confirmado", "agradecemos sua preferência",
+    ]
+    max_similarity = max(_string_similarity(answer.lower(), g) for g in generic_phrases)
+    r_task += 0.25 * (1 - max_similarity)
+    return min(r_format + r_partial + r_task, 1.0)
+def commerce_reward_fn(completions, prompts, **kwargs) -> list[float]:
+    """
+    Master reward function v3: dispatches by task type + zero-advantage noise.
+    """
+    rewards = []
+    for completion, prompt in zip(completions, prompts):
+        if isinstance(completion, list):
+            comp_text = completion[-1]["content"] if completion else ""
+        else:
+            comp_text = str(completion)
+        if isinstance(prompt, list):
+            prompt_text = " ".join(m.get("content", "") for m in prompt)
+        else:
+            prompt_text = str(prompt)
+        task = _classify_task_type(prompt_text)
+        if task == "extraction":
+            rewards.append(reward_extraction(comp_text))
+        elif task == "sql_qa":
+            rewards.append(reward_sql_qa(comp_text))
+        elif task == "insights":
+            rewards.append(reward_insights(comp_text))
+        elif task == "push":
+            rewards.append(reward_push(comp_text))
+        else:
+            r = 0.15 if has_think_block(comp_text) else 0.0
+            r += 0.2 if comp_text.strip() else 0.0
+            rewards.append(r)
+    # ── v3: Zero-advantage noise injection ────────────────────────────────────
+    if ZERO_ADV_NOISE_STD > 0 and NUM_GENERATIONS > 1:
+        for i in range(0, len(rewards), NUM_GENERATIONS):
+            group = rewards[i:i+NUM_GENERATIONS]
+            if len(group) < 2:
+                continue
+            if max(group) - min(group) < 0.001:
+                for j in range(i, min(i+NUM_GENERATIONS, len(rewards))):
+                    rewards[j] += random.gauss(0, ZERO_ADV_NOISE_STD)
+    return rewards
+print("✓ v3 Reward functions defined (staged: format → partial → task)")
+```
+---
+## Cell 7: Reward Calibration
+**Gate:** Verify reward variance > 0. Compare v3 scoring to v2 calibration (mean=0.38).
+```python
+train_path = DATA_DIR / "pairs" / "train.jsonl"
+by_type = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
+with open(train_path) as f:
+    for line in f:
+        row = json.loads(line)
+        convs = row["conversations"]
+        prompt_msgs = [m for m in convs if m["role"] in ("system", "user")]
+        if not prompt_msgs:
+            continue
+        user_text = " ".join(m["content"] for m in prompt_msgs if m["role"] == "user")
+        task = _classify_task_type(user_text)
+        by_type[task].append(prompt_msgs)
+print(f"Prompts by type: {', '.join(f'{k}={len(v)}' for k, v in by_type.items())}")
+rng = random.Random(42)
+cal_samples = []
+for task_type in ["extraction", "extraction", "sql_qa", "sql_qa", "insights", "insights", "push", "push"]:
+    cal_samples.append(rng.choice(by_type[task_type]))
+FastLanguageModel.for_inference(model)
+print(f"\nReward calibration v3 ({len(cal_samples)} samples):")
+print("-" * 70)
+cal_rewards = []
+for i, msgs in enumerate(cal_samples):
+    text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)
+    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+    gen_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
+    r = commerce_reward_fn([response], [text])[0]
+    cal_rewards.append(r)
+    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH
+    has_answer = "</think>" in response
+    answer_preview = strip_think(response)[:100] if has_answer else "[stuck in <think>]"
+    task = _classify_task_type(text)
+    print(f"  [{task:12s}] reward={r:.2f} | tokens={gen_tokens:4d} | ceiling={'⚠️ HIT' if hit_ceiling else 'ok':6s} | {answer_preview}")
+print(f"\nMean={sum(cal_rewards)/len(cal_rewards):.2f}, Min={min(cal_rewards):.2f}, Max={max(cal_rewards):.2f}")
+print(f"v2 calibration was: Mean=0.38, Min=0.02, Max=0.70")
+print(f"Variance > 0: {len(set(cal_rewards)) > 1}")
+```
+---
+## Cell 8: Dataset Preparation v3
+```python
+from datasets import Dataset
+def prepare_grpo_datasets_v3(n_prompts=GRPO_PROMPTS, eval_ratio=EVAL_SPLIT_RATIO,
+                              general_mix=GENERAL_MIX_RATIO, seed=42):
+    rng = random.Random(seed)
+    train_pools = {}
+    eval_records = []
+    for task, pool in by_type.items():
+        shuffled = pool.copy()
+        rng.shuffle(shuffled)
+        n_eval = max(1, int(len(shuffled) * eval_ratio))
+        eval_records.extend(shuffled[:n_eval])
+        train_pools[task] = shuffled[n_eval:]
+    if n_prompts is None:
+        train_records = []
+        for task, pool in train_pools.items():
+            train_records.extend(pool)
+        rng.shuffle(train_records)
+    else:
+        targets = {
+            "extraction": int(n_prompts * 0.4),
+            "sql_qa":     int(n_prompts * 0.4),
+            "insights":   int(n_prompts * 0.1),
+            "push":       int(n_prompts * 0.1),
+        }
+        train_records = []
+        for task, target_n in targets.items():
+            pool = train_pools[task]
+            n = min(target_n, len(pool))
+            train_records.extend(rng.sample(pool, n))
+        rng.shuffle(train_records)
+    general_path = DATA_DIR / "pairs" / "general_reasoning.jsonl"
+    if general_mix > 0 and general_path.exists():
+        general_records = []
+        with open(general_path) as f:
+            for line in f:
+                row = json.loads(line)
+                convs = row["conversations"]
+                prompt_msgs = [m for m in convs if m["role"] in ("system", "user")]
+                if prompt_msgs:
+                    general_records.append(prompt_msgs)
+        n_general = int(len(train_records) * general_mix / (1 - general_mix))
+        n_general = min(n_general, len(general_records))
+        if n_general > 0:
+            train_records.extend(rng.sample(general_records, n_general))
+            rng.shuffle(train_records)
+            print(f"  Cocktail Effect: added {n_general} general reasoning samples ({general_mix:.0%} mix)")
+    elif general_mix > 0:
+        print(f"  ⚠️ general_reasoning.jsonl not found — skipping mix")
+    task_dist = {}
+    for record in train_records:
+        user_text = " ".join(m["content"] for m in record if m["role"] == "user")
+        task = _classify_task_type(user_text)
+        task_dist[task] = task_dist.get(task, 0) + 1
+    n_domain = len(train_records)
+    steps_per_epoch = n_domain * NUM_GENERATIONS // (BATCH_SIZE * GRAD_ACCUM)
+    print(f"v3 Dataset split (eval_ratio={eval_ratio}):")
+    print(f"  train : {n_domain} prompts")
+    print(f"  eval  : {len(eval_records)} prompts")
+    print(f"  distribution: {', '.join(f'{k}={v}' for k, v in sorted(task_dist.items()))}")
+    print(f"  steps/epoch: {n_domain} × {NUM_GENERATIONS} / ({BATCH_SIZE} × {GRAD_ACCUM}) = {steps_per_epoch}")
+    print(f"  MAX_STEPS={MAX_STEPS} → {'< 1 epoch' if MAX_STEPS < steps_per_epoch else f'{MAX_STEPS/steps_per_epoch:.1f} epochs'}")
+    train_ds = Dataset.from_list([{"prompt": msgs} for msgs in train_records])
+    eval_ds  = Dataset.from_list([{"prompt": msgs} for msgs in eval_records])
+    return train_ds, eval_ds
+train_dataset, eval_dataset = prepare_grpo_datasets_v3()
+dataset = train_dataset
+print(f"\n✓ v3 Datasets ready: train={len(train_dataset)}, eval={len(eval_dataset)}")
+```
+---
+## Cell 9: Smoke Test
+**Gate:** Runs 1 step without OOM at new completion length (4096).
+```python
+from trl import GRPOConfig, GRPOTrainer
+FastLanguageModel.for_training(model)
+smoke_config = GRPOConfig(
+    output_dir=str(CHECKPOINT_DIR / "smoke"),
+    num_generations=NUM_GENERATIONS,
+    scale_rewards=SCALE_REWARDS,
+    max_completion_length=MAX_COMPLETION_LENGTH,
+    max_steps=1,
+    num_train_epochs=1,
+    temperature=TEMPERATURE,
+    per_device_train_batch_size=BATCH_SIZE,
+    gradient_accumulation_steps=1,
+    learning_rate=LEARNING_RATE,
+    fp16=False,
+    bf16=True,
+    logging_steps=1,
+    save_steps=999,
+    report_to="none",
+    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,
+    seed=42,
+    remove_unused_columns=False,
+)
+smoke_trainer = GRPOTrainer(
+    model=model,
+    reward_funcs=commerce_reward_fn,
+    args=smoke_config,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+)
+t0 = time.time()
+smoke_trainer.train()
+step_time = time.time() - t0
+print(f"\n✓ Smoke test passed!")
+print(f"  Step time (grad_accum=1): {step_time:.0f}s")
+print(f"  Estimated step time (grad_accum={GRAD_ACCUM}): {step_time * GRAD_ACCUM:.0f}s")
+print(f"  VRAM peak: {torch.cuda.max_memory_allocated()/1e9:.1f} GB / {torch.cuda.get_device_properties(0).total_mem/1e9:.1f} GB")
+vram_used = torch.cuda.max_memory_allocated() / 1e9
+vram_total = torch.cuda.get_device_properties(0).total_mem / 1e9
+if vram_used > vram_total * 0.95:
+    print(f"\n⚠️  VRAM at {vram_used/vram_total:.0%} — dangerously close to OOM")
+    print(f"    Option 1: Reduce MAX_COMPLETION_LENGTH to 3072")
+    print(f"    Option 2: Reduce BATCH_SIZE to 2 (increase GRAD_ACCUM to 2)")
+del smoke_trainer
+gc.collect(); torch.cuda.empty_cache()
+```
+---
+## Cell 10: Probe Run (3 steps)
+```python
+FastLanguageModel.for_training(model)
+probe_config = GRPOConfig(
+    output_dir=str(CHECKPOINT_DIR / "probe"),
+    num_generations=NUM_GENERATIONS,
+    scale_rewards=SCALE_REWARDS,
+    max_completion_length=MAX_COMPLETION_LENGTH,
+    max_steps=3,
+    temperature=TEMPERATURE,
+    num_train_epochs=NUM_EPOCHS,
+    per_device_train_batch_size=BATCH_SIZE,
+    gradient_accumulation_steps=GRAD_ACCUM,
+    learning_rate=LEARNING_RATE,
+    warmup_ratio=0.1,
+    lr_scheduler_type="cosine",
+    fp16=False,
+    bf16=True,
+    logging_steps=1,
+    disable_tqdm=True,
+    logging_first_step=True,
+    save_steps=999,
+    report_to="none",
+    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,
+    seed=42,
+    remove_unused_columns=False,
+)
+probe_trainer = GRPOTrainer(
+    model=model,
+    reward_funcs=commerce_reward_fn,
+    args=probe_config,
+    train_dataset=dataset,
+    tokenizer=tokenizer,
+)
+t0 = time.time()
+result = probe_trainer.train()
+elapsed = time.time() - t0
+print(f"\n✓ Probe complete in {elapsed:.0f}s ({elapsed/3:.0f}s/step)")
+print(f"  Train loss: {result.training_loss:.6f}")
+print(f"  Estimated full run ({MAX_STEPS} steps): {elapsed/3 * MAX_STEPS / 3600:.1f}h")
+if abs(result.training_loss) < 1e-6:
+    print("  ⚠️ Loss is near-zero — reward variance may be insufficient")
+else:
+    print("  ✓ Loss is non-zero — GRPO has gradient signal")
+del probe_trainer
+gc.collect(); torch.cuda.empty_cache()
+```
+---
+## Cell 11: Full Training Run v3
+```python
+import wandb
+_wandb_key = os.environ.get("WANDB_API_KEY", "").strip()
+if not _wandb_key:
+    raise EnvironmentError("WANDB_API_KEY is not set.")
+wandb.login(key=_wandb_key, relogin=True)
+print(f"✓ W&B authenticated")
+```
+```python
+import shutil
+import torch
+from transformers import TrainerCallback
+from trl import GRPOConfig, GRPOTrainer
+wandb.init(
+    project=WANDB_PROJECT,
+    name=f"grpo-v3-l4-{time.strftime('%Y%m%d-%H%M')}",
+    config={
+        "model_id":               MODEL_ID,
+        "version":                "v3",
+        "temperature":            TEMPERATURE,
+        "max_completion_length":  MAX_COMPLETION_LENGTH,
+        "num_generations":        NUM_GENERATIONS,
+        "learning_rate":          LEARNING_RATE,
+        "beta":                   BETA,
+        "batch_size":             BATCH_SIZE,
+        "grad_accum":             GRAD_ACCUM,
+        "max_steps":              MAX_STEPS,
+        "scale_rewards":          SCALE_REWARDS,
+        "save_steps":             SAVE_STEPS,
+        "eval_steps":             EVAL_STEPS,
+        "eval_max_samples":       EVAL_MAX_SAMPLES,
+        "eval_max_tokens":        EVAL_MAX_TOKENS,
+        "eval_temperature":       EVAL_TEMPERATURE,
+        "patience":               EARLY_STOPPING_PATIENCE,
+        "delta":                  EARLY_STOPPING_DELTA,
+        "train_prompts":          len(train_dataset),
+        "eval_prompts":           len(eval_dataset),
+        "zero_adv_noise_std":     ZERO_ADV_NOISE_STD,
+        "general_mix_ratio":      GENERAL_MIX_RATIO,
+        "_ref_temperature":       "Skywork-OR1 (2505.22312)",
+        "_ref_completion_length": "Dr. GRPO (2503.20783)",
+        "_ref_staged_rewards":    "Reasoning-SQL (2503.23157)",
+        "_ref_zero_adv":          "Skywork-OR1 (2505.22312)",
+    },
+)
+print(f"✓ W&B run: {wandb.run.url}")
+FRESH = True
+resume_from = None
+if FRESH and CHECKPOINT_DIR.exists():
+    print("FRESH: deleting old checkpoints...")
+    shutil.rmtree(CHECKPOINT_DIR)
+elif CHECKPOINT_DIR.exists():
+    checkpoints = sorted(
+        [d for d in CHECKPOINT_DIR.iterdir()
+         if d.is_dir() and d.name.startswith("checkpoint-")],
+        key=lambda d: int(d.name.split("-")[-1]),
+    )
+    if checkpoints:
+        resume_from = str(checkpoints[-1])
+        print(f"Resuming from: {resume_from}")
+class UnslothGRPOTrainer(GRPOTrainer):
+    """Wraps generation with Unsloth for_inference()/for_training()."""
+    def _generate(self, prompts, images):
+        FastLanguageModel.for_inference(self.model)
+        try:
+            result = super()._generate(prompts, images)
+        finally:
+            FastLanguageModel.for_training(self.model)
+        return result
+class EvalRewardCallback(TrainerCallback):
+    """v3: deterministic eval, per-task breakdown, patience=15."""
+    def __init__(self, eval_records, reward_fn, patience=EARLY_STOPPING_PATIENCE,
+                 delta=EARLY_STOPPING_DELTA):
+        self.eval_records     = eval_records
+        self.reward_fn        = reward_fn
+        self.patience         = patience
+        self.delta            = delta
+        self.best_reward      = -float("inf")
+        self.no_improve_count = 0
+    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):
+        if state.global_step == 0 or state.global_step % EVAL_STEPS != 0:
+            return control
+        tokenizer = processing_class
+        if tokenizer is None:
+            print("[EvalRewardCallback] WARNING: tokenizer is None, skipping eval")
+            return control
+        mean_reward, task_rewards = self._run_eval(model, tokenizer, args)
+        improved = mean_reward > self.best_reward + self.delta
+        status = "↑ improved" if improved else f"↔ no gain ({self.no_improve_count + 1}/{self.patience})"
+        log_dict = {
+            "eval/mean_reward":      mean_reward,
+            "eval/best_reward":      max(self.best_reward, mean_reward),
+            "eval/no_improve_count": self.no_improve_count,
+        }
+        for task, rewards in task_rewards.items():
+            if rewards:
+                log_dict[f"eval/{task}_reward"] = sum(rewards) / len(rewards)
+        wandb.log(log_dict, step=state.global_step)
+        print(f"\n[EvalReward] step={state.global_step} | mean={mean_reward:.4f} | best={self.best_reward:.4f} | {status}")
+        for task, rewards in task_rewards.items():
+            if rewards:
+                print(f"  {task}: {sum(rewards)/len(rewards):.3f} (n={len(rewards)})")
+        if improved:
+            self.best_reward      = mean_reward
+            self.no_improve_count = 0
+        else:
+            self.no_improve_count += 1
+            if self.no_improve_count >= self.patience:
+                print(f"[EarlyStopping] No improvement ≥ {self.delta} for {self.patience} consecutive evals. Halting.")
+                wandb.log({"early_stop/step": state.global_step}, step=state.global_step)
+                control.should_training_stop = True
+        return control
+    def _run_eval(self, model, tokenizer, args):
+        FastLanguageModel.for_inference(model)
+        rewards = []
+        task_rewards = {}
+        subset = self.eval_records[:EVAL_MAX_SAMPLES]
+        for record in subset:
+            msgs = record["prompt"]
+            text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
+            inputs = tokenizer(text, return_tensors="pt", truncation=True,
+                             max_length=args.max_prompt_length).to(model.device)
+            with torch.no_grad():
+                out = model.generate(**inputs, max_new_tokens=EVAL_MAX_TOKENS,
+                                    temperature=EVAL_TEMPERATURE, do_sample=True)
+            resp = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+            r = self.reward_fn([resp], [text])[0]
+            rewards.append(r)
+            user_text = " ".join(m.get("content", "") for m in msgs if m.get("role") == "user")
+            task = _classify_task_type(user_text)
+            task_rewards.setdefault(task, []).append(r)
+        FastLanguageModel.for_training(model)
+        mean = sum(rewards) / len(rewards) if rewards else 0.0
+        return mean, task_rewards
+class EntropyMonitorCallback(TrainerCallback):
+    """v3 NEW: Monitor entropy collapse indicators (Skywork-OR1 §4)."""
+    def __init__(self):
+        self.consecutive_ceiling_hits = 0
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not logs:
+            return
+        step = state.global_step
+        monitor = {}
+        comp_len = logs.get("completion_length", 0)
+        if comp_len > 0:
+            ratio = comp_len / MAX_COMPLETION_LENGTH
+            monitor["monitor/completion_ratio"] = ratio
+            if ratio > 0.95:
+                self.consecutive_ceiling_hits += 1
+                if self.consecutive_ceiling_hits >= 3:
+                    print(f"⚠️ Step {step}: Completion ceiling hit {self.consecutive_ceiling_hits} consecutive times.")
+            else:
+                self.consecutive_ceiling_hits = 0
+        reward_std = logs.get("reward_std", logs.get("rewards/commerce_reward_fn/std", 0))
+        if reward_std is not None:
+            monitor["monitor/reward_std"] = reward_std
+            if reward_std < 0.01:
+                print(f"⚠️ Step {step}: reward_std={reward_std:.4f} — near-zero variance")
+        clip_high = logs.get("clip_ratio/high_mean", 0)
+        clip_low = logs.get("clip_ratio/low_mean", 0)
+        if clip_high is not None and clip_low is not None:
+            total_clip = clip_high + abs(clip_low)
+            monitor["monitor/total_clip_ratio"] = total_clip
+            if total_clip > 0.01 and step > 10:
+                print(f"✓ Step {step}: clip_ratio={total_clip:.3f} — policy is updating")
+        if monitor and wandb.run:
+            wandb.log(monitor, step=step)
+FastLanguageModel.for_training(model)
+grpo_config = GRPOConfig(
+    output_dir=str(CHECKPOINT_DIR),
+    num_generations=NUM_GENERATIONS,
+    scale_rewards=SCALE_REWARDS,
+    max_completion_length=MAX_COMPLETION_LENGTH,
+    temperature=TEMPERATURE,
+    max_steps=MAX_STEPS,
+    num_train_epochs=NUM_EPOCHS,
+    per_device_train_batch_size=BATCH_SIZE,
+    gradient_accumulation_steps=GRAD_ACCUM,
+    learning_rate=LEARNING_RATE,
+    warmup_ratio=0.1,
+    lr_scheduler_type="cosine",
+    fp16=False,
+    bf16=True,
+    logging_steps=1,
+    logging_first_step=True,
+    disable_tqdm=True,
+    save_steps=SAVE_STEPS,
+    save_total_limit=SAVE_TOTAL_LIMIT,
+    save_only_model=True,
+    eval_steps=EVAL_STEPS,
+    report_to="wandb",
+    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,
+    seed=42,
+    remove_unused_columns=False,
+    **({"use_vllm": True, "vllm_mode": "colocate",
+        "vllm_enable_sleep_mode": True} if USE_VLLM else {}),
+)
+eval_cb = EvalRewardCallback(eval_records=list(eval_dataset), reward_fn=commerce_reward_fn)
+entropy_cb = EntropyMonitorCallback()
+TrainerClass = GRPOTrainer if USE_VLLM else UnslothGRPOTrainer
+trainer = TrainerClass(
+    model=model,
+    reward_funcs=commerce_reward_fn,
+    args=grpo_config,
+    train_dataset=train_dataset,
+    processing_class=tokenizer,
+    callbacks=[eval_cb, entropy_cb],
+)
+print(f"{'='*70}")
+print(f"GRPO v3 Training — Ready to Launch")
+print(f"{'='*70}")
+print(f"  Trainer:       {TrainerClass.__name__}")
+print(f"  Max steps:     {MAX_STEPS}")
+print(f"  Temperature:   {TEMPERATURE} (v2 was 0.8)")
+print(f"  Completion:    {MAX_COMPLETION_LENGTH} tokens (v2 was 2048)")
+print(f"  Generations:   {NUM_GENERATIONS} per prompt (v2 was 8)")
+print(f"  Learning rate: {LEARNING_RATE} (v2 was 5e-7)")
+print(f"  Save every:    {SAVE_STEPS} steps (keep {SAVE_TOTAL_LIMIT})")
+print(f"  Eval every:    {EVAL_STEPS} steps ({EVAL_MAX_SAMPLES} samples × {EVAL_MAX_TOKENS} tok)")
+print(f"  Patience:      {EARLY_STOPPING_PATIENCE} evals ({EARLY_STOPPING_PATIENCE * EVAL_STEPS} steps)")
+print(f"  Resume:        {resume_from is not None}")
+print(f"{'='*70}")
+t_start = time.time()
+result  = trainer.train(resume_from_checkpoint=resume_from)
+elapsed = time.time() - t_start
+wandb.log({
+    "train/final_loss":       result.training_loss,
+    "train/duration_hours":   elapsed / 3600,
+    "train/total_steps":      result.global_step,
+    "eval/best_reward_final": eval_cb.best_reward,
+})
+wandb.finish()
+print(f"\n{'='*70}")
+print(f"GRPO v3 Training Complete")
+print(f"  Loss:        {result.training_loss:.6f}")
+print(f"  Steps:       {result.global_step}")
+print(f"  Duration:    {elapsed/3600:.1f}h")
+print(f"  Best eval R: {eval_cb.best_reward:.4f}")
+print(f"  Trainer:     {TrainerClass.__name__}")
+print(f"{'='*70}")
+```
+---
+## Cell 12: Save Adapter
+```python
+GRPO_ADAPTER_DIR.mkdir(parents=True, exist_ok=True)
+model.save_pretrained(str(GRPO_ADAPTER_DIR))
+tokenizer.save_pretrained(str(GRPO_ADAPTER_DIR))
+summary = {
+    "model_id": MODEL_ID,
+    "sft_adapter": str(SFT_ADAPTER_DIR),
+    "method": "GRPO",
+    "version": "v3",
+    "train_loss": result.training_loss,
+    "best_eval_reward": eval_cb.best_reward,
+    "num_prompts": len(train_dataset),
+    "num_generations": NUM_GENERATIONS,
+    "scale_rewards": SCALE_REWARDS,
+    "temperature": TEMPERATURE,
+    "learning_rate": LEARNING_RATE,
+    "beta": BETA,
+    "max_completion_length": MAX_COMPLETION_LENGTH,
+    "max_steps": MAX_STEPS,
+    "actual_steps": result.global_step,
+    "epochs": NUM_EPOCHS,
+    "max_seq_length": MAX_SEQ_LENGTH,
+    "duration_seconds": round(elapsed),
+    "gpu": "L4",
+    "platform": "vertex-ai-workbench",
+    "v3_fixes": [
+        "temperature=1.0 (Skywork-OR1)",
+        "max_completion_length=4096 (Dr. GRPO)",
+        "learning_rate=2e-6 (4x v2)",
+        "beta=0.0 (Dr. GRPO)",
+        "staged rewards (Reasoning-SQL)",
+        "zero-advantage noise (Skywork-OR1)",
+        "entropy monitoring callback",
+    ],
+}
+with open(GRPO_ADAPTER_DIR / "training_summary.json", "w") as f:
+    json.dump(summary, f, indent=2)
+print(f"✓ Adapter saved to {GRPO_ADAPTER_DIR}")
+print(f"  Files: {[f.name for f in GRPO_ADAPTER_DIR.iterdir() if f.is_file()]}")
+```
+---
+## Cell 13: Validation
+```python
+FastLanguageModel.for_inference(model)
+system_msg = {"role": "system", "content": SYSTEM_PT}
+test_prompts = [
+    {"role": "user", "content": (
+        "Analise esta avaliação de e-commerce brasileiro e extraia dados estruturados.\n\n"
+        "nota=2/5 | status=delivered\ntítulo: decepcionado\n"
+        "texto: Produto veio com defeito e o vendedor não respondeu.\n\n"
+        "Retorne um objeto JSON com exatamente estas chaves:\n"
+        "sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, "
+        "seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend"
+    )},
+    {"role": "user", "content": (
+        "Analise esta avaliação de e-commerce brasileiro e extraia dados estruturados.\n\n"
+        "nota=5/5 | status=delivered\ntítulo: adorei!\n"
+        "texto: Entrega rápida e produto exatamente como descrito. Recomendo!\n\n"
+        "Retorne um objeto JSON com exatamente estas chaves:\n"
+        "sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, "
+        "seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend"
+    )},
+    {"role": "user", "content": "Quais são as categorias de reclamação mais frequentes e como afetam a nota média?"},
+    {"role": "user", "content": "Analise a retenção de clientes afetados por product_quality."},
+    {"role": "user", "content": (
+        "Perfil do cliente:\n- Estado: MG\n- Valor do pedido: R$150\n"
+        "- Reclamação: produto com defeito\n- Nota: 1.0/5\n\n"
+        "Este cliente deve receber uma notificação de reengajamento?"
+    )},
+    {"role": "user", "content": "Compare a satisfação de clientes em SP vs RJ."},
+    {"role": "user", "content": (
+        "Crie uma notificação push de reengajamento para um cliente em SP "
+        "que reclamou de atraso na entrega. Nota: 2/5."
+    )},
+]
+print("=" * 70)
+print("GRPO v3 Validation")
+print("=" * 70)
+v3_rewards = []
+for i, prompt in enumerate(test_prompts):
+    messages = [system_msg, prompt]
+    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.1, do_sample=True)
+    gen_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
+    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+    reward = commerce_reward_fn([response], [text])[0]
+    v3_rewards.append(reward)
+    answer = strip_think(response)
+    task = _classify_task_type(prompt["content"])
+    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH
+    print(f"\n--- Sample {i+1} [{task}] (reward={reward:.2f}, tokens={gen_tokens}, ceiling={'HIT' if hit_ceiling else 'ok'}) ---")
+    print(f"Prompt: {prompt['content'][:80]}...")
+    print(f"Answer: {answer[:400]}")
+print(f"\n{'='*70}")
+print(f"v3 Validation Summary")
+print(f"{'='*70}")
+print(f"  Mean reward: {sum(v3_rewards)/len(v3_rewards):.3f}")
+print(f"  Min:         {min(v3_rewards):.3f}")
+print(f"  Max:         {max(v3_rewards):.3f}")
+print()
+print(f"  Comparison to baselines:")
+print(f"    SFT calibration (Cell 7): mean=0.38")
+print(f"    GRPO v2 validation:       mean=0.54")
+print(f"    GRPO v3 validation:       mean={sum(v3_rewards)/len(v3_rewards):.3f}")
+v3_vs_v2 = (sum(v3_rewards)/len(v3_rewards) - 0.54) / 0.54 * 100
+print(f"    v3 vs v2:                 {v3_vs_v2:+.1f}%")
+```